ArticlePDF Available

Communication Efficient Compressed and Accelerated Federated Learning in Open RAN Intelligent Controllers

March 2024
IEEE/ACM Transactions on Networking PP(99)

March 2024
PP(99)

DOI:10.1109/TNET.2024.3384839

Authors:

École de Technologie Supérieure

The disaggregated and hierarchical architecture of Open Radio Access Network (ORAN) with openness paradigm promises to deliver the ever demanding 5G services. Meanwhile, it also faces new challenges for the efficient deployment of Machine Learning (ML) models. Although ORAN has been designed with built-in Radio Intelligent Controllers (RIC) providing the capability of training ML models, traditional centralized learning methods may be no longer appropriate for the RICs due to privacy issues, computational burden, and communication overhead. Recently, Federated Learning (FL), a powerful distributed ML training, has emerged as a new solution for training models in ORAN systems. 5G use cases such as meeting the network slice Service Level Agreement (SLA) and Key Performance Indicator (KPI) monitoring for the smart radio resource management can greatly benefits from the FL models. However, training FL models efficiently in ORAN system is a challenging issue due to the stringent deadline of ORAN control loops, expensive compute resources, and limited communication bandwidth. Moreover, to deliver Grade of Service (GoS), the trained ML models must converge with acceptable accuracy. In this paper, we propose a second order gradient descent based FL training method named MCORANFed that utilizes compression techniques to minimize the communication cost and yet converges at a faster rate than state-of-the-art FL variants. We formulate a joint optimization problem to minimize the overall resource cost and learning time, and then solve it by the decomposition method. Our experimental results prove that MCORANFed is communication efficient with respect to ORAN system, and outperforms FL methods like MFL, FedAvg, and ORANFed in terms of costs and convergence rate.

System Model for FL update interaction

…

MCORANFed model with proposed steps

…

Figures - uploaded by Kim-Khoa Nguyen

Content may be subject to copyright.

Content uploaded by Kim-Khoa Nguyen

Content may be subject to copyright.

Communication Efﬁcient Compressed and

Accelerated Federated Learning in Open RAN

Intelligent Controllers

Amardip Kumar Singh, Kim Khoa Nguyen

Ecole de Technologie Sup´

erieure, Montreal, Canada

amardip-kumar.singh.1@ens.etsmtl.ca, kimkhoa.nguyen@etsmtl.ca

Abstract—The disaggregated and hierarchical architecture of

Open Radio Access Network (ORAN) with openness paradigm

promises to deliver the ever demanding 5G services. Meanwhile, it

also faces new challenges for the efﬁcient deployment of Machine

Learning (ML) models. Although ORAN has been designed

with built-in Radio Intelligent Controllers (RIC) providing the

capability of training ML models, traditional centralized learning

methods may be no longer appropriate for the RICs due to pri-

vacy issues, computational burden, and communication overhead.

Recently, Federated Learning (FL), a powerful distributed ML

training, has emerged as a new solution for training models in

ORAN systems. 5G use cases such as meeting the network slice

Service Level Agreement (SLA) and Key Performance Indicator

(KPI) monitoring for the smart radio resource management can

greatly beneﬁts from the FL models. However, training FL models

efﬁciently in ORAN system is a challenging issue due to the

stringent deadline of ORAN control loops, expensive compute

resources, and limited communication bandwidth. Moreover, to

deliver Grade of Service (GoS), the trained ML models must

converge with acceptable accuracy. In this paper, we propose a

second order gradient descent based FL training method named

MCORANFed that utilizes compression techniques to minimize

the communication cost and yet converges at a faster rate than

state-of-the-art FL variants. We formulate a joint optimization

problem to minimize the overall resource cost and learning time,

and then solve it by the decomposition method. Our experimental

results prove that MCORANFed is communication efﬁcient with

respect to ORAN system, and outperforms FL methods like MFL,

FedAvg, and ORANFed in terms of costs and convergence rate.

Index Terms—Federated Learning, ORAN, 5G, Resource Al-

location, RAN Intelligent Controller, Network Slicing, RIC

I. INTRODUCTION

Due to an exponential growth in high performing user

devices and a massive increase in connectivity, the demand

of a reliable and scalable data network is ever increasing.

To meet this demand and realise the envision of 5G and

beyond communications, leading telecommunication infras-

tructure providers have taken many initiatives and instituted fo-

cused research studies to redesign their network management

systems [1]. A major part of this ongoing study concentrates

on Radio Access Network (RAN) part. In this direction, O-

RAN Alliance has already published the speciﬁcations of

Open RAN (ORAN) which is particularly based on two key

concepts: (i) openness and (ii) in-built intelligence [2]. ORAN

frees the network from vendor lock-in and disaggregates

the processing layers from proprietary hardware to shared

open clouds. On the other hand, it also calls for a smarter

Service and Management Operation by installing intelligent

controllers which utilizes the operational and performance data

to optimize the resources [3].

With its unique features, ORAN can support many use cases

such as Quality of Experience (QoE) optimization, ensuring

Service Level Agreements (SLA) agreements, providing slice

based ultra Reliable Low Latency Communications (uRLLC)

and enhanced Mobile Broadband (eMBB), and monitoring

Key Performance Indicators (KPI) which are intrinsic part of

5G network [4]. All of these use cases require training of

fast and secure Machine Learning (ML) models. However, a

traditional centralized ML model suffers from two important

limitations: (i) transferring locally collected data to a central

server and (ii) model training time. Further, the privacy issues

of shared processing open clouds and security of raw data pose

additional challenges in the paradigm of ORAN architecture

[5]. Due to this reason, we turn to Federated Learning (FL)

which is becoming an advanced ML tool in the recent times

[6].

The basic idea of FL is to train an ML model without

sharing the locally collected raw data and communicating only

its model updates to a distant server [7] to iteratively improve

the ﬁnal model so trained. In this training, an iterative gradient

descent method is used to minimize the loss function, which

is the main objective of any ML model. FL, being in its

nascent stage, is still being studied and improved progressively

by researchers across the domains [8]. However, there is

hardly any standard version of it which can adhere to ORAN

architectural requirements and therefore stands as an open

problem [9]. The main questions that exist in this set up are:

(i) How to allocate the ORAN resources required to train such

a model?; (ii) How to meet the stringent deadlines of ORAN

control loops that automates the performance of the network

while training an FL model?; and (iii) How to guarantee the

convergence of this model with accuracy?

Based on the environment it is implemented, most of the

FL variants improve their performance with respect to two

aspects: (i) selection of participating local trainers, and (ii)

reducing the communication time between the global server

and the set of local trainers [6]. In case of ORAN architectural

environment, both aspects are important. Moreover, the control

loop deadline varies with respect to the performance of each

network slice. In this paper, we ﬁrst propose an accelerated

iteration method that utilizes a random sparsiﬁcation com-

pression operator to optimize the communication resources.

Then, we derive an updated FL algorithm by incorporating

slice speciﬁc and deadline aware selection of local trainers

[10].

Our main contributions can be summarized as follows:

1) A second order momentum gradient descent based novel

FL training method that utilizes a compression model

updates in each global round.

2) A mathematical proof of the rate of convergence of this

method.

3) A mathematical formulation of the joint optimization

problem to minimize communication resource usage cost

and total learning time under the constraints of limited

bandwidth, selected local trainers, and compression pa-

rameter.

4) A solution of this problem by decomposition method.

We ﬁrst solve for the set of local trainers and then

allocate the resources to this set of selected trainers.

5) An updated FL algorithm, so called MCORANFed (Mo-

mentum Compressed ORAN FL) to train the model

through several iterations.

6) An analysis of results obtained from extensive experi-

ments that validates our proposed method compared to

the state-of-the-art FL variants.

To the best of our knowledge, this is the ﬁrst work that

takes into account acceleration and compression techniques

to optimize ML training through federated settings in ORAN.

The remainder of this paper is organized as follows. In Section

II, a summary of the related works is presented. Then, we

brieﬂy describe the concerned ORAN architecture for the FL

setup in section III. In Section IV, the considered system model

and the problem formulation are presented. In Section V, we

describe our proposed solution approach. In Section VI, we

present the numerical results to evaluate the performance of

our proposed solution. Finally, we conclude the paper and

discuss our future work in section VII.

II. RE LATE D WORKS AND CHALLENGES

[7] introduced the initial concept of federated averaging

to train deep learning model in a distributed setting. It also

has proposed a method to select a fraction of clients instead

of all the local trainers. While it solves the challenges of

privacy by not sharing the raw data with the aggregator,

the communication efﬁciency is still questionable. Based on

this work, several variants of FL have then been proposed

to meet the requirements of edge cloud based system. [11],

[12], [13], [14], and [15] dealt with the energy efﬁciency

optimization problems where a joint problem is solved under

limited compute resources. In all of these FL variants, the

local model is processed at the user end devices. Their main

objective was to minimize the energy cost incurred while

training the local model. However, in ORAN system, the

main objective should be minimizing the learning time to

boost the performance of FL training itself. Nonetheless, it

laid the foundation of mathematically formulating a resource

allocation problem for FL in mobile edge architecture. The

control algorithm in [13] regulates the number of global and

local iterations required to attain a desired level of model

accuracy. The authors in [12] have considered a wireless com-

munication scenario to train an FL model under synchronous

mode of communication between the model update aggregator

and local devices. Latency constraints are considered in [14]

to minimize the computation and transmission energy of user

devices. Through approximated and closed form solutions to

the resource allocation problem, it assigns optimal processing

power for the local training.

In prior works [16], [17], [18], [19], and [20], the authors

have tackled another aspect of FL performance. They focus on

optimizing the number of local trainers in each global round of

FL. Further, deadline in each loop is also taken into account so

that the communication overhead can be minimized. However,

the FL training method has not been improved signiﬁcantly in

these works. The simple ﬁrst order iterative gradient descent

method is common in all these papers. [16] investigated the

role of client scheduling and user association problems of

FL training. There are multiple approaches to deal with this

aspect such as Round Robin Scheduling, Proportional Fair

Scheduling, and Random Scheduling. This is done considering

that the set of end users that trains the local models is ﬁxed.

However, in ORAN system, the ML model is not trained by

end users. Rather, the edge node dataset keeps on changing

and only the set of local training point remains ﬁxed. That

adds to the extra challenges of data heterogeneity as well.

[20] proposed a hierarchical FL algorithm in single edge cloud

system. ORAN deals with multiple carriers networks leading

to frequent change in the bandwidth. [17] takes into account

the channel states to optimize the client selection.

Unlike these prior works, we propose an FL training method

that suits the architectural requirements of ORAN that is quite

different from a traditional mobile edge cloud system.

III. ML TRAINING IN ORAN

O-RAN Alliance has speciﬁed a promising framework with

intelligent blocks residing within the Service and Management

Orchestration (SMO) layer, termed as Radio Intelligent Con-

trollers (RIC) [21]. There are two types of RICs: near real

time intelligent controller (Near-RT RIC) and non real time

intelligent controller (Non-RT RIC). The newly developed

interfaces like E2 and A1 collect measurement and operational

datasets to monitor the performance of RAN in scenarios

where the expected latency is in range of milliseconds [22].

These datasets are stored in Radio Network Information Base

(RNIB) that are shared with multi-vendor edge clouds. O-

RAN has already listed various services as xApps in near

RT-RIC and Non RT-RIC blocks. For example, QoS and QoE

optimization use case in which the Non-RT RIC checks long-

term patterns and provides policy guidance to the Near-RT

RIC. Supported with ML models and RAN lower level data

reported through E2 nodes, the Near-RT RIC can dynamically

predict the QoE or QoS performance of the ongoing applica-

tion and change the behaviour of RAN to maintain the user

ML Model Application

Data Base

RAN Data Analytics & AI Platform

O-CU-UP

O-CU-CP

O-DU

O-RU

Near-

RT-RIC

ML Model

Host

Non-RT-RIC

Service & Management Orchestration Functions

Near-

RT-RIC

Near-

RT-RIC

near-

RT-RIC

Near-

RT-RIC

Near-

RT-RIC

near-

RT-RIC

Near-

RT-RIC

Near-

RT-RIC

near-

RT-RIC

O-CU-UP

O-CU-CP

O-DU

O-RU

O-CU-UP

O-CU-CP

O-DU

O-RU

Data Base

Data

Base

Data

Base Data Base

Data Base

Data

Base Data Base

Data Base

Data

Base

E1 E1

E2 E2

uRLLC

slice

eMBB

slice

mMTC

slice

Local model

update vector

uploading

Global model

aggregated vector

broadcast

Fig. 1. O-RAN Architecture for Radio Intelligent Controllers based FL

experience [21]. Another important use case determines RAN

slicing and SLA assurance and helps in resource allocation

optimization. Using the slice speciﬁc performance metrics

collected through E2 nodes, Non-RT RIC trains an ML model

and guides Near-RT RIC using A1 interface based policies

to meet the slice KPI targets. It can also guides the dynamic

Radio Resource Management (RRM) policies of near-RT-RIC

through O1 conﬁguration. SMO can also take into account

these ML models to predict the optimal hierarchy ratio for

controlling RAN elements [21]. As illustrated in Figure 1, the

dedicated databases that collects the slice speciﬁc performance

measurements are connected to respective near-RT RICs onto

the RAN analytics and AI platform through E1 and O1

standardized interfaces. These dynamic data collection points

are then processed locally at the edge computing hosts. The

interaction through A1 interface then transfers the updates to

the Non-RT RIC placed in SMO at a regional cloud server.

We formulate the problem of local trainers’ selection, resource

allocation, total FL training time, and resource costs involved

in this framework.

IV. SYS TE M MOD EL A ND PROBLEM FORMULATION

Consider an O-RAN system with a single regional cloud

and a set Mof Mdistributed edge clouds cooperatively

performing an FL training model. As illustrated in Figure

2, in this FL setup each edge cloud node hosts a near-RT-

RIC entity that processes its locally collected data to train a

local FL model. The Non-RT-RIC placed at the regional cloud

integrates the local FL models from participating edge clouds

and generates an aggregated FL model. This aggregated FL

model is further used to improve local FL models of each near-

RT-RIC enabling the local models to collaboratively perform

TABLE I

SUM MARY O F KEY N OTATIO NS.

Notation System Model Parameters

MSet of distributed edges

DiDataset at ith Local Trainer

STotal size of all training data

Rco total communication resource cost

Rco

mcommunication resource cost of mth near-RT-RIC

Rcp total compute resource cost

cmCPU cycles required to process per bit data

fmProcessing power of mth host

pccompute resource usage cost

Rcp

mcompute resource cost of mth near-RT-RIC

ptr bandwidth resource usage cost

FLoss function at the Non-RT-RIC

Tco

mtransmission time for mth local model updates

Tcp

mmth local model computation processing time

TmLearning Time per global FL round at mth local model

Ttotal Total FL time

BuUplink Bandwidth for allocation

BdDownlink Bandwidth for allocation

BTotal Bandwidth assigned for FL tasks

Notation Input Parameters

gilocal model vector at ith near-RT-RIC

TTotal number of global rounds in FL

θLocal Accuracy

ϵPreﬁxed global accuracy

ρPareto parameter

NSet of selected local trainers

Kϵnumber of global rounds required to attain ϵaccuracy

ηlearning rate

γmomentum attenuation factor

Sω

msize of update vectors

Notation Decision Varibales

atTrainers’ selection decision vector

btBandwidth fraction allocation vector

ωcompression ratio

a learning algorithm without transferring its raw training data.

We call this aggregated FL model so generated by using the

local FL models as the global FL model. The uplink from near-

RT-RICs to the Non-RT-RIC is used to send the local FL model

update parameters and the downlink is used to broadcast the

global FL model in global rounds of the training. These links

of communication channel is supported by open interfaces of

O-RAN [22].

Table I summarizes the key notations used and the complete

system model is described through the following subsections

to address the complexities of each components.

A. The Learning Model

In this model, each near-RT-RIC collects a dataset Di=

[xi,1, ....., xi,Si]of input data where Siis the number of the

input samples collected by near-RT-RIC iand each element xis

is the FL model’s input vector. Let yis be the output of xis. For

simplicity, we consider an FL model with single output, which

can be readily generalized to a case with multiple outputs.

The output data vector for training the FL model of near-

RT-RIC iis yi= [yi,1, ....., yi,Si]. We assume that the data

collected by each near-RT-RIC is different from the other near-

RT-RICs i.e. (xi=xj;i=j, i, j ∈ M). So, each local trainer

Edge cloud Database 1

Edge cloud Database 2

Edge cloud Database M

Local Model 1

Local Model 2

Local Model M

Global Model

Regional Cloud

Near-RT-RIC

Non-RT-RIC SMO

Near-RT-RIC

A1 interface

Fig. 2. System Model for FL update interaction

will train the model using a different dataset. This is in line

with the real scenario as each local near-RT-RIC collects the

operational data from the corresponding slice speciﬁc users.

We deﬁne a vector gito capture the parameters related to the

local FL model that is trained by Diand yi. The vector gi

actually determines the local FL model of each near-RT-RIC

i. For example, in a linear regression prediction algorithm,

i.yirepresents the output, and giis the associated weight

vector that is used to calculate the prediction accuracy. So,

in each local model, the target is to get the optimal giso

that the the model accuracy can be maximized. Hence, for a

system of Mlocal trainers i.e. near-RT-RICs, the objective

of the global FL training process is to solve the following

optimization problem:

min

g1,......,gM

i=1

s=1

f(gi,xis, yis )(1)

s.t. g1=g2=..... =gM=g∀i∈ M (1a)

where S=PM

i=1 Siis the total size of training data of all

near-RT-RICs. gis the global FL model generated by the Non-

RT-RIC and f(gi,xis, yis)is a loss function that indicates

the FL model’s training accuracy. The exact expression for

the loss function varies depending upon the ML model that

is being trained. However, the objective of these functions

remains the same. Constraint (1a) ensures that, once the FL

model converges, all of the near-RT-RICs and the Non-RT-

RIC will transmit the parameters gof the global FL model to

its connected near-RT-RICs so that they train their local FL

models with the updated weights. Then the near-RT-RICs will

transmit their local FL models to the Non-RT-RIC to update

the global FL model. The update of each near-RT-RIC i’s local

FL model gidepends on all near-RT-RICs’ local FL models.

Generally, the iterative Gradient Descent (GD) method is

used to approximate the local model’s loss function and its

corresponding weights. In this method, the iterative process

updates the weights as below:

gi(t) = gi(t−1) −η∇fi(gi(t−1)),(2)

where tdenotes the iteration count and ηis the learning rate.

Once the optimal solution of (1) is gained, using this updated

weight vector (gi) for all the local trainers i.e. (i∈1,2, ..., M ),

the global weight vector (g) can be updated as:

f(g) = PM

i=1 |Si|fi(gi)|

|S|,(3)

which results in the updated global loss function, f(g).

To assess the performance of model training, the model

accuracy is calculated in each global round. We present the

deﬁning notion of accuracy in the next subsection.

B. FL Model Accuracy

The target for each of the local model trainers is to attain

aθ∈(0,1) level of accuracy, deﬁned as below:

||∇ft

i(gt

i)|| ≤ θ||∇ft

i(gt−1

i)||,∀i ϵ {1,2,3, .., M }(4)

To attain this accuracy, a near-RT-RIC takes several iterations

so called local iterations. Correspondingly, in the global model

placed at the non-RT-RIC, the target is to attain the optimal

model weights to reach ϵlevel of global model accuracy,

deﬁned as below:

|f(gt)−f(g)| ≤ ϵ∀t≥X(5)

(5) simply states that gis the optimal model parameter i.e.

for every global round beyond X, the difference between the

loss function values falls within the deﬁned accuracy level no

matter how long we keep on iterating.

Now, the convergence of this iterative method is ensured

under a set of conditional bounds on the loss function:

f:Rn→Rs. t.

(i) f(g)is convex.

(ii) f(g)is ρ-Lipschitz, i.e. ||f(g)−f(g′)|| ≤ ρ||g−g′||, for

any g,g′∈Rn.

(iii) f(g)is β-smooth, i.e, ||∇f(g)−∇f(g′)|| ≤ β||g−g′||,

for any g,g′.

Under the above stated conditions, it is proven [23] that the

number of global iterations required to attain a level of global

accuracy ϵand local accuracy θcan be upper bounded by:

K(ϵ, θ) = O(log(1/ϵ))

(1 −θ)(6)

We use this relationship among the local accuracy level,

global model accuracy, and the upper limit on the number

of required global rounds to model resource cost and the FL

model training time.

C. FL Resource Model

In order to transfer the model updates from the participating

local trainers to the global aggregator and vice versa, the avail-

able communication resources are to be assigned. on the other

hand, the local trainers require compute resources in the form

of processing capacities to locally train the individual models.

While the compute resource at the Non-RT-RIC (hosted by

a regional cloud) is not overwhelmed, the compute resources

at the local node are scarce and provided by the shared edge

clouds of O-RAN. So, it needs to be judiciously allocated and

utilised. In effect, allocation of these resources determine the

learning time, communications rounds, and therefore impacts

the performance of FL model so trained. Hence, the two

aspects of this model training must be jointly considered.

For the communication part, we consider a wireless trans-

mission under the orthogonal frequency division multiple

access (OFDMA) for local model uploading with a total

bandwidth B. This connection is provided by the standard-

ised A1 interface between near-RT-RICs and the Non-RT-

RIC [24]. Let bt

m∈[0,1] be the bandwidth allocation ratio

for trainer min round t, hence its allocated bandwidth is

mB. Let bt= (bt

1, ......., bt

M). Bandwidth allocation must

satisfy Pm∈M bt

m= 1 ∀t. In each global interaction, the

O-RAN system has to decide which local training points i.e.

which near-RT-RIC to participate. This is because at each time

interval only a limited number of clients can participate due

to delay constraints originating from the control loops of O-

RAN [22]. Therefore, the selected clients upload their local FL

models updates depending on the wireless media. We deﬁne

a binary variable at

m∈ {1,0}to decide whether or not the

trainer mis selected in round t, and at= (at

1, ......, at

collects the overall trainers’ selection decisions. A selected

near-RT-RIC in round ti.e. (at

m= 1), consumes compute

resources to train locally with the collected data. Clearly if

m= 0, namely trainer mis not selected in round t, then no

bandwidth is allocated to it i.e. bt

m= 0. On the other hand, if

m= 1, then we require at least a minimum bandwidth bmin

is to be allocated to the trainer mi.e. bt

m≥bmin. To make

the problem feasible, we assume bmin ≤1

M. Therefore, total

resource cost for using communication bandwidth is:

Rco =

m=1

Rco

m=Kϵ

m=1

mbt

mBptr (7)

for Kϵglobal rounds where ptr is the unit cost of bandwidth

usage. For each near-RT-RIC m, let Rcp

mdenote its local

training compute resource cost in every round which depends

on its computing host and dataset. To process the local dataset

each near-RT-RIC uses the CPU cycle frequency of the edge

host. Let the CPU power of mth host be fmcycles/s and the

per unit time usage cost be pc. Then the total compute resource

cost is:

Rcp =

m=1

Rcp

t=1

Dmcm

pc(8)

where cmis the CPU cycles required for processing one bit

of data.

D. Latency Model

Since, the number of distributed local edge nodes is ex-

pected to be in large numbers, we consider a synchronous

mode of communication, in other words the tth round of

global aggregation starts only when all the near-RT-RICs have

ﬁnished sending their local update vectors to the Non-RT-

RIC. Therefore, before entering this communication round,

all the near-RT-RICs must ﬁnish its local ML processing.

In each of the global round, the FL tasks are spanned over

three operations: (i) computation, (ii) communication of local

updates to the Non-RT-RIC using uplink, and (iii) broad-

cast communication to all the involved near-RT-RICs using

downlink. Let the computation time required for one local

round for mth near-RT-RIC be Tcp

m, and there be Kllocal

iterations in each interval of the global communication. Then,

the computation time in one global iteration round is KlTcp

Let the communication time required in transferring the local

update vectors from mth near-RT-RIC to the Non-RT-RIC be

Tco

min the uplink phase. Let Smbe the datasize of the update

vector of mth local trainer. Then, the learning time in one

global round for the mth local FL model trainer is:

Tm=Kl.T cp

m+Tco

m;m∈ M (9)

Where Tco

mis calculated as:

Tco

m=Sm

m.B ;m∈ M (10)

In the downlink phase, we do not consider the delay because

it is negligible as compared to the uplink delay as a result of

high speed downlink communication. Since, Kϵis the total

number of global rounds to attain the global accuracy ϵas

established in (8). Therefore, the total learning time can be

modeled as:

Ttotal =Kϵ.Tmax =Kϵ.max{Tm;∀m∈ M} (11)

Now, since the two conﬂicting goals of FL training must be

jointly optimized, for the sake of modelling, we treat learning

time as another component of the total cost along with resource

usage costs. As such, we use a dimension coherent parameter

ρto frame these type of costs into one expression as below:

cost(t) = n(Rco +Rcp) + ρ.(Tco

m+Kl.max{Tcp

m})o

=n(Kϵ.

t=1

m=1

.at

m(bt

m.B.ptr +Dm.cm

.pc))+

ρ.(Sm

m.B +Kl.max{Tcp

m})o(12)

E. Problem Formulation

Our goal is to jointly minimize the resource cost and the

learning time under the constraints of bandwidth resources.

This can be done by optimizing the number of local trainers

i.e. near-RT-RICs and bandwidth allocation as formulated in

the optimization model (13).

P: min

at,btcost(t)(13)

subject to:

m=1

m.bt

m.B ≤B, (13a)

m=1

m= 1,(13b)

bmin ≤bt

m≤1 ; ∀m∈ M,(13c)

m∈ {1,0}.(13d)

fm≥fmin (13e)

The objective function (13) has two components balanced by

a trade-off parameter ρbecause the two goals are conﬂicting.

The total resource cost, Rtotal =Rcp +Rco and the FL

training time, Ttotal as given by (11). Minimizing the resource

cost naturally leads to higher learning time and vice-versa.

Constraint (13a) bounds the total bandwidth allocated for the

FL tasks. Constraint (13b) presents the deﬁnition of bt

mi.e,

the sum of bandwidth fractions must be 1. (13c) denotes

the boundary of the bandwidth fractional allocation. (13d)

represents the deﬁning domain of the decision variable. (13e)

ensures that the assigned computation resource is greater than

the minimum CPU frequency.

V. MCORANFED

In our prior work, we proposed ORANFed [10] that uses a

ORAN system based deadline aware and slice speciﬁc local

trainers’ selection and then trains the FL model. This method,

although provides a novel FL algorithm to suit the require-

ments of ORAN, could not be extended for a large number

of local trainers as it does not mitigate the communication

latency. Therefore, we address this communication bottleneck

by using compression on the update vectors and further

decreasing the total number of global rounds by expediting

the convergence through momentum gradient descent. Figure

3 illustrates the proposed steps of MCORANFed.

A. Randomized Compression Operator

It has been shown in [25] that a randomized compression

operator achieves a convergence rate of O((1 + ω)L

ϵ)as

opposed to (6) in no-compression scenario. Due to the faster

convergence, it saves on communication time too.

An ω- compression operator can be deﬁned as a map

C:Rd→Rd

s.t. it satisﬁes the following conditional properties:

(i)

E[C(x)|x] = ∀x∈Rd(14)

i.e. C(.) is unbiased and

(ii) ∃ω≥0s.t.

E[||C(x)−x||2]≤ω||x||2,∀x∈Rd(15)

Local

Computation

Nodes

Global Model

Aggregator

Local Model Global Model

Momentum

gradient descent

Training

iterations

performed

K-th global round (K+1)-th global round

Model Accuracy

Calculation

Task and

Resource Budget

. . . . continues

Non-RT-RIC

Near-RT-RIC

Fig. 3. MCORANFed model with proposed steps

i.e. its variance is uniformly bounded.

For the case of no compression (ω= 0), C(x)≡x.

From this class of compression operator, a random

sparsiﬁcation operator can be deﬁned as:

C(x) := d

k(ζk·x)∀x∈Rd

where ζk∈ {0,1}dis a uniformly random binary vector

with knon-zero elements. From the above deﬁnition of

compression operator, this speciﬁc operator can be derived by

substituting ω=d

k−1and k=dimplies zero compression.

B. Accelerated Gradient Descent

As can be inferred from (2), GD is a ﬁrst order approx-

imation method. To increase the convergence rate of this

iterative approximation, we impose second order term using

Momentum Gradient Descent (MGD) [26]. MGD improves

GD by adding a momentum term which leads to the following

update rule:

di(t) = γdi(t−1) + ∇fi(gi(t−1)) (16)

and

gi(t) = gi(t−1) −ηdi(t)(17)

where di(t)is the momentum term having the same dimension

as gi(t),ηis the learning step size, γis the momentum

attenuation factor, and t is the iteration index. f(g)with

iterations of (16) and (17) can converge to the minimum loss

function value faster than the GD. MGD converges in the range

of −1< γ < 1with a bounded η. It further shows accelerated

convergence within the range of 0< γ < 1with small values

of η[26].

Local iterations start with initial values for di(0), and gi(0).

Then local updates are performed using (16) and (17) for each

t∈[k].[k]is the periodic global aggregation interval. Without

loss of generality, we assume that t=kT , where Tis the

number of global rounds, tis the total number of iterations,

and kis the period of global model aggregation. Whenever t

is a multiple of T, global aggregation is performed using the

following rule:

d(t) = PM

i=1 |Si|di(t)

|S|(18)

and

g(t) = PM

i=1 |Si|gi(t)

|S|(19)

In order to ensure the accelerated convergence rate of MGD

based FL over GD based FL training, we require an additional

set of conditions on the loss functions fi. Together with

conditions (i) to (iii), two additional conditions have to fulﬁlled

by the loss functions.

(iv) For any gand i, the difference between the global

gradient and local gradient can be bounded by

||∇fi(g)− ∇f(g)|| ≤ δi, and δ:= PiSi.δi

The loss function fis µ-strongly convex: i.e.

f:Rd→Rand

(v) ∃L≥µ≥0s.t. µ||g−g′|| ≤ ||∇f(g)− ∇f(g′)||.

These conditions form a class of well behaved functions

which aligns with most of the ML methods’ loss functions

and are in line with the recent works [14], [13], [16], [11]

on convergence analysis of FL. This leads to the aggregated

model loss function, f(g), deﬁned over all the local loss

functions as:

f(g) :=

i=1

(|Si|/S)fi(gi)(20)

Then the iterative model compression, when applied on the

model updates, works as;

g(t) = C(g(t−1)) −γ∇f(C(g(t−1))) (21)

This is Momentum Gradient Descent with Compressed Iterates

(MGDCI) with compression parameter ωfrom the class of

randomized compression operators [25]. Such an operator is

proven to converge linearly to an approximate solution of the

size O(κω)in the neighbourhood of the optimal solution of

(1) provided it is bounded by the following threshold [27]:

ω≤µ

4.1−2γL

2γL2+2

γ+L−µ(22)

C. Updated Optimization Problem

Let Sω

mbe the datasize of the update vector of mth trainer

under the ω-compression operator. Then, Sω

mis calculated as:

Sω

m=k

1−ω,(23)

where kis the sparsiﬁcation factor.

Therefore, the learning time in one global round of FL

training for the mth local FL model trainer becomes:

Tco

m=Sω

m.B ;m∈ M,(24)

the cost function updates as:

cost∗(t) = n(Kϵ.

t=1

m=1

.at

m(bt

m.B.ptr +Dm.cm

.pc))+

ρ.(Sω

m.B +Kl.max{Tcp

m})o(25)

Hence, the updated joint resource allocation and FL model

learning time optimization problem is:

P: min

at,bt,ω cost∗(t)(26)

subject to:

m=1

m.bt

m.B ≤B, (26a)

m=1

m= 1,(26b)

bmin ≤bt

m≤1 ; ∀mϵM,(26c)

m∈ {1,0}.(26d)

fm≥fmin (26e)

4ω

µ≤1−2γL

2γL2+2

γ+L−µ(26f)

Constraint (26f) bounds the feasible domain of ωto ensure

the convergence of MCORANFed, where µ,γ, and Lare the

parameters as deﬁned in the set of conditions (i) to (v) on the

loss function and as described in (22).

VI. PRO PO SE D SOLUTION

The problem formulated in (26) is a non-convex opti-

mization problem because the objective function and the

constraint (26a) are non-convex functions. Moreover, it is hard

to transform this problem to get at least a close form solution.

Therefore, we take an approach of decomposition method.

As shown in Figure 4, we layout a scheme of the solution

approach. We ﬁrst divide the main optimization problem

into two sub-problems, named respectively Local Trainers’

selection (P1) and Resource Allocation (P2). Then we use a

solution of the ﬁrst sub-problem to reshape the second sub-

problem. Finally, we solve the updated second sub-problem.

A. Local Trainers’ Selection

Among the three decision variables, athas been derived

through a deadline aware and slicing based local trainers’

Main Opt. Prob. (26):

BW usage cost and FL

time minimization,

sub. to optimal

trainers’ selection, BW

allocation, &

compression operator

Deadline aware

local Trainers’

Selection

(Algorithm 1)

Optimal

Resource

Allocation to

selected trainers

Solution Approach

Solving (P1)

Solving (P2)

Using the

solutions to

implement

Algorithm 2

Fig. 4. Schematic Diagram of the Proposed Solution

selection algorithm as proposed in [10]. In this step, we solve

the following sub-problem:

P1: min

at(Kϵ.

t=1

m=1

.at

m(bt

m.B.ptr +Dm.cm

.pc)) (27)

subject to:

m∈ {1,0}.(27a)

The objective of this trainers’ selection algorithm is to maxi-

mize the number of near-RT-RICs to participate in each global

round, and to allow the non-RT-RIC to aggregate all received

local model updates. This is based on the proposition that a

larger fraction of trainers in each round saves the total time

required for a global FL model to attain the desired model

accuracy [7]. As speciﬁed in O-RAN Alliance whitepaper

[22], the collected RAN operational data can be separated

based on their slice-user groups. Each near-RT-RIC is then

fed with slice speciﬁc network data. The selection of a near-

RT-RIC corresponding to a slice must be incorporated in each

iteration of gradient descent training of the model. However,

not all the local models can be accommodated in each iteration

because of the deadline constraint and limited computational

and bandwidth resources to be assigned for this learning task.

Moreover, due to the variation in trafﬁc patterns for different

kinds of slicing services of O-RAN, the local FL model

might encounter inconsistency problem. This may lead to a

degradation in accurate prediction. We take into account this

differentiation and propose a trainers’ selection algorithm that

respects the formation of slices in O-RAN while maintaining

a deadline awareness. In Algorithm 1, we categorize the set

of near-RT-RICs into three classes corresponding to eMBB,

uRLLC, and mMTC slicing services.

Let N(⊆ M)be the set of selected near-RT-RICs, tround

be the deadline for each global round, t1be the elapsed time to

perform Algorithm 1, and tagg be the time taken in aggregating

the update parameters at the Non-RT-RIC. Therefore, the

mathematical optimization problem for the trainer selection

becomes:

max

N{|N |} (28)

s.t. t1+tagg. +1

2(tk−1

n+α.tk

n)≤tround.(28a)

(28) is a combinatorial optimization problem which makes it

non-trivial. So, we employ a greedy heuristic to solve this

problem as shown in Algorithm 1 [10]. We repeat the steps in

each global round until we get the desired accuracy. Here, the

constraint (28a) restricts the violation of the deadline for every

near-RT-RIC in each global round. The deadline is assigned

separately for each slice-user groups while the total deadline

in each round is varied experimentally to observe its impact

on overall learning time of the global FL model.

Algorithm 1 : Deadline aware and Slicing based Local

Trainers’ Selection

1: Input: M: Set of all near-RT-RICs

2: Initialize Nu,Ne,Nm= Φ

3: for ti

round deﬁned for i∈ {N u,Ne,Nm}do

4: while |N | >0do

5: x←arg minn∈N 1

2.(tk−1

n+α.tk

n(estimated))

6: t←t1+tagg. +tk

7: N \ {x}

8: if t<ti

round then

9: t←t+tk

10: end if

11: end while

12: end for

13: Output: N=Nu∪ N e∪ N m

B. Resource Allocation

From the trainers’ selection phase, we obtain at, i.e. a

binary valued vector of selected trainers in kth global round.

The next phase is to allocate the compute and bandwidth

resources to support the local training, parameters uploading,

model aggregation, and broadcast of updated model weights.

For this task, we solve the optimization problem (26) with

the obtained value of atand unknown variable btand ω.

Still, (26) is a non-convex optimization problem, therefore,

an exact solution of which is intractable using traditional

methods. Therefore, we employ an approximation approach

with equivalent surrogate functions. With these changes and

substituting the deﬁning expressions, the optimization problem

(26) reduces the following mathematical form:

P2: min

bt,ω cost∗(t)(29)

subject to: (26a), (26b), (26c), and (26f)

The number of local iterations (Kl) in each global round is

determined experimentally as required in attaining the local

accuracy value θ. We solve this problem using an iterative

approximation method and implemented the solution using

Ipopt solver [28].

Here, the constraints are linear with respect to each of the

vector variable btand ω.

C. Federated Training in ORAN RICs (MCORANFed)

Using the solutions of trainer selection and resource alloca-

tion in (29), we train the FL model as described in Algorithm

2. In each global round, a subset of participating local trainers

is selected followed by resource allocation in each global

round. This loop continues for Kϵiterations, which is the

maximum number of global rounds required to attain the

preﬁxed accuracy of the model.

Algorithm 2 :Momentum Compressed ORANFed

Input: The dataset Di(∀i∈ M); The number of participants:

M; The number of iterations: Kϵ

Output: Final model parameter g

1: for k= 1,2,3, ..., Kϵdo

2: Non-RT-RIC uses Alg. 1 to get subset N;

3: allocation of compute and bandwidth resources to se-

lected near-RT-RICs (N);

4: Each near-RT-RIC trains using local data till it achieves

an accuracy θand obtains gi,k;

5: for all local trainers in parallel m = 1, 2, 3, .. do

6: Compress local gradients using (21) and obtain gifor

i∈ N ;

7: Transmit gito Non-RT-RIC;

8: end for

9: Decompress the received compressed weights and ag-

gregate d(t) and g(t) according to (18) and (21);

10: Calculate global loss function f(g) through (20);

11: Non-RT-RIC broadcasts the aggregated parameters;

12: Non-RT-RIC calculates the global accuracy attained (5);

13: end for

14: Finally trained model is sent to SMO for deployment

D. Complexity Analysis

Theorem 1: Suppose a constant learning step size

ηk=θ√N

√H,∀kis chosen where θ > 0is a constant satisfying

θ√N

√H≤1

2L, the convergence rate of Algorithm 2 is:

E[||zT||2]≤

4(E[F(g0)] −F∗) + 8ζθLδ 2

(ζ−1)√NH 3/2+ (4ω2+ 1) 8Nθ2L2µ2H

ϵ,

where His the total number of overall iterations

(local and global), zTis a random variable which samples a

weight parameter gtwith probability 1

NH , and ζis a constant.

Proof: Please refer to Appendix

In Theorem 1, we follow a weaker notion of convergence

where the mean expected squared gradient norm is taken

to express the convergence rate because of the non-convex

settings [29].

Theorem 2: Suppose f(x)is convex with L-Lipschitz

continuous gradient and the compression operator C(.)

satisﬁes (14) and (15). Let the learning step size η=1

(1+ω)L,

then the number of iterations performed by MCORANFed

to ﬁnd an ϵ-solution such that E[||zT||2]≤ϵis at most

Kϵ=O((1+ω)L

ϵ).

Where ϵis the right hand side of the inequality given in

Theorem 1.

Proof: Please refer to Appendix

Momentum Compressed ORANFed consists of trainers’

selection in Step 3, and assignment of resources in Step 4.

Steps 5 to 13 train the FL model iteratively. So, the complexity

of Algorithm 2 can be analysed in two parts. In the ﬁrst part,

(27) is solved using Algorithm 1 having time complexity of

O(L)where Lis the cardinality of the set M. In the second

part, (29) is solved using Interior Point Approximation method

having complexity O(JIpopt)[28], where JI popt is the total

number of iterations within the Ipopt solver algorithm.

VII. NUMERICAL RES ULTS

In this section, we describe the FL training task, experi-

mental settings, baselines used to compare the results, and an

analysis. Values of the parameters used in our experiment are

listed in Table 2.

TABLE II

EXP ERI ME NTAL SETTINGS

Parameter Description Value

NMax. no. of Local Trainers 50

BBW Budget for FL Training 1MHz

cmProcessing Rate 15cycles/bit

fmMax. CPU Power ∼U(1,1.6)GHz

ptr Per unit cost Transmission Cost 1

pcPer Unit Computation Cost 1

DmDataset Size ∼U(5,10)MB

dUpdate vector size 20bits

bmin Minimum BW Allocation 0.1MHz

kSparsiﬁcation factor 0.35

ηLearning rate ∼(0.1,0.4

ρPareto Trade-off parameter (0,1)

A. Federated Learning Task:

We train a supervised data trafﬁc prediction model using

a time series dataset [30]. This labelled data is accumulated

over a year, and the goal is to predict incoming trafﬁc for the

next hour using ORAN RICs.. This kind of task is generally

required to address a radio resource assignment problem. We

assign the labels for 3 different types of network slices (e.g.,

URLLC, eMBB, and mMTC), and uniformly distribute the

entire dataset onto 50 local nodes. Then using Long Short

Term Memory (LSTM) based model neural network consisting

of 4 layers, we train a regression model. For this purpose

the corresponding loss function is mean square error (MSE)

metric which is used for prediction accuracy as explained with

equation (1) in section IV. Before going into the federated

settings, the model is trained using centralised learning to

get the benchmark model accuracy which is around 96.3%.

Therefore, the value of ϵ(global accuracy for FL) is set as

0.96

B. Network Settings:

We run this training on a machine with Intel(R) Core(TM)

i5-8265U CPU having a maximum processing power of 1.16

GHz. For simplicity, all near-RT-RIC nodes have the same data

processing rate cm. A uniform distribution random number

generator is used for assigning CPU frequency (fm) of the host

in the range of (1,1.6). The value of compression parameter

(ω) lies in (0, 1) and the sparsiﬁcation factor (κ) is ﬁxed as 0.35

as obtained from a experimental data for lossless compression.

Small values of learning rate (η) in (0.1, 0.4) are taken and

the average convergence rate is reﬂected over 10 simulations.

The main settings are summarized in Table 2.

C. Baselines:

Since every FL variant targets a speciﬁc objective corre-

sponding to the mobile edge environment it is trained in, we

consider the following FL models as baselines for comparison:

1) Federated Averaging (FedAvg) [7]: FedAvg algorithm

with ﬁxed number of clients (N= 50) is considered for

the benchmark. This serves as the basic FL model without

compression and without any selection of local trainers in

global rounds. This provides a limiting comparison with other

FL variants and our proposed method.

2) Momentum Federated Learning (MFL) [26]: MFL is

used to compare the effect of momentum attenuation factor

on the accelerated convergence model. This method is an

improvement of the traditional gradient descent approach.

Serving as the model without compression operator, this FL

variant does not use any local trainers’ selection algorithm. In

this experiment, two variants of MFL are considered: one with

γ= 0.7and another with γ= 0.9. These variations are used

to show the differential behaviours of momentum attenuation

factor when applied on the iterative convergence of FL.

3) ORAN Federated Learning (ORANFed) [10]: In our

prior work, we have proposed ORANFed algorithm that uses a

local trainers’ selection to assign the resources for FL training.

This baseline serves as an FL variant with selection but without

compression or acceleration operator.

D. Performance Evaluation

Our proposed model is evaluated under the following per-

formance metrics:

1) Convergence Rate: Fig. 5 shows the joint impact of

acceleration and compression in converging the model. While

FedAvg requires a signiﬁcantly high number of global com-

munication rounds to converge as compared to other three

methods, it also performs poorly with respect to the model

accuracy so attained. Whereas, ORANFed, due to its deadline

aware local trainers’ selection, requires a smaller number of

global rounds and also attains higher model accuracy. MCO-

RANFed not only attains the highest model accuracy but in the

50 100 150 200 250 300

Global Rounds

0.0

0.2

0.4

0.6

0.8

1.0

Model Accuracy (ε)

MCORANFed

MFL

ORANFed

FedAvg

Fig. 5. Impact of Accelerated Convergence

50 100 150 200 250 300

Global Rounds

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Loss function

MCORANFed

MFL

ORANFed

FedAvg

Fig. 6. Accelerated Convergence w.r.t Loss function

0 25 50 75 100 125 150 175

Training Time

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Accuracy of the trained Model

MCORANFed

MFL: γ = 0.7

MFL: γ = 0.9

ORANFed

FedAvg

Fig. 7. Time elapsed to converge

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Communicated bits

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Loss Function Value

MCORANFed

MFL

FedAvg

Fig. 8. Impact of Compression Operator

0 25 50 75 100 125 150 175

Training Time

100

200

300

400

Communication Resource Usage Cost

MCORANFed

MFL

ORANFed

FedAvg

Fig. 9. Objective Cost Comparison

same time it also requires a signiﬁcantly low number of global

communication rounds to converge. Although MFL performs

better than FedAvg and ORANFed, it converges slower than

our proposed method. MCORANFed takes advantages of both

momentum gradient descent and trainers’ selection algorithm.

A similar impact can be seen in Fig. 6 where loss function

value is plotted against the number of global communication

rounds for each FL variants.

2) Training Time: FL methods can also be evaluated ac-

cording to the time it takes to train the ﬁnal model. The

total time includes data processing time at the local trainer,

model update transmit delay, and global aggregation time in

each global round. Fig. 7 shows that MCORANFed takes

about 120 time units whereas both variants of MFL takes

about 135 units of time. ORANFed and FedAvg require longer

learning time: 150 and 175 units respectively. A smaller

number of communication rounds results in a lower transmit

delay. Therefore, MCORANFed saves the total training time

too.

3) Impact of Compression: In Fig. 8, the loss function

value is plotted against the size of the compressed bits that

are communicated in each global round. FedAvg which uses

no compression operator, transfers 20 bits in every round.

Whereas MFL and MCORANFed share variable compressed

bit sizes. This Figure shows that MCORANFed reaches a

lower loss function value within a smaller number of global

rounds and sends lower bit sizes as compared to MFL. This

result demonstrates the efﬁciency of the compression operator

that we have applied on this model.

4) Resource Usage Costs: The objective of the main op-

timization problem deﬁned in this paper is to minimize in

the same time the resource cost and the learning time. Fig. 9

shows the comparison of our proposed method with respect to

bandwidth usage cost in training the FL model. As shown,

MCORANFed performs better than MFL, ORANFed, and

FedAvg in terms of both aspects of the objective function.

Through these results, the superior performance of our

proposed method is conﬁrmed. The differentiating factor that

improves the FL method can also be clearly seen.

VIII. CONCLUSION

In this paper, we proposed a communication efﬁcient fed-

erated learning method designed for ORAN systems. Our

model takes into account the importance of faster conver-

gence through momentum gradient descent and compressed

communication, deadline aware local trainers’ selection, and

an optimal resource allocation for training FL models. The

proposed model outperforms state-of-the-art FL methods in

terms of learning time and resource cost in the experimental

settings. The simulation results show that an FL model trained

with MCORANFed can save resource costs which is substan-

tial for smart radio resource allocation in ORAN. Therefore,

it can be deployed in the control loops of ORAN for different

use cases such as to guarantee the slice QoS. In future, we

will investigate the location of the distributed data collection

points to improve further MCORANFed in a highly complex

yet realistic environment.

ACKNOWLEDGMENT

The authors thank VMware, NSERC, and Mitacs for fund-

ing this research under the ALLRP 577577-22 grant.

REFERENCES

[1] Ericsson, “Accelerating the adoption of ai in programmable

5g networks [whitepaper],” July 2021. [Online]. Available:

https://www.ericsson.com/4a3998/assets/local/reports-papers/white-

papers/08172020-accelerating-the-adoption-of-ai-in-programmable-5g-

networks-whitepaper.pdf

[2] O-RAN-WG1.OAM-Architecture-v02.00, “O-RAN Alliance,” www.o-

ran.org, pp. accessed on October 11, 2021.

[3] Y. Sun, M. Peng, Y. Ren, L. Chen, L. Yu, and S. Suo, “Harmonizing

artiﬁcial intelligence with radio access networks: Advances, case study,

and open issues,” IEEE Network, vol. 35, no. 4, pp. 144–151, 2021.

[4] X. Costa-Perez, J. Swetina, T. Guo, R. Mahindra, and S. Rangarajan,

“Radio access network virtualization for future mobile carrier networks,”

IEEE Communications Magazine, vol. 51, no. 7, pp. 27–35, 2013.

[5] M. Maternia, S. E. El Ayoubi, M. Fallgren, P. Spapis, Y. Qi, D. Mart´

ın-

Sacrist´

an, ´

O. Carrasco, M. Fresia, M. Payar´

o, M. Schubert et al., “5g

ppp use cases and performance evaluation models,” 5G-PPP, Tech. Rep,

2016.

[6] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang,

D. Niyato, and C. Miao, “Federated learning in mobile edge networks:

A comprehensive survey,” IEEE Communications Surveys Tutorials,

vol. 22, no. 3, pp. 2031–2063, 2020.

[7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,

“Communication-efﬁcient learning of deep networks from decentralized

data,” in Artiﬁcial intelligence and statistics. PMLR, 2017, pp. 1273–

1282.

[8] Z. Zhao, C. Feng, H. H. Yang, and X. Luo, “Federated-learning-

enabled intelligent fog radio access networks: Fundamental theory, key

techniques, and future trends,” IEEE Wireless Communications, vol. 27,

no. 2, pp. 22–28, 2020.

[9] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang,

D. Niyato, and C. Miao, “Federated learning in mobile edge networks:

A comprehensive survey,” IEEE Communications Surveys Tutorials,

vol. 22, no. 3, pp. 2031–2063, 2020.

[10] A. K. Singh and K. Khoa Nguyen, “Joint selection of local trainers

and resource allocation for federated learning in open ran intelligent

controllers,” in 2022 IEEE Wireless Communications and Networking

Conference (WCNC), 2022, pp. 1874–1879.

[11] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint

learning and communications framework for federated learning over

wireless networks,” IEEE Transactions on Wireless Communications,

vol. 20, no. 1, pp. 269–283, 2021.

[12] C. T. Dinh, N. H. Tran, M. N. H. Nguyen, C. S. Hong, W. Bao, A. Y.

Zomaya, and V. Gramoli, “Federated learning over wireless networks:

Convergence analysis and resource allocation,” IEEE/ACM Transactions

on Networking, vol. 29, no. 1, pp. 398–409, 2021.

[13] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and

K. Chan, “Adaptive federated learning in resource constrained edge com-

puting systems,” IEEE Journal on Selected Areas in Communications,

vol. 37, no. 6, pp. 1205–1221, 2019.

[14] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy

efﬁcient federated learning over wireless communication networks,”

IEEE Transactions on Wireless Communications, vol. 20, no. 3, pp.

1935–1949, 2021.

[15] X. Mo and J. Xu, “Energy-efﬁcient federated edge learning with joint

communication and computation design,” Journal of Communications

and Information Networks, vol. 6, no. 2, pp. 110–124, 2021.

[16] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies

for federated learning in wireless networks,” IEEE Transactions on

Communications, vol. 68, no. 1, pp. 317–333, 2020.

[17] J. Xu and H. Wang, “Client selection and bandwidth allocation in

wireless federated learning networks: A long-term perspective,” IEEE

Transactions on Wireless Communications, vol. 20, no. 2, pp. 1188–

1200, 2021.

[18] M. M. Amiri, D. G¨

und¨

uz, S. R. Kulkarni, and H. V. Poor, “Convergence

of update aware device scheduling for federated learning at the wireless

edge,” IEEE Transactions on Wireless Communications, vol. 20, no. 6,

pp. 3643–3658, 2021.

[19] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling

and resource allocation for latency constrained wireless federated learn-

ing,” IEEE Transactions on Wireless Communications, vol. 20, no. 1,

pp. 453–467, 2021.

[20] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “Hfel: Joint edge asso-

ciation and resource allocation for cost-efﬁcient hierarchical federated

edge learning,” IEEE Transactions on Wireless Communications, vol. 19,

no. 10, pp. 6535–6548, 2020.

[21] “O-ran alliance whitepaper,” June 2021. [Online].

Available: https://www.o-ran.org/s/O-RAN-Minimum-Viable-Plan-

and-Acceleration-towards-Commercialization-White-Paper-29-June-

2021.pdf

[22] O-RAN-WG6.CAD-V01.00.00; Cloud Architecture and Deployment

Scenarios for O-RAN Virtualized RAN speciﬁcation, “O-RAN

Alliance.” [Online]. Available: www.o-ran.org

[23] J. Koneˇ

cn`

y, H. B. McMahan, D. Ramage, and P. Richt´

arik, “Federated

optimization: Distributed machine learning for on-device intelligence,”

arXiv preprint arXiv:1610.02527, 2016.

[24] H. Lee, J. Cha, D. Kwon, M. Jeong, and I. Park, “Hosting ai/ml

workﬂows on o-ran ric platform,” in 2020 IEEE Globecom Workshops

(GC Wkshps, 2020, pp. 1–6.

[25] Z. Li, D. Kovalev, X. Qian, and P. Richtarik, “Acceleration for com-

pressed gradient descent in distributed and federated optimization,” in

Proceedings of the 37th International Conference on Machine Learning,

ser. Proceedings of Machine Learning Research, H. D. III and A. Singh,

Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5895–5904.

[26] W. Liu, L. Chen, Y. Chen, and W. Zhang, “Accelerating federated learn-

ing via momentum gradient descent,” IEEE Transactions on Parallel and

Distributed Systems, vol. 31, no. 8, pp. 1754–1766, 2020.

[27] A. Khaled and P. Richt´

arik, “Gradient descent with compressed iterates,”

arXiv preprint arXiv:1909.04716, 2019.

[28] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge

university press, 2004.

[29] L. Li, D. Shi, R. Hou, H. Li, M. Pan, and Z. Han, “To talk or to work:

Flexible communication compression for energy efﬁcient federated

learning over heterogeneous mobile edge devices,” in IEEE INFOCOM

2021 - IEEE Conference on Computer Communications, 2021, pp. 1–10.

[30] “5G User predicition data.” [Online]. Available:

https://www.kaggle.com/liukunxin/dataset

IX. APPENDIX

Theorem 1, Proof: Let zTbe a random variable sampled from g(t)

iwith probability P r[zT=g(k)

i] = 1

NH and taking

δ=q1

MPM

m=1 δ2

mand ηk=θ√N

√H, we get:

E[||zT||2] = 1

H−1

k=0

m=1

E[||∇fm(g(k)

m)||2],(30)

To get the difference between loss functional values from two consecutive iterations, we deﬁne the following sequences: (1)

q(k)=1

MPM

m=1 ∇fm(g(k)

m;D(k)

m), (2) q(k)=ED(k)

m[q(k)] = 1

MPM

m=1 ∇fm(gk

m), and the update rule of g(k)

mas per (21).

Now, using the smoothness condition (v) on the class of loss functions, F:Rd→R, we get:

F(g(k+1))−F(g(k))

≤<∇F(g(k)), g(k+1) −g(k)>+L

2||g(k+1) −g(k)||2

=−η < ∇F(g(k)), q(k)>+η2L

2||q(k)||2

≤ −η < ∇F(g(k)), q(k)>+η2L||q(k)−q(k)||2

=−η

m=1

<∇F(g(k)),∇fm(gk

m;Dk

m)>+η2L||q(k)−q(k)||2+η2L|| 1

m=1

∇fm(g(k)

m, D(k)

m)||2

(31)

By taking expectation w.r.t the sampled dataset at each near-RT-RIC at time k, we get:

E[F(g(k+1))] −F(g(k))

≤−η

2(||∇F(g(k))||2+|| 1

m=1

∇fm(gk

m)||2)+

2||∇F(g(k))−1

m=1

∇fm(g(k)

m)||2+

η2L|| 1

m=1

∇fm(g(k)

m)||2+η2Lσ2

Mb(k)(32)

Theorem 2, Proof: Suppose f(x)is convex with L-Lipschitz continuous gradient and the compression operator C(.)

satisﬁes (8) and (9). Let the learning step size η=1

(1+ω)L, then the number of iterations performed by CGD to ﬁnd an

ϵ-solution such that E[f(xk)−f(x∗)] ≤ϵis at most k=O((1+ω)L

ϵ).

Proof: According to CGD update rule, xk+1 =xk−ηgk, we have

E[||xk+1 −x∗||2] = E[||xk−ηgk−x∗||2]

=E[||xk−ηC(∇f(xk)) −x∗||2]

=E[||xk−x∗||2−2η⟨C(∇f(xk)), xk−x∗⟩+η2||C(∇f(xk))||2]

=||xk−x∗||2−2η⟨∇f(xk), xk−x∗⟩+η2E[||C(∇f(xk))||2]

=||xk−x∗||2−2η⟨∇f(xk), xk−x∗⟩+η2||∇f(xk)||2+E[||C(∇f(xk)) − ∇f(xk)||2]

≤ ||xk−x∗||2−2η⟨f(xk), xk−x∗⟩+η2(1 + ω)||∇f(xk)||2

≤ ||xk−x∗||2−2η(f(xk)−f(x∗)) + η2(1 + ω)||∇f(xk)||2,(33)

(34)

using the convexity and L-smoothness of f, we have

E[f(xk+1)−f(x∗)] ≤E[f(xk)−f(x∗)−f(x∗) + ⟨∇f(xk), x(k+ 1) −xk⟩+L

2||xk+1 −xk||2]

=E[f(xk)−f(x∗) + ⟨∇f(xk),−η(C)(∇f(xk))⟩+Lη2

2||C(∇f(xk))||2]

=E[f(xk)−f(x∗)−η||∇f(xk)||2+Lη2

2||C(∇f(xk))||2]

≤E[f(xk)−f(x∗)−η(1 −Lη(1 + ω)

2)||∇f(xk)||2].(35)

By multiplying (35) by η(1+ω)

1−Lη(1+ω)

and adding 33, we have

E"η(1 + ω)

1−Lη(1+ω)

(f(xk+1)−f(x∗)) + ||xk+1 −x∗||2+ 2η(f(xk)−f(x∗))#

≤E"η(1 + ω)

1−Lη(1+ω)

(f(xk+1)−f(x∗)) + ||xk−x∗||2#

(36)

E"2η

i=0

(f(xi)−f(x∗))#

≤η(1 + ω)

1−Lη(1+ω)

(f(x0)−f(x∗)) + ||x0−x∗||2(37)

≤η(1 + ω)

1−Lη(1+ω)

2||x0−x∗||2) + ||x0−x∗||2=2||x0−x∗||2

2−Lη(1 + ω),(38)

using the L-smoothness and convexity of f(xi)for each i= 0,1, ...k according to above inferences, we must have:

E2ηk(F(xk)−f(x∗))≤2||x0−x∗||2

2−Lη(1 + ω)(39)

E[f(xk)−f(x∗)] ≤||x0−x∗||2

(2 −Lη(1 + ω))ηk =(1 + ω)L||x0−x∗||2

k,(40)

with the choice of η=1

(1+ω)Land k=(1+ω)L||x0−x∗||2

ϵsuch that E[f(xk)−f(x∗)] ≤ϵwithin O((1+ω)L

ϵ)iterations.

ResearchGate has not been able to resolve any citations for this publication.

Joint Selection of Local Trainers and Resource Allocation for Federated Learning in Open RAN Intelligent Controllers

Conference Paper

Full-text available

Jan 2022

Recently, Federated Learning (FL) has been applied in various research domains specially because of its privacy preserving and decentralized approach of model training. However, very few FL applications have been developed for the Radio Access Network (RAN) due to the lack of efficient deployment models. Open RAN (O-RAN) promises a high standard of meeting 5G services through its disaggregated, hierarchical, and distributed network function processing framework. Moreover, it comes with built-in intelligent controllers to instill smart decision making ability into RAN. In this paper, we propose a framework named O-RANFed to deploy and optimize FL tasks in O-RAN to provide 5G slicing services. To improve the performance of FL we formulate a joint mathematical optimization model of local learners selection and resource allocation to perform model training in every iteration. We solve this non-convex problem using the decomposition method. First, we propose a slicing based and deadline aware client selection algorithm. Then, we solve the reduced resource allocation problem by using successive convex approximation (SCA) method. Our simulation results show the proposed model outperforms the state-of-the-art FL methods such as FedAvg and FedProx in terms of convergence, learning time, and resource costs.

Hosting AI/ML Workflows on O-RAN RIC Platform

Conference Paper

Full-text available

Dec 2020

Energy Efficient Federated Learning Over Wireless Communication Networks

Article

Full-text available

Nov 2020

In this paper, the problem of energy efficient transmission and computation resource allocation for federated learning (FL) over wireless communication networks is investigated. In the considered model, each user exploits limited local computational resources to train a local FL model with its collected data and, then, sends the trained FL model to a base station (BS) which aggregates the local FL model and broadcasts it back to all of the users. Since FL involves an exchange of a learning model between users and the BS, both computation and communication latencies are determined by the learning accuracy level. Meanwhile, due to the limited energy budget of the wireless users, both local computation energy and transmission energy must be considered during the FL process. This joint learning and communication problem is formulated as an optimization problem whose goal is to minimize the total energy consumption of the system under a latency constraint. To solve this problem, an iterative algorithm is proposed where, at every step, closed-form solutions for time allocation, bandwidth allocation, power control, computation frequency, and learning accuracy are derived. Since the iterative algorithm requires an initial feasible solution, we construct the completion time minimization problem and a bisection-based algorithm is proposed to obtain the optimal solution, which is a feasible solution to the original energy minimization problem. Numerical results show that the proposed algorithms can reduce up to 59.5% energy consumption compared to the conventional FL method.

A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks

Article

Full-text available

Oct 2020

In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that generates a global FL model and sends the model back to the users. Since all training parameters are transmitted over wireless links, the quality of training is affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS needs to select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To seek the solution, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can improve the identification accuracy by up to 1:4%, 3:5% and 4:1%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation, 2) a standard FL algorithm with random user selection and resource allocation, and 3) a wireless optimization algorithm that minimizes the sum packet error rates of all users while being agnostic to the FL parameters.

Energy-Efficient Federated Edge Learning with Joint Communication and Computation Design

Article

Jun 2021

This paper studies a federated edge learning system, in which an edge server coordinates a set of edge devices to train a shared machine learning (ML) model based on their locally distributed data samples. During the distributed training, we exploit the joint communication and computation design for improving the system energy efficiency, in which both the communication resource allocation for global ML-parameters aggregation and the computation resource allocation for locally updating ML-parameters are jointly optimized. In particular, we consider two transmission protocols for edge devices to upload ML-parameters to edge server, based on the non-orthogonal multiple access (NOMA) and time division multiple access (TDMA), respectively. Under both protocols, we minimize the total energy consumption at all edge devices over a particular finite training duration subject to a given training accuracy, by jointly optimizing the transmission power and rates at edge devices for uploading ML-parameters and their central processing unit (CPU) frequencies for local update. We propose efficient algorithms to solve the formulated energy minimization problems by using the techniques from convex optimization. Numerical results show that as compared to other benchmark schemes, our proposed joint communication and computation design significantly can improve the energy efficiency of the federated edge learning system, by properly balancing the energy tradeoff between communication and computation.

Harmonizing Artificial Intelligence with Radio Access Networks: Advances, Case Study, and Open Issues

Article

Jul 2021

Driven by the demands of efficient network operation and high service availability, the convergence of artificial intelligence (AI) with radio access networks (RANs) has drawn considerable attention. However, current academic research mainly focuses on applying AI into optimizing RANs with a few discussions on architecture design. This article surveys the recent progress achieved by industry in integrating AI into RANs, and proposes an AI-driven fog RAN (F-RAN) paradigm. Specifically, being wrappers of Al-re-lated functionalities, AI capsules are presented as new network functions in the F-RAN domain. With AI capsules, computation and cache resources at various fog nodes can be utilized to facilitate real-time AI-based F-RAN optimization and alleviate the transmission burden incurred by network data collection. At the edge cloud, a centralized AI brain for F-RANs is deployed, which incorporates a wireless-oriented auto-AI platform and a digital colon of the network environment for offline AI model training and evaluation. By the interplay among AI capsules and the AI brain, universal and endogenous intelligence can be fully realized within F-RANs, which in turn enhances system performance. Furthermore, we demonstrate the effectiveness of a scalable deep-reinforcement-learning-based method in minimizing energy consumption for a computation offloading use case. At last, open issues are identified in terms of interface standardization, federated learning, and transfer learning.

To Talk or to Work: Flexible Communication Compression for Energy Efficient Federated Learning over Heterogeneous Mobile Edge Devices

Conference Paper

May 2021

Convergence of Update Aware Device Scheduling for Federated Learning at the Wireless Edge

Article

Jan 2021

We study federated learning (FL) at the wireless edge, where power-limited devices with local datasets collaboratively train a joint model with the help of a remote parameter server (PS). We assume that the devices are connected to the PS through a bandwidth-limited shared wireless channel. At each iteration of FL, a subset of the devices are scheduled to transmit their local model updates to the PS over orthogonal channel resources, while each participating device must compress its model update to accommodate to its link capacity. We design novel scheduling and resource allocation policies that decide on the subset of the devices to transmit at each round, and how the resources should be allocated among the participating devices, not only based on their channel conditions, but also on the significance of their local model updates. We then establish convergence of a wireless FL algorithm with device scheduling, where devices have limited capacity to convey their messages. The results of numerical experiments show that the proposed scheduling policy, based on both the channel conditions and the significance of the local model updates, provides a better long-term performance than scheduling policies based only on either of the two metrics individually. Furthermore, we observe that when the data is independent and identically distributed (i.i.d.) across devices, selecting a single device at each round provides the best performance, while when the data distribution is non-i. i.d., scheduling multiple devices at each round improves the performance. This observation is verified by the convergence result, which shows that the number of scheduled devices should increase for a less diverse and more biased data distribution.

Federated Learning Over Wireless Networks: Convergence Analysis and Resource Allocation

Article

Nov 2020

There is an increasing interest in a fast-growing machine learning technique called Federated Learning (FL), in which the model training is distributed over mobile user equipment (UEs), exploiting UEs’ local computation and training data. Despite its advantages such as preserving data privacy, FL still has challenges of heterogeneity across UEs’ data and physical resources. To address these challenges, we first propose FEDL , a FL algorithm which can handle heterogeneous UE data without further assumptions except strongly convex and smooth loss functions. We provide a convergence rate characterizing the trade-off between local computation rounds of each UE to update its local model and global communication rounds to update the FL global model. We then employ FEDL in wireless networks as a resource allocation optimization problem that captures the trade-off between FEDL convergence wall clock time and energy consumption of UEs with heterogeneous computing and power resources. Even though the wireless resource allocation problem of FEDL is non-convex, we exploit this problem’s structure to decompose it into three sub-problems and analyze their closed-form solutions as well as insights into problem design. Finally, we empirically evaluate the convergence of FEDL with PyTorch experiments, and provide extensive numerical results for the wireless resource allocation sub-problems. Experimental results show that FEDL outperforms the vanilla FedAvg algorithm in terms of convergence rate and test accuracy in various settings.

Client Selection and Bandwidth Allocation in Wireless Federated Learning Networks: A Long-Term Perspective

Article

Oct 2020

This paper studies federated learning (FL) in a classic wireless network, where learning clients share a common wireless link to a coordinating server to perform federated model training using their local data. In such wireless federated learning networks (WFLNs), optimizing the learning performance depends crucially on how clients are selected and how bandwidth is allocated among the selected clients in every learning round, as both radio and client energy resources are limited. While existing works have made some attempts to allocate the limited wireless resources to optimize FL, they focus on the problem in individual learning rounds, overlooking an inherent yet critical feature of federated learning. This paper brings a new long-term perspective to resource allocation in WFLNs, realizing that learning rounds are not only temporally interdependent but also have varying significance towards the final learning outcome. To this end, we first design data-driven experiments to show that different temporal client selection patterns lead to considerably different learning performance. With the obtained insights, we formulate a stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints, and develop a new algorithm that utilizes only currently available wireless channel information but can achieve long-term performance guarantee. Experiments show that our algorithm results in the desired temporal client selection pattern, is adaptive to changing network environments and far outperforms benchmarks that ignore the long-term effect of FL.