ArticlePDF Available

Communication Efficient Compressed and Accelerated Federated Learning in Open RAN Intelligent Controllers

Authors:

Abstract and Figures

The disaggregated and hierarchical architecture of Open Radio Access Network (ORAN) with openness paradigm promises to deliver the ever demanding 5G services. Meanwhile, it also faces new challenges for the efficient deployment of Machine Learning (ML) models. Although ORAN has been designed with built-in Radio Intelligent Controllers (RIC) providing the capability of training ML models, traditional centralized learning methods may be no longer appropriate for the RICs due to privacy issues, computational burden, and communication overhead. Recently, Federated Learning (FL), a powerful distributed ML training, has emerged as a new solution for training models in ORAN systems. 5G use cases such as meeting the network slice Service Level Agreement (SLA) and Key Performance Indicator (KPI) monitoring for the smart radio resource management can greatly benefits from the FL models. However, training FL models efficiently in ORAN system is a challenging issue due to the stringent deadline of ORAN control loops, expensive compute resources, and limited communication bandwidth. Moreover, to deliver Grade of Service (GoS), the trained ML models must converge with acceptable accuracy. In this paper, we propose a second order gradient descent based FL training method named MCORANFed that utilizes compression techniques to minimize the communication cost and yet converges at a faster rate than state-of-the-art FL variants. We formulate a joint optimization problem to minimize the overall resource cost and learning time, and then solve it by the decomposition method. Our experimental results prove that MCORANFed is communication efficient with respect to ORAN system, and outperforms FL methods like MFL, FedAvg, and ORANFed in terms of costs and convergence rate.
Content may be subject to copyright.
Communication Efficient Compressed and
Accelerated Federated Learning in Open RAN
Intelligent Controllers
Amardip Kumar Singh, Kim Khoa Nguyen
´
Ecole de Technologie Sup´
erieure, Montreal, Canada
amardip-kumar.singh.1@ens.etsmtl.ca, kimkhoa.nguyen@etsmtl.ca
Abstract—The disaggregated and hierarchical architecture of
Open Radio Access Network (ORAN) with openness paradigm
promises to deliver the ever demanding 5G services. Meanwhile, it
also faces new challenges for the efficient deployment of Machine
Learning (ML) models. Although ORAN has been designed
with built-in Radio Intelligent Controllers (RIC) providing the
capability of training ML models, traditional centralized learning
methods may be no longer appropriate for the RICs due to pri-
vacy issues, computational burden, and communication overhead.
Recently, Federated Learning (FL), a powerful distributed ML
training, has emerged as a new solution for training models in
ORAN systems. 5G use cases such as meeting the network slice
Service Level Agreement (SLA) and Key Performance Indicator
(KPI) monitoring for the smart radio resource management can
greatly benefits from the FL models. However, training FL models
efficiently in ORAN system is a challenging issue due to the
stringent deadline of ORAN control loops, expensive compute
resources, and limited communication bandwidth. Moreover, to
deliver Grade of Service (GoS), the trained ML models must
converge with acceptable accuracy. In this paper, we propose a
second order gradient descent based FL training method named
MCORANFed that utilizes compression techniques to minimize
the communication cost and yet converges at a faster rate than
state-of-the-art FL variants. We formulate a joint optimization
problem to minimize the overall resource cost and learning time,
and then solve it by the decomposition method. Our experimental
results prove that MCORANFed is communication efficient with
respect to ORAN system, and outperforms FL methods like MFL,
FedAvg, and ORANFed in terms of costs and convergence rate.
Index Terms—Federated Learning, ORAN, 5G, Resource Al-
location, RAN Intelligent Controller, Network Slicing, RIC
I. INTRODUCTION
Due to an exponential growth in high performing user
devices and a massive increase in connectivity, the demand
of a reliable and scalable data network is ever increasing.
To meet this demand and realise the envision of 5G and
beyond communications, leading telecommunication infras-
tructure providers have taken many initiatives and instituted fo-
cused research studies to redesign their network management
systems [1]. A major part of this ongoing study concentrates
on Radio Access Network (RAN) part. In this direction, O-
RAN Alliance has already published the specifications of
Open RAN (ORAN) which is particularly based on two key
concepts: (i) openness and (ii) in-built intelligence [2]. ORAN
frees the network from vendor lock-in and disaggregates
the processing layers from proprietary hardware to shared
open clouds. On the other hand, it also calls for a smarter
Service and Management Operation by installing intelligent
controllers which utilizes the operational and performance data
to optimize the resources [3].
With its unique features, ORAN can support many use cases
such as Quality of Experience (QoE) optimization, ensuring
Service Level Agreements (SLA) agreements, providing slice
based ultra Reliable Low Latency Communications (uRLLC)
and enhanced Mobile Broadband (eMBB), and monitoring
Key Performance Indicators (KPI) which are intrinsic part of
5G network [4]. All of these use cases require training of
fast and secure Machine Learning (ML) models. However, a
traditional centralized ML model suffers from two important
limitations: (i) transferring locally collected data to a central
server and (ii) model training time. Further, the privacy issues
of shared processing open clouds and security of raw data pose
additional challenges in the paradigm of ORAN architecture
[5]. Due to this reason, we turn to Federated Learning (FL)
which is becoming an advanced ML tool in the recent times
[6].
The basic idea of FL is to train an ML model without
sharing the locally collected raw data and communicating only
its model updates to a distant server [7] to iteratively improve
the final model so trained. In this training, an iterative gradient
descent method is used to minimize the loss function, which
is the main objective of any ML model. FL, being in its
nascent stage, is still being studied and improved progressively
by researchers across the domains [8]. However, there is
hardly any standard version of it which can adhere to ORAN
architectural requirements and therefore stands as an open
problem [9]. The main questions that exist in this set up are:
(i) How to allocate the ORAN resources required to train such
a model?; (ii) How to meet the stringent deadlines of ORAN
control loops that automates the performance of the network
while training an FL model?; and (iii) How to guarantee the
convergence of this model with accuracy?
Based on the environment it is implemented, most of the
FL variants improve their performance with respect to two
aspects: (i) selection of participating local trainers, and (ii)
reducing the communication time between the global server
and the set of local trainers [6]. In case of ORAN architectural
environment, both aspects are important. Moreover, the control
loop deadline varies with respect to the performance of each
network slice. In this paper, we first propose an accelerated
iteration method that utilizes a random sparsification com-
pression operator to optimize the communication resources.
Then, we derive an updated FL algorithm by incorporating
slice specific and deadline aware selection of local trainers
[10].
Our main contributions can be summarized as follows:
1) A second order momentum gradient descent based novel
FL training method that utilizes a compression model
updates in each global round.
2) A mathematical proof of the rate of convergence of this
method.
3) A mathematical formulation of the joint optimization
problem to minimize communication resource usage cost
and total learning time under the constraints of limited
bandwidth, selected local trainers, and compression pa-
rameter.
4) A solution of this problem by decomposition method.
We first solve for the set of local trainers and then
allocate the resources to this set of selected trainers.
5) An updated FL algorithm, so called MCORANFed (Mo-
mentum Compressed ORAN FL) to train the model
through several iterations.
6) An analysis of results obtained from extensive experi-
ments that validates our proposed method compared to
the state-of-the-art FL variants.
To the best of our knowledge, this is the first work that
takes into account acceleration and compression techniques
to optimize ML training through federated settings in ORAN.
The remainder of this paper is organized as follows. In Section
II, a summary of the related works is presented. Then, we
briefly describe the concerned ORAN architecture for the FL
setup in section III. In Section IV, the considered system model
and the problem formulation are presented. In Section V, we
describe our proposed solution approach. In Section VI, we
present the numerical results to evaluate the performance of
our proposed solution. Finally, we conclude the paper and
discuss our future work in section VII.
II. RE LATE D WORKS AND CHALLENGES
[7] introduced the initial concept of federated averaging
to train deep learning model in a distributed setting. It also
has proposed a method to select a fraction of clients instead
of all the local trainers. While it solves the challenges of
privacy by not sharing the raw data with the aggregator,
the communication efficiency is still questionable. Based on
this work, several variants of FL have then been proposed
to meet the requirements of edge cloud based system. [11],
[12], [13], [14], and [15] dealt with the energy efficiency
optimization problems where a joint problem is solved under
limited compute resources. In all of these FL variants, the
local model is processed at the user end devices. Their main
objective was to minimize the energy cost incurred while
training the local model. However, in ORAN system, the
main objective should be minimizing the learning time to
boost the performance of FL training itself. Nonetheless, it
laid the foundation of mathematically formulating a resource
allocation problem for FL in mobile edge architecture. The
control algorithm in [13] regulates the number of global and
local iterations required to attain a desired level of model
accuracy. The authors in [12] have considered a wireless com-
munication scenario to train an FL model under synchronous
mode of communication between the model update aggregator
and local devices. Latency constraints are considered in [14]
to minimize the computation and transmission energy of user
devices. Through approximated and closed form solutions to
the resource allocation problem, it assigns optimal processing
power for the local training.
In prior works [16], [17], [18], [19], and [20], the authors
have tackled another aspect of FL performance. They focus on
optimizing the number of local trainers in each global round of
FL. Further, deadline in each loop is also taken into account so
that the communication overhead can be minimized. However,
the FL training method has not been improved significantly in
these works. The simple first order iterative gradient descent
method is common in all these papers. [16] investigated the
role of client scheduling and user association problems of
FL training. There are multiple approaches to deal with this
aspect such as Round Robin Scheduling, Proportional Fair
Scheduling, and Random Scheduling. This is done considering
that the set of end users that trains the local models is fixed.
However, in ORAN system, the ML model is not trained by
end users. Rather, the edge node dataset keeps on changing
and only the set of local training point remains fixed. That
adds to the extra challenges of data heterogeneity as well.
[20] proposed a hierarchical FL algorithm in single edge cloud
system. ORAN deals with multiple carriers networks leading
to frequent change in the bandwidth. [17] takes into account
the channel states to optimize the client selection.
Unlike these prior works, we propose an FL training method
that suits the architectural requirements of ORAN that is quite
different from a traditional mobile edge cloud system.
III. ML TRAINING IN ORAN
O-RAN Alliance has specified a promising framework with
intelligent blocks residing within the Service and Management
Orchestration (SMO) layer, termed as Radio Intelligent Con-
trollers (RIC) [21]. There are two types of RICs: near real
time intelligent controller (Near-RT RIC) and non real time
intelligent controller (Non-RT RIC). The newly developed
interfaces like E2 and A1 collect measurement and operational
datasets to monitor the performance of RAN in scenarios
where the expected latency is in range of milliseconds [22].
These datasets are stored in Radio Network Information Base
(RNIB) that are shared with multi-vendor edge clouds. O-
RAN has already listed various services as xApps in near
RT-RIC and Non RT-RIC blocks. For example, QoS and QoE
optimization use case in which the Non-RT RIC checks long-
term patterns and provides policy guidance to the Near-RT
RIC. Supported with ML models and RAN lower level data
reported through E2 nodes, the Near-RT RIC can dynamically
predict the QoE or QoS performance of the ongoing applica-
tion and change the behaviour of RAN to maintain the user
ML Model Application
Data Base
RAN Data Analytics & AI Platform
O-CU-UP
O-CU-CP
O-DU
O-RU
Near-
RT-RIC
ML Model
Host
Non-RT-RIC
Service & Management Orchestration Functions
A1
O1
O2
E2
O1
E1
F1
E2
Near-
RT-RIC
Near-
RT-RIC
near-
RT-RIC
Near-
RT-RIC
Near-
RT-RIC
near-
RT-RIC
Near-
RT-RIC
Near-
RT-RIC
near-
RT-RIC
O-CU-UP
O-CU-CP
O-DU
O-RU
O-CU-UP
O-CU-CP
O-DU
O-RU
Data Base
Data
Base
Data
Base Data Base
Data Base
Data
Base Data Base
Data Base
Data
Base
F1
F1
E1 E1
E2 E2
uRLLC
slice
eMBB
slice
mMTC
slice
Local model
update vector
uploading
Global model
aggregated vector
broadcast
Fig. 1. O-RAN Architecture for Radio Intelligent Controllers based FL
experience [21]. Another important use case determines RAN
slicing and SLA assurance and helps in resource allocation
optimization. Using the slice specific performance metrics
collected through E2 nodes, Non-RT RIC trains an ML model
and guides Near-RT RIC using A1 interface based policies
to meet the slice KPI targets. It can also guides the dynamic
Radio Resource Management (RRM) policies of near-RT-RIC
through O1 configuration. SMO can also take into account
these ML models to predict the optimal hierarchy ratio for
controlling RAN elements [21]. As illustrated in Figure 1, the
dedicated databases that collects the slice specific performance
measurements are connected to respective near-RT RICs onto
the RAN analytics and AI platform through E1 and O1
standardized interfaces. These dynamic data collection points
are then processed locally at the edge computing hosts. The
interaction through A1 interface then transfers the updates to
the Non-RT RIC placed in SMO at a regional cloud server.
We formulate the problem of local trainers’ selection, resource
allocation, total FL training time, and resource costs involved
in this framework.
IV. SYS TE M MOD EL A ND PROBLEM FORMULATION
Consider an O-RAN system with a single regional cloud
and a set Mof Mdistributed edge clouds cooperatively
performing an FL training model. As illustrated in Figure
2, in this FL setup each edge cloud node hosts a near-RT-
RIC entity that processes its locally collected data to train a
local FL model. The Non-RT-RIC placed at the regional cloud
integrates the local FL models from participating edge clouds
and generates an aggregated FL model. This aggregated FL
model is further used to improve local FL models of each near-
RT-RIC enabling the local models to collaboratively perform
TABLE I
SUM MARY O F KEY N OTATIO NS.
Notation System Model Parameters
MSet of distributed edges
DiDataset at ith Local Trainer
STotal size of all training data
Rco total communication resource cost
Rco
mcommunication resource cost of mth near-RT-RIC
Rcp total compute resource cost
cmCPU cycles required to process per bit data
fmProcessing power of mth host
pccompute resource usage cost
Rcp
mcompute resource cost of mth near-RT-RIC
ptr bandwidth resource usage cost
FLoss function at the Non-RT-RIC
Tco
mtransmission time for mth local model updates
Tcp
mmth local model computation processing time
TmLearning Time per global FL round at mth local model
Ttotal Total FL time
BuUplink Bandwidth for allocation
BdDownlink Bandwidth for allocation
BTotal Bandwidth assigned for FL tasks
Notation Input Parameters
gilocal model vector at ith near-RT-RIC
TTotal number of global rounds in FL
θLocal Accuracy
ϵPrefixed global accuracy
ρPareto parameter
NSet of selected local trainers
Kϵnumber of global rounds required to attain ϵaccuracy
ηlearning rate
γmomentum attenuation factor
Sω
msize of update vectors
Notation Decision Varibales
atTrainers’ selection decision vector
btBandwidth fraction allocation vector
ωcompression ratio
a learning algorithm without transferring its raw training data.
We call this aggregated FL model so generated by using the
local FL models as the global FL model. The uplink from near-
RT-RICs to the Non-RT-RIC is used to send the local FL model
update parameters and the downlink is used to broadcast the
global FL model in global rounds of the training. These links
of communication channel is supported by open interfaces of
O-RAN [22].
Table I summarizes the key notations used and the complete
system model is described through the following subsections
to address the complexities of each components.
A. The Learning Model
In this model, each near-RT-RIC collects a dataset Di=
[xi,1, ....., xi,Si]of input data where Siis the number of the
input samples collected by near-RT-RIC iand each element xis
is the FL model’s input vector. Let yis be the output of xis. For
simplicity, we consider an FL model with single output, which
can be readily generalized to a case with multiple outputs.
The output data vector for training the FL model of near-
RT-RIC iis yi= [yi,1, ....., yi,Si]. We assume that the data
collected by each near-RT-RIC is different from the other near-
RT-RICs i.e. (xi=xj;i=j, i, j M). So, each local trainer
.
.
Edge cloud Database 1
Edge cloud Database 2
Edge cloud Database M
Local Model 1
Local Model 2
Local Model M
Global Model
Regional Cloud
Near-RT-RIC
Near-RT-RIC
Non-RT-RIC SMO
.
.
Near-RT-RIC
A1 interface
Fig. 2. System Model for FL update interaction
will train the model using a different dataset. This is in line
with the real scenario as each local near-RT-RIC collects the
operational data from the corresponding slice specific users.
We define a vector gito capture the parameters related to the
local FL model that is trained by Diand yi. The vector gi
actually determines the local FL model of each near-RT-RIC
i. For example, in a linear regression prediction algorithm,
xT
i.yirepresents the output, and giis the associated weight
vector that is used to calculate the prediction accuracy. So,
in each local model, the target is to get the optimal giso
that the the model accuracy can be maximized. Hence, for a
system of Mlocal trainers i.e. near-RT-RICs, the objective
of the global FL training process is to solve the following
optimization problem:
min
g1,......,gM
1
S
M
X
i=1
Si
X
s=1
f(gi,xis, yis )(1)
s.t. g1=g2=..... =gM=gi M (1a)
where S=PM
i=1 Siis the total size of training data of all
near-RT-RICs. gis the global FL model generated by the Non-
RT-RIC and f(gi,xis, yis)is a loss function that indicates
the FL model’s training accuracy. The exact expression for
the loss function varies depending upon the ML model that
is being trained. However, the objective of these functions
remains the same. Constraint (1a) ensures that, once the FL
model converges, all of the near-RT-RICs and the Non-RT-
RIC will transmit the parameters gof the global FL model to
its connected near-RT-RICs so that they train their local FL
models with the updated weights. Then the near-RT-RICs will
transmit their local FL models to the Non-RT-RIC to update
the global FL model. The update of each near-RT-RIC i’s local
FL model gidepends on all near-RT-RICs’ local FL models.
Generally, the iterative Gradient Descent (GD) method is
used to approximate the local model’s loss function and its
corresponding weights. In this method, the iterative process
updates the weights as below:
gi(t) = gi(t1) ηfi(gi(t1)),(2)
where tdenotes the iteration count and ηis the learning rate.
Once the optimal solution of (1) is gained, using this updated
weight vector (gi) for all the local trainers i.e. (i1,2, ..., M ),
the global weight vector (g) can be updated as:
f(g) = PM
i=1 |Si|fi(gi)|
|S|,(3)
which results in the updated global loss function, f(g).
To assess the performance of model training, the model
accuracy is calculated in each global round. We present the
defining notion of accuracy in the next subsection.
B. FL Model Accuracy
The target for each of the local model trainers is to attain
aθ(0,1) level of accuracy, defined as below:
||∇ft
i(gt
i)|| θ||∇ft
i(gt1
i)||,i ϵ {1,2,3, .., M }(4)
To attain this accuracy, a near-RT-RIC takes several iterations
so called local iterations. Correspondingly, in the global model
placed at the non-RT-RIC, the target is to attain the optimal
model weights to reach ϵlevel of global model accuracy,
defined as below:
|f(gt)f(g)| ϵtX(5)
(5) simply states that gis the optimal model parameter i.e.
for every global round beyond X, the difference between the
loss function values falls within the defined accuracy level no
matter how long we keep on iterating.
Now, the convergence of this iterative method is ensured
under a set of conditional bounds on the loss function:
f:RnRs. t.
(i) f(g)is convex.
(ii) f(g)is ρ-Lipschitz, i.e. ||f(g)f(g)|| ρ||gg||, for
any g,gRn.
(iii) f(g)is β-smooth, i.e, ||∇f(g)f(g)|| β||gg||,
for any g,g.
Under the above stated conditions, it is proven [23] that the
number of global iterations required to attain a level of global
accuracy ϵand local accuracy θcan be upper bounded by:
K(ϵ, θ) = O(log(1))
(1 θ)(6)
We use this relationship among the local accuracy level,
global model accuracy, and the upper limit on the number
of required global rounds to model resource cost and the FL
model training time.
C. FL Resource Model
In order to transfer the model updates from the participating
local trainers to the global aggregator and vice versa, the avail-
able communication resources are to be assigned. on the other
hand, the local trainers require compute resources in the form
of processing capacities to locally train the individual models.
While the compute resource at the Non-RT-RIC (hosted by
a regional cloud) is not overwhelmed, the compute resources
at the local node are scarce and provided by the shared edge
clouds of O-RAN. So, it needs to be judiciously allocated and
utilised. In effect, allocation of these resources determine the
learning time, communications rounds, and therefore impacts
the performance of FL model so trained. Hence, the two
aspects of this model training must be jointly considered.
For the communication part, we consider a wireless trans-
mission under the orthogonal frequency division multiple
access (OFDMA) for local model uploading with a total
bandwidth B. This connection is provided by the standard-
ised A1 interface between near-RT-RICs and the Non-RT-
RIC [24]. Let bt
m[0,1] be the bandwidth allocation ratio
for trainer min round t, hence its allocated bandwidth is
bt
mB. Let bt= (bt
1, ......., bt
M). Bandwidth allocation must
satisfy Pm∈M bt
m= 1 t. In each global interaction, the
O-RAN system has to decide which local training points i.e.
which near-RT-RIC to participate. This is because at each time
interval only a limited number of clients can participate due
to delay constraints originating from the control loops of O-
RAN [22]. Therefore, the selected clients upload their local FL
models updates depending on the wireless media. We define
a binary variable at
m {1,0}to decide whether or not the
trainer mis selected in round t, and at= (at
1, ......, at
M)
collects the overall trainers’ selection decisions. A selected
near-RT-RIC in round ti.e. (at
m= 1), consumes compute
resources to train locally with the collected data. Clearly if
at
m= 0, namely trainer mis not selected in round t, then no
bandwidth is allocated to it i.e. bt
m= 0. On the other hand, if
at
m= 1, then we require at least a minimum bandwidth bmin
is to be allocated to the trainer mi.e. bt
mbmin. To make
the problem feasible, we assume bmin 1
M. Therefore, total
resource cost for using communication bandwidth is:
Rco =
M
X
m=1
Rco
m=Kϵ
M
X
m=1
at
mbt
mBptr (7)
for Kϵglobal rounds where ptr is the unit cost of bandwidth
usage. For each near-RT-RIC m, let Rcp
mdenote its local
training compute resource cost in every round which depends
on its computing host and dataset. To process the local dataset
each near-RT-RIC uses the CPU cycle frequency of the edge
host. Let the CPU power of mth host be fmcycles/s and the
per unit time usage cost be pc. Then the total compute resource
cost is:
Rcp =
M
X
m=1
Rcp
m=
T
X
t=1
at
m
Dmcm
fm
pc(8)
where cmis the CPU cycles required for processing one bit
of data.
D. Latency Model
Since, the number of distributed local edge nodes is ex-
pected to be in large numbers, we consider a synchronous
mode of communication, in other words the tth round of
global aggregation starts only when all the near-RT-RICs have
finished sending their local update vectors to the Non-RT-
RIC. Therefore, before entering this communication round,
all the near-RT-RICs must finish its local ML processing.
In each of the global round, the FL tasks are spanned over
three operations: (i) computation, (ii) communication of local
updates to the Non-RT-RIC using uplink, and (iii) broad-
cast communication to all the involved near-RT-RICs using
downlink. Let the computation time required for one local
round for mth near-RT-RIC be Tcp
m, and there be Kllocal
iterations in each interval of the global communication. Then,
the computation time in one global iteration round is KlTcp
m.
Let the communication time required in transferring the local
update vectors from mth near-RT-RIC to the Non-RT-RIC be
Tco
min the uplink phase. Let Smbe the datasize of the update
vector of mth local trainer. Then, the learning time in one
global round for the mth local FL model trainer is:
Tm=Kl.T cp
m+Tco
m;m M (9)
Where Tco
mis calculated as:
Tco
m=Sm
bt
m.B ;m M (10)
In the downlink phase, we do not consider the delay because
it is negligible as compared to the uplink delay as a result of
high speed downlink communication. Since, Kϵis the total
number of global rounds to attain the global accuracy ϵas
established in (8). Therefore, the total learning time can be
modeled as:
Ttotal =Kϵ.Tmax =Kϵ.max{Tm;m M} (11)
Now, since the two conflicting goals of FL training must be
jointly optimized, for the sake of modelling, we treat learning
time as another component of the total cost along with resource
usage costs. As such, we use a dimension coherent parameter
ρto frame these type of costs into one expression as below:
cost(t) = n(Rco +Rcp) + ρ.(Tco
m+Kl.max{Tcp
m})o
=n(Kϵ.
T
X
t=1
M
X
m=1
.at
m(bt
m.B.ptr +Dm.cm
fm
.pc))+
ρ.(Sm
bt
m.B +Kl.max{Tcp
m})o(12)
E. Problem Formulation
Our goal is to jointly minimize the resource cost and the
learning time under the constraints of bandwidth resources.
This can be done by optimizing the number of local trainers
i.e. near-RT-RICs and bandwidth allocation as formulated in
the optimization model (13).
P: min
at,btcost(t)(13)
subject to:
M
X
m=1
at
m.bt
m.B B, (13a)
M
X
m=1
bt
m= 1,(13b)
bmin bt
m1 ; m M,(13c)
at
m {1,0}.(13d)
fmfmin (13e)
The objective function (13) has two components balanced by
a trade-off parameter ρbecause the two goals are conflicting.
The total resource cost, Rtotal =Rcp +Rco and the FL
training time, Ttotal as given by (11). Minimizing the resource
cost naturally leads to higher learning time and vice-versa.
Constraint (13a) bounds the total bandwidth allocated for the
FL tasks. Constraint (13b) presents the definition of bt
mi.e,
the sum of bandwidth fractions must be 1. (13c) denotes
the boundary of the bandwidth fractional allocation. (13d)
represents the defining domain of the decision variable. (13e)
ensures that the assigned computation resource is greater than
the minimum CPU frequency.
V. MCORANFED
In our prior work, we proposed ORANFed [10] that uses a
ORAN system based deadline aware and slice specific local
trainers’ selection and then trains the FL model. This method,
although provides a novel FL algorithm to suit the require-
ments of ORAN, could not be extended for a large number
of local trainers as it does not mitigate the communication
latency. Therefore, we address this communication bottleneck
by using compression on the update vectors and further
decreasing the total number of global rounds by expediting
the convergence through momentum gradient descent. Figure
3 illustrates the proposed steps of MCORANFed.
A. Randomized Compression Operator
It has been shown in [25] that a randomized compression
operator achieves a convergence rate of O((1 + ω)L
ϵ)as
opposed to (6) in no-compression scenario. Due to the faster
convergence, it saves on communication time too.
An ω- compression operator can be defined as a map
C:RdRd
s.t. it satisfies the following conditional properties:
(i)
E[C(x)|x] = xRd(14)
i.e. C(.) is unbiased and
(ii) ω0s.t.
E[||C(x)x||2]ω||x||2,xRd(15)
Local
Computation
Nodes
Global Model
Aggregator
Local Model Global Model
Momentum
gradient descent
Training
iterations
performed
K-th global round (K+1)-th global round
Model Accuracy
Calculation
Task and
Resource Budget
. . . . continues
Non-RT-RIC
Near-RT-RIC
Fig. 3. MCORANFed model with proposed steps
i.e. its variance is uniformly bounded.
For the case of no compression (ω= 0), C(x)x.
From this class of compression operator, a random
sparsification operator can be defined as:
C(x) := d
k(ζk·x)xRd
where ζk {0,1}dis a uniformly random binary vector
with knon-zero elements. From the above definition of
compression operator, this specific operator can be derived by
substituting ω=d
k1and k=dimplies zero compression.
B. Accelerated Gradient Descent
As can be inferred from (2), GD is a first order approx-
imation method. To increase the convergence rate of this
iterative approximation, we impose second order term using
Momentum Gradient Descent (MGD) [26]. MGD improves
GD by adding a momentum term which leads to the following
update rule:
di(t) = γdi(t1) + fi(gi(t1)) (16)
and
gi(t) = gi(t1) ηdi(t)(17)
where di(t)is the momentum term having the same dimension
as gi(t),ηis the learning step size, γis the momentum
attenuation factor, and t is the iteration index. f(g)with
iterations of (16) and (17) can converge to the minimum loss
function value faster than the GD. MGD converges in the range
of 1< γ < 1with a bounded η. It further shows accelerated
convergence within the range of 0< γ < 1with small values
of η[26].
Local iterations start with initial values for di(0), and gi(0).
Then local updates are performed using (16) and (17) for each
t[k].[k]is the periodic global aggregation interval. Without
loss of generality, we assume that t=kT , where Tis the
number of global rounds, tis the total number of iterations,
and kis the period of global model aggregation. Whenever t
is a multiple of T, global aggregation is performed using the
following rule:
d(t) = PM
i=1 |Si|di(t)
|S|(18)
and
g(t) = PM
i=1 |Si|gi(t)
|S|(19)
In order to ensure the accelerated convergence rate of MGD
based FL over GD based FL training, we require an additional
set of conditions on the loss functions fi. Together with
conditions (i) to (iii), two additional conditions have to fulfilled
by the loss functions.
(iv) For any gand i, the difference between the global
gradient and local gradient can be bounded by
||∇fi(g) f(g)|| δi, and δ:= PiSii
S.
The loss function fis µ-strongly convex: i.e.
f:RdRand
(v) Lµ0s.t. µ||gg|| ||∇f(g) f(g)||.
These conditions form a class of well behaved functions
which aligns with most of the ML methods’ loss functions
and are in line with the recent works [14], [13], [16], [11]
on convergence analysis of FL. This leads to the aggregated
model loss function, f(g), defined over all the local loss
functions as:
f(g) :=
M
X
i=1
(|Si|/S)fi(gi)(20)
Then the iterative model compression, when applied on the
model updates, works as;
g(t) = C(g(t1)) γf(C(g(t1))) (21)
This is Momentum Gradient Descent with Compressed Iterates
(MGDCI) with compression parameter ωfrom the class of
randomized compression operators [25]. Such an operator is
proven to converge linearly to an approximate solution of the
size O(κω)in the neighbourhood of the optimal solution of
(1) provided it is bounded by the following threshold [27]:
ωµ
4.12γL
2γL2+2
γ+Lµ(22)
C. Updated Optimization Problem
Let Sω
mbe the datasize of the update vector of mth trainer
under the ω-compression operator. Then, Sω
mis calculated as:
Sω
m=k
1ω,(23)
where kis the sparsification factor.
Therefore, the learning time in one global round of FL
training for the mth local FL model trainer becomes:
Tco
m=Sω
m
bt
m.B ;m M,(24)
the cost function updates as:
cost(t) = n(Kϵ.
T
X
t=1
M
X
m=1
.at
m(bt
m.B.ptr +Dm.cm
fm
.pc))+
ρ.(Sω
m
bt
m.B +Kl.max{Tcp
m})o(25)
Hence, the updated joint resource allocation and FL model
learning time optimization problem is:
P: min
at,bt cost(t)(26)
subject to:
M
X
m=1
at
m.bt
m.B B, (26a)
M
X
m=1
bt
m= 1,(26b)
bmin bt
m1 ; M,(26c)
at
m {1,0}.(26d)
fmfmin (26e)
4ω
µ12γL
2γL2+2
γ+Lµ(26f)
Constraint (26f) bounds the feasible domain of ωto ensure
the convergence of MCORANFed, where µ,γ, and Lare the
parameters as defined in the set of conditions (i) to (v) on the
loss function and as described in (22).
VI. PRO PO SE D SOLUTION
The problem formulated in (26) is a non-convex opti-
mization problem because the objective function and the
constraint (26a) are non-convex functions. Moreover, it is hard
to transform this problem to get at least a close form solution.
Therefore, we take an approach of decomposition method.
As shown in Figure 4, we layout a scheme of the solution
approach. We first divide the main optimization problem
into two sub-problems, named respectively Local Trainers’
selection (P1) and Resource Allocation (P2). Then we use a
solution of the first sub-problem to reshape the second sub-
problem. Finally, we solve the updated second sub-problem.
A. Local Trainers’ Selection
Among the three decision variables, athas been derived
through a deadline aware and slicing based local trainers’
Main Opt. Prob. (26):
BW usage cost and FL
time minimization,
sub. to optimal
trainersselection, BW
allocation, &
compression operator
Deadline aware
local Trainers’
Selection
(Algorithm 1)
Optimal
Resource
Allocation to
selected trainers
Solution Approach
Solving (P1)
Solving (P2)
Using the
solutions to
implement
Algorithm 2
Fig. 4. Schematic Diagram of the Proposed Solution
selection algorithm as proposed in [10]. In this step, we solve
the following sub-problem:
P1: min
at(Kϵ.
T
X
t=1
M
X
m=1
.at
m(bt
m.B.ptr +Dm.cm
fm
.pc)) (27)
subject to:
at
m {1,0}.(27a)
The objective of this trainers’ selection algorithm is to maxi-
mize the number of near-RT-RICs to participate in each global
round, and to allow the non-RT-RIC to aggregate all received
local model updates. This is based on the proposition that a
larger fraction of trainers in each round saves the total time
required for a global FL model to attain the desired model
accuracy [7]. As specified in O-RAN Alliance whitepaper
[22], the collected RAN operational data can be separated
based on their slice-user groups. Each near-RT-RIC is then
fed with slice specific network data. The selection of a near-
RT-RIC corresponding to a slice must be incorporated in each
iteration of gradient descent training of the model. However,
not all the local models can be accommodated in each iteration
because of the deadline constraint and limited computational
and bandwidth resources to be assigned for this learning task.
Moreover, due to the variation in traffic patterns for different
kinds of slicing services of O-RAN, the local FL model
might encounter inconsistency problem. This may lead to a
degradation in accurate prediction. We take into account this
differentiation and propose a trainers’ selection algorithm that
respects the formation of slices in O-RAN while maintaining
a deadline awareness. In Algorithm 1, we categorize the set
of near-RT-RICs into three classes corresponding to eMBB,
uRLLC, and mMTC slicing services.
Let N( M)be the set of selected near-RT-RICs, tround
be the deadline for each global round, t1be the elapsed time to
perform Algorithm 1, and tagg be the time taken in aggregating
the update parameters at the Non-RT-RIC. Therefore, the
mathematical optimization problem for the trainer selection
becomes:
max
N{|N |} (28)
s.t. t1+tagg. +1
2(tk1
n+α.tk
n)tround.(28a)
(28) is a combinatorial optimization problem which makes it
non-trivial. So, we employ a greedy heuristic to solve this
problem as shown in Algorithm 1 [10]. We repeat the steps in
each global round until we get the desired accuracy. Here, the
constraint (28a) restricts the violation of the deadline for every
near-RT-RIC in each global round. The deadline is assigned
separately for each slice-user groups while the total deadline
in each round is varied experimentally to observe its impact
on overall learning time of the global FL model.
Algorithm 1 : Deadline aware and Slicing based Local
Trainers’ Selection
1: Input: M: Set of all near-RT-RICs
2: Initialize Nu,Ne,Nm= Φ
3: for ti
round defined for i {N u,Ne,Nm}do
4: while |N | >0do
5: xarg minn∈N 1
2.(tk1
n+α.tk
n(estimated))
6: tt1+tagg. +tk
n
7: N \ {x}
8: if t<ti
round then
9: tt+tk
n
10: end if
11: end while
12: end for
13: Output: N=Nu N e N m
B. Resource Allocation
From the trainers’ selection phase, we obtain at, i.e. a
binary valued vector of selected trainers in kth global round.
The next phase is to allocate the compute and bandwidth
resources to support the local training, parameters uploading,
model aggregation, and broadcast of updated model weights.
For this task, we solve the optimization problem (26) with
the obtained value of atand unknown variable btand ω.
Still, (26) is a non-convex optimization problem, therefore,
an exact solution of which is intractable using traditional
methods. Therefore, we employ an approximation approach
with equivalent surrogate functions. With these changes and
substituting the defining expressions, the optimization problem
(26) reduces the following mathematical form:
P2: min
bt cost(t)(29)
subject to: (26a), (26b), (26c), and (26f)
The number of local iterations (Kl) in each global round is
determined experimentally as required in attaining the local
accuracy value θ. We solve this problem using an iterative
approximation method and implemented the solution using
Ipopt solver [28].
Here, the constraints are linear with respect to each of the
vector variable btand ω.
C. Federated Training in ORAN RICs (MCORANFed)
Using the solutions of trainer selection and resource alloca-
tion in (29), we train the FL model as described in Algorithm
2. In each global round, a subset of participating local trainers
is selected followed by resource allocation in each global
round. This loop continues for Kϵiterations, which is the
maximum number of global rounds required to attain the
prefixed accuracy of the model.
Algorithm 2 :Momentum Compressed ORANFed
Input: The dataset Di(i M); The number of participants:
M; The number of iterations: Kϵ
Output: Final model parameter g
1: for k= 1,2,3, ..., Kϵdo
2: Non-RT-RIC uses Alg. 1 to get subset N;
3: allocation of compute and bandwidth resources to se-
lected near-RT-RICs (N);
4: Each near-RT-RIC trains using local data till it achieves
an accuracy θand obtains gi,k;
5: for all local trainers in parallel m = 1, 2, 3, .. do
6: Compress local gradients using (21) and obtain gifor
i N ;
7: Transmit gito Non-RT-RIC;
8: end for
9: Decompress the received compressed weights and ag-
gregate d(t) and g(t) according to (18) and (21);
10: Calculate global loss function f(g) through (20);
11: Non-RT-RIC broadcasts the aggregated parameters;
12: Non-RT-RIC calculates the global accuracy attained (5);
13: end for
14: Finally trained model is sent to SMO for deployment
D. Complexity Analysis
Theorem 1: Suppose a constant learning step size
ηk=θN
H,kis chosen where θ > 0is a constant satisfying
θN
H1
2L, the convergence rate of Algorithm 2 is:
E[||zT||2]
4(E[F(g0)] F) + 8ζθLδ 2
(ζ1)NH 3/2+ (4ω2+ 1) 82L2µ2H
K2
ϵ,
where His the total number of overall iterations
(local and global), zTis a random variable which samples a
weight parameter gtwith probability 1
NH , and ζis a constant.
Proof: Please refer to Appendix
In Theorem 1, we follow a weaker notion of convergence
where the mean expected squared gradient norm is taken
to express the convergence rate because of the non-convex
settings [29].
Theorem 2: Suppose f(x)is convex with L-Lipschitz
continuous gradient and the compression operator C(.)
satisfies (14) and (15). Let the learning step size η=1
(1+ω)L,
then the number of iterations performed by MCORANFed
to find an ϵ-solution such that E[||zT||2]ϵis at most
Kϵ=O((1+ω)L
ϵ).
Where ϵis the right hand side of the inequality given in
Theorem 1.
Proof: Please refer to Appendix
Momentum Compressed ORANFed consists of trainers’
selection in Step 3, and assignment of resources in Step 4.
Steps 5 to 13 train the FL model iteratively. So, the complexity
of Algorithm 2 can be analysed in two parts. In the first part,
(27) is solved using Algorithm 1 having time complexity of
O(L)where Lis the cardinality of the set M. In the second
part, (29) is solved using Interior Point Approximation method
having complexity O(JIpopt)[28], where JI popt is the total
number of iterations within the Ipopt solver algorithm.
VII. NUMERICAL RES ULTS
In this section, we describe the FL training task, experi-
mental settings, baselines used to compare the results, and an
analysis. Values of the parameters used in our experiment are
listed in Table 2.
TABLE II
EXP ERI ME NTAL SETTINGS
Parameter Description Value
NMax. no. of Local Trainers 50
BBW Budget for FL Training 1MHz
cmProcessing Rate 15cycles/bit
fmMax. CPU Power U(1,1.6)GHz
ptr Per unit cost Transmission Cost 1
pcPer Unit Computation Cost 1
DmDataset Size U(5,10)MB
dUpdate vector size 20bits
bmin Minimum BW Allocation 0.1MHz
kSparsification factor 0.35
ηLearning rate (0.1,0.4
ρPareto Trade-off parameter (0,1)
A. Federated Learning Task:
We train a supervised data traffic prediction model using
a time series dataset [30]. This labelled data is accumulated
over a year, and the goal is to predict incoming traffic for the
next hour using ORAN RICs.. This kind of task is generally
required to address a radio resource assignment problem. We
assign the labels for 3 different types of network slices (e.g.,
URLLC, eMBB, and mMTC), and uniformly distribute the
entire dataset onto 50 local nodes. Then using Long Short
Term Memory (LSTM) based model neural network consisting
of 4 layers, we train a regression model. For this purpose
the corresponding loss function is mean square error (MSE)
metric which is used for prediction accuracy as explained with
equation (1) in section IV. Before going into the federated
settings, the model is trained using centralised learning to
get the benchmark model accuracy which is around 96.3%.
Therefore, the value of ϵ(global accuracy for FL) is set as
0.96
B. Network Settings:
We run this training on a machine with Intel(R) Core(TM)
i5-8265U CPU having a maximum processing power of 1.16
GHz. For simplicity, all near-RT-RIC nodes have the same data
processing rate cm. A uniform distribution random number
generator is used for assigning CPU frequency (fm) of the host
in the range of (1,1.6). The value of compression parameter
(ω) lies in (0, 1) and the sparsification factor (κ) is fixed as 0.35
as obtained from a experimental data for lossless compression.
Small values of learning rate (η) in (0.1, 0.4) are taken and
the average convergence rate is reflected over 10 simulations.
The main settings are summarized in Table 2.
C. Baselines:
Since every FL variant targets a specific objective corre-
sponding to the mobile edge environment it is trained in, we
consider the following FL models as baselines for comparison:
1) Federated Averaging (FedAvg) [7]: FedAvg algorithm
with fixed number of clients (N= 50) is considered for
the benchmark. This serves as the basic FL model without
compression and without any selection of local trainers in
global rounds. This provides a limiting comparison with other
FL variants and our proposed method.
2) Momentum Federated Learning (MFL) [26]: MFL is
used to compare the effect of momentum attenuation factor
on the accelerated convergence model. This method is an
improvement of the traditional gradient descent approach.
Serving as the model without compression operator, this FL
variant does not use any local trainers’ selection algorithm. In
this experiment, two variants of MFL are considered: one with
γ= 0.7and another with γ= 0.9. These variations are used
to show the differential behaviours of momentum attenuation
factor when applied on the iterative convergence of FL.
3) ORAN Federated Learning (ORANFed) [10]: In our
prior work, we have proposed ORANFed algorithm that uses a
local trainers’ selection to assign the resources for FL training.
This baseline serves as an FL variant with selection but without
compression or acceleration operator.
D. Performance Evaluation
Our proposed model is evaluated under the following per-
formance metrics:
1) Convergence Rate: Fig. 5 shows the joint impact of
acceleration and compression in converging the model. While
FedAvg requires a significantly high number of global com-
munication rounds to converge as compared to other three
methods, it also performs poorly with respect to the model
accuracy so attained. Whereas, ORANFed, due to its deadline
aware local trainers’ selection, requires a smaller number of
global rounds and also attains higher model accuracy. MCO-
RANFed not only attains the highest model accuracy but in the
50 100 150 200 250 300
Global Rounds
0.0
0.2
0.4
0.6
0.8
1.0
Model Accuracy (ε)
MCORANFed
MFL
ORANFed
FedAvg
Fig. 5. Impact of Accelerated Convergence
50 100 150 200 250 300
Global Rounds
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Loss function
MCORANFed
MFL
ORANFed
FedAvg
Fig. 6. Accelerated Convergence w.r.t Loss function
0 25 50 75 100 125 150 175
Training Time
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Accuracy of the trained Model
MCORANFed
MFL: γ = 0.7
MFL: γ = 0.9
ORANFed
FedAvg
Fig. 7. Time elapsed to converge
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Communicated bits
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Loss Function Value
MCORANFed
MFL
FedAvg
Fig. 8. Impact of Compression Operator
0 25 50 75 100 125 150 175
Training Time
0
100
200
300
400
Communication Resource Usage Cost
MCORANFed
MFL
ORANFed
FedAvg
Fig. 9. Objective Cost Comparison
same time it also requires a significantly low number of global
communication rounds to converge. Although MFL performs
better than FedAvg and ORANFed, it converges slower than
our proposed method. MCORANFed takes advantages of both
momentum gradient descent and trainers’ selection algorithm.
A similar impact can be seen in Fig. 6 where loss function
value is plotted against the number of global communication
rounds for each FL variants.
2) Training Time: FL methods can also be evaluated ac-
cording to the time it takes to train the final model. The
total time includes data processing time at the local trainer,
model update transmit delay, and global aggregation time in
each global round. Fig. 7 shows that MCORANFed takes
about 120 time units whereas both variants of MFL takes
about 135 units of time. ORANFed and FedAvg require longer
learning time: 150 and 175 units respectively. A smaller
number of communication rounds results in a lower transmit
delay. Therefore, MCORANFed saves the total training time
too.
3) Impact of Compression: In Fig. 8, the loss function
value is plotted against the size of the compressed bits that
are communicated in each global round. FedAvg which uses
no compression operator, transfers 20 bits in every round.
Whereas MFL and MCORANFed share variable compressed
bit sizes. This Figure shows that MCORANFed reaches a
lower loss function value within a smaller number of global
rounds and sends lower bit sizes as compared to MFL. This
result demonstrates the efficiency of the compression operator
that we have applied on this model.
4) Resource Usage Costs: The objective of the main op-
timization problem defined in this paper is to minimize in
the same time the resource cost and the learning time. Fig. 9
shows the comparison of our proposed method with respect to
bandwidth usage cost in training the FL model. As shown,
MCORANFed performs better than MFL, ORANFed, and
FedAvg in terms of both aspects of the objective function.
Through these results, the superior performance of our
proposed method is confirmed. The differentiating factor that
improves the FL method can also be clearly seen.
VIII. CONCLUSION
In this paper, we proposed a communication efficient fed-
erated learning method designed for ORAN systems. Our
model takes into account the importance of faster conver-
gence through momentum gradient descent and compressed
communication, deadline aware local trainers’ selection, and
an optimal resource allocation for training FL models. The
proposed model outperforms state-of-the-art FL methods in
terms of learning time and resource cost in the experimental
settings. The simulation results show that an FL model trained
with MCORANFed can save resource costs which is substan-
tial for smart radio resource allocation in ORAN. Therefore,
it can be deployed in the control loops of ORAN for different
use cases such as to guarantee the slice QoS. In future, we
will investigate the location of the distributed data collection
points to improve further MCORANFed in a highly complex
yet realistic environment.
ACKNOWLEDGMENT
The authors thank VMware, NSERC, and Mitacs for fund-
ing this research under the ALLRP 577577-22 grant.
REFERENCES
[1] Ericsson, “Accelerating the adoption of ai in programmable
5g networks [whitepaper],” July 2021. [Online]. Available:
https://www.ericsson.com/4a3998/assets/local/reports-papers/white-
papers/08172020-accelerating-the-adoption-of-ai-in-programmable-5g-
networks-whitepaper.pdf
[2] O-RAN-WG1.OAM-Architecture-v02.00, “O-RAN Alliance,” www.o-
ran.org, pp. accessed on October 11, 2021.
[3] Y. Sun, M. Peng, Y. Ren, L. Chen, L. Yu, and S. Suo, “Harmonizing
artificial intelligence with radio access networks: Advances, case study,
and open issues,” IEEE Network, vol. 35, no. 4, pp. 144–151, 2021.
[4] X. Costa-Perez, J. Swetina, T. Guo, R. Mahindra, and S. Rangarajan,
“Radio access network virtualization for future mobile carrier networks,”
IEEE Communications Magazine, vol. 51, no. 7, pp. 27–35, 2013.
[5] M. Maternia, S. E. El Ayoubi, M. Fallgren, P. Spapis, Y. Qi, D. Mart´
ın-
Sacrist´
an, ´
O. Carrasco, M. Fresia, M. Payar´
o, M. Schubert et al., “5g
ppp use cases and performance evaluation models, 5G-PPP, Tech. Rep,
2016.
[6] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang,
D. Niyato, and C. Miao, “Federated learning in mobile edge networks:
A comprehensive survey,” IEEE Communications Surveys Tutorials,
vol. 22, no. 3, pp. 2031–2063, 2020.
[7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
“Communication-efficient learning of deep networks from decentralized
data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–
1282.
[8] Z. Zhao, C. Feng, H. H. Yang, and X. Luo, “Federated-learning-
enabled intelligent fog radio access networks: Fundamental theory, key
techniques, and future trends,” IEEE Wireless Communications, vol. 27,
no. 2, pp. 22–28, 2020.
[9] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang,
D. Niyato, and C. Miao, “Federated learning in mobile edge networks:
A comprehensive survey,” IEEE Communications Surveys Tutorials,
vol. 22, no. 3, pp. 2031–2063, 2020.
[10] A. K. Singh and K. Khoa Nguyen, “Joint selection of local trainers
and resource allocation for federated learning in open ran intelligent
controllers,” in 2022 IEEE Wireless Communications and Networking
Conference (WCNC), 2022, pp. 1874–1879.
[11] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, A joint
learning and communications framework for federated learning over
wireless networks,” IEEE Transactions on Wireless Communications,
vol. 20, no. 1, pp. 269–283, 2021.
[12] C. T. Dinh, N. H. Tran, M. N. H. Nguyen, C. S. Hong, W. Bao, A. Y.
Zomaya, and V. Gramoli, “Federated learning over wireless networks:
Convergence analysis and resource allocation, IEEE/ACM Transactions
on Networking, vol. 29, no. 1, pp. 398–409, 2021.
[13] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
K. Chan, “Adaptive federated learning in resource constrained edge com-
puting systems,” IEEE Journal on Selected Areas in Communications,
vol. 37, no. 6, pp. 1205–1221, 2019.
[14] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy
efficient federated learning over wireless communication networks,
IEEE Transactions on Wireless Communications, vol. 20, no. 3, pp.
1935–1949, 2021.
[15] X. Mo and J. Xu, “Energy-efficient federated edge learning with joint
communication and computation design,” Journal of Communications
and Information Networks, vol. 6, no. 2, pp. 110–124, 2021.
[16] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies
for federated learning in wireless networks,” IEEE Transactions on
Communications, vol. 68, no. 1, pp. 317–333, 2020.
[17] J. Xu and H. Wang, “Client selection and bandwidth allocation in
wireless federated learning networks: A long-term perspective, IEEE
Transactions on Wireless Communications, vol. 20, no. 2, pp. 1188–
1200, 2021.
[18] M. M. Amiri, D. G¨
und¨
uz, S. R. Kulkarni, and H. V. Poor, “Convergence
of update aware device scheduling for federated learning at the wireless
edge,” IEEE Transactions on Wireless Communications, vol. 20, no. 6,
pp. 3643–3658, 2021.
[19] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling
and resource allocation for latency constrained wireless federated learn-
ing,” IEEE Transactions on Wireless Communications, vol. 20, no. 1,
pp. 453–467, 2021.
[20] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “Hfel: Joint edge asso-
ciation and resource allocation for cost-efficient hierarchical federated
edge learning,” IEEE Transactions on Wireless Communications, vol. 19,
no. 10, pp. 6535–6548, 2020.
[21] “O-ran alliance whitepaper, June 2021. [Online].
Available: https://www.o-ran.org/s/O-RAN-Minimum-Viable-Plan-
and-Acceleration-towards-Commercialization-White-Paper-29-June-
2021.pdf
[22] O-RAN-WG6.CAD-V01.00.00; Cloud Architecture and Deployment
Scenarios for O-RAN Virtualized RAN specification, “O-RAN
Alliance.” [Online]. Available: www.o-ran.org
[23] J. Koneˇ
cn`
y, H. B. McMahan, D. Ramage, and P. Richt´
arik, “Federated
optimization: Distributed machine learning for on-device intelligence,
arXiv preprint arXiv:1610.02527, 2016.
[24] H. Lee, J. Cha, D. Kwon, M. Jeong, and I. Park, “Hosting ai/ml
workflows on o-ran ric platform, in 2020 IEEE Globecom Workshops
(GC Wkshps, 2020, pp. 1–6.
[25] Z. Li, D. Kovalev, X. Qian, and P. Richtarik, “Acceleration for com-
pressed gradient descent in distributed and federated optimization,” in
Proceedings of the 37th International Conference on Machine Learning,
ser. Proceedings of Machine Learning Research, H. D. III and A. Singh,
Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5895–5904.
[26] W. Liu, L. Chen, Y. Chen, and W. Zhang, “Accelerating federated learn-
ing via momentum gradient descent,” IEEE Transactions on Parallel and
Distributed Systems, vol. 31, no. 8, pp. 1754–1766, 2020.
[27] A. Khaled and P. Richt´
arik, “Gradient descent with compressed iterates,”
arXiv preprint arXiv:1909.04716, 2019.
[28] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge
university press, 2004.
[29] L. Li, D. Shi, R. Hou, H. Li, M. Pan, and Z. Han, “To talk or to work:
Flexible communication compression for energy efficient federated
learning over heterogeneous mobile edge devices, in IEEE INFOCOM
2021 - IEEE Conference on Computer Communications, 2021, pp. 1–10.
[30] “5G User predicition data.” [Online]. Available:
https://www.kaggle.com/liukunxin/dataset
IX. APPENDIX
Theorem 1, Proof: Let zTbe a random variable sampled from g(t)
iwith probability P r[zT=g(k)
i] = 1
NH and taking
δ=q1
MPM
m=1 δ2
mand ηk=θN
H, we get:
E[||zT||2] = 1
NH
H1
X
k=0
M
X
m=1
E[||∇fm(g(k)
m)||2],(30)
To get the difference between loss functional values from two consecutive iterations, we define the following sequences: (1)
q(k)=1
MPM
m=1 fm(g(k)
m;D(k)
m), (2) q(k)=ED(k)
m[q(k)] = 1
MPM
m=1 fm(gk
m), and the update rule of g(k)
mas per (21).
Now, using the smoothness condition (v) on the class of loss functions, F:RdR, we get:
F(g(k+1))F(g(k))
<F(g(k)), g(k+1) g(k)>+L
2||g(k+1) g(k)||2
=η < F(g(k)), q(k)>+η2L
2||q(k)||2
η < F(g(k)), q(k)>+η2L||q(k)q(k)||2
=η
M
M
X
m=1
<F(g(k)),fm(gk
m;Dk
m)>+η2L||q(k)q(k)||2+η2L|| 1
M
M
X
m=1
fm(g(k)
m, D(k)
m)||2
(31)
By taking expectation w.r.t the sampled dataset at each near-RT-RIC at time k, we get:
E[F(g(k+1))] F(g(k))
η
2(||∇F(g(k))||2+|| 1
M
M
X
m=1
fm(gk
m)||2)+
η
2||∇F(g(k))1
M
M
X
m=1
fm(g(k)
m)||2+
η2L|| 1
M
M
X
m=1
fm(g(k)
m)||2+η22
Mb(k)(32)
Theorem 2, Proof: Suppose f(x)is convex with L-Lipschitz continuous gradient and the compression operator C(.)
satisfies (8) and (9). Let the learning step size η=1
(1+ω)L, then the number of iterations performed by CGD to find an
ϵ-solution such that E[f(xk)f(x)] ϵis at most k=O((1+ω)L
ϵ).
Proof: According to CGD update rule, xk+1 =xkηgk, we have
E[||xk+1 x||2] = E[||xkηgkx||2]
=E[||xkηC(f(xk)) x||2]
=E[||xkx||22η⟨C(f(xk)), xkx+η2||C(f(xk))||2]
=||xkx||22η⟨∇f(xk), xkx+η2E[||C(f(xk))||2]
=||xkx||22η⟨∇f(xk), xkx+η2||∇f(xk)||2+E[||C(f(xk)) f(xk)||2]
||xkx||22ηf(xk), xkx+η2(1 + ω)||∇f(xk)||2
||xkx||22η(f(xk)f(x)) + η2(1 + ω)||∇f(xk)||2,(33)
(34)
using the convexity and L-smoothness of f, we have
E[f(xk+1)f(x)] E[f(xk)f(x)f(x) + ⟨∇f(xk), x(k+ 1) xk+L
2||xk+1 xk||2]
=E[f(xk)f(x) + ⟨∇f(xk),η(C)(f(xk))+2
2||C(f(xk))||2]
=E[f(xk)f(x)η||∇f(xk)||2+2
2||C(f(xk))||2]
E[f(xk)f(x)η(1 (1 + ω)
2)||∇f(xk)||2].(35)
By multiplying (35) by η(1+ω)
1(1+ω)
2
and adding 33, we have
E"η(1 + ω)
1(1+ω)
2
(f(xk+1)f(x)) + ||xk+1 x||2+ 2η(f(xk)f(x))#
E"η(1 + ω)
1(1+ω)
2
(f(xk+1)f(x)) + ||xkx||2#
(36)
E"2η
k
X
i=0
(f(xi)f(x))#
η(1 + ω)
1(1+ω)
2
(f(x0)f(x)) + ||x0x||2(37)
η(1 + ω)
1(1+ω)
2
(L
2||x0x||2) + ||x0x||2=2||x0x||2
2(1 + ω),(38)
using the L-smoothness and convexity of f(xi)for each i= 0,1, ...k according to above inferences, we must have:
E2ηk(F(xk)f(x))2||x0x||2
2(1 + ω)(39)
E[f(xk)f(x)] ||x0x||2
(2 (1 + ω))ηk =(1 + ω)L||x0x||2
k,(40)
with the choice of η=1
(1+ω)Land k=(1+ω)L||x0x||2
ϵsuch that E[f(xk)f(x)] ϵwithin O((1+ω)L
ϵ)iterations.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Recently, Federated Learning (FL) has been applied in various research domains specially because of its privacy preserving and decentralized approach of model training. However, very few FL applications have been developed for the Radio Access Network (RAN) due to the lack of efficient deployment models. Open RAN (O-RAN) promises a high standard of meeting 5G services through its disaggregated, hierarchical, and distributed network function processing framework. Moreover, it comes with built-in intelligent controllers to instill smart decision making ability into RAN. In this paper, we propose a framework named O-RANFed to deploy and optimize FL tasks in O-RAN to provide 5G slicing services. To improve the performance of FL we formulate a joint mathematical optimization model of local learners selection and resource allocation to perform model training in every iteration. We solve this non-convex problem using the decomposition method. First, we propose a slicing based and deadline aware client selection algorithm. Then, we solve the reduced resource allocation problem by using successive convex approximation (SCA) method. Our simulation results show the proposed model outperforms the state-of-the-art FL methods such as FedAvg and FedProx in terms of convergence, learning time, and resource costs.
Article
Full-text available
In this paper, the problem of energy efficient transmission and computation resource allocation for federated learning (FL) over wireless communication networks is investigated. In the considered model, each user exploits limited local computational resources to train a local FL model with its collected data and, then, sends the trained FL model to a base station (BS) which aggregates the local FL model and broadcasts it back to all of the users. Since FL involves an exchange of a learning model between users and the BS, both computation and communication latencies are determined by the learning accuracy level. Meanwhile, due to the limited energy budget of the wireless users, both local computation energy and transmission energy must be considered during the FL process. This joint learning and communication problem is formulated as an optimization problem whose goal is to minimize the total energy consumption of the system under a latency constraint. To solve this problem, an iterative algorithm is proposed where, at every step, closed-form solutions for time allocation, bandwidth allocation, power control, computation frequency, and learning accuracy are derived. Since the iterative algorithm requires an initial feasible solution, we construct the completion time minimization problem and a bisection-based algorithm is proposed to obtain the optimal solution, which is a feasible solution to the original energy minimization problem. Numerical results show that the proposed algorithms can reduce up to 59.5% energy consumption compared to the conventional FL method.
Article
Full-text available
In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that generates a global FL model and sends the model back to the users. Since all training parameters are transmitted over wireless links, the quality of training is affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS needs to select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To seek the solution, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can improve the identification accuracy by up to 1:4%, 3:5% and 4:1%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation, 2) a standard FL algorithm with random user selection and resource allocation, and 3) a wireless optimization algorithm that minimizes the sum packet error rates of all users while being agnostic to the FL parameters.
Article
This paper studies a federated edge learning system, in which an edge server coordinates a set of edge devices to train a shared machine learning (ML) model based on their locally distributed data samples. During the distributed training, we exploit the joint communication and computation design for improving the system energy efficiency, in which both the communication resource allocation for global ML-parameters aggregation and the computation resource allocation for locally updating ML-parameters are jointly optimized. In particular, we consider two transmission protocols for edge devices to upload ML-parameters to edge server, based on the non-orthogonal multiple access (NOMA) and time division multiple access (TDMA), respectively. Under both protocols, we minimize the total energy consumption at all edge devices over a particular finite training duration subject to a given training accuracy, by jointly optimizing the transmission power and rates at edge devices for uploading ML-parameters and their central processing unit (CPU) frequencies for local update. We propose efficient algorithms to solve the formulated energy minimization problems by using the techniques from convex optimization. Numerical results show that as compared to other benchmark schemes, our proposed joint communication and computation design significantly can improve the energy efficiency of the federated edge learning system, by properly balancing the energy tradeoff between communication and computation.
Article
Driven by the demands of efficient network operation and high service availability, the convergence of artificial intelligence (AI) with radio access networks (RANs) has drawn considerable attention. However, current academic research mainly focuses on applying AI into optimizing RANs with a few discussions on architecture design. This article surveys the recent progress achieved by industry in integrating AI into RANs, and proposes an AI-driven fog RAN (F-RAN) paradigm. Specifically, being wrappers of Al-re-lated functionalities, AI capsules are presented as new network functions in the F-RAN domain. With AI capsules, computation and cache resources at various fog nodes can be utilized to facilitate real-time AI-based F-RAN optimization and alleviate the transmission burden incurred by network data collection. At the edge cloud, a centralized AI brain for F-RANs is deployed, which incorporates a wireless-oriented auto-AI platform and a digital colon of the network environment for offline AI model training and evaluation. By the interplay among AI capsules and the AI brain, universal and endogenous intelligence can be fully realized within F-RANs, which in turn enhances system performance. Furthermore, we demonstrate the effectiveness of a scalable deep-reinforcement-learning-based method in minimizing energy consumption for a computation offloading use case. At last, open issues are identified in terms of interface standardization, federated learning, and transfer learning.
Article
We study federated learning (FL) at the wireless edge, where power-limited devices with local datasets collaboratively train a joint model with the help of a remote parameter server (PS). We assume that the devices are connected to the PS through a bandwidth-limited shared wireless channel. At each iteration of FL, a subset of the devices are scheduled to transmit their local model updates to the PS over orthogonal channel resources, while each participating device must compress its model update to accommodate to its link capacity. We design novel scheduling and resource allocation policies that decide on the subset of the devices to transmit at each round, and how the resources should be allocated among the participating devices, not only based on their channel conditions, but also on the significance of their local model updates. We then establish convergence of a wireless FL algorithm with device scheduling, where devices have limited capacity to convey their messages. The results of numerical experiments show that the proposed scheduling policy, based on both the channel conditions and the significance of the local model updates, provides a better long-term performance than scheduling policies based only on either of the two metrics individually. Furthermore, we observe that when the data is independent and identically distributed (i.i.d.) across devices, selecting a single device at each round provides the best performance, while when the data distribution is non-i. i.d., scheduling multiple devices at each round improves the performance. This observation is verified by the convergence result, which shows that the number of scheduled devices should increase for a less diverse and more biased data distribution.
Article
There is an increasing interest in a fast-growing machine learning technique called Federated Learning (FL), in which the model training is distributed over mobile user equipment (UEs), exploiting UEs’ local computation and training data. Despite its advantages such as preserving data privacy, FL still has challenges of heterogeneity across UEs’ data and physical resources. To address these challenges, we first propose FEDL , a FL algorithm which can handle heterogeneous UE data without further assumptions except strongly convex and smooth loss functions. We provide a convergence rate characterizing the trade-off between local computation rounds of each UE to update its local model and global communication rounds to update the FL global model. We then employ FEDL in wireless networks as a resource allocation optimization problem that captures the trade-off between FEDL convergence wall clock time and energy consumption of UEs with heterogeneous computing and power resources. Even though the wireless resource allocation problem of FEDL is non-convex, we exploit this problem’s structure to decompose it into three sub-problems and analyze their closed-form solutions as well as insights into problem design. Finally, we empirically evaluate the convergence of FEDL with PyTorch experiments, and provide extensive numerical results for the wireless resource allocation sub-problems. Experimental results show that FEDL outperforms the vanilla FedAvg algorithm in terms of convergence rate and test accuracy in various settings.
Article
This paper studies federated learning (FL) in a classic wireless network, where learning clients share a common wireless link to a coordinating server to perform federated model training using their local data. In such wireless federated learning networks (WFLNs), optimizing the learning performance depends crucially on how clients are selected and how bandwidth is allocated among the selected clients in every learning round, as both radio and client energy resources are limited. While existing works have made some attempts to allocate the limited wireless resources to optimize FL, they focus on the problem in individual learning rounds, overlooking an inherent yet critical feature of federated learning. This paper brings a new long-term perspective to resource allocation in WFLNs, realizing that learning rounds are not only temporally interdependent but also have varying significance towards the final learning outcome. To this end, we first design data-driven experiments to show that different temporal client selection patterns lead to considerably different learning performance. With the obtained insights, we formulate a stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints, and develop a new algorithm that utilizes only currently available wireless channel information but can achieve long-term performance guarantee. Experiments show that our algorithm results in the desired temporal client selection pattern, is adaptive to changing network environments and far outperforms benchmarks that ignore the long-term effect of FL.