Conference PaperPDF Available

Joint Optimization for Federated Learning Over the Air

Authors:
Joint Optimization for Federated Learning Over the Air
Xin Fan1, Yue Wang2, Yan Huo1, and Zhi Tian2
1School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing, China
2Department of Electrical & Computer Engineering, George Mason University, Fairfax, VA, USA
E-mails: {yhuo,fanxin}@bjtu.edu.cn, {ywang56,ztian1}@gmu.edu
Abstract—In this paper, we focus on federated learning (FL)
over the air based on analog aggregation transmission in realistic
wireless networks. We first derive a closed-form expression for the
expected convergence rate of FL over the air, which theoretically
quantifies the impact of analog aggregation on FL. Based on
that, we further develop a joint optimization model for accurate
FL implementation, which allows a parameter server to select
a subset of edge devices and determine an appropriate power
scaling factor. Such a joint optimization of device selection and
power control for FL over the air is then formulated as an mixed
integer programming problem. Finally, we efficiently solve this
problem via a simple finite-set search method. Simulation results
show that the proposed solutions developed for wireless channels
outperform a benchmark method, and could achieve comparable
performance of the ideal case where FL is implemented over
reliable and error-free wireless channels.
Index Terms—Federated learning, analog aggregation, con-
vergence analysis, joint optimization, worker scheduling, power
scaling.
I. INTRODUCTION
Recently, federated learning (FL) has been proposed as a
well acknowledged approach for collaborative edge learning
[1], [2]. In FL, edge devices (local workers) train local models
from their own local data, and then transmit their local updates
to a parameter server (PS). After aggregating these received
local updates, the PS feeds back the averaged update to
the local workers. These iterative updates between PS and
workers, can be either model parameters for model averaging
[1] or parameters’ gradients for gradient averaging [2]. In
this way, FL relieves communication overhead and protect
user privacy compared to raw data sharing of traditional
collaborative learning, specially when the local data is in large
volume and privacy-sensitive. Existing work on FL focuses on
FL algorithms given idealized link assumptions, but the impact
of wireless environments on FL performance should be taken
into account in the design of FL deployed in real wireless
systems. Otherwise, the inherent characteristics of wireless
links may introduce unwanted training errors that dramatically
degrade the learning performance in terms of accuracy and
convergence rate.
To solve this problem, research efforts have been spent
on optimizing network resources used for transmitting model
updates in FL [3]. These works of FL over wireless net-
works adopt digital communications, using a transmission-
then-aggregation policy. Unfortunately, the communication
overhead and transmission latency become large as the number
of active workers increases. On the other hand, it is worth
noting that FL aims for global aggregation and hence only
utilizes the averaged updates of distributed workers rather
than the individual local updates from workers. Alternatively,
the nature of waveform superposition in wireless multiple
access channel (MAC) [4] provides a direct and efficient
way for transmission of the averaged updates in FL, also
known as analog aggregation based FL [5]–[10]. As a joint
transmission-and-aggregation policy, analog aggregation trans-
mission enables all the participating workers to simultaneously
upload their local model updates to the PS over the same
time-frequency resources as long as the aggregated waveform
represents the averaged updates, thus substantially reducing
the overhead of wireless communication for FL [11], [12].
While there exist works of analog aggregation based FL [5]–
[10], some of them mainly focus on designing transmission
schemes without optimization [5]–[8], where they adopt pres-
elected participating workers and fixed their power allocation.
Although optimization issues are considered in [9], [10],
the optimization is conducted on communication side alone,
without an underlying connection to FL. Noticeably, while the
optimization in existing works boils down to maximizing the
number of selected workers, our theoretical results indicate
that more workers is not necessary better over imperfect
links and with limited communication resources. Thus, unlike
these existing works, we seek to analyze the convergence
behavior of analog aggregation based FL, which interprets
the specific relationship between communications and FL
in the paradigm of analog aggregation. Such a meaningful
connection leads to a joint optimization framework for analog
communications and FL. This paper aims at a comprehensive
study on problem formulation, solution development, and
algorithm implementation for the joint design and optimization
of wireless communication and FL. Our key contributions are
summarized as follows:
We derive closed-form expressions for the expected con-
vergence rate of FL over the air in the cases of convex
and non-convex, respectively, which not only interprets
but also quantifies the impact of wireless communications
on the convergence and accuracy of FL over the air. Also,
full-size gradient descent (GD) and mini-batched statisti-
cal gradient descent (SGD) methods are both considered
in this work.
Based on the closed-form theoretical results, we formu-
late a joint optimization problem of learning, worker
selection, and power control, with a goal of minimizing
the global FL loss function. The optimization formulation
Ͱ
Parameter
server (PS)
M
odel copies
a
t local
w
orkers
Global
FL model

Worker 1 Worker 2 Worker U
Noise
w1w2wU
i
i
i
wK
K
1
Z
¦
Fig. 1: FL via analog aggregation from wirelessly distributed data.
turns out to be universal for the convex and non-convex
cases with GD and SGD. Further, for practical implemen-
tation of the joint optimization problem in the presence of
some unobservable parameters, we develop an alternative
reformulation that approximates the original unattainable
problem as a feasible optimization problem under the
operational constraints of analog aggregation.
To efficiently solve the approximate problem, we iden-
tity a tight solution space by exploring the relationship
between the number of workers and the power scaling.
Thanks to the reduced search space, we propose a sim-
ple discrete enumeration method to efficiently find the
globally optimal solution.
II. SYSTEM MODEL
As shown in Fig. 1, we consider a one-hop wireless network
consisting of a single PS at a base station and Uuser devices as
distributed local workers. Through FL, the PS and all workers
collaborate to train a common model for supervised learning
and data inference, without sharing local data.
A. FL Model
Let Di={xi,k,yi,k }Ki
k=1 denote the local dataset at the i-th
worker, i=1,...,U, where xi,k is the input data vector, yi,k
is the labeled output vector, k=1,2, ..., Ki, and Ki=|Di|
is the number of data samples available at the i-th worker.
With K=U
i=1 Kisamples in total, these Uworkers seek
to collectively train a learning model parameterized by a global
model parameter w=[w1,...,w
D]∈R
Dof dimension D,
by minimizing the following loss function
(Global loss function)F(w;D)= 1
K
U
i=1
Ki
k=1
f(w;xi,k,yi,k ),(1)
where the global loss function F(w;D)is a summation of K
data-dependent components, each component f(w;xi,k,yi,k )
is a sample-wise local function that quantifies the model
prediction error of the same data model parameterized by the
shared model parameter w, and D=iDi.
In distributed learning, each worker trains a local model wi
from its local data Di, which can be viewed as a local copy
of the global model w. That is, the local loss function is
(Local loss function) Fi(wi;Di)= 1
Ki
Ki
k=1
f(wi;xi,k,yi,k ),(2)
where wi=[w1
i,...,w
D
i]∈R
Dis the local model parameter.
Through collaboration, it is desired to achieve wi=w=w,
i, so that all workers reach the globally optimal model w.
Such a distributed learning can be formulated via consensus
optimization as [1], [13]
P1: min
w
1
K
U
i=1
Ki
k=1
f(wi;xi,k,y
i,k).(3)
To solve P1, this paper adopts a model-averaging algorithm
for FL [1], [13], where gradient descent is applied, and then
the local model at the i-th local worker is updated as
(Local model updating) wi=wα
Ki
Ki
k=1
f(w;xi,k,yi,k ),(4)
where αis the learning rate, and f(w;xi,k,yi,k )is the
gradient of f(w;xi,k,yi,k )with respect to w.
When local updating is completed, each worker transmits
its updated parameter wito the PS via wireless uplinks to
update the global was
(Global model updating) w=U
i=1 Kiwi
K.(5)
Then, the PS broadcasts win (5) to all participating workers
as their initial value in the next round. The FL implements the
local model-updating in (4) and the global model-averaging in
(5) iteratively, until convergence.
B. Analog Aggregation Transmission Model
To avoid heavy communication overhead, we adopt analog
aggregation without coding, which allows multiple workers to
simultaneously upload their updates to the PS over the same
time-frequency resources. The local updates wi’s are aggre-
gated over the air to implement the global model updating step
in (5). Such an analog aggregation is conducted in an entry-
wise manner. That is, the d-th entries wd
ifrom all workers,
i=1, ..., U , are aggregated to compute wdin (5), for any
d[1,D].
Let pi,t =[p1
i,t,...,p
d
i,t,...,p
D
i,t]denote the power control
vector of worker iat the t-th iteration. Noticeably, the choice
of pi,t in FL over the air should be made not only to effectively
implement the aggregation rule in (5), but also to properly
accommodate the need for network resource allocation. Ac-
cordingly, we set the power control policy as
pd
i,t =βd
i,tKibd
t
hd
i,t
,(6)
where hi,t is the channel gain between the i-th worker and the
PS at the t-th iteration1,bd
tis the power scaling factor, and
βd
i,t is a transmission scheduling indicator. That is, βd
i,t =1
means that the d-th entry of the local model parameter wi,t of
the i-th worker is scheduled to contribute to the FL algorithm
at the t-th iteration, and βd
i,t =0, otherwise. Through power
scaling, the transmit power used for uploading the d-th entry
from the i-th worker should not exceed a maximum power
limit Pd,max
i=Pmax
ifor any d, as follows:
|pd
i,twd
i,t|2=
βd
i,tKibd
t
hd
i,t
wd
i,t
2
Pmax
i.(7)
1In this paper, we assume the channel state information (CSI) to be constant
within each iteration, but may vary over iterations. We also assume that the
CSI is perfectly known at the PS, and leave the imperfect CSI case in future
work.
At the PS side, the received signal at the t-th iteration can
be written as
yt=
U
i=1
pi,t wi,t hi,t +zt=
U
i=1
Kibtβi,t wi,t +zt,
where represents Hadamard product, hi,t =
[h1
i,t,h
2
i,t, ..., hD
i,t],βi,t =[β1
i,t2
i,t, .., β D
i,t],bt=
[b1
t,b
2
t, ..., bD
t], and zt∼CN(0
2I)is additive white
Gaussian noise (AWGN).
Given the received yt, the PS estimates wtvia a post-
processing operation as
wt=U
i=1
Kiβi,t bt−1
yt=U
i=1
Kiβi,t bt−1
zt
+U
i=1
Kiβi,t−1U
i=1
Kiβi,t wi,t,(8)
where (U
i=1 Kiβi,t bt)−1is a properly chosen scaling
vector to produce equal weighting for participating wi’s in (8)
as desired in (5), and (X)−1represents the inverse Hadamard
operation of Xthat calculates its entry-wise reciprocal. No-
ticeably, in order to implement the averaging of (5) in FL
over the air, such a post-processing operation requires btto
be the same for all workers for given tand d, which allows
to eliminate btfrom the first term in (8).
III. THE CONVERGENCE ANALYSIS OF FL WITH ANALOG
AGGREGATION
In this section, we study the effect of analog aggregation
transmission on FL over the air, by analyzing its convergence
behavior for both the convex and the non-convex cases. To
average the effects of instantaneous SNRs, we derive the
expected convergence rate of FL over the air, which quantifies
the impact of wireless communications on FL using analog
aggregation transmissions.
A. Convex Case
We first make the following assumptions that are commonly
adopted in the optimization literature [3], [14], [15].
Assumption 1 (Lipschitz continuity, smoothness): The
gradient F(w)of the loss function F(w)is uniformly
Lipschitz continuous with respect to w, that is,
∇F(wt+1)−∇F(wt)≤Lwt+1 wt,wt,wt+1,(9)
where Lis a positive constant.
Assumption 2 (strongly convex): F(w)is strongly
convex with a positive parameter μ, obeying
F(wt+1)F(wt)+(wt+1 wt)TF(wt)
+μ
2wt+1 wt2,wt,wt+1.(10)
Assumption 3 (bounded local gradients): The sample-
wised local gradients at local workers are bounded by their
global counterpart [14], [15]
∇f(wt)2ρ1+ρ2∇F(wt)2,(11)
where ρ1
20.
According to [2], [16], the FL algorithm applied over ideal
wireless channels is able to solve P1 and converges to an
optimal w. In the presence of wireless transmission errors,
we derive the expected convergence rate of the FL over the
air with analog aggregation, as in Theorem 1.
Theorem 1. Adopt Assumptions 1-3, and the model updating
rule for wtof the FL-over-the-air scheme is given by (8),t.
Given the learning rate α=1
L, the expected performance gap
E[F(wt)F(w)] of wtat the t-th iteration is given by
E[F(wt)F(w)]
t1
i=1
i
j=1
At+1jBti+Bt

Δt
+
t
j=1
AjE[F(w0)F(w)],(12)
where At=1μ
L+ρ2D
d=1(K
U
i=1 Kiβd
i,t
1) and Bt=
ρ1
2LD
d=1(K
U
i=1 Kiβd
i,t
1) + (U
i=1 Kiβi,t bt)−122
2.
Proof. All the proofs, which are omitted in this paper due to
the page limit, can be found in our journal version at [17]:
https://arxiv.org/pdf/2104.03490.pdf
B. Non-convex Case
When the loss function F(w)is nonconvex, such as in
the case of convolutional neural networks, we derive the
convergence rate of the FL over the air with analog aggregation
for the nonconvex case without Assumption 2, which is
summarized in Theorem 2.
Theorem 2. Under the Assumptions 1 and 3for the non-
convex case, given the learning rate α=1
L, the convergence
at the T-th iteration is given by
1
T
T
t=1
∇F(wt1)22L
T(1 ρ2D(K
Kmin 1))
E[F(w0)] F(w)]
+2LT
t=1 Bt
T(1 ρ2D(K
Kmin 1)) .(13)
Proof. Please refer to our journal version [17].
As we can see from Theorem 2, when Tis large enough,
we have
min
0,1,...,T
E[∇F(wt1)2]1
T
T
t=1
∇F(wt1)2
T→∞
2LT
t=1 Bt
T(1 ρ2D(K
Kmin 1))

NC
T
,(14)
which guarantees convergence of the FL algorithm to a sta-
tionary point [13]. Similarly, the performance gap at the step t
for non-convex cases is given by NC
t=2LT
t=1 Bt
T(1ρ2D(K
Kmin 1)) .
C. Stochastic gradient descent
Our work can be extended to stochastic versions of gradient
descent (SGD) as well. Here, we provide convergence analysis
for mini-batch gradient descent with a constant mini-batch size
Kb. Theorem 3 summarizes the convergence behavior of SGD
for the strongly convex case.
Theorem 3. Under the Assumptions 1, 2 and 3for the convex
case, given the learning rate α=1
Land the mini-batch size
Kb, the convergence behavior of the SGD implementation of
FL over the air is given by
E[F(wt)F(w)]
t1
i=1
i
j=1
ASGD
t+1jBSGD
ti+BSGD
t

ΔSGD
t
+
t
j=1
ASGD
jE[F(w0)F(w)],(15)
where ASGD
t=1μ
L+ρ2(D
d=1((U
i=1 Kb)22K(U
i=1 Kb)
K2+
(U
i=1 Kb)
U
i=1 Kbβd
i,t
)+(U
i=1(KiKb))2
K2)and BSGD
t=
ρ1
2L(D
d=1((U
i=1 Kb)22K(U
i=1 Kb)
K2+(U
i=1 Kb)
U
i=1 Kbβd
i,t
)+
(U
i=1(KiKb))2
K2)+
ÄU
i=1 Kiβi,t btä−1
22
2.
Proof. Please refer to our journal version [17].
From Theorem 3, the cumulative performance gap of FL
after the t-th iteration for the SGD case is bounded by
SGD
t=t1
i=1 i
j=1 ASGD
t+1jBSGD
ti+BSGD
t.
IV. PERFORMANCE OPTIMIZATION FOR FEDERATED
LEARNING OVER THE AIR
In this section, we first formulate a joint optimization
problem to reduce the gap for FL over the air, which turns
out to be applicable for both the convex and non-convex
cases, using either GD or SGD implementations. To make it
applicable in practice in the presence of some unobservable
parameters at the PS, we reformulate it to an approximate
problem by imposing a conservative power constraint. To
efficiently solve such an approximate problem, we first identify
a tight solution space and then develop an optimal solution via
discrete programming.
A. Problem Formulation for Joint Optimization
Since we are concerned with convergence accuracy, our
optimization problem boils down to minimizing the perfor-
mance gap for different cases (i.e., t,NC
t, and SGD
t)
at each iteration. We recognize that solving P1 amounts to
iteratively minimizing those gap t,NC
t, and SGD
tunder
the transmit power constraint in (7). At the t-th iteration, the
objective functions for those three cases are given by
t=Bt+Att1,(16)
NC
t=Bt,(17)
SGD
t=BSGD
t+ASGD
tSGD
t1.(18)
where 0=0and SGD
0=0. Note that when the
optimization is executed at the t-th iteration, t1and SGD
t1
can be treated as constants.
Considering the entry-wise transmission for analog aggre-
gation, we remove irrelevant items and extract the component
of the d-th entry from those gap in (16), (17) and (18) as the
objective to minimize, which is given by
Rt[d]= 2
2ÄU
i=1 βd
i,tKibd
tä2+1+2KLρ2t1
2LU
i=1 Kiβd
i,t
,d,
RNC
t[d]= 2
2ÄU
i=1 βd
i,tKibd
tä2+1
2LU
i=1 Kiβd
i,t
,d,
RSGD
t[d]= 2
2ÄU
i=1 βd
i,tKibd
tä2+U(ρ1+22t1)
2LU
i=1 Kiβd
i,t
,d.
Since all entries indexed by dare separable with respect
to the design parameters, we perform entry-wise optimization
by considering wtand wi,t one entry at a time, where the
superscript dand the index of different cases are omitted
hereafter. To determine βi,t and btat the t-th iteration, the PS
carries out a joint optimization problem formulated as follows:
P2: min
{bti,t}U
i=1
Rt(19a)
s.t.
βi,tKibt
hi,t
wi,t
2
Pmax
i,(19b)
βi,t ∈{0,1},i∈{1,2, ..., U },
where Kishould be Kbin (19b) for the SGD case.
However, in (19b), the knowledge of {wi,t}U
i=1 is needed
but is unavailable to the PS due to analog aggregation.
To overcome this issue, we reformulate a practical optimiza-
tion problem via an approximation that wt11
UU
i=1 wi,t.
According to FL, each local parameter wi,t is updated from the
broadcast wt1along the direction of the averaged gradient
over its local data α
KiKi
k=1 f(wt1;xi,k,yi,k ). Hence, it
is reasonable to make the following common assumption on
bounded local gradients, considering that the local gradients
can be controlled by adjusting the learning rate or through
simple clipping [13], [18].
Assumption 4 (bounded local gradients): The gap be-
tween the global parameter wt1and the local parameter
update wi,t,i, t is bounded by
|wt1wi,t|≤η, (20)
where η0is related to the learning rate αthat satisfies the
following condition2
ηmax
α
Ki
Ki
k=1
f(w, xi,k,yi,k )U
i=1
.(21)
Under Assumption 4, we reformulate the original opti-
mization problem (P2) into the following problem (P3), by
replacing wi,t in (19b) by its approximation:
P3: min
{bti,t}U
i=1
Rt(22a)
s.t.
βi,tKibt
hi,t
2
(|wt1|+η)2Pmax
i,(22b)
βi,t ∈{0,1},i∈{1,2, ..., U },(22c)
2This implies the value range of η. In practice, ηcan take
|wt1wt2|. In addition, for the SGD case, we have η
max{{|αEDi[f(w, xi,k,yi,k )]|}U
i=1}
where the power constraint (22b) is constructed based
on the fact that |βi,tKibt
hi,t wi,t|2=|βi,t Kibt
hi,t |2|wi,t|2
|βi,tKibt
hi,t |2(|wt1|+η)2.
Since wt1is always available at the PS, P3 becomes a
feasible formulation for adoption in practice. Next, we develop
the optimal solution to P3.
B. Optimal Solution to P3 via Discrete Programming
At first glance, a direct solution to P3 leads to a mixed
integer programming (MIP), which unfortunately incurs high
complexity. To solve P3 in an efficient manner, we develop a
simple solution by identifying a tight search space without loss
of optimality. The tight search space, given in the following
Theorem 4, is a result of the constraints in (22b) and (22c),
irrespective of the objective function (22a). Hence, it holds
universally for any Rt,RNC
t, and RSGD
t.
Theorem 4. When all the required parameters in P3 i.e.,
{Pmax
i,w
t1,h
i,t,K
i}U
i=1, are available at the PS, the
solution space of (bt
i,t)in P3 can be reduced to the
following tight search space without loss of optimality as
S=ßÄb(k)
t
(k)
i,t äU
k=1
b(k)
t=Pmax
khk,t
Kk(|wt1|+η)
,
β(k)
t(b(k)
t)=îβ(k)
1,t ,...,β
(k)
U,t ó,k =1,...,U,(23)
where β(k)
tis a function of b(k)
t, in the form β(k)
i,t =H(Pmax
i
|Kib(k)
t(|wt1|+η)
hi,t |)and H(x)is the Heaviside step function,
i.e., H(x)=1for x>0, and H(x)=0 otherwise.
Proof. Please see our journal version [17].
Thanks to Theorem 4, we equivalently transform P3 from a
MIP into a discrete programming (DP) problem P4 as follows
P4: min
(bt,βt)∈S Rt=Rt(bt,βt).(24)
According to P4, the objective Rtcan only take on U
possible values corresponding to the Ufeasible values of
bt; meanwhile, given each bt, the value of βtis uniquely
determined. Hence the minimum Rtcan be obtained via line
search over the Ufeasible points (bt,βt)in (23).
Putting together, we propose a joint optimizationfor FL
over the air (INFLOTA), which is a dynamic scheduling and
power scaling policy. By using different Rt, our INFLOTA
can be adjust to all the considered cases including the convex
and non-convex, using either GD or SGD implementations.
Remark 1.(Complexity) Our INFLOTA provides a holistic
solution for implementation of the overall FL at both the PS
and workers sides. Its computational complexity is mainly
determined by that of the optimization step in P4. The
complexity order of the optimization step is low at O(U),
since the search space is reduced to Upoints only via P4.
V. S IMULATION RESULTS AND ANALYSIS
In the simulations, we evaluate the performance of the
proposed INFLOTA for both linear regression and image
classification tasks, which are based on a synthetic dataset
and the MNIST dataset, respectively.
In the considered network, we set U=20,Pmax
i=Pmax =
10 mW for any i[1,U], and σ2=10
4mW. The wireless
channel gain hi,t is generated from an exponential distribution
with unit mean for different iand t.
We use two baseline methods for comparison: a) an FL
algorithm that assumes idealized wireless transmissions with
error-free links to achieve perfect aggregation, and b) an FL
algorithm that randomly determines the power scalar and user
selection. They are named as Perfect aggregation and Random
policy, respectively.
A. Linear regression experiments
In linear regression experiments, the synthetic data used to
train FL is generated randomly from [0,1]. The input xand
the output yfollow the function y=2x+1+n×0.4
where n∼N(0,1). Since linear regression only involves two
parameters, we train a simple two-layer neural network, with
one neuron in each layer, without activation functions, which
is the convex case. The loss function is the MSE of the model
prediction ˆyand the labeled true output y. The learning rate
is set to 0.01.
Fig. 2 shows an example of using FL for linear regression.
The optimal result of a linear regression is y=2x+1,
because the original data generation function is y=2x+
1+0.4n. In Fig. 2, we can see that the most accurate
approximation is achieved by Perfect aggregation, which is
the ideal case without considering the influence of wireless
communications. Random policy considers the influence of
wireless communication but without any optimization. Thus,
its performance is the worst. Our proposed INFLOTA performs
closely to the ideal case, which jointly considers the learning
and the influence of wireless communication. This is because
that our proposed INFLOTA can optimize worker selection
and power control so as to reduce the effect of wireless
transmission errors on FL.
In Fig. 3, we show how wireless transmission affects the
convergence behavior of FL in terms the value of the loss
function and the global FL model remains unchanged which
shows that FL converges. As we can see, as the number of
iterations increases, the MSE values of all the considered
learning algorithms decrease at different rates, and eventually
flatten out to reach their steady state. All schemes converge,
but to different steady state values. This behavior shows that
the channel noise does not affect the convergence of the
FL algorithm but it affects the value that the FL algorithm
converges to.
B. Evaluation on the MNIST dataset
In order to evaluate the performance of our proposed
INFLOTA in realistic application scenarios with real data, we
0 0.2 0.4 0.6 0.8 1
In
p
ut of the FL al
g
orithm
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output of the FL algorithm
Data samples
Optimal result
Perfect aggregation
Our INFLOTA
Random policy
Fig. 2: An example of implementing FL for
linear regression.
10 20 30 40 50
Iteration (t)
100
MSE
Perfect aggregation
Our INFLOTA
Random policy
Fig. 3: MSE as the number of iteration
varies.
0 1020304050
Iteration
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Test accuracy
Perfect aggregation
Our INFLOTA
Random policy
Fig. 4: The test accuracy as the iteration
varies.
train a multilayer perceptron (MLP) on the MNIST dataset3
with a 784-neuron input layer, a 64-neuron hidden layer, and
a 10-neuron softmax output layer, which is a non-convex case.
We adopt cross entropy as the loss function, and rectified linear
unit (ReLU) as the activation function. The total number of
parameters in the MLP is 50890. The learning rate αis set
as 0.1. In MNIST dataset, there are 60000 training samples
and 10000 test samples. We randomly take out 500 1000
training samples and distribute them to 20 local workers as
their local data. Then the three trained FL are tested with
10000 test samples. We provide the results of test accuracy
versus the iteration index tin Fig. 4. As we can see, our
proposed INFLOTA outperforms Random policy, and achieves
comparable performance as Perfect aggregation.
VI. CONCLUSION
In this paper, we have studied the joint optimization of
communications and FL over the air with analog aggregation.
Under the convex and non-convex cases with either the GD
or SGD implementations, we respectively derive closed-form
expressions for the expected convergence rate of the FL algo-
rithm, which can quantify the impact of resource-constrained
wireless communications on FL under the analog aggregation
paradigm. Through analyzing the expected convergence rates,
we have proposed a joint optimization scheme of worker
selection and power control, which can mitigate the impact of
wireless communications on the convergence and performance
of FL. More significantly, our joint optimization scheme is
applicable for both the convex and non-convex cases, using
either GD or SGD implementations. Simulation results show
that the proposed optimization scheme is effective in mitigat-
ing the impact of wireless communications on FL.
ACKNOWLEDGMENTS
This work was partly supported by the National Natural Sci-
ence Foundation of China (Grants 61871023 and 61931001),
Beijing Natural Science Foundation (Grant 4202054), the
National Science Foundation of the US (Grants 1741338,
1939553, 2003211, 2128596 and 2136202), and VRIF-CCI
(Grant 223996).
3http://yann.lecun.com/exdb/mnist/
REFERENCES
[1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al.,
“Communication-efficient learning of deep networks from decentralized
data,” arXiv preprint arXiv:1602.05629, 2016.
[2] J. Koneˇ
cn`
y, H. B. McMahan, D. Ramage, and P. Richt ´
arik, “Federated
optimization: Distributed machine learning for on-device intelligence,
arXiv preprint arXiv:1610.02527, 2016.
[3] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, A joint
learning and communications framework for federated learning over
wireless networks,” IEEE Transactions on Wireless Communications,
2020.
[4] B. Nazer and M. Gastpar, “Computation over multiple-access channels,
IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498–3516, 2007.
[5] M. M. Amiri and D. G¨
und¨
uz, “Federated learning over wireless fading
channels,” IEEE Transactions on Wireless Communications, vol. 19,
no. 5, pp. 3546–3557, 2020.
[6] M. M. Amiri, T. M. Duman, and D. G¨
und¨
uz, “Collaborative machine
learning at the wireless edge with blind transmitters,” arXiv preprint
arXiv:1907.03909, 2019.
[7] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for
low-latency federated edge learning, IEEE Transactions on Wireless
Communications, vol. 19, no. 1, pp. 491–506, 2019.
[8] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Bev-sgd: Best effort voting
sgd for analog aggregation based federated learning against byzantine
attackers,” arXiv preprint arXiv:2110.09660, 2021.
[9] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-
the-air computation,” IEEE Transactions on Wireless Communications,
vol. 19, no. 3, pp. 2022–2035, 2020.
[10] Y. Sun, S. Zhou, and D. G¨
und¨
uz, “Energy-aware analog aggre-
gation for federated learning with redundant data,” arXiv preprint
arXiv:1911.00188, 2019.
[11] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Communication-efficient feder-
ated learning through 1-bit compressive sensing and analog aggregation,
in 2021 IEEE International Conference on Communications Workshops
(ICC Workshops). IEEE, 2021, pp. 1–6.
[12] ——, “1-bit compressive sensing for efficient federated learning over
the air, arXiv preprint arXiv:2103.16055, 2021.
[13] J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the
design and analysis of communication-efficient sgd algorithms,” arXiv
preprint arXiv:1808.07576, 2018.
[14] D. P. Bertsekas, J. N. Tsitsiklis, and J. Tsitsiklis, Neuro-Dynamic
Programming. Athena Scientific, 1996.
[15] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic
methods for data fitting,” SIAM Journal on Scientific Computing, vol. 34,
no. 3, pp. A1380–A1405, 2012.
[16] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized
gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp.
1835–1854, 2016.
[17] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Joint optimization of
communications and federated learning over the air, arXiv preprint
arXiv:2104.03490, 2021.
[18] Y. Liu, K. Yuan, G. Wu, Z. Tian, and Q. Ling, “Decentralized dynamic
admm with quantized and censored communications,” in 2019 53rd
Asilomar Conference on Signals, Systems, and Computers. IEEE, 2019,
pp. 1496–1500.
... Sometimes the volume of data also matters as the network transmission between client and server is more. Recently, the studies in [46][47][48] have proposed techniques to handle the limited bandwidth bottleneck which is accomplished by joint device selection and beamforming design. ...
Article
Full-text available
Forests are a vital part of the ecological system. Forest fires are a serious issue that may cause significant loss of life and infrastructure. Forest fires may occur due to human or man-made climate effects. Numerous artificial intelligence-based strategies such as machine learning (ML) and deep learning (DL) have helped researchers to predict forest fires. However, ML and DL strategies pose some challenges such as large multidimensional data, communication lags, transmission latency, lack of processing power, and privacy concerns. Federated Learning (FL) is a recent development in ML that enables the collection and process of multidimensional, large volumes of data efficiently, which has the potential to solve the aforementioned challenges. FL can also help in identifying the trends based on the geographical locations that can help the authorities to respond faster to forest fires. However, FL algorithms send and receive large amounts of weights of the client-side trained models, and also it induces significant communication overhead. To overcome this issue, in this paper, we propose a unified framework based on FL with a particle swarm-optimization algorithm (PSO) that enables the authorities to respond faster to forest fires. The proposed PSO-enabled FL framework is evaluated by using multidimensional forest fire image data from Kaggle. In comparison to the state-of-the-art federated average model, the proposed model performed better in situations of data imbalance, incurred lower communication costs, and thus proved to be more network efficient. The results of the proposed framework have been validated and 94.47% prediction accuracy has been recorded. These results obtained by the proposed framework can serve as a useful component in the development of early warning systems for forest fires.
Article
The valuable data collected by IoT devices together with the resurgence of machine learning (ML) stimulate the latest trend of artificial intelligence (AI) at the edge. However, traditional ML and recent federated learning (FL) face major challenges including communication bottleneck, data heterogeneity, and security concerns in edge IoT. Meanwhile, the swarm nature of IoT systems is overlooked by most existing literature, which calls for new designs of distributed learning algorithms. Inspired by the success of biological intelligence (BI) of gregarious organisms, we propose a novel edge learning approach for swarm IoT, called communication-efficient and Byzantine-robust distributed swarm learning (CB-DSL), through a holistic integration of AI-enabled stochastic gradient descent and BI-enabled particle swarm optimization. To deal with the non-i.i.d. data issues and Byzantine attacks, a small amount of global data samples are introduced in CB-DSL and shared among IoT workers, which alleviates the local data heterogeneity effectively and enables to fully utilize the exploration-exploitation mechanism of swarm intelligence. Our convergence analysis theoretically demonstrates that the CB-DSL is superior to the standard FL with better convergence behavior. We also evaluate the model divergence of CB-DSL by deriving its upper bound, which measures the effectiveness of the introduction of the globally shared dataset.
Article
For distributed learning among collaborative users, this paper develops and analyzes a communication-efficient scheme for federated learning (FL) over the air, which incorporates 1-bit compressive sensing (CS) into analog-aggregation transmissions. To facilitate design parameter optimization, we analyze the efficacy of the proposed scheme by deriving a closed-form expression for the expected convergence rate. Our theoretical results unveil the tradeoff between convergence performance and communication efficiency as a result of the aggregation errors caused by sparsification, dimension reduction, quantization, signal reconstruction and noise. Then, we formulate a joint optimization problem to mitigate the impact of these aggregation errors through joint optimal design of worker scheduling and power scaling policy. An enumeration-based method is proposed to solve this non-convex problem, which is optimal but becomes computationally infeasible as the number of devices increases. For scalable computing, we resort to the alternating direction method of multipliers (ADMM) technique to develop an efficient implementation that is suitable for large-scale networks. Simulation results show that our proposed 1-bit CS based FL over the air achieves comparable performance to the ideal case where conventional FL without compression and quantification is applied over error-free aggregation, at much reduced communication overhead and transmission latency.
Preprint
Full-text available
The valuable data collected by IoT devices in edge networks together with the resurgence of ML stimulate the latest trend of edge AI. However, recent FL methods face major challenges including communication bottleneck, data heterogeneity and security concerns in edge IoT scenarios, especially when being adopted for distributed learning among massive IoT devices equipped with limited data and transmission resources. Meanwhile, the swarm nature of IoT systems is overlooked by most existing literature, which calls for new designs of distributed learning algorithms. Inspired by the success of biological intelligence (BI) of gregarious organisms, we propose a novel edge learning approach for swarm IoT, called communication-efficient and Byzantine-robust distributed swarm learning (CB-DSL), through a holistic integration of AI-enabled stochastic gradient descent and BI-enabled particle swarm optimization. To deal with non-i.i.d. data issues and Byzantine attacks, global data samples are introduced in CB-DSL and shared among IoT workers, which not only alleviates the local data heterogeneity effectively but also enables to fully utilize the exploration-exploitation mechanism of swarm intelligence. Further, we provide convergence analysis to theoretically demonstrate that the proposed CB-DSL is superior to the standard FL with better convergence behavior. In addition, to measure the effectiveness of the introduction of the globally shared dataset, we also conduct model divergence analysis by evaluating the distance between the data distribution at local IoT devices and the population distribution for the whole datasets. Numerical results verify that the proposed CB-DSL outperforms the existing benchmarks in terms of faster convergence speed, higher convergent accuracy, lower communication cost, and better robustness against non-i.i.d. data and Byzantine attacks.
Article
As a promising distributed learning technology, analog aggregation-based federated learning over the air (FLOA) provides high communication efficiency and privacy provisioning under the edge computing paradigm. When all edge devices (workers) simultaneously upload their local updates to the parameter server (PS) through commonly shared time-frequency resources, the PS obtains the averaged update only rather than the individual local ones. While such a concurrent transmission and aggregation scheme reduces the latency and communication costs, it unfortunately renders FLOA vulnerable to Byzantine attacks. Aiming at Byzantine-resilient FLOA, this article starts from analyzing the channel inversion (CI) mechanism that is widely used for power control in FLOA. Our theoretical analysis indicates that although CI can achieve good learning performance in the benign scenarios, it fails to work well with limited defensive capability against Byzantine attacks. Then, we propose a novel scheme called the best effort voting (BEV) power control policy that is integrated with stochastic gradient descent (SGD). Our BEV-SGD enhances the robustness of FLOA to Byzantine attacks, by allowing all the workers to send their local updates at their maximum transmit power. Under worst-case attacks, we derive the expected convergence rates of FLOA with CI and BEV power control policies, respectively. The rate comparison reveals that our BEV-SGD outperforms its counterpart with CI in terms of better convergence behavior, which is verified by experimental simulations.
Preprint
Federated learning (FL) is an attractive paradigm for making use of rich distributed data while protecting data privacy. Nonetheless, nonideal communication links and limited transmission resources have become the bottleneck of the implementation of fast and accurate FL. In this paper, we study joint optimization of communications and FL based on analog aggregation transmission in realistic wireless networks. We first derive a closed-form expression for the expected convergence rate of FL over the air, which theoretically quantifies the impact of analog aggregation on FL. Based on the analytical result, we develop a joint optimization model for accurate FL implementation, which allows a parameter server to select a subset of workers and determine an appropriate power scaling factor. Since the practical setting of FL over the air encounters unobservable parameters, we reformulate the joint optimization of worker selection and power allocation using controlled approximation. Finally, we efficiently solve the resulting mixed-integer programming problem via a simple yet optimal finite-set search method by reducing the search space. Simulation results show that the proposed solutions developed for realistic wireless analog channels outperform a benchmark method, and achieve comparable performance of the ideal case where FL is implemented over noise-free wireless channels.
Article
Full-text available
In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that generates a global FL model and sends the model back to the users. Since all training parameters are transmitted over wireless links, the quality of training is affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS needs to select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To seek the solution, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can improve the identification accuracy by up to 1:4%, 3:5% and 4:1%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation, 2) a standard FL algorithm with random user selection and resource allocation, and 3) a wireless optimization algorithm that minimizes the sum packet error rates of all users while being agnostic to the FL parameters.
Article
For distributed learning among collaborative users, this paper develops and analyzes a communication-efficient scheme for federated learning (FL) over the air, which incorporates 1-bit compressive sensing (CS) into analog-aggregation transmissions. To facilitate design parameter optimization, we analyze the efficacy of the proposed scheme by deriving a closed-form expression for the expected convergence rate. Our theoretical results unveil the tradeoff between convergence performance and communication efficiency as a result of the aggregation errors caused by sparsification, dimension reduction, quantization, signal reconstruction and noise. Then, we formulate a joint optimization problem to mitigate the impact of these aggregation errors through joint optimal design of worker scheduling and power scaling policy. An enumeration-based method is proposed to solve this non-convex problem, which is optimal but becomes computationally infeasible as the number of devices increases. For scalable computing, we resort to the alternating direction method of multipliers (ADMM) technique to develop an efficient implementation that is suitable for large-scale networks. Simulation results show that our proposed 1-bit CS based FL over the air achieves comparable performance to the ideal case where conventional FL without compression and quantification is applied over error-free aggregation, at much reduced communication overhead and transmission latency.
Article
As a promising distributed learning technology, analog aggregation-based federated learning over the air (FLOA) provides high communication efficiency and privacy provisioning under the edge computing paradigm. When all edge devices (workers) simultaneously upload their local updates to the parameter server (PS) through commonly shared time-frequency resources, the PS obtains the averaged update only rather than the individual local ones. While such a concurrent transmission and aggregation scheme reduces the latency and communication costs, it unfortunately renders FLOA vulnerable to Byzantine attacks. Aiming at Byzantine-resilient FLOA, this article starts from analyzing the channel inversion (CI) mechanism that is widely used for power control in FLOA. Our theoretical analysis indicates that although CI can achieve good learning performance in the benign scenarios, it fails to work well with limited defensive capability against Byzantine attacks. Then, we propose a novel scheme called the best effort voting (BEV) power control policy that is integrated with stochastic gradient descent (SGD). Our BEV-SGD enhances the robustness of FLOA to Byzantine attacks, by allowing all the workers to send their local updates at their maximum transmit power. Under worst-case attacks, we derive the expected convergence rates of FLOA with CI and BEV power control policies, respectively. The rate comparison reveals that our BEV-SGD outperforms its counterpart with CI in terms of better convergence behavior, which is verified by experimental simulations.
Article
Federated learning (FL) is an attractive paradigm for making use of rich distributed data while protecting data privacy. Nonetheless, non-ideal communication links and limited transmission resources may hinder the implementation of fast and accurate FL. In this paper, we study joint optimization of communications and FL based on analog aggregation transmission in realistic wireless networks. We first derive closed-form expressions for the expected convergence rate of FL over the air, which theoretically quantify the impact of analog aggregation on FL. Based on the analytical results, we develop a joint optimization model for accurate FL implementation, which allows a parameter server to select a subset of workers and determine an appropriate power scaling factor. Since the practical setting of FL over the air encounters unobservable parameters, we reformulate the joint optimization of worker selection and power allocation using controlled approximation. Finally, we efficiently solve the resulting mixed-integer programming problem via a simple yet optimal finite-set search method by reducing the search space. Simulation results show that the proposed solutions developed for realistic wireless analog channels outperform a benchmark method, and achieve comparable performance of the ideal case where FL is implemented over error-free wireless channels.
Article
We study federated machine learning at the wireless network edge, where limited power wireless devices, each with its own dataset, build a joint model with the help of a remote parameter server (PS). We consider a bandwidth-limited fading multiple access channel (MAC) from the wireless devices to the PS, and propose various techniques to implement distributed stochastic gradient descent (DSGD) over this shared noisy wireless channel. We first propose a digital DSGD (D-DSGD) scheme, in which one device is selected opportunistically for transmission at each iteration based on the channel conditions; the scheduled device quantizes its gradient estimate to a finite number of bits imposed by the channel condition, and transmits these bits to the PS in a reliable manner. Next, motivated by the additive nature of the wireless MAC, we propose a novel analog communication scheme, referred to as the compressed analog DSGD (CA-DSGD), where the devices first sparsify their gradient estimates while accumulating error from previous iterations, and project the resultant sparse vector into a low-dimensional vector for bandwidth reduction. We also design a power allocation scheme to align the received gradient vectors at the PS in an efficient manner. Numerical results show that D-DSGD outperforms other digital approaches in the literature; however, in general the proposed CA-DSGD algorithm converges faster than the D-DSGD scheme, and reaches a higher level of accuracy. We have observed that the gap between the analog and digital schemes increases when the datasets of devices are not independent and identically distributed (i.i.d.). Furthermore, the performance of the CA-DSGD scheme is shown to be robust against imperfect channel state information (CSI) at the devices. Overall these results show clear advantages for the proposed analog over-the-air DSGD scheme, which suggests that learning and communication algorithms should be designed jointly to achieve the best end-to-end performance in machine learning applications at the wireless edge.
Article
The stringent requirements for low-latency and privacy of the emerging high-stake applications with intelligent devices such as drones and smart vehicles make the cloud computing inapplicable in these scenarios. Instead, edge machine learning becomes increasingly attractive for performing training and inference directly at network edges without sending data to a centralized data center. This stimulates a nascent field termed as federated learning for training a machine learning model on computation, storage, energy and bandwidth limited mobile devices in a distributed manner. To preserve data privacy and address the issues of unbalanced and non-IID data points across different devices, the federated averaging algorithm has been proposed for global model aggregation by computing the weighted average of locally updated model at each selected device. However, the limited communication bandwidth becomes the main bottleneck for aggregating the locally computed updates. We thus propose a novel over-the-air computation based approach for fast global model aggregation via exploring the superposition property of a wireless multiple-access channel. This is achieved by joint device selection and beamforming design, which is modeled as a sparse and low-rank optimization problem to support efficient algorithms design. To achieve this goal, we provide a difference-of-convex-functions (DC) representation for the sparse and low-rank function to enhance sparsity and accurately detect the fixed-rank constraint in the procedure of device selection. A DC algorithm is further developed to solve the resulting DC program with global convergence guarantees. The algorithmic advantages and admirable performance of the proposed methodologies are demonstrated through extensive numerical results.