Content uploaded by Xin Fan
Author content
All content in this area was uploaded by Xin Fan on May 21, 2023
Content may be subject to copyright.
Joint Optimization for Federated Learning Over the Air
Xin Fan1, Yue Wang2, Yan Huo1, and Zhi Tian2
1School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing, China
2Department of Electrical & Computer Engineering, George Mason University, Fairfax, VA, USA
E-mails: {yhuo,fanxin}@bjtu.edu.cn, {ywang56,ztian1}@gmu.edu
Abstract—In this paper, we focus on federated learning (FL)
over the air based on analog aggregation transmission in realistic
wireless networks. We first derive a closed-form expression for the
expected convergence rate of FL over the air, which theoretically
quantifies the impact of analog aggregation on FL. Based on
that, we further develop a joint optimization model for accurate
FL implementation, which allows a parameter server to select
a subset of edge devices and determine an appropriate power
scaling factor. Such a joint optimization of device selection and
power control for FL over the air is then formulated as an mixed
integer programming problem. Finally, we efficiently solve this
problem via a simple finite-set search method. Simulation results
show that the proposed solutions developed for wireless channels
outperform a benchmark method, and could achieve comparable
performance of the ideal case where FL is implemented over
reliable and error-free wireless channels.
Index Terms—Federated learning, analog aggregation, con-
vergence analysis, joint optimization, worker scheduling, power
scaling.
I. INTRODUCTION
Recently, federated learning (FL) has been proposed as a
well acknowledged approach for collaborative edge learning
[1], [2]. In FL, edge devices (local workers) train local models
from their own local data, and then transmit their local updates
to a parameter server (PS). After aggregating these received
local updates, the PS feeds back the averaged update to
the local workers. These iterative updates between PS and
workers, can be either model parameters for model averaging
[1] or parameters’ gradients for gradient averaging [2]. In
this way, FL relieves communication overhead and protect
user privacy compared to raw data sharing of traditional
collaborative learning, specially when the local data is in large
volume and privacy-sensitive. Existing work on FL focuses on
FL algorithms given idealized link assumptions, but the impact
of wireless environments on FL performance should be taken
into account in the design of FL deployed in real wireless
systems. Otherwise, the inherent characteristics of wireless
links may introduce unwanted training errors that dramatically
degrade the learning performance in terms of accuracy and
convergence rate.
To solve this problem, research efforts have been spent
on optimizing network resources used for transmitting model
updates in FL [3]. These works of FL over wireless net-
works adopt digital communications, using a transmission-
then-aggregation policy. Unfortunately, the communication
overhead and transmission latency become large as the number
of active workers increases. On the other hand, it is worth
noting that FL aims for global aggregation and hence only
utilizes the averaged updates of distributed workers rather
than the individual local updates from workers. Alternatively,
the nature of waveform superposition in wireless multiple
access channel (MAC) [4] provides a direct and efficient
way for transmission of the averaged updates in FL, also
known as analog aggregation based FL [5]–[10]. As a joint
transmission-and-aggregation policy, analog aggregation trans-
mission enables all the participating workers to simultaneously
upload their local model updates to the PS over the same
time-frequency resources as long as the aggregated waveform
represents the averaged updates, thus substantially reducing
the overhead of wireless communication for FL [11], [12].
While there exist works of analog aggregation based FL [5]–
[10], some of them mainly focus on designing transmission
schemes without optimization [5]–[8], where they adopt pres-
elected participating workers and fixed their power allocation.
Although optimization issues are considered in [9], [10],
the optimization is conducted on communication side alone,
without an underlying connection to FL. Noticeably, while the
optimization in existing works boils down to maximizing the
number of selected workers, our theoretical results indicate
that more workers is not necessary better over imperfect
links and with limited communication resources. Thus, unlike
these existing works, we seek to analyze the convergence
behavior of analog aggregation based FL, which interprets
the specific relationship between communications and FL
in the paradigm of analog aggregation. Such a meaningful
connection leads to a joint optimization framework for analog
communications and FL. This paper aims at a comprehensive
study on problem formulation, solution development, and
algorithm implementation for the joint design and optimization
of wireless communication and FL. Our key contributions are
summarized as follows:
•We derive closed-form expressions for the expected con-
vergence rate of FL over the air in the cases of convex
and non-convex, respectively, which not only interprets
but also quantifies the impact of wireless communications
on the convergence and accuracy of FL over the air. Also,
full-size gradient descent (GD) and mini-batched statisti-
cal gradient descent (SGD) methods are both considered
in this work.
•Based on the closed-form theoretical results, we formu-
late a joint optimization problem of learning, worker
selection, and power control, with a goal of minimizing
the global FL loss function. The optimization formulation
Ͱ
Parameter
server (PS)
M
odel copies
a
t local
w
orkers
Global
FL model
Worker 1 Worker 2 Worker U
Noise
w1w2wU
i
i
i
wK
K
1
Z
¦
Fig. 1: FL via analog aggregation from wirelessly distributed data.
turns out to be universal for the convex and non-convex
cases with GD and SGD. Further, for practical implemen-
tation of the joint optimization problem in the presence of
some unobservable parameters, we develop an alternative
reformulation that approximates the original unattainable
problem as a feasible optimization problem under the
operational constraints of analog aggregation.
•To efficiently solve the approximate problem, we iden-
tity a tight solution space by exploring the relationship
between the number of workers and the power scaling.
Thanks to the reduced search space, we propose a sim-
ple discrete enumeration method to efficiently find the
globally optimal solution.
II. SYSTEM MODEL
As shown in Fig. 1, we consider a one-hop wireless network
consisting of a single PS at a base station and Uuser devices as
distributed local workers. Through FL, the PS and all workers
collaborate to train a common model for supervised learning
and data inference, without sharing local data.
A. FL Model
Let Di={xi,k,yi,k }Ki
k=1 denote the local dataset at the i-th
worker, i=1,...,U, where xi,k is the input data vector, yi,k
is the labeled output vector, k=1,2, ..., Ki, and Ki=|Di|
is the number of data samples available at the i-th worker.
With K=U
i=1 Kisamples in total, these Uworkers seek
to collectively train a learning model parameterized by a global
model parameter w=[w1,...,w
D]∈R
Dof dimension D,
by minimizing the following loss function
(Global loss function)F(w;D)= 1
K
U
i=1
Ki
k=1
f(w;xi,k,yi,k ),(1)
where the global loss function F(w;D)is a summation of K
data-dependent components, each component f(w;xi,k,yi,k )
is a sample-wise local function that quantifies the model
prediction error of the same data model parameterized by the
shared model parameter w, and D=iDi.
In distributed learning, each worker trains a local model wi
from its local data Di, which can be viewed as a local copy
of the global model w. That is, the local loss function is
(Local loss function) Fi(wi;Di)= 1
Ki
Ki
k=1
f(wi;xi,k,yi,k ),(2)
where wi=[w1
i,...,w
D
i]∈R
Dis the local model parameter.
Through collaboration, it is desired to achieve wi=w=w∗,
∀i, so that all workers reach the globally optimal model w∗.
Such a distributed learning can be formulated via consensus
optimization as [1], [13]
P1: min
w
1
K
U
i=1
Ki
k=1
f(wi;xi,k,y
i,k).(3)
To solve P1, this paper adopts a model-averaging algorithm
for FL [1], [13], where gradient descent is applied, and then
the local model at the i-th local worker is updated as
(Local model updating) wi=w−α
Ki
Ki
k=1
∇f(w;xi,k,yi,k ),(4)
where αis the learning rate, and ∇f(w;xi,k,yi,k )is the
gradient of f(w;xi,k,yi,k )with respect to w.
When local updating is completed, each worker transmits
its updated parameter wito the PS via wireless uplinks to
update the global was
(Global model updating) w=U
i=1 Kiwi
K.(5)
Then, the PS broadcasts win (5) to all participating workers
as their initial value in the next round. The FL implements the
local model-updating in (4) and the global model-averaging in
(5) iteratively, until convergence.
B. Analog Aggregation Transmission Model
To avoid heavy communication overhead, we adopt analog
aggregation without coding, which allows multiple workers to
simultaneously upload their updates to the PS over the same
time-frequency resources. The local updates wi’s are aggre-
gated over the air to implement the global model updating step
in (5). Such an analog aggregation is conducted in an entry-
wise manner. That is, the d-th entries wd
ifrom all workers,
i=1, ..., U , are aggregated to compute wdin (5), for any
d∈[1,D].
Let pi,t =[p1
i,t,...,p
d
i,t,...,p
D
i,t]denote the power control
vector of worker iat the t-th iteration. Noticeably, the choice
of pi,t in FL over the air should be made not only to effectively
implement the aggregation rule in (5), but also to properly
accommodate the need for network resource allocation. Ac-
cordingly, we set the power control policy as
pd
i,t =βd
i,tKibd
t
hd
i,t
,(6)
where hi,t is the channel gain between the i-th worker and the
PS at the t-th iteration1,bd
tis the power scaling factor, and
βd
i,t is a transmission scheduling indicator. That is, βd
i,t =1
means that the d-th entry of the local model parameter wi,t of
the i-th worker is scheduled to contribute to the FL algorithm
at the t-th iteration, and βd
i,t =0, otherwise. Through power
scaling, the transmit power used for uploading the d-th entry
from the i-th worker should not exceed a maximum power
limit Pd,max
i=Pmax
ifor any d, as follows:
|pd
i,twd
i,t|2=
βd
i,tKibd
t
hd
i,t
wd
i,t
2
≤Pmax
i.(7)
1In this paper, we assume the channel state information (CSI) to be constant
within each iteration, but may vary over iterations. We also assume that the
CSI is perfectly known at the PS, and leave the imperfect CSI case in future
work.
At the PS side, the received signal at the t-th iteration can
be written as
yt=
U
i=1
pi,t wi,t hi,t +zt=
U
i=1
Kibtβi,t wi,t +zt,
where represents Hadamard product, hi,t =
[h1
i,t,h
2
i,t, ..., hD
i,t],βi,t =[β1
i,t,β2
i,t, .., β D
i,t],bt=
[b1
t,b
2
t, ..., bD
t], and zt∼CN(0,σ
2I)is additive white
Gaussian noise (AWGN).
Given the received yt, the PS estimates wtvia a post-
processing operation as
wt=U
i=1
Kiβi,t bt−1
yt=U
i=1
Kiβi,t bt−1
zt
+U
i=1
Kiβi,t−1U
i=1
Kiβi,t wi,t,(8)
where (U
i=1 Kiβi,t bt)−1is a properly chosen scaling
vector to produce equal weighting for participating wi’s in (8)
as desired in (5), and (X)−1represents the inverse Hadamard
operation of Xthat calculates its entry-wise reciprocal. No-
ticeably, in order to implement the averaging of (5) in FL
over the air, such a post-processing operation requires btto
be the same for all workers for given tand d, which allows
to eliminate btfrom the first term in (8).
III. THE CONVERGENCE ANALYSIS OF FL WITH ANALOG
AGGREGATION
In this section, we study the effect of analog aggregation
transmission on FL over the air, by analyzing its convergence
behavior for both the convex and the non-convex cases. To
average the effects of instantaneous SNRs, we derive the
expected convergence rate of FL over the air, which quantifies
the impact of wireless communications on FL using analog
aggregation transmissions.
A. Convex Case
We first make the following assumptions that are commonly
adopted in the optimization literature [3], [14], [15].
Assumption 1 (Lipschitz continuity, smoothness): The
gradient ∇F(w)of the loss function F(w)is uniformly
Lipschitz continuous with respect to w, that is,
∇F(wt+1)−∇F(wt)≤Lwt+1 −wt,∀wt,wt+1,(9)
where Lis a positive constant.
Assumption 2 (strongly convex): ∇F(w)is strongly
convex with a positive parameter μ, obeying
F(wt+1)≥F(wt)+(wt+1 −wt)T∇F(wt)
+μ
2wt+1 −wt2,∀wt,wt+1.(10)
Assumption 3 (bounded local gradients): The sample-
wised local gradients at local workers are bounded by their
global counterpart [14], [15]
∇f(wt)2≤ρ1+ρ2∇F(wt)2,(11)
where ρ1,ρ
2≥0.
According to [2], [16], the FL algorithm applied over ideal
wireless channels is able to solve P1 and converges to an
optimal w∗. In the presence of wireless transmission errors,
we derive the expected convergence rate of the FL over the
air with analog aggregation, as in Theorem 1.
Theorem 1. Adopt Assumptions 1-3, and the model updating
rule for wtof the FL-over-the-air scheme is given by (8),∀t.
Given the learning rate α=1
L, the expected performance gap
E[F(wt)−F(w∗)] of wtat the t-th iteration is given by
E[F(wt)−F(w∗)] ≤
t−1
i=1
i
j=1
At+1−jBt−i+Bt
Δt
+
t
j=1
AjE[F(w0)−F(w∗)],(12)
where At=1−μ
L+ρ2D
d=1(K
U
i=1 Kiβd
i,t
−1) and Bt=
ρ1
2LD
d=1(K
U
i=1 Kiβd
i,t
−1) + (U
i=1 Kiβi,t bt)−12Lσ2
2.
Proof. All the proofs, which are omitted in this paper due to
the page limit, can be found in our journal version at [17]:
https://arxiv.org/pdf/2104.03490.pdf
B. Non-convex Case
When the loss function F(w)is nonconvex, such as in
the case of convolutional neural networks, we derive the
convergence rate of the FL over the air with analog aggregation
for the nonconvex case without Assumption 2, which is
summarized in Theorem 2.
Theorem 2. Under the Assumptions 1 and 3for the non-
convex case, given the learning rate α=1
L, the convergence
at the T-th iteration is given by
1
T
T
t=1
∇F(wt−1)2≤2L
T(1 −ρ2D(K
Kmin −1))
E[F(w0)] −F(w∗)]
+2LT
t=1 Bt
T(1 −ρ2D(K
Kmin −1)) .(13)
Proof. Please refer to our journal version [17].
As we can see from Theorem 2, when Tis large enough,
we have
min
0,1,...,T
E[∇F(wt−1)2]≤1
T
T
t=1
∇F(wt−1)2
T→∞
≤2LT
t=1 Bt
T(1 −ρ2D(K
Kmin −1))
NC
T
,(14)
which guarantees convergence of the FL algorithm to a sta-
tionary point [13]. Similarly, the performance gap at the step t
for non-convex cases is given by NC
t=2LT
t=1 Bt
T(1−ρ2D(K
Kmin −1)) .
C. Stochastic gradient descent
Our work can be extended to stochastic versions of gradient
descent (SGD) as well. Here, we provide convergence analysis
for mini-batch gradient descent with a constant mini-batch size
Kb. Theorem 3 summarizes the convergence behavior of SGD
for the strongly convex case.
Theorem 3. Under the Assumptions 1, 2 and 3for the convex
case, given the learning rate α=1
Land the mini-batch size
Kb, the convergence behavior of the SGD implementation of
FL over the air is given by
E[F(wt)−F(w∗)] ≤
t−1
i=1
i
j=1
ASGD
t+1−jBSGD
t−i+BSGD
t
ΔSGD
t
+
t
j=1
ASGD
jE[F(w0)−F(w∗)],(15)
where ASGD
t=1−μ
L+ρ2(D
d=1((U
i=1 Kb)2−2K(U
i=1 Kb)
K2+
(U
i=1 Kb)
U
i=1 Kbβd
i,t
)+(U
i=1(Ki−Kb))2
K2)and BSGD
t=
ρ1
2L(D
d=1((U
i=1 Kb)2−2K(U
i=1 Kb)
K2+(U
i=1 Kb)
U
i=1 Kbβd
i,t
)+
(U
i=1(Ki−Kb))2
K2)+
ÄU
i=1 Kiβi,t btä−1
2Lσ2
2.
Proof. Please refer to our journal version [17].
From Theorem 3, the cumulative performance gap of FL
after the t-th iteration for the SGD case is bounded by
SGD
t=t−1
i=1 i
j=1 ASGD
t+1−jBSGD
t−i+BSGD
t.
IV. PERFORMANCE OPTIMIZATION FOR FEDERATED
LEARNING OVER THE AIR
In this section, we first formulate a joint optimization
problem to reduce the gap for FL over the air, which turns
out to be applicable for both the convex and non-convex
cases, using either GD or SGD implementations. To make it
applicable in practice in the presence of some unobservable
parameters at the PS, we reformulate it to an approximate
problem by imposing a conservative power constraint. To
efficiently solve such an approximate problem, we first identify
a tight solution space and then develop an optimal solution via
discrete programming.
A. Problem Formulation for Joint Optimization
Since we are concerned with convergence accuracy, our
optimization problem boils down to minimizing the perfor-
mance gap for different cases (i.e., t,NC
t, and SGD
t)
at each iteration. We recognize that solving P1 amounts to
iteratively minimizing those gap t,NC
t, and SGD
tunder
the transmit power constraint in (7). At the t-th iteration, the
objective functions for those three cases are given by
t=Bt+Att−1,(16)
NC
t=Bt,(17)
SGD
t=BSGD
t+ASGD
tSGD
t−1.(18)
where 0=0and SGD
0=0. Note that when the
optimization is executed at the t-th iteration, t−1and SGD
t−1
can be treated as constants.
Considering the entry-wise transmission for analog aggre-
gation, we remove irrelevant items and extract the component
of the d-th entry from those gap in (16), (17) and (18) as the
objective to minimize, which is given by
Rt[d]= Lσ2
2ÄU
i=1 βd
i,tKibd
tä2+Kρ1+2KLρ2t−1
2LU
i=1 Kiβd
i,t
,∀d,
RNC
t[d]= Lσ2
2ÄU
i=1 βd
i,tKibd
tä2+Kρ1
2LU
i=1 Kiβd
i,t
,∀d,
RSGD
t[d]= Lσ2
2ÄU
i=1 βd
i,tKibd
tä2+U(ρ1+2Lρ2t−1)
2LU
i=1 Kiβd
i,t
,∀d.
Since all entries indexed by dare separable with respect
to the design parameters, we perform entry-wise optimization
by considering wtand wi,t one entry at a time, where the
superscript dand the index of different cases are omitted
hereafter. To determine βi,t and btat the t-th iteration, the PS
carries out a joint optimization problem formulated as follows:
P2: min
{bt,βi,t}U
i=1
Rt(19a)
s.t.
βi,tKibt
hi,t
wi,t
2
≤Pmax
i,(19b)
βi,t ∈{0,1},i∈{1,2, ..., U },
where Kishould be Kbin (19b) for the SGD case.
However, in (19b), the knowledge of {wi,t}U
i=1 is needed
but is unavailable to the PS due to analog aggregation.
To overcome this issue, we reformulate a practical optimiza-
tion problem via an approximation that wt−1≈1
UU
i=1 wi,t.
According to FL, each local parameter wi,t is updated from the
broadcast wt−1along the direction of the averaged gradient
over its local data α
KiKi
k=1 ∇f(wt−1;xi,k,yi,k ). Hence, it
is reasonable to make the following common assumption on
bounded local gradients, considering that the local gradients
can be controlled by adjusting the learning rate or through
simple clipping [13], [18].
Assumption 4 (bounded local gradients): The gap be-
tween the global parameter wt−1and the local parameter
update wi,t,∀i, t is bounded by
|wt−1−wi,t|≤η, (20)
where η≥0is related to the learning rate αthat satisfies the
following condition2
η≥max ⎧
⎨
⎩
α
Ki
Ki
k=1
∇f(w, xi,k,yi,k )U
i=1⎫
⎬
⎭
.(21)
Under Assumption 4, we reformulate the original opti-
mization problem (P2) into the following problem (P3), by
replacing wi,t in (19b) by its approximation:
P3: min
{bt,βi,t}U
i=1
Rt(22a)
s.t.
βi,tKibt
hi,t
2
(|wt−1|+η)2≤Pmax
i,(22b)
βi,t ∈{0,1},i∈{1,2, ..., U },(22c)
2This implies the value range of η. In practice, ηcan take
|wt−1−wt−2|. In addition, for the SGD case, we have η≥
max{{|αEDi[∇f(w, xi,k,yi,k )]|}U
i=1}
where the power constraint (22b) is constructed based
on the fact that |βi,tKibt
hi,t wi,t|2=|βi,t Kibt
hi,t |2|wi,t|2≤
|βi,tKibt
hi,t |2(|wt−1|+η)2.
Since wt−1is always available at the PS, P3 becomes a
feasible formulation for adoption in practice. Next, we develop
the optimal solution to P3.
B. Optimal Solution to P3 via Discrete Programming
At first glance, a direct solution to P3 leads to a mixed
integer programming (MIP), which unfortunately incurs high
complexity. To solve P3 in an efficient manner, we develop a
simple solution by identifying a tight search space without loss
of optimality. The tight search space, given in the following
Theorem 4, is a result of the constraints in (22b) and (22c),
irrespective of the objective function (22a). Hence, it holds
universally for any Rt,RNC
t, and RSGD
t.
Theorem 4. When all the required parameters in P3 i.e.,
{Pmax
i,w
t−1,h
i,t,K
i,η}U
i=1, are available at the PS, the
solution space of (bt,β
i,t)in P3 can be reduced to the
following tight search space without loss of optimality as
S=ßÄb(k)
t,β
(k)
i,t ä™U
k=1
b(k)
t=Pmax
khk,t
Kk(|wt−1|+η)
,
β(k)
t(b(k)
t)=îβ(k)
1,t ,...,β
(k)
U,t ó,k =1,...,U,(23)
where β(k)
tis a function of b(k)
t, in the form β(k)
i,t =H(Pmax
i−
|Kib(k)
t(|wt−1|+η)
hi,t |)and H(x)is the Heaviside step function,
i.e., H(x)=1for x>0, and H(x)=0 otherwise.
Proof. Please see our journal version [17].
Thanks to Theorem 4, we equivalently transform P3 from a
MIP into a discrete programming (DP) problem P4 as follows
P4: min
(bt,βt)∈S Rt=Rt(bt,βt).(24)
According to P4, the objective Rtcan only take on U
possible values corresponding to the Ufeasible values of
bt; meanwhile, given each bt, the value of βtis uniquely
determined. Hence the minimum Rtcan be obtained via line
search over the Ufeasible points (bt,βt)in (23).
Putting together, we propose a joint optimizationfor FL
over the air (INFLOTA), which is a dynamic scheduling and
power scaling policy. By using different Rt, our INFLOTA
can be adjust to all the considered cases including the convex
and non-convex, using either GD or SGD implementations.
Remark 1.(Complexity) Our INFLOTA provides a holistic
solution for implementation of the overall FL at both the PS
and workers sides. Its computational complexity is mainly
determined by that of the optimization step in P4. The
complexity order of the optimization step is low at O(U),
since the search space is reduced to Upoints only via P4.
V. S IMULATION RESULTS AND ANALYSIS
In the simulations, we evaluate the performance of the
proposed INFLOTA for both linear regression and image
classification tasks, which are based on a synthetic dataset
and the MNIST dataset, respectively.
In the considered network, we set U=20,Pmax
i=Pmax =
10 mW for any i∈[1,U], and σ2=10
−4mW. The wireless
channel gain hi,t is generated from an exponential distribution
with unit mean for different iand t.
We use two baseline methods for comparison: a) an FL
algorithm that assumes idealized wireless transmissions with
error-free links to achieve perfect aggregation, and b) an FL
algorithm that randomly determines the power scalar and user
selection. They are named as Perfect aggregation and Random
policy, respectively.
A. Linear regression experiments
In linear regression experiments, the synthetic data used to
train FL is generated randomly from [0,1]. The input xand
the output yfollow the function y=−2x+1+n×0.4
where n∼N(0,1). Since linear regression only involves two
parameters, we train a simple two-layer neural network, with
one neuron in each layer, without activation functions, which
is the convex case. The loss function is the MSE of the model
prediction ˆyand the labeled true output y. The learning rate
is set to 0.01.
Fig. 2 shows an example of using FL for linear regression.
The optimal result of a linear regression is y=−2x+1,
because the original data generation function is y=−2x+
1+0.4n. In Fig. 2, we can see that the most accurate
approximation is achieved by Perfect aggregation, which is
the ideal case without considering the influence of wireless
communications. Random policy considers the influence of
wireless communication but without any optimization. Thus,
its performance is the worst. Our proposed INFLOTA performs
closely to the ideal case, which jointly considers the learning
and the influence of wireless communication. This is because
that our proposed INFLOTA can optimize worker selection
and power control so as to reduce the effect of wireless
transmission errors on FL.
In Fig. 3, we show how wireless transmission affects the
convergence behavior of FL in terms the value of the loss
function and the global FL model remains unchanged which
shows that FL converges. As we can see, as the number of
iterations increases, the MSE values of all the considered
learning algorithms decrease at different rates, and eventually
flatten out to reach their steady state. All schemes converge,
but to different steady state values. This behavior shows that
the channel noise does not affect the convergence of the
FL algorithm but it affects the value that the FL algorithm
converges to.
B. Evaluation on the MNIST dataset
In order to evaluate the performance of our proposed
INFLOTA in realistic application scenarios with real data, we
0 0.2 0.4 0.6 0.8 1
In
p
ut of the FL al
g
orithm
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output of the FL algorithm
Data samples
Optimal result
Perfect aggregation
Our INFLOTA
Random policy
Fig. 2: An example of implementing FL for
linear regression.
10 20 30 40 50
Iteration (t)
100
MSE
Perfect aggregation
Our INFLOTA
Random policy
Fig. 3: MSE as the number of iteration
varies.
0 1020304050
Iteration
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Test accuracy
Perfect aggregation
Our INFLOTA
Random policy
Fig. 4: The test accuracy as the iteration
varies.
train a multilayer perceptron (MLP) on the MNIST dataset3
with a 784-neuron input layer, a 64-neuron hidden layer, and
a 10-neuron softmax output layer, which is a non-convex case.
We adopt cross entropy as the loss function, and rectified linear
unit (ReLU) as the activation function. The total number of
parameters in the MLP is 50890. The learning rate αis set
as 0.1. In MNIST dataset, there are 60000 training samples
and 10000 test samples. We randomly take out 500 −1000
training samples and distribute them to 20 local workers as
their local data. Then the three trained FL are tested with
10000 test samples. We provide the results of test accuracy
versus the iteration index tin Fig. 4. As we can see, our
proposed INFLOTA outperforms Random policy, and achieves
comparable performance as Perfect aggregation.
VI. CONCLUSION
In this paper, we have studied the joint optimization of
communications and FL over the air with analog aggregation.
Under the convex and non-convex cases with either the GD
or SGD implementations, we respectively derive closed-form
expressions for the expected convergence rate of the FL algo-
rithm, which can quantify the impact of resource-constrained
wireless communications on FL under the analog aggregation
paradigm. Through analyzing the expected convergence rates,
we have proposed a joint optimization scheme of worker
selection and power control, which can mitigate the impact of
wireless communications on the convergence and performance
of FL. More significantly, our joint optimization scheme is
applicable for both the convex and non-convex cases, using
either GD or SGD implementations. Simulation results show
that the proposed optimization scheme is effective in mitigat-
ing the impact of wireless communications on FL.
ACKNOWLEDGMENTS
This work was partly supported by the National Natural Sci-
ence Foundation of China (Grants 61871023 and 61931001),
Beijing Natural Science Foundation (Grant 4202054), the
National Science Foundation of the US (Grants 1741338,
1939553, 2003211, 2128596 and 2136202), and VRIF-CCI
(Grant 223996).
3http://yann.lecun.com/exdb/mnist/
REFERENCES
[1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al.,
“Communication-efficient learning of deep networks from decentralized
data,” arXiv preprint arXiv:1602.05629, 2016.
[2] J. Koneˇ
cn`
y, H. B. McMahan, D. Ramage, and P. Richt ´
arik, “Federated
optimization: Distributed machine learning for on-device intelligence,”
arXiv preprint arXiv:1610.02527, 2016.
[3] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint
learning and communications framework for federated learning over
wireless networks,” IEEE Transactions on Wireless Communications,
2020.
[4] B. Nazer and M. Gastpar, “Computation over multiple-access channels,”
IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498–3516, 2007.
[5] M. M. Amiri and D. G¨
und¨
uz, “Federated learning over wireless fading
channels,” IEEE Transactions on Wireless Communications, vol. 19,
no. 5, pp. 3546–3557, 2020.
[6] M. M. Amiri, T. M. Duman, and D. G¨
und¨
uz, “Collaborative machine
learning at the wireless edge with blind transmitters,” arXiv preprint
arXiv:1907.03909, 2019.
[7] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for
low-latency federated edge learning,” IEEE Transactions on Wireless
Communications, vol. 19, no. 1, pp. 491–506, 2019.
[8] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Bev-sgd: Best effort voting
sgd for analog aggregation based federated learning against byzantine
attackers,” arXiv preprint arXiv:2110.09660, 2021.
[9] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-
the-air computation,” IEEE Transactions on Wireless Communications,
vol. 19, no. 3, pp. 2022–2035, 2020.
[10] Y. Sun, S. Zhou, and D. G¨
und¨
uz, “Energy-aware analog aggre-
gation for federated learning with redundant data,” arXiv preprint
arXiv:1911.00188, 2019.
[11] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Communication-efficient feder-
ated learning through 1-bit compressive sensing and analog aggregation,”
in 2021 IEEE International Conference on Communications Workshops
(ICC Workshops). IEEE, 2021, pp. 1–6.
[12] ——, “1-bit compressive sensing for efficient federated learning over
the air,” arXiv preprint arXiv:2103.16055, 2021.
[13] J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the
design and analysis of communication-efficient sgd algorithms,” arXiv
preprint arXiv:1808.07576, 2018.
[14] D. P. Bertsekas, J. N. Tsitsiklis, and J. Tsitsiklis, Neuro-Dynamic
Programming. Athena Scientific, 1996.
[15] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic
methods for data fitting,” SIAM Journal on Scientific Computing, vol. 34,
no. 3, pp. A1380–A1405, 2012.
[16] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized
gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp.
1835–1854, 2016.
[17] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Joint optimization of
communications and federated learning over the air,” arXiv preprint
arXiv:2104.03490, 2021.
[18] Y. Liu, K. Yuan, G. Wu, Z. Tian, and Q. Ling, “Decentralized dynamic
admm with quantized and censored communications,” in 2019 53rd
Asilomar Conference on Signals, Systems, and Computers. IEEE, 2019,
pp. 1496–1500.