Conference PaperPDF Available

Joint Optimization for Federated Learning Over the Air

May 2022

May 2022

DOI:10.1109/ICC45855.2022.9838269

Conference: ICC 2022 - IEEE International Conference on Communications

Authors:

Xin Fan

Beijing Forestry University

Yue Wang

George Mason University

Yan Huo

Beijing Jiaotong University

Content uploaded by Xin Fan

Content may be subject to copyright.

Joint Optimization for Federated Learning Over the Air

Xin Fan1, Yue Wang2, Yan Huo1, and Zhi Tian2

1School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing, China

2Department of Electrical & Computer Engineering, George Mason University, Fairfax, VA, USA

E-mails: {yhuo,fanxin}@bjtu.edu.cn, {ywang56,ztian1}@gmu.edu

Abstract—In this paper, we focus on federated learning (FL)

over the air based on analog aggregation transmission in realistic

wireless networks. We ﬁrst derive a closed-form expression for the

expected convergence rate of FL over the air, which theoretically

quantiﬁes the impact of analog aggregation on FL. Based on

that, we further develop a joint optimization model for accurate

FL implementation, which allows a parameter server to select

a subset of edge devices and determine an appropriate power

scaling factor. Such a joint optimization of device selection and

power control for FL over the air is then formulated as an mixed

integer programming problem. Finally, we efﬁciently solve this

problem via a simple ﬁnite-set search method. Simulation results

show that the proposed solutions developed for wireless channels

outperform a benchmark method, and could achieve comparable

performance of the ideal case where FL is implemented over

reliable and error-free wireless channels.

Index Terms—Federated learning, analog aggregation, con-

vergence analysis, joint optimization, worker scheduling, power

scaling.

I. INTRODUCTION

Recently, federated learning (FL) has been proposed as a

well acknowledged approach for collaborative edge learning

[1], [2]. In FL, edge devices (local workers) train local models

from their own local data, and then transmit their local updates

to a parameter server (PS). After aggregating these received

local updates, the PS feeds back the averaged update to

the local workers. These iterative updates between PS and

workers, can be either model parameters for model averaging

[1] or parameters’ gradients for gradient averaging [2]. In

this way, FL relieves communication overhead and protect

user privacy compared to raw data sharing of traditional

collaborative learning, specially when the local data is in large

volume and privacy-sensitive. Existing work on FL focuses on

FL algorithms given idealized link assumptions, but the impact

of wireless environments on FL performance should be taken

into account in the design of FL deployed in real wireless

systems. Otherwise, the inherent characteristics of wireless

links may introduce unwanted training errors that dramatically

degrade the learning performance in terms of accuracy and

convergence rate.

To solve this problem, research efforts have been spent

on optimizing network resources used for transmitting model

updates in FL [3]. These works of FL over wireless net-

works adopt digital communications, using a transmission-

then-aggregation policy. Unfortunately, the communication

overhead and transmission latency become large as the number

of active workers increases. On the other hand, it is worth

noting that FL aims for global aggregation and hence only

utilizes the averaged updates of distributed workers rather

than the individual local updates from workers. Alternatively,

the nature of waveform superposition in wireless multiple

access channel (MAC) [4] provides a direct and efﬁcient

way for transmission of the averaged updates in FL, also

known as analog aggregation based FL [5]–[10]. As a joint

transmission-and-aggregation policy, analog aggregation trans-

mission enables all the participating workers to simultaneously

upload their local model updates to the PS over the same

time-frequency resources as long as the aggregated waveform

represents the averaged updates, thus substantially reducing

the overhead of wireless communication for FL [11], [12].

While there exist works of analog aggregation based FL [5]–

[10], some of them mainly focus on designing transmission

schemes without optimization [5]–[8], where they adopt pres-

elected participating workers and ﬁxed their power allocation.

Although optimization issues are considered in [9], [10],

the optimization is conducted on communication side alone,

without an underlying connection to FL. Noticeably, while the

optimization in existing works boils down to maximizing the

number of selected workers, our theoretical results indicate

that more workers is not necessary better over imperfect

links and with limited communication resources. Thus, unlike

these existing works, we seek to analyze the convergence

behavior of analog aggregation based FL, which interprets

the speciﬁc relationship between communications and FL

in the paradigm of analog aggregation. Such a meaningful

connection leads to a joint optimization framework for analog

communications and FL. This paper aims at a comprehensive

study on problem formulation, solution development, and

algorithm implementation for the joint design and optimization

of wireless communication and FL. Our key contributions are

summarized as follows:

•We derive closed-form expressions for the expected con-

vergence rate of FL over the air in the cases of convex

and non-convex, respectively, which not only interprets

but also quantiﬁes the impact of wireless communications

on the convergence and accuracy of FL over the air. Also,

full-size gradient descent (GD) and mini-batched statisti-

cal gradient descent (SGD) methods are both considered

in this work.

•Based on the closed-form theoretical results, we formu-

late a joint optimization problem of learning, worker

selection, and power control, with a goal of minimizing

the global FL loss function. The optimization formulation

Parameter

server (PS)

odel copies

t local

orkers

Global

FL model



Worker 1 Worker 2 Worker U

Noise

w1w2wU

Fig. 1: FL via analog aggregation from wirelessly distributed data.

turns out to be universal for the convex and non-convex

cases with GD and SGD. Further, for practical implemen-

tation of the joint optimization problem in the presence of

some unobservable parameters, we develop an alternative

reformulation that approximates the original unattainable

problem as a feasible optimization problem under the

operational constraints of analog aggregation.

•To efﬁciently solve the approximate problem, we iden-

tity a tight solution space by exploring the relationship

between the number of workers and the power scaling.

Thanks to the reduced search space, we propose a sim-

ple discrete enumeration method to efﬁciently ﬁnd the

globally optimal solution.

II. SYSTEM MODEL

As shown in Fig. 1, we consider a one-hop wireless network

consisting of a single PS at a base station and Uuser devices as

distributed local workers. Through FL, the PS and all workers

collaborate to train a common model for supervised learning

and data inference, without sharing local data.

A. FL Model

Let Di={xi,k,yi,k }Ki

k=1 denote the local dataset at the i-th

worker, i=1,...,U, where xi,k is the input data vector, yi,k

is the labeled output vector, k=1,2, ..., Ki, and Ki=|Di|

is the number of data samples available at the i-th worker.

With K=U

i=1 Kisamples in total, these Uworkers seek

to collectively train a learning model parameterized by a global

model parameter w=[w1,...,w

D]∈R

Dof dimension D,

by minimizing the following loss function

(Global loss function)F(w;D)= 1



i=1



k=1

f(w;xi,k,yi,k ),(1)

where the global loss function F(w;D)is a summation of K

data-dependent components, each component f(w;xi,k,yi,k )

is a sample-wise local function that quantiﬁes the model

prediction error of the same data model parameterized by the

shared model parameter w, and D=iDi.

In distributed learning, each worker trains a local model wi

from its local data Di, which can be viewed as a local copy

of the global model w. That is, the local loss function is

(Local loss function) Fi(wi;Di)= 1



k=1

f(wi;xi,k,yi,k ),(2)

where wi=[w1

i,...,w

i]∈R

Dis the local model parameter.

Through collaboration, it is desired to achieve wi=w=w∗,

∀i, so that all workers reach the globally optimal model w∗.

Such a distributed learning can be formulated via consensus

optimization as [1], [13]

P1: min



i=1



k=1

f(wi;xi,k,y

i,k).(3)

To solve P1, this paper adopts a model-averaging algorithm

for FL [1], [13], where gradient descent is applied, and then

the local model at the i-th local worker is updated as

(Local model updating) wi=w−α



k=1

∇f(w;xi,k,yi,k ),(4)

where αis the learning rate, and ∇f(w;xi,k,yi,k )is the

gradient of f(w;xi,k,yi,k )with respect to w.

When local updating is completed, each worker transmits

its updated parameter wito the PS via wireless uplinks to

update the global was

(Global model updating) w=U

i=1 Kiwi

K.(5)

Then, the PS broadcasts win (5) to all participating workers

as their initial value in the next round. The FL implements the

local model-updating in (4) and the global model-averaging in

(5) iteratively, until convergence.

B. Analog Aggregation Transmission Model

To avoid heavy communication overhead, we adopt analog

aggregation without coding, which allows multiple workers to

simultaneously upload their updates to the PS over the same

time-frequency resources. The local updates wi’s are aggre-

gated over the air to implement the global model updating step

in (5). Such an analog aggregation is conducted in an entry-

wise manner. That is, the d-th entries wd

ifrom all workers,

i=1, ..., U , are aggregated to compute wdin (5), for any

d∈[1,D].

Let pi,t =[p1

i,t,...,p

i,t]denote the power control

vector of worker iat the t-th iteration. Noticeably, the choice

of pi,t in FL over the air should be made not only to effectively

implement the aggregation rule in (5), but also to properly

accommodate the need for network resource allocation. Ac-

cordingly, we set the power control policy as

i,t =βd

i,tKibd

i,t

,(6)

where hi,t is the channel gain between the i-th worker and the

PS at the t-th iteration1,bd

tis the power scaling factor, and

βd

i,t is a transmission scheduling indicator. That is, βd

i,t =1

means that the d-th entry of the local model parameter wi,t of

the i-th worker is scheduled to contribute to the FL algorithm

at the t-th iteration, and βd

i,t =0, otherwise. Through power

scaling, the transmit power used for uploading the d-th entry

from the i-th worker should not exceed a maximum power

limit Pd,max

i=Pmax

ifor any d, as follows:

|pd

i,twd

i,t|2=

βd

i,tKibd

i,t

i,t

≤Pmax

i.(7)

1In this paper, we assume the channel state information (CSI) to be constant

within each iteration, but may vary over iterations. We also assume that the

CSI is perfectly known at the PS, and leave the imperfect CSI case in future

work.

At the PS side, the received signal at the t-th iteration can

be written as

yt=



i=1

pi,t wi,t hi,t +zt=



i=1

Kibtβi,t wi,t +zt,

where represents Hadamard product, hi,t =

[h1

i,t,h

i,t, ..., hD

i,t],βi,t =[β1

i,t,β2

i,t, .., β D

i,t],bt=

[b1

t,b

t, ..., bD

t], and zt∼CN(0,σ

2I)is additive white

Gaussian noise (AWGN).

Given the received yt, the PS estimates wtvia a post-

processing operation as

wt=U



i=1

Kiβi,t bt−1

yt=U



i=1

Kiβi,t bt−1

zt

+U



i=1

Kiβi,t−1U



i=1

Kiβi,t wi,t,(8)

where (U

i=1 Kiβi,t bt)−1is a properly chosen scaling

vector to produce equal weighting for participating wi’s in (8)

as desired in (5), and (X)−1represents the inverse Hadamard

operation of Xthat calculates its entry-wise reciprocal. No-

ticeably, in order to implement the averaging of (5) in FL

over the air, such a post-processing operation requires btto

be the same for all workers for given tand d, which allows

to eliminate btfrom the ﬁrst term in (8).

III. THE CONVERGENCE ANALYSIS OF FL WITH ANALOG

AGGREGATION

In this section, we study the effect of analog aggregation

transmission on FL over the air, by analyzing its convergence

behavior for both the convex and the non-convex cases. To

average the effects of instantaneous SNRs, we derive the

expected convergence rate of FL over the air, which quantiﬁes

the impact of wireless communications on FL using analog

aggregation transmissions.

A. Convex Case

We ﬁrst make the following assumptions that are commonly

adopted in the optimization literature [3], [14], [15].

Assumption 1 (Lipschitz continuity, smoothness): The

gradient ∇F(w)of the loss function F(w)is uniformly

Lipschitz continuous with respect to w, that is,

∇F(wt+1)−∇F(wt)≤Lwt+1 −wt,∀wt,wt+1,(9)

where Lis a positive constant.

Assumption 2 (strongly convex): ∇F(w)is strongly

convex with a positive parameter μ, obeying

F(wt+1)≥F(wt)+(wt+1 −wt)T∇F(wt)

+μ

2wt+1 −wt2,∀wt,wt+1.(10)

Assumption 3 (bounded local gradients): The sample-

wised local gradients at local workers are bounded by their

global counterpart [14], [15]

∇f(wt)2≤ρ1+ρ2∇F(wt)2,(11)

where ρ1,ρ

2≥0.

According to [2], [16], the FL algorithm applied over ideal

wireless channels is able to solve P1 and converges to an

optimal w∗. In the presence of wireless transmission errors,

we derive the expected convergence rate of the FL over the

air with analog aggregation, as in Theorem 1.

Theorem 1. Adopt Assumptions 1-3, and the model updating

rule for wtof the FL-over-the-air scheme is given by (8),∀t.

Given the learning rate α=1

L, the expected performance gap

E[F(wt)−F(w∗)] of wtat the t-th iteration is given by

E[F(wt)−F(w∗)] ≤

t−1



i=1



j=1

At+1−jBt−i+Bt

  

Δt



j=1

AjE[F(w0)−F(w∗)],(12)

where At=1−μ

L+ρ2D

d=1(K

U

i=1 Kiβd

i,t

−1) and Bt=

ρ1

2LD

d=1(K

U

i=1 Kiβd

i,t

−1) + (U

i=1 Kiβi,t bt)−12Lσ2

Proof. All the proofs, which are omitted in this paper due to

the page limit, can be found in our journal version at [17]:

https://arxiv.org/pdf/2104.03490.pdf

B. Non-convex Case

When the loss function F(w)is nonconvex, such as in

the case of convolutional neural networks, we derive the

convergence rate of the FL over the air with analog aggregation

for the nonconvex case without Assumption 2, which is

summarized in Theorem 2.

Theorem 2. Under the Assumptions 1 and 3for the non-

convex case, given the learning rate α=1

L, the convergence

at the T-th iteration is given by



t=1

∇F(wt−1)2≤2L

T(1 −ρ2D(K

Kmin −1))

E[F(w0)] −F(w∗)]

+2LT

t=1 Bt

T(1 −ρ2D(K

Kmin −1)) .(13)

Proof. Please refer to our journal version [17].

As we can see from Theorem 2, when Tis large enough,

we have

min

0,1,...,T

E[∇F(wt−1)2]≤1



t=1

∇F(wt−1)2

T→∞

≤2LT

t=1 Bt

T(1 −ρ2D(K

Kmin −1))

 

NC

,(14)

which guarantees convergence of the FL algorithm to a sta-

tionary point [13]. Similarly, the performance gap at the step t

for non-convex cases is given by NC

t=2LT

t=1 Bt

T(1−ρ2D(K

Kmin −1)) .

C. Stochastic gradient descent

Our work can be extended to stochastic versions of gradient

descent (SGD) as well. Here, we provide convergence analysis

for mini-batch gradient descent with a constant mini-batch size

Kb. Theorem 3 summarizes the convergence behavior of SGD

for the strongly convex case.

Theorem 3. Under the Assumptions 1, 2 and 3for the convex

case, given the learning rate α=1

Land the mini-batch size

Kb, the convergence behavior of the SGD implementation of

FL over the air is given by

E[F(wt)−F(w∗)] ≤

t−1



i=1



j=1

ASGD

t+1−jBSGD

t−i+BSGD

  

ΔSGD



j=1

ASGD

jE[F(w0)−F(w∗)],(15)

where ASGD

t=1−μ

L+ρ2(D

d=1((U

i=1 Kb)2−2K(U

i=1 Kb)

K2+

(U

i=1 Kb)

U

i=1 Kbβd

i,t

)+(U

i=1(Ki−Kb))2

K2)and BSGD

ρ1

2L(D

d=1((U

i=1 Kb)2−2K(U

i=1 Kb)

K2+(U

i=1 Kb)

U

i=1 Kbβd

i,t

(U

i=1(Ki−Kb))2

K2)+



ÄU

i=1 Kiβi,t btä−1



2Lσ2

Proof. Please refer to our journal version [17].

From Theorem 3, the cumulative performance gap of FL

after the t-th iteration for the SGD case is bounded by

SGD

t=t−1

i=1 i

j=1 ASGD

t+1−jBSGD

t−i+BSGD

IV. PERFORMANCE OPTIMIZATION FOR FEDERATED

LEARNING OVER THE AIR

In this section, we ﬁrst formulate a joint optimization

problem to reduce the gap for FL over the air, which turns

out to be applicable for both the convex and non-convex

cases, using either GD or SGD implementations. To make it

applicable in practice in the presence of some unobservable

parameters at the PS, we reformulate it to an approximate

problem by imposing a conservative power constraint. To

efﬁciently solve such an approximate problem, we ﬁrst identify

a tight solution space and then develop an optimal solution via

discrete programming.

A. Problem Formulation for Joint Optimization

Since we are concerned with convergence accuracy, our

optimization problem boils down to minimizing the perfor-

mance gap for different cases (i.e., t,NC

t, and SGD

at each iteration. We recognize that solving P1 amounts to

iteratively minimizing those gap t,NC

t, and SGD

tunder

the transmit power constraint in (7). At the t-th iteration, the

objective functions for those three cases are given by

t=Bt+Att−1,(16)

NC

t=Bt,(17)

SGD

t=BSGD

t+ASGD

tSGD

t−1.(18)

where 0=0and SGD

0=0. Note that when the

optimization is executed at the t-th iteration, t−1and SGD

t−1

can be treated as constants.

Considering the entry-wise transmission for analog aggre-

gation, we remove irrelevant items and extract the component

of the d-th entry from those gap in (16), (17) and (18) as the

objective to minimize, which is given by

Rt[d]= Lσ2

2ÄU

i=1 βd

i,tKibd

tä2+Kρ1+2KLρ2t−1

2LU

i=1 Kiβd

i,t

,∀d,

RNC

t[d]= Lσ2

2ÄU

i=1 βd

i,tKibd

tä2+Kρ1

2LU

i=1 Kiβd

i,t

,∀d,

RSGD

t[d]= Lσ2

2ÄU

i=1 βd

i,tKibd

tä2+U(ρ1+2Lρ2t−1)

2LU

i=1 Kiβd

i,t

,∀d.

Since all entries indexed by dare separable with respect

to the design parameters, we perform entry-wise optimization

by considering wtand wi,t one entry at a time, where the

superscript dand the index of different cases are omitted

hereafter. To determine βi,t and btat the t-th iteration, the PS

carries out a joint optimization problem formulated as follows:

P2: min

{bt,βi,t}U

i=1

Rt(19a)

s.t. 

βi,tKibt

hi,t

wi,t

≤Pmax

i,(19b)

βi,t ∈{0,1},i∈{1,2, ..., U },

where Kishould be Kbin (19b) for the SGD case.

However, in (19b), the knowledge of {wi,t}U

i=1 is needed

but is unavailable to the PS due to analog aggregation.

To overcome this issue, we reformulate a practical optimiza-

tion problem via an approximation that wt−1≈1

UU

i=1 wi,t.

According to FL, each local parameter wi,t is updated from the

broadcast wt−1along the direction of the averaged gradient

over its local data α

KiKi

k=1 ∇f(wt−1;xi,k,yi,k ). Hence, it

is reasonable to make the following common assumption on

bounded local gradients, considering that the local gradients

can be controlled by adjusting the learning rate or through

simple clipping [13], [18].

Assumption 4 (bounded local gradients): The gap be-

tween the global parameter wt−1and the local parameter

update wi,t,∀i, t is bounded by

|wt−1−wi,t|≤η, (20)

where η≥0is related to the learning rate αthat satisﬁes the

following condition2

η≥max ⎧

⎨

⎩



k=1

∇f(w, xi,k,yi,k )U

i=1⎫

⎬

⎭

.(21)

Under Assumption 4, we reformulate the original opti-

mization problem (P2) into the following problem (P3), by

replacing wi,t in (19b) by its approximation:

P3: min

{bt,βi,t}U

i=1

Rt(22a)

s.t. 

βi,tKibt

hi,t 

(|wt−1|+η)2≤Pmax

i,(22b)

βi,t ∈{0,1},i∈{1,2, ..., U },(22c)

2This implies the value range of η. In practice, ηcan take

|wt−1−wt−2|. In addition, for the SGD case, we have η≥

max{{|αEDi[∇f(w, xi,k,yi,k )]|}U

i=1}

where the power constraint (22b) is constructed based

on the fact that |βi,tKibt

hi,t wi,t|2=|βi,t Kibt

hi,t |2|wi,t|2≤

|βi,tKibt

hi,t |2(|wt−1|+η)2.

Since wt−1is always available at the PS, P3 becomes a

feasible formulation for adoption in practice. Next, we develop

the optimal solution to P3.

B. Optimal Solution to P3 via Discrete Programming

At ﬁrst glance, a direct solution to P3 leads to a mixed

integer programming (MIP), which unfortunately incurs high

complexity. To solve P3 in an efﬁcient manner, we develop a

simple solution by identifying a tight search space without loss

of optimality. The tight search space, given in the following

Theorem 4, is a result of the constraints in (22b) and (22c),

irrespective of the objective function (22a). Hence, it holds

universally for any Rt,RNC

t, and RSGD

Theorem 4. When all the required parameters in P3 i.e.,

{Pmax

i,w

t−1,h

i,t,K

i,η}U

i=1, are available at the PS, the

solution space of (bt,β

i,t)in P3 can be reduced to the

following tight search space without loss of optimality as

S=ßÄb(k)

t,β

(k)

i,t ä™U

k=1

b(k)

t=Pmax

khk,t

Kk(|wt−1|+η)

β(k)

t(b(k)

t)=îβ(k)

1,t ,...,β

(k)

U,t ó,k =1,...,U,(23)

where β(k)

tis a function of b(k)

t, in the form β(k)

i,t =H(Pmax

i−

|Kib(k)

t(|wt−1|+η)

hi,t |)and H(x)is the Heaviside step function,

i.e., H(x)=1for x>0, and H(x)=0 otherwise.

Proof. Please see our journal version [17].

Thanks to Theorem 4, we equivalently transform P3 from a

MIP into a discrete programming (DP) problem P4 as follows

P4: min

(bt,βt)∈S Rt=Rt(bt,βt).(24)

According to P4, the objective Rtcan only take on U

possible values corresponding to the Ufeasible values of

bt; meanwhile, given each bt, the value of βtis uniquely

determined. Hence the minimum Rtcan be obtained via line

search over the Ufeasible points (bt,βt)in (23).

Putting together, we propose a joint optimizationfor FL

over the air (INFLOTA), which is a dynamic scheduling and

power scaling policy. By using different Rt, our INFLOTA

can be adjust to all the considered cases including the convex

and non-convex, using either GD or SGD implementations.

Remark 1.(Complexity) Our INFLOTA provides a holistic

solution for implementation of the overall FL at both the PS

and workers sides. Its computational complexity is mainly

determined by that of the optimization step in P4. The

complexity order of the optimization step is low at O(U),

since the search space is reduced to Upoints only via P4.

V. S IMULATION RESULTS AND ANALYSIS

In the simulations, we evaluate the performance of the

proposed INFLOTA for both linear regression and image

classiﬁcation tasks, which are based on a synthetic dataset

and the MNIST dataset, respectively.

In the considered network, we set U=20,Pmax

i=Pmax =

10 mW for any i∈[1,U], and σ2=10

−4mW. The wireless

channel gain hi,t is generated from an exponential distribution

with unit mean for different iand t.

We use two baseline methods for comparison: a) an FL

algorithm that assumes idealized wireless transmissions with

error-free links to achieve perfect aggregation, and b) an FL

algorithm that randomly determines the power scalar and user

selection. They are named as Perfect aggregation and Random

policy, respectively.

A. Linear regression experiments

In linear regression experiments, the synthetic data used to

train FL is generated randomly from [0,1]. The input xand

the output yfollow the function y=−2x+1+n×0.4

where n∼N(0,1). Since linear regression only involves two

parameters, we train a simple two-layer neural network, with

one neuron in each layer, without activation functions, which

is the convex case. The loss function is the MSE of the model

prediction ˆyand the labeled true output y. The learning rate

is set to 0.01.

Fig. 2 shows an example of using FL for linear regression.

The optimal result of a linear regression is y=−2x+1,

because the original data generation function is y=−2x+

1+0.4n. In Fig. 2, we can see that the most accurate

approximation is achieved by Perfect aggregation, which is

the ideal case without considering the inﬂuence of wireless

communications. Random policy considers the inﬂuence of

wireless communication but without any optimization. Thus,

its performance is the worst. Our proposed INFLOTA performs

closely to the ideal case, which jointly considers the learning

and the inﬂuence of wireless communication. This is because

that our proposed INFLOTA can optimize worker selection

and power control so as to reduce the effect of wireless

transmission errors on FL.

In Fig. 3, we show how wireless transmission affects the

convergence behavior of FL in terms the value of the loss

function and the global FL model remains unchanged which

shows that FL converges. As we can see, as the number of

iterations increases, the MSE values of all the considered

learning algorithms decrease at different rates, and eventually

ﬂatten out to reach their steady state. All schemes converge,

but to different steady state values. This behavior shows that

the channel noise does not affect the convergence of the

FL algorithm but it affects the value that the FL algorithm

converges to.

B. Evaluation on the MNIST dataset

In order to evaluate the performance of our proposed

INFLOTA in realistic application scenarios with real data, we

0 0.2 0.4 0.6 0.8 1

ut of the FL al

orithm

-1.5

-1

-0.5

0.5

1.5

Output of the FL algorithm

Data samples

Optimal result

Perfect aggregation

Our INFLOTA

Random policy

Fig. 2: An example of implementing FL for

linear regression.

10 20 30 40 50

Iteration (t)

100

MSE

Perfect aggregation

Our INFLOTA

Random policy

Fig. 3: MSE as the number of iteration

varies.

0 1020304050

Iteration

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Test accuracy

Perfect aggregation

Our INFLOTA

Random policy

Fig. 4: The test accuracy as the iteration

varies.

train a multilayer perceptron (MLP) on the MNIST dataset3

with a 784-neuron input layer, a 64-neuron hidden layer, and

a 10-neuron softmax output layer, which is a non-convex case.

We adopt cross entropy as the loss function, and rectiﬁed linear

unit (ReLU) as the activation function. The total number of

parameters in the MLP is 50890. The learning rate αis set

as 0.1. In MNIST dataset, there are 60000 training samples

and 10000 test samples. We randomly take out 500 −1000

training samples and distribute them to 20 local workers as

their local data. Then the three trained FL are tested with

10000 test samples. We provide the results of test accuracy

versus the iteration index tin Fig. 4. As we can see, our

proposed INFLOTA outperforms Random policy, and achieves

comparable performance as Perfect aggregation.

VI. CONCLUSION

In this paper, we have studied the joint optimization of

communications and FL over the air with analog aggregation.

Under the convex and non-convex cases with either the GD

or SGD implementations, we respectively derive closed-form

expressions for the expected convergence rate of the FL algo-

rithm, which can quantify the impact of resource-constrained

wireless communications on FL under the analog aggregation

paradigm. Through analyzing the expected convergence rates,

we have proposed a joint optimization scheme of worker

selection and power control, which can mitigate the impact of

wireless communications on the convergence and performance

of FL. More signiﬁcantly, our joint optimization scheme is

applicable for both the convex and non-convex cases, using

either GD or SGD implementations. Simulation results show

that the proposed optimization scheme is effective in mitigat-

ing the impact of wireless communications on FL.

ACKNOWLEDGMENTS

This work was partly supported by the National Natural Sci-

ence Foundation of China (Grants 61871023 and 61931001),

Beijing Natural Science Foundation (Grant 4202054), the

National Science Foundation of the US (Grants 1741338,

1939553, 2003211, 2128596 and 2136202), and VRIF-CCI

(Grant 223996).

3http://yann.lecun.com/exdb/mnist/

REFERENCES

[1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al.,

“Communication-efﬁcient learning of deep networks from decentralized

data,” arXiv preprint arXiv:1602.05629, 2016.

[2] J. Koneˇ

cn`

y, H. B. McMahan, D. Ramage, and P. Richt ´

arik, “Federated

optimization: Distributed machine learning for on-device intelligence,”

arXiv preprint arXiv:1610.02527, 2016.

[3] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint

learning and communications framework for federated learning over

wireless networks,” IEEE Transactions on Wireless Communications,

2020.

[4] B. Nazer and M. Gastpar, “Computation over multiple-access channels,”

IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498–3516, 2007.

[5] M. M. Amiri and D. G¨

und¨

uz, “Federated learning over wireless fading

channels,” IEEE Transactions on Wireless Communications, vol. 19,

no. 5, pp. 3546–3557, 2020.

[6] M. M. Amiri, T. M. Duman, and D. G¨

und¨

uz, “Collaborative machine

learning at the wireless edge with blind transmitters,” arXiv preprint

arXiv:1907.03909, 2019.

[7] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for

low-latency federated edge learning,” IEEE Transactions on Wireless

Communications, vol. 19, no. 1, pp. 491–506, 2019.

[8] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Bev-sgd: Best effort voting

sgd for analog aggregation based federated learning against byzantine

attackers,” arXiv preprint arXiv:2110.09660, 2021.

[9] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-

the-air computation,” IEEE Transactions on Wireless Communications,

vol. 19, no. 3, pp. 2022–2035, 2020.

[10] Y. Sun, S. Zhou, and D. G¨

und¨

uz, “Energy-aware analog aggre-

gation for federated learning with redundant data,” arXiv preprint

arXiv:1911.00188, 2019.

[11] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Communication-efﬁcient feder-

ated learning through 1-bit compressive sensing and analog aggregation,”

in 2021 IEEE International Conference on Communications Workshops

(ICC Workshops). IEEE, 2021, pp. 1–6.

[12] ——, “1-bit compressive sensing for efﬁcient federated learning over

the air,” arXiv preprint arXiv:2103.16055, 2021.

[13] J. Wang and G. Joshi, “Cooperative sgd: A uniﬁed framework for the

design and analysis of communication-efﬁcient sgd algorithms,” arXiv

preprint arXiv:1808.07576, 2018.

[14] D. P. Bertsekas, J. N. Tsitsiklis, and J. Tsitsiklis, Neuro-Dynamic

Programming. Athena Scientiﬁc, 1996.

[15] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic

methods for data ﬁtting,” SIAM Journal on Scientiﬁc Computing, vol. 34,

no. 3, pp. A1380–A1405, 2012.

[16] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized

gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp.

1835–1854, 2016.

[17] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Joint optimization of

communications and federated learning over the air,” arXiv preprint

arXiv:2104.03490, 2021.

[18] Y. Liu, K. Yuan, G. Wu, Z. Tian, and Q. Ling, “Decentralized dynamic

admm with quantized and censored communications,” in 2019 53rd

Asilomar Conference on Signals, Systems, and Computers. IEEE, 2019,

pp. 1496–1500.

Particle Swarm-Based Federated Learning Approach for Early Detection of Forest Fires

Article

Full-text available

Jan 2023

Forests are a vital part of the ecological system. Forest fires are a serious issue that may cause significant loss of life and infrastructure. Forest fires may occur due to human or man-made climate effects. Numerous artificial intelligence-based strategies such as machine learning (ML) and deep learning (DL) have helped researchers to predict forest fires. However, ML and DL strategies pose some challenges such as large multidimensional data, communication lags, transmission latency, lack of processing power, and privacy concerns. Federated Learning (FL) is a recent development in ML that enables the collection and process of multidimensional, large volumes of data efficiently, which has the potential to solve the aforementioned challenges. FL can also help in identifying the trends based on the geographical locations that can help the authorities to respond faster to forest fires. However, FL algorithms send and receive large amounts of weights of the client-side trained models, and also it induces significant communication overhead. To overcome this issue, in this paper, we propose a unified framework based on FL with a particle swarm-optimization algorithm (PSO) that enables the authorities to respond faster to forest fires. The proposed PSO-enabled FL framework is evaluated by using multidimensional forest fire image data from Kaggle. In comparison to the state-of-the-art federated average model, the proposed model performed better in situations of data imbalance, incurred lower communication costs, and thus proved to be more network efficient. The results of the proposed framework have been validated and 94.47% prediction accuracy has been recorded. These results obtained by the proposed framework can serve as a useful component in the development of early warning systems for forest fires.

Robust Distributed Swarm Learning for Intelligent IoT

Conference Paper

May 2023

Efficient Distributed Swarm Learning for Edge Computing

Conference Paper

May 2023

ASPFL: Efficient Personalized Federated Learning for Edge Based on Adaptive Sparse Training

Conference Paper

Jul 2023

CB-DSL: Communication-Efficient and Byzantine-Robust Distributed Swarm Learning on Non-i.i.d. Data

Article

Jan 2023

The valuable data collected by IoT devices together with the resurgence of machine learning (ML) stimulate the latest trend of artificial intelligence (AI) at the edge. However, traditional ML and recent federated learning (FL) face major challenges including communication bottleneck, data heterogeneity, and security concerns in edge IoT. Meanwhile, the swarm nature of IoT systems is overlooked by most existing literature, which calls for new designs of distributed learning algorithms. Inspired by the success of biological intelligence (BI) of gregarious organisms, we propose a novel edge learning approach for swarm IoT, called communication-efficient and Byzantine-robust distributed swarm learning (CB-DSL), through a holistic integration of AI-enabled stochastic gradient descent and BI-enabled particle swarm optimization. To deal with the non-i.i.d. data issues and Byzantine attacks, a small amount of global data samples are introduced in CB-DSL and shared among IoT workers, which alleviates the local data heterogeneity effectively and enables to fully utilize the exploration-exploitation mechanism of swarm intelligence. Our convergence analysis theoretically demonstrates that the CB-DSL is superior to the standard FL with better convergence behavior. We also evaluate the model divergence of CB-DSL by deriving its upper bound, which measures the effectiveness of the introduction of the globally shared dataset.

1-Bit Compressive Sensing for Efficient Federated Learning Over the Air

Article

Jan 2022

For distributed learning among collaborative users, this paper develops and analyzes a communication-efficient scheme for federated learning (FL) over the air, which incorporates 1-bit compressive sensing (CS) into analog-aggregation transmissions. To facilitate design parameter optimization, we analyze the efficacy of the proposed scheme by deriving a closed-form expression for the expected convergence rate. Our theoretical results unveil the tradeoff between convergence performance and communication efficiency as a result of the aggregation errors caused by sparsification, dimension reduction, quantization, signal reconstruction and noise. Then, we formulate a joint optimization problem to mitigate the impact of these aggregation errors through joint optimal design of worker scheduling and power scaling policy. An enumeration-based method is proposed to solve this non-convex problem, which is optimal but becomes computationally infeasible as the number of devices increases. For scalable computing, we resort to the alternating direction method of multipliers (ADMM) technique to develop an efficient implementation that is suitable for large-scale networks. Simulation results show that our proposed 1-bit CS based FL over the air achieves comparable performance to the ideal case where conventional FL without compression and quantification is applied over error-free aggregation, at much reduced communication overhead and transmission latency.

CB-DSL: Communication-efficient and Byzantine-robust Distributed Swarm Learning on Non-i.i.d. Data

Preprint

Full-text available

Aug 2022

The valuable data collected by IoT devices in edge networks together with the resurgence of ML stimulate the latest trend of edge AI. However, recent FL methods face major challenges including communication bottleneck, data heterogeneity and security concerns in edge IoT scenarios, especially when being adopted for distributed learning among massive IoT devices equipped with limited data and transmission resources. Meanwhile, the swarm nature of IoT systems is overlooked by most existing literature, which calls for new designs of distributed learning algorithms. Inspired by the success of biological intelligence (BI) of gregarious organisms, we propose a novel edge learning approach for swarm IoT, called communication-efficient and Byzantine-robust distributed swarm learning (CB-DSL), through a holistic integration of AI-enabled stochastic gradient descent and BI-enabled particle swarm optimization. To deal with non-i.i.d. data issues and Byzantine attacks, global data samples are introduced in CB-DSL and shared among IoT workers, which not only alleviates the local data heterogeneity effectively but also enables to fully utilize the exploration-exploitation mechanism of swarm intelligence. Further, we provide convergence analysis to theoretically demonstrate that the proposed CB-DSL is superior to the standard FL with better convergence behavior. In addition, to measure the effectiveness of the introduction of the globally shared dataset, we also conduct model divergence analysis by evaluating the distance between the data distribution at local IoT devices and the population distribution for the whole datasets. Numerical results verify that the proposed CB-DSL outperforms the existing benchmarks in terms of faster convergence speed, higher convergent accuracy, lower communication cost, and better robustness against non-i.i.d. data and Byzantine attacks.

BEV-SGD: Best Effort Voting SGD against Byzantine Attacks for Analog Aggregation based Federated Learning Over the Air

Article

Oct 2022

As a promising distributed learning technology, analog aggregation-based federated learning over the air (FLOA) provides high communication efficiency and privacy provisioning under the edge computing paradigm. When all edge devices (workers) simultaneously upload their local updates to the parameter server (PS) through commonly shared time-frequency resources, the PS obtains the averaged update only rather than the individual local ones. While such a concurrent transmission and aggregation scheme reduces the latency and communication costs, it unfortunately renders FLOA vulnerable to Byzantine attacks. Aiming at Byzantine-resilient FLOA, this article starts from analyzing the channel inversion (CI) mechanism that is widely used for power control in FLOA. Our theoretical analysis indicates that although CI can achieve good learning performance in the benign scenarios, it fails to work well with limited defensive capability against Byzantine attacks. Then, we propose a novel scheme called the best effort voting (BEV) power control policy that is integrated with stochastic gradient descent (SGD). Our BEV-SGD enhances the robustness of FLOA to Byzantine attacks, by allowing all the workers to send their local updates at their maximum transmit power. Under worst-case attacks, we derive the expected convergence rates of FLOA with CI and BEV power control policies, respectively. The rate comparison reveals that our BEV-SGD outperforms its counterpart with CI in terms of better convergence behavior, which is verified by experimental simulations.

Joint Optimization of Communications and Federated Learning Over the Air

Preprint

Apr 2021

Federated learning (FL) is an attractive paradigm for making use of rich distributed data while protecting data privacy. Nonetheless, nonideal communication links and limited transmission resources have become the bottleneck of the implementation of fast and accurate FL. In this paper, we study joint optimization of communications and FL based on analog aggregation transmission in realistic wireless networks. We first derive a closed-form expression for the expected convergence rate of FL over the air, which theoretically quantifies the impact of analog aggregation on FL. Based on the analytical result, we develop a joint optimization model for accurate FL implementation, which allows a parameter server to select a subset of workers and determine an appropriate power scaling factor. Since the practical setting of FL over the air encounters unobservable parameters, we reformulate the joint optimization of worker selection and power allocation using controlled approximation. Finally, we efficiently solve the resulting mixed-integer programming problem via a simple yet optimal finite-set search method by reducing the search space. Simulation results show that the proposed solutions developed for realistic wireless analog channels outperform a benchmark method, and achieve comparable performance of the ideal case where FL is implemented over noise-free wireless channels.

Communication-efficient Federated Learning Through 1-Bit Compressive Sensing and Analog Aggregation

Conference Paper

Full-text available

Jun 2021

A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks

Article

Full-text available

Oct 2020

In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that generates a global FL model and sends the model back to the users. Since all training parameters are transmitted over wireless links, the quality of training is affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS needs to select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To seek the solution, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can improve the identification accuracy by up to 1:4%, 3:5% and 4:1%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation, 2) a standard FL algorithm with random user selection and resource allocation, and 3) a wireless optimization algorithm that minimizes the sum packet error rates of all users while being agnostic to the FL parameters.

1-Bit Compressive Sensing for Efficient Federated Learning Over the Air

Article

Jan 2022

BEV-SGD: Best Effort Voting SGD against Byzantine Attacks for Analog Aggregation based Federated Learning Over the Air

Article

Oct 2022

Joint Optimization of Communications and Federated Learning Over the Air

Article

Dec 2021

Federated learning (FL) is an attractive paradigm for making use of rich distributed data while protecting data privacy. Nonetheless, non-ideal communication links and limited transmission resources may hinder the implementation of fast and accurate FL. In this paper, we study joint optimization of communications and FL based on analog aggregation transmission in realistic wireless networks. We first derive closed-form expressions for the expected convergence rate of FL over the air, which theoretically quantify the impact of analog aggregation on FL. Based on the analytical results, we develop a joint optimization model for accurate FL implementation, which allows a parameter server to select a subset of workers and determine an appropriate power scaling factor. Since the practical setting of FL over the air encounters unobservable parameters, we reformulate the joint optimization of worker selection and power allocation using controlled approximation. Finally, we efficiently solve the resulting mixed-integer programming problem via a simple yet optimal finite-set search method by reducing the search space. Simulation results show that the proposed solutions developed for realistic wireless analog channels outperform a benchmark method, and achieve comparable performance of the ideal case where FL is implemented over error-free wireless channels.

Energy-Aware Analog Aggregation for Federated Learning with Redundant Data

Conference Paper

Jun 2020

Decentralized Dynamic ADMM with Quantized and Censored Communications

Conference Paper

Nov 2019

Federated Learning Over Wireless Fading Channels

Article

Feb 2020

We study federated machine learning at the wireless network edge, where limited power wireless devices, each with its own dataset, build a joint model with the help of a remote parameter server (PS). We consider a bandwidth-limited fading multiple access channel (MAC) from the wireless devices to the PS, and propose various techniques to implement distributed stochastic gradient descent (DSGD) over this shared noisy wireless channel. We first propose a digital DSGD (D-DSGD) scheme, in which one device is selected opportunistically for transmission at each iteration based on the channel conditions; the scheduled device quantizes its gradient estimate to a finite number of bits imposed by the channel condition, and transmits these bits to the PS in a reliable manner. Next, motivated by the additive nature of the wireless MAC, we propose a novel analog communication scheme, referred to as the compressed analog DSGD (CA-DSGD), where the devices first sparsify their gradient estimates while accumulating error from previous iterations, and project the resultant sparse vector into a low-dimensional vector for bandwidth reduction. We also design a power allocation scheme to align the received gradient vectors at the PS in an efficient manner. Numerical results show that D-DSGD outperforms other digital approaches in the literature; however, in general the proposed CA-DSGD algorithm converges faster than the D-DSGD scheme, and reaches a higher level of accuracy. We have observed that the gap between the analog and digital schemes increases when the datasets of devices are not independent and identically distributed (i.i.d.). Furthermore, the performance of the CA-DSGD scheme is shown to be robust against imperfect channel state information (CSI) at the devices. Overall these results show clear advantages for the proposed analog over-the-air DSGD scheme, which suggests that learning and communication algorithms should be designed jointly to achieve the best end-to-end performance in machine learning applications at the wireless edge.

Collaborative Machine Learning at the Wireless Edge with Blind Transmitters

Conference Paper

Nov 2019

Federated Learning via Over-the-Air Computation

Article

Jan 2020

The stringent requirements for low-latency and privacy of the emerging high-stake applications with intelligent devices such as drones and smart vehicles make the cloud computing inapplicable in these scenarios. Instead, edge machine learning becomes increasingly attractive for performing training and inference directly at network edges without sending data to a centralized data center. This stimulates a nascent field termed as federated learning for training a machine learning model on computation, storage, energy and bandwidth limited mobile devices in a distributed manner. To preserve data privacy and address the issues of unbalanced and non-IID data points across different devices, the federated averaging algorithm has been proposed for global model aggregation by computing the weighted average of locally updated model at each selected device. However, the limited communication bandwidth becomes the main bottleneck for aggregating the locally computed updates. We thus propose a novel over-the-air computation based approach for fast global model aggregation via exploring the superposition property of a wireless multiple-access channel. This is achieved by joint device selection and beamforming design, which is modeled as a sparse and low-rank optimization problem to support efficient algorithms design. To achieve this goal, we provide a difference-of-convex-functions (DC) representation for the sparse and low-rank function to enhance sparsity and accurately detect the fixed-rank constraint in the procedure of device selection. A DC algorithm is further developed to solve the resulting DC program with global convergence guarantees. The algorithmic advantages and admirable performance of the proposed methodologies are demonstrated through extensive numerical results.

Joint Optimization for Federated Learning Over the Air

Recommended publications

Joint Optimization of Data Sampling and User Selection for Federated Learning in the Mobile Edge Com...

Best Effort Voting Power Control for Byzantine-resilient Federated Learning Over the Air

Joint Optimization of Communications and Federated Learning Over the Air

Joint Optimization of Communications and Federated Learning Over the Air

1-Bit Compressive Sensing for Efficient Federated Learning Over the Air