PreprintPDF Available

Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization

January 2022

January 2022

Authors:

Shiqiang Wang

IBM Research

Mingyue Ji

University of Utah

Preprints and early-stage research may not have been peer reviewed yet.

Federated learning (FL) is a useful tool in distributed machine learning that utilizes users' local datasets in a privacy-preserving manner. When deploying FL in a constrained wireless environment; however, training models in a time-efficient manner can be a challenging task due to intermittent connectivity of devices, heterogeneous connection quality, and non-i.i.d. data. In this paper, we provide a novel convergence analysis of non-convex loss functions using FL on both i.i.d. and non-i.i.d. datasets with arbitrary device selection probabilities for each round. Then, using the derived convergence bound, we use stochastic optimization to develop a new client selection and power allocation algorithm that minimizes a function of the convergence bound and the average communication time under a transmit power constraint. We find an analytical solution to the minimization problem. One key feature of the algorithm is that knowledge of the channel statistics is not required and only the instantaneous channel state information needs to be known. Using the FEMNIST and CIFAR-10 datasets, we show through simulations that the communication time can be significantly decreased using our algorithm, compared to uniformly random participation.

Block diagram of the uplink communication in federated learning over a wireless network.

…

Figures - uploaded by Shiqiang Wang

Content may be subject to copyright.

Content uploaded by Shiqiang Wang

Content may be subject to copyright.

Communication-Eﬃcient Device Scheduling for

Federated Learning Using Stochastic Optimization

Jake Perazzone∗, Shiqiang Wang†, Mingyue Ji‡, Kevin S. Chan∗

∗Army Research Laboratory, Adelphi, MD, USA. Email: {jake.b.perazzone.civ; kevin.s.chan.civ}@army.mil,

†IBM T. J. Watson Research Center, Yorktown Heights, NY, USA. Email: wangshiq@us.ibm.com

‡Department of Electrical & Computer Engineering, University of Utah, Salt Lake City, UT, USA. Email: mingyue.ji@utah.edu

Abstract—Federated learning (FL) is a useful tool in dis-

tributed machine learning that utilizes users’ local datasets in a

privacy-preserving manner. When deploying FL in a constrained

wireless environment; however, training models in a time-eﬃcient

manner can be a challenging task due to intermittent connectivity

of devices, heterogeneous connection quality, and non-i.i.d. data.

In this paper, we provide a novel convergence analysis of non-

convex loss functions using FL on both i.i.d. and non-i.i.d.

datasets with arbitrary device selection probabilities for each

round. Then, using the derived convergence bound, we use

stochastic optimization to develop a new client selection and

power allocation algorithm that minimizes a function of the

convergence bound and the average communication time under

a transmit power constraint. We ﬁnd an analytical solution to

the minimization problem. One key feature of the algorithm is

that knowledge of the channel statistics is not required and only

the instantaneous channel state information needs to be known.

Using the FEMNIST and CIFAR-10 datasets, we show through

simulations that the communication time can be signiﬁcantly

decreased using our algorithm, compared to uniformly random

participation.

I. Introduction

Federated learning (FL) is a valuable machine learning (ML)

tool that enables distributed training of neural network models

without centralized data by utilizing computation at several

distributed learners who use their own local datasets. Model

training is accomplished through a collaborative procedure in

which the participating learners are sent the current model

and then they each separately perform updates via stochastic

gradient descent (SGD) using their own locally collected

datasets. After a set number of local iterations, the participants

send their updated model weights to an aggregator who updates

the global model, typically through simple averaging of each

participant’s update, as in FedAvg [1]. The process then repeats

by sending out the updated global model to all learners

participating in the next round and continues until a satisfactory

model is obtained. A block diagram of the uplink in a wireless

network running FL can be found in Figure 1 where each

learner

is a device that has its own independent channel to

the aggregator with fading parameter hn(t).

This research was partly sponsored by the U.S. Army Research Laboratory

and the U.K. Ministry of Defence under Agreement Number W911NF-16-

3-0001. The views and conclusions contained in this document are those of

the authors and should not be interpreted as representing the oﬃcial policies,

either expressed or implied, of the U.S. Army Research Laboratory, the U.S.

Government, the U.K. Ministry of Defence or the U.K. Government. The U.S.

and U.K. Governments are authorized to reproduce and distribute reprints for

Government purposes notwithstanding any copyright notation hereon.

Fig. 1: Block diagram of the uplink communication in federated

learning over a wireless network.

One of the major advantages of the FL training process

is that user privacy is preserved since the users’ data never

leaves their device. This allows the end user to take part

in training and ultimately obtain better ML models without

fear of revealing their private data. The orchestration of FL

over large-scale wireless networks, though, has proved to be a

challenging task since the amount of communication required

to converge to an acceptable model creates a large bottleneck in

the process. This is particularly evident in dynamic mobile edge

computing (MEC) environments where poor channel quality

and intermittent connectivity can completely derail training.

For example, if a device loses connection to the server and

does not participate in training for an extended period of time,

the global model will begin to shift away from their locally

optimal model which will negatively aﬀect convergence until

they rejoin. If many devices are absent for extended periods of

time, the global model will converge very slowly, or possibly

not at all depending on the degree of heterogeneity of the

data. Additionally, if a device is available but has a very bad

connection, resources will be wasted if they participate in every

round. Thus, device selection becomes a very important aspect

in the management of FL in practice.

In the original FL algorithm, FedAvg [1], clients are selected

uniformly at random in each round. Although this strategy has

been shown to converge [2], [3], in practice, devices are not

always available for selection due to factors such as energy

and time constraints. Additionally, since this selection policy

is agnostic to channel conditions and other factors, it will lead

to the consumption of more network resources than necessary.

Thus, a more intelligent approach to device selection is needed

to optimize network resource consumption. However, before

designing such an approach, the eﬀect of arbitrary device

arXiv:2201.07912v2 [cs.LG] 4 May 2022

selection on FL convergence must be understood to ensure

convergence to a good model.

In this paper, we derive a novel convergence bound for non-

convex loss functions with arbitrary device selection probabili-

ties for each FL round. Our new upper bound shows that as long

as all devices have a non-zero probability of participating in

each round, then FL will converge in expectation to a stationary

point of the loss function. We then use the knowledge of how

the selection probabilities aﬀect this newly found convergence

bound to formulate a stochastic optimization problem that

determines the optimal selection probabilities and transmit

powers. The objective function of the problem minimizes a

weighted sum of the convergence bound and the time spent

for communicating model parameters with a constraint on the

peak and time average transmit power. Communicating the

model parameters over many rounds is the major bottleneck

of FL. Therefore, minimizing the communication time is very

beneﬁcial in speeding up convergence and minimizing the

burden on the network. The form of the convergence bound and

our novel problem formulation allow us to utilize the Lyapunov

drift-plus-penalty framework to compute an analytical and

distributed solution to the minimization problem with analytical

expressions. A key advantage of our new device selection

algorithm is that it is able to make decisions according to

current channel conditions without knowledge of the underlying

channel statistics.

To show the performance of our algorithm, we run numerous

experiments on the CIFAR-10 and FEMNIST datasets to

demonstrate the saved communication time using our developed

algorithm. We compare our results to the uniform selection

policy of FedAvg and show that the time required to reach a

target accuracy can be decreased by up to 58%. In summary,

our main contributions are as follows:

We derive a new upper bound for convergence of non-

convex loss functions using FL with arbitrary selection

probabilities.

We formulate a novel stochastic optimization problem that

minimizes a weighted sum of the newly found convergence

bound and the amount of communication time spent on

transmitting parameter updates, while satisfying transmit

power constraints.

Using the Lyapunov drift-plus-penalty framework, we

derive an analytical and distributed solution to the problem

that does not require knowledge of the channel statistics.

We provide experimental results that demonstrate a com-

munication savings of up to 58% compared to traditional

uniform selection strategies.

The rest of the paper is organized as follows. First, we present

some related work in Section II before formally presenting

the FL problem in Section III. Then, convergence analysis is

provided in IV and the device scheduling policy is developed

in Section V. Finally, we present experimental results in

Section VI.

II. Related Works

Since its introduction in [1], FL has garnered a lot of attention

in both industry and academia with a major focus on providing

privacy guarantees [4]–[6], characterizing convergence [2],

[3], [7], [8], and enhancing communication eﬃciency [9]

through strategies such as model compression via sparsiﬁcation

[10], [11] and quantization [12], [13]. One of the biggest

challenges with implementing FL at scale is the heterogeneity

that is present in both the system and the data. An alternative

way to address communication eﬃciency and combat system

heterogeneity is through device scheduling, or client selection.

Doing this naively, however, can lead to suboptimal models due

to the skew introduced by the heterogeneous, or non-i.i.d., data

at the devices which is why we design our selection process

based on its eﬀect on convergence.

One of the ﬁrst works speciﬁcally targeting the problem is

[14] where the FL training process is modeled in a MEC

network with a wide variety of devices. The scheduling

procedure presented was designed to speed up convergence

by having as many devices as possible to participate in each

round within a desired time window. The presented strategy,

however, does not consider its eﬀect on convergence and results

in much poorer performance for non-i.i.d. datasets as indicated

in their results. Some other empirical studies with similar

approaches include [15], [16], but also do not consider or

derive convergence bounds for their selection strategies.

Some later work began to include analysis of the convergence

of FL with device selection. Among these is [17], but only the

convergence of simple linear regression loss is considered. In

[18], the authors analyze the convergence of strongly convex

loss functions, but unfortunately their bound introduces a non-

vanishing term and thus their strategy is not guaranteed to

converge to a stationary point of the loss function. Both [19]

and [20] consider convergence, but only for strongly convex loss

functions. Convergence results for non-convex loss functions

with partial device participation have also been presented in

[7], [21], [22], but they only consider the case where devices

are chosen uniformly at random with or without replacement

and do not allow for arbitrary selection probabilities. Finally,

[23] considers arbitrary probabilities for each device, but these

probabilities are held constant throughout training and are not

reﬂected in the parameter aggregation weights. Additionally,

in [23], all devices must participate in the ﬁrst round for

convergence. We improve upon these results by considering

non-convex loss functions and derive a bound with no non-

vanishing term under the condition that all devices have an

arbitrary non-zero probability of participating in each round.

Some works [24], [25] develop frameworks that jointly opti-

mize convergence and communication over wireless networks.

Similarly to our approach, they derive a convergence bound

and then minimize it by ﬁnding the optimal parameter values.

For example, in [24], the FL loss is minimized while meeting

the delay and energy consumption requirements via power

allocation, user selection, and resource block allocation. Both

papers, however, make the unrealistic assumption that the

channel remains constant throughout the training process which

we do not assume here.

In [26], stochastic optimization is used to determine an opti-

mal scheduling and resource block policy that simultaneously

minimizes the FL loss function and CSI uncertainties. The loss

function considered, though, is simple linear regression and

does not readily apply to neural network models. Stochastic

optimization is also considered for FL in [27] and [28], but not

to design an optimal device selection policy that guarantees

convergence of non-convex loss functions like we do here.

III. Problem Formulation

We now explain the FL problem in more detail. Consider

a system with

clients, where each client

has a possibly

non-convex local objective

fn(x)

with parameter

x∈Rd

. We

would like to solve the following ﬁnite-sum problem:

min

f(x) := 1

n=1

fn(x).(1)

To solve

(1)

, we follow the FL paradigm and perform

a slightly modiﬁed version of FedAvg [1], where instead

of uniform sampling with/without replacement, each client

has an arbitrary probability of being selected to participate

in a given round, denoted as

for device

at round

. This modiﬁcation allows us to adjust the probability of

selection for the participating devices based on system dynamics

including channel conditions. Additionally, analyzing FedAvg

with arbitrary probabilities allows us to observe the eﬀect that

a scheduling policy that controls these probabilities has on

convergence in order to design a better policy. The modiﬁed

algorithm can be found in Algorithm 1.

We let

Iln

t∈ {0,1}

be a random variable to denote whether

client

is sampled in round

such that

n:= Pr{Iln

t= 1}

Next, denote

as the synchronization interval, or the number of

local SGD updates performed by a device before aggregation,

and denote

gn(x)

as the stochastic gradient of

fn(x)

. We

denote the learning rate as

γ > 0

and the number of total

rounds as

. Note that this algorithm is logically equivalent

to one where only the participating clients receive the model

updates and compute the gradient updates. Notice that each

device’s aggregation weight is inversely proportional to their

probability of being selected. This ensures that the gradient

updates remain unbiased. Intuitively, it ensures that devices

with low participation can still have suﬃcient inﬂuence on the

global model when they do participate.

IV. Convergence Analysis

In this section, we prove an upper bound on the convergence

(1)

using Algorithm 1 for non-convex loss functions. We

will assume that

Ilt

and

Ilt

are independent for

n6=n0

and

that the randomness in client sampling is independent of SGD

noise, so that

Ilt

and

are independent. We then make the

following assumptions on the local loss functions, which are

common in the convergence analysis literature.

Algorithm 1: FedAvg with client sampling

Input: γ,x0,I,T,{qt

Output: {xt}

1for t←0, . . . , T −1do

2Sample Ilt

n∼qt

n,∀n;

3for n←1, . . . , N in parallel do

4yn

t,0←xt;

5for i←0, . . . , I −1do

6yn

t,i+1 ←yn

t,i −γgn(yn

t,i);

7xt+1 ←xt+1

NPN

n=1

Ilt

nyn

t,I −yn

t,0;

// global parameter update

Assumption 1 (L-smoothness).

k∇fn(y1)− ∇fn(y2)k ≤ Lky1−y2k(2)

for any y1,y2and some L > 0.

Assumption 2 (Unbiased stochastic gradients).

E[gn(y)|y] = ∇fn(y).(3)

for any y

Now, we state our novel convergence theorem in Theorem 1.

Theorem 1.

Let Assumptions 1 and 2 hold with

and qt

ndeﬁned as above. Then, Algorithm 1 satisﬁes

T−1

t=0

Ehk∇f(xt)k2i

≤2 (f(x0)−f∗)

γT I

+γ2L2(I−1)

T I N

T−1

t=0

n=1

I−1

i=0

i−1

j=0

Eh

gn(yn

t,j )



+γL

T N

T−1

t=0

n=1

I−1

i=0

Eh

gn(yn

t,i)



2i(4)

where f∗represents the optimal solution to (1).

Proof. The proof can be found in Appendix A.

Furthermore, by making an assumption of uniformly bounded

stochastic gradients, we can simplify the bound.

Assumption 3 (Bounded stochastic gradients).

Ehkgn(y)k2i≤G2,∀y, n (5)

for some G > 0.

Corollary 1.

If Assumption 3 holds, then the bound

(4)

becomes

T−1

t=0

Ehk∇f(xt)k2i≤2 (f(x0)−f∗)

γT I +γ2L2(I−1)2G2

+γLI G2

T N

T−1

t=0

n=1

.(6)

Proof.

The proof is a simple application of Assumption 3 and

Theorem 1.

If we set

γ=1

√T

, we can guarantee a convergence rate of

O1

√T

to a stationary point of

f(x)

. The third term in the

bound shows the eﬀect of arbitrary client sampling and follows

with the intuition that the more often devices participate, the

less iterations will be required to converge. The bound can

be minimized by choosing a selection strategy that minimizes

the time average

T N PT−1

t=0 PN

n=1 1

. While it has a trivial

minimum at

n= 1

for all

and

, i.e., full participation, it is

impractical to assume that every device can or will participate

in every round due to lack of network resources and the amount

of time it would take to receive updates from every device. Now

with a convergence bound that is a function of device selection

probability

, we can design an optimization problem that

properly considers the eﬀect that

has on convergence in

addition to minimizing communication overhead.

V. Communication-Efficient Scheduling Policy

In this section, we formulate a novel stochastic optimization

problem that chooses the selection probabilities

and transmit

powers

Pn(t)

in each round, to minimize a function of

the convergence bound and communication overhead. More

speciﬁcally, we minimize a weighted sum of the last term

(6)

and the average time spent communicating over the

channel while satisfying transmission power constraints. Since

communicating the model parameters over many rounds is the

major bottleneck of FL, minimizing the communication time is

very beneﬁcial in speeding up convergence time and minimizing

the burden on the network. Our formulation allows for the

application of the Lyapunov drift-plus-penalty framework which

leads to an analytical solution that does not require knowledge

of the exact dynamics or statistics of the channel; only the

instantaneous channel state information (CSI) is needed. The

solution can also be computed in a distributed fashion in which

each device can determine its own selection probability, and

since selection is done independently, each device can notify

the aggregator when it should be selected.

We consider a simple wireless network model where all

devices are able to communicate with the aggregator and must

take turns in transmitting their parameters as in time-division

multiple access (TDMA). For simplicity, we only consider

the uplink channel, since the downlink is a broadcast by the

aggregator to all the devices that takes much less time. At each

round

, the devices receive information about their current CSI

in the form of channel gain

|hn(t)|2

and noise power

. The

algorithm then uses this information to determine each device’s

probability of selection

and transmission power

for that

round. Additionally, the transmission power is subject to both

a peak power constraint Pmax and time average constraint ¯

Pn.

We formulate the problem as

min

{qt

n},{Pn(t)}lim

T→∞

T−1

t=0

E[y0(t)] (7)

s.t. lim

T→∞

T−1

t=0

EPn(t)qt

n≤¯

Pn,∀n= 1, . . . , N

0≤Pn(t)≤Pmax, n = 1, . . . , N

n∈(0,1]

where

y0(t) :=

n=1 



Nqt

+λ·` qt

Blog21+ |hn(t)|2Pn(t)

N0

,(8)

is the number of bits required to represent the model,

the bandwidth of the communication channel,

log2(·)

denotes

the base 2 logarithm, and

λ > 0

is a tuneable parameter that

controls the trade-oﬀ between minimizing the convergence

bound and the sum transmission time. The ﬁrst term in the

objective is straightforwardly taken from

(6)

while the second

term represents the minimum expected amount of time it takes

to transmit the model given

. The denominator of the second

term is the channel capacity and although the communication

rate in practice is not truly equal to the capacity, it gives us a

communication time lower bound that indicates how channel

gain and transmission power aﬀect communication times.

The novelty of our convergence bound and formulation comes

from the fact that both the eﬀect of

on convergence and

the additional term we add to minimize communication time

are in the form of a time average. This allows us to apply

standard theorems from the Lyapunov stochastic optimization

framework [29] to reformulate

(7)

into a form that we solve

analytically. While the framework specializes in stabilizing

queues in stochastic networks and we have no such queues

here, the framework allows us to convert our transmission

power constraint into a set of virtual queues and apply the

Lyapunov convergence theorem to our problem. The practical

implications of the virtual queues will be explored at the end of

this section and its eﬀect will be further illustrated in Section

VI. So, to put our optimization problem into the Lyapunov

drift-plus-penalty framework and using standard notation, we

turn the constraint into a virtual queue

Zn(t)

for each client

such that

Zn(t+ 1) = max[Zn(t) + yn(t),0],(9)

where

yn(t) = Pn(t)qt

n−¯

Pn.(10)

Since we have no actual queues, the Lyapunov function is

L(Θ(t)) := 1

n=1

Zn(t)2(11)

where

Θ(t)

represents the current queue states, which in this

case, is just

{Zn(t) : ∀n}

. Next, we deﬁne the Lyapunov drift:

∆(t+ 1) = L(t+ 1) −L(t),(12)

where we drop

Θ(t)

for simplicity. Finally, we have the

Lyapunov drift-plus-penalty function that we wish to minimize:

∆(t) + VE[y0(t)|Θ(t)] ,(13)

Algorithm 2: Stochastic client sampling

Input: hn(t),N0,`,B,V,λ,Pmax,¯

Output: qt

n,Pn(t)

1Zn(0) ←0

2Pn(0) ←Pmax

3q0

n←min (max (rBlog21+|hn(t)|2Pmax

N0

Nλ` ,0),1)

4for t←1, . . . , T −1do

5for n←1, . . . , N in parallel do

6Calculate roots via (16) and (17)

7if 0≤Pn(t)≤Pmax and qt

n∈(0,1] then

8Perform Hessian determinant test to ensure

minimum

9else

10 Pn(t)←Pmax

11 qt

n←min{(17),1}

12 Zn(t+ 1) ←max[Zn(t) + Pn(t)qt

n−¯

Pn,0]

where

V > 0

is another arbitrarily chosen weight that controls

the fundamental trade-oﬀ between queue stability and optimality

of the objective functions.

Now, by utilizing Lemma 4.6 from [29] and assuming that

the random event, i.e., channel gain

|hn(t)|2

, is i.i.d. with

respect to t, we can upper bound (13):

∆(t) + VE[y0(t)|Θ(t)] ≤C+VE[y0(t)|Θ(t)]

n=1

Zn(t)E[yn(t)|Θ(t)] (14)

where

C > 0

is a constant. Next, according to the Min Drift-

Plus-Penalty Algorithm, we opportunistically minimize the

expectation in the right hand side of

(14)

at each time step

min

{qt

n},{Pn(t)}f(qt

n, Pn(t)) := V y0(t) +

n=1

Zn(t)yn(t)(15)

s.t. 0≤Pn(t)≤Pmax,∀n= 1, . . . , N

n∈(0,1] .

Since the objective is an independent sum over

, we

can perform the minimization separately for each device

Algorithm 2 details to process in determining the optimal

Pn(t)

and

in each round. We now present Theorem 2 which gives

an analytical solution to

(15)

that can be computed distributively

by the devices.

Theorem 2.

The solution to

(15)

is given by Algorithm 2 where

the optimal values for each

is given by either the endpoints,

i.e., Popt

n(t) = Pmax,qt

n= 1 or by

Popt

n(t) = N0

|hn(t)|2



4W0 rA

4!−2

−1

(16)

where A=V λ`|hn(t)|2(log(2))2

N0BZn(t)and

qt,opt

n=



λ`N

Blog21 + |hn(t)|2Popt

n(t)

N0+N

VZn(t)Popt

n(t)



−1

(17)

where

W0(·)

is the principal branch of the Lambert

function.

Proof. The proof can be found in Appendix B

Theorem 4.8 in [29] and Theorem 2 guarantee that this

algorithm satisﬁes

lim sup

t→∞

t−1

τ=0

E[y0(τ)] ≤yopt

o+C

V,(18)

where

yopt

is the minimum of

. The theorems also guarantee

that the transmit power constraint is satisﬁed as

t→ ∞

. The

user-deﬁned parameter

traditionally controls the trade-oﬀ

between the average queue backlog and the gap from optimality,

but since we do not have physical queues in our problem, the

trade-oﬀ does not exist in the same way. Instead,

controls

the speed of convergence in addition to the optimality gap

in (18).

(17)

, we can see that when there is a large virtual queue

Zn(t)

or chosen transmit power, the probability of selection

is decreased in order to satisfy the transmit power constraint.

In this way, the virtual queue represents how far from the

time average constraint we are. As

is increased, the eﬀect

that the current virtual queue has on selection becomes less

important and it takes longer to satisfy the average power

constraint. This is also explored experimentally in Section

VI-C

A large

favors the minimization of communication time rather

than the convergence bound which naturally leads to lower

as seen in

(17)

. Finally, since the probability calculation is

done independently by each device, it can be computed locally

without direct orchestration by the aggregator.

VI. Experiments

In order to demonstrate the advantages of our device

scheduling algorithm, we evaluate it on the CIFAR-10 [30]

and

FEMNIST

[31] datasets and compare the performance

to uniform device sampling in terms of total time for com-

municating model parameters. For simplicity, we assume that

the computation time is much less than communication time

and do not include that in our time measurements. The

FEMNIST dataset is a federated partitioning of the extended

MNIST (EMNIST) dataset [32] that consists of 62 classes

of handwritten letters and digits from 3597 diﬀerent writers.

In the experiments, each device is given data from only one

writer in order to simulate a more realistic heterogeneous data

environment rather than partitioning by class as is sometimes

done, e.g., in [33]. Therefore, for the FEMNIST dataset, we

consider

N= 3597

clients in which we reserve 10% of the

data for testing. For the CIFAR-10 dataset, on the other hand,

we only consider the i.i.d. case where

N= 100

clients are

given a uniform sampling from the 50,000 color images of 10

classes where 10,000 images are reserved for testing.

In both experiments, we train the same convolutional neural

network (CNN) as in [8], [10] which has

d= 555,178

parameters for CIFAR-10 and

d= 444,062

parameters for

FEMNIST. Therefore, for Algorithm 2, we set

`= 32d

since

each parameter is represented as a 32 bit ﬂoating point number.

We also set the minibatch size to 32,

= 0.01,

I= 10

and

B= 22 ×106

to simulate WiFi bandwidth. The power

constraints are set to

Pn= 1

and

Pmax = 100

and the noise

power is normalized to

N0= 1

. For the channel model, we

assume each device experiences Rayleigh fading such that

|hn(t)|

is distributed as a Rayleigh random variable. In the ﬁrst

set of experiments, we assume that every device has the same

Rayleigh parameter,

σ= 1

, but change to a more heterogeneous

setup for the next group of experiments. Note that our algorithm

does not need to know either the parameters or the distribution

of the channel gain itself.

In the uniform selection cases, we choose the number of

devices to be selected in each round to match the average

number of devices selected using our algorithm for diﬀerent

values. The average number of devices selected by our

algorithm, denoted by

, is estimated using the Monte Carlo

method. Note that the optimal number of selected devices by

uniform selection is not known in practice; hence, we consider

a stronger benchmark here than the commonly used uniform

selection method. To satisfy the transmit power constraint in

(7)

for the uniform case, we set

Pn(t) = ¯

Pn·N

for all

and

, where

is the number of devices selected in a given

round that is equal to either

bMc

dMe

. We use a moving

average with a window size of 500 iterations to smooth the

curves for a better viewing experience.

To keep the communications channel realistic, we upper and

lower bound the possible values for

|hn(t)|2

. For the upper

bound, we set

|hn(t)|2<(210 −1)N0/¯

since, in practice,

with a very good channel, modern communication system can

only go up to 1024-QAM which is 10 bits/s/Hz. For the lower

bound, we set

|hn(t)|2<(2.25 −1)N0/Pmax

to avoid big

outliers that likely will not be chosen by either selection policy

and only assume error correction is available at a rate of

.25

bits/s/hz at the maximum transmit power. Additionally, in cases

where the value of

results in very low selection probabilities,

we ensure that at least one device is selected each round by

choosing the device with the largest

if none are chosen

during the regular selection process.

A. CIFAR-10 Results

First, we present our experimental results for the i.i.d. CIFAR-

10 dataset. The results of our experiments are shown in Figure

2. In Figures 2a and 2b, we consider a homogeneous network

where each device has the same Rayleigh fading parameter,

while in Figures 2c and 2d, we consider a heterogeneous

network where the fading parameter is diﬀerent for each device.

More speciﬁcally, in the homogeneous case, we set the Rayleigh

fading parameter such that all 100 devices have variance

σ= 1

In the heterogeneous channel case, we set the Rayleigh fading

(a) Testing accuracy over time. (b) Loss function over time.

Fig. 2: Comparison of total communication time for uniform selection

vs proposed algorithm on CIFAR-10 dataset.

parameter such that 10 devices have

σ= 0.2

, 40 have

σ= 0.75

and 50 have

σ= 1.2

. In all four plots, we look at the cases

where

λ= 10

and

λ= 100

and compare them to uniform

selection with

M= 5.99

and

M= 2.5

, respectively, for the

homogeneous channel case, and

M= 5.65

and

M= 2.41

respectively, for the heterogeneous channel case. In the uniform

case, fractional devices are chosen by choosing the ﬂoor or

ceiling of

with the appropriate probability. We set

V= 1000

for our algorithm and justify this choice in Section VI-C.

In this i.i.d. data case, the advantages of our scheme are

readily apparent as our selection policy consistently reaches

testing accuracy values in less time compared to the uniform

equivalent. The achieved training speed up is more noticeable

in the heterogeneous channel case since the algorithm picks

the devices with bad channels less often. When comparing

the Figures 2a and 2c, it is most clear in the

λ= 100

case.

For example, in the homogeneous channel case, our algorithm

ﬁrst reaches an accuracy of

0.7

79.2%

less time whereas in

the heterogeneous channel case, our algorithm ﬁrst reaches an

accuracy of

0.7

58.2%

less time which is a larger speed up.

We also note that the reason that the selection schemes

that choose fewer devices per round, e.g., the

λ= 100

and

corresponding uniform cases, appear to converge to a higher

accuracy faster is because they are able to complete more

iterations in the given time frame. So, while having fewer

devices participate in a round generally results in a poorer

quality update due to increased variance, it allows for the

local models to be aggregated more quickly and thus can end

up resulting in faster convergence in time. In other words,

quantity over quality wins out. To illustrate the worse per

round performance of the fewer device per round regimes, we

plot the same results from Figure 2 in Figure 3, but versus

(a) Testing accuracy over rounds.

(b) Loss function over rounds.

Fig. 3: Eﬀect of λ(CIFAR-10).

communication rounds rather than communication time. We

reiterate that larger

means fewer devices chosen per round on

average. It is clear that as

is increased, the testing accuracy

converges more slowly per round and oscillates more intensely.

This reveals an interesting unsolved trade-oﬀ between the

quality versus the speed of global updates in federated learning.

While the scenario and datasets considered here favor faster,

lower quality updates, this might not always be the case. For

example, the optimal update policy will depend on things like

the communication/channel model used and the computation

time required to compute updates.

B. FEMNIST Results

In our next experiment, we compare the total communication

time for the FEMNIST dataset using uniform sampling versus

our algorithm. We set the fading parameters such that for the

heterogeneous case, 500 clients have

σ= 0.2

, 1500 clients

have

σ= 0.75

, and 1597 clients have

σ= 1.2

, while for the

homogeneous case, we set

σ= 1

for all devices. We again

set

V= 1000

for our algorithm. In uniform selection, we set

M= 54.36

and

M= 19.4

devices for

λ= 10

and

λ= 100

respectively, for the homogeneous case, and we set

M= 52.7

and

M= 18.62

devices for

λ= 10

and

λ= 100

, respectively,

for the heterogeneous case. The results are shown in Figure 4.

Interestingly, in the homogeneous channel case (Figures 4a

and 4b), the two selection strategies perform very similarly

with a marginal increase in speed for our algorithm. This is

most likely due to the greater number of devices being chosen

and the similarity in channel gain causing the algorithm to

choose in such a way that is close to uniform selection. For

the heterogeneous channel gain case, on the other hand, the

more varying channel gains causes the algorithm to choose

the devices with better channels more frequently. Since our

algorithm guarantees that our algorithm converges even when

training on non-i.i.d. data, the model still converges and beneﬁts

from the time saved using our device selection policy. Another

interesting note is that, in the heterogeneous case, the percentage

of speed up is better for the

λ= 10

case than the

λ= 100

case.

For example, the testing accuracy reaches

0.8

69.5%

less

time in the

λ= 10

case compared to its uniform equivalent

and

86.2%

less time in the

λ= 100

case compared to its

uniform equivalent.

(a) Testing accuracy over time. (b) Loss function over time.

Fig. 4: Comparison of total communication time for uniform selection

vs proposed algorithm on FEMNIST dataset.

C. The Eﬀect of V

In Figure 5, we plot expected time average transmit power

TPT−1

t=0 Pn(t)qt

over the course of training rounds to show

how the parameter

in our algorithm aﬀects the satisfaction

of the power constraint. While large

brings us closer to

the optimal values that minimize the weighted sum of the

time average from the convergence upper bound and the total

communication time, it also takes more rounds for the time

average power constraint to be satisﬁed. For

V= 1

, the

constraint is satisﬁed very quickly and oscillates around

P= 1

while for

V= 105

case, it takes many more rounds to satisfy

the constraint. We also note for comparison purposes that

the power allocated in the uniform selection case will always

satisfy the constraint by design. Thus, our algorithm sacriﬁces

not satisfying the constraint initially in ﬁnite time in order to

make gains in performance, but always satisﬁes the constraint

asymptotically. Our gains are not solely attributed to this,

however. For the previous experiments, we chose

V= 1000

since it satisﬁes the constraint in about the same amount of

rounds as it takes for the loss function to achieve a desired

value.

VII. Conclusions and Future Works

In this paper, we studied the aﬀect of arbitrary selection

probabilities for devices in federated learning and noted the

challenge of scheduling devices in a heterogeneous wireless

environment. After deriving a novel convergence bound for non-

convex loss functions, we formulated a stochastic optimization

problem that minimizes a weighted sum of the derived

convergence bound and the total time spent on transmitting the

parameter updates under a transmit power constraint. By using

Fig. 5: The convergence of the constraint for diﬀerent values of

The larger the

, the more rounds it takes until the constraint is

satisﬁed. Here, the constraint is ¯

Pn= 1 for all n.

the Lyapunov drift-plus-penalty framework, we developed an

algorithm that analytically solves the formulated problem to

ﬁnd the optimal selection probabilities and transmit powers.

Our experimental results showed that even without knowledge

of the channel statistics, a signiﬁcant amount of time can be

saved during the FL training procedure using our algorithm.

We used a realistic non-i.i.d. dataset known as FEMNIST to

demonstrate how the algorithm might perform in practice and

the results were very promising for heterogeneous wireless

environments. We also showed via the CIFAR-10 dataset that

the gains can be even greater when the data is i.i.d.. Future

work may consider multiple access communication schemes

and seek to minimize the slowest of the chosen devices since

aggregation will ultimately be waiting for the last update. There

is potential for many interesting directions by considering

diﬀerent objective functions to focus on diﬀerent aspects of

the FL process.

Appendix

A. Proof of Theorem 1

Preliminary inequalities. By Jensen’s inequality:



MPM

m=1 ym



2≤1

MPM

m=1 kymk2(19)



PM

m=1 ym



2≤MPM

m=1 kymk2(20)

By Peter-Paul inequality:

hy1,y2i ≤ ρky1k2

2+ky2k2

2ρ(21)

for ρ > 0.

First, we note that

xt+1 −xt=1

n=1

Ilt

(yn

t,I −yn

t,0)

=−γ

n=1

Ilt

I−1

i=0

gn(yn

t,i)

Then, from L-smoothness, we have

E[f(xt+1)|xt]

≤f(xt) + h∇f(xt),E[xt+1 −xt|xt]i

2Ehkxt+1 −xtk2xti

=f(xt)−γ*∇f(xt),E"1

n=1

Ilt

I−1

i=0

gn(yn

t,i)

xt#+

+γ2L

2N2E





n=1

Ilt

I−1

i=0

gn(yn

t,i)



2

xt



(a)

=f(xt)−γ*∇f(xt),1

n=1

I−1

i=0

E∇fn(yn

t,i)xt+

+γ2L

2N2E





n=1

Ilt

I−1

i=0

gn(yn

t,i)



2

xt



=f(xt)−γ

I−1

i=0

E"*∇f(xt),1

n=1

∇fn(yn

t,i)+

xt#

+γ2L

2N2E





n=1

Ilt

I−1

i=0

gn(yn

t,i)



2

xt

(22)

where (a) uses the independence between

Ilt

and

, the

fact that

EIlt

nxt=EIlt

n=qt

, and the total expec-

tation

Egn(yn

t,i)xt=EEgn(yn

t,i)yn

t,i,xtxt=

E∇fn(yn

t,i)xt.

For the last term, we note that

E





n=1

Ilt

I−1

i=0

gn(yn

t,i)



2

xt



≤N

n=1

E





Ilt

I−1

i=0

gn(yn

t,i)



2

xt



n=1

EIlt

nxt

(qt

n)2E





I−1

i=0

gn(yn

t,i)



2

xt



≤NI

n=1

(qt

n)2

I−1

i=0

Eh

gn(yn

t,i)



2xti

≤NI

n=1

I−1

i=0

Eh

gn(yn

t,i)



2xti

Plugging back into (22), we get

E[f(xt+1)|xt]

≤f(xt)−γ

I−1

i=0

E"*∇f(xt),1

n=1

∇fn(yn

t,i)+

xt#

+LIγ2

n=1

I−1

i=0

Eh

gn(yn

t,i)



2xti(23)

Taking total expectation on both sides, we have

E[f(xt+1)] ≤E[f(xt)]

−γ

I−1

i=0

E"*∇f(xt),1

n=1

∇fn(yn

t,i)+#

+LIγ2

n=1

I−1

i=0

Eh

gn(yn

t,i)



2i(24)

Now, note that

−γE"*∇f(xt),1

n=1

∇fn(yn

t,i)+#

=−γE"*∇f(xt),1

n=1

∇fn(yn

t,i)− ∇f(xt) + ∇f(xt)+#

=γE"*∇f(xt),∇f(xt)−1

n=1

∇fn(yn

t,i)+#

−γE[h∇f(xt),∇f(xt)i]

≤γ

2Ehk∇f(xt)k2i+γ

2E





∇f(xt)−1

n=1

∇fn(yn

t,i)



2



−γEhk∇f(xt)k2i

=γ

2Ehk∇f(xt)k2i

+γ

2E





n=1 ∇fn(xt)− ∇fn(yn

t,i)



2



−γEhk∇f(xt)k2i

≤γ

n=1

Eh

∇fn(xt)− ∇fn(yn

t,i)



2i−γ

2Ehk∇f(xt)k2i

≤γL2

n=1

Eh

xt−yn

t,i



2i−γ

2Ehk∇f(xt)k2i

≤γL2

n=1

E







i−1

j=0

γgn(yn

t,j )



2



−γ

2Ehk∇f(xt)k2i

≤γ3L2(I−1)

n=1

i−1

j=0

Eh

gn(yn

t,j )



2i−γ

2Ehk∇f(xt)k2i

Plugging back to (24), we have

E[f(xt+1)]

≤E[f(xt)] + γ3L2(I−1)

n=1

I−1

i=0

i−1

j=0

Eh

gn(yn

t,j )



−γI

2Ehk∇f(xt)k2i

+LIγ2

n=1

I−1

i=0

Eh

gn(yn

t,i)



2i(25)

Rearranging and summing tfrom 0to T−1, we have

T−1

t=0

Ehk∇f(xt)k2i

≤2 (E[f(x0)] −E[f(xT)])

γT I

+γ2L2(I−1)

T I N

T−1

t=0

n=1

I−1

i=0

i−1

j=0

Eh

gn(yn

t,j )



+γL

T N

T−1

t=0

n=1

I−1

i=0

Eh

gn(yn

t,i)



≤2 (f(x0)−f∗)

γT I

+γ2L2(I−1)

T I N

T−1

t=0

n=1

I−1

i=0

i−1

j=0

Eh

gn(yn

t,j )



+γL

T N

T−1

t=0

n=1

I−1

i=0

Eh

gn(yn

t,i)



2i(26)

B. Proof of Theorem 2

Since there are only two variables to solve for and two simple

boundary constraints per

, we can ﬁnd the minimizing values

and

Pn(t)

by ﬁnding the roots of the gradient of the

objective function and ensuring that they are within the upper

and lower bounds. If no roots are within that set, one of the

end points will minimize the function, so we only need to

check those points.

To ﬁnd the roots, we compute the gradient of the objective

function for each nin (15)

∇f(qt

n, Pn(t)) =







−V

N(qt

n)2+V λ`

Blog21+|hn(t)|2Pn(t)

N0+Zn(t)Pn(t)

−V λ`|hn(t)|2

N0B1+|hn(t)|2Pn(t)

N0log21+|hn(t)|2Pn(t)

N02qt

n+Zn(t)qt

n



.

(27)

We ﬁrst look at the partial derivative with respect to

Pn(t)

and

note that setting it equal to zero and dividing by qt

ngives

0 =

−V λ`|hn(t)|2/(N0B)

1 + |hn(t)|2Pn(t)

N0log21 + |hn(t)|2Pn(t)

N02+Zn(t)

which does not depend on

. Next, let

A=V λ`|hn(t)|2(log(2))2

N0BZn(t)

and

x= 1 + |hn(t)|2Pn(t)

, then we have something in the

form of

A=x(log(x))2=x(log (1/x))2.

By dividing both sides by

1/4

, letting

x0=qA

√x

, and

rearranging, we have

pA/4=x0ex0

that has a known solution of

x0=WkqA

4

where

Wk(·)

the Lambert Wfunction which solves wexp w=zfor w.

To get the critical point for

Pn(t)

, we unwrap and substitute

Pn(t) = N0

|hn(t)|2(x−1), to get

Popt

n(t) = N0

|hn(t)|2



4Wk rA

4!−2

−1

(28)

which has a single root at k= 0 since qA

4≥0.

Finally, for the critical point for

, we can plug

Popt

n(t)

into

the partial derivative with respect to qt

nto get (17).

References

[1]

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,

“Communication-eﬃcient learning of deep networks from decentralized

data,” in Artiﬁcial intelligence and statistics. PMLR, 2017, pp. 1273–

1282.

[2]

X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence

of FedAvg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.

[3]

A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Achieving linear con-

vergence in federated learning under objective and systems heterogeneity,”

arXiv preprint arXiv:2102.07053, 2021.

[4]

V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha, and

G. Srivastava, “A survey on security and privacy of federated learning,”

Future Generation Computer Systems, vol. 115, pp. 619–640, 2021.

[5]

R. C. Geyer, T. Klein, and M. Nabi, “Diﬀerentially private federated

learning: A client level perspective,” arXiv preprint arXiv:1712.07557,

2017.

[6]

S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, and

Y. Zhou, “A hybrid approach to privacy-preserving federated learning,”

in Proceedings of the 12th ACM Workshop on Artiﬁcial Intelligence and

Security, 2019, pp. 1–11.

[7]

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,

“Federated optimization in heterogeneous networks,” Proceedings of

Machine Learning and Systems, vol. 2, pp. 429–450, 2020.

[8]

S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and

K. Chan, “Adaptive federated learning in resource constrained edge

computing systems,” IEEE Journal on Selected Areas in Communications,

vol. 37, no. 6, pp. 1205–1221, 2019.

[9]

J. Konečn

y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and

D. Bacon, “Federated learning: Strategies for improving communication

eﬃciency,” arXiv preprint arXiv:1610.05492, 2016.

[10]

P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsiﬁcation

for eﬃcient federated learning: An online learning approach,” in 2020

IEEE 40th International Conference on Distributed Computing Systems

(ICDCS). IEEE, 2020, pp. 300–310.

[11]

F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Robust and

communication-eﬃcient federated learning from non-iid data,” IEEE

transactions on neural networks and learning systems, vol. 31, no. 9, pp.

3400–3413, 2019.

[12]

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd:

Communication-eﬃcient sgd via gradient quantization and encoding,”

Advances in Neural Information Processing Systems, vol. 30, pp. 1709–

1720, 2017.

[13]

A. Albasyoni, M. Safaryan, L. Condat, and P. Richtárik, “Optimal gradient

compression for distributed and federated learning,” arXiv preprint

arXiv:2010.03246, 2020.

[14]

T. Nishio and R. Yonetani, “Client selection for federated learning

with heterogeneous resources in mobile edge,” in ICC 2019-2019 IEEE

International Conference on Communications (ICC). IEEE, 2019, pp.

1–7.

[15]

M. Ribero and H. Vikalo, “Communication-eﬃcient federated learning

via optimal client sampling,” arXiv preprint arXiv:2007.15197, 2020.

[16]

J. Goetz, K. Malik, D. Bui, S. Moon, H. Liu, and A. Kumar, “Active

federated learning,” arXiv preprint arXiv:1909.12641, 2019.

[17]

H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies

for federated learning in wireless networks,” IEEE Transactions on

Communications, vol. 68, no. 1, pp. 317–333, 2019.

[18]

Y. J. Cho, J. Wang, and G. Joshi, “Client selection in federated learning:

Convergence analysis and power-of-choice selection strategies,” arXiv

preprint arXiv:2010.01243, 2020.

[19]

J. Ren, Y. He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling for

cellular federated edge learning with importance and channel awareness,”

IEEE Transactions on Wireless Communications, vol. 19, no. 11, pp.

7690–7703, 2020.

[20]

Y. Ruan, X. Zhang, S.-C. Liang, and C. Joe-Wong, “Towards ﬂexible

device participation in federated learning,” in International Conference

on Artiﬁcial Intelligence and Statistics. PMLR, 2021, pp. 3403–3411.

[21]

S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T.

Suresh, “Scaﬀold: Stochastic controlled averaging for federated learning,”

in International Conference on Machine Learning. PMLR, 2020, pp.

5132–5143.

[22]

H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial

worker participation in non-iid federated learning,” arXiv preprint

arXiv:2101.11203, 2021.

[23]

X. Gu, K. Huang, J. Zhang, and L. Huang, “Fast federated learning

in the presence of arbitrary device unavailability,” arXiv preprint

arXiv:2106.04159, 2021.

[24]

M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint

learning and communications framework for federated learning over

wireless networks,” IEEE Transactions on Wireless Communications,

vol. 20, no. 1, pp. 269–283, 2020.

[25]

Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy

eﬃcient federated learning over wireless communication networks,” IEEE

Transactions on Wireless Communications, 2020.

[26]

M. M. Wadu, S. Samarakoon, and M. Bennis, “Federated learning under

channel uncertainty: Joint client scheduling and resource allocation,”

in 2020 IEEE Wireless Communications and Networking Conference

(WCNC). IEEE, 2020, pp. 1–6.

[27]

T. Huang, W. Lin, W. Wu, L. He, K. Li, and A. Y. Zomaya, “An eﬃciency-

boosting client selection scheme for federated learning with fairness

guarantee,” IEEE Transactions on Parallel and Distributed Systems,

vol. 32, no. 7, pp. 1552–1564, 2020.

[28]

Z. Zhou, S. Yang, L. Pu, and S. Yu, “Ceﬂ: online admission control,

data scheduling, and accuracy tuning for cost-eﬃcient federated learning

across edge nodes,” IEEE Internet of Things Journal, vol. 7, no. 10, pp.

9341–9356, 2020.

[29]

M. J. Neely, “Stochastic network optimization with application to commu-

nication and queueing systems,” Synthesis Lectures on Communication

Networks, vol. 3, no. 1, pp. 1–211, 2010.

[30]

A. Krizhevsky and G. Hinton, “Learning multiple layers of features

from tiny images,” University of Toronto, Toronto, Ontario, Tech. Rep. 0,

2009.

[31]

S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečn

y, H. B. McMahan,

V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,”

arXiv preprint arXiv:1812.01097, 2018.

[32]

G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “EMNIST: Extending

MNIST to handwritten letters,” in 2017 International Joint Conference

on Neural Networks (IJCNN). IEEE, 2017, pp. 2921–2926.

[33]

Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated

learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.

ResearchGate has not been able to resolve any citations for this publication.

Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach

Conference Paper

Full-text available

Nov 2020

Energy Efficient Federated Learning Over Wireless Communication Networks

Article

Full-text available

Nov 2020

In this paper, the problem of energy efficient transmission and computation resource allocation for federated learning (FL) over wireless communication networks is investigated. In the considered model, each user exploits limited local computational resources to train a local FL model with its collected data and, then, sends the trained FL model to a base station (BS) which aggregates the local FL model and broadcasts it back to all of the users. Since FL involves an exchange of a learning model between users and the BS, both computation and communication latencies are determined by the learning accuracy level. Meanwhile, due to the limited energy budget of the wireless users, both local computation energy and transmission energy must be considered during the FL process. This joint learning and communication problem is formulated as an optimization problem whose goal is to minimize the total energy consumption of the system under a latency constraint. To solve this problem, an iterative algorithm is proposed where, at every step, closed-form solutions for time allocation, bandwidth allocation, power control, computation frequency, and learning accuracy are derived. Since the iterative algorithm requires an initial feasible solution, we construct the completion time minimization problem and a bisection-based algorithm is proposed to obtain the optimal solution, which is a feasible solution to the original energy minimization problem. Numerical results show that the proposed algorithms can reduce up to 59.5% energy consumption compared to the conventional FL method.

A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks

Article

Full-text available

Oct 2020

In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that generates a global FL model and sends the model back to the users. Since all training parameters are transmitted over wireless links, the quality of training is affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS needs to select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To seek the solution, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can improve the identification accuracy by up to 1:4%, 3:5% and 4:1%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation, 2) a standard FL algorithm with random user selection and resource allocation, and 3) a wireless optimization algorithm that minimizes the sum packet error rates of all users while being agnostic to the FL parameters.

Communication-Efficient Federated Learning via Optimal Client Sampling

Conference Paper

Jul 2020

Federated learning is a private and efficient framework for learning models in settings where data is distributed across many clients. Due to interactive nature of the training process, frequent communication of large amounts of information is required between the clients and the central server which aggregates local models. We propose a novel, simple and efficient way of updating the central model in communication-constrained settings by determining the optimal client sampling policy. In particular, modeling the progression of clients’ weights by an Ornstein-Uhlenbeck process allows us to derive the optimal sampling strategy for selecting a subset of clients with significant weight updates. The central server then collects local models from only the selected clients and subsequently aggregates them. We propose two client sampling strategies and test them on two federated learning benchmark tests, namely, a classification task on EMNIST and a realistic language modeling task using the Stackoverflow dataset. The results show that the proposed framework provides significant reduction in communication while maintaining competitive or achieving superior performance compared to baseline. Our methods introduce a new line of communication strategies orthogonal to the existing user-local methods such as quantization or sparsification, thus complementing rather than aiming to replace them.

Stochastic Network Optimization with Application to Communication and Queueing Systems

Article

Jan 2010

Michael J. Neely

An Efficiency-Boosting Client Selection Scheme for Federated Learning With Fairness Guarantee

Article

Nov 2020

The issue of potential privacy leakage during centralized AI’s model training has drawn intensive concern from the public. A Parallel and Distributed Computing (or PDC) scheme, termed Federated Learning (FL), has emerged as a new paradigm to cope with the privacy issue by allowing clients to perform model training locally, without the necessity to upload their personal sensitive data. In FL, the number of clients could be sufficiently large, but the bandwidth available for model distribution and re-upload is quite limited, making it sensible to only involve part of the volunteers to participate in the training process. The client selection policy is critical to an FL process in terms of training efficiency, the final model’s quality as well as fairness. In this article, we will model the fairness guaranteed client selection as a Lyapunov optimization problem and then a $\mathbf {C^2MAB}$ -based method is proposed for estimation of the model exchange time between each client and the server, based on which we design a fairness guaranteed algorithm termed RBCS-F for problem-solving. The regret of RBCS-F is strictly bounded by a finite constant, justifying its theoretical feasibility. Barring the theoretical results, more empirical data can be derived from our real training experiments on public datasets.

A survey on security and privacy of federated learning

Article

Oct 2020

Federated learning (FL) is a new breed of Artificial Intelligence (AI) that builds upon decentralized data and training that brings learning to the edge or directly on-device. FL is a new research area often referred to as a new dawn in AI, is in its infancy, and has not yet gained much trust in the community, mainly because of its (unknown) security and privacy implications. To advance the state of the research in this area and to realize extensive utilization of the FL approach and its mass adoption, its security and privacy concerns must be first identified, evaluated, and documented. FL is preferred in use-cases where security and privacy are the key concerns and having a clear view and understanding of risk factors enable an implementer/adopter of FL to successfully build a secure environment and gives researchers a clear vision on possible research areas. This paper aims to provide a comprehensive study concerning FL's security and privacy aspects that can help bridge the gap between the current state of federated AI and a future in which mass adoption is possible. We present an illustrative description of approaches and various implementation styles with an examination of the current challenges in FL and establish a detailed review of security and privacy concerns that need to be considered in a thorough and clear context. Findings from our study suggest that overall there are fewer privacy-specific threats associated with FL compared to security threats. The most specific security threats currently are communication bottlenecks, poisoning, and backdoor attacks while inference-based attacks are the most critical to the privacy of FL. We conclude the paper with much needed future research directions to make FL adaptable in realistic scenarios.

Scheduling for Cellular Federated Edge Learning With Importance and Channel Awareness

Article

Aug 2020

In cellular federated edge learning (FEEL), multiple edge devices holding local data jointly train a neural network by communicating learning updates with an access point without exchanging their data samples. With very limited communication resources, it is beneficial to schedule the most informative local learning updates. This paper focuses on FEEL with gradient averaging over participating devices in each round of communication. A novel scheduling policy is proposed to exploit both diversity in multiuser channels and diversity in the “importance” of the edge devices’ learning updates. First, a new probabilistic scheduling framework is developed to yield unbiased update aggregation in FEEL. The importance of a local learning update is measured by its gradient divergence. If one edge device is scheduled in each communication round, the scheduling policy is derived in closed form to achieve the optimal trade-off between channel quality and update importance. The probabilistic scheduling framework is then extended to allow scheduling multiple edge devices in each communication round. Numerical results obtained using popular models and learning datasets demonstrate that the proposed scheduling policy can achieve faster model convergence and higher learning accuracy than conventional scheduling policies that only exploit a single type of diversity.

Federated Learning under Channel Uncertainty: Joint Client Scheduling and Resource Allocation

Conference Paper

May 2020

CEFL: Online Admission Control, Data Scheduling and Accuracy Tuning for Cost-Efficient Federated Learning Across Edge Nodes

Article

Mar 2020

With the proliferation of Internet-of-Things (IoT), zillions of bytes of data are generated at the network edge, incurring an urgent need to push the frontiers of artificial intelligence (AI) to network edge so as to fully unleash the potential of the IoT big data. To materialize such a vision which is known as edge intelligence, federated learning is emerging as a promising solution to enable edge nodes to collaboratively learn a shared model in a privacy-preserving and communication-efficient manner, by keeping the data at the edge nodes. While pilot efforts on federated learning have mostly focused on reducing the communication overhead, the computation efficiency of those resource-constrained edge nodes has been largely overlooked. To bridge this gap, in this paper, we investigate how to coordinate the edge and the cloud to optimize the system-wide cost-efficiency of federated learning. Leveraging Lyapunov optimization theory, we design and analyze a cost-efficient optimization framework CEFL to make online yet near-optimal control decisions on admission control, load balancing, data scheduling and accuracy tuning for the dynamically arrived training data samples, reducing both computation and communication cost. In particular, our control framework CEFL can be flexibly extended to incorporate various design choices and practical requirements of federated learning, such as exploiting the cheaper cloud resource for model training with better cost-efficiency yet still facilitating on-demand privacy preservation. Via both rigorous theoretical analysis and extensive trace-driven evaluations, we verify the cost-efficiency of our proposed CEFL framework.

Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization

Abstract and Figures

Recommended publications

Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization

Federated Learning with Flexible Control

Federated Learning with Flexible Control

Clustered Scheduling and Communication Pipelining For Efficient Resource Management Of Wireless Fede...