PreprintPDF Available

Quantized Federated Learning under Transmission Delay and Outage Constraints

June 2021

June 2021

Authors:

Yanqing Xu

The Chinese University of Hong Kong, Shenzhen

Tsung-Hui Chang

The Chinese University of Hong Kong, Shenzhen

Preprints and early-stage research may not have been peer reviewed yet.

Federated learning (FL) has been recognized as a viable distributed learning paradigm which trains a machine learning model collaboratively with massive mobile devices in the wireless edge while protecting user privacy. Although various communication schemes have been proposed to expedite the FL process, most of them have assumed ideal wireless channels which provide reliable and lossless communication links between the server and mobile clients. Unfortunately, in practical systems with limited radio resources such as constraint on the training latency and constraints on the transmission power and bandwidth, transmission of a large number of model parameters inevitably suffers from quantization errors (QE) and transmission outage (TO). In this paper, we consider such non-ideal wireless channels, and carry out the first analysis showing that the FL convergence can be severely jeopardized by TO and QE, but intriguingly can be alleviated if the clients have uniform outage probabilities. These insightful results motivate us to propose a robust FL scheme, named FedTOE, which performs joint allocation of wireless resources and quantization bits across the clients to minimize the QE while making the clients have the same TO probability. Extensive experimental results are presented to show the superior performance of FedTOE for a deep learning-based classification task with transmission latency constraints.

Federated learning in wireless edge.

…

Comparison between baselines and FedTOE with τ max = 50ms for offline scheduling under the i.i.d. data.

…

Comparison between baselines and FedTOE with τ max = 50ms for offline scheduling under the non-i.i.d. case.

…

Comparison between baselines and FedTOE with τ max = 200ms for offline scheduling under the non-i.i.d. case.

…

Comparison between Baseline 3 and FedTOE with different τ max for offline scheduling under the non-i.i.d. case.

…

Figures - uploaded by Yanqing Xu

Content may be subject to copyright.

Content uploaded by Yanqing Xu

Content may be subject to copyright.

Quantized Federated Learning under Transmission Delay and

Outage Constraints

Yanmeng Wang, Yanqing Xu, Qingjiang Shi, and Tsung-Hui Chang ∗

Abstract

Federated learning (FL) has been recognized as a viable distributed learning paradigm which

trains a machine learning model collaboratively with massive mobile devices in the wireless edge

while protecting user privacy. Although various communication schemes have been proposed

to expedite the FL process, most of them have assumed ideal wireless channels which provide

reliable and lossless communication links between the server and mobile clients. Unfortunately,

in practical systems with limited radio resources such as constraint on the training latency and

constraints on the transmission power and bandwidth, transmission of a large number of model

parameters inevitably suﬀers from quantization errors (QE) and transmission outage (TO). In

this paper, we consider such non-ideal wireless channels, and carry out the ﬁrst analysis show-

ing that the FL convergence can be severely jeopardized by TO and QE, but intriguingly can

be alleviated if the clients have uniform outage probabilities. These insightful results motivate

us to propose a robust FL scheme, named FedTOE, which performs joint allocation of wireless

resources and quantization bits across the clients to minimize the QE while making the clients

have the same TO probability. Extensive experimental results are presented to show the superior

performance of FedTOE for a deep learning-based classiﬁcation task with transmission latency

constraints.

Keywords−Federated learning, transmission outage, quantization error, convergence rate,

wireless resource allocation.

1 Introduction

With the rapid development of mobile communications and artiﬁcial intelligence (AI), the edge

AI, a system that exploits locally generated data to learn a machine learning (ML) model at the

wireless edge, has attracted increasing attentions from both the academia and industries [1–3]. In

particular, federated learning (FL) has been proposed to allow an edge server to coordinate mas-

sive mobile clients to collaboratively train a shared ML model without accessing the raw data of

clients [4]. However, FL faces several critical challenges. This includes that the mobile clients have

dramatically diﬀerent data distribution (data heterogeneity) and diﬀerent computation capabilities

∗Y. Wang, Y. Xu and T.-H. Chang are with the School of Science and Engineering, The Chinese University of

Hong Kong, Shenzhen 518172, China, and also with the Shenzhen Research Institute of Big Data, Shenzhen 518172,

China (e-mail: hiwangym@gmail.com, xuyanqing@cuhk.edu.cn, tsunghui.chang@ieee.org). Q. Shi is with the School

of Software Engineering, Tongji University, Shanghai 201804, China, and also with the Shenzhen Research Institute

of Big Data, Shenzhen 518172, China (e-mail: shiqj@tongji.edu.cn). (Corresponding author: Tsung-Hui Chang.)

arXiv:2106.09397v1 [cs.IT] 17 Jun 2021

(device heterogeneity) [5]. Moreover, the training is subject to training latency and limited com-

munication resources for serving a large number of clients. In view of this, the well-known FedAvg

algorithm [4] with local stochastic gradient descent (local SGD) and partial participation of clients

is widely adopted to reduce the training latency and communication overhead [6]. Furthermore,

several improved FL algorithms have been proposed to reduce the inter-client variance caused by

data heterogeneity [7,8] and device heterogeneity [5, 9].

Recently, wireless resource scheduling has been introduced for FL from diﬀerent perspectives.

Firstly, some works have aimed to reduce the total training latency by improving the data through-

put between the clients and the server under limited resource budget. For example, [10] adopted

joint client selection and beamforming design at the server to maximize the number of selected

clients while guaranteeing the mean squared error performance of the received data at the server,

while [11] introduced a hierarchical FL framework to maximize the transmission rate in the up-

link under the bandwidth and transmit power constraints. With a slight diﬀerence, [12] proposed a

“later-is-better” principle to jointly optimize the client selection and bandwidth allocation through-

out the training process under a total energy budget. However, all the above works did not explic-

itly consider the inﬂuence of resource allocation on the FL performance, and thus cannot directly

minimize the training latency.

Secondly, some works aimed to achieve a high learning performance within a total training

latency, through analyzing the theoretical relations between the number of communication rounds

and achieved learning accuracy. For instance, based on the number of communication rounds

required to attain a certain model accuracy, [13] and [14] proposed to optimize bandwidth allocation

to minimize the total latency of the FedAvg algorithm. The work [15] optimized resource allocation

under delay constraints and captured two tradeoﬀs, including the tradeoﬀ between computation

and communication latencies as well as that between training latency and energy consumption of

all clients. While these works can minimize the training latency directly, they have assumed ideal

wireless channels with reliable and lossless transmissions.

Some recent works have considered FL and wireless resource allocation under non-ideal wireless

environments. For example, the work [16] studied the inﬂuence of packet error rate on the conver-

gence of FedAvg, and proposed a joint resource allocation and client selection scheme to improve

the convergence speed of FedAvg. The work [17] attempted to redesign the averaging scheme of

local models based on the transmission outage (TO) probabilities. The work [18] exploited the

waveform-superposition property of broadband channels to reduce the transmission delay, and also

investigated the impacts of channel fading and imperfect channel knowledge on the FL convergence.

On the other hand, some works considered compressed transmission via quantization and analyzed

the inﬂuence of the quantization error (QE) on the FL performance. For instance, [19] proposed a

communication-eﬃcient FL method, FedPAQ, which sends the quantized global model in the down-

link, and then analyzed the eﬀect of QE on the convergence of FL. Besides, the authors of [20]

considered layered quantized transmissions for communication-eﬃcient FL where diﬀerent quanti-

zation levels are assigned to diﬀerent layers of the trained neural network. It is noted that in the

aforementioned works [16–20], the issues of TO and QE have never been considered simultaneously.

In this paper, we highlight the need of studying the joint impacts of TO and QE on FL,

especially when the transmission latency is constrained. Speciﬁcally, given a transmission delay

constraint, a larger number of quantization bits lead to a smaller QE of the transmitted model

but demand a higher transmission rate, which however result in a larger TO probability [21].

Therefore, either when the model size is large or when the latency constraint is stringent, it is

essential to take into account both TO and QE in the FL process. In view of this, unlike the

existing works, [16–20], we study the joint eﬀects of TO and QE, and consider that the clients

have non-i.i.d. data distribution at the same time. To overcome these eﬀects, we propose a new

FL scheme, called FedTOE (Federated learning with Transmission Outage and quantization Error),

which performs joint allocation of wireless resources and quantization bits for achieving robust

FL performance under such non-ideal learning environment. In particular, our main contributions

include:

(1) FL convergence analysis under both TO and QE: We consider a non-convex FL problem,

which is more general than the convex problems studied in [16, 17, 20], and consider non-ideal

(uplink) wireless channels with both TO and QE. To the best of our knowledge, this paper is the

ﬁrst to analyze the inﬂuence of both TO and QE on the FL convergence simultaneously. The

derived theoretical results show that non-uniform TO probabilities not only lead to a biased

solution [5] but also amplify the negative eﬀects caused by QE and non-i.i.d. data distribution

(data heterogeneity). Intriguingly, such undesired property can be alleviated if the clients have

the same TO probabilities.

(2) FedTOE:Inspired by this observation, we formulate a resource allocation problem to mitigate the

impacts of TO and QE. Speciﬁcally, we propose to carefully allocate the (uplink) transmission

bandwidth and quantization bits of clients to minimize the aggregate QE subject to constraints

on the transmission latency and TO probabilities. We show that the optimal solution to this

problem can achieve an uniform TO probability across the clients while minimizing the QE.

(3) Experiments: The proposed FedTOE is implemented for a deep learning-based handwritten-

digit recognition task, and the experimental results demonstrate that FedTOE has promising

performance over benchmark schemes.

Synopsis: Section 2 introduces the proposed system model of FL in the wireless environment.

Then, Section 3 presents the convergence rate analysis of FL under both TO and QE. Based on

the results, the wireless resource allocation scheme (i.e., FedTOE) is formulated in Section 4. The

experiment results are presented in Section 5. Finally, Section 6 concludes this paper.

2 System model

2.1 Federated Learning Algorithm

Consider a wireless FL network as shown in Fig. 1 where a central server coordinates Nmobile

clients to solve the following distributed learning problem

min

w∈RmF(w) =

i=1

piFi(w) , (1)

where Fi(w) is the (possibly) non-convex local loss function, w∈Rmdenotes the m-dimensional

model parameters to be learned, and pi=ni/PN

j=1 njin which niis the number of data sam-

ples stored in client i. Let ξibe the mini-batch samples with size b, we denote Fi(w,ξi) =

bPb

j=1 f(w, ξij), where ξij is the j-th randomly selected sample from the dataset of client i, and

Server

Client (b)

(a)

(c)

Figure 1: Federated learning in wireless edge.

f(w, ξij) is the model loss function with respect to ξij. When b=ni,ξirefer to the whole local

dataset in client iand then Fi(w,ξi) = Fi(w).

We follow the seminal FedAvg algorithm [4]. Speciﬁcally, in the r-th communication round,

FedAvg executes the following three steps (see Fig. 1):

(a) Broadcasting: The server samples Kclients, denoted by the set Srwhere |Sr|=K, and then

broadcasts the global model ¯

wr−1in the last communication round to each client i∈ Sr.

(b) Local model updating: Each client i∈ Srupdates local model by local stochastic gradient

descent (SGD) [7]. It contains Econsecutive SGD updates as follows

wr,0

i=¯

wr−1

wr,`

i=wr,`−1

i−γ∇Fi(wr,`−1

i,ξr,`

i), ` = 1, . . . , E ,(2)

where γis the learning rate.

ito the server for producing

a new global model based on certain aggregation principle.

Speciﬁcally, FedAvg considers the following two aggregation schemes, depending on whether all

clients participate or not.

(i) Full participation: All clients participate in the aggregation process, i.e., Sr={1,··· , N }

∀r, and the global model is updated by

wr=

i=1

piwr,E

i. (3)

Considering the massive participates in the network, this scheme would not be feasible under

limited communication bandwidth for the uplink channels.

(ii) Partial participation: With |Sr|  N, the global model is updated by

wr=1

i∈Sr

wr,E

i, (4)

where Kclients (KN) in Srare selected with replacement according to the probability

distribution {p1,··· , pN}. It should be pointed out that the average scheme in (4) leads to

an unbiased estimate of ¯

wrin (3), i.e., E[¯

wr] = ˜

wr[6].

However, the aforementioned schemes are still far from practice. In particular, in digital com-

munication systems, the model parameters need to be quantized before being transmitted, which

brings QEs to the learned model. Meanwhile, channel fadings could cause TO in the delivery of the

model parameters from time to time. Moreover, given a ﬁxed transmission delay, QE is strongly

coupled with TO. Speciﬁcally, a larger number of quantization bits lead to a smaller QE of the

learned model but require a higher transmission rate, which however can further elevate the TO

probability. Therefore, it is essential to consider TO and QE simultaneously in the wireless FL

systems. Motivated by this, in the next two subsections, we incorporate QE and TO in the uplink

channels of FL and describe their impacts in detail1.

2.2 Quantized Transmission

For the local model wr,E

i, we assume that each parameter wr,E

ij is bounded satisfying |wr,E

ij | ∈

[wr

ij,¯wr

ij], and is quantized by the stochastic quantization method in [22]. In concrete terms, with

iquantization bits, we denote {c0, c1,··· , c2Br

i−1}as the knobs uniformly distributed in [wr

ij,¯wr

ij],

where

cu=wr

ij +u×¯wr

ij −wr

2Br

i−1, u = 0,· ·· ,2Br

i−1. (5)

Then, the parameter wr,E

ij falling in [cu,cu+1) is quantized by

Q(wr,E

ij ) = 









sign(wr,E

ij )·cu,w.p. cu+1 − |wr,E

ij |

cu+1 −cu

sign(wr,E

ij )·cu+1,w.p. |wr,E

ij | − cu

cu+1 −cu

(6)

where ‘w.p.’ stands for ‘with probability’. In addition, let µbe the number of bits used to represent

sign(wr,E

ij ), wr

ij and ¯wr

ij. Then, the quantized local model Q(wr,E

i)=[Q(wr,E

i1),··· ,Q(wr,E

im )] is

expressed by a total number of

i=mBr

i+µbits , (7)

and is sent to the server.

Lemma 1 With the stochastic quantization method, each local model is unbiasedly estimated as

E[Q(wr,E

i)] = wr,E

i,(8)

and the associated QE is bounded by

E[kQ(wr,E

i)−wr,E

ik2]≤δ2

ir/(2Br

i−1)2,J2

ir ,(9)

where δir ,q1

4Pm

j=1( ¯wr

ij −wr

ij)2.

Proof: Properties like Lemma 8 have been discussed in the literature; see [19] and [20]. For

ease of reference, the proof is presented in Section A of the Supplementary Material. 

As one can see from (7) and (9) that a higher quantization level Br

ileads to a larger number of

bits ˆ

ifor transmission but a smaller QE.

1In the current work, we only consider the TO and QE in the uplink transmission since the server (i.e., base

station) is assumed to be powerful enough to provide reliable and lossless communications for the downlink broadcast

channels [19].

2.3 Transmission Outage

There are several ways to model TO in wireless channels. For example, 1) without channel state

information at the transmitter (CSIT), the transmission may suﬀer from outage due to large-scale

fadings such as shadowing [16]; 2) with imperfect CSIT (e.g., imperfect channel estimation or ﬁnite

bandwidth feedback), the CSI error could cause transmission outage [23]; 3) with perfect CSIT,

due to ﬁnite blocklength transmission, the receiver may fail to decode the message [24]. In this

work, for simplicity, we will assume no CSIT and focus on the impacts of shadowing on the TO of

the system.

By assuming that the frequency division multiple access (FDMA) is adopted for uplink trans-

mission, the channel capacity of each client i∈ Sris

i=Wr

ilog21 + Pr

i|hi|2

iN0bps, (10)

where Wr

iand Pr

idenote the allocated bandwidth and transmit power of client i, respectively,

hiis the uplink channel coeﬃcient between the server and client i, and N0represents the power

spectrum density (PSD) of the additive noise. According to the channel coding theorem [21], if the

transmission rate Rr

iis higher than Cr

i, TO occurs and the server fails to decode Q(wr,E

i) correctly.

Suppose that the uplink transmission is subject to a delay constraint τi, then Rr

i=ˆ

i/τi. Thus,

the outage probability is given by

i,Pr(Cr

i≤Rr

i).(11)

We model the channel gain in (10) using the classical path loss model with shadowing [21], i.e.,

[|hi|2]dB = [K]dB −λ[di]dB +ψdB, where [x]dB measures xin dB, Kis a constant depending on the

antenna characteristics and channel attenuation, λis the path loss exponent, di(in meter) is the

distance between client iand the server, and ψdB ∼ N(0, σ2

dB) is the shadowing. Then, the TO

probability in (11) can be computed as

qi= Pr(ψdB < ρi)=1−Q(ρi/σdB) , (12)

where Q(x) = R+∞

√2πexp(−1

2z2)dzis the Q-function and ρi,[(2Ri/Wi−1)WiN0]dB −[Pi]dB −

[K]dB +λ[di]dB.

2.4 Federated Learning with QE and TO

Let us reconsider the FedAvg in Section 2.1 in the presence of both TO and QE in the uplink.

According to [20] and [22], it is more bit-eﬃcient to transmit the model updates (i.e., wr,E

i−wr,0

than the model wr,E

iitself in the uplink since the dynamic ranges of model updates can decrease

with the number of communication rounds. By adopting this scheme, each client isends to the

server with

Q(∆wr

i),Q1

γ(wr,E

i−wr,0

i)=Q E

`=1 ∇Fi(wr,`−1

i,ξr,`

i)!.(13)

Due to TO, the server may fail to receive the upload messages. We denote 1r

i= 1 if the server

correctly receives the transmitted local model from client i, and 1r

i= 0 otherwise. Then, with the

Algorithm 1 FedTOE: FL with uplink TO and QE

1: Initialize global model ¯

w0by the server.

2: for r= 1,2,··· , M do

3: Server samples Kclients Srwith replacement based on the probabilities {p1,··· , pN};

4: Server broadcasts global model ¯

wr−1to clients in Sr;

5: for client i∈ Srdo (in parallel)

6: wr,0

i←¯

wr−1

7: for `= 1,2,··· , E do

8: Update local model by mini-batch SGD in (2);

9: end for

10: Send quantized model update in (13) to the server;

11: end for

12: if Pi∈Sr1r

i= 0 then

13: Repeat Step 10 for all clients in Sr;

14: else

15: Server updates global model by (14);

16: end if

17: end for

partial participation scheme in (4), the global model at the server is obtained by

wr=¯

wr−1+γP

i∈Sr

iQ(∆wr

i∈Sr

. (14)

Note that when the channel is ideal without TO and QE, then (14) reduces to the simple averaging

scheme in (4). We assume that the server can use cyclic redundancy check (CRC) to check whether

the failure occurs or not [16]. If Pi∈Sr1r

i= 0, i.e., none of the clients successfully transmit their

local updates, retransmission is carried out until at least one client’s meassage is correctly received

by the server

In the downlink transmission, the global model (i.e., ¯

wr) is sent to each client i∈ Sr(assuming

no TO and QE). Such consideration is based on the following two reasons. First, the wireless

resources of the server for broadcasting transmission are arguably abundant to transmit global

model parameters reliably with high precision [19]. Second, the selected clients diﬀer from round

to round, and thus it requires additional caching mechanism to track the latest global model if

the server transmits model diﬀerence ¯

wr−¯

wr−1rather than ¯

wr; see [20, 25] for the details. The

described FL algorithm with uplink TO and QE is summarized in Algorithm 1.

Remark 1 Fig. 2 illustrates the inﬂuence of TO and QE on the FL with full participation (i.e.,

K=N= 100) and the presence of non-i.i.d data distribution. The ideal scheme suﬀers neither TO

nor QE, while the curves with Bi= 3 and 10 refer to the schemes which allocate uniform bandwidth

and same quantization level Bito all clients. For a more detailed setting, refer to Section 5.1. One

can see from this ﬁgure that the scheme with fewer quantization bits (i.e., Bi= 3) has an impaired

performance due to large QE, whereas the one with more quantization bits (i.e., Bi= 10) not only

has a slower convergence rate but also does not move to the right solution due to the bias caused

by TO. Therefore, the wireless resource and quantization bits need to be carefully allocated.

0 100 200 300 400 500

Communication rounds

Training loss

(a) Training loss.

0 100 200 300 400 500

Communication rounds

0.2

0.4

0.6

0.8

Testing accuracy

(b) Testing accuracy.

Figure 2: Training loss and testing accuracy comparison of diﬀerent schemes in wireless

environment, where the uplink transmission delay per communication round is constrained by

100ms.

In view of this, a robust FL scheme is proposed in this paper, referred to as FedTOE, which

can exhibit robustness in such non-ideal wireless channels with TO and QE as shown in Fig. 2.

We ﬁrst present a novel theoretical analysis on the convergence of Algorithm 1 in the next section,

based on which, a joint wireless resource and quantization bits allocation scheme will be presented

to improve the FL performance under TO and QE in Section 4.

3 Performance analysis

3.1 Assumptions

We consider general smooth non-convex learning problems with the following assumptions.

Assumption 1 Each local function Fiis lowered bounded, i.e., Fi(w)≥F > −∞, and diﬀer-

entiable whose ∇Fiis Lipschitz continuous with constant L:∀vand w,Fi(v)≤Fi(w)+(v−

w)T∇Fi(w) + L

2kv−wk2

Assumption 2 Unbiasedness and bounded variance of SGD: E[∇Fi(w, ξij)] = E[∇Fi(w)],

E[k∇Fi(w, ξij)− ∇Fi(w)k2]≤σ2.

Assumption 3 Bounded data variance: E[k∇Fi(w)− ∇F(w)k2]≤D2

i,∀i= 1,··· , N , which

measures the heterogeneity of local datasets [26].

3.2 Theoretical results

For ease of presentation, we consider the ﬁxed quantization level and constant TO probabilities

across the training process, i.e., Br

i=Biand qr

i=qifor all r= 1,··· , M . As one will see,

such simpliﬁcation is suﬃcient to reveal the insight how TO and QE impact on the algorithm

convergence. The extension to the more general case is straightforward and presented in the

Supplementary Material.

We ﬁrst present the following lemma.

Lemma 2 Considering the FL algorithm in Algorithm 1, it holds true that

E"Pi∈Sr

i∆wr

Pi∈Sr

iX

i∈Sr

i6= 0#=

i=1

βi∆wr

i(15)

for some ¯

βi∈[0,1] with PN

i=1 ¯

βi= 1, where E[·]is taken with respect to Srand {1r

i}. Moreover,

we also have

E"Pi∈Sr

i∆wr

Pi∈Sr

i2X

i∈Sr

i6= 0#=

i=1

¯αi∆wr

i(16)

for some ¯αi≥0∀i= 1, . . . , N, and therefore

E"1

Pi∈Sr

iX

i∈Sr

i6= 0#=

i=1

¯αi,1

K.(17)

When qiis uniform for all clients, i.e., qi=q∀i, then ¯

βi=piand ¯αi=pi/¯

K∀iwith ¯

1−(q)K

v=1 1

v(Cv

K(1−q)v(q)K−v), where Cv

K=K!

v!(K−v)! . In addition, if qi= 0 ∀i(no TO), then ¯

K=K.

Proof: See Appendix A. 

From (15), one can see that {¯

βi}is the equivalent appearance probabilities of {∆wr

i}in the

global aggregation due to client sampling and TO, and they are deviated from {pi}when {qi}are

not uniform. Meanwhile, in (17), ¯

Krepresents the average eﬀective number of active clients under

TO. The main convergence result is stated below.

Theorem 1 Let Assumptions 1 to 3 hold. If one chooses γ=¯

2/(8LT 1

2)and E≤T1

4/¯

4where

T=ME ≥max{¯

K3,1/¯

K}is the total number of SGD updates per client, we have

r=1

E"k∇F(¯

wr−1)k2X

i∈Sr

i6= 0#

≤496L(E[F(¯

w0)] −F)

11 T¯

K1

+

39

88 T¯

K1

88 T¯

K3

4

σ2

b+31 ¯

88T3

r=1

i=1

¯αiJ2

| {z }

(a)(caused by QE)

+31

22 T¯

K1

i=1

¯αiD2

| {z }

(b)(caused by partial participation

and data variance)

+

4

11 T¯

K1

22 T¯

K3

4



i=1

βiD2

| {z }

(c)(caused by data variance)

+62

11χ2

βkp

i=1

piD2

| {z }

(d)(caused by TO and

data variance)

+31

22 T¯

K1

v=2

(qmax)K−vCv

1−(qmax)K

i=1

pikqi−¯qk2D2

| {z }

(e)(caused by TO and data variance)

,(18)

where J2

ir is given in (9),χ2

βkp

,PN

i=1 (¯

βi−pi)2/piis the chi-square divergence [5], and qmax =

max{q1, . . . , qN}and ¯q=PN

i=1 piqiare the maximum and average TO probabilities, respectively.

Proof: Unlike the existing works [16–20,27,28], we consider a non-convex FL problem with both

TO and QE, which makes Theorem 1 much more challenging to prove. In particular, we adopt

the analysis frameworks in [26,27] and develop several new techniques to deal with the diﬃculties

brought by TO variables 1r

iand deviated probabilities ¯

βiand ¯αi. Details are presented in Appendix

B. 

The upper bound in the right hand side (RHS) of (18) reveals several important insights.

•Firstly, the upper bound depends on the eﬀective number of clients ¯

Kinstead of K, and thus

larger TO probabilities directly slow down the algorithm convergence.

•Secondly, we observe that, except for the ﬁrst two terms, the terms (a)-(d) are caused by

either QE, non-i.i.d. data distribution, TO or partial client participation. Therefore, in ideal

wireless channels without QE and TO and with full client participation, the terms (a), (b),

(d) and (e) can be removed, whereas the term (c) due to the non-i.i.d. data distribution still

impedes the convergence.

•Thirdly, the term (d) does not decrease with T. Since it is caused by non-uniform TO prob-

abilities and non-i.i.d. data distribution, this implies that the former ampliﬁes the negative

eﬀects of the latter and will make the algorithm converge to a biased solution. Intriguingly,

this phenomenon is analogous to the inconsistency issue analyzed in [5] where the clients

adopt diﬀerent numbers of local SGD steps.

•Last but not the least, when the clients have an uniform TO probability, i.e., qi=q∀i,

the terms (d) and (e) can vanish, showing that the algorithm can still converge to a proper

stationary solution. Speciﬁcally, by combining with Lemma 2, we can derive the following

result:

Corollary 1 Under the same conditions as Theorem 1, if all clients have a uniform TO probability

q, we have

r=1

E"k∇F(¯

wr−1)k2X

i∈Sr

i6= 0#

≤496L

11(T¯

K)1

(E[F(¯

w0)] −F) + 39

88(T¯

K)1

88(T¯

K)3

4σ2

b+31

88T3

2¯

r=1

i=1

piJ2

+4

11(T¯

K)1

22(T¯

K)3

+31

22T1

4¯

4N

i=1

piD2

i.(19)

From the RHS of (19), we can observe that with uniform TO probabilities, the impact of QE

can be reduced with a larger number of eﬀective clients ¯

K, and the FL algorithm can also achieve

alinear speed-up with respect to ¯

Keven when both TO and QE are present. This inspiring result

implies that balancing the client TO probabilities is crucial for achieving fast and robust FL in

non-ideal wireless channels.

Remark 2 To the best of our knowledge, the claims in Theorem 1 and Corollary 1 and the

associated insights have not been discovered in the literature. Note that these results can readily

be extended to the general case where the quantization levels {Br

i}and TO probabilities {qr

i}vary

with the communication round r. For example, the associated upper bound for Corollary 1 can be

obtained by simply replacing PN

i=1 piJ2

ir in the RHS of (19) with ESr1

KPi∈SrJ2

ir. More details

are shown in Section B of the Supplementary Material.

4 Wireless Resource Allocation

Since both TO and QE inevitably occur in the delay constrained wireless communication systems,

we aim to minimize their eﬀects on the FL in the wireless edge. Based on the theoretical results in

Theorem 1 and Corollary 1, we propose to carefully allocate the wireless resources and quantization

bits across the clients to minimize the impact of QE while achieving an uniform TO probability for

the clients.

4.1 Proposed FedTOE

Let’s ﬁrst assume an oﬄine scenario, where the bandwidth Wi, transmit power Pi, quantization

level Biand uplink transmission rate Riof each client are optimized oﬄine, and applied to the

whole model learning process. Online scheduling will be considered in Section 4.2.

4.1.1 Problem formulation

Based on Corollary 1 and the deﬁnition of QE in (9), the proposed FedTOE considers the following

resource allocation problem.

min

Wi,Pi,Bi,Ri

i=1,··· ,N

i=1

pi·PM

r=1 δ2

(2Bi−1)2(20a)

s.t.

i=1

Wi≤Wtotal, Wi≥0, i = 1,··· , N (20b)

0≤Pi≤Pmax, i = 1,··· , N (20c)

0≤τi≤τmax, i = 1,··· , N (20d)

0≤qi≤qmax, i = 1,··· , N (20e)

Bi∈Z+, i = 1,··· , N . (20f)

where Wtotal is the total bandwidth of the uplink channel, Pmax is the maximum transmit power

of each client, τiis the uplink transmission delay per communication round of client i,τmax and

qmax are the constraints on uplink transmission delay and TO probabilities, and Z+is the positive

integer set.

4.1.2 Uplink delay

Since retransmission is performed if all selected clients encounter outage in the uplink transmission

(i.e., Pi∈Sr1r

i= 0), the average transmission delay of each selected client i∈ Srat the r-th

communication round can be shown to be

¯τr

i=1

1−Qj∈Srqj

max

j∈Sr

, (21)

where the derivation of (21) is presented in Section C of the Supplementary Material. One can see

that QK

k=1 qj≈0 with a large Kor smaller qj<1, and thus ¯τr

i≈maxj∈Srˆ

Bj/Rj. To approximately

meet the transmission delay constraint in (20d), we replace (20e) by 0 ≤ˆ

Bi/Ri≤τmax∀i= 1, . . . , N .

4.1.3 Optimal condition

One can prove that the solution to (20) satisﬁes Proposition 1.

Proposition 1 (Optimal condition) After relaxing Bi∈Z+to Bi≥1∀i= 1, . . . , N , for the

optimal condition of problem (20) it holds that (a) the transmit power Pi=Pmax ∀i, (b) the uplink

delay τi=ˆ

Bi/Ri=τmax ∀i, and (c) the outage probability qi=qmax ∀i. Moreover, based on (7)

and (12), the optimal transmission rate Risatisﬁes

Ri=¯

Ri(Wi),Wilog21 + θiPmax

WiN0,(22)

where θi,10 1

10 (σdB·Q−1(1−qmax )+[K]dB−λ[di]dB ), and the optimal quantization level satisﬁes

Bi= ( ¯

Ri(Wi)τmax −µ)/m. (23)

Furthermore, (23) can be equivalently written as Wi=Wi(Bi)for some continuously diﬀerentiable

and increasing function Wi(·).

Proof: The conditions (a)-(c) can be easily proved by contradiction and based on the monotonic

property of (20a) with respect to Bi. The existence of Wi(·) and its monotonically increasing

property can be obtained by the implicit function theorem [29]. The detailed proof is presented in

Section D of the Supplementary Material. 

Following Proposition 1, the solution of (20) automatically makes all clients have the same TO

probabilities.

4.1.4 Optimization method

By Proposition 1, problem (20) after relaxing Bi∈Z+to Bi≥0∀i= 1, . . . , N , can be reformulated

min

i=1,··· ,N

i=1

piPM

r=1 δ2

2τmax

m¯

Ri(Wi)−µ

m−12(24a)

s.t.

i=1

Wi≤Wtotal, Wi≥Wi(1), i = 1, . . . , N . (24b)

Proposition 2 Problem (24) is convex.

Proof: It can be proved by showing that the second-order derivative of each term in the sum-

mation of (24a) with respect to Wiis non-negative. The details are relegated to Section E of the

Supplementary Material. 

Based on Proposition 2, problem (24) can be eﬃciently solved by a simple gradient projection

method [30] with an initial point in the feasible region of (24b)2. Since Biis an positive integer,

after each gradient descent step in optimizing (24), each Biobtained by (23) is ﬂoored to its

nearest integer bBic. Then, the bandwidth supporting bBicwith the TO probability qmax is given

by Wi(bBic), which is further used as the starting point for the next gradient descent step.

Algorithm 2 FedTOE: Algorithm to solve (20)

1: j= 0

2: while j < maximum iteration number do

3: Updating Wiwith one-step gradient descent and projection on (24);

4: Compute each Bi(i= 1,·· · , N ) by (23);

5: Set each Bi=bBic;

6: Find each Wi=Wi(bBic) by bisection search;

7: j=j+ 1

8: end while

9: Compute each Riby (22);

Output: Transmit power Pmax, bandwidth Wi, quantization level Bi, and transmission rate Riof

each client

The details of our proposed wireless resource allocation method for oﬄine scheduling are sum-

marized in Algorithm 2. We refer to the FL process in Algorithm 1 with the wireless resource

allocation solution by Algorithm 2 as FedTOE.

4.2 Online scheduling

In this subsection, let us investigate the online scenario, where the bandwidth Wr

i, transmit power

i, quantization level Br

i, and uplink transmission rate Rr

iof each client are optimized for every

communication round r. Such online scheduling can make better use of the wireless resources

via dynamically allocating bandwidth and quantization bits for the selected clients in Srat each

communication round r. According to Remark 2, we can consider the following QE minimization

problem at each communication round:

min

i,P r

i,Br

i,Rr

i∈Sr

δ2

(2Br

i−1)2(25a)

s.t.X

i∈Sr

Wi≤Wtotal, W r

i≥0, i ∈ Sr(25b)

0≤Pr

i≤Pmax, i ∈ Sr(25c)

0≤qr

i≤qmax, i ∈ Sr(25d)

0≤¯τr

i≤τmax, i ∈ Sr(25e)

i∈Z+, i ∈ Sr. (25f)

Then, following similar derivations as the oﬄine scheme in the previous subsection, (25) can be

handled by solving

min

i,i∈SrX

i∈Sr

δ2

2τmax

m¯

Ri(Wr

i)−µ

m−12(26a)

s.t.X

i∈Sr

i≤Wtotal, W r

i≥Wi(1), i ∈ Sr.(26b)

2In practice, the value of Wi(1) can be computed by bisection search based on (23) and monotonic property of

Wi(Bi).

Table 1: Parameter Setting

Param. Value Param. Value Param. Value

b128 N0-174 dBm/Hz Wtotal 20 MHz

E5 [K]dB -31.54 qmax 0.1

γ0.05 Wtotal 20 MHz Bmin 64 bits

σdB 3.65 λ3Bmax 64 bits

The procedure of solving (25) is similar to Algorithm 2, except replacing (24) in Step 3 with

(26), replacing i= 1,··· , N in Step 4 with i∈ Sr, and replacing Wi,Bi, and Riwith Wr

i,Br

i, and

irespectively.

5 Numerical results

5.1 Parameter setting

In the simulations, we assume that the server (i.e., base station) is located at the cell center with a

cell radius 600m, and N= 100 clients are uniformly distributed within the cell. The server employs

Algorithm 1 to train a 3-layer neural network with size 784 ×30 ×10 for the classiﬁcation of digits

based on the MNIST database [31]. In the experiments, we consider two types of local datasets,

i.e., the i.i.d. and the non-i.i.d local datasets. Speciﬁcally, in the i.i.d. case, the 60000 training

samples in MNIST database are shuﬄed and then randomly distributed to each client, while in

the non-IID case, the training samples are reordered by their digit labels from 0 to 9 and then

partitioned so that each client possesses at most 2 digits of training samples. Besides, each client

is assumed to possess the same number of training samples, i.e., ni= 600 ∀i= 1, . . . , N .

In the simulations, the size of quantized local model update is represented by

i=m(1 + Br

i) + nminBmin +nmax Bmax (bits) , (27)

where the total number of model parameters is m= 23860 which consists of 23820 (= 784 ×30 +

30 ×10) weights and 40 (= 30 + 10) bias in the adopted neural network, and 1 bit, Bmin bits, and

Bmax bits are used for representing the sign, the lower limit wr

ij, and the upper limit ¯wr

ij of each

parameter update respectively. In the quantization process as (6), the weight updates belonging

to the same layer share the same range [wr

ij,¯wr

ij], and so do the bias updates. In this way, with

a hidden layer and an output layer in the trained network, there are in total nmax =nmin = 4

diﬀerent lower and upper limits respectively adopted by each client to quantize its local model

update. For simplicity, we assume that the clients in Srhave similar constant δir in (9), which

leads to a constant PM

r=1 δ2

ir for all clients in (20a). The other simulation parameters are listed in

Table 1 [16, 21, 32], and all results were obtained by averaging over 5 independent experiments.

Three baselines and the ideal scheme are considered for comparison with FedTOE.

•Baseline 1. This scheme performs FL by Algorithm 1 with all clients adopting the maximum

transmit power Pmax, the same quantization level Bi, uniform bandwidth Wi=Wtotal/N

(oﬄine scheduling) or Wi=Wtotal/K (online scheduling), and date rate Ri=ˆ

Bi/τmax.

•Baseline 2. Based on [17], the global model is updated by ¯

wr=¯

wr−1+γ

KPi∈Sr

ˆpi(1−qi)1r

i∆wr

where piis the weight of client ideﬁned in (1) and ˆpiis the selection probability. For the

1 25 50 75 100

Client index

0.2

0.4

0.6

0.8

TO probability

(a) τmax = 50ms.

1 25 50 75 100

Client index

0.2

0.4

0.6

0.8

TO probability

(b) τmax = 200ms.

Figure 3: TO probability of each client under diﬀerent schemes (Client with larger index is

farther away from the server).

full-participation case, ˆpi= 1, while for the partial participation case, ˆpiis optimized by

formulation (13) in [17]. Since [17] only considers the inﬂuence of TO but not quantization,

for fair comparison, we modify the global updating scheme as

wr=¯

wr−1+γ

i∈Sr

ˆpi(1 −qi)1r

iQ(∆wr

i) . (28)

Other settings are the same as Baseline 1.

•Baseline 3. This scheme considers (20) but with ﬁxed uniform bandwidth Wi=Wtotal/N

(oﬄine) or Wi=Wtotal/K (online). Thus, only Biis optimized and determined by (23).

•Ideal. The ideal scheme suﬀers neither TO nor QE, which acts as the performance upper

bound in the simulations.

5.2 Performance Comparison with Oﬄine Resource Allocation

5.2.1 TO versus quantization level

To examine the eﬀectiveness of FedTOE, the performance of diﬀerent schemes are compared

under two diﬀerent constraints on the total uplink transmission delay τtotal, including a tight one

with τtotal = 25s and a loose one with τtotal = 100s. Then, given the total number of communication

rounds M= 500, the constraints on the uplink transmission delay per communication round (i.e.,

τmax) for the above two cases are 50ms and 200ms respectively.

Based on the above settings, Fig. 3 compares the TO probabilities of the proposed FedTOE and

Baseline 1 (which have diﬀerent values of Bi). It can be seen from Fig. 3(a) that all clients in

FedTOE have uniform TO probabilities, which is consistent with Proposition 1. Diﬀerent from this,

for Baseline 1, the clients farther from the server have larger TO probabilities. This is because the

data rate Rifor all clients in Baseline 1 is the same, and then the client with longer distance from

server has a larger TO probability in (12). Meanwhile, as shown in Fig. 3(a), the Baseline 1 with

a larger quantization level Bileads to a higher TO probability. The reason is that given a ﬁxed

uplink delay, transmitting more bits requires a higher data rate which increases the TO probability.

Further, it can be observed from Fig. 3(b) that under a relaxed delay constraint (τmax = 200ms),

(a) K = 10. (b) K = 10.

Figure 4: Comparison between baselines and FedTOE with τmax = 50ms for oﬄine scheduling

under the i.i.d. data.

the TO probabilities in Baseline 1 with all Biare reduced signiﬁcantly, since a smaller transmission

rate Rican be used under τmax = 200ms and then leads to lower TO probabilities.

Next, we evaluate the performance of FedTOE with respect to the communication round. From

Fig. 4 to Fig. 6, the training loss and testing accuracy of the proposed FedTOE, Baseline 1 and

Baseline 2 are compared. The performance of the ideal scheme is also shown in the ﬁgures. In

the simulations, K= 10 refers to the partial participation with replacement and K=N= 100

corresponds to the full participation of all clients. It should be pointed out that the retransmission

rounds caused when all clients experience TO are also counted.

The i.i.d. data case. One can see from Fig. 4(a) and Fig. 4(b) that under the i.i.d. case,

both FedTOE and Baseline 1 with smaller Bi= 2,5 perform closely to the ideal scheme. Speciﬁcally,

under the i.i.d. case with data variance D2

i≈0, the objective inconsistency in Theorem 1 will vanish

and the learned model by Baseline 1 can converge in the right direction even with TO. However,

the TO probabilities will aﬀect the average eﬀective number of active clients ¯

K, thus Baseline 1

with Bi= 10 in Fig. 4(a) and Fig. 4(b) has a deteriorated performance due to the higher TO

probabilities and large number of retransmission rounds. Interestingly, as shown in Fig. 4(c)-(d),

with the number of selected clients Kincreasing to 100, the eﬀect of outage probabilities in Baseline

1 will be alleviated since more clients can transmit their local model update successfully.

(a) K = 10. (b) K = 10.

Figure 5: Comparison between baselines and FedTOE with τmax = 50ms for oﬄine scheduling

under the non-i.i.d. case.

It can also be observed from Fig. 4 that Baseline 2 [17] with Bi= 5 and 10 fails to learn the

model. This is because, for the partial participation with K= 10, higher selection probabilities

in Baseline 2 are allocated to the clients with larger TO probabilities, thus reducing the eﬀective

number of active clients ¯

Kand consequently slowing down the convergence speed of FL. Meanwhile,

for the full participation with K= 100, Baseline 2 with larger Bi= 5 and 10 still cannot correctly

update the global model since the averaging scheme (28) in Baseline 2 will be unstable if the outage

probability qiis large.

The non-i.i.d. data case. Comparing Fig. 4 with Fig. 5, we can ﬁnd that non-i.i.d. degrades

all curves, but the proposed FedTOE still performs closely to the ideal scheme and outperforms both

Baseline 1 and 2. Speciﬁcally, one can observe from Fig. 5 that Baseline 1 and 2 with Bi= 2 have

a deteriorated performance, since the non-i.i.d. data ampliﬁes the eﬀect of QE and Bi= 2 is not

enough to accurately represent the model update. Diﬀerent from the previous i.i.d. case, the reason

why Baseline 1 with Bi= 5 and 10 fails to learn the model with non-i.i.d. data is that not only the

high TO probabilities decrease ¯

Kbut also the non-uniform TO probabilities among clients cause

the objective inconsistency as discussed in Theorem 1. Meanwhile, as shown in Fig. 5(c) and Fig.

5(d), the inﬂuence of non-uniform TO on Baseline 1 under the non-i.i.d. case cannot be alleviated

with the number of selected clients Kincreasing to 100. Besides, diﬀerent from Baseline 1 and 2,

(a) K = 10. (b) K = 10.

Figure 6: Comparison between baselines and FedTOE with τmax = 200ms for oﬄine scheduling

under the non-i.i.d. case.

FedTOE can adaptively determine the quantization levels via (20) to achieve superior performance.

Finally, it can be observed from Fig. 6 that under a looser per-round delay constraint (τmax =

200ms), Baseline 1 and 2 with Bi= 5 and 10 can also perform well since the TO probabilities

under τmax = 200ms are no longer high and become similar among clients as shown in Fig. 3(b).

In this situation, QE becomes a dominant factor in the performance for FL, thus Baseline 1 and 2

with Bi= 2 still perform worse owing to large QE.

As a brief summary, the proposed FedTOE can automatically ﬁnd the optimal bandwidth alloca-

tion Wi, quantization level Bi, and transmission rate Rifor each client under diﬀerent transmission

delay constraints, and performs a robust FL performance for both the i.i.d. and non-i.i.d. cases.

5.2.2 Necessity of optimization on bandwidth allocation

In this part, we demonstrate the necessity of optimizing the bandwidth allocation for FL. First

of all, Fig. 7 compares the training loss and testing accuracy of FedTOE and Baseline 3 with respect

to the total uplink transmission time τtotal =Mτmax, under various per-round delay constraints

τmax. One can observe that for τmax = 50ms, FedTOE performs signiﬁcantly better than Baseline 3,

and for τmax ≥100ms, the two schemes perform comparably. However, both schemes don’t converge

well for τmax = 40ms due to the insuﬃcient number of quantization bits under the stringent delay

constraint.

To analyze the cause why FedTOE outperforms Baseline 3, we plot in Fig. 8 the uplink bandwidth

and quantization level allocated to clients by the two schemes, where the client with a larger index

is farther from the server. In the optimal wireless resource allocation scheme of both FedTOE and

Baseline 3, the outage probabilities for all clients achieve qmax = 0.1. With this condition, it can

be seen from Fig. 8(a) that FedTOE prefers to allocate more bandwidth to the clients farther away

from the server while less bandwidth to the clients close to the server, thus allowing a more uniform

allocation of quantization bits as shown in Fig. 8(b). On the contrary, Baseline 3 (which has a

uniform bandwidth allocation) allocates larger Bito the clients close to the server since they have

larger channel capacity whereas Baseline 3 has to allocate smaller Bito the distant clients due

(a) K = 10. (b) K = 10.

Figure 7: Comparison between Baseline 3 and FedTOE with diﬀerent τmax for oﬄine scheduling

under the non-i.i.d. case.

1 25 50 75 100

Client index

100

200

300

Bandwidth (kHz)

(a) Allocated bandwidth Wi.

1 25 50 75 100

Client index

Quantization level

(b) Quantization level Bi.

Figure 8: Allocated bandwidth and quantization level of each client for oﬄine scheduling

(Client with larger index is farther from the server).

to the delay constraint and it causes signiﬁcant QE. Therefore, when τmax is large, FedTOE and

Baseline 3 perform equally well. However, when τmax is small, FedTOE can greatly outperform

Baseline 3 as seen in Fig. 7.

Lastly, one can see from Fig. 7 that a tighter per-round delay τmax can speed up the learning

process if the total uplink transmission time τtotal is constrained. For example, FedTOE under

τmax = 50ms has a faster learning speed than those under τmax ≥100ms. This is because a smaller

τmax allows a larger number of communication rounds Munder a ﬁxed τtotal. Similarly, one can

see that Baseline 3 under a smaller τmax converges faster than that under τmax ≥100ms.

5.3 Performance Comparison with Online Scheduling

In this subsection, the performance of the proposed FedTOE with online scheduling is evaluated. In

online scheduling, the total 20M bandwidth is allocated to only the K= 10 selected clients per

round instead of to all the 100 clients in the oﬄine scheme. So a larger allocated bandwidth of

(a) K = 10. (b) K = 10.

Figure 9: Comparison between baselines and FedTOE with τmax = 9ms for online scheduling

under the non-i.i.d. case.

(a) K = 10. (b) K = 10.

Figure 10: Comparison between Baseline 3 and FedTOE with diﬀerent τmax for online schedul-

ing under the non-i.i.d. case.

clients can improve their transmission rates and then reduce the uplink transmission delay. Thus,

compared with the adopted per-round uplink delay constraint τmax for oﬄine scheduling in Fig. 5,

we choose a much tighter τmax = 9ms to compare the training loss and testing accuracy of FedTOE,

Baseline 1, and Baseline 2 in online scheduling. It can be seen from Fig. 9 that FedTOE still has

superior performance than Baseline 1 and 2 in the online scheduling. Speciﬁcally, Baseline 1 and

2 with Bi= 2 have poorer performance because of higher QE, while Bi= 10 fails to update the

global model due to high TO probabilities. Meanwhile, Baseline 2 with Bi= 5 converges slower and

ﬂuctuates a lot because of the unstable average scheme (28) under high TO probabilities. While

Baseline 1 with Bi= 5 gradually approaches to FedTOE,FedTOE has a faster convergence rate and

can dynamically adjust the quantization levels by (25) at each communication round.

Finally, Fig. 10 compares the performance of FedTOE and Baseline 3 under online scheduling

with diﬀerent uplink delay constraints. It can also be observed from Fig. 10 that for a smaller

uplink delay τmax = 6ms or 9ms, FedTOE has a signiﬁcant advantage over Baseline 3.

6 Conclusion

In this paper, we have investigated FL in non-ideal wireless channels in the presence of both TO

and QE. We have carried out a novel convergence analysis that shows TO and QE, together with

non-i.i.d. data distribution, can signiﬁcantly impede the FL process. In particular, we have shown

that when the clients have heterogeneous TO probabilities, not only the negative eﬀects of QE

and non-i.i.d data distribution can be enlarged but also the algorithm can converge to a biased

solution. On the contrary, when the clients have an uniform TO probability, these issues can

be alleviated. Inspired by this result, we have proposed FedTOE which performs joint allocation

of bandwidth and quantization bits to minimize the QE while satisfying the transmission delay

constraint and uniform TO probabilities. The presented experiment results have demonstrated that

FedTOE exhibits superior robustness against TO and QE when compared to the existing schemes.

Moreover, experiment results have also shown that a tighter transmission delay constraint per

communication round may speed up the FL process.

Appendices

A Proof of Lemma 2

A.1 Proof of (15) and (16)

At each communication round, Kclients are selected independently and with replacement based

on the probability distribution {pi}N

i=1. As a result, there are NKdiﬀerent possibilities for the set

Sr(denoted by Sg

r, g = 1, . . . , N K) and the appearance probability of each set Sg

ris Pr(Sr=Sg

r) =

Qi∈Sg

rpi. Meanwhile, since TOs occur independently across the clients, we have Pr Pi∈Sr1r

i6= 0=

1−Qi∈Srqi. Then, we can obtain (15) for some non-negative ¯

βi,i= 1, . . . , N , according to the

derivations in (29),

E"Pi∈Sr

i∆wr

Pi∈Sr

iXi∈Sr

i6= 0#(29a)

=ESr"ETO "Pi∈Sr

i∆wr

Pi∈Sr

iXi∈Sr

i6= 0## (29b)

=ESr"K

v=1 X

BrS¯

Br=Sr

|Br|=v,|¯

Br|=K−v

Pr 1r

k1= 1 ∀k1∈ Br,1r

k2= 0 ∀k2∈¯

BrXi∈Sr

i6= 0·Pk1∈Br∆wr

v#(29c)

=XNK

g=1 Yi∈S g

pi!· XK

v=1XBg

rS¯

r=Sg

|Bg

r|=v,|¯

r|=K−vQk1∈Bg

r(1 −qk1)Qk2∈¯

rqk2

1−Qi∈Sgqi·Pk1∈Bg

r∆wr

v!(29d)

,XN

i=1

βi∆wr

i(29e)

where in (29c), Bris the set of selected clients without TO while ¯

Bris the one of clients with TO,

and in (29d), Qk1∈Bg

r(1 −qk1)Qk2∈¯

rqk2is the probability of the event that solely the clients in

rhave successful transmissions. By letting ∆wr

i= 1 in (29a), we then have PN

i=1 ¯

βi= 1. In the

same fashion as (29), we can obtain

E"Pi∈Sr

i∆wr

(Pi∈Sr

i)2Xi∈Sr

i6= 0#(30a)

=XNK

g=1 Yi∈S g

pi!· XK

v=1XBg

rS¯

r=Sg

|Bg

r|=v,|¯

r|=K−vQk1∈Bg

r(1 −qk1)Qk2∈¯

rqk2

1−Qi∈Sgqi·Pk1∈Bg

r∆wr

v2!

,XN

i=1 ¯αi∆wr

i(30b)

for some ¯αi≥0∀i= 1,··· , N , which is (16). 

A.2 Computing the values of ¯

βi,¯αiand ¯

Kunder uniform-TO

With the same TO probability qfor all clients, (29) becomes

(29a) =ESr"XK

v=1XBrS¯

Br=Sr

|Br|=v,|¯

Br|=K−v

(1 −q)v(q)K−v

1−(q)K·Pk1∈Br∆wr

=ESr"XK

v=1

(1 −q)v(q)K−v

1−(q)K·1

vXBrS¯

Br=Sr

|Br|=v,|¯

Br|=K−vXk1∈Br

∆wr

k1#

=ESr"XK

v=1

(1 −q)v(q)K−v

1−(q)K·1

vXi∈Sr

Cv−1

K−1∆wr

(a)

=ESr"XK

v=1

K(1 −q)v(q)K−v

1−(q)K·1

KXi∈Sr

∆wr

i#(b)

=ESr1

KXi∈Sr

∆wr

i(c)

=XN

i=1pi∆wr

i, (31)

where equality (a) follows from 1

vCv−1

K−1=1

v·(K−1)!

(v−1)!(K−v)! =1

K·K!

v!(K−v)! =1

KCv

K, equality (b) is by

v=1

K(1−q)v(q)K−v

1−(q)K= 1 since PK

v=0 Cv

K(1 −q)v(q)K−v= 1, and equality (c) is by the fact that

the clients are independently sampled with replacement following distribution {pi}N

i=1 [6]. After

comparing (29e) with (31), we have ¯

βi=pi∀iunder the uniform-TO case.

Similar to the proof in (31), with the same TO probability qfor all clients, (30) becomes

(30a) = ESr"K

v=1 X

BrS¯

Br=Sr

|Br|=v,|¯

Br|=K−v

(1 −q)v(q)K−v

1−(q)K·Pk1∈Br∆wr

v2#=

v=1

vCv

K(1 −q)v(q)K−v

1−(q)K"N

i=1

pi∆wr

i#,

(32)

and letting ∆wr

i= 1 in (30a) and (32) gives rise to

K=E1

Pi∈Sr

iXi∈Sr

i6=0=XK

v=1

vCv

K(1−q)v(q)K−v

1−(q)K.

Finally, by comparing (30b) and (32), we have ¯αi=pi/¯

Kunder the uniform-TO case. 

B Proof of Theorem 1

Our analysis considers only the “successful” communication rounds where at least one client in

Srcommunicates with the server successfully, and therefore the derivations are all based on the

conditional events that Pi∈Sr1r

i6= 0 ∀r= 1,··· , M . In the following proof, without further clariﬁ-

cation, we simply write E[·] and Pr[·] for the conditional E[·|Pi∈Sr1r

i6= 0] and Pr[ ·| Pi∈Sr1r

i6= 0],

respectively.

B.1 Proof of convergence rate

With Assumption 1, we have

E[F(¯

wr)] ≤E[F(¯

wr−1)] + E[h∇F(¯

wr−1),¯

wr−¯

wr−1i] + L

2Ek¯

wr−¯

wr−1k2. (33)

We need the following three key lemmas which are proved in subsequent subsections.

Lemma 3 Under Assumptions 1 and 3, it holds that

E[h∇F(¯

wr−1),¯

wr−¯

wr−1i]

≤ − γE

2Ek∇F(¯

wr−1)k2+γEχ2

βkpXN

i=1piD2

i+γL2XN

i=1

βiXE

`=1

Ehkwr,`−1

i−¯wr−1k2i,(34)

where χ2

βkp=PN

i=1 (¯

βi−pi)2/piis the chi-square divergence between p= [p1,··· , pN]and β=

[¯

β1,··· ,¯

βN][5].

Lemma 4 With qmax = max{q1, . . . , qN}and ¯q=PN

i=1 piqias the maximum and the average TO

probabilities, we have

Ek¯

wr−¯

wr−1k2≤4γ2E2Ek∇F(¯

wr−1)k2+γ2E

σ2

b+γ2XN

i=1 ¯αiJ2

ir + 4γ2E2

i=1

¯αiD2

+ 4γ2E2

v=2

(qmax)K−vCv

1−(qmax)K

i=1

pikqi−¯qk2D2

i+2γ2EL2

i=1

βi

`=1

Ehkwr,`−1

i−¯

wr−1k2i.

(35)

Lemma 5 The diﬀerence between the local model at each round rand the global model at the

previous last round is bounded by

`=1

Ehkwr,`−1

i−¯

wr−1k2i

≤γ2E3σ2

b+ 4γ2E3D2

i+ 4γ2E3Ek∇F(¯wr−1)k2

1−2γ2E2L2.(36)

By substituting (34) into the second term in the RHS of (33), (35) into the third term, and by

(36), we have

E[F(¯

wr)] ≤E[F(¯

wr−1)] −γE

2−2γ2E2L−4γ3E3L2+ 4γ4E4L3

1−2γ2E2L2Ek∇F(¯

wr−1)k2

+γ2EL

2¯

K+γ3E3L2+γ4E4L3

1−2γ2E2L2σ2

b+γ2L

2XN

i=1 ¯αiJ2

ir + 2γ2E2LXN

i=1 ¯αiD2

+4γ3E3L2+ 4γ4E4L3

1−2γ2E2L2XN

i=1

βiD2

i+γEχ2

βkpXN

i=1piD2

+ 2γ2E2LXK

v=2

(qmax)K−vCv

1−(qmax)KXN

i=1pikqi−¯qk2D2

Next, summing above items from r= 1 to Mand dividing both sides by the total number of

local mini-batch SGD steps T=ME yields

γ

2−2γ2EL−4γ3E2L2+4γ4E3L3

1−2γ2E2L2PM

r=1Ek∇F(¯

wr−1)k2

≤E[F(¯

w0)] −E[F(¯

wM)]

T+γ2L

2¯

K+γ3E2L2+γ4E3L3

1−2γ2E2L2σ2

b+γ2L

2TXM

r=1XN

i=1 ¯αiJ2

ir + 2γ2ELXN

i=1 ¯αiD2

+4γ3E2L2+4γ4E3L3

1−2γ2E2L2XN

i=1

βiD2

i+γχ2

βkpXN

i=1piD2

i+2γ2ELXK

v=2

(qmax)K−vCv

1−(qmax)KXN

i=1pikqi−¯qk2D2

(37)

Further dividing both sides in (37) by γleads to

1

2−2γEL −4γ2E2L2+ 4γ3E3L3

1−2γ2E2L2

| {z }

,H1

r=1Ek∇F(¯

wr−1)k2

≤1

γT

|{z}

,H2

(E[F(¯

w0)] −E[F(¯

wM)]) + γL

2¯

K+γ2E2L2+γ3E3L3

1−2γ2E2L2

| {z }

,H3

σ2

b+γL

|{z}

,H4

r=1

i=1

¯αiJ2

ir + 2γEL

|{z}

,H6

i=1

¯αiD2

+4γ2E2L2+ 4γ3E3L3

1−2γ2E2L2

| {z }

,H5

i=1

βiD2

i+χ2

βkp

i=1

piD2

i+ 2γEL

|{z}

,H6

v=2

(qmax)K−vCv

1−(qmax)K

i=1

pikqi−¯qk2D2

i. (38)

Let the learning rate γ=¯

2/(8LT 1

2) and the number of local updating steps E≤T1

4/¯

where T≥max{¯

K3,1/¯

K}in order to guarantee E≥1. By this, H2= 8L(T¯

K)−1

2and H4=

2T−3

2/16. Since γEL ≤(T¯

K)−1

4/8, we have H6≤(T¯

K)−1

4/4 and

H5≤

82(T¯

K)−1

2+4

83(T¯

K)−3

1−2

82(T¯

K)−1

(a)

≤

82(T¯

K)−1

2+4

83(T¯

K)−3

1−2

31(T¯

K)1

124(T¯

K)3

where inequality (a) is due to T≥1/¯

K. Then,

H1=1

2−H6−H5≥1

2−1

4(T¯

K)1

4−2

31(T¯

K)1

2−1

124(T¯

K)3

4≥1

2−1

4−2

31 −1

124 =11

62 ,

H3=γL

2¯

K+H5

4≤L

16L(T¯

K)1

62(T¯

K)1

496(T¯

K)3

4≤39

496(T¯

K)1

496(T¯

K)3

Finally, by substituting above coeﬃcients and E[F(¯

wM)] ≥Fin Assumption 1 into (38),

Theorem 1 is proved. 

B.2 Proof of Lemma 3

We have

E[h∇F(¯

wr−1),¯

wr−¯

wr−1i]

=E"*∇F(¯

wr−1),−γPi∈Sr

iQ(∆wr

Pi∈Sr

i+#

(a)

=E"*∇F(¯

wr−1),−γPi∈Sr

iPE

`=1∇Fi(wr,`−1

i,ξr,`

Pi∈Sr

i+#

(b)

=E"*∇F(¯

wr−1),−γPi∈Sr

iPE

`=1∇Fi(wr,`−1

Pi∈Sr

i+#

(c)

=−γXE

`=1

E∇F(¯

wr−1),XN

i=1

βi∇Fi(wr,`−1

i)

(d)

=−γ

2XE

`=1

Ek∇F(¯

wr−1)k2−γ

2XE

`=1

Eh



XN

i=1

βi∇Fi(wr,`−1

i)



+γ

2XE

`=1

Eh



∇F(¯

wr−1)−XN

i=1

βi∇Fi(wr,`−1

i)



≤ − γE

2Ek∇F(¯

wr−1)k2+γ

2XE

`=1

E



∇F(¯

wr−1)−XN

i=1

βi∇Fi(wr,`−1

i)



(e)

≤ − γE

2Ek∇F(¯

wr−1)k2+γXE

`=1

Eh



∇F(¯

wr−1)−XN

i=1

βi∇Fi(¯

wr−1)



| {z }

,A1

+γXE

`=1

Eh



XN

i=1

βi(∇Fi(¯

wr−1)− ∇Fi(wr,`−1

i))



| {z }

,A2

, (39)

where equality (a) is due to the unbiased quantization in (8) and the deﬁnition of ∆wr

iin (13),

equality (b) is due to E[∇Fi(wr,`−1

i,ξr,`

i)] = ∇Fi(wr,`−1

i) in Assumption 2, equality (c) is obtained

by (15), equality (d) follows from the basic identity hx1,x2i=1

2(kx1k2+kx2k2−kx1−x2k2), and

inequality (e) is due to kx1+x2k2≤2kx1k2+ 2kx2k2.

In (39), the term A1can be further bounded as

A1=E"



XN

i=1pi∇Fi(¯

wr−1)−XN

i=1

βi∇Fi(¯

wr−1)



(a)

=E"



i=1

(pi−¯

βi)∇Fi(¯

wr−1)−

i=1

(pi−¯

βi)∇F(¯

wr−1)



=E"



XN

i=1

pi−¯

βi

√pi

√pi(∇Fi(¯

wr−1)− ∇F(¯

wr−1))



(b)

≤ N

i=1

(¯

βi−pi)2

pi!N

i=1

piEk∇Fi(¯

wr−1)− ∇F(¯

wr−1)k2

(c)

≤χ2

βkpXN

i=1piD2

i, (40)

where equality (a) is because PN

i=1(pi−¯

βi) = 0, inequality (b) is due to the Cauchy-Schwarz

Inequality, and inequality (c) is due to Assumption 3 and the deﬁnition of χ2

βkpin Lemma 3.

Besides, A2is bounded as

(a)

≤XN

i=1

βiEhk∇Fi(¯

wr−1)− ∇Fi(wr,`−1

i)k2i(b)

≤L2XN

i=1

βiEhkwr,`−1

i−¯wr−1k2i, (41)

where inequality (a) is by the Jensen’s Inequality and inequality (b) is due to Assumption 1. Finally,

by substituting (40) and (41) into (39), we can obtain Lemma 3 directly. 

B.3 Proof of Lemma 4

We have

E[k¯

wr−¯

wr−1k2]

=E





−γPi∈Sr

iQ(∆wr

Pi∈Sr

i



2



(a)

=γ2E





Pi∈Sr

i∆wr

Pi∈Sr

i



+



Pi∈Sr

i(Q(∆wr

i)−∆wr

Pi∈Sr

i



2



(b)

=γ2E





Pi∈Sr

iPE

`=1(∇Fi(wr,`−1

i,ξr,`

i)−∇Fi(wr,`−1

i))

Pi∈Sr

i



2



| {z }

,G1(caused by SGD)

+γ2E





Pi∈Sr

iPE

`=1∇Fi(wr,`−1

Pi∈Sr

i



2



| {z }

,G2

+γ2E





Pi∈Sr

i(Q(∆wr

i)−∆wr

Pi∈Sr

i



2



| {z }

,G3(caused by quantization error)

, (42)

where equality (a) is by E[kxk2] = E[kx−E[x]k2] + kE[x]k2and (8); equality (b) is obtained

similarly but using ∆wr

i=PE

`=1 ∇Fi(wr,`−1

i,ξr,`

i) in (13) and E[∇Fi(wr,`−1

i,ξr,`

i)] = ∇Fi(wr,`−1

in Assumption 2.

In (42), the term G1can be shown as

(a)

=E"Pi∈Sr

iPE

`=1k∇Fi(wr,`−1

i,ξr,`

i)− ∇Fi(wr,`−1

i)k2

Pi∈Sr

i2#

(b)

=E"Pi∈Sr

iPE

`=1

σ2

Pi∈Sr

i2#=Eσ2

bE"1

Pi∈Sr

i#(c)

=Eσ2

Kb , (43)

where equality (a) is due to E[∇Fi(wr,`−1

i,ξr,`

i)] = ∇Fi(wr,`−1

i) in Assumption 2, equality (b) is

due to the bounded variance of SGD in Assumption 2, and equality (c) is due to (17). For G2in

(42), we have

G2≤2E





Pi∈Sr

iPE

`=1(∇Fi(wr,`−1

i)− ∇Fi(¯

wr−1))

Pi∈Sr

i



2



| {z }

,G21

+2 E





Pi∈Sr

iPE

`=1∇Fi(¯

wr−1)

Pi∈Sr

i



2



| {z }

,G22

, (44)

where

G21 ≤E·E"Pi∈Sr

iPE

`=1k∇Fi(wr,`−1

i)− ∇Fi(¯

wr−1)k2

Pi∈Sr

(a)

=EXN

i=1

βiXE

`=1

Ehk∇Fi(wr,`−1

i)− ∇Fi(¯

wr−1)k2i(b)

≤EL2XN

i=1

βiXE

`=1

Ehkwr,`−1

i−¯

wr−1k2i,

(45)

in which equality (a) is due to (15) in Lemma 2, and inequality (b) is due to Assumption 2.

G22 =E2E





Pi∈Sr

i∇Fi(¯

wr−1)

Pi∈Sr

i



2



=2E2E





Pi∈Sr

i(∇Fi(¯

wr−1)− ∇F(¯

wr−1))

Pi∈Sr

i



2



| {z }

(caused by partial participation)

+2E2Ehk∇F(¯

wr−1)k2i

=2E2E"Pi∈Sr

ik∇Fi(¯

wr−1)− ∇F(¯

wr−1)k2

Pi∈Sr

i2#

| {z }

,G23

+ 2E2E



Pk0∈SrPk∈Sr

k6=k0

k1r

k0(∇Fk(¯

wr−1)− ∇F(¯

wr−1))(∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

Pi∈Sr

i2





| {z }

,G24

+ 2E2Ek∇F(¯

wr−1)k2. (46)

Next, we bound G23 and G24 in (46) as follows. Firstly

G23

(a)

=XN

i=1 ¯αiEk∇Fi(¯

wr−1)− ∇F(¯

wr−1)k2(b)

≤XN

i=1 ¯αiD2

i, (47)

where equality (a) is due to (16) in Lemma 2, and inequality (b) is due to Assumption 3. Secondly,

G24 =EK

v=1

Pr X

i∈Sr

i=v1

v2X

k∈SrX

k0∈Sr

k06=k

Eh1r

k1r

k0(∇Fk(¯

wr−1)

− ∇F(¯

wr−1))(∇Fk0(¯

wr−1)− ∇F(¯

wr−1))X

i∈Sr

i=vi

(a)

=EK

v=1

v2X

k∈SrX

k0∈Sr

k06=kPr 1r

k= 1,1r

k0= 1,X

i∈Sr

i=v

·(∇Fk(¯

wr−1)− ∇F(¯

wr−1))(∇Fk0(¯

wr−1)− ∇F(¯

wr−1)),

where equality (a) follows because if 1r

k= 0 or 1r

k0= 0, then 1r

k1r

k0(∇Fk(¯

wr−1)−∇F(¯

wr−1))(∇Fk0(¯

wr−1)−

∇F(¯

wr−1)) = 0. In addition, when v= 1, there is only one selected client with successful transmis-

sion, and 1r

kand 1r

k0cannot equal to 1 at the same time, thus Pr[1r

k= 1,1r

k0= 1,Pi∈Sr1r

i= 1] = 0.

When v≥2,

Pr h1r

k= 1,1r

k0= 1,Xi∈Sr

i=vi

(1 −qk)(1 −qk0)PBrS¯

Br={Sr\{k,k0}}

|Br|=v−2,|¯

Br|=K−vQ

k1∈Br

(1 −qk1)Q

k2∈¯

qk2

1−Qi∈Srqk

(a)

≤

(1 −qk)(1 −qk0)PBrS¯

Br={Sr\{k,k0}}

|Br|=v−2,|¯

Br|=K−v

(qmax)K−v

1−(qmax)K

(b)

=(1 −qk)(1 −qk0)(qmax)K−vCv−2

K−2

1−(qmax)K, (48)

where Bris the set of selected clients (except kand k0) in Srtransmitting their local model updates

successfully while ¯

Bris the one that suﬀers from TO; inequality (a) is due to 1 −qk1≤1 and

qk2≤qmax = max{q1, . . . , qN}, and in equality (b), Cv−2

K−2=(K−2)!

(v−2)!(K−v)! . Thus,

G24 ≤EK

v=2

(qmax)K−vCv−2

K−2

(1 −(qmax)K)v2X

k∈SrX

k0∈Sr

k06=k

(1−qk)(1−qk0)(∇Fk(¯

wr−1)−∇F(¯

wr−1))(∇Fk0(¯

wr−1)−∇F(¯

wr−1))

=EK

v=2

(qmax)K−vCv−2

K−2

(1−(qmax )K)v2X

k∈Sr

(1−qk)(∇Fk(¯

wr−1)−∇F(¯

wr−1)) X

k0∈Sr

k06=k

(1 −qk0)(∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

(a)

=EK

v=2

(qmax)K−vK(K−1)Cv−2

K−2

(1 −(qmax)K)v2

j=1

pj(1 −qj)(∇Fj(¯

wr−1)

− ∇F(¯

wr−1))

j0=1

pj0(1 −qj0)(∇Fj0(¯

wr−1)− ∇F(¯

wr−1))

(b)

≤EK

v=2

(qmax)K−vCv

1−(qmax)K

j=1

j0=1

pjpj0(1−qj)(1−qj0)(∇Fj(¯

wr−1)−∇F(¯

wr−1))(∇Fj0(¯

wr−1)−∇F(¯

wr−1)),

(49)

where equality (a) can be obtained based on the same reason as obtaining (c) in (31) since the

clients k, k0∈ Srare selected independently and with replacement. The above inequality (b) is

obtained by K(K−1)Cv−2

K−2

v2≤K(K−1)

v(v−1) Cv−2

K−2=Cv

Kfor v≥2. Then, with the average TO probability

¯q=PN

i=1 piqi, we have (1 −qj)(1 −qj0) = (1 −¯q+ ¯q−qj)(1 −¯q+ ¯q−qj0) = (1 −¯q)2+ (1 −¯q)(¯q−

qj) + (1 −¯q)(¯q−qj0) + ( ¯q−qj)( ¯q−qj0). Thus, with ∇F(¯

wr−1) = PN

i=1pi∇Fi(¯

wr−1), (49) turns into

G24 ≤EXK

v=2

(qmax)K−vCv

1−(qmax)K

·(1−¯q)2

j=1

pj ∇Fj(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)!

| {z }

j0=1

pj0 ∇Fj0(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)!

| {z }

+ (1 −¯q)

j=1

pj(¯q−qj) (∇Fj(¯

wr−1)− ∇F(¯

wr−1))

j0=1

pj0 ∇Fj0(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)!

| {z }

+ (1 −¯q)

j=1

pj ∇Fj(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)!

| {z }

j0=1

(¯q−qj0) ∇Fj0(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)!

j=1

j0=1

pjpj0(¯q−qj)(¯q−qj0) (∇Fj(¯

wr−1)− ∇F(¯

wr−1)) (∇Fj0(¯

wr−1)− ∇F(¯

wr−1)) 

=E"K

v=2

(qmax)K−vCv

1−(qmax)K

j=1

j0=1

pjpj0(¯q−qj)(¯q−qj0)(∇Fj(¯

wr−1)−∇F(¯

wr−1))(∇Fj0(¯

wr−1)−∇F(¯

wr−1))#

(a)

≤XK

v=2

(qmax)K−vCv

1−(qmax)KXN

i=1pikqi−¯qk2Ek∇Fi(¯

wr−1)− ∇F(¯

wr−1)k2

(b)

≤XK

v=2

(qmax)K−vCv

1−(qmax)KXN

i=1pikqi−¯qk2D2

i, (50)

where inequality (a) is due to the Young’s inequality, i.e., (¯q−qj)(¯q−qj0)(∇Fj(¯

wr−1)−∇F(¯

wr−1))

(∇Fj0(¯

wr−1)−∇F(¯

wr−1)) ≤1

2k¯q−qjk2k∇Fj(¯

wr−1)−∇F(¯

wr−1)k2+1

2k¯q−qj0k2k∇Fj0(¯

wr−1)−

∇F(¯

wr−1)k2, and inequality (b) is by Assumption 3.

Substituting (45), (46), (47), and (50) into (44), we have

G2≤2EL2XN

i=1

βiXE

`=1

Eh

wr,`−1

i−¯

wr−1

2i+ 4E2XK

v=2

(qmax)K−vCv

1−(qmax)KXN

i=1pikqi−¯qk2D2

+ 4E2XN

i=1 ¯αiD2

i+ 4E2Ehk∇F(¯

wr−1)k2i. (51)

Besides, for the term G3in (42), we have

(a)

=E"Pi∈Sr

ikQ(∆wr

i)−∆wr

ik2

(Pi∈Sr

i)2#(b)

=XN

i=1 ¯αiEkQ(∆wr

i)−∆wr

ik2(c)

≤XN

i=1 ¯αiJ2

ir , (52)

where equality (a) is due to the unbiased quantization in (8), equality (b) is by (16) in Lemma 2,

and inequality (c) is due to the bounded QE in (9).

Finally, by substituting (43), (51) and (52) into (42), we obtain Lemma 4. 

B.4 Proof of Lemma 5

According to (2), the local model in the (r+ 1)-th communication round are updated by

wr,`−1

i=¯wr−1−γX`−1

t=1 ∇Fi(wr,t−1

i,ξr,t

i) .

Therefore,

Eh

wr,`−1

i−¯

wr−1

2i

=E



γX`−1

t=1 ∇Fi(wr,t−1

i,ξr,t

i)



2

≤γ2(`−1)X`−1

t=1

E



∇Fi(wr,t−1

i,ξr,t

i)



2

(a)

=γ2(`−1)X`−1

t=1

E



∇Fi(wr,t−1

i,ξr,t

i)− ∇Fi(wr,t−1

i)



2+γ2(`−1)X`−1

t=1

E



∇Fi(wr,t−1

i)



2

(b)

≤γ2(`−1)2σ2

b+γ2(`−1)X`−1

t=1

E



∇Fi(wr,t−1

i)



2

≤γ2E2σ2

b+γ2EX`−1

t=1

E



∇Fi(wr,t−1

i)



2

≤γ2E2σ2

b+ 2γ2EX`−1

t=1

E



∇Fi(wr,t−1

i)− ∇Fi(¯wr−1)



2+ 2γ2EX`−1

t=1

Ehk∇Fi(¯wr−1)k2i

≤γ2E2σ2

b+ 2γ2EL2X`−1

t=1

E



wr,t−1

i−¯wr−1



2+ 4γ2E2Ehk∇Fi(¯wr−1)−∇F(¯wr−1)k2+k∇F(¯wr−1)k2i

(c)

≤γ2E2σ2

b+ 2γ2EL2X`−1

t=1

E



wr,t−1

i−¯wr−1



2+ 4γ2E2D2

i+ 4γ2E2Ehk∇F(¯wr−1)k2i, (53)

where equality (a) is due to E[kxk2] = E[kx−E[x]k2]+kE[x]k2and E[∇Fi(wr,t−1

i,ξr,t

i)] = ∇Fi(wr,t−1

i),

equality (b) is by Assumption 2 given the mini-batch size b, and inequality (c) is by Assumption

3. Then, summing both sides of (53) from `= 1 to Eyields

`=1

Eh

wr,`−1

i−¯

wr−1

2i

≤γ2E3σ2

b+ 2γ2EL2XE

`=1X`−1

t=1

E



wr,t−1

i−¯wr−1



2

| {z }

(a)

+4γ2E3D2

i+ 4γ2E3Ehk∇F(¯wr−1)k2i

(b)

≤γ2E3σ2

b+ 2γ2E2L2XE

`=1

Eh

wr,`−1

i−¯wr−1

2i+ 4γ2E3D2

i+ 4γ2E3Ehk∇F(¯wr−1)k2i, (54)

where inequality (b) is because the occurrence number of E[kwr,`−1

i−¯

wr−1k2] for each `∈[1, E]

in term (a) is less than the number of local updating steps E, and thus (a) ≤EPE

`=1 E[kwr,`−1

i−

¯wr−1k2].

Finally, rearranging the terms in (54) yields Lemma 5. 

References

[1] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edge ai: Intelligentizing

mobile edge computing, caching and communication by federated learning,” IEEE Network,

vol. 33, no. 5, pp. 156–165, 2019.

[2] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless

communication meets machine learning,” IEEE Commun. Mag., vol. 58, no. 1, pp. 19–25, 2020.

[3] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and

C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Com-

mun. Surveys Tuts., vol. 22, no. 3, pp. 2031–2063, 2020.

[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-eﬃcient

learning of deep networks from decentralized data,” in Artiﬁcial Intelligence and Statistics,

2017, pp. 1273–1282.

[5] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency

problem in heterogeneous federated optimization,” arXiv preprint arXiv:2007.07481, 2020.

[6] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid

data,” in ICLR, 2019.

[7] X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y. Cheng, “Variance reduced local SGD with

lower communication complexity,” arXiv preprint arXiv:1912.12844, 2019.

[8] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD:

Stochastic controlled averaging for federated learning,” in ICML, 2020, pp. 5132–5143.

[9] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization

in heterogeneous networks,” arXiv preprint arXiv:1812.06127, 2018.

[10] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,”

IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, 2020.

[11] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning

across heterogeneous cellular networks,” in IEEE ICASSP, 2020, pp. 8866–8870.

[12] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning

networks: A long-term perspective,” IEEE Trans. Wireless Commun., 2020.

[13] W. Shi, S. Zhou, and Z. Niu, “Device Scheduling with Fast Convergence for Wireless Federated

Learning,” in IEEE ICC, 2020, pp. 1–6.

[14] Z. Yang, M. Chen, W. Saad, C. S. Hong, M. Shikh-Bahaei, H. V. Poor, and S. Cui, “De-

lay Minimization for Federated Learning Over Wireless Communication Networks,” in ICML

Workshop on Federated Learning, 2020.

[15] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong, “Federated learning over

wireless networks: Optimization model design and analysis,” in IEEE INFOCOM, 2019, pp.

1387–1395.

[16] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and com-

munications framework for federated learning over wireless networks,” IEEE Trans. Wireless

Commun., vol. 20, no. 1, pp. 269–283, 2021.

[17] M. Salehi and E. Hossain, “Federated Learning in Unreliable and Resource-Constrained Cel-

lular Wireless Networks,” arXiv preprint arXiv:2012.05137, 2020.

[18] G. Zhu, Y. Du, D. Gunduz, and K. Huang, “One-bit over-the-air aggregation for

communication-eﬃcient federated edge learning: Design and convergence analysis,” IEEE

Trans. Wireless Commun., vol. 20, no. 3, pp. 2120–2135, 2021.

[19] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: A

communication-eﬃcient federated learning method with periodic averaging and quantization,”

in International Conference on Artiﬁcial Intelligence and Statistics, 2020, pp. 2021–2031.

[20] S. Zheng, C. Shen, and X. Chen, “Design and Analysis of Uplink and Downlink Communica-

tions for Federated Learning,” IEEE J. Sel. Areas Commun., pp. 1–1, 2020.

[21] A. Goldsmith, Wireless communications. Cambridge university press, 2005.

[22] M. M. Amiri, D. Gunduz, S. R. Kulkarni, and H. V. Poor, “Federated learning with quantized

global model updates,” arXiv preprint arXiv:2006.10672, 2020.

[23] K.-Y. Wang, A. M.-C. So, T.-H. Chang, W.-K. Ma, and C.-Y. Chi, “Outage constrained

robust transmit optimization for multiuser MISO downlinks: Tractable approximations by

conic optimization,” IEEE Trans. Signal Process., vol. 62, no. 21, pp. 5690–5705, 2014.

[24] Y. Xu, C. Shen, T.-H. Chang, S.-C. Lin, Y. Zhao, and G. Zhu, “Transmission energy min-

imization for heterogeneous low-latency noma downlink,” IEEE Trans. Wireless Commun.,

vol. 19, no. 2, pp. 1054–1069, 2020.

[25] F. Sattler, S. Wiedemann, K.-R. M¨uller, and W. Samek, “Robust and communication-eﬃcient

federated learning from non-iid data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 9,

pp. 3400–3413, 2019.

[26] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms

outperform centralized algorithms? a case study for decentralized parallel stochastic gradient

descent,” in NeurIPS, 2017, pp. 5336–5346.

[27] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster convergence and less communi-

cation: Demystifying why model averaging works for deep learning,” in AAAI, vol. 33, no. 01,

2019, pp. 5693–5700.

[28] J. Liu, C. Zhang et al., “Distributed learning systems with ﬁrst-order methods,” Foundations

and Trends®in Databases, vol. 9, no. 1, pp. 1–100, 2020.

[29] S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History, Theory, and Appli-

cations. Boston, MA: Birkh¨auser, 2002.

[30] M. A. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruc-

tion: Application to compressed sensing and other inverse problems,” IEEE J. Sel. Topics

Signal Process., vol. 1, no. 4, pp. 586–597, 2007.

[31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-based learning applied to document

recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[32] I. S. Misra, Wireless communications and networks: 3G and beyond. McGraw Hill Education

(India) Pvt Ltd, 2013.

Supplementary Material

A Proof of Lemma 1

With |wr,E

ij | ∈ [wr

ij,¯wr

ij] and quantization level Br

i, the quantized wr,E

ij is unbiasedly estimated since

E[Q(wr,E

ij )] =sign(wr,E

ij )·cu·Pr wr,E

ij = sign(wr,E

ij )·cu+ sign(wr,E

ij )·cu+1 ·Pr wr,E

j+1 = sign(wr,E

ij )·cu+1

=sign(wr,E

ij )·cu

cu+1 − |wr,E

ij |

cu+1 −cu

+cu+1 |wr,E

ij | − cu

cu+1 −cu= sign(wr,E

ij )· |wr,E

ij |=wr,E

ij . (55)

Based on this, we have

E[Q(wr,E

i)] = hE[Q(wr,E

i1)],E[Q(wr,E

i2)],··· ,E[Q(wr,E

im )]i=hwr,E

i1, wr,E

i2,··· , wr,E

im i=wr,E

With the stochastic quantization method in (6), the quantization error is bounded by

Eh|Q(wr,E

ij )−wr,E

ij |2i=(cu− |wr,E

ij |)2·cu+1 − |wr,E

ij |

cu+1 −cu

+ (cu+1 − |wr,E

ij |)2·|wr,E

ij | − cu

cu+1 −cu

=(|wr,E

ij | − cu)(cu+1 − |wr,E

ij |)(|wr,E

ij | − cu+cu+1 − |wr,E

ij |)

cu+1 −cu

=(|wr,E

ij | − cu)(cu+1 − |wr,E

ij |)

=−(|wr,E

ij |)2+ (cu+cu+1)|wr,E

ij | − cucu+1

=−|wr,E

ij | − cu+cu+1

22

+cu−cu+1

22

≤cu−cu+1

22

, (56)

where with cudeﬁned in (5), the interval between neighboring knobs is

|cu−cu+1|=|¯wr

ij −wr

ij |

2Br

i−1. (57)

Then, substituting (57) into (56), we have

Eh|Q(wr,E

ij )−wr,E

ij |2i≤( ¯wr

ij −wr

ij )2

4(2Br

i−1)2, (58)

and the total QE of local model can be bounded by

Eh|Q(wr,E

i)−wr,E

i|2i=E"

j=1 Q(wr,E

ij )−wr,E

ij 

2#(a)

j=1

Eh|Q(wr,E

ij )−wr,E

ij |2i(b)

≤Pm

j=1( ¯wr

ij −wr

ij )2

4(2Br

i−1)2,

where equality (a) is due to the unbiased quantization in (55), and inequality (b) is due to the error

bound in (58). 

B Extended Discussion of Remark 2

B.1 Performance analysis of general case

For the general case, we consider the unﬁxed quantization level Br

iand the changed TO probabilities

iduring the training process for diﬀerent communication rounds. Similar to the Lemma 2, we

have some properties for the general case as shown in Lemma 6.

Lemma 6 Considering FL algorithm in Algorithm 1, it holds true that

E"Pi∈Sr

i∆wr

Pi∈Sr

iXi∈Sr

i6= 0#(a)

=ESrhXi∈Sr

βr

i∆wr

ii(b)

=XN

i=1

βi∆wr

i(59)

for some βr

i,¯

βi∈[0,1] with Pi∈Srβr

i= 1 and PN

i=1 ¯

βi= 1, where equality (a) is taken expected

with respect to {1r

i}while equality (b) is taken expected with respect to Sr.

Moreover, we also have

E"Pi∈Sr

i∆wr

Pi∈Sr

i2Xi∈Sr

i6= 0#=ESrhXi∈Sr

αr

i∆wr

ii=XN

i=1 ¯αi∆wr

i(60)

for some αr

i,¯αi≥0∀i= 1,··· , N and ∀r= 1,· ·· , M .

Finally, same with (17), we denote

E"1

Pi∈Sr

iXi∈Sr

i6= 0#=XN

i=1 ¯αi,1

where ¯

Krepresents the average eﬀective number of active clients at each communication round.

If qr

iis uniform for all clients at all communication rounds, i.e., qr

i=q∀i= 1,··· , N and ∀r=

1,··· , M , then βr

i= 1/K and αr

i= 1/(K¯

K)∀i∈ Sr,¯

βi=piand ¯αi=pi/¯

K∀i∈ {1,··· , N }, and

K=1−(q)K

v=1 1

v(Cv

K(1−q)v(q)K−v)with Cv

K=K!

v!(K−v)! . In addition, if qr

i= 0 ∀i∈ Srand ∀r= 1,·· · , M

(no TO), then ¯

K=K.

From (59), one can see that {βr

i}is the equivalent appearance probability of {∆wr

i}transmitted

by each selected client i∈ Srin the global aggregation due to TO, while βiis that of ∆wr

transmitted by each client i∈ {1,··· , N }in the global aggregation due to client sampling and TO.

The main convergence result is stated below.

Theorem 2 (General case) Let Assumptions 1 to 3 hold. If one chooses γ=¯

2/(8LT 1

2)and

E≤T1

4/¯

4where T=ME ≥max{¯

K3,1/¯

K}is the total number of SGD updates per client, we

have

MXM

r=1

Ehk∇F(¯

wr−1)k2Xi∈Sr

i6= 0i

≤496L(E[F(¯

w0)] −F)

11 T¯

K1

+ 39

88 T¯

K1

88 T¯

K3

4!σ2

b+31 ¯

88T3

2XM

r=1

ESr"Xi∈Sr

αr

iJ2

ir#

| {z }

(a)(caused by QE)

+31

22 T¯

K1

4XN

i=1 ¯αiD2

| {z }

(b)(caused by partial participation

and data variance)

+ 4

11 T¯

K1

22 T¯

K3

4!XN

i=1

βiD2

| {z }

(c)(caused by data variance)

+62

11χ2

βkpXN

i=1piD2

| {z }

(d)(caused by TO and

data variance)

+31

22T¯

KXK

v=2

(qmax)K−vCv

1−(qmax)KXM

r=1

ESr1

KXi∈Srkqr

i−¯qk2D2

i

| {z }

(e)(caused by TO and data variance)

,(61)

where χ2

βkp

,PN

i=1 (¯

βi−pi)2/piis the chi-square divergence [5], qmax = maxi∈Sr,∀Sr{qr

i}and

¯q=ESr1

KPi∈Srqr

iare the maximum and average TO probabilities, respectively.

Proof: See the subsequent Subsection B.2. 

The upper bound in (61) reveals similar insights as discussed in Theorem 1. Also, when the

clients have a uniform TO probability, the terms (d) and (e) would vanish. Then, combining with

Lemma 6, we can derive the following Corollary 2 for the uniform-TO case with unﬁxed quantization

level Br

i. As shown in (62), the FL algorithm can also achieve a linear speed-up with respect to ¯

even when both TO and QE are present.

Corollary 2 Under the same conditions as Theorem 2, if all clients have a uniform TO probability

q, we have

MXM

r=1

E"k∇F(¯

wr−1)k2X

i∈Sr

i6= 0#

≤496L

11(T¯

K)1

(E[F(¯

w0)] −F) + 39

88(T¯

K)1

88(T¯

K)3

4σ2

b+31

88T3

2¯

2XM

r=1

ESr1

KXi∈Sr

ir

+4

11(T¯

K)1

22(T¯

K)3

+31

22T1

4¯

4XN

i=1piD2

i.(62)



B.2 Proof of Theorem 2

In the general case, for the same client i, its TO probability qr

iand quantization level Br

iwould

vary with the selected client set Sr. For example, the TO probability and quantization level of

client 1 in Sg

r={1,2,3,··· , K }and those in Sg

r={1,3,4,··· , K + 1}are diﬀerent. Based on

this, since diﬀerent communication rounds correspond to diﬀerent Sr, the TO probability qr

iand

quantization level Br

iof the same selected client ivary with the communication round.

For simplicity, we assume that for each possible set Sg

r, both the wireless resource (including

bandwidth and transmit power) and quantization level follow a ﬁxed allocation scheme whenever

rappears. In this way, for each possible set Sg

r, there is a unique set of the TO probabilities and

quantization levels for the clients in Sg

r. Then, with denoting qgi and Bgi as the TO probability

and the quantization level of the client i∈ Sg

r, we have qr

i=qgi and Br

i=Bgi if Sr=Sg

The proof of Theorem 2 is similar to that of Theorem 1 (Appendix B) except for the following

diﬀerences.

B.2.1 Diﬀerence 1

The formulation (52) in Appendix B becomes

(a)

=E"Pi∈Sr

ikQ(∆wr

i)−∆wr

ik2

(Pi∈Sr

i)2#(b)

=ESrhXi∈Sr

αr

iEkQ(∆wr

i)−∆wr

ik2i(c)

≤ESrhXi∈Sr

αr

iJ2

iri,

(63)

where equality (a) is due to the unbiased quantization in (8), equality (b) is caused by (60) in

Lemma 6, and inequality (c) is due to the bounded QE in (9). Based on (63), the term (a) in

Theorem 1 (i.e., 31 ¯

K1/2

88T3/2PM

r=1 PN

i=1 ¯αiJ2

ir) turns into 31 ¯

K1/2

88T3/2PM

r=1 ESrPi∈Srαr

iJ2

irin Theorem 2.

B.2.2 Diﬀerence 2

With the maximum TO probability qmax = max

i∈Sr,∀Sr{qr

i}= max

g∈{1,2,···,N K}max

i∈Sg

qgi, (48) in Appendix

B becomes

Pr h1r

k= 1,1r

k0= 1,Xi∈Sr

i=vi≤(1 −qr

k)(1 −qr

k0)(qmax)K−vCv−2

K−2

1−(qmax)K.

Then, with the average TO probability ¯q=ESr1

KPi∈Srqr

i=PNK

g=1 Qi∈Sg

rpi·1

KPi∈Sg

rqgi, the

formulation (49) in Appendix B turns into

G24 =XK

v=2

(qmax)K−vCv−2

K−2

(1 −(qmax)K)v2

·EXk∈SrXk0∈Sr

k06=k(1 −qr

k)(1 −qr

k0)(∇Fk(¯

wr−1)− ∇F(¯

wr−1))(∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

(a)

=XK

v=2

(qmax)K−vCv−2

K−2

(1 −(qmax)K)v2

·E(1 −¯q)2Xk∈SrXk0∈Sr

k06=k

(∇Fk(¯

wr−1)− ∇F(¯

wr−1)) (∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

+ (1 −¯q)Xk∈SrXk0∈Sr

k06=k

(¯q−qr

k) (∇Fk(¯

wr−1)− ∇F(¯

wr−1)) (∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

+ (1 −¯q)Xk∈SrXk0∈Sr

k06=k

(¯q−qr

k0) (∇Fk(¯

wr−1)− ∇F(¯

wr−1)) (∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

+Xk∈SrXk0∈Sr

k06=k

(¯q−qr

k)(¯q−qr

k0) (∇Fk(¯

wr−1)− ∇F(¯

wr−1)) (∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

=XK

v=2

(qmax)K−vCv−2

K−2

(1 −(qmax)K)v2

·E(1 −¯q)2Xk∈Sr

(∇Fk(¯

wr−1)− ∇F(¯

wr−1)) Xk0∈Sr

k06=k

(∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

+ (1 −¯q)Xk∈Sr

(¯q−qr

k) (∇Fk(¯

wr−1)− ∇F(¯

wr−1)) Xk0∈Sr

k06=k

(∇Fk0(¯

wr−1)− ∇F(¯

wr−1))

+ (1 −¯q)Xk0∈Sr

(¯q−qr

k0) (∇Fk0(¯

wr−1)− ∇F(¯

wr−1)) Xk∈Sr

k6=k0

(∇Fk(¯

wr−1)− ∇F(¯

wr−1))

+Xk∈Sr

(¯q−qr

k) (∇Fk(¯

wr−1)− ∇F(¯

wr−1)) Xk0∈Sr

k06=k

(¯q−qr

k0) (∇Fk0(¯

wr−1)− ∇F(¯

wr−1)),

(64)

where equality (a) follows from (1−qr

k)(1 −qr

k0) = (1 −¯q+ ¯q−qr

k)(1 −¯q+ ¯q−qr

k0) = (1 −¯q)2+ (1 −

¯q)(¯q−qr

k) + (1 −¯q)(¯q−qr

k0) + (¯q−qr

k)(¯q−qr

k0).

Next, since the clients k, k0∈ Srare selected independently and with replacement, then based

on ∇F(¯

wr−1) = PN

i=1pi∇Fi(¯

wr−1) and the same reason as obtaining (c) in (31), (64) becomes

G24 ≤XK

v=2

(qmax)K−vCv−2

K−2

(1 −(qmax)K)v2

·E"(1 −¯q)2K2

j=1

pj∇Fj(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)

| {z }

j0=1

pj0∇Fj0(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)

| {z }

+ (1−¯q)(K−1) X

k∈Sr

(¯q−qr

k) (∇Fk(¯

wr−1)−∇F(¯

wr−1))

j0=1

pj0∇Fj0(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)

|{z }

+ (1−¯q)(K−1) X

k0∈Sr

(¯q−qr

k0) (∇Fk0(¯

wr−1)−∇F(¯

wr−1))

j=1

pj∇Fj(¯

wr−1)−

i=1

pi∇Fi(¯

wr−1)

|{z }

+Xk∈SrXk0∈Sr

k06=k

(¯q−qr

k)(¯q−qr

k0) (∇Fk(¯

wr−1)− ∇F(¯

wr−1)) (∇Fk0(¯

wr−1)− ∇F(¯

wr−1)) #

v=2

(qmax)K−vCv−2

K−2

(1 −(qmax)K)v2·E"X

k∈SrX

k0∈Sr

k06=k

(¯q−qr

k)(¯q−qr

k0) (∇Fk(¯

wr−1)−∇F(¯

wr−1)) (∇Fk0(¯

wr−1)−∇F(¯

wr−1))

(a)

≤XK

v=2

(qmax)K−vCv−2

K−2

(1−(qmax )K)v2·EXk∈SrXk0∈Sr

k06=k

2kqr

k−¯qk2k∇Fk(¯

wr−1)−∇F(¯

wr−1)k2

+kqr

k0−¯qk2k∇Fk0(¯

wr−1)−∇F(¯

wr−1)k2

=XK

v=2

(qmax)K−vCv−2

K−2

(1 −(qmax)K)v2·K−1

2·EXk∈Srkqr

k−¯qk2k∇Fk(¯

wr−1)− ∇F(¯

wr−1)k2

+Xk0∈Srkqr

k0−¯qk2k∇Fk0(¯

wr−1)− ∇F(¯

wr−1)k2

=XK

v=2

(qmax)K−vK(K−1)Cv−2

K−2

(1 −(qmax)K)v2·E1

KXi∈Srkqr

i−¯qk2k∇Fi(¯

wr−1)− ∇F(¯

wr−1)k2

(b)

≤XK

v=2

(qmax)K−vCv

1−(qmax)KESr1

KXi∈Srkqr

i−¯qk2D2

i, (65)

where inequality (a) is due to Young’s Inequality, and inequality (b) is obtained by K(K−1)Cv−2

K−2

v2≤

K(K−1)

v(v−1) Cv−2

K−2=Cv

Kand Assumption 3.

Based on (65), the last term of (38) in Appendix B becomes

2γEL

M·

v=2

(qmax)K−vCv

1−(qmax)KXM

r=1

ESr1

KXi∈Srkqr

i−¯qk2D2

i.

and the coeﬃcient H6in (38) is redeﬁned as H6,2γEL

M=2γE2L

T. If one chooses γ=¯

2/(8LT 1

and E≤T1

4/¯

4, we have

H6≤2

8Lr¯

T· T1

4!2

·L

T=1

4T¯

Therefore, the term (e) (i.e., 31

22(T¯

K)1/4PK

v=2

(qmax)K−vCv

1−(qmax)KPN

i=1 pikqi−¯qk2D2

i) in Theorem 1 be-

comes 31

22T¯

KPK

v=2

(qmax)K−vCv

1−(qmax)KPM

r=1 ESrh1

KPi∈Srkqr

i−¯qk2D2

iiin Theorem 2. 

C Average uplink transmission delay in (21)

C.1 Derivation process of ¯τr

If the TO probabilities of the selected clients in Srall equal to 1, the probability that all selected

clients fail to transmit data without TO is Pr(Pi∈Sr1r

i) = Qi∈Srqi= 1. In such case, the re-

transmission process will be repeated inﬁnitely, and the transmission delay will become inﬁnite.

However, this extreme situation can be easily avoided in the wireless system if the conditions in

Lemma 7 are satisﬁed.

Lemma 7 With the deﬁnition of TO probability in (12), if the uplink data rate Ri<+∞, the

allocated bandwidth Wi>0and the transmit power Pi>0(in Watt) for each client iare satisﬁed,

then the outage probability of each client qi<1.

Actually, as shown in the Proposition 1, the above conditions are satisﬁed in the optimal

condition of problem (20).

Then, since retransmission is performed if all selected clients experience outage in the uplink

transmission (i.e., Pj∈Sr1r

j= 0), the average transmission delay of the client i∈ Sris computed

¯τr

∞

k=1 Yj∈Sr

qjk−1

| {z }

(a) 1−Yj∈Sr

qj

| {z }

(b)

k·max

j∈Sr

| {z }

(c)

=1−Yj∈Sr

qj∞

k=1

kYj∈Sr

qjk−1

| {z }

(d)

·max

j∈Sr

(66)

where (a) is the probability of Pj∈Sr1r

j= 0 for (k−1) consecutive times of retransmission, (b)

denotes the probability of Pj∈Sr1r

j≥1, and (c) is the uplink delay of ksuccessive transmissions.

Next, with

1−Yj∈Sr

qjN

k=1

kYj∈Sr

qjk−1

k=1

kYj∈Sr

qjk−1−

k=1

kYj∈Sr

qjk

= 1 + Yj∈Sr

qj1+· ·· +Yj∈Sr

qjN−1−NYj∈Sr

qjN

1−Qj∈SrqjN

1−Qj∈Srqj−NYj∈Sr

qjN

1−(1 + N)Qj∈SrqjN

+NQj∈SrqjN+1

1−Qj∈Srqj

and Qj∈Srqj<1, the term (d) in (66) is given by

(d) = lim

N→∞ 1−Yj∈Sr

qjN

k=1

kYj∈Sr

qjk−1=1

1−Qj∈Srqj

. (67)

Combining (66) and (67), we can obtain

¯τr

i=1

1−Qj∈Srqj

max

j∈Sr



C.2 Proof of Lemma 7

According to (12), if ρi<+∞, we have Q(ρi/σdB)> Q(+∞) = 0 and then qi<1. Therefore, if

we want qi<1, the following conditions need to be satisﬁed to make ρi<+∞.

(i) The data rate Ri<+∞. Otherwise, according to the deﬁnition of ρiin (12), i.e., ρi,

[(2Ri/Wi−1)WiN0]dB −[Pi]dB −[K]dB +λ[di]dB, if Ri= +∞, we have ρi= +∞.

(ii) The transmit power Pi>0 (Watt). Otherwise, if Pi= 0 (Watt), we have [Pi]dB =−∞ and

then ρi= +∞.

(iii) The allocated bandwidth Wi>0. Otherwise, if Wi= 0, then ρi= +∞since

lim

Wi→0(2

Wi−1)Wi= lim

Wi→0

Wi−1

(a)

= lim

Wi→0

−2

Wi·ln 2 ·Ri

−1

= lim

Wi→02

WiRiln 2 = +∞(68)

where (a) is due to the L’Hospital’s Rule.

Therefore, with Ri<+∞,Pi>0 (in Watt), and Wi>0, we have qi<1. 

D Monotonically increasing property of Wi(Bi)

According to (22) and (23), we have the quantization level satisﬁes

Bi=¯

Bi(Wi) = τmax

mWilog21 + θiPmax

WiN0−µ

m. (69)

Based on this, ﬁrst-order derivative of ¯

Bi(Wi) with respect to the allocated bandwidth Wiis

∂¯

Bi(Wi)

∂Wi

=τmax

mlog21 + θiPmax

WiN0+τmax

1 + θiPmax

WiN0ln 2 ·−θiPmax

iN0

=τmax

mlog21 + θiPmax

WiN0−τmaxθiPmax

m(WiN0+θiPmax) ln 2 , (70)

and then the associated second-order derivative is

∂2¯

Bi(Wi)

∂W 2

=τmax

m1 + θiPmax

WiN0ln 2 ·−θiPmax

iN0+τmaxθiPmax N0

m(WiN0+θiPmax)2ln 2

=−τmaxθiPmax

m(WiN0+θiPmax)Wiln 2 +τmax θiPmaxN0

m(WiN0+θiPmax)2ln 2 =−τmaxθ2

iP2

max

m(WiN0+θiPmax)2Wiln 2 .

In the practical wireless environment, the shadowing variance σdB >0, the constant [K]dB >

−∞, the distance di<+∞(in meter), and it is reasonable to set the TO probability constraint

qmax ∈(0,1]. Thus, the parameter θi,10 1

10 (σdB·Q−1(1−qmax )+[K]dB−λ[di]dB )deﬁned in (22) satisﬁes

θi∈(0,+∞). Meanwhile, in the real communication systems, the number of parameters m∈

(0,+∞), the delay constraint τmax ∈(0,+∞), and the transmit power constraint Pmax ∈(0,+∞)

(in Watt). Therefore, ∂2¯

Bi(Wi)

∂W 2

i<0 with the allocated bandwidth Wi∈[0,+∞), which means

that ∂¯

Bi(Wi)

∂Wimonotonically decreases with the increasing Wi∈[0,+∞). Then, combining with

limWi→∞ ∂¯

Bi(Wi)

∂Wi= 0 in (70), we have

∂¯

Bi(Wi)

∂Wi

>0 (71)

for Wi∈[0,+∞), which means that Biin (69) monotonically increases with Wi∈[0,+∞).

Next, based on (69) and the implicit function theorem [29], we can deﬁne a function Ψi(Wi, Bi)

to describe the relation between Wiand Bias

Ψi(Wi, Bi)=Ψi(¯

Wi(Bi), Bi) = ¯

Bi(Wi)−Bi= 0 . (72)

Then, taking the derivatives of both sides in (72) with respect to Bi, we have

∂Ψi(Wi, Bi)

∂Bi

+∂Ψi(Wi, Bi)

∂Wi·∂¯

Wi(Bi)

∂Bi

= 0 .

Thus, combining with ∂Ψi(Wi,Bi)

∂Wi=∂¯

Bi(Wi)

∂Wiand ∂Ψi(Wi,Bi)

∂Bi=−1, we can obtain that

∂¯

Wi(Bi)

∂Bi

=−

∂Ψi(Wi,Bi)

∂Bi

∂Ψi(Wi,Bi)

∂Wi

∂¯

Bi(Wi)

∂Wi

(a)

where (a) is due to (71). Therefore, ¯

Wi(Bi) monotonically increasing Bi.

E Proof of Proposition 2

Based on (22) and (24a), we can denote φi,1

2τmax

m¯

Ri(Wi)−µ

m−12=1

τmax

mWilog21+ θiPmax

WiN0−µ

m−1!2.

Then, we have

∂φi

∂Wi

=−2

2τmax

m¯

Ri(Wi)−µ

m−13·

∂2τmax

m¯

Ri(Wi)−µ

m−1

∂Wi

=−2

2τmax

m¯

Ri(Wi)−µ

m−13·2τmax

m¯

Ri(Wi)−µ

mln 2 ·τmax

mlog21 + θiPmax

WiN0−θiPmax

(WiN0+θiPmax) ln 2 

=−2

2τmax

m¯

Ri(Wi)−µ

m−13·2τmax

m¯

Ri(Wi)−µ

m·τmax

mln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 

=2τmax

m·

,ϕi

z }| {

2τmax

m¯

Ri(Wi)−µ

m·θiPmax

WiN0+θiPmax −ln 1 + θiPmax

WiN0

2τmax

m¯

Ri(Wi)−µ

m−13

| {z }

,ρi

Based on this, we have

∂2φi

∂W 2

=2τmax

m·

∂ϕi

∂Wiρi−∂ ρi

∂Wiϕi

ρ2

where ρ2

i=2τmax

m¯

Ri(Wi)−µ

m−16≥1 since the quantization level Bi=τmax

m¯

Ri(Wi)−µ

m≥1,

∂ρi

∂Wi

∂2τmax

m¯

Ri(Wi)−µ

m−13

∂Wi

=3 2τmax

m¯

Ri(Wi)−µ

m−12·2τmax

m¯

Ri(Wi)−µ

m·τmax

mln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 

=3 2τmax

m¯

Ri(Wi)−µ

m−12·2τmax

m¯

Ri(Wi)−µ

m−1+1·τmax

mln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 

=3τmax

mln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax ·2τmax

m¯

Ri(Wi)−µ

m−13+2τmax

m¯

Ri(Wi)−µ

m−12,

and

∂ϕi

∂Wi

∂2τmax

m¯

Ri(Wi)−µ

m·θiPmax

WiN0+θiPmax −ln 1 + θiPmax

WiN0

∂Wi

=−2τmax

m¯

Ri(Wi)−µ

m·τmax

mln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 2

+ 2τmax

m¯

Ri(Wi)−µ

mθ2

iP2

max

(WiN0+θiPmax)2Wi

Thus,

∂ϕi

∂Wi

ρi−∂ρi

∂Wi

ϕi

=−τmax

m2τmax

m¯

Ri(Wi)−µ

m·ln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 2

·2τmax

m¯

Ri(Wi)−µ

m−13

+ 2τmax

m¯

Ri(Wi)−µ

mθ2

iP2

max

(WiN0+θiPmax)2Wi·2τmax

m¯

Ri(Wi)−µ

m−13

+3τmax

m2τmax

m¯

Ri(Wi)−µ

m·ln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 2

·2τmax

m¯

Ri(Wi)−µ

m−13

+3τmax

m2τmax

m¯

Ri(Wi)−µ

m·ln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 2

·2τmax

m¯

Ri(Wi)−µ

m−12

=2τmax

m2τmax

m¯

Ri(Wi)−µ

m·ln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 2

·2τmax

m¯

Ri(Wi)−µ

m−13

+ 2τmax

m¯

Ri(Wi)−µ

mθ2

iP2

max

(WiN0+θiPmax)2Wi·2τmax

m¯

Ri(Wi)−µ

m−13

+3τmax

m2τmax

m¯

Ri(Wi)−µ

m·ln 1 + θiPmax

WiN0−θiPmax

WiN0+θiPmax 2

·2τmax

m¯

Ri(Wi)−µ

m−12.

Since the quantization level Bi=τmax

m¯

Ri(Wi)−µ

m≥1, we have 2 τmax

m¯

Ri(Wi)−µ

m−1≥1. Besides,

with the allocated bandwidth Wi∈[Wi(1),+∞) in the constraint (24b), as well as the number of

parameters m∈(0,+∞) and the delay constraint τmax ∈(0,+∞) in the practical communication

systems, we have ∂2φi

∂W 2

i≥0, which means φiis convex with respect to Wiin the feasible region of

(24b). Therefore, the objective function (24) is convex. 

ResearchGate has not been able to resolve any citations for this publication.

A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks

Article

Full-text available

Oct 2020

In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that generates a global FL model and sends the model back to the users. Since all training parameters are transmitted over wireless links, the quality of training is affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS needs to select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To seek the solution, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can improve the identification accuracy by up to 1:4%, 3:5% and 4:1%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation, 2) a standard FL algorithm with random user selection and resource allocation, and 3) a wireless optimization algorithm that minimizes the sum packet error rates of all users while being agnostic to the FL parameters.

Federated Learning in Unreliable and Resource-Constrained Cellular Wireless Networks

Article

May 2021

With growth in the number of smart devices and advancements in their hardware, in recent years, data-driven machine learning techniques have drawn significant attention. However, due to privacy and communication issues, it is not possible to collect this data at a centralized location. Federated learning is a machine learning setting where the centralized location trains a learning model over remote devices. Federated learning algorithms cannot be employed in the real world scenarios unless they consider unreliable and resource-constrained nature of the wireless medium. In this paper, we propose a federated learning algorithm that is suitable for cellular wireless networks. We prove its convergence, and provide a sub-optimal scheduling policy that improves the convergence rate. We also study the effect of local computation steps and communication steps on the convergence of the proposed algorithm. We prove, in practice, federated learning algorithms may solve a different problem than the one that they have been employed for if the unreliability of wireless channels is neglected. Finally, through numerous experiments on real and synthetic datasets, we demonstrate the convergence of our proposed algorithm.

Design and Analysis of Uplink and Downlink Communications for Federated Learning

Article

Dec 2020

Communication has been known to be one of the primary bottlenecks of federated learning (FL), and yet existing studies have not addressed the efficient communication design, particularly in wireless FL where both uplink and downlink communications have to be considered. In this paper, we focus on the design and analysis of physical layer quantization and transmission methods for wireless FL. We answer the question of what and how to communicate between clients and the parameter server and evaluate the impact of the various quantization and transmission options of the updated model on the learning performance. We provide new convergence analysis of the well-known FED AVG under non-i.i.d. dataset distributions, partial clients participation, and finite-precision quantization in uplink and downlink communications. These analyses reveal that, in order to achieve an O (1/T) convergence rate with quantization, transmitting the weight requires increasing the quantization level at a logarithmic rate, while transmitting the weight differential can keep a constant quantization level. Comprehensive numerical evaluation on various real-world datasets reveals that the benefit of a FL-tailored uplink and downlink communication design is enormous - a carefully designed quantization and transmission achieves more than 98% of the floating-point baseline accuracy with fewer than 10% of the baseline bandwidth, for majority of the experiments on both i.i.d. and non-i.i.d. datasets. In particular, 1-bit quantization (3.1% of the floating-point baseline bandwidth) achieves 99.8% of the floating-point baseline accuracy at almost the same convergence rate on MNIST, representing the best known bandwidth-accuracy tradeoff to the best of the authors' knowledge.

One-Bit Over-the-Air Aggregation for Communication-Efficient Federated Edge Learning: Design and Convergence Analysis

Article

Nov 2020

Federated edge learning (FEEL) is a popular framework for model training at an edge server using data distributed at edge devices (e.g., smart-phones and sensors) without compromising their privacy. In the FEEL framework, edge devices periodically transmit high-dimensional stochastic gradients to the edge server, where these gradients are aggregated and used to update a global model. When the edge devices share the same communication medium, the multiple access channel (MAC) from the devices to the edge server induces a communication bottleneck. To overcome this bottleneck, an efficient broadband analog transmission scheme has been recently proposed, featuring the aggregation of analog modulated gradients (or local models) via the waveform-superposition property of the wireless medium. However, the assumed linear analog modulation makes it difficult to deploy this technique in modern wireless systems that exclusively use digital modulation. To address this issue, we propose in this work a novel digital version of broadband over-the-air aggregation, called one-bit broadband digital aggregation (OBDA). The new scheme features one-bit gradient quantization followed by digital quadrature amplitude modulation (QAM) at edge devices and over-the-air majority-voting based decoding at edge server. We provide a comprehensive analysis of the effects of wireless channel hostilities (channel noise, fading, and channel estimation errors) on the convergence rate of the proposed FEEL scheme. The analysis shows that the hostilities slow down the convergence of the learning process by introducing a scaling factor and a bias term into the gradient norm. However, we show that all the negative effects vanish as the number of participating devices grows, but at a different rate for each type of channel hostility.

Client Selection and Bandwidth Allocation in Wireless Federated Learning Networks: A Long-Term Perspective

Article

Oct 2020

This paper studies federated learning (FL) in a classic wireless network, where learning clients share a common wireless link to a coordinating server to perform federated model training using their local data. In such wireless federated learning networks (WFLNs), optimizing the learning performance depends crucially on how clients are selected and how bandwidth is allocated among the selected clients in every learning round, as both radio and client energy resources are limited. While existing works have made some attempts to allocate the limited wireless resources to optimize FL, they focus on the problem in individual learning rounds, overlooking an inherent yet critical feature of federated learning. This paper brings a new long-term perspective to resource allocation in WFLNs, realizing that learning rounds are not only temporally interdependent but also have varying significance towards the final learning outcome. To this end, we first design data-driven experiments to show that different temporal client selection patterns lead to considerably different learning performance. With the obtained insights, we formulate a stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints, and develop a new algorithm that utilizes only currently available wireless channel information but can achieve long-term performance guarantee. Experiments show that our algorithm results in the desired temporal client selection pattern, is adaptive to changing network environments and far outperforms benchmarks that ignore the long-term effect of FL.

Device Scheduling with Fast Convergence for Wireless Federated Learning

Conference Paper

Jun 2020

Distributed Learning Systems with First-Order Methods

Article

Jan 2020

Hierarchical Federated Learning ACROSS Heterogeneous Cellular Networks

Conference Paper

May 2020

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

Article

Apr 2020

In recent years, mobile devices are equipped with increasingly advanced sensing and computing capabilities. Coupled with advancements in Deep Learning (DL), this opens up countless possibilities for meaningful applications, e.g., for medical purposes and in vehicular networks. Traditional cloud-based Machine Learning (ML) approaches require the data to be centralized in a cloud server or data center. However, this results in critical issues related to unacceptable latency and communication inefficiency. To this end, Mobile Edge Computing (MEC) has been proposed to bring intelligence closer to the edge, where data is produced. However, conventional enabling technologies for ML at mobile edge networks still require personal data to be shared with external parties, e.g., edge servers. Recently, in light of increasingly stringent data privacy legislations and growing privacy concerns, the concept of Federated Learning (FL) has been introduced. In FL, end devices use their local data to train an ML model required by the server. The end devices then send the model updates rather than raw data to the server for aggregation. FL can serve as an enabling technology in mobile edge networks since it enables the collaborative training of an ML model and also enables DL for mobile edge network optimization. However, in a large-scale and complex mobile edge network, heterogeneous devices with varying constraints are involved. This raises challenges of communication costs, resource allocation, and privacy and security in the implementation of FL at scale. In this survey, we begin with an introduction to the background and fundamentals of FL. Then, we highlight the aforementioned challenges of FL implementation and review existing solutions. Furthermore, we present the applications of FL for mobile edge network optimization. Finally, we discuss the important challenges and future research directions in FL.

Toward an Intelligent Edge: Wireless Communication Meets Machine Learning

Article

Jan 2020

The recent revival of AI is revolutionizing almost every branch of science and technology. Given the ubiquitous smart mobile gadgets and IoT devices, it is expected that a majority of intelligent applications will be deployed at the edge of wireless networks. This trend has generated strong interest in realizing an "intelligent edge" to support AI-enabled applications at various edge devices. Accordingly, a new research area, called edge learning, has emerged, which crosses and revolutionizes two disciplines: wireless communication and machine learning. A major theme in edge learning is to overcome the limited computing power, as well as limited data, at each edge device. This is accomplished by leveraging the mobile edge computing platform and exploiting the massive data distributed over a large number of edge devices. In such systems, learning from distributed data and communicating between the edge server and devices are two critical and coupled aspects, and their fusion poses many new research challenges. This article advocates a new set of design guidelines for wireless communication in edge learning, collectively called learning- driven communication. Illustrative examples are provided to demonstrate the effectiveness of these design guidelines. Unique research opportunities are identified.

Quantized Federated Learning under Transmission Delay and Outage Constraints

Abstract and Figures

Recommended publications

Quantized Federated Learning Under Transmission Delay and Outage Constraints

Byzantine-Resilient Federated Machine Learning via Over-the-Air Computation

Applications in AIoT: Federated Distributed Learning for Edge IoT

Robust Federated Learning in Wireless Channels with Transmission Outage and Quantization Errors

Why Batch Normalization Damage Federated Learning on Non-IID Data?