PreprintPDF Available

Quantized Federated Learning under Transmission Delay and Outage Constraints

Authors:
  • The Chinese University of Hong Kong, Shenzhen
  • The Chinese University of Hong Kong, Shenzhen
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Federated learning (FL) has been recognized as a viable distributed learning paradigm which trains a machine learning model collaboratively with massive mobile devices in the wireless edge while protecting user privacy. Although various communication schemes have been proposed to expedite the FL process, most of them have assumed ideal wireless channels which provide reliable and lossless communication links between the server and mobile clients. Unfortunately, in practical systems with limited radio resources such as constraint on the training latency and constraints on the transmission power and bandwidth, transmission of a large number of model parameters inevitably suffers from quantization errors (QE) and transmission outage (TO). In this paper, we consider such non-ideal wireless channels, and carry out the first analysis showing that the FL convergence can be severely jeopardized by TO and QE, but intriguingly can be alleviated if the clients have uniform outage probabilities. These insightful results motivate us to propose a robust FL scheme, named FedTOE, which performs joint allocation of wireless resources and quantization bits across the clients to minimize the QE while making the clients have the same TO probability. Extensive experimental results are presented to show the superior performance of FedTOE for a deep learning-based classification task with transmission latency constraints.
Content may be subject to copyright.
Quantized Federated Learning under Transmission Delay and
Outage Constraints
Yanmeng Wang, Yanqing Xu, Qingjiang Shi, and Tsung-Hui Chang
Abstract
Federated learning (FL) has been recognized as a viable distributed learning paradigm which
trains a machine learning model collaboratively with massive mobile devices in the wireless edge
while protecting user privacy. Although various communication schemes have been proposed
to expedite the FL process, most of them have assumed ideal wireless channels which provide
reliable and lossless communication links between the server and mobile clients. Unfortunately,
in practical systems with limited radio resources such as constraint on the training latency and
constraints on the transmission power and bandwidth, transmission of a large number of model
parameters inevitably suffers from quantization errors (QE) and transmission outage (TO). In
this paper, we consider such non-ideal wireless channels, and carry out the first analysis show-
ing that the FL convergence can be severely jeopardized by TO and QE, but intriguingly can
be alleviated if the clients have uniform outage probabilities. These insightful results motivate
us to propose a robust FL scheme, named FedTOE, which performs joint allocation of wireless
resources and quantization bits across the clients to minimize the QE while making the clients
have the same TO probability. Extensive experimental results are presented to show the superior
performance of FedTOE for a deep learning-based classification task with transmission latency
constraints.
KeywordsFederated learning, transmission outage, quantization error, convergence rate,
wireless resource allocation.
1 Introduction
With the rapid development of mobile communications and artificial intelligence (AI), the edge
AI, a system that exploits locally generated data to learn a machine learning (ML) model at the
wireless edge, has attracted increasing attentions from both the academia and industries [1–3]. In
particular, federated learning (FL) has been proposed to allow an edge server to coordinate mas-
sive mobile clients to collaboratively train a shared ML model without accessing the raw data of
clients [4]. However, FL faces several critical challenges. This includes that the mobile clients have
dramatically different data distribution (data heterogeneity) and different computation capabilities
Y. Wang, Y. Xu and T.-H. Chang are with the School of Science and Engineering, The Chinese University of
Hong Kong, Shenzhen 518172, China, and also with the Shenzhen Research Institute of Big Data, Shenzhen 518172,
China (e-mail: hiwangym@gmail.com, xuyanqing@cuhk.edu.cn, tsunghui.chang@ieee.org). Q. Shi is with the School
of Software Engineering, Tongji University, Shanghai 201804, China, and also with the Shenzhen Research Institute
of Big Data, Shenzhen 518172, China (e-mail: shiqj@tongji.edu.cn). (Corresponding author: Tsung-Hui Chang.)
1
arXiv:2106.09397v1 [cs.IT] 17 Jun 2021
(device heterogeneity) [5]. Moreover, the training is subject to training latency and limited com-
munication resources for serving a large number of clients. In view of this, the well-known FedAvg
algorithm [4] with local stochastic gradient descent (local SGD) and partial participation of clients
is widely adopted to reduce the training latency and communication overhead [6]. Furthermore,
several improved FL algorithms have been proposed to reduce the inter-client variance caused by
data heterogeneity [7,8] and device heterogeneity [5, 9].
Recently, wireless resource scheduling has been introduced for FL from different perspectives.
Firstly, some works have aimed to reduce the total training latency by improving the data through-
put between the clients and the server under limited resource budget. For example, [10] adopted
joint client selection and beamforming design at the server to maximize the number of selected
clients while guaranteeing the mean squared error performance of the received data at the server,
while [11] introduced a hierarchical FL framework to maximize the transmission rate in the up-
link under the bandwidth and transmit power constraints. With a slight difference, [12] proposed a
“later-is-better” principle to jointly optimize the client selection and bandwidth allocation through-
out the training process under a total energy budget. However, all the above works did not explic-
itly consider the influence of resource allocation on the FL performance, and thus cannot directly
minimize the training latency.
Secondly, some works aimed to achieve a high learning performance within a total training
latency, through analyzing the theoretical relations between the number of communication rounds
and achieved learning accuracy. For instance, based on the number of communication rounds
required to attain a certain model accuracy, [13] and [14] proposed to optimize bandwidth allocation
to minimize the total latency of the FedAvg algorithm. The work [15] optimized resource allocation
under delay constraints and captured two tradeoffs, including the tradeoff between computation
and communication latencies as well as that between training latency and energy consumption of
all clients. While these works can minimize the training latency directly, they have assumed ideal
wireless channels with reliable and lossless transmissions.
Some recent works have considered FL and wireless resource allocation under non-ideal wireless
environments. For example, the work [16] studied the influence of packet error rate on the conver-
gence of FedAvg, and proposed a joint resource allocation and client selection scheme to improve
the convergence speed of FedAvg. The work [17] attempted to redesign the averaging scheme of
local models based on the transmission outage (TO) probabilities. The work [18] exploited the
waveform-superposition property of broadband channels to reduce the transmission delay, and also
investigated the impacts of channel fading and imperfect channel knowledge on the FL convergence.
On the other hand, some works considered compressed transmission via quantization and analyzed
the influence of the quantization error (QE) on the FL performance. For instance, [19] proposed a
communication-efficient FL method, FedPAQ, which sends the quantized global model in the down-
link, and then analyzed the effect of QE on the convergence of FL. Besides, the authors of [20]
considered layered quantized transmissions for communication-efficient FL where different quanti-
zation levels are assigned to different layers of the trained neural network. It is noted that in the
aforementioned works [16–20], the issues of TO and QE have never been considered simultaneously.
In this paper, we highlight the need of studying the joint impacts of TO and QE on FL,
especially when the transmission latency is constrained. Specifically, given a transmission delay
constraint, a larger number of quantization bits lead to a smaller QE of the transmitted model
but demand a higher transmission rate, which however result in a larger TO probability [21].
Therefore, either when the model size is large or when the latency constraint is stringent, it is
2
essential to take into account both TO and QE in the FL process. In view of this, unlike the
existing works, [16–20], we study the joint effects of TO and QE, and consider that the clients
have non-i.i.d. data distribution at the same time. To overcome these effects, we propose a new
FL scheme, called FedTOE (Federated learning with Transmission Outage and quantization Error),
which performs joint allocation of wireless resources and quantization bits for achieving robust
FL performance under such non-ideal learning environment. In particular, our main contributions
include:
(1) FL convergence analysis under both TO and QE: We consider a non-convex FL problem,
which is more general than the convex problems studied in [16, 17, 20], and consider non-ideal
(uplink) wireless channels with both TO and QE. To the best of our knowledge, this paper is the
first to analyze the influence of both TO and QE on the FL convergence simultaneously. The
derived theoretical results show that non-uniform TO probabilities not only lead to a biased
solution [5] but also amplify the negative effects caused by QE and non-i.i.d. data distribution
(data heterogeneity). Intriguingly, such undesired property can be alleviated if the clients have
the same TO probabilities.
(2) FedTOE:Inspired by this observation, we formulate a resource allocation problem to mitigate the
impacts of TO and QE. Specifically, we propose to carefully allocate the (uplink) transmission
bandwidth and quantization bits of clients to minimize the aggregate QE subject to constraints
on the transmission latency and TO probabilities. We show that the optimal solution to this
problem can achieve an uniform TO probability across the clients while minimizing the QE.
(3) Experiments: The proposed FedTOE is implemented for a deep learning-based handwritten-
digit recognition task, and the experimental results demonstrate that FedTOE has promising
performance over benchmark schemes.
Synopsis: Section 2 introduces the proposed system model of FL in the wireless environment.
Then, Section 3 presents the convergence rate analysis of FL under both TO and QE. Based on
the results, the wireless resource allocation scheme (i.e., FedTOE) is formulated in Section 4. The
experiment results are presented in Section 5. Finally, Section 6 concludes this paper.
2 System model
2.1 Federated Learning Algorithm
Consider a wireless FL network as shown in Fig. 1 where a central server coordinates Nmobile
clients to solve the following distributed learning problem
min
wRmF(w) =
N
X
i=1
piFi(w) , (1)
where Fi(w) is the (possibly) non-convex local loss function, wRmdenotes the m-dimensional
model parameters to be learned, and pi=ni/PN
j=1 njin which niis the number of data sam-
ples stored in client i. Let ξibe the mini-batch samples with size b, we denote Fi(w,ξi) =
1
bPb
j=1 f(w, ξij), where ξij is the j-th randomly selected sample from the dataset of client i, and
3
Server
Client (b)
(a)
(c)
Figure 1: Federated learning in wireless edge.
f(w, ξij) is the model loss function with respect to ξij. When b=ni,ξirefer to the whole local
dataset in client iand then Fi(w,ξi) = Fi(w).
We follow the seminal FedAvg algorithm [4]. Specifically, in the r-th communication round,
FedAvg executes the following three steps (see Fig. 1):
(a) Broadcasting: The server samples Kclients, denoted by the set Srwhere |Sr|=K, and then
broadcasts the global model ¯
wr1in the last communication round to each client i∈ Sr.
(b) Local model updating: Each client i∈ Srupdates local model by local stochastic gradient
descent (SGD) [7]. It contains Econsecutive SGD updates as follows
wr,0
i=¯
wr1
wr,`
i=wr,`1
iγFi(wr,`1
i,ξr,`
i), ` = 1, . . . , E ,(2)
where γis the learning rate.
(c) Aggregation: The selected clients upload their local model wr,E
ito the server for producing
a new global model based on certain aggregation principle.
Specifically, FedAvg considers the following two aggregation schemes, depending on whether all
clients participate or not.
(i) Full participation: All clients participate in the aggregation process, i.e., Sr={1,··· , N }
r, and the global model is updated by
˜
wr=
N
X
i=1
piwr,E
i. (3)
Considering the massive participates in the network, this scheme would not be feasible under
limited communication bandwidth for the uplink channels.
(ii) Partial participation: With |Sr|  N, the global model is updated by
¯
wr=1
KX
i∈Sr
wr,E
i, (4)
where Kclients (KN) in Srare selected with replacement according to the probability
distribution {p1,··· , pN}. It should be pointed out that the average scheme in (4) leads to
an unbiased estimate of ¯
wrin (3), i.e., E[¯
wr] = ˜
wr[6].
4
However, the aforementioned schemes are still far from practice. In particular, in digital com-
munication systems, the model parameters need to be quantized before being transmitted, which
brings QEs to the learned model. Meanwhile, channel fadings could cause TO in the delivery of the
model parameters from time to time. Moreover, given a fixed transmission delay, QE is strongly
coupled with TO. Specifically, a larger number of quantization bits lead to a smaller QE of the
learned model but require a higher transmission rate, which however can further elevate the TO
probability. Therefore, it is essential to consider TO and QE simultaneously in the wireless FL
systems. Motivated by this, in the next two subsections, we incorporate QE and TO in the uplink
channels of FL and describe their impacts in detail1.
2.2 Quantized Transmission
For the local model wr,E
i, we assume that each parameter wr,E
ij is bounded satisfying |wr,E
ij | ∈
[wr
ij,¯wr
ij], and is quantized by the stochastic quantization method in [22]. In concrete terms, with
Br
iquantization bits, we denote {c0, c1,··· , c2Br
i1}as the knobs uniformly distributed in [wr
ij,¯wr
ij],
where
cu=wr
ij +uׯwr
ij wr
ij
2Br
i1, u = 0,· ·· ,2Br
i1. (5)
Then, the parameter wr,E
ij falling in [cu,cu+1) is quantized by
Q(wr,E
ij ) =
sign(wr,E
ij )·cu,w.p. cu+1 − |wr,E
ij |
cu+1 cu
,
sign(wr,E
ij )·cu+1,w.p. |wr,E
ij | − cu
cu+1 cu
,
(6)
where ‘w.p.’ stands for ‘with probability’. In addition, let µbe the number of bits used to represent
sign(wr,E
ij ), wr
ij and ¯wr
ij. Then, the quantized local model Q(wr,E
i)=[Q(wr,E
i1),··· ,Q(wr,E
im )] is
expressed by a total number of
ˆ
Br
i=mBr
i+µbits , (7)
and is sent to the server.
Lemma 1 With the stochastic quantization method, each local model is unbiasedly estimated as
E[Q(wr,E
i)] = wr,E
i,(8)
and the associated QE is bounded by
E[kQ(wr,E
i)wr,E
ik2]δ2
ir/(2Br
i1)2,J2
ir ,(9)
where δir ,q1
4Pm
j=1( ¯wr
ij wr
ij)2.
Proof: Properties like Lemma 8 have been discussed in the literature; see [19] and [20]. For
ease of reference, the proof is presented in Section A of the Supplementary Material.
As one can see from (7) and (9) that a higher quantization level Br
ileads to a larger number of
bits ˆ
Br
ifor transmission but a smaller QE.
1In the current work, we only consider the TO and QE in the uplink transmission since the server (i.e., base
station) is assumed to be powerful enough to provide reliable and lossless communications for the downlink broadcast
channels [19].
5
2.3 Transmission Outage
There are several ways to model TO in wireless channels. For example, 1) without channel state
information at the transmitter (CSIT), the transmission may suffer from outage due to large-scale
fadings such as shadowing [16]; 2) with imperfect CSIT (e.g., imperfect channel estimation or finite
bandwidth feedback), the CSI error could cause transmission outage [23]; 3) with perfect CSIT,
due to finite blocklength transmission, the receiver may fail to decode the message [24]. In this
work, for simplicity, we will assume no CSIT and focus on the impacts of shadowing on the TO of
the system.
By assuming that the frequency division multiple access (FDMA) is adopted for uplink trans-
mission, the channel capacity of each client i∈ Sris
Cr
i=Wr
ilog21 + Pr
i|hi|2
Wr
iN0bps, (10)
where Wr
iand Pr
idenote the allocated bandwidth and transmit power of client i, respectively,
hiis the uplink channel coefficient between the server and client i, and N0represents the power
spectrum density (PSD) of the additive noise. According to the channel coding theorem [21], if the
transmission rate Rr
iis higher than Cr
i, TO occurs and the server fails to decode Q(wr,E
i) correctly.
Suppose that the uplink transmission is subject to a delay constraint τi, then Rr
i=ˆ
Br
ii. Thus,
the outage probability is given by
qr
i,Pr(Cr
iRr
i).(11)
We model the channel gain in (10) using the classical path loss model with shadowing [21], i.e.,
[|hi|2]dB = [K]dB λ[di]dB +ψdB, where [x]dB measures xin dB, Kis a constant depending on the
antenna characteristics and channel attenuation, λis the path loss exponent, di(in meter) is the
distance between client iand the server, and ψdB ∼ N(0, σ2
dB) is the shadowing. Then, the TO
probability in (11) can be computed as
qi= Pr(ψdB < ρi)=1Q(ρidB) , (12)
where Q(x) = R+
x
1
2πexp(1
2z2)dzis the Q-function and ρi,[(2Ri/Wi1)WiN0]dB [Pi]dB
[K]dB +λ[di]dB.
2.4 Federated Learning with QE and TO
Let us reconsider the FedAvg in Section 2.1 in the presence of both TO and QE in the uplink.
According to [20] and [22], it is more bit-efficient to transmit the model updates (i.e., wr,E
iwr,0
i)
than the model wr,E
iitself in the uplink since the dynamic ranges of model updates can decrease
with the number of communication rounds. By adopting this scheme, each client isends to the
server with
Q(∆wr
i),Q1
γ(wr,E
iwr,0
i)=Q E
X
`=1 Fi(wr,`1
i,ξr,`
i)!.(13)
Due to TO, the server may fail to receive the upload messages. We denote 1r
i= 1 if the server
correctly receives the transmitted local model from client i, and 1r
i= 0 otherwise. Then, with the
6
Algorithm 1 FedTOE: FL with uplink TO and QE
1: Initialize global model ¯
w0by the server.
2: for r= 1,2,··· , M do
3: Server samples Kclients Srwith replacement based on the probabilities {p1,··· , pN};
4: Server broadcasts global model ¯
wr1to clients in Sr;
5: for client i∈ Srdo (in parallel)
6: wr,0
i¯
wr1
7: for `= 1,2,··· , E do
8: Update local model by mini-batch SGD in (2);
9: end for
10: Send quantized model update in (13) to the server;
11: end for
12: if Pi∈Sr1r
i= 0 then
13: Repeat Step 10 for all clients in Sr;
14: else
15: Server updates global model by (14);
16: end if
17: end for
partial participation scheme in (4), the global model at the server is obtained by
¯
wr=¯
wr1+γP
i∈Sr
1r
iQ(∆wr
i)
P
i∈Sr
1r
i
. (14)
Note that when the channel is ideal without TO and QE, then (14) reduces to the simple averaging
scheme in (4). We assume that the server can use cyclic redundancy check (CRC) to check whether
the failure occurs or not [16]. If Pi∈Sr1r
i= 0, i.e., none of the clients successfully transmit their
local updates, retransmission is carried out until at least one client’s meassage is correctly received
by the server
In the downlink transmission, the global model (i.e., ¯
wr) is sent to each client i∈ Sr(assuming
no TO and QE). Such consideration is based on the following two reasons. First, the wireless
resources of the server for broadcasting transmission are arguably abundant to transmit global
model parameters reliably with high precision [19]. Second, the selected clients differ from round
to round, and thus it requires additional caching mechanism to track the latest global model if
the server transmits model difference ¯
wr¯
wr1rather than ¯
wr; see [20, 25] for the details. The
described FL algorithm with uplink TO and QE is summarized in Algorithm 1.
Remark 1 Fig. 2 illustrates the influence of TO and QE on the FL with full participation (i.e.,
K=N= 100) and the presence of non-i.i.d data distribution. The ideal scheme suffers neither TO
nor QE, while the curves with Bi= 3 and 10 refer to the schemes which allocate uniform bandwidth
and same quantization level Bito all clients. For a more detailed setting, refer to Section 5.1. One
can see from this figure that the scheme with fewer quantization bits (i.e., Bi= 3) has an impaired
performance due to large QE, whereas the one with more quantization bits (i.e., Bi= 10) not only
has a slower convergence rate but also does not move to the right solution due to the bias caused
by TO. Therefore, the wireless resource and quantization bits need to be carefully allocated.
7
0 100 200 300 400 500
Communication rounds
0
1
2
3
Training loss
(a) Training loss.
(b) Testing accuracy.
Figure 2: Training loss and testing accuracy comparison of different schemes in wireless
environment, where the uplink transmission delay per communication round is constrained by
100ms.
In view of this, a robust FL scheme is proposed in this paper, referred to as FedTOE, which
can exhibit robustness in such non-ideal wireless channels with TO and QE as shown in Fig. 2.
We first present a novel theoretical analysis on the convergence of Algorithm 1 in the next section,
based on which, a joint wireless resource and quantization bits allocation scheme will be presented
to improve the FL performance under TO and QE in Section 4.
3 Performance analysis
3.1 Assumptions
We consider general smooth non-convex learning problems with the following assumptions.
Assumption 1 Each local function Fiis lowered bounded, i.e., Fi(w)F > −∞, and differ-
entiable whose Fiis Lipschitz continuous with constant L:vand w,Fi(v)Fi(w)+(v
w)TFi(w) + L
2kvwk2
2.
Assumption 2 Unbiasedness and bounded variance of SGD: E[Fi(w, ξij)] = E[Fi(w)],
E[k∇Fi(w, ξij)− ∇Fi(w)k2]σ2.
Assumption 3 Bounded data variance: E[k∇Fi(w)− ∇F(w)k2]D2
i,i= 1,··· , N , which
measures the heterogeneity of local datasets [26].
3.2 Theoretical results
For ease of presentation, we consider the fixed quantization level and constant TO probabilities
across the training process, i.e., Br
i=Biand qr
i=qifor all r= 1,··· , M . As one will see,
such simplification is sufficient to reveal the insight how TO and QE impact on the algorithm
convergence. The extension to the more general case is straightforward and presented in the
Supplementary Material.
We first present the following lemma.
8
Lemma 2 Considering the FL algorithm in Algorithm 1, it holds true that
E"Pi∈Sr
1r
iwr
i
Pi∈Sr
1r
iX
i∈Sr
1r
i6= 0#=
N
X
i=1
¯
βiwr
i(15)
for some ¯
βi[0,1] with PN
i=1 ¯
βi= 1, where E[·]is taken with respect to Srand {1r
i}. Moreover,
we also have
E"Pi∈Sr
1r
iwr
i
Pi∈Sr
1r
i2X
i∈Sr
1r
i6= 0#=
N
X
i=1
¯αiwr
i(16)
for some ¯αi0i= 1, . . . , N, and therefore
E"1
Pi∈Sr
1r
iX
i∈Sr
1r
i6= 0#=
N
X
i=1
¯αi,1
¯
K.(17)
When qiis uniform for all clients, i.e., qi=qi, then ¯
βi=piand ¯αi=pi/¯
Kiwith ¯
K=
1(q)K
PK
v=1 1
v(Cv
K(1q)v(q)Kv), where Cv
K=K!
v!(Kv)! . In addition, if qi= 0 i(no TO), then ¯
K=K.
Proof: See Appendix A.
From (15), one can see that {¯
βi}is the equivalent appearance probabilities of {wr
i}in the
global aggregation due to client sampling and TO, and they are deviated from {pi}when {qi}are
not uniform. Meanwhile, in (17), ¯
Krepresents the average effective number of active clients under
TO. The main convergence result is stated below.
Theorem 1 Let Assumptions 1 to 3 hold. If one chooses γ=¯
K1
2/(8LT 1
2)and ET1
4/¯
K3
4where
T=ME max{¯
K3,1/¯
K}is the total number of SGD updates per client, we have
1
M
M
X
r=1
E"k∇F(¯
wr1)k2X
i∈Sr
1r
i6= 0#
496L(E[F(¯
w0)] F)
11 T¯
K1
2
+
39
88 T¯
K1
2
+1
88 T¯
K3
4
σ2
b+31 ¯
K1
2
88T3
2
M
X
r=1
N
X
i=1
¯αiJ2
ir
| {z }
(a)(caused by QE)
+31
22 T¯
K1
4
N
X
i=1
¯αiD2
i
| {z }
(b)(caused by partial participation
and data variance)
+
4
11 T¯
K1
2
+1
22 T¯
K3
4
N
X
i=1
¯
βiD2
i
| {z }
(c)(caused by data variance)
+62
11χ2
βkp
N
X
i=1
piD2
i
| {z }
(d)(caused by TO and
data variance)
+31
22 T¯
K1
4
K
X
v=2
(qmax)KvCv
K
1(qmax)K
N
X
i=1
pikqi¯qk2D2
i
| {z }
(e)(caused by TO and data variance)
,(18)
where J2
ir is given in (9),χ2
βkp
,PN
i=1 (¯
βipi)2/piis the chi-square divergence [5], and qmax =
max{q1, . . . , qN}and ¯q=PN
i=1 piqiare the maximum and average TO probabilities, respectively.
9
Proof: Unlike the existing works [16–20,27,28], we consider a non-convex FL problem with both
TO and QE, which makes Theorem 1 much more challenging to prove. In particular, we adopt
the analysis frameworks in [26,27] and develop several new techniques to deal with the difficulties
brought by TO variables 1r
iand deviated probabilities ¯
βiand ¯αi. Details are presented in Appendix
B.
The upper bound in the right hand side (RHS) of (18) reveals several important insights.
Firstly, the upper bound depends on the effective number of clients ¯
Kinstead of K, and thus
larger TO probabilities directly slow down the algorithm convergence.
Secondly, we observe that, except for the first two terms, the terms (a)-(d) are caused by
either QE, non-i.i.d. data distribution, TO or partial client participation. Therefore, in ideal
wireless channels without QE and TO and with full client participation, the terms (a), (b),
(d) and (e) can be removed, whereas the term (c) due to the non-i.i.d. data distribution still
impedes the convergence.
Thirdly, the term (d) does not decrease with T. Since it is caused by non-uniform TO prob-
abilities and non-i.i.d. data distribution, this implies that the former amplifies the negative
effects of the latter and will make the algorithm converge to a biased solution. Intriguingly,
this phenomenon is analogous to the inconsistency issue analyzed in [5] where the clients
adopt different numbers of local SGD steps.
Last but not the least, when the clients have an uniform TO probability, i.e., qi=qi,
the terms (d) and (e) can vanish, showing that the algorithm can still converge to a proper
stationary solution. Specifically, by combining with Lemma 2, we can derive the following
result:
Corollary 1 Under the same conditions as Theorem 1, if all clients have a uniform TO probability
q, we have
1
M
M
X
r=1
E"k∇F(¯
wr1)k2X
i∈Sr
1r
i6= 0#
496L
11(T¯
K)1
2
(E[F(¯
w0)] F) + 39
88(T¯
K)1
2
+1
88(T¯
K)3
4σ2
b+31
88T3
2¯
K1
2
M
X
r=1
N
X
i=1
piJ2
ir
+4
11(T¯
K)1
2
+1
22(T¯
K)3
4
+31
22T1
4¯
K5
4N
X
i=1
piD2
i.(19)
From the RHS of (19), we can observe that with uniform TO probabilities, the impact of QE
can be reduced with a larger number of effective clients ¯
K, and the FL algorithm can also achieve
alinear speed-up with respect to ¯
Keven when both TO and QE are present. This inspiring result
implies that balancing the client TO probabilities is crucial for achieving fast and robust FL in
non-ideal wireless channels.
Remark 2 To the best of our knowledge, the claims in Theorem 1 and Corollary 1 and the
associated insights have not been discovered in the literature. Note that these results can readily
be extended to the general case where the quantization levels {Br
i}and TO probabilities {qr
i}vary
with the communication round r. For example, the associated upper bound for Corollary 1 can be
obtained by simply replacing PN
i=1 piJ2
ir in the RHS of (19) with ESr1
KPi∈SrJ2
ir. More details
are shown in Section B of the Supplementary Material.
10
4 Wireless Resource Allocation
Since both TO and QE inevitably occur in the delay constrained wireless communication systems,
we aim to minimize their effects on the FL in the wireless edge. Based on the theoretical results in
Theorem 1 and Corollary 1, we propose to carefully allocate the wireless resources and quantization
bits across the clients to minimize the impact of QE while achieving an uniform TO probability for
the clients.
4.1 Proposed FedTOE
Let’s first assume an offline scenario, where the bandwidth Wi, transmit power Pi, quantization
level Biand uplink transmission rate Riof each client are optimized offline, and applied to the
whole model learning process. Online scheduling will be considered in Section 4.2.
4.1.1 Problem formulation
Based on Corollary 1 and the definition of QE in (9), the proposed FedTOE considers the following
resource allocation problem.
min
Wi,Pi,Bi,Ri
i=1,··· ,N
N
X
i=1
pi·PM
r=1 δ2
ir
(2Bi1)2(20a)
s.t.
N
X
i=1
WiWtotal, Wi0, i = 1,··· , N (20b)
0PiPmax, i = 1,··· , N (20c)
0τiτmax, i = 1,··· , N (20d)
0qiqmax, i = 1,··· , N (20e)
BiZ+, i = 1,··· , N . (20f)
where Wtotal is the total bandwidth of the uplink channel, Pmax is the maximum transmit power
of each client, τiis the uplink transmission delay per communication round of client i,τmax and
qmax are the constraints on uplink transmission delay and TO probabilities, and Z+is the positive
integer set.
4.1.2 Uplink delay
Since retransmission is performed if all selected clients encounter outage in the uplink transmission
(i.e., Pi∈Sr1r
i= 0), the average transmission delay of each selected client i∈ Srat the r-th
communication round can be shown to be
¯τr
i=1
1Qj∈Srqj
max
j∈Sr
ˆ
Bj
Rj
, (21)
where the derivation of (21) is presented in Section C of the Supplementary Material. One can see
that QK
k=1 qj0 with a large Kor smaller qj<1, and thus ¯τr
imaxj∈Srˆ
Bj/Rj. To approximately
meet the transmission delay constraint in (20d), we replace (20e) by 0 ˆ
Bi/Riτmaxi= 1, . . . , N .
11
4.1.3 Optimal condition
One can prove that the solution to (20) satisfies Proposition 1.
Proposition 1 (Optimal condition) After relaxing BiZ+to Bi1i= 1, . . . , N , for the
optimal condition of problem (20) it holds that (a) the transmit power Pi=Pmax i, (b) the uplink
delay τi=ˆ
Bi/Ri=τmax i, and (c) the outage probability qi=qmax i. Moreover, based on (7)
and (12), the optimal transmission rate Risatisfies
Ri=¯
Ri(Wi),Wilog21 + θiPmax
WiN0,(22)
where θi,10 1
10 (σdB·Q1(1qmax )+[K]dBλ[di]dB ), and the optimal quantization level satisfies
Bi= ( ¯
Ri(Wi)τmax µ)/m. (23)
Furthermore, (23) can be equivalently written as Wi=Wi(Bi)for some continuously differentiable
and increasing function Wi(·).
Proof: The conditions (a)-(c) can be easily proved by contradiction and based on the monotonic
property of (20a) with respect to Bi. The existence of Wi(·) and its monotonically increasing
property can be obtained by the implicit function theorem [29]. The detailed proof is presented in
Section D of the Supplementary Material.
Following Proposition 1, the solution of (20) automatically makes all clients have the same TO
probabilities.
4.1.4 Optimization method
By Proposition 1, problem (20) after relaxing BiZ+to Bi0i= 1, . . . , N , can be reformulated
as
min
Wi
i=1,··· ,N
N
X
i=1
piPM
r=1 δ2
ir
2τmax
m¯
Ri(Wi)µ
m12(24a)
s.t.
N
X
i=1
WiWtotal, WiWi(1), i = 1, . . . , N . (24b)
Proposition 2 Problem (24) is convex.
Proof: It can be proved by showing that the second-order derivative of each term in the sum-
mation of (24a) with respect to Wiis non-negative. The details are relegated to Section E of the
Supplementary Material.
Based on Proposition 2, problem (24) can be efficiently solved by a simple gradient projection
method [30] with an initial point in the feasible region of (24b)2. Since Biis an positive integer,
after each gradient descent step in optimizing (24), each Biobtained by (23) is floored to its
nearest integer bBic. Then, the bandwidth supporting bBicwith the TO probability qmax is given
by Wi(bBic), which is further used as the starting point for the next gradient descent step.
12
Algorithm 2 FedTOE: Algorithm to solve (20)
1: j= 0
2: while j < maximum iteration number do
3: Updating Wiwith one-step gradient descent and projection on (24);
4: Compute each Bi(i= 1,·· · , N ) by (23);
5: Set each Bi=bBic;
6: Find each Wi=Wi(bBic) by bisection search;
7: j=j+ 1
8: end while
9: Compute each Riby (22);
Output: Transmit power Pmax, bandwidth Wi, quantization level Bi, and transmission rate Riof
each client
The details of our proposed wireless resource allocation method for offline scheduling are sum-
marized in Algorithm 2. We refer to the FL process in Algorithm 1 with the wireless resource
allocation solution by Algorithm 2 as FedTOE.
4.2 Online scheduling
In this subsection, let us investigate the online scenario, where the bandwidth Wr
i, transmit power
Pr
i, quantization level Br
i, and uplink transmission rate Rr
iof each client are optimized for every
communication round r. Such online scheduling can make better use of the wireless resources
via dynamically allocating bandwidth and quantization bits for the selected clients in Srat each
communication round r. According to Remark 2, we can consider the following QE minimization
problem at each communication round:
min
Wr
i,P r
i,Br
i,Rr
i
i∈Sr
1
KX
i∈Sr
δ2
ir
(2Br
i1)2(25a)
s.t.X
i∈Sr
WiWtotal, W r
i0, i ∈ Sr(25b)
0Pr
iPmax, i ∈ Sr(25c)
0qr
iqmax, i ∈ Sr(25d)
0¯τr
iτmax, i ∈ Sr(25e)
Br
iZ+, i ∈ Sr. (25f)
Then, following similar derivations as the offline scheme in the previous subsection, (25) can be
handled by solving
min
Wr
i,i∈SrX
i∈Sr
δ2
ir
2τmax
m¯
Ri(Wr
i)µ
m12(26a)
s.t.X
i∈Sr
Wr
iWtotal, W r
iWi(1), i ∈ Sr.(26b)
2In practice, the value of Wi(1) can be computed by bisection search based on (23) and monotonic property of
Wi(Bi).
13
Table 1: Parameter Setting
Param. Value Param. Value Param. Value
b128 N0-174 dBm/Hz Wtotal 20 MHz
E5 [K]dB -31.54 qmax 0.1
γ0.05 Wtotal 20 MHz Bmin 64 bits
σdB 3.65 λ3Bmax 64 bits
The procedure of solving (25) is similar to Algorithm 2, except replacing (24) in Step 3 with
(26), replacing i= 1,··· , N in Step 4 with i∈ Sr, and replacing Wi,Bi, and Riwith Wr
i,Br
i, and
Rr
irespectively.
5 Numerical results
5.1 Parameter setting
In the simulations, we assume that the server (i.e., base station) is located at the cell center with a
cell radius 600m, and N= 100 clients are uniformly distributed within the cell. The server employs
Algorithm 1 to train a 3-layer neural network with size 784 ×30 ×10 for the classification of digits
based on the MNIST database [31]. In the experiments, we consider two types of local datasets,
i.e., the i.i.d. and the non-i.i.d local datasets. Specifically, in the i.i.d. case, the 60000 training
samples in MNIST database are shuffled and then randomly distributed to each client, while in
the non-IID case, the training samples are reordered by their digit labels from 0 to 9 and then
partitioned so that each client possesses at most 2 digits of training samples. Besides, each client
is assumed to possess the same number of training samples, i.e., ni= 600 i= 1, . . . , N .
In the simulations, the size of quantized local model update is represented by
ˆ
Br
i=m(1 + Br
i) + nminBmin +nmax Bmax (bits) , (27)
where the total number of model parameters is m= 23860 which consists of 23820 (= 784 ×30 +
30 ×10) weights and 40 (= 30 + 10) bias in the adopted neural network, and 1 bit, Bmin bits, and
Bmax bits are used for representing the sign, the lower limit wr
ij, and the upper limit ¯wr
ij of each
parameter update respectively. In the quantization process as (6), the weight updates belonging
to the same layer share the same range [wr
ij,¯wr
ij], and so do the bias updates. In this way, with
a hidden layer and an output layer in the trained network, there are in total nmax =nmin = 4
different lower and upper limits respectively adopted by each client to quantize its local model
update. For simplicity, we assume that the clients in Srhave similar constant δir in (9), which
leads to a constant PM
r=1 δ2
ir for all clients in (20a). The other simulation parameters are listed in
Table 1 [16, 21, 32], and all results were obtained by averaging over 5 independent experiments.
Three baselines and the ideal scheme are considered for comparison with FedTOE.
Baseline 1. This scheme performs FL by Algorithm 1 with all clients adopting the maximum
transmit power Pmax, the same quantization level Bi, uniform bandwidth Wi=Wtotal/N
(offline scheduling) or Wi=Wtotal/K (online scheduling), and date rate Ri=ˆ
Bimax.
Baseline 2. Based on [17], the global model is updated by ¯
wr=¯
wr1+γ
KPi∈Sr
pi
ˆpi(1qi)1r
iwr
i,
where piis the weight of client idefined in (1) and ˆpiis the selection probability. For the
14
1 25 50 75 100
Client index
0
0.2
0.4
0.6
0.8
1
TO probability
(a) τmax = 50ms.
1 25 50 75 100
Client index
0
0.2
0.4
0.6
0.8
1
TO probability
(b) τmax = 200ms.
Figure 3: TO probability of each client under different schemes (Client with larger index is
farther away from the server).
full-participation case, ˆpi= 1, while for the partial participation case, ˆpiis optimized by
formulation (13) in [17]. Since [17] only considers the influence of TO but not quantization,
for fair comparison, we modify the global updating scheme as
¯
wr=¯
wr1+γ
KX
i∈Sr
pi
ˆpi(1 qi)1r
iQ(∆wr
i) . (28)
Other settings are the same as Baseline 1.
Baseline 3. This scheme considers (20) but with fixed uniform bandwidth Wi=Wtotal/N
(offline) or Wi=Wtotal/K (online). Thus, only Biis optimized and determined by (23).
Ideal. The ideal scheme suffers neither TO nor QE, which acts as the performance upper
bound in the simulations.
5.2 Performance Comparison with Offline Resource Allocation
5.2.1 TO versus quantization level
To examine the effectiveness of FedTOE, the performance of different schemes are compared
under two different constraints on the total uplink transmission delay τtotal, including a tight one
with τtotal = 25s and a loose one with τtotal = 100s. Then, given the total number of communication
rounds M= 500, the constraints on the uplink transmission delay per communication round (i.e.,
τmax) for the above two cases are 50ms and 200ms respectively.
Based on the above settings, Fig. 3 compares the TO probabilities of the proposed FedTOE and
Baseline 1 (which have different values of Bi). It can be seen from Fig. 3(a) that all clients in
FedTOE have uniform TO probabilities, which is consistent with Proposition 1. Different from this,
for Baseline 1, the clients farther from the server have larger TO probabilities. This is because the
data rate Rifor all clients in Baseline 1 is the same, and then the client with longer distance from
server has a larger TO probability in (12). Meanwhile, as shown in Fig. 3(a), the Baseline 1 with
a larger quantization level Bileads to a higher TO probability. The reason is that given a fixed
uplink delay, transmitting more bits requires a higher data rate which increases the TO probability.
Further, it can be observed from Fig. 3(b) that under a relaxed delay constraint (τmax = 200ms),
15
(a) K = 10. (b) K = 10.
(c) K = 100. (d) K = 100.
Figure 4: Comparison between baselines and FedTOE with τmax = 50ms for offline scheduling
under the i.i.d. data.
the TO probabilities in Baseline 1 with all Biare reduced significantly, since a smaller transmission
rate Rican be used under τmax = 200ms and then leads to lower TO probabilities.
Next, we evaluate the performance of FedTOE with respect to the communication round. From
Fig. 4 to Fig. 6, the training loss and testing accuracy of the proposed FedTOE, Baseline 1 and
Baseline 2 are compared. The performance of the ideal scheme is also shown in the figures. In
the simulations, K= 10 refers to the partial participation with replacement and K=N= 100
corresponds to the full participation of all clients. It should be pointed out that the retransmission
rounds caused when all clients experience TO are also counted.
The i.i.d. data case. One can see from Fig. 4(a) and Fig. 4(b) that under the i.i.d. case,
both FedTOE and Baseline 1 with smaller Bi= 2,5 perform closely to the ideal scheme. Specifically,
under the i.i.d. case with data variance D2
i0, the objective inconsistency in Theorem 1 will vanish
and the learned model by Baseline 1 can converge in the right direction even with TO. However,
the TO probabilities will affect the average effective number of active clients ¯
K, thus Baseline 1
with Bi= 10 in Fig. 4(a) and Fig. 4(b) has a deteriorated performance due to the higher TO
probabilities and large number of retransmission rounds. Interestingly, as shown in Fig. 4(c)-(d),
with the number of selected clients Kincreasing to 100, the effect of outage probabilities in Baseline
1 will be alleviated since more clients can transmit their local model update successfully.
16
(a) K = 10. (b) K = 10.
(c) K = 100. (d) K = 100.
Figure 5: Comparison between baselines and FedTOE with τmax = 50ms for offline scheduling
under the non-i.i.d. case.
It can also be observed from Fig. 4 that Baseline 2 [17] with Bi= 5 and 10 fails to learn the
model. This is because, for the partial participation with K= 10, higher selection probabilities
in Baseline 2 are allocated to the clients with larger TO probabilities, thus reducing the effective
number of active clients ¯
Kand consequently slowing down the convergence speed of FL. Meanwhile,
for the full participation with K= 100, Baseline 2 with larger Bi= 5 and 10 still cannot correctly
update the global model since the averaging scheme (28) in Baseline 2 will be unstable if the outage
probability qiis large.
The non-i.i.d. data case. Comparing Fig. 4 with Fig. 5, we can find that non-i.i.d. degrades
all curves, but the proposed FedTOE still performs closely to the ideal scheme and outperforms both
Baseline 1 and 2. Specifically, one can observe from Fig. 5 that Baseline 1 and 2 with Bi= 2 have
a deteriorated performance, since the non-i.i.d. data amplifies the effect of QE and Bi= 2 is not
enough to accurately represent the model update. Different from the previous i.i.d. case, the reason
why Baseline 1 with Bi= 5 and 10 fails to learn the model with non-i.i.d. data is that not only the
high TO probabilities decrease ¯
Kbut also the non-uniform TO probabilities among clients cause
the objective inconsistency as discussed in Theorem 1. Meanwhile, as shown in Fig. 5(c) and Fig.
5(d), the influence of non-uniform TO on Baseline 1 under the non-i.i.d. case cannot be alleviated
with the number of selected clients Kincreasing to 100. Besides, different from Baseline 1 and 2,
17
(a) K = 10. (b) K = 10.
Figure 6: Comparison between baselines and FedTOE with τmax = 200ms for offline scheduling
under the non-i.i.d. case.
FedTOE can adaptively determine the quantization levels via (20) to achieve superior performance.
Finally, it can be observed from Fig. 6 that under a looser per-round delay constraint (τmax =
200ms), Baseline 1 and 2 with Bi= 5 and 10 can also perform well since the TO probabilities
under τmax = 200ms are no longer high and become similar among clients as shown in Fig. 3(b).
In this situation, QE becomes a dominant factor in the performance for FL, thus Baseline 1 and 2
with Bi= 2 still perform worse owing to large QE.
As a brief summary, the proposed FedTOE can automatically find the optimal bandwidth alloca-
tion Wi, quantization level Bi, and transmission rate Rifor each client under different transmission
delay constraints, and performs a robust FL performance for both the i.i.d. and non-i.i.d. cases.
5.2.2 Necessity of optimization on bandwidth allocation
In this part, we demonstrate the necessity of optimizing the bandwidth allocation for FL. First
of all, Fig. 7 compares the training loss and testing accuracy of FedTOE and Baseline 3 with respect
to the total uplink transmission time τtotal =Mτmax, under various per-round delay constraints
τmax. One can observe that for τmax = 50ms, FedTOE performs significantly better than Baseline 3,
and for τmax 100ms, the two schemes perform comparably. However, both schemes don’t converge
well for τmax = 40ms due to the insufficient number of quantization bits under the stringent delay
constraint.
To analyze the cause why FedTOE outperforms Baseline 3, we plot in Fig. 8 the uplink bandwidth
and quantization level allocated to clients by the two schemes, where the client with a larger index
is farther from the server. In the optimal wireless resource allocation scheme of both FedTOE and
Baseline 3, the outage probabilities for all clients achieve qmax = 0.1. With this condition, it can
be seen from Fig. 8(a) that FedTOE prefers to allocate more bandwidth to the clients farther away
from the server while less bandwidth to the clients close to the server, thus allowing a more uniform
allocation of quantization bits as shown in Fig. 8(b). On the contrary, Baseline 3 (which has a
uniform bandwidth allocation) allocates larger Bito the clients close to the server since they have
larger channel capacity whereas Baseline 3 has to allocate smaller Bito the distant clients due
18
(a) K = 10. (b) K = 10.
Figure 7: Comparison between Baseline 3 and FedTOE with different τmax for offline scheduling
under the non-i.i.d. case.
1 25 50 75 100
Client index
0
100
200
300
Bandwidth (kHz)
(a) Allocated bandwidth Wi.
1 25 50 75 100
Client index
2
3
5
8
15
30
Quantization level
(b) Quantization level Bi.
Figure 8: Allocated bandwidth and quantization level of each client for offline scheduling
(Client with larger index is farther from the server).
to the delay constraint and it causes significant QE. Therefore, when τmax is large, FedTOE and
Baseline 3 perform equally well. However, when τmax is small, FedTOE can greatly outperform
Baseline 3 as seen in Fig. 7.
Lastly, one can see from Fig. 7 that a tighter per-round delay τmax can speed up the learning
process if the total uplink transmission time τtotal is constrained. For example, FedTOE under
τmax = 50ms has a faster learning speed than those under τmax 100ms. This is because a smaller
τmax allows a larger number of communication rounds Munder a fixed τtotal. Similarly, one can
see that Baseline 3 under a smaller τmax converges faster than that under τmax 100ms.
5.3 Performance Comparison with Online Scheduling
In this subsection, the performance of the proposed FedTOE with online scheduling is evaluated. In
online scheduling, the total 20M bandwidth is allocated to only the K= 10 selected clients per
round instead of to all the 100 clients in the offline scheme. So a larger allocated bandwidth of
19
(a) K = 10. (b) K = 10.
Figure 9: Comparison between baselines and FedTOE with τmax = 9ms for online scheduling
under the non-i.i.d. case.
(a) K = 10. (b) K = 10.
Figure 10: Comparison between Baseline 3 and FedTOE with different τmax for online schedul-
ing under the non-i.i.d. case.
clients can improve their transmission rates and then reduce the uplink transmission delay. Thus,
compared with the adopted per-round uplink delay constraint τmax for offline scheduling in Fig. 5,
we choose a much tighter τmax = 9ms to compare the training loss and testing accuracy of FedTOE,
Baseline 1, and Baseline 2 in online scheduling. It can be seen from Fig. 9 that FedTOE still has
superior performance than Baseline 1 and 2 in the online scheduling. Specifically, Baseline 1 and
2 with Bi= 2 have poorer performance because of higher QE, while Bi= 10 fails to update the
global model due to high TO probabilities. Meanwhile, Baseline 2 with Bi= 5 converges slower and
fluctuates a lot because of the unstable average scheme (28) under high TO probabilities. While
Baseline 1 with Bi= 5 gradually approaches to FedTOE,FedTOE has a faster convergence rate and
can dynamically adjust the quantization levels by (25) at each communication round.
Finally, Fig. 10 compares the performance of FedTOE and Baseline 3 under online scheduling
with different uplink delay constraints. It can also be observed from Fig. 10 that for a smaller
20
uplink delay τmax = 6ms or 9ms, FedTOE has a significant advantage over Baseline 3.
6 Conclusion
In this paper, we have investigated FL in non-ideal wireless channels in the presence of both TO
and QE. We have carried out a novel convergence analysis that shows TO and QE, together with
non-i.i.d. data distribution, can significantly impede the FL process. In particular, we have shown
that when the clients have heterogeneous TO probabilities, not only the negative effects of QE
and non-i.i.d data distribution can be enlarged but also the algorithm can converge to a biased
solution. On the contrary, when the clients have an uniform TO probability, these issues can
be alleviated. Inspired by this result, we have proposed FedTOE which performs joint allocation
of bandwidth and quantization bits to minimize the QE while satisfying the transmission delay
constraint and uniform TO probabilities. The presented experiment results have demonstrated that
FedTOE exhibits superior robustness against TO and QE when compared to the existing schemes.
Moreover, experiment results have also shown that a tighter transmission delay constraint per
communication round may speed up the FL process.
Appendices
A Proof of Lemma 2
A.1 Proof of (15) and (16)
At each communication round, Kclients are selected independently and with replacement based
on the probability distribution {pi}N
i=1. As a result, there are NKdifferent possibilities for the set
Sr(denoted by Sg
r, g = 1, . . . , N K) and the appearance probability of each set Sg
ris Pr(Sr=Sg
r) =
Qi∈Sg
rpi. Meanwhile, since TOs occur independently across the clients, we have Pr Pi∈Sr1r
i6= 0=
1Qi∈Srqi. Then, we can obtain (15) for some non-negative ¯
βi,i= 1, . . . , N , according to the
derivations in (29),
E"Pi∈Sr
1r
iwr
i
Pi∈Sr
1r
iXi∈Sr
1r
i6= 0#(29a)
=ESr"ETO "Pi∈Sr
1r
iwr
i
Pi∈Sr
1r
iXi∈Sr
1r
i6= 0## (29b)
=ESr"K
X
v=1 X
BrS¯
Br=Sr
|Br|=v,|¯
Br|=Kv
Pr 1r
k1= 1 k1∈ Br,1r
k2= 0 k2¯
BrXi∈Sr
1r
i6= 0·Pk1∈Brwr
k1
v#(29c)
=XNK
g=1 Yi∈S g
r
pi!· XK
v=1XBg
rS¯
Bg
r=Sg
r
|Bg
r|=v,|¯
Bg
r|=KvQk1∈Bg
r(1 qk1)Qk2¯
Bg
rqk2
1Qi∈Sgqi·Pk1∈Bg
rwr
k1
v!(29d)
,XN
i=1
¯
βiwr
i(29e)
where in (29c), Bris the set of selected clients without TO while ¯
Bris the one of clients with TO,
and in (29d), Qk1∈Bg
r(1 qk1)Qk2¯
Bg
rqk2is the probability of the event that solely the clients in
21
Bg
rhave successful transmissions. By letting ∆wr
i= 1 in (29a), we then have PN
i=1 ¯
βi= 1. In the
same fashion as (29), we can obtain
E"Pi∈Sr
1r
iwr
i
(Pi∈Sr
1r
i)2Xi∈Sr
1r
i6= 0#(30a)
=XNK
g=1 Yi∈S g
r
pi!· XK
v=1XBg
rS¯
Bg
r=Sg
r
|Bg
r|=v,|¯
Bg
r|=KvQk1∈Bg
r(1 qk1)Qk2¯
Bg
rqk2
1Qi∈Sgqi·Pk1∈Bg
rwr
k1
v2!
,XN
i=1 ¯αiwr
i(30b)
for some ¯αi0i= 1,··· , N , which is (16).
A.2 Computing the values of ¯
βi,¯αiand ¯
Kunder uniform-TO
With the same TO probability qfor all clients, (29) becomes
(29a) =ESr"XK
v=1XBrS¯
Br=Sr
|Br|=v,|¯
Br|=Kv
(1 q)v(q)Kv
1(q)K·Pk1∈Brwr
k1
v#
=ESr"XK
v=1
(1 q)v(q)Kv
1(q)K·1
vXBrS¯
Br=Sr
|Br|=v,|¯
Br|=KvXk1∈Br
wr
k1#
=ESr"XK
v=1
(1 q)v(q)Kv
1(q)K·1
vXi∈Sr
Cv1
K1wr
i#
(a)
=ESr"XK
v=1
Cv
K(1 q)v(q)Kv
1(q)K·1
KXi∈Sr
wr
i#(b)
=ESr1
KXi∈Sr
wr
i(c)
=XN
i=1piwr
i, (31)
where equality (a) follows from 1
vCv1
K1=1
v·(K1)!
(v1)!(Kv)! =1
K·K!
v!(Kv)! =1
KCv
K, equality (b) is by
PK
v=1
Cv
K(1q)v(q)Kv
1(q)K= 1 since PK
v=0 Cv
K(1 q)v(q)Kv= 1, and equality (c) is by the fact that
the clients are independently sampled with replacement following distribution {pi}N
i=1 [6]. After
comparing (29e) with (31), we have ¯
βi=piiunder the uniform-TO case.
Similar to the proof in (31), with the same TO probability qfor all clients, (30) becomes
(30a) = ESr"K
X
v=1 X
BrS¯
Br=Sr
|Br|=v,|¯
Br|=Kv
(1 q)v(q)Kv
1(q)K·Pk1∈Brwr
k1
v2#=
K
X
v=1
1
vCv
K(1 q)v(q)Kv
1(q)K"N
X
i=1
piwr
i#,
(32)
and letting ∆wr
i= 1 in (30a) and (32) gives rise to
1
¯
K=E1
Pi∈Sr
1r
iXi∈Sr
1r
i6=0=XK
v=1
1
vCv
K(1q)v(q)Kv
1(q)K.
Finally, by comparing (30b) and (32), we have ¯αi=pi/¯
Kunder the uniform-TO case.
B Proof of Theorem 1
Our analysis considers only the “successful” communication rounds where at least one client in
Srcommunicates with the server successfully, and therefore the derivations are all based on the
22
conditional events that Pi∈Sr1r
i6= 0 r= 1,··· , M . In the following proof, without further clarifi-
cation, we simply write E[·] and Pr[·] for the conditional E[·|Pi∈Sr1r
i6= 0] and Pr[ ·| Pi∈Sr1r
i6= 0],
respectively.
B.1 Proof of convergence rate
With Assumption 1, we have
E[F(¯
wr)] E[F(¯
wr1)] + E[h∇F(¯
wr1),¯
wr¯
wr1i] + L
2Ek¯
wr¯
wr1k2. (33)
We need the following three key lemmas which are proved in subsequent subsections.
Lemma 3 Under Assumptions 1 and 3, it holds that
E[h∇F(¯
wr1),¯
wr¯
wr1i]
≤ − γE
2Ek∇F(¯
wr1)k2+γEχ2
βkpXN
i=1piD2
i+γL2XN
i=1
¯
βiXE
`=1
Ehkwr,`1
i¯wr1k2i,(34)
where χ2
βkp=PN
i=1 (¯
βipi)2/piis the chi-square divergence between p= [p1,··· , pN]and β=
[¯
β1,··· ,¯
βN][5].
Lemma 4 With qmax = max{q1, . . . , qN}and ¯q=PN
i=1 piqias the maximum and the average TO
probabilities, we have
Ek¯
wr¯
wr1k24γ2E2Ek∇F(¯
wr1)k2+γ2E
¯
K
σ2
b+γ2XN
i=1 ¯αiJ2
ir + 4γ2E2
N
X
i=1
¯αiD2
i
+ 4γ2E2
K
X
v=2
(qmax)KvCv
K
1(qmax)K
N
X
i=1
pikqi¯qk2D2
i+2γ2EL2
N
X
i=1
¯
βi
E
X
`=1
Ehkwr,`1
i¯
wr1k2i.
(35)
Lemma 5 The difference between the local model at each round rand the global model at the
previous last round is bounded by
XE
`=1
Ehkwr,`1
i¯
wr1k2i
γ2E3σ2
b+ 4γ2E3D2
i+ 4γ2E3Ek∇F(¯wr1)k2
12γ2E2L2.(36)
By substituting (34) into the second term in the RHS of (33), (35) into the third term, and by
(36), we have
E[F(¯
wr)] E[F(¯
wr1)] γE
22γ2E2L4γ3E3L2+ 4γ4E4L3
12γ2E2L2Ek∇F(¯
wr1)k2
+γ2EL
2¯
K+γ3E3L2+γ4E4L3
12γ2E2L2σ2
b+γ2L
2XN
i=1 ¯αiJ2
ir + 2γ2E2LXN
i=1 ¯αiD2
i
+4γ3E3L2+ 4γ4E4L3
12γ2E2L2XN
i=1
¯
βiD2
i+γEχ2
βkpXN
i=1piD2
i
+ 2γ2E2LXK
v=2
(qmax)KvCv
K
1(qmax)KXN
i=1pikqi¯qk2D2
i.
23
Next, summing above items from r= 1 to Mand dividing both sides by the total number of
local mini-batch SGD steps T=ME yields
γ
22γ2EL4γ3E2L2+4γ4E3L3
12γ2E2L2PM
r=1Ek∇F(¯
wr1)k2
M
E[F(¯
w0)] E[F(¯
wM)]
T+γ2L
2¯
K+γ3E2L2+γ4E3L3
12γ2E2L2σ2
b+γ2L
2TXM
r=1XN
i=1 ¯αiJ2
ir + 2γ2ELXN
i=1 ¯αiD2
i
+4γ3E2L2+4γ4E3L3
12γ2E2L2XN
i=1
¯
βiD2
i+γχ2
βkpXN
i=1piD2
i+2γ2ELXK
v=2
(qmax)KvCv
K
1(qmax)KXN
i=1pikqi¯qk2D2
i.
(37)
Further dividing both sides in (37) by γleads to
1
22γEL 4γ2E2L2+ 4γ3E3L3
12γ2E2L2
| {z }
,H1
PM
r=1Ek∇F(¯
wr1)k2
M
1
γT
|{z}
,H2
(E[F(¯
w0)] E[F(¯
wM)]) + γL
2¯
K+γ2E2L2+γ3E3L3
12γ2E2L2
| {z }
,H3
σ2
b+γL
2T
|{z}
,H4
M
X
r=1
N
X
i=1
¯αiJ2
ir + 2γEL
|{z}
,H6
N
X
i=1
¯αiD2
i
+4γ2E2L2+ 4γ3E3L3
12γ2E2L2
| {z }
,H5
XN
i=1
¯
βiD2
i+χ2
βkp
N
X
i=1
piD2
i+ 2γEL
|{z}
,H6
K
X
v=2
(qmax)KvCv
K
1(qmax)K
N
X
i=1
pikqi¯qk2D2
i. (38)
Let the learning rate γ=¯
K1
2/(8LT 1
2) and the number of local updating steps ET1
4/¯
K3
4,
where Tmax{¯
K3,1/¯
K}in order to guarantee E1. By this, H2= 8L(T¯
K)1
2and H4=
¯
K1
2T3
2/16. Since γEL (T¯
K)1
4/8, we have H6(T¯
K)1
4/4 and
H5
4
82(T¯
K)1
2+4
83(T¯
K)3
4
12
82(T¯
K)1
2
(a)
4
82(T¯
K)1
2+4
83(T¯
K)3
4
12
82
=2
31(T¯
K)1
2
+1
124(T¯
K)3
4
,
where inequality (a) is due to T1/¯
K. Then,
H1=1
2H6H51
21
4(T¯
K)1
42
31(T¯
K)1
21
124(T¯
K)3
41
21
42
31 1
124 =11
62 ,
H3=γL
2¯
K+H5
4L
16L(T¯
K)1
2
+1
62(T¯
K)1
2
+1
496(T¯
K)3
439
496(T¯
K)1
2
+1
496(T¯
K)3
4
.
Finally, by substituting above coefficients and E[F(¯
wM)] Fin Assumption 1 into (38),
Theorem 1 is proved.
B.2 Proof of Lemma 3
We have
E[h∇F(¯
wr1),¯
wr¯
wr1i]
=E"*F(¯
wr1),γPi∈Sr
1r
iQ(∆wr
i)
Pi∈Sr
1r
i+#
(a)
=E"*F(¯
wr1),γPi∈Sr
1r
iPE
`=1Fi(wr,`1
i,ξr,`
i)
Pi∈Sr
1r
i+#
24
(b)
=E"*F(¯
wr1),γPi∈Sr
1r
iPE
`=1Fi(wr,`1
i)
Pi∈Sr
1r
i+#
(c)
=γXE
`=1
EF(¯
wr1),XN
i=1
¯
βiFi(wr,`1
i)
(d)
=γ
2XE
`=1
Ek∇F(¯
wr1)k2γ
2XE
`=1
Eh
XN
i=1
¯
βiFi(wr,`1
i)
2i
+γ
2XE
`=1
Eh
F(¯
wr1)XN
i=1
¯
βiFi(wr,`1
i)
2i
≤ − γE
2Ek∇F(¯
wr1)k2+γ
2XE
`=1
E
F(¯
wr1)XN
i=1
¯
βiFi(wr,`1
i)
2i
(e)
≤ − γE
2Ek∇F(¯
wr1)k2+γXE
`=1
Eh
F(¯
wr1)XN
i=1
¯
βiFi(¯
wr1)
2i
| {z }
,A1
+γXE
`=1
Eh
XN
i=1
¯
βi(Fi(¯
wr1)− ∇Fi(wr,`1
i))
2i
| {z }
,A2
, (39)
where equality (a) is due to the unbiased quantization in (8) and the definition of ∆wr
iin (13),
equality (b) is due to E[Fi(wr,`1
i,ξr,`
i)] = Fi(wr,`1
i) in Assumption 2, equality (c) is obtained
by (15), equality (d) follows from the basic identity hx1,x2i=1
2(kx1k2+kx2k2kx1x2k2), and
inequality (e) is due to kx1+x2k22kx1k2+ 2kx2k2.
In (39), the term A1can be further bounded as
A1=E"
XN
i=1piFi(¯
wr1)XN
i=1
¯
βiFi(¯
wr1)
2#
(a)
=E"
N
X
i=1
(pi¯
βi)Fi(¯
wr1)
N
X
i=1
(pi¯
βi)F(¯
wr1)
2#
=E"
XN
i=1
pi¯
βi
pi
pi(Fi(¯
wr1)− ∇F(¯
wr1))
2#
(b)
N
X
i=1
(¯
βipi)2
pi!N
X
i=1
piEk∇Fi(¯
wr1)− ∇F(¯
wr1)k2
(c)
χ2
βkpXN
i=1piD2
i, (40)
where equality (a) is because PN
i=1(pi¯
βi) = 0, inequality (b) is due to the Cauchy-Schwarz
Inequality, and inequality (c) is due to Assumption 3 and the definition of χ2
βkpin Lemma 3.
Besides, A2is bounded as
A2
(a)
XN
i=1
¯
βiEhk∇Fi(¯
wr1)− ∇Fi(wr,`1
i)k2i(b)
L2XN
i=1
¯
βiEhkwr,`1
i¯wr1k2i, (41)
where inequality (a) is by the Jensen’s Inequality and inequality (b) is due to Assumption 1. Finally,
by substituting (40) and (41) into (39), we can obtain Lemma 3 directly.
25
B.3 Proof of Lemma 4
We have
E[k¯
wr¯
wr1k2]
=E
γPi∈Sr
1r
iQ(∆wr
i)
Pi∈Sr
1r
i
2
(a)
=γ2E
Pi∈Sr
1r
iwr
i
Pi∈Sr
1r
i
2
+
Pi∈Sr
1r
i(Q(∆wr
i)wr
i)
Pi∈Sr
1r
i
2
(b)
=γ2E
Pi∈Sr
1r
iPE
`=1(Fi(wr,`1
i,ξr,`
i)−∇Fi(wr,`1
i))
Pi∈Sr
1r
i
2
| {z }
,G1(caused by SGD)
+γ2E
Pi∈Sr
1r
iPE
`=1Fi(wr,`1
i)
Pi∈Sr
1r
i
2
| {z }
,G2
+γ2E
Pi∈Sr
1r
i(Q(∆wr
i)wr
i)
Pi∈Sr
1r
i
2
| {z }
,G3(caused by quantization error)
, (42)
where equality (a) is by E[kxk2] = E[kxE[x]k2] + kE[x]k2and (8); equality (b) is obtained
similarly but using ∆wr
i=PE
`=1 Fi(wr,`1
i,ξr,`
i) in (13) and E[Fi(wr,`1
i,ξr,`
i)] = Fi(wr,`1
i)
in Assumption 2.
In (42), the term G1can be shown as
G1
(a)
=E"Pi∈Sr
1r
iPE
`=1k∇Fi(wr,`1
i,ξr,`
i)− ∇Fi(wr,`1
i)k2
Pi∈Sr
1r
i2#
(b)
=E"Pi∈Sr
1r
iPE
`=1
σ2
b
Pi∈Sr
1r
i2#=2
bE"1
Pi∈Sr
1r
i#(c)
=2
¯
Kb , (43)
where equality (a) is due to E[Fi(wr,`1
i,ξr,`
i)] = Fi(wr,`1
i) in Assumption 2, equality (b) is
due to the bounded variance of SGD in Assumption 2, and equality (c) is due to (17). For G2in
(42), we have
G22E
Pi∈Sr
1r
iPE
`=1(Fi(wr,`1
i)− ∇Fi(¯
wr1))
Pi∈Sr
1r
i
2
| {z }
,G21
+2 E
Pi∈Sr
1r
iPE
`=1Fi(¯
wr1)
Pi∈Sr
1r
i
2
| {z }
,G22
, (44)
where
G21 E·E"Pi∈Sr
1r
iPE
`=1k∇Fi(wr,`1
i)− ∇Fi(¯
wr1)k2
Pi∈Sr
1r
i#
(a)
=EXN
i=1
¯
βiXE
`=1
Ehk∇Fi(wr,`1
i)− ∇Fi(¯
wr1)k2i(b)
EL2XN
i=1
¯
βiXE
`=1
Ehkwr,`1
i¯
wr1k2i,
(45)
in which equality (a) is due to (15) in Lemma 2, and inequality (b) is due to Assumption 2.
26
G22 =E2E
Pi∈Sr
1r
iFi(¯
wr1)
Pi∈Sr
1r
i
2
=2E2E
Pi∈Sr
1r
i(Fi(¯
wr1)− ∇F(¯
wr1))
Pi∈Sr
1r
i
2
| {z }
(caused by partial participation)
+2E2Ehk∇F(¯
wr1)k2i
=2E2E"Pi∈Sr
1r
ik∇Fi(¯
wr1)− ∇F(¯
wr1)k2
Pi∈Sr
1r
i2#
| {z }
,G23
+ 2E2E
Pk0∈SrPk∈Sr
k6=k0
1r
k1r
k0(Fk(¯
wr1)− ∇F(¯
wr1))(Fk0(¯
wr1)− ∇F(¯
wr1))
Pi∈Sr
1r
i2
| {z }
,G24
+ 2E2Ek∇F(¯
wr1)k2. (46)
Next, we bound G23 and G24 in (46) as follows. Firstly
G23
(a)
=XN
i=1 ¯αiEk∇Fi(¯
wr1)− ∇F(¯
wr1)k2(b)
XN
i=1 ¯αiD2
i, (47)
where equality (a) is due to (16) in Lemma 2, and inequality (b) is due to Assumption 3. Secondly,
G24 =EK
X
v=1
Pr X
i∈Sr
1r
i=v1
v2X
k∈SrX
k0∈Sr
k06=k
Eh1r
k1r
k0(Fk(¯
wr1)
− ∇F(¯
wr1))(Fk0(¯
wr1)− ∇F(¯
wr1))X
i∈Sr
1r
i=vi
(a)
=EK
X
v=1
1
v2X
k∈SrX
k0∈Sr
k06=kPr 1r
k= 1,1r
k0= 1,X
i∈Sr
1r
i=v
·(Fk(¯
wr1)− ∇F(¯
wr1))(Fk0(¯
wr1)− ∇F(¯
wr1)),
where equality (a) follows because if 1r
k= 0 or 1r
k0= 0, then 1r
k1r
k0(Fk(¯
wr1)−∇F(¯
wr1))(Fk0(¯
wr1)
F(¯
wr1)) = 0. In addition, when v= 1, there is only one selected client with successful transmis-
sion, and 1r
kand 1r
k0cannot equal to 1 at the same time, thus Pr[1r
k= 1,1r
k0= 1,Pi∈Sr1r
i= 1] = 0.
When v2,
Pr h1r
k= 1,1r
k0= 1,Xi∈Sr
1r
i=vi
=
(1 qk)(1 qk0)PBrS¯
Br={Sr\{k,k0}}
|Br|=v2,|¯
Br|=KvQ
k1∈Br
(1 qk1)Q
k2¯
Br
qk2
1Qi∈Srqk
(a)
(1 qk)(1 qk0)PBrS¯
Br={Sr\{k,k0}}
|Br|=v2,|¯
Br|=Kv
(qmax)Kv
1(qmax)K
(b)
=(1 qk)(1 qk0)(qmax)KvCv2
K2
1(qmax)K, (48)
27
where Bris the set of selected clients (except kand k0) in Srtransmitting their local model updates
successfully while ¯
Bris the one that suffers from TO; inequality (a) is due to 1 qk11 and
qk2qmax = max{q1, . . . , qN}, and in equality (b), Cv2
K2=(K2)!
(v2)!(Kv)! . Thus,
G24 EK
X
v=2
(qmax)KvCv2
K2
(1 (qmax)K)v2X
k∈SrX
k0∈Sr
k06=k
(1qk)(1qk0)(Fk(¯
wr1)−∇F(¯
wr1))(Fk0(¯
wr1)−∇F(¯
wr1))
=EK
X
v=2
(qmax)KvCv2
K2
(1(qmax )K)v2X
k∈Sr
(1qk)(Fk(¯
wr1)−∇F(¯
wr1)) X
k0∈Sr
k06=k
(1 qk0)(Fk0(¯
wr1)− ∇F(¯
wr1))
(a)
=EK
X
v=2
(qmax)KvK(K1)Cv2
K2
(1 (qmax)K)v2
N
X
j=1
pj(1 qj)(Fj(¯
wr1)
− ∇F(¯
wr1))
N
X
j0=1
pj0(1 qj0)(Fj0(¯
wr1)− ∇F(¯
wr1))
(b)
EK
X
v=2
(qmax)KvCv
K
1(qmax)K
N
X
j=1
N
X
j0=1
pjpj0(1qj)(1qj0)(Fj(¯
wr1)−∇F(¯
wr1))(Fj0(¯
wr1)−∇F(¯
wr1)),
(49)
where equality (a) can be obtained based on the same reason as obtaining (c) in (31) since the
clients k, k0∈ Srare selected independently and with replacement. The above inequality (b) is
obtained by K(K1)Cv2
K2
v2K(K1)
v(v1) Cv2
K2=Cv
Kfor v2. Then, with the average TO probability
¯q=PN
i=1 piqi, we have (1 qj)(1 qj0) = (1 ¯q+ ¯qqj)(1 ¯q+ ¯qqj0) = (1 ¯q)2+ (1 ¯q)(¯q
qj) + (1 ¯q)(¯qqj0) + ( ¯qqj)( ¯qqj0). Thus, with F(¯
wr1) = PN
i=1piFi(¯
wr1), (49) turns into
G24 EXK
v=2
(qmax)KvCv
K
1(qmax)K
·(1¯q)2
N
X
j=1
pj Fj(¯
wr1)
N
X
i=1
piFi(¯
wr1)!
| {z }
=0
N
X
j0=1
pj0 Fj0(¯
wr1)
N
X
i=1
piFi(¯
wr1)!
| {z }
=0
+ (1 ¯q)
N
X
j=1
pjqqj) (Fj(¯
wr1)− ∇F(¯
wr1))
N
X
j0=1
pj0 Fj0(¯
wr1)
N
X
i=1
piFi(¯
wr1)!
| {z }
=0
+ (1 ¯q)
N
X
j=1
pj Fj(¯
wr1)
N
X
i=1
piFi(¯
wr1)!
| {z }
=0
N
X
j0=1
qqj0) Fj0(¯
wr1)
N
X
i=1
piFi(¯
wr1)!
+
N
X
j=1
N
X
j0=1
pjpj0qqj)(¯qqj0) (Fj(¯
wr1)− ∇F(¯
wr1)) (Fj0(¯
wr1)− ∇F(¯
wr1)) 
=E"K
X
v=2
(qmax)KvCv
K
1(qmax)K
N
X
j=1
N
X
j0=1
pjpj0qqj)(¯qqj0)(Fj(¯
wr1)−∇F(¯
wr1))(Fj0(¯
wr1)−∇F(¯
wr1))#
(a)
XK
v=2
(qmax)KvCv
K
1(qmax)KXN
i=1pikqi¯qk2Ek∇Fi(¯
wr1)− ∇F(¯
wr1)k2
28
(b)
XK
v=2
(qmax)KvCv
K
1(qmax)KXN
i=1pikqi¯qk2D2
i, (50)
where inequality (a) is due to the Young’s inequality, i.e., (¯qqj)(¯qqj0)(Fj(¯
wr1)F(¯
wr1))
(Fj0(¯
wr1)F(¯
wr1)) 1
2k¯qqjk2k∇Fj(¯
wr1)F(¯
wr1)k2+1
2k¯qqj0k2k∇Fj0(¯
wr1)
F(¯
wr1)k2, and inequality (b) is by Assumption 3.
Substituting (45), (46), (47), and (50) into (44), we have
G22EL2XN
i=1
¯
βiXE
`=1
Eh
wr,`1
i¯
wr1
2i+ 4E2XK
v=2
(qmax)KvCv
K
1(qmax)KXN
i=1pikqi¯qk2D2
i
+ 4E2XN
i=1 ¯αiD2
i+ 4E2Ehk∇F(¯
wr1)k2i. (51)
Besides, for the term G3in (42), we have
G3
(a)
=E"Pi∈Sr
1r
ikQ(∆wr
i)wr
ik2
(Pi∈Sr
1r
i)2#(b)
=XN
i=1 ¯αiEkQ(∆wr
i)wr
ik2(c)
XN
i=1 ¯αiJ2
ir , (52)
where equality (a) is due to the unbiased quantization in (8), equality (b) is by (16) in Lemma 2,
and inequality (c) is due to the bounded QE in (9).
Finally, by substituting (43), (51) and (52) into (42), we obtain Lemma 4.
B.4 Proof of Lemma 5
According to (2), the local model in the (r+ 1)-th communication round are updated by
wr,`1
i=¯wr1γX`1
t=1 Fi(wr,t1
i,ξr,t
i) .
Therefore,
Eh
wr,`1
i¯
wr1
2i
=E
γX`1
t=1 Fi(wr,t1
i,ξr,t
i)
2
γ2(`1)X`1
t=1
E
Fi(wr,t1
i,ξr,t
i)
2
(a)
=γ2(`1)X`1
t=1
E
Fi(wr,t1
i,ξr,t
i)− ∇Fi(wr,t1
i)
2+γ2(`1)X`1
t=1
E
Fi(wr,t1
i)
2
(b)
γ2(`1)2σ2
b+γ2(`1)X`1
t=1
E
Fi(wr,t1
i)
2
γ2E2σ2
b+γ2EX`1
t=1
E
Fi(wr,t1
i)
2
γ2E2σ2
b+ 2γ2EX`1
t=1
E
Fi(wr,t1
i)− ∇Fi(¯wr1)
2+ 2γ2EX`1
t=1
Ehk∇Fi(¯wr1)k2i
γ2E2σ2
b+ 2γ2EL2X`1
t=1
E
wr,t1
i¯wr1
2+ 4γ2E2Ehk∇Fi(¯wr1)F(¯wr1)k2+k∇F(¯wr1)k2i
(c)
γ2E2σ2
b+ 2γ2EL2X`1
t=1
E
wr,t1
i¯wr1
2+ 4γ2E2D2
i+ 4γ2E2Ehk∇F(¯wr1)k2i, (53)
29
where equality (a) is due to E[kxk2] = E[kxE[x]k2]+kE[x]k2and E[Fi(wr,t1
i,ξr,t
i)] = Fi(wr,t1
i),
equality (b) is by Assumption 2 given the mini-batch size b, and inequality (c) is by Assumption
3. Then, summing both sides of (53) from `= 1 to Eyields
XE
`=1
Eh
wr,`1
i¯
wr1
2i
γ2E3σ2
b+ 2γ2EL2XE
`=1X`1
t=1
E
wr,t1
i¯wr1
2
| {z }
(a)
+4γ2E3D2
i+ 4γ2E3Ehk∇F(¯wr1)k2i
(b)
γ2E3σ2
b+ 2γ2E2L2XE
`=1
Eh
wr,`1
i¯wr1
2i+ 4γ2E3D2
i+ 4γ2E3Ehk∇F(¯wr1)k2i, (54)
where inequality (b) is because the occurrence number of E[kwr,`1
i¯
wr1k2] for each `[1, E]
in term (a) is less than the number of local updating steps E, and thus (a) EPE
`=1 E[kwr,`1
i
¯wr1k2].
Finally, rearranging the terms in (54) yields Lemma 5.
References
[1] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edge ai: Intelligentizing
mobile edge computing, caching and communication by federated learning,” IEEE Network,
vol. 33, no. 5, pp. 156–165, 2019.
[2] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless
communication meets machine learning,” IEEE Commun. Mag., vol. 58, no. 1, pp. 19–25, 2020.
[3] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and
C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Com-
mun. Surveys Tuts., vol. 22, no. 3, pp. 2031–2063, 2020.
[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient
learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics,
2017, pp. 1273–1282.
[5] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency
problem in heterogeneous federated optimization,” arXiv preprint arXiv:2007.07481, 2020.
[6] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid
data,” in ICLR, 2019.
[7] X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y. Cheng, “Variance reduced local SGD with
lower communication complexity,” arXiv preprint arXiv:1912.12844, 2019.
[8] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD:
Stochastic controlled averaging for federated learning,” in ICML, 2020, pp. 5132–5143.
[9] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization
in heterogeneous networks,” arXiv preprint arXiv:1812.06127, 2018.
30
[10] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,”
IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, 2020.
[11] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning
across heterogeneous cellular networks,” in IEEE ICASSP, 2020, pp. 8866–8870.
[12] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning
networks: A long-term perspective,” IEEE Trans. Wireless Commun., 2020.
[13] W. Shi, S. Zhou, and Z. Niu, “Device Scheduling with Fast Convergence for Wireless Federated
Learning,” in IEEE ICC, 2020, pp. 1–6.
[14] Z. Yang, M. Chen, W. Saad, C. S. Hong, M. Shikh-Bahaei, H. V. Poor, and S. Cui, “De-
lay Minimization for Federated Learning Over Wireless Communication Networks,” in ICML
Workshop on Federated Learning, 2020.
[15] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong, “Federated learning over
wireless networks: Optimization model design and analysis,” in IEEE INFOCOM, 2019, pp.
1387–1395.
[16] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and com-
munications framework for federated learning over wireless networks,” IEEE Trans. Wireless
Commun., vol. 20, no. 1, pp. 269–283, 2021.
[17] M. Salehi and E. Hossain, “Federated Learning in Unreliable and Resource-Constrained Cel-
lular Wireless Networks,” arXiv preprint arXiv:2012.05137, 2020.
[18] G. Zhu, Y. Du, D. Gunduz, and K. Huang, “One-bit over-the-air aggregation for
communication-efficient federated edge learning: Design and convergence analysis,” IEEE
Trans. Wireless Commun., vol. 20, no. 3, pp. 2120–2135, 2021.
[19] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: A
communication-efficient federated learning method with periodic averaging and quantization,”
in International Conference on Artificial Intelligence and Statistics, 2020, pp. 2021–2031.
[20] S. Zheng, C. Shen, and X. Chen, “Design and Analysis of Uplink and Downlink Communica-
tions for Federated Learning,” IEEE J. Sel. Areas Commun., pp. 1–1, 2020.
[21] A. Goldsmith, Wireless communications. Cambridge university press, 2005.
[22] M. M. Amiri, D. Gunduz, S. R. Kulkarni, and H. V. Poor, “Federated learning with quantized
global model updates,” arXiv preprint arXiv:2006.10672, 2020.
[23] K.-Y. Wang, A. M.-C. So, T.-H. Chang, W.-K. Ma, and C.-Y. Chi, “Outage constrained
robust transmit optimization for multiuser MISO downlinks: Tractable approximations by
conic optimization,” IEEE Trans. Signal Process., vol. 62, no. 21, pp. 5690–5705, 2014.
[24] Y. Xu, C. Shen, T.-H. Chang, S.-C. Lin, Y. Zhao, and G. Zhu, “Transmission energy min-
imization for heterogeneous low-latency noma downlink,” IEEE Trans. Wireless Commun.,
vol. 19, no. 2, pp. 1054–1069, 2020.
31
[25] F. Sattler, S. Wiedemann, K.-R. M¨uller, and W. Samek, “Robust and communication-efficient
federated learning from non-iid data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 9,
pp. 3400–3413, 2019.
[26] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms
outperform centralized algorithms? a case study for decentralized parallel stochastic gradient
descent,” in NeurIPS, 2017, pp. 5336–5346.
[27] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster convergence and less communi-
cation: Demystifying why model averaging works for deep learning,” in AAAI, vol. 33, no. 01,
2019, pp. 5693–5700.
[28] J. Liu, C. Zhang et al., “Distributed learning systems with first-order methods,” Foundations
and Trends®in Databases, vol. 9, no. 1, pp. 1–100, 2020.
[29] S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History, Theory, and Appli-
cations. Boston, MA: Birkh¨auser, 2002.
[30] M. A. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruc-
tion: Application to compressed sensing and other inverse problems,” IEEE J. Sel. Topics
Signal Process., vol. 1, no. 4, pp. 586–597, 2007.
[31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[32] I. S. Misra, Wireless communications and networks: 3G and beyond. McGraw Hill Education
(India) Pvt Ltd, 2013.
32
Supplementary Material
A Proof of Lemma 1
With |wr,E
ij | ∈ [wr
ij,¯wr
ij] and quantization level Br
i, the quantized wr,E
ij is unbiasedly estimated since
E[Q(wr,E
ij )] =sign(wr,E
ij )·cu·Pr wr,E
ij = sign(wr,E
ij )·cu+ sign(wr,E
ij )·cu+1 ·Pr wr,E
j+1 = sign(wr,E
ij )·cu+1
=sign(wr,E
ij )·cu
cu+1 − |wr,E
ij |
cu+1 cu
+cu+1 |wr,E
ij | − cu
cu+1 cu= sign(wr,E
ij )· |wr,E
ij |=wr,E
ij . (55)
Based on this, we have
E[Q(wr,E
i)] = hE[Q(wr,E
i1)],E[Q(wr,E
i2)],··· ,E[Q(wr,E
im )]i=hwr,E
i1, wr,E
i2,··· , wr,E
im i=wr,E
i.
With the stochastic quantization method in (6), the quantization error is bounded by
Eh|Q(wr,E
ij )wr,E
ij |2i=(cu− |wr,E
ij |)2·cu+1 − |wr,E
ij |
cu+1 cu
+ (cu+1 − |wr,E
ij |)2·|wr,E
ij | − cu
cu+1 cu
=(|wr,E
ij | − cu)(cu+1 − |wr,E
ij |)(|wr,E
ij | − cu+cu+1 − |wr,E
ij |)
cu+1 cu
=(|wr,E
ij | − cu)(cu+1 − |wr,E
ij |)
=(|wr,E
ij |)2+ (cu+cu+1)|wr,E
ij | − cucu+1
=|wr,E
ij | − cu+cu+1
22
+cucu+1
22
cucu+1
22
, (56)
where with cudefined in (5), the interval between neighboring knobs is
|cucu+1|=|¯wr
ij wr
ij |
2Br
i1. (57)
Then, substituting (57) into (56), we have
Eh|Q(wr,E
ij )wr,E
ij |2i( ¯wr
ij wr
ij )2
4(2Br
i1)2, (58)
and the total QE of local model can be bounded by
Eh|Q(wr,E
i)wr,E
i|2i=E"
m
X
j=1 Q(wr,E
ij )wr,E
ij
2#(a)
=
m
X
j=1
Eh|Q(wr,E
ij )wr,E
ij |2i(b)
Pm
j=1( ¯wr
ij wr
ij )2
4(2Br
i1)2,
where equality (a) is due to the unbiased quantization in (55), and inequality (b) is due to the error
bound in (58).
B Extended Discussion of Remark 2
B.1 Performance analysis of general case
For the general case, we consider the unfixed quantization level Br
iand the changed TO probabilities
qr
iduring the training process for different communication rounds. Similar to the Lemma 2, we
have some properties for the general case as shown in Lemma 6.
33
Lemma 6 Considering FL algorithm in Algorithm 1, it holds true that
E"Pi∈Sr
1r
iwr
i
Pi∈Sr
1r
iXi∈Sr
1r
i6= 0#(a)
=ESrhXi∈Sr
βr
iwr
ii(b)
=XN
i=1
¯
βiwr
i(59)
for some βr
i,¯
βi[0,1] with Pi∈Srβr
i= 1 and PN
i=1 ¯
βi= 1, where equality (a) is taken expected
with respect to {1r
i}while equality (b) is taken expected with respect to Sr.
Moreover, we also have
E"Pi∈Sr
1r
iwr
i
Pi∈Sr
1r
i2Xi∈Sr
1r
i6= 0#=ESrhXi∈Sr
αr
iwr
ii=XN
i=1 ¯αiwr
i(60)
for some αr
i,¯αi0i= 1,··· , N and r= 1,· ·· , M .
Finally, same with (17), we denote
E"1
Pi∈Sr
1r
iXi∈Sr
1r
i6= 0#=XN
i=1 ¯αi,1
¯
K,
where ¯
Krepresents the average effective number of active clients at each communication round.
If qr
iis uniform for all clients at all communication rounds, i.e., qr
i=qi= 1,··· , N and r=
1,··· , M , then βr
i= 1/K and αr
i= 1/(K¯
K)i∈ Sr,¯
βi=piand ¯αi=pi/¯
Ki∈ {1,··· , N }, and
¯
K=1(q)K
PK
v=1 1
v(Cv
K(1q)v(q)Kv)with Cv
K=K!
v!(Kv)! . In addition, if qr
i= 0 i∈ Srand r= 1,·· · , M
(no TO), then ¯
K=K.
From (59), one can see that {βr
i}is the equivalent appearance probability of {wr
i}transmitted
by each selected client i∈ Srin the global aggregation due to TO, while βiis that of ∆wr
i
transmitted by each client i∈ {1,··· , N }in the global aggregation due to client sampling and TO.
The main convergence result is stated below.
Theorem 2 (General case) Let Assumptions 1 to 3 hold. If one chooses γ=¯
K1
2/(8LT 1
2)and
ET1
4/¯
K3
4where T=ME max{¯
K3,1/¯
K}is the total number of SGD updates per client, we
have
1
MXM
r=1
Ehk∇F(¯
wr1)k2Xi∈Sr
1r
i6= 0i
496L(E[F(¯
w0)] F)
11 T¯
K1
2
+ 39
88 T¯
K1
2
+1
88 T¯
K3
4!σ2
b+31 ¯
K1
2
88T3
2XM
r=1
ESr"Xi∈Sr
αr
iJ2
ir#
| {z }
(a)(caused by QE)
+31
22 T¯
K1
4XN
i=1 ¯αiD2
i
| {z }
(b)(caused by partial participation
and data variance)
+ 4
11 T¯
K1
2
+1
22 T¯
K3
4!XN
i=1
¯
βiD2
i
| {z }
(c)(caused by data variance)
+62
11χ2
βkpXN
i=1piD2
i
| {z }
(d)(caused by TO and
data variance)
+31
22T¯
KXK
v=2
(qmax)KvCv
K
1(qmax)KXM
r=1
ESr1
KXi∈Srkqr
i¯qk2D2
i
| {z }
(e)(caused by TO and data variance)
,(61)
where χ2
βkp
,PN
i=1 (¯
βipi)2/piis the chi-square divergence [5], qmax = maxi∈Sr,∀Sr{qr
i}and
¯q=ESr1
KPi∈Srqr
iare the maximum and average TO probabilities, respectively.
34
Proof: See the subsequent Subsection B.2.
The upper bound in (61) reveals similar insights as discussed in Theorem 1. Also, when the
clients have a uniform TO probability, the terms (d) and (e) would vanish. Then, combining with
Lemma 6, we can derive the following Corollary 2 for the uniform-TO case with unfixed quantization
level Br
i. As shown in (62), the FL algorithm can also achieve a linear speed-up with respect to ¯
K
even when both TO and QE are present.
Corollary 2 Under the same conditions as Theorem 2, if all clients have a uniform TO probability
q, we have
1
MXM
r=1
E"k∇F(¯
wr1)k2X
i∈Sr
1r
i6= 0#
496L
11(T¯
K)1
2
(E[F(¯
w0)] F) + 39
88(T¯
K)1
2
+1
88(T¯
K)3
4σ2
b+31
88T3
2¯
K1
2XM
r=1
ESr1
KXi∈Sr
J2
ir
+4
11(T¯
K)1
2
+1
22(T¯
K)3
4
+31
22T1
4¯
K5
4XN
i=1piD2
i.(62)
B.2 Proof of Theorem 2
In the general case, for the same client i, its TO probability qr
iand quantization level Br
iwould
vary with the selected client set Sr. For example, the TO probability and quantization level of
client 1 in Sg
r={1,2,3,··· , K }and those in Sg
r={1,3,4,··· , K + 1}are different. Based on
this, since different communication rounds correspond to different Sr, the TO probability qr
iand
quantization level Br
iof the same selected client ivary with the communication round.
For simplicity, we assume that for each possible set Sg
r, both the wireless resource (including
bandwidth and transmit power) and quantization level follow a fixed allocation scheme whenever
Sg
rappears. In this way, for each possible set Sg
r, there is a unique set of the TO probabilities and
quantization levels for the clients in Sg
r. Then, with denoting qgi and Bgi as the TO probability
and the quantization level of the client i∈ Sg
r, we have qr
i=qgi and Br
i=Bgi if Sr=Sg
r.
The proof of Theorem 2 is similar to that of Theorem 1 (Appendix B) except for the following
differences.
B.2.1 Difference 1
The formulation (52) in Appendix B becomes
G3
(a)
=E"Pi∈Sr
1r
ikQ(∆wr
i)wr
ik2
(Pi∈Sr
1r
i)2#(b)
=ESrhXi∈Sr
αr
iEkQ(∆wr
i)wr
ik2i(c)
ESrhXi∈Sr
αr
iJ2
iri,
(63)
where equality (a) is due to the unbiased quantization in (8), equality (b) is caused by (60) in
Lemma 6, and inequality (c) is due to the bounded QE in (9). Based on (63), the term (a) in
Theorem 1 (i.e., 31 ¯
K1/2
88T3/2PM
r=1 PN
i=1 ¯αiJ2
ir) turns into 31 ¯
K1/2
88T3/2PM
r=1 ESrPi∈Srαr
iJ2
irin Theorem 2.
35
B.2.2 Difference 2
With the maximum TO probability qmax = max
i∈Sr,∀Sr{qr
i}= max
g∈{1,2,···,N K}max
i∈Sg
r
qgi, (48) in Appendix
B becomes
Pr h1r
k= 1,1r
k0= 1,Xi∈Sr
1r
i=vi(1 qr
k)(1 qr
k0)(qmax)KvCv2
K2
1(qmax)K.
Then, with the average TO probability ¯q=ESr1
KPi∈Srqr
i=PNK
g=1 Qi∈Sg
rpi·1
KPi∈Sg
rqgi, the
formulation (49) in Appendix B turns into
G24 =XK
v=2
(qmax)KvCv2
K2
(1 (qmax)K)v2
·EXk∈SrXk0∈Sr
k06=k(1 qr
k)(1 qr
k0)(Fk(¯
wr1)− ∇F(¯
wr1))(Fk0(¯
wr1)− ∇F(¯
wr1))
(a)
=XK
v=2
(qmax)KvCv2
K2
(1 (qmax)K)v2
·E(1 ¯q)2Xk∈SrXk0∈Sr
k06=k
(Fk(¯
wr1)− ∇F(¯
wr1)) (Fk0(¯
wr1)− ∇F(¯
wr1))
+ (1 ¯q)Xk∈SrXk0∈Sr
k06=k
qqr
k) (Fk(¯
wr1)− ∇F(¯
wr1)) (Fk0(¯
wr1)− ∇F(¯
wr1))
+ (1 ¯q)Xk∈SrXk0∈Sr
k06=k
qqr
k0) (Fk(¯
wr1)− ∇F(¯
wr1)) (Fk0(¯
wr1)− ∇F(¯
wr1))
+Xk∈SrXk0∈Sr
k06=k
qqr
k)(¯qqr
k0) (Fk(¯
wr1)− ∇F(¯
wr1)) (Fk0(¯
wr1)− ∇F(¯
wr1))
=XK
v=2
(qmax)KvCv2
K2
(1 (qmax)K)v2
·E(1 ¯q)2Xk∈Sr
(Fk(¯
wr1)− ∇F(¯
wr1)) Xk0∈Sr
k06=k
(Fk0(¯
wr1)− ∇F(¯
wr1))
+ (1 ¯q)Xk∈Sr
qqr
k) (Fk(¯
wr1)− ∇F(¯
wr1)) Xk0∈Sr
k06=k
(Fk0(¯
wr1)− ∇F(¯
wr1))
+ (1 ¯q)Xk0∈Sr
qqr
k0) (Fk0(¯
wr1)− ∇F(¯
wr1)) Xk∈Sr
k6=k0
(Fk(¯
wr1)− ∇F(¯
wr1))
+Xk∈Sr
qqr
k) (Fk(¯
wr1)− ∇F(¯
wr1)) Xk0∈Sr
k06=k
qqr
k0) (Fk0(¯
wr1)− ∇F(¯
wr1)),
(64)
where equality (a) follows from (1qr
k)(1 qr
k0) = (1 ¯q+ ¯qqr
k)(1 ¯q+ ¯qqr
k0) = (1 ¯q)2+ (1
¯q)(¯qqr
k) + (1 ¯q)(¯qqr
k0) + (¯qqr
k)(¯qqr
k0).
Next, since the clients k, k0∈ Srare selected independently and with replacement, then based
on F(¯
wr1) = PN
i=1piFi(¯
wr1) and the same reason as obtaining (c) in (31), (64) becomes
G24 XK
v=2
(qmax)KvCv2
K2
(1 (qmax)K)v2
·E"(1 ¯q)2K2
N
X
j=1
pjFj(¯
wr1)
N
X
i=1
piFi(¯
wr1)
| {z }
=0
N
X
j0=1
pj0Fj0(¯
wr1)
N
X
i=1
piFi(¯
wr1)
| {z }
=0
36
+ (1¯q)(K1) X
k∈Sr
qqr
k) (Fk(¯
wr1)−∇F(¯
wr1))
N
X
j0=1
pj0Fj0(¯
wr1)
N
X
i=1
piFi(¯
wr1)
|{z }
=0
+ (1¯q)(K1) X
k0∈Sr
qqr
k0) (Fk0(¯
wr1)−∇F(¯
wr1))
N
X
j=1
pjFj(¯
wr1)
N
X
i=1
piFi(¯
wr1)
|{z }
=0
+Xk∈SrXk0∈Sr
k06=k
qqr
k)(¯qqr
k0) (Fk(¯
wr1)− ∇F(¯
wr1)) (Fk0(¯
wr1)− ∇F(¯
wr1)) #
=
K
X
v=2
(qmax)KvCv2
K2
(1 (qmax)K)v2·E"X
k∈SrX
k0∈Sr
k06=k
qqr
k)(¯qqr
k0) (Fk(¯
wr1)−∇F(¯
wr1)) (Fk0(¯
wr1)−∇F(¯
wr1))
#
(a)
XK
v=2
(qmax)KvCv2
K2
(1(qmax )K)v2·EXk∈SrXk0∈Sr
k06=k
1
2kqr
k¯qk2k∇Fk(¯
wr1)−∇F(¯
wr1)k2
+kqr
k0¯qk2k∇Fk0(¯
wr1)−∇F(¯
wr1)k2
=XK
v=2
(qmax)KvCv2
K2
(1 (qmax)K)v2·K1
2·EXk∈Srkqr
k¯qk2k∇Fk(¯
wr1)− ∇F(¯
wr1)k2
+Xk0∈Srkqr
k0¯qk2k∇Fk0(¯
wr1)− ∇F(¯
wr1)k2
=XK
v=2
(qmax)KvK(K1)Cv2
K2
(1 (qmax)K)v2·E1
KXi∈Srkqr
i¯qk2k∇Fi(¯
wr1)− ∇F(¯
wr1)k2
(b)
XK
v=2
(qmax)KvCv
K
1(qmax)KESr1
KXi∈Srkqr
i¯qk2D2
i, (65)
where inequality (a) is due to Young’s Inequality, and inequality (b) is obtained by K(K1)Cv2
K2
v2
K(K1)
v(v1) Cv2
K2=Cv
Kand Assumption 3.
Based on (65), the last term of (38) in Appendix B becomes
2γEL
M·
K
X
v=2
(qmax)KvCv
K
1(qmax)KXM
r=1
ESr1
KXi∈Srkqr
i¯qk2D2
i.
and the coefficient H6in (38) is redefined as H6,2γEL
M=2γE2L
T. If one chooses γ=¯
K1
2/(8LT 1
2)
and ET1
4/¯
K3
4, we have
H62
8Lr¯
K
T· T1
4
¯
K3
4!2
·L
T=1
4T¯
K.
Therefore, the term (e) (i.e., 31
22(T¯
K)1/4PK
v=2
(qmax)KvCv
K
1(qmax)KPN
i=1 pikqi¯qk2D2
i) in Theorem 1 be-
comes 31
22T¯
KPK
v=2
(qmax)KvCv
K
1(qmax)KPM
r=1 ESrh1
KPi∈Srkqr
i¯qk2D2
iiin Theorem 2.
37
C Average uplink transmission delay in (21)
C.1 Derivation process of ¯τr
i
If the TO probabilities of the selected clients in Srall equal to 1, the probability that all selected
clients fail to transmit data without TO is Pr(Pi∈Sr1r
i) = Qi∈Srqi= 1. In such case, the re-
transmission process will be repeated infinitely, and the transmission delay will become infinite.
However, this extreme situation can be easily avoided in the wireless system if the conditions in
Lemma 7 are satisfied.
Lemma 7 With the definition of TO probability in (12), if the uplink data rate Ri<+, the
allocated bandwidth Wi>0and the transmit power Pi>0(in Watt) for each client iare satisfied,
then the outage probability of each client qi<1.
Actually, as shown in the Proposition 1, the above conditions are satisfied in the optimal
condition of problem (20).
Then, since retransmission is performed if all selected clients experience outage in the uplink
transmission (i.e., Pj∈Sr1r
j= 0), the average transmission delay of the client i∈ Sris computed
by
¯τr
i=
X
k=1 Yj∈Sr
qjk1
| {z }
(a) 1Yj∈Sr
qj
| {z }
(b)
k·max
j∈Sr
ˆ
Bj
Rj
| {z }
(c)
=1Yj∈Sr
qj
X
k=1
kYj∈Sr
qjk1
| {z }
(d)
·max
j∈Sr
ˆ
Bj
Rj
,
(66)
where (a) is the probability of Pj∈Sr1r
j= 0 for (k1) consecutive times of retransmission, (b)
denotes the probability of Pj∈Sr1r
j1, and (c) is the uplink delay of ksuccessive transmissions.
Next, with
1Yj∈Sr
qjN
X
k=1
kYj∈Sr
qjk1
=
N
X
k=1
kYj∈Sr
qjk1
N
X
k=1
kYj∈Sr
qjk
= 1 + Yj∈Sr
qj1+· ·· +Yj∈Sr
qjN1NYj∈Sr
qjN
=
1Qj∈SrqjN
1Qj∈SrqjNYj∈Sr
qjN
=
1(1 + N)Qj∈SrqjN
+NQj∈SrqjN+1
1Qj∈Srqj
,
and Qj∈Srqj<1, the term (d) in (66) is given by
(d) = lim
N→∞ 1Yj∈Sr
qjN
X
k=1
kYj∈Sr
qjk1=1
1Qj∈Srqj
. (67)
Combining (66) and (67), we can obtain
¯τr
i=1
1Qj∈Srqj
max
j∈Sr
ˆ
Bj
Rj
.
38
C.2 Proof of Lemma 7
According to (12), if ρi<+, we have Q(ρidB)> Q(+) = 0 and then qi<1. Therefore, if
we want qi<1, the following conditions need to be satisfied to make ρi<+.
(i) The data rate Ri<+. Otherwise, according to the definition of ρiin (12), i.e., ρi,
[(2Ri/Wi1)WiN0]dB [Pi]dB [K]dB +λ[di]dB, if Ri= +, we have ρi= +.
(ii) The transmit power Pi>0 (Watt). Otherwise, if Pi= 0 (Watt), we have [Pi]dB =−∞ and
then ρi= +.
(iii) The allocated bandwidth Wi>0. Otherwise, if Wi= 0, then ρi= +since
lim
Wi0(2
Ri
Wi1)Wi= lim
Wi0
2
Ri
Wi1
1
Wi
(a)
= lim
Wi0
2
Ri
Wi·ln 2 ·Ri
W2
i
1
W2
i
= lim
Wi02
Ri
WiRiln 2 = +(68)
where (a) is due to the L’Hospital’s Rule.
Therefore, with Ri<+,Pi>0 (in Watt), and Wi>0, we have qi<1.
D Monotonically increasing property of Wi(Bi)
According to (22) and (23), we have the quantization level satisfies
Bi=¯
Bi(Wi) = τmax
mWilog21 + θiPmax
WiN0µ
m. (69)
Based on this, first-order derivative of ¯
Bi(Wi) with respect to the allocated bandwidth Wiis
¯
Bi(Wi)
∂Wi
=τmax
mlog21 + θiPmax
WiN0+τmax
m
Wi
1 + θiPmax
WiN0ln 2 ·θiPmax
W2
iN0
=τmax
mlog21 + θiPmax
WiN0τmaxθiPmax
m(WiN0+θiPmax) ln 2 , (70)
and then the associated second-order derivative is
2¯
Bi(Wi)
∂W 2
i
=τmax
m1 + θiPmax
WiN0ln 2 ·θiPmax
W2
iN0+τmaxθiPmax N0
m(WiN0+θiPmax)2ln 2
=τmaxθiPmax
m(WiN0+θiPmax)Wiln 2 +τmax θiPmaxN0
m(WiN0+θiPmax)2ln 2 =τmaxθ2
iP2
max
m(WiN0+θiPmax)2Wiln 2 .
In the practical wireless environment, the shadowing variance σdB >0, the constant [K]dB >
−∞, the distance di<+(in meter), and it is reasonable to set the TO probability constraint
qmax (0,1]. Thus, the parameter θi,10 1
10 (σdB·Q1(1qmax )+[K]dBλ[di]dB )defined in (22) satisfies
θi(0,+). Meanwhile, in the real communication systems, the number of parameters m
(0,+), the delay constraint τmax (0,+), and the transmit power constraint Pmax (0,+)
(in Watt). Therefore, 2¯
Bi(Wi)
∂W 2
i<0 with the allocated bandwidth Wi[0,+), which means
39
that ¯
Bi(Wi)
∂Wimonotonically decreases with the increasing Wi[0,+). Then, combining with
limWi→∞ ¯
Bi(Wi)
∂Wi= 0 in (70), we have
¯
Bi(Wi)
∂Wi
>0 (71)
for Wi[0,+), which means that Biin (69) monotonically increases with Wi[0,+).
Next, based on (69) and the implicit function theorem [29], we can define a function Ψi(Wi, Bi)
to describe the relation between Wiand Bias
Ψi(Wi, Bi)=Ψi(¯
Wi(Bi), Bi) = ¯
Bi(Wi)Bi= 0 . (72)
Then, taking the derivatives of both sides in (72) with respect to Bi, we have
Ψi(Wi, Bi)
∂Bi
+Ψi(Wi, Bi)
∂Wi·¯
Wi(Bi)
∂Bi
= 0 .
Thus, combining with Ψi(Wi,Bi)
∂Wi=¯
Bi(Wi)
∂Wiand Ψi(Wi,Bi)
∂Bi=1, we can obtain that
¯
Wi(Bi)
∂Bi
=
Ψi(Wi,Bi)
∂Bi
Ψi(Wi,Bi)
∂Wi
=1
¯
Bi(Wi)
∂Wi
(a)
>0
where (a) is due to (71). Therefore, ¯
Wi(Bi) monotonically increasing Bi.
E Proof of Proposition 2
Based on (22) and (24a), we can denote φi,1
2τmax
m¯
Ri(Wi)µ
m12=1
2
τmax
mWilog21+ θiPmax
WiN0µ
m1!2.
Then, we have
∂φi
∂Wi
=2
2τmax
m¯
Ri(Wi)µ
m13·
2τmax
m¯
Ri(Wi)µ
m1
∂Wi
=2
2τmax
m¯
Ri(Wi)µ
m13·2τmax
m¯
Ri(Wi)µ
mln 2 ·τmax
mlog21 + θiPmax
WiN0θiPmax
(WiN0+θiPmax) ln 2
=2
2τmax
m¯
Ri(Wi)µ
m13·2τmax
m¯
Ri(Wi)µ
m·τmax
mln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax
=2τmax
m·
,ϕi
z }| {
2τmax
m¯
Ri(Wi)µ
m·θiPmax
WiN0+θiPmax ln 1 + θiPmax
WiN0
2τmax
m¯
Ri(Wi)µ
m13
| {z }
,ρi
.
Based on this, we have
2φi
∂W 2
i
=2τmax
m·
∂ϕi
∂Wiρi ρi
∂Wiϕi
ρ2
i
,
40
where ρ2
i=2τmax
m¯
Ri(Wi)µ
m161 since the quantization level Bi=τmax
m¯
Ri(Wi)µ
m1,
∂ρi
∂Wi
=
2τmax
m¯
Ri(Wi)µ
m13
∂Wi
=3 2τmax
m¯
Ri(Wi)µ
m12·2τmax
m¯
Ri(Wi)µ
m·τmax
mln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax
=3 2τmax
m¯
Ri(Wi)µ
m12·2τmax
m¯
Ri(Wi)µ
m1+1·τmax
mln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax
=3τmax
mln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax ·2τmax
m¯
Ri(Wi)µ
m13+2τmax
m¯
Ri(Wi)µ
m12,
and
∂ϕi
∂Wi
=
2τmax
m¯
Ri(Wi)µ
m·θiPmax
WiN0+θiPmax ln 1 + θiPmax
WiN0
∂Wi
=2τmax
m¯
Ri(Wi)µ
m·τmax
mln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax 2
+ 2τmax
m¯
Ri(Wi)µ
mθ2
iP2
max
(WiN0+θiPmax)2Wi
.
Thus,
∂ϕi
∂Wi
ρi∂ρi
∂Wi
ϕi
=τmax
m2τmax
m¯
Ri(Wi)µ
m·ln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax 2
·2τmax
m¯
Ri(Wi)µ
m13
+ 2τmax
m¯
Ri(Wi)µ
mθ2
iP2
max
(WiN0+θiPmax)2Wi·2τmax
m¯
Ri(Wi)µ
m13
+3τmax
m2τmax
m¯
Ri(Wi)µ
m·ln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax 2
·2τmax
m¯
Ri(Wi)µ
m13
+3τmax
m2τmax
m¯
Ri(Wi)µ
m·ln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax 2
·2τmax
m¯
Ri(Wi)µ
m12
=2τmax
m2τmax
m¯
Ri(Wi)µ
m·ln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax 2
·2τmax
m¯
Ri(Wi)µ
m13
+ 2τmax
m¯
Ri(Wi)µ
mθ2
iP2
max
(WiN0+θiPmax)2Wi·2τmax
m¯
Ri(Wi)µ
m13
+3τmax
m2τmax
m¯
Ri(Wi)µ
m·ln 1 + θiPmax
WiN0θiPmax
WiN0+θiPmax 2
·2τmax
m¯
Ri(Wi)µ
m12.
Since the quantization level Bi=τmax
m¯
Ri(Wi)µ
m1, we have 2 τmax
m¯
Ri(Wi)µ
m11. Besides,
with the allocated bandwidth Wi[Wi(1),+) in the constraint (24b), as well as the number of
parameters m(0,+) and the delay constraint τmax (0,+) in the practical communication
systems, we have 2φi
∂W 2
i0, which means φiis convex with respect to Wiin the feasible region of
(24b). Therefore, the objective function (24) is convex.
41
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that generates a global FL model and sends the model back to the users. Since all training parameters are transmitted over wireless links, the quality of training is affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS needs to select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To seek the solution, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can improve the identification accuracy by up to 1:4%, 3:5% and 4:1%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation, 2) a standard FL algorithm with random user selection and resource allocation, and 3) a wireless optimization algorithm that minimizes the sum packet error rates of all users while being agnostic to the FL parameters.
Article
With growth in the number of smart devices and advancements in their hardware, in recent years, data-driven machine learning techniques have drawn significant attention. However, due to privacy and communication issues, it is not possible to collect this data at a centralized location. Federated learning is a machine learning setting where the centralized location trains a learning model over remote devices. Federated learning algorithms cannot be employed in the real world scenarios unless they consider unreliable and resource-constrained nature of the wireless medium. In this paper, we propose a federated learning algorithm that is suitable for cellular wireless networks. We prove its convergence, and provide a sub-optimal scheduling policy that improves the convergence rate. We also study the effect of local computation steps and communication steps on the convergence of the proposed algorithm. We prove, in practice, federated learning algorithms may solve a different problem than the one that they have been employed for if the unreliability of wireless channels is neglected. Finally, through numerous experiments on real and synthetic datasets, we demonstrate the convergence of our proposed algorithm.
Article
Communication has been known to be one of the primary bottlenecks of federated learning (FL), and yet existing studies have not addressed the efficient communication design, particularly in wireless FL where both uplink and downlink communications have to be considered. In this paper, we focus on the design and analysis of physical layer quantization and transmission methods for wireless FL. We answer the question of what and how to communicate between clients and the parameter server and evaluate the impact of the various quantization and transmission options of the updated model on the learning performance. We provide new convergence analysis of the well-known FED AVG under non-i.i.d. dataset distributions, partial clients participation, and finite-precision quantization in uplink and downlink communications. These analyses reveal that, in order to achieve an O (1/T) convergence rate with quantization, transmitting the weight requires increasing the quantization level at a logarithmic rate, while transmitting the weight differential can keep a constant quantization level. Comprehensive numerical evaluation on various real-world datasets reveals that the benefit of a FL-tailored uplink and downlink communication design is enormous - a carefully designed quantization and transmission achieves more than 98% of the floating-point baseline accuracy with fewer than 10% of the baseline bandwidth, for majority of the experiments on both i.i.d. and non-i.i.d. datasets. In particular, 1-bit quantization (3.1% of the floating-point baseline bandwidth) achieves 99.8% of the floating-point baseline accuracy at almost the same convergence rate on MNIST, representing the best known bandwidth-accuracy tradeoff to the best of the authors' knowledge.
Article
Federated edge learning (FEEL) is a popular framework for model training at an edge server using data distributed at edge devices (e.g., smart-phones and sensors) without compromising their privacy. In the FEEL framework, edge devices periodically transmit high-dimensional stochastic gradients to the edge server, where these gradients are aggregated and used to update a global model. When the edge devices share the same communication medium, the multiple access channel (MAC) from the devices to the edge server induces a communication bottleneck. To overcome this bottleneck, an efficient broadband analog transmission scheme has been recently proposed, featuring the aggregation of analog modulated gradients (or local models) via the waveform-superposition property of the wireless medium. However, the assumed linear analog modulation makes it difficult to deploy this technique in modern wireless systems that exclusively use digital modulation. To address this issue, we propose in this work a novel digital version of broadband over-the-air aggregation, called one-bit broadband digital aggregation (OBDA). The new scheme features one-bit gradient quantization followed by digital quadrature amplitude modulation (QAM) at edge devices and over-the-air majority-voting based decoding at edge server. We provide a comprehensive analysis of the effects of wireless channel hostilities (channel noise, fading, and channel estimation errors) on the convergence rate of the proposed FEEL scheme. The analysis shows that the hostilities slow down the convergence of the learning process by introducing a scaling factor and a bias term into the gradient norm. However, we show that all the negative effects vanish as the number of participating devices grows, but at a different rate for each type of channel hostility.
Article
This paper studies federated learning (FL) in a classic wireless network, where learning clients share a common wireless link to a coordinating server to perform federated model training using their local data. In such wireless federated learning networks (WFLNs), optimizing the learning performance depends crucially on how clients are selected and how bandwidth is allocated among the selected clients in every learning round, as both radio and client energy resources are limited. While existing works have made some attempts to allocate the limited wireless resources to optimize FL, they focus on the problem in individual learning rounds, overlooking an inherent yet critical feature of federated learning. This paper brings a new long-term perspective to resource allocation in WFLNs, realizing that learning rounds are not only temporally interdependent but also have varying significance towards the final learning outcome. To this end, we first design data-driven experiments to show that different temporal client selection patterns lead to considerably different learning performance. With the obtained insights, we formulate a stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints, and develop a new algorithm that utilizes only currently available wireless channel information but can achieve long-term performance guarantee. Experiments show that our algorithm results in the desired temporal client selection pattern, is adaptive to changing network environments and far outperforms benchmarks that ignore the long-term effect of FL.
Article
In recent years, mobile devices are equipped with increasingly advanced sensing and computing capabilities. Coupled with advancements in Deep Learning (DL), this opens up countless possibilities for meaningful applications, e.g., for medical purposes and in vehicular networks. Traditional cloud-based Machine Learning (ML) approaches require the data to be centralized in a cloud server or data center. However, this results in critical issues related to unacceptable latency and communication inefficiency. To this end, Mobile Edge Computing (MEC) has been proposed to bring intelligence closer to the edge, where data is produced. However, conventional enabling technologies for ML at mobile edge networks still require personal data to be shared with external parties, e.g., edge servers. Recently, in light of increasingly stringent data privacy legislations and growing privacy concerns, the concept of Federated Learning (FL) has been introduced. In FL, end devices use their local data to train an ML model required by the server. The end devices then send the model updates rather than raw data to the server for aggregation. FL can serve as an enabling technology in mobile edge networks since it enables the collaborative training of an ML model and also enables DL for mobile edge network optimization. However, in a large-scale and complex mobile edge network, heterogeneous devices with varying constraints are involved. This raises challenges of communication costs, resource allocation, and privacy and security in the implementation of FL at scale. In this survey, we begin with an introduction to the background and fundamentals of FL. Then, we highlight the aforementioned challenges of FL implementation and review existing solutions. Furthermore, we present the applications of FL for mobile edge network optimization. Finally, we discuss the important challenges and future research directions in FL.
Article
The recent revival of AI is revolutionizing almost every branch of science and technology. Given the ubiquitous smart mobile gadgets and IoT devices, it is expected that a majority of intelligent applications will be deployed at the edge of wireless networks. This trend has generated strong interest in realizing an "intelligent edge" to support AI-enabled applications at various edge devices. Accordingly, a new research area, called edge learning, has emerged, which crosses and revolutionizes two disciplines: wireless communication and machine learning. A major theme in edge learning is to overcome the limited computing power, as well as limited data, at each edge device. This is accomplished by leveraging the mobile edge computing platform and exploiting the massive data distributed over a large number of edge devices. In such systems, learning from distributed data and communicating between the edge server and devices are two critical and coupled aspects, and their fusion poses many new research challenges. This article advocates a new set of design guidelines for wireless communication in edge learning, collectively called learning- driven communication. Illustrative examples are provided to demonstrate the effectiveness of these design guidelines. Unique research opportunities are identified.