ArticlePDF Available

Quality- and Availability-Based Device Scheduling and Resource Allocation for Federated Edge Learning

Authors:

Abstract

To achieve an efficient federated edge learning (FEEL) system, the scheme of device scheduling and resource allocation should jointly perceive the device availability, wireless channel quality, and local gradient quality. The existing literature on FEEL rarely considers these three aspects simultaneously, so the schemes they proposed still have room for improving the efficiency of FEEL, which motivates our work. In this paper, by mathematically modeling the device availability, wireless channel quality, and gradient quality, and deriving the convergence bound for model training in the FEEL system, we formulate a joint device scheduling and resource allocation problem, aiming to improve the FEEL efficiency. The formulated problem is a challenging non-convex problem. By exploring its structural properties and utilizing the KKT conditions, we obtain an optimal solution in closed-form. The analytical results enable us to gain some important insights into how the device availability, wireless channel quality, and gradient quality affect device scheduling and resource allocation in the FEEL system.
1
Quality- and Availability-Based Device
Scheduling and Resource Allocation
for Federated Edge Learning
Wanli Wen, Yi Zhang, Chen Chen, Yunjian Jia, Lu Luo, and Lei Tang
Abstract—To achieve an efficient federated edge learning
(FEEL) system, the scheme of device scheduling and resource
allocation should jointly perceive the device availability, wireless
channel quality, and local gradient quality. The existing literature
on FEEL rarely considers these three aspects simultaneously, so
the schemes they proposed still have room for improving the
efficiency of FEEL, which motivates our work. In this paper, by
mathematically modeling the device availability, wireless channel
quality, and gradient quality, and deriving the convergence
bound for model training in the FEEL system, we formulate a
joint device scheduling and resource allocation problem, aiming
to improve the FEEL efficiency. The formulated problem is
a challenging non-convex problem. By exploring its structural
properties and utilizing the KKT conditions, we obtain an optimal
solution in closed-form. The analytical results enable us to gain
some important insights into how the device availability, wireless
channel quality, and gradient quality affect device scheduling and
resource allocation in the FEEL system.
Index Terms—Federated edge learning, device availability,
channel and gradient quality, scheduling, resource allocation.
I. INTRODUCTION
With the development of machine learning (ML), there is a
growing trend to deploy ML algorithms at the wireless edge
to extract useful knowledge from massive data generated on
end-user devices such as smartphones and cars. For traditional
ML algorithms, the training data needs to be gathered at the
edge server, such as the base station or access point, for
model training. However, the devices may be reluctant to share
sensitive data with the server due to concerns about privacy
disclosure. To address this issue, federated edge learning
(FEEL) has been proposed in recent years [1]–[3]. The model
training process of FEEL is an iterative process, where an
iteration is also called a communication round. In an arbitrary
round, each device downloads a global ML model from the
server, computes an updated model/gradient based on the local
dataset, and then submits the resultant model/gradient to the
server for parameter aggregation. An improved global model is
This work is sponsored by the National Natural Science Foundation of
China under Grant 61971077, the Natural Science Foundation of Chongqing,
China under Grant cstc2021jcyj-msxmX0458 and cstc2021jcyj-msxmX0480,
and the open research fund of National Mobile Communications Research
Laboratory, Southeast University under Grant 2022D06. (Corresponding au-
thor: Yunjian Jia)
Wanli Wen is with the School of Microelectronics and Communication
Engineering, Chongqing University, Chongqing, China, and also with the
National Mobile Communications Research Laboratory, Southeast University,
Nanjing, China (wanli_wen@cqu.edu.cn).
Yi Zhang, Chen Chen, and Yunjian Jia are with the School of Microelec-
tronics and Communication Engineering, Chongqing University, Chongqing,
China (never_zy@cqu.edu.cn, c.chen@cqu.edu.cn, yunjian@cqu.edu.cn).
Lu Luo and Lei Tang are with the CSSC Haizhuang Windpower Co., Ltd.,
Chongqing, China (lu.luo@hzwindpower.com, tanglei@hzwindpower.com).
then sent back from the server to the devices for another round
of model training. Since FEEL does not expose the devices’
data during model training, it can well protect data privacy,
which has attracted widespread attention from the industry
and academia [4].
To achieve an efficient FEEL system with high training
performance and low training energy consumption, device
scheduling and wireless resource allocation should be care-
fully designed. Consequently, the existing researches in this
direction can be roughly divided into three categories: device
scheduling [5]–[7], resource allocation [8], as well as joint
device scheduling and resource allocation [9]–[15]. Specifi-
cally, the authors in [5] analyzed the convergence of FEEL
under some conventional scheduling schemes. In [6], an online
device scheduling method based on the theory of multi-armed
bandit was proposed to minimize the training latency of FEEL.
The authors in [7] proposed a device scheduling scheme based
on the quality of wireless channel and gradient. The authors
in [8] investigated the trade-off between training latency and
energy consumption of FEEL. In [9]–[15], several different
optimization problems of joint device scheduling and resource
allocation are established to improve the training performance
[9]–[12], [15] or save the energy [12]–[14].
However, in practical wireless networks, the design of
device scheduling and resource allocation in the FEEL systems
faces some major challenges. For instance, some devices
may temporarily leave the training process due to reasons
such as losing connection, making phone calls, and suffering
low battery, so the devices may not be always available
to participate in the training process. Furthermore, some
devices may suffer from poor quality of wireless channels
and/or models/gradients, so scheduling these devices will not
only consume more energy for training, but also prolong
the training time. Therefore, to achieve an efficient FEEL
system in practical wireless networks, the scheme of device
scheduling and resource allocation should jointly perceive the
device availability, wireless channel quality, and local gradient
quality. Nonetheless, the existing literature on FEEL rarely
considers these three aspects simultaneously, so the schemes
they proposed still have room for improving the efficiency of
FEEL, which motivates this letter.
Our main contributions are twofold: 1) A novel scheme of
device scheduling and resource allocation is devised for FEEL,
which can simultaneously perceive the device availability,
wireless channel quality, and gradient quality. 2) We obtain
some important theoretical insights into how the device avail-
ability, wireless channel quality, and gradient quality affect the
device scheduling and resource allocation in the FEEL system.
This article has been accepted for publication in IEEE Communications Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2022.3194558
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONGQING UNIVERSITY. Downloaded on August 16,2022 at 03:39:16 UTC from IEEE Xplore. Restrictions apply.
2
ĞǀŝĐĞϭĞǀŝĐĞϱĞǀŝĐĞϲĞǀŝĐĞϮĞǀŝĐĞϰĞǀŝĐĞϯ
n
Z
Ö
n
J
hŶĂǀĂŝůĂďůĞĞǀŝĐĞĞǀŝĐĞůƵƐƚĞƌ'ƌĂĚŝĞŶƚŽŵƉƵƚĂƚŝŽŶ'ƌĂĚŝĞŶƚŐŐƌĞŐĂƚŝŽŶ
n
Z
Ö
n
J
ÖÖ
n
n
§·¨¸©¹
J
J
n
Z
6
ĚŐĞƐĞƌǀĞƌ
Ön
J
n
Z
Fig. 1. An illustration of the FEEL system with one edge server
and K= 6 edge devices. Devices {1,3,5}and {2,4,6}belong to
two different clusters, respectively, where devices 3, 4, and 6 are
temporarily unavailable to participate in model training in round n.
II. SY ST EM MO DE L
We consider an FEEL system which is composed of one
edge server and Kdevices, denoted by K{1,2,· · · , K}.
The server and devices are equipped with one antenna. Let
Dk{ξd}|Dk|
d=1 be the dataset of device k, where |Dk|denotes
the cardinality of Dkand ξdis the d-th data point in Dk.
Denote DSk∈K Dkas the whole data set. An example
of this system is depicted in Fig. 1. In the FEEL system, we
aim to learn a supervised machine learning (ML) model over
D, which is mathematically possible by solving the following
problem:
warg min
wL(w),(1)
where the vector wis a parameter relating to the specific ML
model and L(w)1
|D| Pk∈K|Dk|Lk(w)denotes the global
loss function over D. Here, Lk(w)1
|Dk|P|Dk|
d=1 lk(w, ξd)
is the local loss function over Dkwhere lk(w, ξd)is the
loss function for the data point ξd Dk. In the context of
FEEL, solving the problem in (1) is composed of a series of
iterations, also known as communication rounds. Denote wn
as the model vector after the n-th round with n= 1,2,· · ·,
and w0the initial model vector. Then, in each round n, the
solving process contains three stages: i) Gradient Calculation,
ii) Gradient Submission, and iii) Gradient Aggregation. In the
following, we elaborate on each of these stages.
A. Gradient Calculation
In this stage, device k K computes the gradient of Lk(w)
at w=wn, where wnis the global model vector broadcasted
from the server. Note that in practice, device kmay not always
be available to perform model training due to various reasons,
e.g., losing connection to the server, making phone calls, or
suffering low battery. To reflect this behavior, we introduce
a binary random variable Xk {0,1}to model the state
of the availability of device k.1Specifically, Xk= 1 means
that device kis available to compute gradient and Xk= 0
otherwise. Let ρkPr(Xk= 1) represent the probability of
availability of device kand ρ(ρk)k∈K the vector of the joint
distribution. Then, based on ρk, device kgenerates a specific
availability state, denoted by xn
k {0,1}, in the n-th commu-
nication round. As a result, the computed gradient at device k
1Through modeling the device availability, the impact of the transmission
of the global model vector from the edge server to the devices can be captured
in our system model.
is given by ˆgn
k=xn
kgn
k, where gn
k wLk(w)|w=wnwith
denoting the gradient operator.
B. Gradient Submission
In this stage, device kexpends a certain amount of energy
to upload ˆgn
kto the server via the wireless channel. Due to the
constraints of device availability, channel quality, and gradient
quality, it is important to select the appropriate devices to
perform gradient submission in each communication round.
1) Device Scheduling: Let Cbe the number of devices that
are scheduled for gradient submission. To select Cdifferent
devices, we consider that every Cdevices in Kform a cluster,
which will exactly generate MK
Cclusters in total, denoted
by M{1,2,· · · , M }. Let Kmdenote the set of Cdevices
included in cluster m M. Apparently, scheduling cluster
mis equivalent to schedule Cdevices in Km. Let pn
mbe the
scheduling probability of cluster min round n, where
0pn
m1, m M,(2)
X
m∈M
pn
m= 1.(3)
Define pn(pn
m)m∈M to be the device scheduling design.
Note that, the scheduling design illustrated in our work is
different with those proposed in the existing literature on
FEEL. Here, we focus on scheduling a cluster of users instead
of a single user, and thus scheduling one cluster is equivalent
to scheduling multiple users. Later, we shall see that this
scheduling design will greatly facilitate us to construct an
unbiased global gradient in the phase of Gradient Aggregation.
2) Energy Consumption: We consider time division multi-
ple access in this work.2In the n-th communication round, let
Hn
kand Pn
krepresent the channel power gain and transmission
power of device k, respectively.3Then, the transmission rate
of device kcan be calculated as Rn
kBlog2(1 + Pn
kHn
k2)
(in bps). Here, Bis the available bandwidth and σ2denotes
the noise power. Denote by the number of bits required
to encode the gradient. To ensure device kin cluster mcan
successfully submit its local gradient to the server, Rn
kshould
satisfy Rn
k=ik(m)xn
kℓ/tn
k(m), where ik(m) {0,1}is used
to check whether device kis in cluster m, i.e., ik(m) = 1
if k Km, and ik(m) = 0 otherwise. Additionally, tn
k(m)is
the allocated time for device kto perform gradient submission
and satisfies
0tn
k(m)ik(m)xn
kT, k K, m M,(4)
X
k∈K
tn
k(m) = Tmax
k∈Km
{xn
k}, m M.(5)
Here, Tdenotes the time duration of the gradient submission
stage. As a result, within the time duration tn
k(m), the trans-
mission energy consumed by device kcan be calculated as
2The analysis and optimization framework proposed in this paper can
easily be extended to other advanced access technologies such as orthogonal
frequency division multiple access and non-orthogonal multiple access.
3The server needs to estimate the channel information by using various
methods. The estimation error may cause system performance loss. However,
based on Berge’s Maximum Theorem, it is easy to prove that the performance
loss can be arbitrarily small as long as the estimation error is small.
This article has been accepted for publication in IEEE Communications Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2022.3194558
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONGQING UNIVERSITY. Downloaded on August 16,2022 at 03:39:16 UTC from IEEE Xplore. Restrictions apply.
3
Ek(tn
k(m)) Pn
ktn
k(m) = tn
k(m)
Hkfik(m)xn
k
tn
k(m), where f(x)
σ22x/B 1. The total transmission energy consumption
of all devices is given by E(tn(m)) = Pk∈K Ek(tn
k(m))
with tn(m)(tn
k(m))k∈K. Since cluster mis selected
in accordance with the probability pm, by using the total
probability theorem, the average total transmission energy
consumption of all devices is given by
¯
E(pn,tn)X
m∈M
pn
mE(tn(m)),(6)
where tn(tn(m))m∈M denotes the resource allocation
design. Let ˆ
Edenote the computing energy consumed by all
devices in the phase of Gradient Calculation. Note that ˆ
E
does not depend on (pn,tn). Then, with (6), the total energy
consumption of all devices can be expressed as
˜
E(pn,tn)¯
E(pn,tn) + ˆ
E. (7)
C. Gradient Aggregation
In this stage, the server aggregates the gradients from the
scheduled devices and then generates a new global model for
the next round of local training. Moreover, the aggregation
of gradients shall be unbiased so as to achieve convergence
of FEEL. To this end, we devise a novel gradient aggrega-
tion scheme as follows.
Gradient Aggregation Scheme: In case of scheduling
cluster m, the server calculates the global gradient, denoted
by ˆgm, based on
ˆgn
m=1
DΠpn
mX
k∈K
ik(m)Dk
ρk
ˆgn
k.(8)
The following result concludes the unbiasedness of ˆg(m),
which will greatly help to prove the convergence of FEEL.
Lemma 1 (Unbiasedness of ˆgm): ˆgn
mis an unbiased estimate
of the ground-truth global gradient gn wL(wn).
Proof: See Appendix A.
Then, based on the global gradient ˆg(m), the ML model in
round n+ 1 can be calculated as
wn+1 =wnηˆgn
m.(9)
Here, η > 0denotes the learning rate.
D. One-Round Convergence Bound
The phases of Gradient Calculation,Gradient Submission,
and Gradient Aggregation will be repeated until the conver-
gence of FEEL. Using Lemma 1, we have the following result.
Lemma 2 (One-Round Convergence Bound of FEEL): If
wL(w)is Lipschitz continuous with a positive modulus µ,
then we have
EL(wn+1)L(w)E[L(wn)L(w)]
ηgn2+1
2µη2E[g(pn)] ,(10)
where wdenotes an optimal solution of the problem in (1)
and g(pn)1
(|D|Π)2Pm∈M C
pn
mPk∈KmDk
ρk2ˆgn
k2.
Proof: See Appendix B.
In Lemma 2, g(pn)is directly related to the device schedul-
ing pn. In particular, a smaller g(pn)(e.g., due to higher
device availability ρk) will lead to faster convergence of FEEL.
III. PROB LE M ESTABLISHMENT AND SOLUTION
A. Problem Establishment
Based on (7) and (10), we observe that the scheduling
and resource allocation design (pn,tn)has an impact on
both the energy consumption and the convergence of FEEL.
This observation leads to a natural question: how to design
an appropriate scheduling and resource allocation scheme
that can minimize the energy consumption of FEEL while
simultaneously accelerating its convergence? To answer this
question, we establish an optimization problem as follows.
Problem 1 (Joint Device Scheduling and Resource Alloca-
tion):
min
pn,tn(1 λ)˜
E(pn,tn) + λg(pn)
s.t.(2),(3),(4),(5)
where λ[0,1] is a weight coefficient. Let (pn,tn)be an
optimal solution of Problem 1. Note that in Problem 1, since
ˆ
Eis independent of (pn,tn), it is a constant with respect to
(pn,tn). So, from now on, we exclude the term ˆ
Efrom the
objective function for simplicity.
Problem 1 is a challenging non-convex problem. To solve
Problem 1 optimally, we propose to decompose it into two
subproblems, namely, Resource Allocation subproblem and
Device Scheduling subproblem, using the structural properties
of Problem 1. The two subproblems are specified below.
Problem 2 (Resource Allocation for Each m M):
tn(m)arg min
tnE(tn(m))
s.t.0tn
k(m)ik(m)xn
kT, k K,(11)
X
k∈K
tn
k(m) = Tmax
k∈Km
{xn
k},(12)
where tn(m)denotes an optimal solution. Note that we have
tn= (tn(m))m∈M.
Problem 3 (Device Scheduling for Given tn):
pn= arg min
pn(1 λ)¯
E(pn,tn) + λg(pn)
s.t.(2),(3).
The relationship between Problem 1 and Problems 2 and 3
is addressed as follows. Specifically, it can be easy to verify
that if tn(m)and pnare in the feasible sets of Problems 2
and 3, respectively, the point (pn,tn)is a feasible point of
Problem 1, and vice versa. So, Problem 1 and Problems 2
and 3 have identical feasible sets. The point (pn,tn)is optimal
for Problem 1 if and only if it is optimal for Problems 2 and 3.
Thus, we conclude that Problem 1 and Problems 2 and 3 are
equivalent. On this basis, to solve Problem 1, we are only
required to solve Problems 2 and 3 separately without losing
any optimality, which are elaborated in the sequel.
B. Problem Solution
First, we solve Problem 2. Since Problem 2 is convex, we
can get an optimal solution of it by using the KKT conditions,
as summarized below. Note that we have omitted the details
of the proof due to page limitation.
This article has been accepted for publication in IEEE Communications Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2022.3194558
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONGQING UNIVERSITY. Downloaded on August 16,2022 at 03:39:16 UTC from IEEE Xplore. Restrictions apply.
4
Algorithm 1 The Algorithm to Solve Problem 1
1: Require: K,B,λ,C,T,σ2,Dk,ρk,Hn
k,ˆ
gn
k, and xn
k.
2: Obtain tnby solving Problem 2 via Lemma 3.
3: Obtain pnby solving Problem 3 via Lemma 4.
4: Return: (pn,tn)
Lemma 3 (Optimal Solution of Problem 2): An optimal
solution of Problem 2 is given by
tn
k(m) = min
max
ln 2/B
W0Hn
k¯νσ2
σ2e+ 1
,0
, ik(m)xn
kT
,
where W0(·)denotes the Lambert function and ¯νsatisfies
Pk∈K ¯
tn
k(m) = Tmaxk∈Km{xn
k}.
Remark 1 (Quality-Aware Resource Allocation): We can
observe from Lemma 3 that the time allocation tn
k(m)is
only aware of the channel quality Hn
kand is independent of
the gradient quality ˆgn
kand the device availability ρk. In
particular, tn
k(m)decreases with a growing channel quality
of the available device kin cluster msince a higher quality
channel can support a larger transmission rate.
Next, we solve Problem 3. Similar to Problem 2, Problem 3
is also convex, and thus, we have the following result.
Lemma 4 (Optimal Solution of Problem 3): An optimal
solution of Problem 3 is given by
pn
m= min (max (sλvm
(1 λ)E(tn(m)) + ¯φ,0),1),
where vmC
(|D|Π)2Pk∈KmDk
ρk2ˆgn
k2and ¯φsatisfies
Pm∈M pn
m= 1.
Remark 2 (Quality and Availability-Aware Device Schedul-
ing): From Lemma 4, we can observe that the device schedul-
ing probability pn
mis determined by the channel quality
(captured in E(tn(m))) and the gradient quality (captured
in vm) as well as the device availability (captured in vm). In
particular, pn
mincreases with the improvement of the channel
quality and the gradient quality of all available devices in clus-
ter m. This can be explained as follows: high-quality gradients
contain richer local data information that can contribute to the
convergence of FEEL, and high-quality channels can support
the submission of more high-quality gradients. The probability
pn
mincreases with a reducing availability probability of all
active devices in cluster m. This is because lazy devices with
low availability have more fresh data, and scheduling theses
devices will accelerate the convergence of FEEL.
Finally, by combining Lemmas 3 and 4, we can obtain
an optimal joint device scheduling and resource allocation
scheme (pn,tn), as summarized in Algorithm 1. Note that
Algorithm 1 is executed by the edge server in Stage 2 of FEEL.
Since (pn,tn)is obtained in a closed form, Algorithm 1
can be very efficient and scalable, and has the potential to
solve large-scale problems. We believe that the edge server
has sufficient processing power to execute Algorithm 1.
IV. NUMERICAL RES ULT S
The simulation settings are given below: K= 10,B=
10 MHz, σ2= 109W, T= 60 ms, λ= 1 ×106, and Hk
0 100 200 300 400 500 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
C=2
C=4
C=6
C=10
(a) Test accuracy.
0 100 200 300 400 500 600
0
1
2
3
4
5
C=2
C=4
C=6
C=10
0 600
0
0.2
(b) Energy consumption.
Fig. 2. Impact of cluster size.
0 200 400 600
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18 Proposed scheme
Baseline 1
Baseline 2
Baseline 3
0 200 400 600
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 Proposed scheme
Baseline 1
Baseline 2
Baseline 3
(a) MNIST dataset.
0 200 400 600
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 Proposed scheme
Baseline 1
Baseline 2
Baseline 3
0 200 400 600
0
0.2
0.4
0.6
0.8
1Proposed scheme
Baseline 1
Baseline 2
Baseline 3
(b) Fashion-MNIST dataset.
Fig. 3. Performance comparison with baselines.
is modeled as an independent exponential distribution with
mean ˆ
H= 105, where ˆ
Hreflects the mean channel quality.
We use FEEL to train a classification model on MNIST (built
into Matlab 2021a) and Fashion-MNIST datasets, respectively.
The MNIST contains 10000 handwritten images of the digits 0
to 9, where each digit has 1000 images. The Fashion-MNIST
consists of 70000 images of fashion items from 10 categories,
including “t-shirt”, “bag”, and so on, with 7000 images for
each category. The learning rate ηand the momentum weight
are set to be 0.001 and 0, respectively. Then using Matlab, we
obtain the size of the gradient ˆgn
kas = 9 ×105bits (on the
MNIST dataset) or = 1 ×106bits (on the Fashion-MNIST
dataset). To make ensure that the training data distribution of
each device is non-IID, we first randomly assign a label to each
device, and then randomly select |Dk|images from all images
under this label as the training set, and the rest as the test set.
Here, we set |Dk|= 800 if kis odd and |Dk|= 200, otherwise.
We use the test set to evaluate the classification performance
of FEEL during model training. We employ the convolutional
neural network (CNN) to perform image classification task.
The concrete architecture of CNN is the same as that in [1].
All numerical results are averaged over 100 trails.
Fig. 2 shows the impact of the cluster size Con the accuracy
and energy consumption of the FEEL system. From Fig. 2, we
can see that the larger Cis, the higher the test accuracy is, but
at the cost of consuming more energy. This indicates that in
practical applications, the number of scheduled devices should
be adjusted appropriately to balance the accuracy and energy
consumption.
Fig. 3 compares the performance of FEEL under the
proposed Algorithm 1 with those under three representative
baselines, where C= 2 for the MNIST dataset and C= 6
for the Fashion-MNIST dataset, ρk= 0.8if kis odd and
ρk= 0.2, otherwise. Here, Baseline 1 refers to a uniform
scheduling design, i.e., pn
m= 1/M; Baseline 2 is a schedul-
ing design with availability and gradient quality awareness,
i.e., pn
m=Gn
m/Pm∈M Gn
m, where Gn
m1
CPk∈Kmˆgn
k
represents the mean gradient norm of cluster m; Baseline 3
This article has been accepted for publication in IEEE Communications Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2022.3194558
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONGQING UNIVERSITY. Downloaded on August 16,2022 at 03:39:16 UTC from IEEE Xplore. Restrictions apply.
5
adopts a scheduling design with availability and channel
quality awareness, i.e., pn
m=Jn
m/Pm∈M Jn
m, where Jn
m
1
CPk∈Kmxn
kHn
kdenotes the mean channel power gain of
cluster m. Note that the baselines also consider the optimiza-
tion of resource allocation via Lemma 3. From Fig. 3, it is
clear that the proposed scheme is significantly better than the
baselines in terms of test accuracy and energy consumption on
both the MNIST and fashion-MNIST datasets. The underlying
reasons are given as follows. Baseline 1 fails to perceive
the device availability and completely overlooks the wireless
channel quality as well as the updated gradient quality during
model training. Although Baseline 1 (Baseline 2) perceives
the device availability and can be conscious of the gradient
quality (channel quality), the channel quality (gradient quality)
is completely ignored. In contrast, as pointed out in Remark 1
and Remark 2, our proposed scheme can well adapt to the
changes of device availability, wireless channel quality, and
local gradient quality, so it achieves higher accuracy and lower
energy consumption.
V. CONCLUSIONS
In this paper, we first mathematically model the device
availability, wireless channel quality, and gradient quality, and
derive the convergence bound of FEEL. Then, we formulate a
joint device scheduling and resource allocation problem, aim-
ing to improve the FEEL efficiency. The formulated problem is
a challenging non-convex problem. By exploring its structural
properties and utilizing the KKT conditions, we obtain an
optimal solution in closed-form. Finally, the analytical results
enable us to gain some important insights into how the device
availability, wireless channel quality, and gradient quality
affect the device scheduling and resource allocation in the
FEEL system.
APPENDIX A
PROO F OF LEMMA 1
By taking the derivative of L(w)at w=wn, we have
gn1
|D| Pk∈K|Dk|wLk(w)|w=wn=1
|D| Pk∈K|Dk|gn
k.
Then, by taking expectation of ˆgn
m, we have
E[ˆgn
m] = X
u∈M
pn
u
|D|Πpn
uX
k∈K
ik(u)|Dk|ρ1
kE[xn
k]gn
k
(a)
=1
|D|ΠX
k∈K
|Dk|gn
kX
u∈M
ik(u)
(b)
=1
|D| X
k∈K
|Dk|gn
k,
where (a) follows from E[xn
k] = ρkand (b) is due to
Pu∈M ik(u) = Π for each k K. Therefore, we have
E[ˆgn
m] = gn, namely, ˆgn
mis an unbiased estimation of gn,
which completes the proof.
APPENDIX B
PROO F OF LEMMA 2
First, since wL(w)is Lipschitz continuous with a positive
modulus µ, we have
L(wn+1)L(wn) + wn+1 wnTgn+µ
2
wn+1 wn
2
(a)
=L(wn)+(ηˆgn(m))Tgn+µ
2∥−ηˆgn(m)2.
Here, (·)Tis the transposition operator and (a) follows from
(9). Then, taking expectation in both sides of the above
inequality, we have
EL(wn+1)L(w)
(b)
E[L(wn)L(w)] ηgn2+µη2
2Ehˆgn(m)2i
=E[L(wn)L(w)] ηgn2+µη2
2(|D|Π)2X
m∈M
1
pn
m
,
where E
Pk∈Km
Dk
ρkˆgn
k
2and (b) follows from
the unbiasedness of ˆgn
m. Next, using the generalized triangle
inequality of the second kind Pn
j=1 xj2nPn
j=1xj2,
we have CPk∈KmDk
ρk2
Ehˆgn
k2i. Finally, using the
definition of g(pn), we complete the proof.
REFERENCES
[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
“Communication-Efficient Learning of Deep Networks from Decentral-
ized Data,” in Proc. Int. Conf. Artificial Intell. Stat. (AISTATS), vol. 54,
2017, pp. 1273–1282.
[2] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward
an intelligent edge: Wireless communication meets machine learning,
IEEE Commun. Mag., vol. 58, no. 1, pp. 19–25, Jan. 2020.
[3] Y. Liu, Y. Zhu, and J. J. Yu, “Resource-constrained federated learning
with heterogeneous data: Formulation and analysis,” IEEE Trans. Netw.
Sci. Eng., pp. 1–1, 2021.
[4] J. Kang, Z. Xiong, D. Niyato, S. Xie, and J. Zhang, “Incentive mech-
anism for reliable federated learning: A joint optimization approach
to combining reputation and contract theory, IEEE Internet Things J.,
vol. 6, no. 6, pp. 10 700–10 714, 2019.
[5] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies
for federated learning in wireless networks,” IEEE Trans. Commun.,
vol. 68, no. 1, pp. 317–333, Jan. 2020.
[6] W. Xia, T. Q. S. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu,
“Multi-armed bandit-based client scheduling for federated learning,”
IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7108–7123, Nov.
2020.
[7] J. Leng, Z. Lin, M. Ding, P. Wang, D. Smith, and B. Vucetic, “Client
scheduling in wireless federated learning based on channel and learning
qualities,” IEEE Wirel. Commun., pp. 1–1, 2022.
[8] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy
efficient federated learning over wireless communication networks,
IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, Mar.
2021.
[9] W. Shi, S. Zhou, and Z. Niu, “Device scheduling with fast convergence
for wireless federated learning,” in Proc. IEEE ICC, Jun. 2020, pp. 1–6.
[10] J. Xu and H. Wang, “Client selection and bandwidth allocation in
wireless federated learning networks: A long-term perspective, IEEE
Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, Feb. 2021.
[11] M. M. Wadu, S. Samarakoon, and M. Bennis, “Joint client scheduling
and resource allocation under channel uncertainty in federated learning,”
IEEE Trans. Commun., pp. 1–1, 2021.
[12] J. Ren, Y. He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling
for cellular federated edge learning with importance and channel aware-
ness,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7690–7703,
Nov. 2020.
[13] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio
resource allocation for federated edge learning,” in Proc. IEEE ICC
Workshops, Jun. 2020, pp. 1–6.
[14] W. Wen, Z. Chen, H. H. Yang, W. Xia, and T. Q. S. Quek, “Joint schedul-
ing and resource allocation for hierarchical federated edge learning,”
IEEE Trans. Wireless Commun., pp. 1–1, 2022.
[15] H.-S. Lee, “Device selection and resource allocation for layerwise
federated learning in wireless networks,” IEEE Systems Journal, pp.
1–4, 2022.
This article has been accepted for publication in IEEE Communications Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2022.3194558
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONGQING UNIVERSITY. Downloaded on August 16,2022 at 03:39:16 UTC from IEEE Xplore. Restrictions apply.
... One straightforward design towards learning performance enhancement is to prioritize those devices with higher significance of local updates. Metrics such as norm of gradient [29,42,43], staleness of a device [44], and training loss [45,46] are respectively considered to support the scheduling decision. As data heterogeneity, i.e., non-i.i.d. ...
Article
Full-text available
In this paper, the problem of energy efficient transmission and computation resource allocation for federated learning (FL) over wireless communication networks is investigated. In the considered model, each user exploits limited local computational resources to train a local FL model with its collected data and, then, sends the trained FL model to a base station (BS) which aggregates the local FL model and broadcasts it back to all of the users. Since FL involves an exchange of a learning model between users and the BS, both computation and communication latencies are determined by the learning accuracy level. Meanwhile, due to the limited energy budget of the wireless users, both local computation energy and transmission energy must be considered during the FL process. This joint learning and communication problem is formulated as an optimization problem whose goal is to minimize the total energy consumption of the system under a latency constraint. To solve this problem, an iterative algorithm is proposed where, at every step, closed-form solutions for time allocation, bandwidth allocation, power control, computation frequency, and learning accuracy are derived. Since the iterative algorithm requires an initial feasible solution, we construct the completion time minimization problem and a bisection-based algorithm is proposed to obtain the optimal solution, which is a feasible solution to the original energy minimization problem. Numerical results show that the proposed algorithms can reduce up to 59.5% energy consumption compared to the conventional FL method.
Article
In this article, we study device selection and resource allocation (DSRA) for layerwise federated learning (FL) in wireless networks. For effective learning, DSRA should be carefully determined considering the characteristics of both layerwise FL and wireless networks. To address this, we propose a DSRA algorithm for layerwise FL, called LAFLAS, that maximizes the total average number of the shallow and entire parameter transmissions over time in FL while guaranteeing the ratio between their numbers. Through experiments, we show that the model trained by our LAFLAS outperforms those by state-of-the-art algorithms, which demonstrates the effectiveness of DSRA by LAFLAS.
Article
The concept of hierarchical federated edge learning (H-FEEL) has been recently proposed as an enhancement of federated learning model. Such a system generally consists of three entities, i.e., the server, helpers, and clients, in which each helper collects the trained gradients from clients nearby, aggregates them, and sends the result to the server for global model update. Due to limited communication resources, only a portion of helpers can be scheduled to upload their aggregated gradients in each round of the model training. And that necessitates a well-designed scheme for the joint helper scheduling and communication resources allocation. In this paper, we develop a training algorithm for the H-FEEL system which involves local gradient computing, weighted gradient uploading, and machine learning model updating phases. By characterizing these phases mathematically and analyzing one-round convergence bound of the training algorithm, we formulate an optimization problem to achieve the scheduling and resource allocation scheme. The problem simultaneously captures the uncertainty of the wireless channel and the importance of the weighted gradient. To solve the problem, we first transform it into an equivalent problem and then decompose the transformed problem into two subproblems: bit and sub-channel allocation and helper scheduling , which are mixed integer nonlinear programming and continuous nonlinear problems, respectively. For the first subproblem, we obtain an optimal solution of exponential complexity and a suboptimal solution that has polynomial complexity. For the second subproblem, we obtain a closed-form optimal solution in a special case and a suboptimal solution in the general case. The efficacy of our scheme is amply demonstrated via simulations and the analytical framework is shown to provide valuable design insights for the practical implementation of the H-FEEL system.
Article
Federated learning (FL) emerges as a distributed training method in the Internet of Things (IoT), allowing participating clients to use their local data to train local models and upload parameters for global model aggregation after every few local iterations, protecting data privacy and reducing communication overhead. Given the scarcity of wireless communication resources, in this letter, we propose a client scheduling strategy for a wireless FL network based on a joint quality of channel and learning. Finally, we compare the proposed scheduling method’s performance with that of traditional methods considering the channel quality only. Experimental results show that our method can significantly improve training performance in terms of model accuracy and speed of convergence.
Article
Efficient collaboration between collaborative machine learning and wireless communication technology, forming a Federated Edge Learning (FEEL), has spawned a series of next-generation intelligent applications. However, due to the openness of network connections, the FEEL framework generally involves hundreds of remote devices (or clients), resulting in expensive communication costs, which is not friendly to resource-constrained FEEL. To address this issue, we propose a distributed approximate Newton-type algorithm with fast convergence speed to alleviate the problem of FEEL resource (in terms of communication resources) constraints. Specifically, the proposed algorithm is improved based on distributed L-BFGS algorithm and allows each client to approximate the high-cost Hessian matrix by computing the low-cost Fisher matrix in a distributed manner to find a ‘'better’' descent direction, thereby speeding up convergence. Second, we prove that the proposed algorithm has linear convergence in strongly convex and non-convex cases and analyze its computational and communication complexity. Similarly, due to the heterogeneity of the connected remote devices, FEEL faces the challenge of heterogeneous data and non-IID (Independent and Identically Distributed) data. To this end, we design a simple but elegant training scheme, namely FedOVA, to solve the heterogeneous statistical challenge brought by heterogeneous data.
Article
The performance of federated learning (FL) over wireless networks depend on the reliability of the client-server connectivity and clients’ local computation capabilities. In this article we investigate the problem of client scheduling and resource block (RB) allocation to enhance the performance of model training using FL, over a pre-defined training duration under imperfect channel state information (CSI) and limited local computing resources. First, we analytically derive the gap between the training losses of FL with clients scheduling and a centralized training method for a given training duration. Then, we formulate the gap of the training loss minimization over client scheduling and RB allocation as a stochastic optimization problem and solve it using Lyapunov optimization. A Gaussian process regression-based channel prediction method is leveraged to learn and track the wireless channel, in which, the clients’ CSI predictions and computing power are incorporated into the scheduling decision. Using an extensive set of simulations, we validate the robustness of the proposed method under both perfect and imperfect CSI over an array of diverse data distributions. Results show that the proposed method reduces the gap of the training accuracy loss by up to 40.7% compared to state-of-the-art client scheduling and RB allocation methods.
Article
This paper studies federated learning (FL) in a classic wireless network, where learning clients share a common wireless link to a coordinating server to perform federated model training using their local data. In such wireless federated learning networks (WFLNs), optimizing the learning performance depends crucially on how clients are selected and how bandwidth is allocated among the selected clients in every learning round, as both radio and client energy resources are limited. While existing works have made some attempts to allocate the limited wireless resources to optimize FL, they focus on the problem in individual learning rounds, overlooking an inherent yet critical feature of federated learning. This paper brings a new long-term perspective to resource allocation in WFLNs, realizing that learning rounds are not only temporally interdependent but also have varying significance towards the final learning outcome. To this end, we first design data-driven experiments to show that different temporal client selection patterns lead to considerably different learning performance. With the obtained insights, we formulate a stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints, and develop a new algorithm that utilizes only currently available wireless channel information but can achieve long-term performance guarantee. Experiments show that our algorithm results in the desired temporal client selection pattern, is adaptive to changing network environments and far outperforms benchmarks that ignore the long-term effect of FL.
Article
In cellular federated edge learning (FEEL), multiple edge devices holding local data jointly train a neural network by communicating learning updates with an access point without exchanging their data samples. With very limited communication resources, it is beneficial to schedule the most informative local learning updates. This paper focuses on FEEL with gradient averaging over participating devices in each round of communication. A novel scheduling policy is proposed to exploit both diversity in multiuser channels and diversity in the “importance” of the edge devices’ learning updates. First, a new probabilistic scheduling framework is developed to yield unbiased update aggregation in FEEL. The importance of a local learning update is measured by its gradient divergence. If one edge device is scheduled in each communication round, the scheduling policy is derived in closed form to achieve the optimal trade-off between channel quality and update importance. The probabilistic scheduling framework is then extended to allow scheduling multiple edge devices in each communication round. Numerical results obtained using popular models and learning datasets demonstrate that the proposed scheduling policy can achieve faster model convergence and higher learning accuracy than conventional scheduling policies that only exploit a single type of diversity.