PreprintPDF Available

Gradient Statistics Aware Power Control for Over-the-Air Federated Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Federated learning (FL) is a promising technique that enables many edge devices to train a machine learning model collaboratively in wireless networks. By exploiting the superposition nature of wireless waveforms, over-the-air computation (AirComp) can accelerate model aggregation and hence facilitate communication-efficient FL. Due to channel fading, power control is crucial in AirComp. Prior works assume that the signal to be aggregated from each device, i.e., local gradient, can be normalized with zero mean and unit variance. In FL, however, gradient statistics vary over both training iterations and feature dimensions, and are unknown in advance. This paper studies the power control problem for over-the-air FL by taking gradient statistics into account. The goal is to minimize the aggregation error by jointly optimizing the transmit power at each device and the denoising factor at the edge server. We obtain the optimal policy in closed form when gradient statistics are given. Notably, we show that the optimal transmit power at each device is continuous and monotonically decreases with the squared multivariate coefficient of variation (SMCV) of gradient vectors. We also propose an estimation method of gradient statistics with negligible communication cost. Experimental results demonstrate high learning performance by using the proposed scheme.
Content may be subject to copyright.
arXiv:2003.02089v2 [eess.SP] 8 May 2020
1
Gradient Statistics Aware Power Control for
Over-the-Air Federated Learning
Naifu Zhang and Meixia Tao
Abstract
Federated learning (FL) is a promising technique that enables many edge devices to train a machine
learning model collaboratively in wireless networks. By exploiting the superposition nature of wireless
waveforms, over-the-air computation (AirComp) can accelerate model aggregation and hence facilitate
communication-efficient FL. Due to channel fading, power control is crucial in AirComp. Prior works
assume that the signal to be aggregated from each device, i.e., local gradient, can be normalized with
zero mean and unit variance. In FL, however, gradient statistics vary over both training iterations and
feature dimensions, and are unknown in advance. This paper studies the power control problem for
over-the-air FL by taking gradient statistics into account. The goal is to minimize the aggregation error
by jointly optimizing the transmit power at each device and the denoising factor at the edge server.
We obtain the optimal policy in closed form when gradient statistics are given. Notably, we show that
the optimal transmit power at each device is continuous and monotonically decreases with the squared
multivariate coefficient of variation (SMCV) of gradient vectors. We also propose an estimation method
of gradient statistics with negligible communication cost. Experimental results demonstrate high learning
performance by using the proposed scheme.
Index Terms
Federated learning, over-the-air computation, power control, fading channel.
I. INT RO DUC TION
The proliferation of mobile devices such as smartphones, tablets, and wearable devices has
revolutionized people’s daily lives. Due to the growing computation and sensing capabilities
of these devices, a wealth of data has been generated each day. This has promoted a wide
range of artificial intelligence (AI) applications such as image recognition and natural language
processing. Traditional machine learning procedure, including both training and inference, relies
on cloud computing on a centralized data center with computing, storage, and full access to the
entire data set. Wireless edge devices are thus required to transmit their collected data to a central
parameter server, which can be very costly in terms of energy and bandwidth consumption, and
unfavorable due to response delay and privacy concerns. It is thus increasingly desired to let
edge devices engage in the learning process by keeping the collected data locally and performing
training/inference either collaboratively or individually. This emerging technology is known as
Edge Machine Learning [2] or Edge Intelligence [3].
Federated learning [4]–[6] is a new edge learning framework that enables many edge devices
to collaboratively train a machine learning model without exchanging datasets under the coordi-
nation of an edge server in wireless networks. Compared with traditional learning at a centralized
data center, FL offers several distinct advantages, such as preserving privacy, reducing network
congestion, and leveraging distributed on-device computation. In FL, each edge device downloads
N. Zhang and M. Tao are with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, P.
R. China. (email: arthaslery@sjtu.edu.cn; mxtao@sjtu.edu.cn.). Part of this work is to be presented at IEEE ICC 2020 workshop
[1]
2
a shared model from the edge server, computes an update to the current model by learning from
its local dataset, then sends this update to the edge server. Therein, the updates are averaged to
improve the shared model.
The communication cost is the main bottleneck in FL since a large number of participating
edge devices send their updates to the edge server at each round of the model training. Existing
methods to obtain communication-efficient FL can be mainly divided into three categories: model
parameter compression [7], [8], gradient sparsification [9], [10], and infrequent local update
[4], [11]. Nevertheless, the communication cost of FL is still proportional to the number of
edge devices, and thus inefficient in large-scale environment. Recently, a fast model aggregation
approach is proposed for FL by applying over-the-air computation (AirComp) principle [12],
such as in [13]–[16]. This is accomplished by exploiting the waveform superposition nature of
the wireless medium to compute the desired function of the distributed local gradients (i.e., the
weighted average function) by concurrent transmission. Such AirComp-based FL, referred to as
over-the-air FL in this work, can dramatically save the uplink communication bandwidth.
Due to the channel fading, device selection and power control are crucial to achieve a reliable
and high-performance over-the-air FL. In [17], the authors jointly optimize the transmit power
at edge devices and the receive scaling factor (known as denoising factor) at the edge server for
minimizing the mean square error (MSE) of the aggregated signal. It is shown that the optimal
transmit power in static channels exhibits a threshold-based structure. Namely, each device
applies channel-inversion power control if its quality indicator exceeds the optimal threshold,
and applies full power transmission otherwise. The work [17], however, is purely for AirComp-
based signal aggregation (not in the context of learning), where the signal on each device is
assumed to be independent and identically distributed (IID) with zero mean and unit variance.
For AirComp-based gradient aggregation in FL, the work [14] introduces a truncation-based
approach for excluding the edge devices with deep fading channels to strike a good balance
between learning performance and aggregation error. The work [13] proposes a joint device
selection and receiver beamforming design method to find the maximum number of selected
devices with MSE requirements to improve the learning performance. As in [17], it is assumed
in both [13] and [14] that the signal (i.e., the local gradient) to be aggregated from each device
is IID, and normalized with zero mean and unit variance. By exploiting the sparsity pattern
in gradient vectors, the work [16] projects the gradient estimate in each device into a low-
dimensional vector and transmits only the important gradient entries while accumulating the
error from previous iterations. Therein, a channel-inversion like power control scheme, similar
to those in [13], [14], [17] is designed so that the gradient vectors sent from the selected devices
are aligned at the edge server.
Note that all the exiting works on power control for over-the-air FL have overlooked the
following statistical characteristics of gradients: the gradient distribution over training iterations
is independent but not necessarily identical; and even in the same iteration, the distribution of
each entry of the gradient vector can be non-identical. A general observation is that the gradient
distribution changes over iterations and is different in each feature dimension. In addition, if the
gradient distribution is unknown for each device, normalizing the gradient to a distribution with
zero mean and unit variance is infeasible. As such, due to the neglect of the above characteristics
of gradient distribution, the existing power control methods for over-the-air FL may not perform
efficiently in practice.
Motivated by the above issue, in this paper, we study the optimal power control problem
for over-the-air FL in fading channels by taking gradient statistics into account. Our goal is to
3
minimize the MSE of the aggregated model, and hence improve the accuracy of FL, by jointly
optimizing the transmit power at each device and the denoising factor at the edge server given
the first-order and second-order statistics of gradients at each iteration. The main contributions
of this work are outlined below:
Optimal power control with known gradient statistics: We first derive the MSE expression of
gradient aggregation at each iteration of the model training when the first-order and second-
order statistics of the gradient vectors are known. We then formulate a joint optimization
problem of transmit power at edge devices and denoising factor at the edge server for MSE
minimization subject to individual peak power constraints at edge devices. By decomposing
this non-convex problem into subproblems defined on different subregions, we obtain the
optimal power control strategy in closed form. Unlike the existing threshold-based power
control for normalized signal in [17], our optimal transmit power depends not only on the
channel quality and noise power, but also heavily on the gradient statistics. In particular,
we find that the relative dispersion of the gradient values, i.e., the squared multivariate
coefficient of variation (SMCV) of the gradient vectors, plays a key role in the optimal
transmit power control. Specifically, we prove that the optimal transmit power of each
device is a continuous and monotonically decreasing function of the gradient SMCV.
Optimal power control in special cases: In the special case where the gradient SMCV
approaches infinity, which could happen when the model training converges and/or the
dataset is highly non-IID, we show that there is an optimal threshold for the aggregation
capabilities of the devices, below which the devices transmit with full power and above
which the devices transmit at the power to equalize the weights of their gradients for
aggregation to one. In the other special case where the gradient SMCV approaches zero,
which could happen when the model training just begins, we show that the optimal power
control is to let all the devices transmit with their peak powers.
Adaptive power control with unknown gradient statistics: We propose an adaptive power
control algorithm that estimates the gradient statistics based on the historical aggregated
gradients and then dynamically adjusts the power values in each iteration based on the
estimation results. The communication cost consumed by estimating the gradient statistics
is negligible compared to the transmission of the entire gradient vector.
To evaluate the efficiency of the proposed power control scheme, we implement the FL in
PyTorch for AI applications of three datasets: MNIST, CIFAR-10 and SVHN. Experimental
results demonstrate that the over-the-air FL with the proposed adaptive power control obtains
higher model accuracy than that with existing power control methods (full power transmission
and threshold-based power control for normalized signal [17]). In particular, while the full power
transmission performs poorly in high SNR region and non-IID data distribution and the threshold-
based power control for normalized signal [17] performs poorly in low SNR region and IID data
distribution, the proposed power control can perform very well over a wide range of scenarios
by exploiting the gradient statistics.
The rest of this paper is organized as follows. The over-the-air FL system is modeled in Section
II. Section III describes the optimal power control strategy with known gradient statistics. In
Section IV, we introduce an adaptive power control scheme when the gradient statistics are
unknown in advance. Section V provides experimental results. Finally we conclude the paper in
Section VI.
4
Edge Server
Edge
device 1
Edge
device 2
Edge
device K

Broadcast the
aggregated model
Over-the-air Compute
Fig. 1. Illustration of over-the-air federated learning.
II. SY STE M MODE L
A. Federated Learning Over Wireless Networks
We consider a wireless FL framework as illustrated in Fig. 1, where a shared AI model (e.g.,
a classifier) is trained collaboratively across Ksingle-antenna edge devices via the coordination
of a single-antenna edge server. Let K={1, ..., K}denote the set of edge devices. Each device
k∈ K collects a fraction of labelled training data via interaction with its own users, constituting
a local dataset, denoted as Dk. The loss function measuring the model error is defined as
L(w) = X
k∈K
|Dk|
|D| Lk(w),(1)
where wRDdenotes the D-dimensional model parameter to be learned, Lk(w) = 1
|Dk|Pi∈Dkli(w)
is the loss function of device kquantifying the prediction error of the model won the local dataset
collected at the k-th device, with li(w)being the sample-wise loss function, and D=Sk∈K Dk
is the union of all datasets. The minimization of L(w)is typically carried out through stochastic
gradient descent (SGD) algorithm, where device k’s local dataset Dkis split into mini-batches
of size Band at each iteration t= 1,2, ..., we draw one mini-batch Bk(t)randomly and update
the model parameter as
w(t+ 1) = w(t)γ1
KX
k∈K LSGD
k,t (w(t)),(2)
with γbeing the learning rate and LS GD
k,t (w) = 1
BPi∈Bk(t)li(w). Note that the mean of the
gradient LSGD
k,t (w(t)) in SGD is equal to the gradient L(w(t))in gradient descent (GD)
while the variance depends on the mini-batch size and distribution of data (IID or non-IID).
5
B. Over-The-Air Computation for Gradient Aggregation
We consider block fading channels, where the wireless channels remain unchanged within the
duration of each iteration in FL but may change independently from one iteration to another.
We define the duration of one iteration as one time block, indexed by tN. It is assumed
that the channel coefficients over different time blocks are generated from a stationary and
ergodic process. Let gk(t),LSGD
k,t (w(t)) RDdenote the gradient vector computed on
device kat time block t. The following are key assumptions on the distribution of each entry
gk,d(t), d ∈ {1, ..., D}of gk(t):
The gradient elements {gk,d(t)},k∈ K, are independent and identically distributed over
devices ks. This is a default assumption since the distributions of the local datasets at
different devices, whether identical or non-identical, are unknown to the edge server and
thus the distributions of the local gradients trained from these local datasets are treated
equally.
The gradient elements {gk,d (t)},tN, are independent but non-identically distributed
over iterations t’s. In other words, the gradient distribution is non-stationary over time. The
non-stationary distribution is valid since the gradient values in general change rapidly at
the beginning, then gradually approach to zero as the training goes on.
The gradient elements {gk,d (t)},d∈ {1,2, ..., D}, are independent but non-identically
distributed over gradient vector dimension d’s. This assumption is valid as long as the
features in a data sample are independent but non-identically distributed, which is typically
the case.
The gradient of interest at the edge server at each time block tis given by
g(t) = 1
KX
k∈K
gk(t).(3)
To obtain (3), all the devices transmit their gradient vectors gk(t)concurrently in an analog
manner, following the AirComp principle as shown in Fig. 1. Each transmission block takes a
duration of Dslots, one slot for one entry in the D-dimensional gradient vector. Each gradient
vector gk(t)is multiplied with a pre-processing factor, denoted as bk(t). The received signal
vector at the edge server is given by
y(t) = X
k∈K
bk(t)hk(t)gk(t) + n(t),(4)
where hk(t)denotes the complex-valued channel coefficient from device kto the edge server
and n(t)denotes the additive white Gaussian noise (AWGN) vector at the edge server with
each element having zero mean and variance of σ2
n. To compensate the channel phase offset and
control the actual transmit power at each device, we let bk(t) = pk(t)ek(t)
Bk(t), where pk(t)0
denotes the transmit power at device k∈ K at time block t,θk(t)is the phase of hk(t), and
Bk(t),kgk(t)k=qPD
d=1 g2
k,d(t)denotes the gradient norm of device k. Here, we have
assumed that each device kcan estimate perfectly its own channel phase θk(t). To design the
optimal power control policy pk(t), for k∈ K, we also assume that the edge server knows the
channel amplitude |hk|of all devices. By such design of bk(t), we can rewrite (4) as
y(t) = X
k∈K ppk(t)|hk(t)|
Bk(t)gk(t) + n(t).(5)
6
Each device k∈ K has a peak power budget Pk, i.e.,
pk(t)Pk,k∈ K,tN.(6)
Upon receiving y(t), the edge server applies a denoising factor, denoted by η(t), to recover the
gradient of interest as
ˆg(t) = y(t)
Kpη(t),(7)
where the factor 1/K is employed for the averaging purpose.
C. Performance Measure
We are interested in minimizing the distortion of the recovered gradient ˆg(t)with respect to
(w.r.t.) the ground true gradient g(t). The distortion at a given iteration tis measured by the
instantaneous MSE defined as
MSE(t),E[kˆg(t)g(t)k2]
=1
K2E
y(t)
pη(t)X
k∈K
gk(t)
2
=1
K2"D
X
d=1
σ2
d(t)X
k∈K ppk(t)|hk(t)|
pη(t)Bk(t)1!2
+
D
X
d=1
m2
d(t) 1
pη(t)X
k∈K ppk(t)|hk(t)|
Bk(t)K!2
+2
n
η(t)#,(8)
where the expectation is over the distribution of the transmitted gradients gk(t)and the received
noise n(t),md(t)and σ2
d(t)denote the mean (first-order statistics) and variance (second-order
statistics) of the d-th entry of gradient g(t)at iteration t, respectively. Notice that the gradient
norm Bk(t)of each device kcan be transmitted to the edge server with negligible communication
cost, thus it is considered as a known value in (8).
Observing (8) closely, we find that the MSE consists of three components, representing the
individual misalignment error (the first term), the composite misalignment error (the second
term), and the noise-induced error (the third term), respectively. The individual misalignment
error is weighted by the gradient variance PD
d=1 σ2
d(t)while the composite misalignment error
is weighted by the gradient mean PD
d=1 m2
d(t). In the special case when the gradient has zero
mean, the MSE in (8) reduces to that in [17] where the composite misalignment error is absent.
Our objective is to minimize MSE in (8), by jointly optimizing the transmit power pk(t)at all
devices and the denoising factor η(t)at the edge server, subject to the individual power budget
in (6).
D. Gradient Statistics
In general, the individual misalignment error and the composite misalignment error in MSE of
the gradient aggregation (8) cannot be minimized simultaneously due to the peak power budget
on each device. It is difficult to capture the tradeoff between the two errors by directly using
7
0 20 40 60 80 100
0
0.05
0.1
0.15
0.2
0.25
0.3
0
2
4
6
8
10
12
Fig. 2. Experimental results of MSN α(t)(left y-axis in linear scale) and SMCV β(t)(right y-axis) over iterations
for dataset MNIST, where the number of edge devices is 10 and the local mini-batch size is 20.
their respective weights, namely, the gradient variance and gradient mean. To tackle this issue,
we introduce two alternative parameters of gradient statistics in this subsection.
Let α(t)denote the mean squared norm (MSN) of g(t), i.e., E[kg(t)k2], which is given by
α(t) =
D
X
d=1 σ2
d(t) + m2
d(t),(9)
and let β(t)denote the squared multivariate coefficient of variation (SMCV) of g(t), which is
given by
β(t) = PD
d=1 σ2
d(t)
PD
d=1 m2
d(t).(10)
While α(t)measures the average absolute gradient values at iteration t,β(t)is a measure of
relative dispersion of the gradient values at iteration t. Shortly later, we shall show that the MSE
of the model aggregation is mainly dominated by β(t), rather than α(t). Figs. 2-4 illustrate
the experimental results of the alternative gradient statistics α(t)and β(t)of three datasets,
MNIST, CIFAR-10, and SVHN, respectively, where the gradients are updated ideally without
any transmission error. Both IID and non-IID partitions are considered for the training dataset.
Each value of α(t)and β(t)is obtained by averaging over 300 model trainings. It is observed
that as the training time increases, the gradient MSN α(t)decreases while the gradient SMCV
β(t)increases for all the three datasets. This result is in agreement with the intuition that the
absolute gradient values in SGD-based learning gradually vanish when the training iteration goes
on, but their relative variation remains significant over each iteration. It is also observed that the
gradient SMCV β(t)in non-IID partition is much larger than that in IID partition for all the
three datasets. This indicates that the gradient distribution with non-IID dataset partition is more
dispersive than that with IID dataset partition as expected.
By (9) and (10), the MSE in (8) can be rewritten as (omitting the constant coefficient 1/K2
for convenience)
MSE(t) =β(t)α(t)
β(t) + 1 X
k∈K ppk(t)|hk(t)|
pη(t)Bk(t)1!2
8
0 20 40 60 80 100
10-1
100
101
102
103
0
1
2
3
4
5
6
7
Fig. 3. Experimental results of MSN α(t)(left y-axis in log scale) and SMCV β(t)(right y-axis) over iterations
for dataset CIFAR-10, where the number of edge devices is 10 and the local mini-batch size is 200.
0 20 40 60 80 100
10-2
10-1
100
101
102
0
1
2
3
4
5
6
7
Fig. 4. Experimental results of MSN α(t)(left y-axis in log scale) and SMCV β(t)(right y-axis) over iterations
for dataset SVHN, where the number of edge devices is 10 and the local mini-batch size is 300.
+α(t)
β(t) + 1 1
pη(t)X
k∈K ppk(t)|hk(t)|
Bk(t)K!2
+2
n
η(t).(11)
It is seen from (11) that while the gradient MSN α(t)appears linearly in the weights of
both individual and composite misalignment errors, the gradient SMCV β(t)plays a more
distinguishing role in the MSE expression. In particular, when β(t)0, which could be the
case when the model training just begins (as could be verified by Figs. 2-4), the individual signal
misalignment error can be neglected. When β(t)→ ∞, which could be the case when the model
training converges or during the middle of the training when the dataset is highly non-IID (as
could be verified by Figs. 2-4), the composite signal misalignment error disappears.
III. OPTIMAL POWER CO NTROL WIT H KNOWN GRADIE NT STATISTI CS
In this section, we formulate and solve the optimal power control problem for minimizing
MSE when the gradient statistics α(t)and β(t)are known. For convenience, we omit iteration
index tin this section. For each device k∈ K, we define its aggregation level with power pand
9
denoising factor ηas
Gk(p, η) = p|hk|
ηBk
,(12)
which indicates the weight of the gradient from device kin the global gradient aggregation (7)1.
Furthermore, we define aggregation capability of device kas its aggregation level with peak
power Pkand unit denoising factor η= 1, i.e., Ck=Gk(Pk,1) = Pk|hk|
Bk. Without loss of
generality, we assume that
C1... Ck... CK.(13)
A. Power Control Problem for General Case
In this subsection, we consider the optimal power control problem for MSE minimization in
the general case with arbitrary β. The problem is formulated as
P1: min βα
β+ 1 X
k∈K
(Gk(pk, η)1)2+α
β+ 1 X
k∈K
Gk(pk, η)K!2
+2
n
η(14a)
s.t. 0pkPk,k∈ K (14b)
η0.(14c)
Different from the power control problem in [17], the objective function in (14a) contains not
only the individual misalignment error ( βα
β+1 Pk∈K (Gk(pk, η)1)2), but also the composite
misalignment error ( α
β+1 (Pk∈K Gk(pk, η)K)2). Problem P1is non-convex in general. Even
if the denoising factor ηis given, problem P1is still hard to solve due to the coupling of each
transmit power pk. In the following, we present some properties of the optimal solution.
Lemma 1: The optimal denoising factor ηfor problem P1satisfies ηC2
1.
Proof: Please refer to Appendix A.
Lemma 1 reduces the range of ηand shows that the optimal transmit power of the 1-st device
is p
1=P1.
Lemma 2: The optimal power control policy satisfies p
k=Pk,k∈ {1, ..., l}and p
k<
Pk,k∈ {l+ 1, ..., K}for some l∈ K.
Proof: Please refer to Appendix B.
Based on Lemma 2, solving problem P1can be equivalent to minimizing the objective function
in the following Kexclusive subregions of the global power region, denoted as {M}l∈K and
comparing their corresponding optimal solutions to obtain the global optimal solution:
Ml=[p1,···, pk]RK|pk=Pk,k∈ {1, ..., l},0pk< Pk,k∈ {l+ 1, ..., K}.(15)
To facilitate the derivation, we denote ˜
Mlas a relaxed region of Mlby removing the condition
pk< Pk, for k∈ {l+ 1, ..., K}, i.e.,
˜
Ml=[p1,···, pk]RK|pk=Pk,k∈ {1, ..., l}, pk0,k∈ {l+ 1, ..., K}.(16)
1The weight should be 1 for all devices in the ideal case.
10
For the sub-problem defined in each relaxed subregion ˜
Ml, for l∈ K, taking the derivative of
the objective function (14a) w.r.t. pkand equating it to zero for all k∈ {l+ 1, ..., K}, we obtain
the optimal transmit power ˜p
l,k at any given ηas
˜p
l,k ="β+KPl
i=1 Gi(Pi, η)
β+Kl#2
·B2
kη
|hk|2, k ∈ {l+ 1, ..., K}.(17)
Note that by such power control in (17), the aggregation level Gk(˜p
l,k, η)of each device k
{l+ 1, ..., K}is the same, given by
G0(l) = β+K1
ηPl
i=1 Ci
β+Kl.(18)
Substituting (17) back to (14a) and letting its derivative w.r.t. ηbe zero, we derive the optimal
denoising factor ηin closed-form for problem P1defined in the l-th relaxed subregion, i.e.,
p˜η
l=
βα
β+1
l
P
i=1
C2
i+βα
(β+Kl)(β+1) l
P
i=1
Ci2
+2
n
β(β+K)α
(β+Kl)(β+1)
l
P
i=1
Ci
.(19)
Note that ˜p
l,k may be not less than its power constraint Pkfor some k∈ {l+1, ..., K }and thus the
corresponding ˜p
lmay not lie in the subregion Ml. If this happens, the optimal transmit power
of the sub-problem defined in subregion Mlis irrelevant and does not need to be considered.
This is given in the following lemma.
Lemma 3: For the power control problem P1defined in the l-th relaxed subregion ˜
Ml, if
k∈ {l+ 1, ..., K}such that ˜p
l,k Pk, the global optimal power p,[p
1, ...p
K]of problem
P1must not be in Ml.
Proof: Please refer to Appendix C.
Lemma 3 shows that only the transmit power vectors ˜
p
l’s with elements satisfying ˜p
l,k <
Pk,k∈ {l+ 1, ..., K}are legal candidates of problem P1. Let Ldenote the index set of
the corresponding relaxed subregions. Note that Lis non-empty because ˜
p
Kis always a legal
transmit power candidate. Then, we only need to compare the legal candidate values to obtain
the minimum MSE
l= arg min
l∈L Vl,(20)
where Vlis the optimal value of (14a) in subregion Mlof P1and can be easily obtained by
substituting (17) and (19) back to (14a). The optimal solution to problem P1is given as follows.
Theorem 1: The optimal transmit power at each device that solves problem P1is given by
p
k=
Pk,k∈ {1, ..., l}
β+KPl
i=1 Gi(Pi)
β+Kl2
·B2
kη
|hk|2,k∈ {l+ 1, ..., K},(21)
and the optimal denoising factor at the edge server is given by
η=
βα
β+1
l
P
i=1
C2
i+βα
(β+Kl)(β+1) l
P
i=1
Ci2
+2
n
β(β+K)α
(β+Kl)(β+1)
l
P
i=1
Ci
,(22)
11
where lis given in (20).
Proof: Please refer to Appendix D.
Remark 1: Theorem 1 shows that these devices k∈ {1, ..., l}with aggregation capability
not higher than that of device lshould transmit their gradients with full power, i.e., pk=Pk,
while devices k∈ {l+1, ..., K}with aggregation capability higher than that of device lshould
transmit with the power so that they have the same aggregation level, given by
G
0=β+KPl
i=1 Gi(Pi, η)
β+Kl,(23)
somewhat analogous to channel inversion.
B. On The Optimal Transmit Power Function
In this subsection, we analyse the optimal transmit power pas a vector function of the
gradient SMCV βand the noise variance σ2
n, i.e., p(β, σ2
n) = [p
1(β, σ2
n), ..., p
K(β, σ2
n)]. Note
that the vector function p(β, σ2
n)might have abrupt changes at some (β, σ2
n)due to the discrete
nature of the optimal device threshold lin Theorem 1. In this work, however, we can show
that p(β, σ2
n)is continuous2everywhere for all (β0, σ2
n0). Define XlR2as the domain
of the vector function p(β, σ2
n)when p(β, σ2
n)lies in the subregion Ml, for l∈ K. That is,
(β, σ2
n)∈ Xl, one can have l(β, σ2
n) = l.
We first show that p(β, σ2
n)is continuous and monotonic within each domain Xl,l∈ K. Let
˜p
l(β, σ2
n)denote the optimal power of the sub-problem defined in the l-th relaxed subregion ˜
Ml,
l∈ K. Based on (17) and (19), it is obvious that ˜p
l(β, σ2
n)is continuous within each domain
Xl. Moreover, ˜p
l(β, σ2
n)increases monotonically with the noise variance σ2
nsince enlarging σ2
n
increases both ˜η
land ˜p
l,k, for k∈ {l+1, ..., K}. Furthermore, ˜p
l(β, σ2
n)decreases monotonically
with the gradient SMCV βsince it can be easily shown that ˜p
l(β,σ2
n)
∂β is always negative. Thus
˜p
l(β, σ2
n)is monotonic within each domain Xl,l∈ K. By definition, within each domain
Xl,p(β, σ2
n)is equal to ˜p
l(β, σ2
n), thus, the vector function p(β , σ2
n)is also continuous and
monotonic within each domain Xl, for l∈ K.
Then we find the boundary of each domain Xl, for l∈ K and the corresponding optimal
transmit power p. To this end, we need the following lemma on the lower and upper bounds
of the optimal transmit power values.
Lemma 4: The optimal transmit power plies in subregion Mlif and only if the optimal
transmit power ˜p
lin the l-th relaxed subregion ˜
Mlsatisfies: (ClBk
|hk|)2˜p
l,k <(Cl+1Bk
|hk|)2,k
{l+ 1, ..., K}.
Proof: Please refer to Appendix E.
Lemma 4 shows that in each domain Xl, l ∈ K, the range of p(β , σ2
n)is left-closed and
right-open intervals in h(ClBk
|hk|)2,(Cl+1Bk
|hk|)2,k∈ {l+ 1, ..., K}. Recall that ˜p
l(β, σ2
n)is con-
tinuous and monotonic. The optimal transmit power p(β , σ2
n) = p
l(β, σ2
n)sits on the lower
bound of the range when (β , σ2
n)is at the lower boundary of domain Xl, denoted as Ll,
n(β, σ2
n)|˜p
l,k(β, σ2
n) = (ClBk
|hk|)2,k∈ {l+ 1, ..., K}o, and p(β, σ2
n) = p
l(β, σ2
n)approaches the
2p(β, σ 2
n)is a continuous vector valued function if and only if each element p
k(β, σ 2
n), for k∈ {1, ..., K}is a continuous
function.
12
10-2 10-1 100101102103
0
0.2
0.4
0.6
0.8
1
1.2
Transmit Power
(a) Average received SNR=0dB.
10-3 10-2 10-1 100101102
0
2
4
6
8
10
12
Transmit Power
(b) Average received SNR=10dB.
Fig. 5. Illustration of the optimal transmit power pas a function of gradient SMCV β.
upper bound of the range when (β , σ2
n)approaches the upper boundary of domain Xl, denoted
as Ul,n(β, σ2
n)|˜p
l,k(β, σ2
n) = (Cl+1 Bk
|hk|)2,k∈ {l+ 1, ..., K}o.
Next, we consider the continuity of p(β, σ2
n)at boundaries Lland Ulfor each l∈ K in the
following lemma.
Lemma 5: The optimal transmit power function p(β, σ2
n)is continuous at Ul=Ll+1, for
l∈ {1, ..., K 1}.
Proof: Please refer to Appendix F.
Finally, we can formally conclude the following property of the optimal transmit power
function with respect to the gradient statistics and noise variance.
Property 1: The optimal transmit power p(β, σ2
n)of problem P1is a continuous and mono-
tonic vector function in the entire domain of (β0, σ2
n0). Moreover, it decreases with β
and increases with σ2
n.
Take a system with K= 6 devices for illustration. Fig. 5 shows the optimal transmit power of
each device p
kas a function of gradient SMCV β, together with the corresponding optimal index
l. The results are based on one channel realization of each device, |hk|, taken independently
from normalized Rayleigh distribution, given by [0.50,0.82,0.85,1.16,2.09,2.83]. The peak
power constraint of each device is set to be same, resulting in the same average received
SNR, given by Pk
2
n=5dB and 10dB, respectively. The gradient norms of devices, Bk, are
[0.23,0.31,0.26,0.28,0.28,0.16], noise variance σ2= 1 and D= 1. Fig. 5 clearly verifies that the
optimal transmit power p(β, σ2
n)is a continuous and monotonically decreasing vector function
of the gradient SMCV β. In particular, it is seen that when β0, all the devices transmit
with full power; when beta increases, the power of the devices with large aggregation capability
decreases, then gradually approaches a constant, and the larger the aggregation capability is, the
smaller the constant transmit power is. In addition, it is observed from Fig. 5 that when the
peak power budget increases, more devices can transmit with less power than its peak value.
Equivalently, the optimal transmit power decreases when the noise variance decreases.
13
C. Power Control Problem for Special Cases
In this subsection, we provide some discussions on the optimal power control policy in the
two special cases where the gradient SMCV β→ ∞ and β0, respectively.
1) β→ ∞:As discussed in Section II-D, this case may happen when the model training
converges and/or the dataset is highly non-IID. In this case, the composite signal misalignment
error in the MSE expression disappears. Thus problem P1reduces to the power control problem
in [17]. To be self-contained, we re-state the problem and the solution as follows
P2: min αX
k∈K
(Gk(pk, η)1)2+Dσ2
n
η(24a)
s.t. 0pkPk,k∈ K (24b)
η0.(24c)
Theorem 2 ( [17]): The optimal transmit power that solves problem P2has a threshold-based
structure, i.e.,
p
k=(Pk,k∈ {1, ..., l}
B2
kη
|hk|2,k∈ {l+ 1, ..., K},(25)
where the optimal denoising factor is given as
η=αPl
i=1 C2
i+2
n
αPl
i=1 Ci2
,(26)
and lis given in (20). Furthermore, it holds that C2
lηC2
l+1.
Proof: Please refer to [17, Theorem 1].
Note that the optimal denoising factor ηfor problem P2can be interpreted as the threshold,
since whether a device transmits with full power or not depends entirely on the comparison
between its aggregation capability Ckand η. Moreover, this threshold increases with σ2
n. In
the extreme case when the noise power is very large and so is the threshold, all the devices
shall transmit with full power. Nevertheless, such threshold interpretation of ηdoes not hold
for problem P1with general β.
2) β0:As discussed in Section II-D, β0could happen when the model training just
begins. In this case, the individual signal misalignment error disappears in the MSE expressions.
The original problem P1reduces to
P3: min αX
k∈K
Gk(pk, η)K2
+2
n
η(27a)
s.t. 0pkPk,k∈ K (27b)
η0.(27c)
Theorem 3: The optimal transmit power that solves problem P3is full power transmission,
i.e.,
p
k=Pk,k∈ K,(28)
14
and the optimal denoising factor is given by
η=αPi∈K Ci2+2
n
αK Pi∈K Ci2
.(29)
Proof: Please refer to Appendix G.
Remark 2: The optimal solution of problem P3is the special case of the solution of problem
P1with l=K. Note that the direction of the gradient vector received from each device at the
edge server is independent to the power of the transmitting device. Thus, increasing the power
of all devices can reduce the noise-induced error when the composite signal misalignment error
is fixed.
IV. ADAPTIV E POWER CONTROL WITH UN KNOWN GRADI ENT STATISTICS
In this section, we consider the practical scenario where the gradient statistics α(t)and β(t)
are unknown. We propose a method to estimate α(t)and β(t)in each time block and then devise
an adaptive power control scheme based on the optimal solution of problem P1by using the
estimated α(t)and β(t).
A. Parameters Estimation
In this subsection, we propose a method to estimate α(t)and β(t)at each time block tdirectly
based on their definitions in (9) and (10), respectively.
1) Estimation of α(t):Note that the instantaneous gradient norm of each device, Bk(t), is
assumed to be available at the edge server with negligible cost. By definition (9), we can estimate
the gradient MSN as
ˆα(t) = 1
KX
k∈K
B2
k(t).(30)
2) Estimation of β(t):By definition in (10), the gradient SMCV β(t)depends on md(t)and
σd(t). Before each device sending its gradient at time block t, we cannot estimate β(t)in advance.
However, from the experimental results of real datasets in Figs. 2-4, it can be observed that β(t)
changes slowly over iterations t. Thus we propose to estimate β(t)based on the aggregated
gradient at time block t1as below
ˆ
β(t) = ˆα(t1) PD
d=1 ˆg2
d(t1)
PD
d=1 ˆg2
d(t1) ,(31)
where ˆα(t1) estimates Pdσ2
d(t1) + m2
d(t1) and Pdˆg2
d(t1) estimates Pdm2
d(t1).
B. FL with Adaptive Power Control
In this subsection, we propose the FL process with adaptive power control, which is presented
in Algorithm 1. The algorithm has three steps. First, each device locally takes one step of SGD
on the current model using its local dataset (line 5). After that each device calculates the norm
of its local gradient and uploads it to the edge server with conventional digital transmission (line
6 and line 7). Second, the edge server estimates parameters α(t)and β(t)based on the received
gradient norm at time block tand historical aggregated gradient (line 9 and line 16). Then the
optimal transmit power and denoising factor are obtained based on (21) and (22), respectively
(line 10). Third, the edge server informs the optimal transmit power to each device and each
device transmits local gradient with the assigned power simultaneously using AirComp to the
edge server in an analog manner (line 12-14).
15
Algorithm 1 FL Process with Adaptive Power Control
1: Initialize w(0) in edge server, ˆ
β(1);
2: for time block t= 1, ..., T do
3: Edge server broadcasts the global model w(t)to all edge devices k∈ K;
4: for each device k∈ K in parallel do
5: gk(t) = LSGD
k,t (w(t));
6: Bk(t) = qPdg2
k,d(t);
7: Upload Bk(t)to edge server;
8: end for
9: Edge server estimates ˆα(t)based on (30);
10: Edge server obtains the optimal transmit power p(t)based on (21) and the optimal
denoising factor η(t)based on (22);
11: Edge server sends p
k(t)to device kfor all k∈ K;
12: for each device k∈ K in parallel do
13: Transmit gradient gk(t)with power p
k(t)to edge server using AirComp;
14: end for
15: Edge server receives y(t)and recovers ˆg(t)based on (7);
16: Edge server estimates ˆ
β(t+ 1) based on (31);
17: Edge server updates global model w(t+ 1) = w(t)γˆg(t);
18: end for
19: Edge server returns w(T+ 1);
V. EXPE RIM E NTAL RESU LTS
In this section, we provide experimental results to validate the performance of the proposed
power control for AirComp-based FL over fading channels.
A. Experiment Setup
We conducted experiments on a simulated environment where the number of edge devices is
K= 10 if not specified otherwise. The wireless channels from each device to the edge server
follow IID Rayleigh fading, such that hk’s are modeled as IID complex Gaussian variables with
zero mean and unit variance. For each device k∈ K, we define SNRk=EhPk|hk|2
2
ni=Pk
2
nas
the average received SNR.
1) Baselines: We compare the proposed power control scheme with the following baseline
approaches:
Error-free transmission: the aggregated gradient is updated without any transmission error,
which is equivalent to the centralized SGD algorithm.
Threshold-based power control in [17]: this is the power control scheme given in [17],
which assumed that signals are normalized. Note that it is actually the special case of our
proposed power control scheme with β→ ∞ by considering the individual misalignment
error only in problem P1.
Full power transmission: all devices transmit with full power Pkand the edge server applies
the optimal denoising factor in (22), where l=K.
16
2) Datasets: We evaluate the training of convolutional neural network on three datasets:
MNIST, CIFAR-10 and SVHN. MNIST dataset consists of 10 categories ranging from digit 0 to
9 and a total of 70000 labeled data samples (60000 for training and 10000 for testing). CIFAR-10
dataset includes 60000 color images (50000 for training and 10000 for testing) of 10 different
types of objects. SVHN is a real-world image dataset for developing machine learning and object
recognition algorithms with minimal requirement on data preprocessing and formatting, which
includes 99289 labeled data samples (73257 for training and 26032 for testing).
3) Data Distribution: To study the impact of the SMCV of gradient βfor optimal transmit
power, we simulate two types of dataset partitions among the mobile devices, i.e., the IID
setting and non-IID one. For the former, we randomly partition the training samples into 100
equal shards, each of which is assigned to one particular device. While for the latter, we first
sort the data by digit label, divide it into 200 equal shards, and randomly assign 2shards to
each device.
4) Training and Control Parameters: In all our experiments, the number of local update steps
between two global aggregations is 1. Local batch size of each edge device is 10. The gradient
descent step size is γ= 0.01.
B. Experimental Results
Fig. 6 compares the test accuracy for the three considered datasets with IID dataset partition
and non-IID dataset partition, respectively, where the average received SNR is set as 10dB for all
devices. It is observed that the model accuracy of the proposed scheme is better than threshold-
based power control in [17] and full power transmission. From Figs. 2-4, we know that the
averaged gradient SMCV β(t)in the IID dataset partition is less than that in the non-IID dataset
partition and gradient SMCV β(t)increases over iterations. Threshold-based power control in
[17] has significant accuracy loss in the IID partition or at the beginning of training. This is
because in this case, the gradient SMCV is small and thus the MSE is dominated by the composite
misalignment error. As a result, threshold-based power control in [17] scheme that considers
the individual misalignment error only is much inferior. Besides, full power transmission has
significant accuracy loss in the non-IID partition or at the end of training. This is because the
gradient SMCV is large and therefore the full power transmission scheme fails to minimize the
individual misalignment error that dominates the MSE in this case.
Fig. 7 illustrates the test accuracy for MNIST with non-IID data partition at the average
received SNR =5db. It is observed that the overall performance of the proposed scheme is
still better than two baseline approaches at low SNR region. In specific, full power transmission
performs better than threshold-based power control in [17] scheme. This is mainly because
when the noise variance is large, full power transmission can strongly suppress noise error that
dominates the MSE.
Finally, Fig. 8 compares the test accuracy of different power control schemes at varying
number of devices K. Here, MNIST dataset with non-IID partition is used, the average received
SNR of all the devices is set as SNRk= 10dB and the results are averaged over 50 model
trainings. First, it is observed that the test accuracy achieved by all the four schemes increases
as Kincreases, due to the fact that the edge server can aggregate more data for averaging.
Second, the proposed scheme considerably outperforms both of threshold-based power control
in [17] and full power transmission throughout the whole regime of K. Full power transmission
approaches threshold-based power control in [17] when Kis small (i.e., K= 4 in Fig. 8), but
17
0 10 20 30 40 50
Rounds
50
55
60
65
70
75
80
85
90
95
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(a) MNIST dataset with IID partition.
0 10 20 30 40 50
Rounds
20
30
40
50
60
70
80
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(b) MNIST dataset with non-IID partition.
0 10 20 30 40 50 60
Rounds
10
20
30
40
50
60
70
80
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(c) CIFAR-10 dataset with IID partition.
0 10 20 30 40 50 60 70
Rounds
5
10
15
20
25
30
35
40
45
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(d) CIFAR-10 dataset with non-IID partition.
0 5 10 15 20 25 30
Rounds
10
20
30
40
50
60
70
80
90
100
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(e) SVHN dataset with IID partition.
0 10 20 30 40 50
Rounds
10
20
30
40
50
60
70
80
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(f) SVHN dataset with non-IID partition.
Fig. 6. Performance comparison on different dataset partition, for K= 10 and SNRk= 10dB,k∈ K.
the performance compromises as Kincreases, due to the lack of power adaptation to reduce the
misalignment error.
VI. CO NCL USION
This work studied the power control optimization problem for the over-the-air federated
learning over fading channels by taking the gradient statistics into account. The optimal power
18
0 10 20 30 40 50
Rounds
10
20
30
40
50
60
70
80
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
Fig. 7. Performance comparison for MNIST dataset with non-IID partition at the average received SNR = 5db.
4 6 8 10 12 14 16 18 20
Number of Devices, K
50
55
60
65
70
75
80
85
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
Fig. 8. Performance comparison over the number of devices, where MNIST dataset is non-IID partition and
SNRk= 10dB,k∈ K.
control policy is derived in closed form when the first- and second-order gradient statistics are
known. It is shown that the optimal transmit power on each device decreases with gradient SMCV
and increases with noise variance. In the special cases where βapproaches infinity and zero,
the optimal transmit power reduces to threshold-based power control in [17] and full power
transmission, respectively. We propose an adaptive power control algorithm that dynamically
adjusts the transmit power in each iteration based on the estimation results. Experimental results
show that our proposed adaptive power control scheme outperforms the existing schemes.
APPEN DIX A
PROO F O F LEMM A 1
We prove this Lemma by contradiction. Suppose the optimal denoising factor ηC2
1.
Both the individual and composite signal misalignment errors can be forced to zero by letting
p
k=ηB2
k
|hk|2,k∈ K. The problem P1can thus be expressed as
min
ηC2
1
2
n
η.(32)
19
It is obvious that the optimal solution of the above problem is η=C2
1. Thus, it must hold that
ηC2
1for problem P1.
APPEN DIX B
PROO F O F LEMM A 2
To prove this lemma, we need to prove that if p
k2=Pk2for some device k2, we shall
have p
k1=Pk1,k1< k2. We prove this by contradiction. Let p= [p
1, ..., p
K]denote the
optimal transmit power to the problem P1. We assume that there are two devices k1< k2
satisfying p
k1< Pk1and p
k2=Pk2. Then there always exists a modified transmit power p=
[p
1, ..., p
k11, p
k1, p
k1+1, ..., p
k21, p
k2, p
k2+1, ..., p
K], where p
k1=Pk1and p
k2< Pk2, satisfying
p
k1|hk1|
ηBk1+p
k2|hk2|
ηBk2=qp
k1|hk1|
ηBk1+qp
k2|hk2|
ηBk2
. The resulting MSE obtained by ponly differs from
the minimum MSE by pin the individual misalignment error term. The difference is given by
MSE(p)MSE(p
)
=βα
β+ 1
pp
k1|hk1|
ηBk11!2
+ pp
k2|hk2|
ηBk21!2
qp
k1|hk1|
ηBk11
2
qp
k2|hk2|
ηBk21
2
(33a)
=βα
β+ 1
pp
k1|hk1|
ηBk1!2
+ pp
k2|hk2|
ηBk2!2
qp
k1|hk1|
ηBk1
2
qp
k2|hk2|
ηBk2
2
(33b)
=βα
β+ 1
pp
k2|hk2|
ηBk2qp
k2|hk2|
ηBk2
pp
k2|hk2|
ηBk2
+qp
k2|hk2|
ηBk2pp
k1|hk1|
ηBk1qp
k1|hk1|
ηBk1
(33c)
=2βα
β+ 1
pp
k2|hk2|
ηBk2qp
k2|hk2|
ηBk2
pp
k2|hk2|
ηBk2qp
k1|hk1|
ηBk1
(33d)
0.(33e)
The inequality in (33e) holds as p
k2=Pk2> p
k2and p
k2|hk2|
ηBk2qp
k1|hk1|
ηBk1
by the aggregation
capability ranking in (13). This indicates that pis a better solution than p, which contradicts
the assumption. Therefore, for all pairs of two devices k1< k2, if p
k2=Pk2, we must have
p
k1=Pk1. Lemma 2 is proved.
APPEN DIX C
PROO F O F LEMM A 3
To prove this lemma is equal to proving that if the global optimal solution pof problem
P1lies in subregion Ml, the optimal solution ˜p
lof the sub-problem defined in the relaxed
20
subregion ˜
Mlshould also lie Ml. We prove this by contradiction. Assume that p∈ Mlbut
˜p
l/∈ Ml. The derivative of (14a) w.r.t. pkand ηshows that ˜p
lis the only local optimal solution
in ˜
Ml. As pis the optimal solution in Mland cannot be on the boundary of Ml,pis also
a local optimal solution in Ml˜
Ml. It contradicts that ˜p
lis the only local optimal solution
in ˜
Ml. Therefore, if the global optimal transmit power pis in the l-th subregion Ml,˜p
lmust
be in Ml. The proof of Lemma 3 is completed.
APPEN DIX D
PROO F O F THEO REM 1
To complete the proof, we need to show that with ldefined in (20), the optimal transmit
power is ˜p
lin the l-th relaxed subregion ˜
Ml. By Lemma 3, for all lin L, since ˜p
lis also in
subregion Ml,˜p
lis the optimal transmit power in the subregion Ml. Therefore, the candidate
transmit power ˜p
lwith the smallest value Vlis the optimal transmit power of the problem P1.
According to (17) and (19), the optimal transmit power and denoising factor are the forms given
in Theorem 1.
APPEN DIX E
PROO F O F LEMM A 4
When plies in subregion Ml,pis equal to ˜p
l. Thus, to prove the sufficiency of this Lemma
is equal to prove that the optimal transmit power pholds
Clpp
k|hk|
Bk
< Cl+1,k∈ {l+ 1, ..., K }.(34)
Based on (21), p
k< Pk, for k∈ {l+ 1, ..., K}, and when k∈ {l+ 1}, we have p
l+1|hl+1 |
Bl+1 <
Cl+1. Since p
k|hk|
Bk,k∈ {l+ 1, ..., K}is the same, we have the inequality p
k|hk|
Bk< Cl+1,
for k∈ {l+1, ..., K }. We prove Clp
k|hk|
Bk,k∈ {l+1, ..., K }by contradiction. We assume
that Cl>p
l+1|hl+1 |
Bl+1 . As p
k=Pk, for k∈ {1, ..., l}, we have p
l|hl|
Bl=Cl>p
l+1|hl+1 |
Bl+1 .
Then there always exists a modified transmit power p= [p
1, ..., p
l1, p
l, p
l+1, p
l+2, ..., p
K]
where transmit power p
land p
l+1 satisfy p
l|hl|
Bl=qp
l+1|hl+1 |
Bl+1 =1
2(p
l|hl|
Bl+p
l+1|hl+1 |
Bl+1 ).
The difference between MSE of the transmit power pand pis given by
MSE(p)MSE(p
)
=βα
β+ 1
pp
l|hl|
ηBl1!2
+ pp
l+1|hl+1 |
ηBl+1 1!2
pp
l|hl|
ηBl1!2
qp
l+1|hl+1|
ηBl+1 1
2
(35a)
21
=βα
β+ 1
pp
l|hl|
ηBl!2
+ pp
l+1|hl+1|
ηBl+1 !2
pp
l|hl|
ηBl!2
qp
l+1|hl+1 |
ηBl+1
2
(35b)
=βα
2 (β+ 1) pp
l|hl|
ηBlpp
l+1|hl+1|
ηBl+1 !2>0.(35c)
This indicates that pis a better solution than p, which contradicts the assumption. We have
the inequality Clp
k|hk|
Bk, for k∈ {l+ 1, ..., K}. Thus, the sufficiency of this Lemma has
been proved.
To prove the necessity of this Lemma, we prove the inverse negative proposition of it: if the
optimal transmit power pdoes not lie in subregion Ml, i.e., l∈ {1, ..., l1, l+ 1, ..., K },
˜p
lsatisfies: ˜p
l,k|hk|
BkCl+1 or ˜p
l,k|hk|
Bk< Clfor all k∈ {l+ 1, ..., K}.
First, we prove that ˜p
l, for l∈ {1, ..., l1}holds
p˜p
l,k|hk|
BkCl+1,k∈ {l+ 1, ..., K}.(36)
We prove this by contradiction. We assume that l∈ {1, ..., l1},˜p
l,k|hk|
Bk< Cl+1,k
{l+ 1, ..., K}. As Cl+1 Ck, for k∈ {l+ 1, ..., K}given in (13), ˜p
l,k|hk|
Bk< Ck, for
k∈ {l+ 1, ..., K},˜p
l,k < Pk, for k∈ {l+ 1, ..., K}, i.e., ˜p
lis a feasible transmit power. Note
that ˜p
lis the optimal transmit power in ˜
Mland pis the optimal transmit power in Ml. As the
constraint of ˜p
lis less restrictive than the constraint of p, i.e., Ml˜
Ml, the feasible transmit
power ˜p
lis better than the optimal transmit power p, which contradicts the assumption. We
have the inequality (36).
Second, we prove that ˜p
l, for l∈ {l+ 1, ..., K}holds
p˜p
l,k|hk|
Bk
< Cl,k∈ {l+ 1, ..., K}.(37)
We prove this by contradiction. When l=l+ 1, we assume that ˜p
l,k|hk|
BkCl,k∈ {l+
1, ..., K}. Let pavg = [˜p
l,1, ..., ˜p
l,l1, pavg
l, ..., pavg
K]denote a modified ˜p
l, where pavg
l|hl|
Bl=... =
pavg
K|hK|
BK=1
Kl+1 PK
i=l
˜p
l,i|hi|
Bi. Using a proof similar to that of (35c), we can prove that
MSE(pavg )MSE(˜p
l). Let pfes = [˜p
l,1, ..., ˜p
l,l1, pf es
l, ..., pfes
K]denote a modified ˜p
l, where
pfes
k|hk|
Bk=Cl,, for k∈ {l, ..., K}. The derivative of (14a) w.r.t. pkfor all k∈ {l+ 1, ..., K}
shows that the MSE increases with pkwhen pkp
k. Note that pavg and pf es are in the l-th
relaxed subregion ˜
Mland pavg
kpfes
k> p
k, for k∈ {l, ..., K}, MSE(pf es )MSE(pavg )
and MSE(pfes)MSE(˜p
l). The transmit power pf es ˜
Mlis better than the optimal transmit
power ˜p
l˜
Ml, which contradicts that ˜p
lis optimal solution in ˜
Ml. We have the inequality
(37) when l=l+ 1 and it can be extended to when l∈ {l+ 2, ..., K}by mathematical
induction.
Thus, the necessity of this Lemma has been proved. We complete the proof of Lemma 4.
22
APPEN DIX F
PROO F O F LEMM A 5
We first prove that the upper boundary Ulis equal to the lower boundary Ll+1 , for l
{1, ..., K 1}. Then we prove that the optimal transmit power function p(β, σ2
n)is continuous
at Ul=Ll+1, for l∈ {1, ..., K 1}.
For all (β, σ2
n)∈ Ll+1, the corresponding optimal transmit power ˜p
l+1(β, σ2
n) = (Cl+1 Bk
|hk|)2,
i.e., ˜p
l+1,k(β2
n)|hk|
Bk=Cl+1, for k∈ {l+ 2, ..., K }. Based on (17) and (19), we have
q˜η
l+1(β, σ2
n) = Pl+1
i=1 Ci+ (β+Kl1)Cl+1
β+K(38a)
βα
β+1
l+1
P
i=1
C2
i+βα
(β+Kl1)(β+1) l+1
P
i=1
Ci2
+2
n
β(β+K)α
(β+Kl1)(β+1)
l+1
P
i=1
Ci
=Pl+1
i=1 Ci+ (β+Kl1)Cl+1
β+K(38b)
βα
β+1
l+1
P
i=1
C2
i+2
n
β(β+K)α
(β+Kl1)(β+1)
l+1
P
i=1
Ci
=(β+Kl1)Cl+1
β+K(38c)
βα
β+ 1
l+1
X
i=1
C2
i+2
n=βα
β+ 1Cl+1
l+1
X
i=1
Ci(38d)
βα
β+ 1
l
X
i=1
C2
i+2
n=βα
β+ 1Cl+1
l
X
i=1
Ci(38e)
βα
β+1
l
P
i=1
C2
i+2
n
β(β+K)α
(β+Kl)(β+1)
l
P
i=1
Ci
=(β+Kl)Cl+1
β+K(38f)
βα
β+1
l
P
i=1
C2
i+βα
(β+Kl)(β+1) l
P
i=1
Ci2
+2
n
β(β+K)α
(β+Kl)(β+1)
l
P
i=1
Ci
=Pl
i=1 Ci+ (β+Kl)Cl+1
β+K(38g)
q˜η
l(σ2
n, β) = Pl
i=1 Ci+ (β+Kl)Cl+1
β+K(38h)
q˜p
l,k(β, σ2
n)|hk|
Bk
=Cl+1,k∈ {l+ 1, ..., K }(38i)
˜p
l,k(β, σ2
n) = (Cl+1 Bk
|hk|)2,k∈ {l+ 1, ..., K},(38j)
then we have (σ2
n, β)∈ Ulby the definition of Ul, i.e., Ll+1 ⊆ Ul. For all (β, σ2
n)∈ Ul, we
23
have (σ2
n, β)∈ Ll+1, i.e., Ul⊆ Ll+1 by reversing the derivation of (38). Therefore, we have
Ll+1 =Ul, for l∈ {1, ..., K 1}.
For all (β, σ2
n)0∈ Ll+1 =Ul, define (β, σ2
n)+∈ Xl+1 and (β, σ2
n)∈ Xlas two points infinitely
close to (β, σ2
n)0, respectively. Then the left limitation of p((β, σ2
n)0)is given by
lim
(β,σ2
n)(β,σ2
n)0
p((β, σ2
n)) = lim
(β,σ2
n)(β,σ2
n)0
˜p
l((β, σ2
n))
=˜p
l((β, σ2
n)0) = ˜p
l+1((β, σ2
n)0) = p((β, σ2
n)0).(39)
Similarly, the right limitation of p((β, σ2
n)0)is also equal to p((β, σ2
n)0). Therefore, the optimal
transmit power function p(β, σ2
n)is continuous at Ul=Ll+1, for l∈ {1, ..., K 1}. We complete
the proof of Lemma 5.
APPEN DIX G
PROO F O F THEO REM 3
For any denoising factor η1
K2Pk∈K Ck2
, it must hold that composite signal alignment
Pk∈K Gk(pk, η)Kfor problem P3. Therefore, for minimizing composite signal misalignment
error Pk∈K Gk(pk, η)K2
, all the devices should always transmit with full power, i.e.,
p
k=Pk,k∈ K. The problem P3can be expressed as
min
η0αX
k∈K
Gk(pk, η)K2
+2
n
η(40)
This is a unary quadratic function about 1
η, thus it is easy to derive an optimal solution, i.e.,
η=αPi∈K Ci2+2
n
αK Pi∈K Ci2
.(41)
We complete the proof of Theorem 3.
REFER ENC E S
[1] N. Zhang and M. Tao, “Gradient statistics aware power control for over-the-air federated learning in fading channels,” in
Proc. IEEE ICC Workshops, 2020, pp. 1–6.
[2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE,
vol. 107, no. 11, pp. 2204–2239, 2019.
[3] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence
with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks
from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B.
McMahan et al., “Towards federated learning at scale: System design,” arXiv preprint arXiv:1902.01046, 2019.
[6] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Transactions on
Intelligent Systems and Technology (TIST), vol. 10, no. 2, p. 12, 2019.
[7] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via randomized
quantization and encoding,Advances in Neural Information Processing Systems 30, vol. 3, pp. 1710–1721, 2018.
[8] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed
training of speech dnns,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[9] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” in Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing, 2017, pp. 440–445.
24
[10] Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compression for efficient distributed deep learning,arXiv
preprint arXiv:1802.06058, 2018.
[11] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource
constrained edge computing systems,IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
[12] B. Nazer and M. Gastpar, “Computation over multiple-access channels,IEEE Trans. Inf. Theory, vol. 53, no. 10, pp.
3498–3516, 2007.
[13] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation, IEEE Trans. Wireless Commun.,
vol. 19, no. 3, pp. 2022–2035, March 2020.
[14] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans.
Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.
[15] M. M. Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,
in Proc. IEEE ISIT, July 2019, pp. 1432–1436.
[16] ——, “Federated learning over wireless fading channels,IEEE Trans. Wireless Commun., pp. 1–1, 2020.
[17] X. Cao, G. Zhu, J. Xu, and K. Huang, “Optimal power control for over-the-air computation,” in Proc. IEEE GLOBECOM,
Dec. 2019, pp. 1–6.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Fueled by the availability of more data and computing power, recent breakthroughs in cloud-based machine learning (ML) have transformed every aspect of our lives from face recognition and medical diagnosis to natural language processing. However, classical ML exerts severe demands in terms of energy, memory and computing resources, limiting their adoption for resource constrained edge devices. The new breed of intelligent devices and high-stake applications (drones, augmented/virtual reality, autonomous systems, etc.), requires a novel paradigm change calling for distributed, low-latency and reliable ML at the wireless network edge (referred to as edge ML). In edge ML, training data is unevenly distributed over a large number of edge nodes, which have access to a tiny fraction of the data. Moreover training and inference are carried out collectively over wireless links, where edge devices communicate and exchange their learned models (not their private data). In a first of its kind, this article explores key building blocks of edge ML, different neural network architectural splits and their inherent tradeoffs, as well as theoretical and technical enablers stemming from a wide range of mathematical disciplines. Finally, several case studies pertaining to various high-stake applications are presented demonstrating the effectiveness of edge ML in unlocking the full potential of 5G and beyond.
Article
Full-text available
Emerging technologies and applications including Internet of Things (IoT), social networking, and crowd-sourcing generate large amounts of data at the network edge. Machine learning models are often built from the collected data, to enable the detection, classification, and prediction of future events. Due to bandwidth, storage, and privacy concerns, it is often impractical to send all the data to a centralized location. In this paper, we consider the problem of learning model parameters from data distributed across multiple edge nodes, without sending raw data to a centralized place. Our focus is on a generic class of machine learning models that are trained using gradientdescent based approaches. We analyze the convergence bound of distributed gradient descent from a theoretical point of view, based on which we propose a control algorithm that determines the best trade-off between local update and global parameter aggregation to minimize the loss function under a given resource budget. The performance of the proposed algorithm is evaluated via extensive experiments with real datasets, both on a networked prototype system and in a larger-scale simulated environment. The experimentation results show that our proposed approach performs near to the optimum with various machine learning models and different data distributions.
Article
The stringent requirements for low-latency and privacy of the emerging high-stake applications with intelligent devices such as drones and smart vehicles make the cloud computing inapplicable in these scenarios. Instead, edge machine learning becomes increasingly attractive for performing training and inference directly at network edges without sending data to a centralized data center. This stimulates a nascent field termed as federated learning for training a machine learning model on computation, storage, energy and bandwidth limited mobile devices in a distributed manner. To preserve data privacy and address the issues of unbalanced and non-IID data points across different devices, the federated averaging algorithm has been proposed for global model aggregation by computing the weighted average of locally updated model at each selected device. However, the limited communication bandwidth becomes the main bottleneck for aggregating the locally computed updates. We thus propose a novel over-the-air computation based approach for fast global model aggregation via exploring the superposition property of a wireless multiple-access channel. This is achieved by joint device selection and beamforming design, which is modeled as a sparse and low-rank optimization problem to support efficient algorithms design. To achieve this goal, we provide a difference-of-convex-functions (DC) representation for the sparse and low-rank function to enhance sparsity and accurately detect the fixed-rank constraint in the procedure of device selection. A DC algorithm is further developed to solve the resulting DC program with global convergence guarantees. The algorithmic advantages and admirable performance of the proposed methodologies are demonstrated through extensive numerical results.
Article
To leverage rich data distributed at the network edge, a new machine-learning paradigm, called edge learning, has emerged where learning algorithms are deployed at the edge for providing intelligent services to mobile users. While computing speeds are advancing rapidly, the communication latency is becoming the bottleneck of fast edge learning. To address this issue, this work is focused on designing a low-latency multi-access scheme for edge learning. To this end, we consider a popular privacy-preserving framework, federated edge learning (FEEL), where a global AI-model at an edge-server is updated by aggregating (averaging) local models trained at edge devices. It is proposed that the updates simultaneously transmitted by devices over broadband channels should be analog aggregated “over-the-air” by exploiting the waveform-superposition property of a multi-access channel. Such broadband analog aggregation (BAA) results in dramatical communication-latency reduction compared with the conventional orthogonal access (i.e., OFDMA). In this work, the effects of BAA on learning performance are quantified targeting a single-cell random network. First, we derive two tradeoffs between communication-and-learning metrics, which are useful for network planning and optimization. The power control (“truncated channel inversion”) required for BAA results in a tradeoff between the update-reliability [as measured by the receive signal-to-noise ratio (SNR)] and the expected update-truncation ratio. Consider the scheduling of cell-interior devices to constrain path loss. This gives rise to the other tradeoff between the receive SNR and fraction of data exploited in learning. Next, the latency-reduction ratio of the proposed BAA with respect to the traditional OFDMA scheme is proved to scale almost linearly with the device population. Experiments based on a neural network and a real dataset are conducted for corroborating the theoretical results.
Article
With the breakthroughs in deep learning, the recent years have witnessed a booming of artificial intelligence (AI) applications and services, spanning from personal assistant to recommendation systems to video/audio surveillance. More recently, with the proliferation of mobile computing and Internet of Things (IoT), billions of mobile and IoT devices are connected to the Internet, generating zillions bytes of data at the network edge. Driving by this trend, there is an urgent need to push the AI frontiers to the network edge so as to fully unleash the potential of the edge big data. To meet this demand, edge computing, an emerging paradigm that pushes computing tasks and services from the network core to the network edge, has been widely recognized as a promising solution. The resulted new interdiscipline, edge AI or edge intelligence (EI), is beginning to receive a tremendous amount of interest. However, research on EI is still in its infancy stage, and a dedicated venue for exchanging the recent advances of EI is highly desired by both the computer system and AI communities. To this end, we conduct a comprehensive survey of the recent research efforts on EI. Specifically, we first review the background and motivation for AI running at the network edge. We then provide an overview of the overarching architectures, frameworks, and emerging key technologies for deep learning model toward training/inference at the network edge. Finally, we discuss future research opportunities on EI. We believe that this survey will elicit escalating attentions, stimulate fruitful discussions, and inspire further research ideas on EI.
Article
Today’s artificial intelligence still faces two major challenges. One is that, in most industries, data exists in the form of isolated islands. The other is the strengthening of data privacy and security. We propose a possible solution to these challenges: secure federated learning. Beyond the federated-learning framework first proposed by Google in 2016, we introduce a comprehensive secure federated-learning framework, which includes horizontal federated learning, vertical federated learning, and federated transfer learning. We provide definitions, architectures, and applications for the federated-learning framework, and provide a comprehensive survey of existing works on this subject. In addition, we propose building data networks among organizations based on federated mechanisms as an effective solution to allowing knowledge to be shared without compromising user privacy.