Content uploaded by Meixia Tao
Author content
All content in this area was uploaded by Meixia Tao on May 11, 2020
Content may be subject to copyright.
arXiv:2003.02089v2 [eess.SP] 8 May 2020
1
Gradient Statistics Aware Power Control for
Over-the-Air Federated Learning
Naifu Zhang and Meixia Tao
Abstract
Federated learning (FL) is a promising technique that enables many edge devices to train a machine
learning model collaboratively in wireless networks. By exploiting the superposition nature of wireless
waveforms, over-the-air computation (AirComp) can accelerate model aggregation and hence facilitate
communication-efficient FL. Due to channel fading, power control is crucial in AirComp. Prior works
assume that the signal to be aggregated from each device, i.e., local gradient, can be normalized with
zero mean and unit variance. In FL, however, gradient statistics vary over both training iterations and
feature dimensions, and are unknown in advance. This paper studies the power control problem for
over-the-air FL by taking gradient statistics into account. The goal is to minimize the aggregation error
by jointly optimizing the transmit power at each device and the denoising factor at the edge server.
We obtain the optimal policy in closed form when gradient statistics are given. Notably, we show that
the optimal transmit power at each device is continuous and monotonically decreases with the squared
multivariate coefficient of variation (SMCV) of gradient vectors. We also propose an estimation method
of gradient statistics with negligible communication cost. Experimental results demonstrate high learning
performance by using the proposed scheme.
Index Terms
Federated learning, over-the-air computation, power control, fading channel.
I. INT RO DUC TION
The proliferation of mobile devices such as smartphones, tablets, and wearable devices has
revolutionized people’s daily lives. Due to the growing computation and sensing capabilities
of these devices, a wealth of data has been generated each day. This has promoted a wide
range of artificial intelligence (AI) applications such as image recognition and natural language
processing. Traditional machine learning procedure, including both training and inference, relies
on cloud computing on a centralized data center with computing, storage, and full access to the
entire data set. Wireless edge devices are thus required to transmit their collected data to a central
parameter server, which can be very costly in terms of energy and bandwidth consumption, and
unfavorable due to response delay and privacy concerns. It is thus increasingly desired to let
edge devices engage in the learning process by keeping the collected data locally and performing
training/inference either collaboratively or individually. This emerging technology is known as
Edge Machine Learning [2] or Edge Intelligence [3].
Federated learning [4]–[6] is a new edge learning framework that enables many edge devices
to collaboratively train a machine learning model without exchanging datasets under the coordi-
nation of an edge server in wireless networks. Compared with traditional learning at a centralized
data center, FL offers several distinct advantages, such as preserving privacy, reducing network
congestion, and leveraging distributed on-device computation. In FL, each edge device downloads
N. Zhang and M. Tao are with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, P.
R. China. (email: arthaslery@sjtu.edu.cn; mxtao@sjtu.edu.cn.). Part of this work is to be presented at IEEE ICC 2020 workshop
[1]
2
a shared model from the edge server, computes an update to the current model by learning from
its local dataset, then sends this update to the edge server. Therein, the updates are averaged to
improve the shared model.
The communication cost is the main bottleneck in FL since a large number of participating
edge devices send their updates to the edge server at each round of the model training. Existing
methods to obtain communication-efficient FL can be mainly divided into three categories: model
parameter compression [7], [8], gradient sparsification [9], [10], and infrequent local update
[4], [11]. Nevertheless, the communication cost of FL is still proportional to the number of
edge devices, and thus inefficient in large-scale environment. Recently, a fast model aggregation
approach is proposed for FL by applying over-the-air computation (AirComp) principle [12],
such as in [13]–[16]. This is accomplished by exploiting the waveform superposition nature of
the wireless medium to compute the desired function of the distributed local gradients (i.e., the
weighted average function) by concurrent transmission. Such AirComp-based FL, referred to as
over-the-air FL in this work, can dramatically save the uplink communication bandwidth.
Due to the channel fading, device selection and power control are crucial to achieve a reliable
and high-performance over-the-air FL. In [17], the authors jointly optimize the transmit power
at edge devices and the receive scaling factor (known as denoising factor) at the edge server for
minimizing the mean square error (MSE) of the aggregated signal. It is shown that the optimal
transmit power in static channels exhibits a threshold-based structure. Namely, each device
applies channel-inversion power control if its quality indicator exceeds the optimal threshold,
and applies full power transmission otherwise. The work [17], however, is purely for AirComp-
based signal aggregation (not in the context of learning), where the signal on each device is
assumed to be independent and identically distributed (IID) with zero mean and unit variance.
For AirComp-based gradient aggregation in FL, the work [14] introduces a truncation-based
approach for excluding the edge devices with deep fading channels to strike a good balance
between learning performance and aggregation error. The work [13] proposes a joint device
selection and receiver beamforming design method to find the maximum number of selected
devices with MSE requirements to improve the learning performance. As in [17], it is assumed
in both [13] and [14] that the signal (i.e., the local gradient) to be aggregated from each device
is IID, and normalized with zero mean and unit variance. By exploiting the sparsity pattern
in gradient vectors, the work [16] projects the gradient estimate in each device into a low-
dimensional vector and transmits only the important gradient entries while accumulating the
error from previous iterations. Therein, a channel-inversion like power control scheme, similar
to those in [13], [14], [17] is designed so that the gradient vectors sent from the selected devices
are aligned at the edge server.
Note that all the exiting works on power control for over-the-air FL have overlooked the
following statistical characteristics of gradients: the gradient distribution over training iterations
is independent but not necessarily identical; and even in the same iteration, the distribution of
each entry of the gradient vector can be non-identical. A general observation is that the gradient
distribution changes over iterations and is different in each feature dimension. In addition, if the
gradient distribution is unknown for each device, normalizing the gradient to a distribution with
zero mean and unit variance is infeasible. As such, due to the neglect of the above characteristics
of gradient distribution, the existing power control methods for over-the-air FL may not perform
efficiently in practice.
Motivated by the above issue, in this paper, we study the optimal power control problem
for over-the-air FL in fading channels by taking gradient statistics into account. Our goal is to
3
minimize the MSE of the aggregated model, and hence improve the accuracy of FL, by jointly
optimizing the transmit power at each device and the denoising factor at the edge server given
the first-order and second-order statistics of gradients at each iteration. The main contributions
of this work are outlined below:
•Optimal power control with known gradient statistics: We first derive the MSE expression of
gradient aggregation at each iteration of the model training when the first-order and second-
order statistics of the gradient vectors are known. We then formulate a joint optimization
problem of transmit power at edge devices and denoising factor at the edge server for MSE
minimization subject to individual peak power constraints at edge devices. By decomposing
this non-convex problem into subproblems defined on different subregions, we obtain the
optimal power control strategy in closed form. Unlike the existing threshold-based power
control for normalized signal in [17], our optimal transmit power depends not only on the
channel quality and noise power, but also heavily on the gradient statistics. In particular,
we find that the relative dispersion of the gradient values, i.e., the squared multivariate
coefficient of variation (SMCV) of the gradient vectors, plays a key role in the optimal
transmit power control. Specifically, we prove that the optimal transmit power of each
device is a continuous and monotonically decreasing function of the gradient SMCV.
•Optimal power control in special cases: In the special case where the gradient SMCV
approaches infinity, which could happen when the model training converges and/or the
dataset is highly non-IID, we show that there is an optimal threshold for the aggregation
capabilities of the devices, below which the devices transmit with full power and above
which the devices transmit at the power to equalize the weights of their gradients for
aggregation to one. In the other special case where the gradient SMCV approaches zero,
which could happen when the model training just begins, we show that the optimal power
control is to let all the devices transmit with their peak powers.
•Adaptive power control with unknown gradient statistics: We propose an adaptive power
control algorithm that estimates the gradient statistics based on the historical aggregated
gradients and then dynamically adjusts the power values in each iteration based on the
estimation results. The communication cost consumed by estimating the gradient statistics
is negligible compared to the transmission of the entire gradient vector.
To evaluate the efficiency of the proposed power control scheme, we implement the FL in
PyTorch for AI applications of three datasets: MNIST, CIFAR-10 and SVHN. Experimental
results demonstrate that the over-the-air FL with the proposed adaptive power control obtains
higher model accuracy than that with existing power control methods (full power transmission
and threshold-based power control for normalized signal [17]). In particular, while the full power
transmission performs poorly in high SNR region and non-IID data distribution and the threshold-
based power control for normalized signal [17] performs poorly in low SNR region and IID data
distribution, the proposed power control can perform very well over a wide range of scenarios
by exploiting the gradient statistics.
The rest of this paper is organized as follows. The over-the-air FL system is modeled in Section
II. Section III describes the optimal power control strategy with known gradient statistics. In
Section IV, we introduce an adaptive power control scheme when the gradient statistics are
unknown in advance. Section V provides experimental results. Finally we conclude the paper in
Section VI.
4
Edge Server
Edge
device 1
Edge
device 2
Edge
device K
Broadcast the
aggregated model
Over-the-air Compute
Fig. 1. Illustration of over-the-air federated learning.
II. SY STE M MODE L
A. Federated Learning Over Wireless Networks
We consider a wireless FL framework as illustrated in Fig. 1, where a shared AI model (e.g.,
a classifier) is trained collaboratively across Ksingle-antenna edge devices via the coordination
of a single-antenna edge server. Let K={1, ..., K}denote the set of edge devices. Each device
k∈ K collects a fraction of labelled training data via interaction with its own users, constituting
a local dataset, denoted as Dk. The loss function measuring the model error is defined as
L(w) = X
k∈K
|Dk|
|D| Lk(w),(1)
where w∈RDdenotes the D-dimensional model parameter to be learned, Lk(w) = 1
|Dk|Pi∈Dkli(w)
is the loss function of device kquantifying the prediction error of the model won the local dataset
collected at the k-th device, with li(w)being the sample-wise loss function, and D=Sk∈K Dk
is the union of all datasets. The minimization of L(w)is typically carried out through stochastic
gradient descent (SGD) algorithm, where device k’s local dataset Dkis split into mini-batches
of size Band at each iteration t= 1,2, ..., we draw one mini-batch Bk(t)randomly and update
the model parameter as
w(t+ 1) = w(t)−γ1
KX
k∈K ∇LSGD
k,t (w(t)),(2)
with γbeing the learning rate and LS GD
k,t (w) = 1
BPi∈Bk(t)li(w). Note that the mean of the
gradient ∇LSGD
k,t (w(t)) in SGD is equal to the gradient ∇L(w(t))in gradient descent (GD)
while the variance depends on the mini-batch size and distribution of data (IID or non-IID).
5
B. Over-The-Air Computation for Gradient Aggregation
We consider block fading channels, where the wireless channels remain unchanged within the
duration of each iteration in FL but may change independently from one iteration to another.
We define the duration of one iteration as one time block, indexed by t∈N. It is assumed
that the channel coefficients over different time blocks are generated from a stationary and
ergodic process. Let gk(t),∇LSGD
k,t (w(t)) ∈RDdenote the gradient vector computed on
device kat time block t. The following are key assumptions on the distribution of each entry
gk,d(t), d ∈ {1, ..., D}of gk(t):
•The gradient elements {gk,d(t)},∀k∈ K, are independent and identically distributed over
devices k’s. This is a default assumption since the distributions of the local datasets at
different devices, whether identical or non-identical, are unknown to the edge server and
thus the distributions of the local gradients trained from these local datasets are treated
equally.
•The gradient elements {gk,d (t)},∀t∈N, are independent but non-identically distributed
over iterations t’s. In other words, the gradient distribution is non-stationary over time. The
non-stationary distribution is valid since the gradient values in general change rapidly at
the beginning, then gradually approach to zero as the training goes on.
•The gradient elements {gk,d (t)},∀d∈ {1,2, ..., D}, are independent but non-identically
distributed over gradient vector dimension d’s. This assumption is valid as long as the
features in a data sample are independent but non-identically distributed, which is typically
the case.
The gradient of interest at the edge server at each time block tis given by
g(t) = 1
KX
k∈K
gk(t).(3)
To obtain (3), all the devices transmit their gradient vectors gk(t)concurrently in an analog
manner, following the AirComp principle as shown in Fig. 1. Each transmission block takes a
duration of Dslots, one slot for one entry in the D-dimensional gradient vector. Each gradient
vector gk(t)is multiplied with a pre-processing factor, denoted as bk(t). The received signal
vector at the edge server is given by
y(t) = X
k∈K
bk(t)hk(t)gk(t) + n(t),(4)
where hk(t)denotes the complex-valued channel coefficient from device kto the edge server
and n(t)denotes the additive white Gaussian noise (AWGN) vector at the edge server with
each element having zero mean and variance of σ2
n. To compensate the channel phase offset and
control the actual transmit power at each device, we let bk(t) = √pk(t)e−jθk(t)
Bk(t), where pk(t)≥0
denotes the transmit power at device k∈ K at time block t,θk(t)is the phase of hk(t), and
Bk(t),kgk(t)k=qPD
d=1 g2
k,d(t)denotes the gradient norm of device k. Here, we have
assumed that each device kcan estimate perfectly its own channel phase θk(t). To design the
optimal power control policy pk(t), for k∈ K, we also assume that the edge server knows the
channel amplitude |hk|of all devices. By such design of bk(t), we can rewrite (4) as
y(t) = X
k∈K ppk(t)|hk(t)|
Bk(t)gk(t) + n(t).(5)
6
Each device k∈ K has a peak power budget Pk, i.e.,
pk(t)≤Pk,∀k∈ K,∀t∈N.(6)
Upon receiving y(t), the edge server applies a denoising factor, denoted by η(t), to recover the
gradient of interest as
ˆg(t) = y(t)
Kpη(t),(7)
where the factor 1/K is employed for the averaging purpose.
C. Performance Measure
We are interested in minimizing the distortion of the recovered gradient ˆg(t)with respect to
(w.r.t.) the ground true gradient g(t). The distortion at a given iteration tis measured by the
instantaneous MSE defined as
MSE(t),E[kˆg(t)−g(t)k2]
=1
K2E
y(t)
pη(t)−X
k∈K
gk(t)
2
=1
K2"D
X
d=1
σ2
d(t)X
k∈K ppk(t)|hk(t)|
pη(t)Bk(t)−1!2
+
D
X
d=1
m2
d(t) 1
pη(t)X
k∈K ppk(t)|hk(t)|
Bk(t)−K!2
+Dσ2
n
η(t)#,(8)
where the expectation is over the distribution of the transmitted gradients gk(t)and the received
noise n(t),md(t)and σ2
d(t)denote the mean (first-order statistics) and variance (second-order
statistics) of the d-th entry of gradient g(t)at iteration t, respectively. Notice that the gradient
norm Bk(t)of each device kcan be transmitted to the edge server with negligible communication
cost, thus it is considered as a known value in (8).
Observing (8) closely, we find that the MSE consists of three components, representing the
individual misalignment error (the first term), the composite misalignment error (the second
term), and the noise-induced error (the third term), respectively. The individual misalignment
error is weighted by the gradient variance PD
d=1 σ2
d(t)while the composite misalignment error
is weighted by the gradient mean PD
d=1 m2
d(t). In the special case when the gradient has zero
mean, the MSE in (8) reduces to that in [17] where the composite misalignment error is absent.
Our objective is to minimize MSE in (8), by jointly optimizing the transmit power pk(t)at all
devices and the denoising factor η(t)at the edge server, subject to the individual power budget
in (6).
D. Gradient Statistics
In general, the individual misalignment error and the composite misalignment error in MSE of
the gradient aggregation (8) cannot be minimized simultaneously due to the peak power budget
on each device. It is difficult to capture the tradeoff between the two errors by directly using
7
0 20 40 60 80 100
0
0.05
0.1
0.15
0.2
0.25
0.3
0
2
4
6
8
10
12
Fig. 2. Experimental results of MSN α(t)(left y-axis in linear scale) and SMCV β(t)(right y-axis) over iterations
for dataset MNIST, where the number of edge devices is 10 and the local mini-batch size is 20.
their respective weights, namely, the gradient variance and gradient mean. To tackle this issue,
we introduce two alternative parameters of gradient statistics in this subsection.
Let α(t)denote the mean squared norm (MSN) of g(t), i.e., E[kg(t)k2], which is given by
α(t) =
D
X
d=1 σ2
d(t) + m2
d(t),(9)
and let β(t)denote the squared multivariate coefficient of variation (SMCV) of g(t), which is
given by
β(t) = PD
d=1 σ2
d(t)
PD
d=1 m2
d(t).(10)
While α(t)measures the average absolute gradient values at iteration t,β(t)is a measure of
relative dispersion of the gradient values at iteration t. Shortly later, we shall show that the MSE
of the model aggregation is mainly dominated by β(t), rather than α(t). Figs. 2-4 illustrate
the experimental results of the alternative gradient statistics α(t)and β(t)of three datasets,
MNIST, CIFAR-10, and SVHN, respectively, where the gradients are updated ideally without
any transmission error. Both IID and non-IID partitions are considered for the training dataset.
Each value of α(t)and β(t)is obtained by averaging over 300 model trainings. It is observed
that as the training time increases, the gradient MSN α(t)decreases while the gradient SMCV
β(t)increases for all the three datasets. This result is in agreement with the intuition that the
absolute gradient values in SGD-based learning gradually vanish when the training iteration goes
on, but their relative variation remains significant over each iteration. It is also observed that the
gradient SMCV β(t)in non-IID partition is much larger than that in IID partition for all the
three datasets. This indicates that the gradient distribution with non-IID dataset partition is more
dispersive than that with IID dataset partition as expected.
By (9) and (10), the MSE in (8) can be rewritten as (omitting the constant coefficient 1/K2
for convenience)
MSE(t) =β(t)α(t)
β(t) + 1 X
k∈K ppk(t)|hk(t)|
pη(t)Bk(t)−1!2
8
0 20 40 60 80 100
10-1
100
101
102
103
0
1
2
3
4
5
6
7
Fig. 3. Experimental results of MSN α(t)(left y-axis in log scale) and SMCV β(t)(right y-axis) over iterations
for dataset CIFAR-10, where the number of edge devices is 10 and the local mini-batch size is 200.
0 20 40 60 80 100
10-2
10-1
100
101
102
0
1
2
3
4
5
6
7
Fig. 4. Experimental results of MSN α(t)(left y-axis in log scale) and SMCV β(t)(right y-axis) over iterations
for dataset SVHN, where the number of edge devices is 10 and the local mini-batch size is 300.
+α(t)
β(t) + 1 1
pη(t)X
k∈K ppk(t)|hk(t)|
Bk(t)−K!2
+Dσ2
n
η(t).(11)
It is seen from (11) that while the gradient MSN α(t)appears linearly in the weights of
both individual and composite misalignment errors, the gradient SMCV β(t)plays a more
distinguishing role in the MSE expression. In particular, when β(t)→0, which could be the
case when the model training just begins (as could be verified by Figs. 2-4), the individual signal
misalignment error can be neglected. When β(t)→ ∞, which could be the case when the model
training converges or during the middle of the training when the dataset is highly non-IID (as
could be verified by Figs. 2-4), the composite signal misalignment error disappears.
III. OPTIMAL POWER CO NTROL WIT H KNOWN GRADIE NT STATISTI CS
In this section, we formulate and solve the optimal power control problem for minimizing
MSE when the gradient statistics α(t)and β(t)are known. For convenience, we omit iteration
index tin this section. For each device k∈ K, we define its aggregation level with power pand
9
denoising factor ηas
Gk(p, η) = √p|hk|
√ηBk
,(12)
which indicates the weight of the gradient from device kin the global gradient aggregation (7)1.
Furthermore, we define aggregation capability of device kas its aggregation level with peak
power Pkand unit denoising factor η= 1, i.e., Ck=Gk(Pk,1) = √Pk|hk|
Bk. Without loss of
generality, we assume that
C1≤... ≤Ck≤... ≤CK.(13)
A. Power Control Problem for General Case
In this subsection, we consider the optimal power control problem for MSE minimization in
the general case with arbitrary β. The problem is formulated as
P1: min βα
β+ 1 X
k∈K
(Gk(pk, η)−1)2+α
β+ 1 X
k∈K
Gk(pk, η)−K!2
+Dσ2
n
η(14a)
s.t. 0≤pk≤Pk,∀k∈ K (14b)
η≥0.(14c)
Different from the power control problem in [17], the objective function in (14a) contains not
only the individual misalignment error ( βα
β+1 Pk∈K (Gk(pk, η)−1)2), but also the composite
misalignment error ( α
β+1 (Pk∈K Gk(pk, η)−K)2). Problem P1is non-convex in general. Even
if the denoising factor ηis given, problem P1is still hard to solve due to the coupling of each
transmit power pk. In the following, we present some properties of the optimal solution.
Lemma 1: The optimal denoising factor η∗for problem P1satisfies η∗≥C2
1.
Proof: Please refer to Appendix A.
Lemma 1 reduces the range of η∗and shows that the optimal transmit power of the 1-st device
is p∗
1=P1.
Lemma 2: The optimal power control policy satisfies p∗
k=Pk,∀k∈ {1, ..., l}and p∗
k<
Pk,∀k∈ {l+ 1, ..., K}for some l∈ K.
Proof: Please refer to Appendix B.
Based on Lemma 2, solving problem P1can be equivalent to minimizing the objective function
in the following Kexclusive subregions of the global power region, denoted as {M}l∈K and
comparing their corresponding optimal solutions to obtain the global optimal solution:
Ml=[p1,···, pk]∈RK|pk=Pk,∀k∈ {1, ..., l},0≤pk< Pk,∀k∈ {l+ 1, ..., K}.(15)
To facilitate the derivation, we denote ˜
Mlas a relaxed region of Mlby removing the condition
pk< Pk, for k∈ {l+ 1, ..., K}, i.e.,
˜
Ml=[p1,···, pk]∈RK|pk=Pk,∀k∈ {1, ..., l}, pk≥0,∀k∈ {l+ 1, ..., K}.(16)
1The weight should be 1 for all devices in the ideal case.
10
For the sub-problem defined in each relaxed subregion ˜
Ml, for l∈ K, taking the derivative of
the objective function (14a) w.r.t. pkand equating it to zero for all k∈ {l+ 1, ..., K}, we obtain
the optimal transmit power ˜p∗
l,k at any given ηas
˜p∗
l,k ="β+K−Pl
i=1 Gi(Pi, η)
β+K−l#2
·B2
kη
|hk|2, k ∈ {l+ 1, ..., K}.(17)
Note that by such power control in (17), the aggregation level Gk(˜p∗
l,k, η)of each device k∈
{l+ 1, ..., K}is the same, given by
G0(l) = β+K−1
√ηPl
i=1 Ci
β+K−l.(18)
Substituting (17) back to (14a) and letting its derivative w.r.t. ηbe zero, we derive the optimal
denoising factor ηin closed-form for problem P1defined in the l-th relaxed subregion, i.e.,
p˜η∗
l=
βα
β+1
l
P
i=1
C2
i+βα
(β+K−l)(β+1) l
P
i=1
Ci2
+Dσ2
n
β(β+K)α
(β+K−l)(β+1)
l
P
i=1
Ci
.(19)
Note that ˜p∗
l,k may be not less than its power constraint Pkfor some k∈ {l+1, ..., K }and thus the
corresponding ˜p∗
lmay not lie in the subregion Ml. If this happens, the optimal transmit power
of the sub-problem defined in subregion Mlis irrelevant and does not need to be considered.
This is given in the following lemma.
Lemma 3: For the power control problem P1defined in the l-th relaxed subregion ˜
Ml, if
∃k∈ {l+ 1, ..., K}such that ˜p∗
l,k ≥Pk, the global optimal power p∗,[p∗
1, ...p∗
K]of problem
P1must not be in Ml.
Proof: Please refer to Appendix C.
Lemma 3 shows that only the transmit power vectors ˜
p∗
l’s with elements satisfying ˜p∗
l,k <
Pk,∀k∈ {l+ 1, ..., K}are legal candidates of problem P1. Let Ldenote the index set of
the corresponding relaxed subregions. Note that Lis non-empty because ˜
p∗
Kis always a legal
transmit power candidate. Then, we only need to compare the legal candidate values to obtain
the minimum MSE
l∗= arg min
l∈L Vl,(20)
where Vlis the optimal value of (14a) in subregion Mlof P1and can be easily obtained by
substituting (17) and (19) back to (14a). The optimal solution to problem P1is given as follows.
Theorem 1: The optimal transmit power at each device that solves problem P1is given by
p∗
k=
Pk,∀k∈ {1, ..., l∗}
β+K−Pl∗
i=1 Gi(Pi,η∗)
β+K−l∗2
·B2
kη∗
|hk|2,∀k∈ {l∗+ 1, ..., K},(21)
and the optimal denoising factor at the edge server is given by
√η∗=
βα
β+1
l∗
P
i=1
C2
i+βα
(β+K−l∗)(β+1) l∗
P
i=1
Ci2
+Dσ2
n
β(β+K)α
(β+K−l∗)(β+1)
l∗
P
i=1
Ci
,(22)
11
where l∗is given in (20).
Proof: Please refer to Appendix D.
Remark 1: Theorem 1 shows that these devices k∈ {1, ..., l∗}with aggregation capability
not higher than that of device l∗should transmit their gradients with full power, i.e., pk=Pk,
while devices k∈ {l∗+1, ..., K}with aggregation capability higher than that of device l∗should
transmit with the power so that they have the same aggregation level, given by
G∗
0=β+K−Pl∗
i=1 Gi(Pi, η∗)
β+K−l∗,(23)
somewhat analogous to channel inversion.
B. On The Optimal Transmit Power Function
In this subsection, we analyse the optimal transmit power p∗as a vector function of the
gradient SMCV βand the noise variance σ2
n, i.e., p∗(β, σ2
n) = [p∗
1(β, σ2
n), ..., p∗
K(β, σ2
n)]. Note
that the vector function p∗(β, σ2
n)might have abrupt changes at some (β, σ2
n)due to the discrete
nature of the optimal device threshold l∗in Theorem 1. In this work, however, we can show
that p∗(β, σ2
n)is continuous2everywhere for all (β≥0, σ2
n≥0). Define Xl⊆R2as the domain
of the vector function p∗(β, σ2
n)when p∗(β, σ2
n)lies in the subregion Ml, for l∈ K. That is,
∀(β, σ2
n)∈ Xl, one can have l∗(β, σ2
n) = l.
We first show that p∗(β, σ2
n)is continuous and monotonic within each domain Xl,l∈ K. Let
˜p∗
l(β, σ2
n)denote the optimal power of the sub-problem defined in the l-th relaxed subregion ˜
Ml,
∀l∈ K. Based on (17) and (19), it is obvious that ˜p∗
l(β, σ2
n)is continuous within each domain
Xl. Moreover, ˜p∗
l(β, σ2
n)increases monotonically with the noise variance σ2
nsince enlarging σ2
n
increases both ˜η∗
land ˜p∗
l,k, for k∈ {l+1, ..., K}. Furthermore, ˜p∗
l(β, σ2
n)decreases monotonically
with the gradient SMCV βsince it can be easily shown that ∂˜p∗
l(β,σ2
n)
∂β is always negative. Thus
˜p∗
l(β, σ2
n)is monotonic within each domain Xl,l∈ K. By definition, within each domain
Xl,p∗(β, σ2
n)is equal to ˜p∗
l(β, σ2
n), thus, the vector function p∗(β , σ2
n)is also continuous and
monotonic within each domain Xl, for l∈ K.
Then we find the boundary of each domain Xl, for l∈ K and the corresponding optimal
transmit power p∗. To this end, we need the following lemma on the lower and upper bounds
of the optimal transmit power values.
Lemma 4: The optimal transmit power p∗lies in subregion Mlif and only if the optimal
transmit power ˜p∗
lin the l-th relaxed subregion ˜
Mlsatisfies: (ClBk
|hk|)2≤˜p∗
l,k <(Cl+1Bk
|hk|)2,∀k∈
{l+ 1, ..., K}.
Proof: Please refer to Appendix E.
Lemma 4 shows that in each domain Xl, l ∈ K, the range of p∗(β , σ2
n)is left-closed and
right-open intervals in h(ClBk
|hk|)2,(Cl+1Bk
|hk|)2,∀k∈ {l+ 1, ..., K}. Recall that ˜p∗
l(β, σ2
n)is con-
tinuous and monotonic. The optimal transmit power p∗(β , σ2
n) = p∗
l(β, σ2
n)sits on the lower
bound of the range when (β , σ2
n)is at the lower boundary of domain Xl, denoted as Ll,
n(β, σ2
n)|˜p∗
l,k(β, σ2
n) = (ClBk
|hk|)2,∀k∈ {l+ 1, ..., K}o, and p∗(β, σ2
n) = p∗
l(β, σ2
n)approaches the
2p∗(β, σ 2
n)is a continuous vector valued function if and only if each element p∗
k(β, σ 2
n), for k∈ {1, ..., K}is a continuous
function.
12
10-2 10-1 100101102103
0
0.2
0.4
0.6
0.8
1
1.2
Transmit Power
(a) Average received SNR=0dB.
10-3 10-2 10-1 100101102
0
2
4
6
8
10
12
Transmit Power
(b) Average received SNR=10dB.
Fig. 5. Illustration of the optimal transmit power p∗as a function of gradient SMCV β.
upper bound of the range when (β , σ2
n)approaches the upper boundary of domain Xl, denoted
as Ul,n(β, σ2
n)|˜p∗
l,k(β, σ2
n) = (Cl+1 Bk
|hk|)2,∀k∈ {l+ 1, ..., K}o.
Next, we consider the continuity of p∗(β, σ2
n)at boundaries Lland Ulfor each l∈ K in the
following lemma.
Lemma 5: The optimal transmit power function p∗(β, σ2
n)is continuous at Ul=Ll+1, for
l∈ {1, ..., K −1}.
Proof: Please refer to Appendix F.
Finally, we can formally conclude the following property of the optimal transmit power
function with respect to the gradient statistics and noise variance.
Property 1: The optimal transmit power p∗(β, σ2
n)of problem P1is a continuous and mono-
tonic vector function in the entire domain of (β≥0, σ2
n≥0). Moreover, it decreases with β
and increases with σ2
n.
Take a system with K= 6 devices for illustration. Fig. 5 shows the optimal transmit power of
each device p∗
kas a function of gradient SMCV β, together with the corresponding optimal index
l∗. The results are based on one channel realization of each device, |hk|, taken independently
from normalized Rayleigh distribution, given by [0.50,0.82,0.85,1.16,2.09,2.83]. The peak
power constraint of each device is set to be same, resulting in the same average received
SNR, given by Pk
Dσ2
n=5dB and 10dB, respectively. The gradient norms of devices, Bk, are
[0.23,0.31,0.26,0.28,0.28,0.16], noise variance σ2= 1 and D= 1. Fig. 5 clearly verifies that the
optimal transmit power p∗(β, σ2
n)is a continuous and monotonically decreasing vector function
of the gradient SMCV β. In particular, it is seen that when β→0, all the devices transmit
with full power; when beta increases, the power of the devices with large aggregation capability
decreases, then gradually approaches a constant, and the larger the aggregation capability is, the
smaller the constant transmit power is. In addition, it is observed from Fig. 5 that when the
peak power budget increases, more devices can transmit with less power than its peak value.
Equivalently, the optimal transmit power decreases when the noise variance decreases.
13
C. Power Control Problem for Special Cases
In this subsection, we provide some discussions on the optimal power control policy in the
two special cases where the gradient SMCV β→ ∞ and β→0, respectively.
1) β→ ∞:As discussed in Section II-D, this case may happen when the model training
converges and/or the dataset is highly non-IID. In this case, the composite signal misalignment
error in the MSE expression disappears. Thus problem P1reduces to the power control problem
in [17]. To be self-contained, we re-state the problem and the solution as follows
P2: min αX
k∈K
(Gk(pk, η)−1)2+Dσ2
n
η(24a)
s.t. 0≤pk≤Pk,∀k∈ K (24b)
η≥0.(24c)
Theorem 2 ( [17]): The optimal transmit power that solves problem P2has a threshold-based
structure, i.e.,
p∗
k=(Pk,∀k∈ {1, ..., l∗}
B2
kη∗
|hk|2,∀k∈ {l∗+ 1, ..., K},(25)
where the optimal denoising factor is given as
η∗=αPl∗
i=1 C2
i+Dσ2
n
αPl∗
i=1 Ci2
,(26)
and l∗is given in (20). Furthermore, it holds that C2
l∗≤η∗≤C2
l∗+1.
Proof: Please refer to [17, Theorem 1].
Note that the optimal denoising factor η∗for problem P2can be interpreted as the threshold,
since whether a device transmits with full power or not depends entirely on the comparison
between its aggregation capability Ckand √η∗. Moreover, this threshold increases with σ2
n. In
the extreme case when the noise power is very large and so is the threshold, all the devices
shall transmit with full power. Nevertheless, such threshold interpretation of η∗does not hold
for problem P1with general β.
2) β→0:As discussed in Section II-D, β→0could happen when the model training just
begins. In this case, the individual signal misalignment error disappears in the MSE expressions.
The original problem P1reduces to
P3: min αX
k∈K
Gk(pk, η)−K2
+Dσ2
n
η(27a)
s.t. 0≤pk≤Pk,∀k∈ K (27b)
η≥0.(27c)
Theorem 3: The optimal transmit power that solves problem P3is full power transmission,
i.e.,
p∗
k=Pk,∀k∈ K,(28)
14
and the optimal denoising factor is given by
η∗=αPi∈K Ci2+Dσ2
n
αK Pi∈K Ci2
.(29)
Proof: Please refer to Appendix G.
Remark 2: The optimal solution of problem P3is the special case of the solution of problem
P1with l∗=K. Note that the direction of the gradient vector received from each device at the
edge server is independent to the power of the transmitting device. Thus, increasing the power
of all devices can reduce the noise-induced error when the composite signal misalignment error
is fixed.
IV. ADAPTIV E POWER CONTROL WITH UN KNOWN GRADI ENT STATISTICS
In this section, we consider the practical scenario where the gradient statistics α(t)and β(t)
are unknown. We propose a method to estimate α(t)and β(t)in each time block and then devise
an adaptive power control scheme based on the optimal solution of problem P1by using the
estimated α(t)and β(t).
A. Parameters Estimation
In this subsection, we propose a method to estimate α(t)and β(t)at each time block tdirectly
based on their definitions in (9) and (10), respectively.
1) Estimation of α(t):Note that the instantaneous gradient norm of each device, Bk(t), is
assumed to be available at the edge server with negligible cost. By definition (9), we can estimate
the gradient MSN as
ˆα(t) = 1
KX
k∈K
B2
k(t).(30)
2) Estimation of β(t):By definition in (10), the gradient SMCV β(t)depends on md(t)and
σd(t). Before each device sending its gradient at time block t, we cannot estimate β(t)in advance.
However, from the experimental results of real datasets in Figs. 2-4, it can be observed that β(t)
changes slowly over iterations t. Thus we propose to estimate β(t)based on the aggregated
gradient at time block t−1as below
ˆ
β(t) = ˆα(t−1) −PD
d=1 ˆg2
d(t−1)
PD
d=1 ˆg2
d(t−1) ,(31)
where ˆα(t−1) estimates Pdσ2
d(t−1) + m2
d(t−1) and Pdˆg2
d(t−1) estimates Pdm2
d(t−1).
B. FL with Adaptive Power Control
In this subsection, we propose the FL process with adaptive power control, which is presented
in Algorithm 1. The algorithm has three steps. First, each device locally takes one step of SGD
on the current model using its local dataset (line 5). After that each device calculates the norm
of its local gradient and uploads it to the edge server with conventional digital transmission (line
6 and line 7). Second, the edge server estimates parameters α(t)and β(t)based on the received
gradient norm at time block tand historical aggregated gradient (line 9 and line 16). Then the
optimal transmit power and denoising factor are obtained based on (21) and (22), respectively
(line 10). Third, the edge server informs the optimal transmit power to each device and each
device transmits local gradient with the assigned power simultaneously using AirComp to the
edge server in an analog manner (line 12-14).
15
Algorithm 1 FL Process with Adaptive Power Control
1: Initialize w(0) in edge server, ˆ
β(1);
2: for time block t= 1, ..., T do
3: Edge server broadcasts the global model w(t)to all edge devices k∈ K;
4: for each device k∈ K in parallel do
5: gk(t) = ∇LSGD
k,t (w(t));
6: Bk(t) = qPdg2
k,d(t);
7: Upload Bk(t)to edge server;
8: end for
9: Edge server estimates ˆα(t)based on (30);
10: Edge server obtains the optimal transmit power p∗(t)based on (21) and the optimal
denoising factor η∗(t)based on (22);
11: Edge server sends p∗
k(t)to device kfor all k∈ K;
12: for each device k∈ K in parallel do
13: Transmit gradient gk(t)with power p∗
k(t)to edge server using AirComp;
14: end for
15: Edge server receives y(t)and recovers ˆg(t)based on (7);
16: Edge server estimates ˆ
β(t+ 1) based on (31);
17: Edge server updates global model w(t+ 1) = w(t)−γˆg(t);
18: end for
19: Edge server returns w(T+ 1);
V. EXPE RIM E NTAL RESU LTS
In this section, we provide experimental results to validate the performance of the proposed
power control for AirComp-based FL over fading channels.
A. Experiment Setup
We conducted experiments on a simulated environment where the number of edge devices is
K= 10 if not specified otherwise. The wireless channels from each device to the edge server
follow IID Rayleigh fading, such that hk’s are modeled as IID complex Gaussian variables with
zero mean and unit variance. For each device k∈ K, we define SNRk=EhPk|hk|2
Dσ2
ni=Pk
Dσ2
nas
the average received SNR.
1) Baselines: We compare the proposed power control scheme with the following baseline
approaches:
•Error-free transmission: the aggregated gradient is updated without any transmission error,
which is equivalent to the centralized SGD algorithm.
•Threshold-based power control in [17]: this is the power control scheme given in [17],
which assumed that signals are normalized. Note that it is actually the special case of our
proposed power control scheme with β→ ∞ by considering the individual misalignment
error only in problem P1.
•Full power transmission: all devices transmit with full power Pkand the edge server applies
the optimal denoising factor in (22), where l∗=K.
16
2) Datasets: We evaluate the training of convolutional neural network on three datasets:
MNIST, CIFAR-10 and SVHN. MNIST dataset consists of 10 categories ranging from digit 0 to
9 and a total of 70000 labeled data samples (60000 for training and 10000 for testing). CIFAR-10
dataset includes 60000 color images (50000 for training and 10000 for testing) of 10 different
types of objects. SVHN is a real-world image dataset for developing machine learning and object
recognition algorithms with minimal requirement on data preprocessing and formatting, which
includes 99289 labeled data samples (73257 for training and 26032 for testing).
3) Data Distribution: To study the impact of the SMCV of gradient βfor optimal transmit
power, we simulate two types of dataset partitions among the mobile devices, i.e., the IID
setting and non-IID one. For the former, we randomly partition the training samples into 100
equal shards, each of which is assigned to one particular device. While for the latter, we first
sort the data by digit label, divide it into 200 equal shards, and randomly assign 2shards to
each device.
4) Training and Control Parameters: In all our experiments, the number of local update steps
between two global aggregations is 1. Local batch size of each edge device is 10. The gradient
descent step size is γ= 0.01.
B. Experimental Results
Fig. 6 compares the test accuracy for the three considered datasets with IID dataset partition
and non-IID dataset partition, respectively, where the average received SNR is set as 10dB for all
devices. It is observed that the model accuracy of the proposed scheme is better than threshold-
based power control in [17] and full power transmission. From Figs. 2-4, we know that the
averaged gradient SMCV β(t)in the IID dataset partition is less than that in the non-IID dataset
partition and gradient SMCV β(t)increases over iterations. Threshold-based power control in
[17] has significant accuracy loss in the IID partition or at the beginning of training. This is
because in this case, the gradient SMCV is small and thus the MSE is dominated by the composite
misalignment error. As a result, threshold-based power control in [17] scheme that considers
the individual misalignment error only is much inferior. Besides, full power transmission has
significant accuracy loss in the non-IID partition or at the end of training. This is because the
gradient SMCV is large and therefore the full power transmission scheme fails to minimize the
individual misalignment error that dominates the MSE in this case.
Fig. 7 illustrates the test accuracy for MNIST with non-IID data partition at the average
received SNR =5db. It is observed that the overall performance of the proposed scheme is
still better than two baseline approaches at low SNR region. In specific, full power transmission
performs better than threshold-based power control in [17] scheme. This is mainly because
when the noise variance is large, full power transmission can strongly suppress noise error that
dominates the MSE.
Finally, Fig. 8 compares the test accuracy of different power control schemes at varying
number of devices K. Here, MNIST dataset with non-IID partition is used, the average received
SNR of all the devices is set as SNRk= 10dB and the results are averaged over 50 model
trainings. First, it is observed that the test accuracy achieved by all the four schemes increases
as Kincreases, due to the fact that the edge server can aggregate more data for averaging.
Second, the proposed scheme considerably outperforms both of threshold-based power control
in [17] and full power transmission throughout the whole regime of K. Full power transmission
approaches threshold-based power control in [17] when Kis small (i.e., K= 4 in Fig. 8), but
17
0 10 20 30 40 50
Rounds
50
55
60
65
70
75
80
85
90
95
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(a) MNIST dataset with IID partition.
0 10 20 30 40 50
Rounds
20
30
40
50
60
70
80
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(b) MNIST dataset with non-IID partition.
0 10 20 30 40 50 60
Rounds
10
20
30
40
50
60
70
80
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(c) CIFAR-10 dataset with IID partition.
0 10 20 30 40 50 60 70
Rounds
5
10
15
20
25
30
35
40
45
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(d) CIFAR-10 dataset with non-IID partition.
0 5 10 15 20 25 30
Rounds
10
20
30
40
50
60
70
80
90
100
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(e) SVHN dataset with IID partition.
0 10 20 30 40 50
Rounds
10
20
30
40
50
60
70
80
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
(f) SVHN dataset with non-IID partition.
Fig. 6. Performance comparison on different dataset partition, for K= 10 and SNRk= 10dB,∀k∈ K.
the performance compromises as Kincreases, due to the lack of power adaptation to reduce the
misalignment error.
VI. CO NCL USION
This work studied the power control optimization problem for the over-the-air federated
learning over fading channels by taking the gradient statistics into account. The optimal power
18
0 10 20 30 40 50
Rounds
10
20
30
40
50
60
70
80
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
Fig. 7. Performance comparison for MNIST dataset with non-IID partition at the average received SNR = 5db.
4 6 8 10 12 14 16 18 20
Number of Devices, K
50
55
60
65
70
75
80
85
90
Test Accuracy
Threshold-based in [17]
Full power transmission
Proposed scheme
Error-free transmission
Fig. 8. Performance comparison over the number of devices, where MNIST dataset is non-IID partition and
SNRk= 10dB,∀k∈ K.
control policy is derived in closed form when the first- and second-order gradient statistics are
known. It is shown that the optimal transmit power on each device decreases with gradient SMCV
and increases with noise variance. In the special cases where βapproaches infinity and zero,
the optimal transmit power reduces to threshold-based power control in [17] and full power
transmission, respectively. We propose an adaptive power control algorithm that dynamically
adjusts the transmit power in each iteration based on the estimation results. Experimental results
show that our proposed adaptive power control scheme outperforms the existing schemes.
APPEN DIX A
PROO F O F LEMM A 1
We prove this Lemma by contradiction. Suppose the optimal denoising factor η∗≤C2
1.
Both the individual and composite signal misalignment errors can be forced to zero by letting
p∗
k=ηB2
k
|hk|2,∀k∈ K. The problem P1can thus be expressed as
min
η≤C2
1
Dσ2
n
η.(32)
19
It is obvious that the optimal solution of the above problem is η∗=C2
1. Thus, it must hold that
η≥C2
1for problem P1.
APPEN DIX B
PROO F O F LEMM A 2
To prove this lemma, we need to prove that if p∗
k2=Pk2for some device k2, we shall
have p∗
k1=Pk1,∀k1< k2. We prove this by contradiction. Let p∗= [p∗
1, ..., p∗
K]denote the
optimal transmit power to the problem P1. We assume that there are two devices k1< k2
satisfying p∗
k1< Pk1and p∗
k2=Pk2. Then there always exists a modified transmit power p′=
[p∗
1, ..., p∗
k1−1, p′
k1, p∗
k1+1, ..., p∗
k2−1, p′
k2, p∗
k2+1, ..., p∗
K], where p′
k1=Pk1and p′
k2< Pk2, satisfying
√p∗
k1|hk1|
√η∗Bk1+√p∗
k2|hk2|
√η∗Bk2=qp′
k1|hk1|
√η∗Bk1+qp′
k2|hk2|
√η∗Bk2
. The resulting MSE obtained by p′only differs from
the minimum MSE by p∗in the individual misalignment error term. The difference is given by
MSE(p∗)−MSE(p
′)
=βα
β+ 1
pp∗
k1|hk1|
√η∗Bk1−1!2
+ pp∗
k2|hk2|
√η∗Bk2−1!2
−
qp′
k1|hk1|
√η∗Bk1−1
2
−
qp′
k2|hk2|
√η∗Bk2−1
2
(33a)
=βα
β+ 1
pp∗
k1|hk1|
√η∗Bk1!2
+ pp∗
k2|hk2|
√η∗Bk2!2
−
qp′
k1|hk1|
√η∗Bk1
2
−
qp′
k2|hk2|
√η∗Bk2
2
(33b)
=βα
β+ 1
pp∗
k2|hk2|
√η∗Bk2−qp′
k2|hk2|
√η∗Bk2
pp∗
k2|hk2|
√η∗Bk2
+qp′
k2|hk2|
√η∗Bk2−pp∗
k1|hk1|
√η∗Bk1−qp′
k1|hk1|
√η∗Bk1
(33c)
=2βα
β+ 1
pp∗
k2|hk2|
√η∗Bk2−qp′
k2|hk2|
√η∗Bk2
pp∗
k2|hk2|
√η∗Bk2−qp′
k1|hk1|
√η∗Bk1
(33d)
≥0.(33e)
The inequality in (33e) holds as p∗
k2=Pk2> p′
k2and √p∗
k2|hk2|
√η∗Bk2≥qp′
k1|hk1|
√η∗Bk1
by the aggregation
capability ranking in (13). This indicates that p′is a better solution than p∗, which contradicts
the assumption. Therefore, for all pairs of two devices k1< k2, if p∗
k2=Pk2, we must have
p∗
k1=Pk1. Lemma 2 is proved.
APPEN DIX C
PROO F O F LEMM A 3
To prove this lemma is equal to proving that if the global optimal solution p∗of problem
P1lies in subregion Ml, the optimal solution ˜p∗
lof the sub-problem defined in the relaxed
20
subregion ˜
Mlshould also lie Ml. We prove this by contradiction. Assume that p∗∈ Mlbut
˜p∗
l/∈ Ml. The derivative of (14a) w.r.t. pkand ηshows that ˜p∗
lis the only local optimal solution
in ˜
Ml. As p∗is the optimal solution in Mland cannot be on the boundary of Ml,p∗is also
a local optimal solution in Ml⊆˜
Ml. It contradicts that ˜p∗
lis the only local optimal solution
in ˜
Ml. Therefore, if the global optimal transmit power p∗is in the l-th subregion Ml,˜p∗
lmust
be in Ml. The proof of Lemma 3 is completed.
APPEN DIX D
PROO F O F THEO REM 1
To complete the proof, we need to show that with l∗defined in (20), the optimal transmit
power is ˜p∗
l∗in the l∗-th relaxed subregion ˜
Ml∗. By Lemma 3, for all lin L, since ˜p∗
lis also in
subregion Ml,˜p∗
lis the optimal transmit power in the subregion Ml. Therefore, the candidate
transmit power ˜p∗
l∗with the smallest value Vl∗is the optimal transmit power of the problem P1.
According to (17) and (19), the optimal transmit power and denoising factor are the forms given
in Theorem 1.
APPEN DIX E
PROO F O F LEMM A 4
When p∗lies in subregion Ml,p∗is equal to ˜p∗
l. Thus, to prove the sufficiency of this Lemma
is equal to prove that the optimal transmit power p∗holds
Cl∗≤pp∗
k|hk|
Bk
< Cl∗+1,∀k∈ {l∗+ 1, ..., K }.(34)
Based on (21), p∗
k< Pk, for k∈ {l∗+ 1, ..., K}, and when k∈ {l∗+ 1}, we have √p∗
l∗+1|hl∗+1 |
Bl∗+1 <
Cl∗+1. Since √p∗
k|hk|
Bk,∀k∈ {l∗+ 1, ..., K}is the same, we have the inequality √p∗
k|hk|
Bk< Cl∗+1,
for k∈ {l∗+1, ..., K }. We prove Cl∗≤√p∗
k|hk|
Bk,∀k∈ {l∗+1, ..., K }by contradiction. We assume
that Cl∗>√p∗
l∗+1|hl∗+1 |
Bl∗+1 . As p∗
k=Pk, for k∈ {1, ..., l∗}, we have √p∗
l∗|hl∗|
Bl∗=Cl∗>√p∗
l∗+1|hl∗+1 |
Bl∗+1 .
Then there always exists a modified transmit power p′= [p∗
1, ..., p∗
l∗−1, p′
l∗, p′
l∗+1, p∗
l∗+2, ..., p∗
K]
where transmit power p′
l∗and p′
l∗+1 satisfy √p′
l∗|hl∗|
Bl∗=qp′
l∗+1|hl∗+1 |
Bl∗+1 =1
2(√p∗
l∗|hl∗|
Bl∗+√p∗
l∗+1|hl∗+1 |
Bl∗+1 ).
The difference between MSE of the transmit power p∗and p′is given by
MSE(p∗)−MSE(p
′)
=βα
β+ 1
pp∗
l∗|hl∗|
√η∗Bl∗−1!2
+ pp∗
l∗+1|hl∗+1 |
√η∗Bl∗+1 −1!2
− pp′
l∗|hl∗|
√η∗Bl∗−1!2
−
qp′
l∗+1|hl∗+1|
√η∗Bl∗+1 −1
2
(35a)
21
=βα
β+ 1
pp∗
l∗|hl∗|
√η∗Bl∗!2
+ pp∗
l∗+1|hl∗+1|
√η∗Bl∗+1 !2
− pp′
l∗|hl∗|
√η∗Bl∗!2
−
qp′
l∗+1|hl∗+1 |
√η∗Bl∗+1
2
(35b)
=βα
2 (β+ 1) pp∗
l∗|hl∗|
√η∗Bl∗−pp∗
l∗+1|hl∗+1|
√η∗Bl∗+1 !2>0.(35c)
This indicates that p′is a better solution than p∗, which contradicts the assumption. We have
the inequality Cl∗≤√p∗
k|hk|
Bk, for k∈ {l∗+ 1, ..., K}. Thus, the sufficiency of this Lemma has
been proved.
To prove the necessity of this Lemma, we prove the inverse negative proposition of it: if the
optimal transmit power p∗does not lie in subregion Ml, i.e., l∈ {1, ..., l∗−1, l∗+ 1, ..., K },
˜p∗
lsatisfies: √˜p∗
l,k|hk|
Bk≥Cl+1 or √˜p∗
l,k|hk|
Bk< Clfor all k∈ {l+ 1, ..., K}.
First, we prove that ˜p∗
l, for ∀l∈ {1, ..., l∗−1}holds
p˜p∗
l,k|hk|
Bk≥Cl+1,∀k∈ {l+ 1, ..., K}.(36)
We prove this by contradiction. We assume that ∃l∈ {1, ..., l∗−1},√˜p∗
l,k|hk|
Bk< Cl+1,∀k∈
{l+ 1, ..., K}. As Cl+1 ≤Ck, for ∀k∈ {l+ 1, ..., K}given in (13), √˜p∗
l,k|hk|
Bk< Ck, for
∀k∈ {l+ 1, ..., K},˜p∗
l,k < Pk, for ∀k∈ {l+ 1, ..., K}, i.e., ˜p∗
lis a feasible transmit power. Note
that ˜p∗
lis the optimal transmit power in ˜
Mland p∗is the optimal transmit power in Ml∗. As the
constraint of ˜p∗
lis less restrictive than the constraint of p∗, i.e., Ml∗⊆˜
Ml, the feasible transmit
power ˜p∗
lis better than the optimal transmit power p∗, which contradicts the assumption. We
have the inequality (36).
Second, we prove that ˜p∗
l, for ∀l∈ {l∗+ 1, ..., K}holds
p˜p∗
l,k|hk|
Bk
< Cl,∀k∈ {l+ 1, ..., K}.(37)
We prove this by contradiction. When l=l∗+ 1, we assume that √˜p∗
l,k|hk|
Bk≥Cl,∀k∈ {l+
1, ..., K}. Let pavg = [˜p∗
l,1, ..., ˜p∗
l,l−1, pavg
l, ..., pavg
K]denote a modified ˜p∗
l, where √pavg
l|hl|
Bl=... =
√pavg
K|hK|
BK=1
K−l+1 PK
i=l
√˜p∗
l,i|hi|
Bi. Using a proof similar to that of (35c), we can prove that
MSE(pavg )≤MSE(˜p∗
l). Let pfes = [˜p∗
l,1, ..., ˜p∗
l,l−1, pf es
l, ..., pfes
K]denote a modified ˜p∗
l, where
√pfes
k|hk|
Bk=Cl,, for k∈ {l, ..., K}. The derivative of (14a) w.r.t. pkfor all k∈ {l+ 1, ..., K}
shows that the MSE increases with pkwhen pk≥p∗
k. Note that pavg and pf es are in the l∗-th
relaxed subregion ˜
Ml∗and pavg
k≥pfes
k> p∗
k, for k∈ {l, ..., K}, MSE(pf es )≤MSE(pavg )
and MSE(pfes)≤MSE(˜p∗
l). The transmit power pf es ∈˜
Mlis better than the optimal transmit
power ˜p∗
l∈˜
Ml, which contradicts that ˜p∗
lis optimal solution in ˜
Ml. We have the inequality
(37) when l=l∗+ 1 and it can be extended to when l∈ {l∗+ 2, ..., K}by mathematical
induction.
Thus, the necessity of this Lemma has been proved. We complete the proof of Lemma 4.
22
APPEN DIX F
PROO F O F LEMM A 5
We first prove that the upper boundary Ulis equal to the lower boundary Ll+1 , for l∈
{1, ..., K −1}. Then we prove that the optimal transmit power function p∗(β, σ2
n)is continuous
at Ul=Ll+1, for l∈ {1, ..., K −1}.
For all (β, σ2
n)∈ Ll+1, the corresponding optimal transmit power ˜p∗
l+1(β, σ2
n) = (Cl+1 Bk
|hk|)2,
i.e., √˜p∗
l+1,k(β,σ2
n)|hk|
Bk=Cl+1, for k∈ {l+ 2, ..., K }. Based on (17) and (19), we have
q˜η∗
l+1(β, σ2
n) = Pl+1
i=1 Ci+ (β+K−l−1)Cl+1
β+K(38a)
⇔
βα
β+1
l+1
P
i=1
C2
i+βα
(β+K−l−1)(β+1) l+1
P
i=1
Ci2
+Dσ2
n
β(β+K)α
(β+K−l−1)(β+1)
l+1
P
i=1
Ci
=Pl+1
i=1 Ci+ (β+K−l−1)Cl+1
β+K(38b)
⇔
βα
β+1
l+1
P
i=1
C2
i+Dσ2
n
β(β+K)α
(β+K−l−1)(β+1)
l+1
P
i=1
Ci
=(β+K−l−1)Cl+1
β+K(38c)
⇔βα
β+ 1
l+1
X
i=1
C2
i+Dσ2
n=βα
β+ 1Cl+1
l+1
X
i=1
Ci(38d)
⇔βα
β+ 1
l
X
i=1
C2
i+Dσ2
n=βα
β+ 1Cl+1
l
X
i=1
Ci(38e)
⇔
βα
β+1
l
P
i=1
C2
i+Dσ2
n
β(β+K)α
(β+K−l)(β+1)
l
P
i=1
Ci
=(β+K−l)Cl+1
β+K(38f)
⇔
βα
β+1
l
P
i=1
C2
i+βα
(β+K−l)(β+1) l
P
i=1
Ci2
+Dσ2
n
β(β+K)α
(β+K−l)(β+1)
l
P
i=1
Ci
=Pl
i=1 Ci+ (β+K−l)Cl+1
β+K(38g)
⇔q˜η∗
l(σ2
n, β) = Pl
i=1 Ci+ (β+K−l)Cl+1
β+K(38h)
⇔q˜p∗
l,k(β, σ2
n)|hk|
Bk
=Cl+1,∀k∈ {l+ 1, ..., K }(38i)
⇔˜p∗
l,k(β, σ2
n) = (Cl+1 Bk
|hk|)2,∀k∈ {l+ 1, ..., K},(38j)
then we have (σ2
n, β)∈ Ulby the definition of Ul, i.e., Ll+1 ⊆ Ul. For all (β, σ2
n)∈ Ul, we
23
have (σ2
n, β)∈ Ll+1, i.e., Ul⊆ Ll+1 by reversing the derivation of (38). Therefore, we have
Ll+1 =Ul, for l∈ {1, ..., K −1}.
For all (β, σ2
n)0∈ Ll+1 =Ul, define (β, σ2
n)+∈ Xl+1 and (β, σ2
n)−∈ Xlas two points infinitely
close to (β, σ2
n)0, respectively. Then the left limitation of p∗((β, σ2
n)0)is given by
lim
(β,σ2
n)−→(β,σ2
n)0
p∗((β, σ2
n)−) = lim
(β,σ2
n)−→(β,σ2
n)0
˜p∗
l((β, σ2
n)−)
=˜p∗
l((β, σ2
n)0) = ˜p∗
l+1((β, σ2
n)0) = p∗((β, σ2
n)0).(39)
Similarly, the right limitation of p∗((β, σ2
n)0)is also equal to p∗((β, σ2
n)0). Therefore, the optimal
transmit power function p∗(β, σ2
n)is continuous at Ul=Ll+1, for l∈ {1, ..., K −1}. We complete
the proof of Lemma 5.
APPEN DIX G
PROO F O F THEO REM 3
For any denoising factor η≥1
K2Pk∈K Ck2
, it must hold that composite signal alignment
Pk∈K Gk(pk, η)≤Kfor problem P3. Therefore, for minimizing composite signal misalignment
error Pk∈K Gk(pk, η)−K2
, all the devices should always transmit with full power, i.e.,
p∗
k=Pk,∀k∈ K. The problem P3can be expressed as
min
η≥0αX
k∈K
Gk(pk, η)−K2
+Dσ2
n
η(40)
This is a unary quadratic function about 1
√η, thus it is easy to derive an optimal solution, i.e.,
η∗=αPi∈K Ci2+Dσ2
n
αK Pi∈K Ci2
.(41)
We complete the proof of Theorem 3.
REFER ENC E S
[1] N. Zhang and M. Tao, “Gradient statistics aware power control for over-the-air federated learning in fading channels,” in
Proc. IEEE ICC Workshops, 2020, pp. 1–6.
[2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE,
vol. 107, no. 11, pp. 2204–2239, 2019.
[3] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence
with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks
from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B.
McMahan et al., “Towards federated learning at scale: System design,” arXiv preprint arXiv:1902.01046, 2019.
[6] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Transactions on
Intelligent Systems and Technology (TIST), vol. 10, no. 2, p. 12, 2019.
[7] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via randomized
quantization and encoding,” Advances in Neural Information Processing Systems 30, vol. 3, pp. 1710–1721, 2018.
[8] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed
training of speech dnns,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[9] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” in Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing, 2017, pp. 440–445.
24
[10] Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compression for efficient distributed deep learning,” arXiv
preprint arXiv:1802.06058, 2018.
[11] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource
constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
[12] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp.
3498–3516, 2007.
[13] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun.,
vol. 19, no. 3, pp. 2022–2035, March 2020.
[14] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans.
Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.
[15] M. M. Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,”
in Proc. IEEE ISIT, July 2019, pp. 1432–1436.
[16] ——, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., pp. 1–1, 2020.
[17] X. Cao, G. Zhu, J. Xu, and K. Huang, “Optimal power control for over-the-air computation,” in Proc. IEEE GLOBECOM,
Dec. 2019, pp. 1–6.