PreprintPDF Available

Gradient Statistics Aware Power Control for Over-the-Air Federated Learning

May 2020

May 2020

Authors:

Naifu Zhang

Meixia Tao

Shanghai Jiao Tong University

Preprints and early-stage research may not have been peer reviewed yet.

Federated learning (FL) is a promising technique that enables many edge devices to train a machine learning model collaboratively in wireless networks. By exploiting the superposition nature of wireless waveforms, over-the-air computation (AirComp) can accelerate model aggregation and hence facilitate communication-efficient FL. Due to channel fading, power control is crucial in AirComp. Prior works assume that the signal to be aggregated from each device, i.e., local gradient, can be normalized with zero mean and unit variance. In FL, however, gradient statistics vary over both training iterations and feature dimensions, and are unknown in advance. This paper studies the power control problem for over-the-air FL by taking gradient statistics into account. The goal is to minimize the aggregation error by jointly optimizing the transmit power at each device and the denoising factor at the edge server. We obtain the optimal policy in closed form when gradient statistics are given. Notably, we show that the optimal transmit power at each device is continuous and monotonically decreases with the squared multivariate coefficient of variation (SMCV) of gradient vectors. We also propose an estimation method of gradient statistics with negligible communication cost. Experimental results demonstrate high learning performance by using the proposed scheme.

Experimental results of MSN α(t) (left y-axis in linear scale) and SMCV β(t) (right y-axis) over iterations for dataset MNIST, where the number of edge devices is 10 and the local mini-batch size is 20.

…

Illustration of the optimal transmit power p * as a function of gradient SMCV β.

…

Performance comparison for MNIST dataset with non-IID partition at the average received SNR = 5db.

…

Figures - uploaded by Meixia Tao

Content may be subject to copyright.

Content uploaded by Meixia Tao

Content may be subject to copyright.

arXiv:2003.02089v2 [eess.SP] 8 May 2020

Gradient Statistics Aware Power Control for

Over-the-Air Federated Learning

Naifu Zhang and Meixia Tao

Abstract

Federated learning (FL) is a promising technique that enables many edge devices to train a machine

learning model collaboratively in wireless networks. By exploiting the superposition nature of wireless

waveforms, over-the-air computation (AirComp) can accelerate model aggregation and hence facilitate

communication-efﬁcient FL. Due to channel fading, power control is crucial in AirComp. Prior works

assume that the signal to be aggregated from each device, i.e., local gradient, can be normalized with

zero mean and unit variance. In FL, however, gradient statistics vary over both training iterations and

feature dimensions, and are unknown in advance. This paper studies the power control problem for

over-the-air FL by taking gradient statistics into account. The goal is to minimize the aggregation error

by jointly optimizing the transmit power at each device and the denoising factor at the edge server.

We obtain the optimal policy in closed form when gradient statistics are given. Notably, we show that

the optimal transmit power at each device is continuous and monotonically decreases with the squared

multivariate coefﬁcient of variation (SMCV) of gradient vectors. We also propose an estimation method

of gradient statistics with negligible communication cost. Experimental results demonstrate high learning

performance by using the proposed scheme.

Index Terms

Federated learning, over-the-air computation, power control, fading channel.

I. INT RO DUC TION

The proliferation of mobile devices such as smartphones, tablets, and wearable devices has

revolutionized people’s daily lives. Due to the growing computation and sensing capabilities

of these devices, a wealth of data has been generated each day. This has promoted a wide

range of artiﬁcial intelligence (AI) applications such as image recognition and natural language

processing. Traditional machine learning procedure, including both training and inference, relies

on cloud computing on a centralized data center with computing, storage, and full access to the

entire data set. Wireless edge devices are thus required to transmit their collected data to a central

parameter server, which can be very costly in terms of energy and bandwidth consumption, and

unfavorable due to response delay and privacy concerns. It is thus increasingly desired to let

edge devices engage in the learning process by keeping the collected data locally and performing

training/inference either collaboratively or individually. This emerging technology is known as

Edge Machine Learning [2] or Edge Intelligence [3].

Federated learning [4]–[6] is a new edge learning framework that enables many edge devices

to collaboratively train a machine learning model without exchanging datasets under the coordi-

nation of an edge server in wireless networks. Compared with traditional learning at a centralized

data center, FL offers several distinct advantages, such as preserving privacy, reducing network

congestion, and leveraging distributed on-device computation. In FL, each edge device downloads

N. Zhang and M. Tao are with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, P.

R. China. (email: arthaslery@sjtu.edu.cn; mxtao@sjtu.edu.cn.). Part of this work is to be presented at IEEE ICC 2020 workshop

[1]

a shared model from the edge server, computes an update to the current model by learning from

its local dataset, then sends this update to the edge server. Therein, the updates are averaged to

improve the shared model.

The communication cost is the main bottleneck in FL since a large number of participating

edge devices send their updates to the edge server at each round of the model training. Existing

methods to obtain communication-efﬁcient FL can be mainly divided into three categories: model

parameter compression [7], [8], gradient sparsiﬁcation [9], [10], and infrequent local update

[4], [11]. Nevertheless, the communication cost of FL is still proportional to the number of

edge devices, and thus inefﬁcient in large-scale environment. Recently, a fast model aggregation

approach is proposed for FL by applying over-the-air computation (AirComp) principle [12],

such as in [13]–[16]. This is accomplished by exploiting the waveform superposition nature of

the wireless medium to compute the desired function of the distributed local gradients (i.e., the

weighted average function) by concurrent transmission. Such AirComp-based FL, referred to as

over-the-air FL in this work, can dramatically save the uplink communication bandwidth.

Due to the channel fading, device selection and power control are crucial to achieve a reliable

and high-performance over-the-air FL. In [17], the authors jointly optimize the transmit power

at edge devices and the receive scaling factor (known as denoising factor) at the edge server for

minimizing the mean square error (MSE) of the aggregated signal. It is shown that the optimal

transmit power in static channels exhibits a threshold-based structure. Namely, each device

applies channel-inversion power control if its quality indicator exceeds the optimal threshold,

and applies full power transmission otherwise. The work [17], however, is purely for AirComp-

based signal aggregation (not in the context of learning), where the signal on each device is

assumed to be independent and identically distributed (IID) with zero mean and unit variance.

For AirComp-based gradient aggregation in FL, the work [14] introduces a truncation-based

approach for excluding the edge devices with deep fading channels to strike a good balance

between learning performance and aggregation error. The work [13] proposes a joint device

selection and receiver beamforming design method to ﬁnd the maximum number of selected

devices with MSE requirements to improve the learning performance. As in [17], it is assumed

in both [13] and [14] that the signal (i.e., the local gradient) to be aggregated from each device

is IID, and normalized with zero mean and unit variance. By exploiting the sparsity pattern

in gradient vectors, the work [16] projects the gradient estimate in each device into a low-

dimensional vector and transmits only the important gradient entries while accumulating the

error from previous iterations. Therein, a channel-inversion like power control scheme, similar

to those in [13], [14], [17] is designed so that the gradient vectors sent from the selected devices

are aligned at the edge server.

Note that all the exiting works on power control for over-the-air FL have overlooked the

following statistical characteristics of gradients: the gradient distribution over training iterations

is independent but not necessarily identical; and even in the same iteration, the distribution of

each entry of the gradient vector can be non-identical. A general observation is that the gradient

distribution changes over iterations and is different in each feature dimension. In addition, if the

gradient distribution is unknown for each device, normalizing the gradient to a distribution with

zero mean and unit variance is infeasible. As such, due to the neglect of the above characteristics

of gradient distribution, the existing power control methods for over-the-air FL may not perform

efﬁciently in practice.

Motivated by the above issue, in this paper, we study the optimal power control problem

for over-the-air FL in fading channels by taking gradient statistics into account. Our goal is to

minimize the MSE of the aggregated model, and hence improve the accuracy of FL, by jointly

optimizing the transmit power at each device and the denoising factor at the edge server given

the ﬁrst-order and second-order statistics of gradients at each iteration. The main contributions

of this work are outlined below:

•Optimal power control with known gradient statistics: We ﬁrst derive the MSE expression of

gradient aggregation at each iteration of the model training when the ﬁrst-order and second-

order statistics of the gradient vectors are known. We then formulate a joint optimization

problem of transmit power at edge devices and denoising factor at the edge server for MSE

minimization subject to individual peak power constraints at edge devices. By decomposing

this non-convex problem into subproblems deﬁned on different subregions, we obtain the

optimal power control strategy in closed form. Unlike the existing threshold-based power

control for normalized signal in [17], our optimal transmit power depends not only on the

channel quality and noise power, but also heavily on the gradient statistics. In particular,

we ﬁnd that the relative dispersion of the gradient values, i.e., the squared multivariate

coefﬁcient of variation (SMCV) of the gradient vectors, plays a key role in the optimal

transmit power control. Speciﬁcally, we prove that the optimal transmit power of each

device is a continuous and monotonically decreasing function of the gradient SMCV.

•Optimal power control in special cases: In the special case where the gradient SMCV

approaches inﬁnity, which could happen when the model training converges and/or the

dataset is highly non-IID, we show that there is an optimal threshold for the aggregation

capabilities of the devices, below which the devices transmit with full power and above

which the devices transmit at the power to equalize the weights of their gradients for

aggregation to one. In the other special case where the gradient SMCV approaches zero,

which could happen when the model training just begins, we show that the optimal power

control is to let all the devices transmit with their peak powers.

•Adaptive power control with unknown gradient statistics: We propose an adaptive power

control algorithm that estimates the gradient statistics based on the historical aggregated

gradients and then dynamically adjusts the power values in each iteration based on the

estimation results. The communication cost consumed by estimating the gradient statistics

is negligible compared to the transmission of the entire gradient vector.

To evaluate the efﬁciency of the proposed power control scheme, we implement the FL in

PyTorch for AI applications of three datasets: MNIST, CIFAR-10 and SVHN. Experimental

results demonstrate that the over-the-air FL with the proposed adaptive power control obtains

higher model accuracy than that with existing power control methods (full power transmission

and threshold-based power control for normalized signal [17]). In particular, while the full power

transmission performs poorly in high SNR region and non-IID data distribution and the threshold-

based power control for normalized signal [17] performs poorly in low SNR region and IID data

distribution, the proposed power control can perform very well over a wide range of scenarios

by exploiting the gradient statistics.

The rest of this paper is organized as follows. The over-the-air FL system is modeled in Section

II. Section III describes the optimal power control strategy with known gradient statistics. In

Section IV, we introduce an adaptive power control scheme when the gradient statistics are

unknown in advance. Section V provides experimental results. Finally we conclude the paper in

Section VI.

Edge Server

Edge

device 1

Edge

device 2

Edge

device K



Broadcast the

aggregated model

Over-the-air Compute

Fig. 1. Illustration of over-the-air federated learning.

II. SY STE M MODE L

A. Federated Learning Over Wireless Networks

We consider a wireless FL framework as illustrated in Fig. 1, where a shared AI model (e.g.,

a classiﬁer) is trained collaboratively across Ksingle-antenna edge devices via the coordination

of a single-antenna edge server. Let K={1, ..., K}denote the set of edge devices. Each device

k∈ K collects a fraction of labelled training data via interaction with its own users, constituting

a local dataset, denoted as Dk. The loss function measuring the model error is deﬁned as

L(w) = X

k∈K

|Dk|

|D| Lk(w),(1)

where w∈RDdenotes the D-dimensional model parameter to be learned, Lk(w) = 1

|Dk|Pi∈Dkli(w)

is the loss function of device kquantifying the prediction error of the model won the local dataset

collected at the k-th device, with li(w)being the sample-wise loss function, and D=Sk∈K Dk

is the union of all datasets. The minimization of L(w)is typically carried out through stochastic

gradient descent (SGD) algorithm, where device k’s local dataset Dkis split into mini-batches

of size Band at each iteration t= 1,2, ..., we draw one mini-batch Bk(t)randomly and update

the model parameter as

w(t+ 1) = w(t)−γ1

k∈K ∇LSGD

k,t (w(t)),(2)

with γbeing the learning rate and LS GD

k,t (w) = 1

BPi∈Bk(t)li(w). Note that the mean of the

gradient ∇LSGD

k,t (w(t)) in SGD is equal to the gradient ∇L(w(t))in gradient descent (GD)

while the variance depends on the mini-batch size and distribution of data (IID or non-IID).

B. Over-The-Air Computation for Gradient Aggregation

We consider block fading channels, where the wireless channels remain unchanged within the

duration of each iteration in FL but may change independently from one iteration to another.

We deﬁne the duration of one iteration as one time block, indexed by t∈N. It is assumed

that the channel coefﬁcients over different time blocks are generated from a stationary and

ergodic process. Let gk(t),∇LSGD

k,t (w(t)) ∈RDdenote the gradient vector computed on

device kat time block t. The following are key assumptions on the distribution of each entry

gk,d(t), d ∈ {1, ..., D}of gk(t):

•The gradient elements {gk,d(t)},∀k∈ K, are independent and identically distributed over

devices k’s. This is a default assumption since the distributions of the local datasets at

different devices, whether identical or non-identical, are unknown to the edge server and

thus the distributions of the local gradients trained from these local datasets are treated

equally.

•The gradient elements {gk,d (t)},∀t∈N, are independent but non-identically distributed

over iterations t’s. In other words, the gradient distribution is non-stationary over time. The

non-stationary distribution is valid since the gradient values in general change rapidly at

the beginning, then gradually approach to zero as the training goes on.

•The gradient elements {gk,d (t)},∀d∈ {1,2, ..., D}, are independent but non-identically

distributed over gradient vector dimension d’s. This assumption is valid as long as the

features in a data sample are independent but non-identically distributed, which is typically

the case.

The gradient of interest at the edge server at each time block tis given by

g(t) = 1

k∈K

gk(t).(3)

To obtain (3), all the devices transmit their gradient vectors gk(t)concurrently in an analog

manner, following the AirComp principle as shown in Fig. 1. Each transmission block takes a

duration of Dslots, one slot for one entry in the D-dimensional gradient vector. Each gradient

vector gk(t)is multiplied with a pre-processing factor, denoted as bk(t). The received signal

vector at the edge server is given by

y(t) = X

k∈K

bk(t)hk(t)gk(t) + n(t),(4)

where hk(t)denotes the complex-valued channel coefﬁcient from device kto the edge server

and n(t)denotes the additive white Gaussian noise (AWGN) vector at the edge server with

each element having zero mean and variance of σ2

n. To compensate the channel phase offset and

control the actual transmit power at each device, we let bk(t) = √pk(t)e−jθk(t)

Bk(t), where pk(t)≥0

denotes the transmit power at device k∈ K at time block t,θk(t)is the phase of hk(t), and

Bk(t),kgk(t)k=qPD

d=1 g2

k,d(t)denotes the gradient norm of device k. Here, we have

assumed that each device kcan estimate perfectly its own channel phase θk(t). To design the

optimal power control policy pk(t), for k∈ K, we also assume that the edge server knows the

channel amplitude |hk|of all devices. By such design of bk(t), we can rewrite (4) as

y(t) = X

k∈K ppk(t)|hk(t)|

Bk(t)gk(t) + n(t).(5)

Each device k∈ K has a peak power budget Pk, i.e.,

pk(t)≤Pk,∀k∈ K,∀t∈N.(6)

Upon receiving y(t), the edge server applies a denoising factor, denoted by η(t), to recover the

gradient of interest as

ˆg(t) = y(t)

Kpη(t),(7)

where the factor 1/K is employed for the averaging purpose.

C. Performance Measure

We are interested in minimizing the distortion of the recovered gradient ˆg(t)with respect to

(w.r.t.) the ground true gradient g(t). The distortion at a given iteration tis measured by the

instantaneous MSE deﬁned as

MSE(t),E[kˆg(t)−g(t)k2]

K2E





y(t)

pη(t)−X

k∈K

gk(t)



2



K2"D

d=1

σ2

d(t)X

k∈K ppk(t)|hk(t)|

pη(t)Bk(t)−1!2

d=1

d(t) 1

pη(t)X

k∈K ppk(t)|hk(t)|

Bk(t)−K!2

+Dσ2

η(t)#,(8)

where the expectation is over the distribution of the transmitted gradients gk(t)and the received

noise n(t),md(t)and σ2

d(t)denote the mean (ﬁrst-order statistics) and variance (second-order

statistics) of the d-th entry of gradient g(t)at iteration t, respectively. Notice that the gradient

norm Bk(t)of each device kcan be transmitted to the edge server with negligible communication

cost, thus it is considered as a known value in (8).

Observing (8) closely, we ﬁnd that the MSE consists of three components, representing the

individual misalignment error (the ﬁrst term), the composite misalignment error (the second

term), and the noise-induced error (the third term), respectively. The individual misalignment

error is weighted by the gradient variance PD

d=1 σ2

d(t)while the composite misalignment error

is weighted by the gradient mean PD

d=1 m2

d(t). In the special case when the gradient has zero

mean, the MSE in (8) reduces to that in [17] where the composite misalignment error is absent.

Our objective is to minimize MSE in (8), by jointly optimizing the transmit power pk(t)at all

devices and the denoising factor η(t)at the edge server, subject to the individual power budget

in (6).

D. Gradient Statistics

In general, the individual misalignment error and the composite misalignment error in MSE of

the gradient aggregation (8) cannot be minimized simultaneously due to the peak power budget

on each device. It is difﬁcult to capture the tradeoff between the two errors by directly using

0 20 40 60 80 100

0.05

0.1

0.15

0.2

0.25

0.3

Fig. 2. Experimental results of MSN α(t)(left y-axis in linear scale) and SMCV β(t)(right y-axis) over iterations

for dataset MNIST, where the number of edge devices is 10 and the local mini-batch size is 20.

their respective weights, namely, the gradient variance and gradient mean. To tackle this issue,

we introduce two alternative parameters of gradient statistics in this subsection.

Let α(t)denote the mean squared norm (MSN) of g(t), i.e., E[kg(t)k2], which is given by

α(t) =

d=1 σ2

d(t) + m2

d(t),(9)

and let β(t)denote the squared multivariate coefﬁcient of variation (SMCV) of g(t), which is

given by

β(t) = PD

d=1 σ2

d(t)

d=1 m2

d(t).(10)

While α(t)measures the average absolute gradient values at iteration t,β(t)is a measure of

relative dispersion of the gradient values at iteration t. Shortly later, we shall show that the MSE

of the model aggregation is mainly dominated by β(t), rather than α(t). Figs. 2-4 illustrate

the experimental results of the alternative gradient statistics α(t)and β(t)of three datasets,

MNIST, CIFAR-10, and SVHN, respectively, where the gradients are updated ideally without

any transmission error. Both IID and non-IID partitions are considered for the training dataset.

Each value of α(t)and β(t)is obtained by averaging over 300 model trainings. It is observed

that as the training time increases, the gradient MSN α(t)decreases while the gradient SMCV

β(t)increases for all the three datasets. This result is in agreement with the intuition that the

absolute gradient values in SGD-based learning gradually vanish when the training iteration goes

on, but their relative variation remains signiﬁcant over each iteration. It is also observed that the

gradient SMCV β(t)in non-IID partition is much larger than that in IID partition for all the

three datasets. This indicates that the gradient distribution with non-IID dataset partition is more

dispersive than that with IID dataset partition as expected.

By (9) and (10), the MSE in (8) can be rewritten as (omitting the constant coefﬁcient 1/K2

for convenience)

MSE(t) =β(t)α(t)

β(t) + 1 X

k∈K ppk(t)|hk(t)|

pη(t)Bk(t)−1!2

0 20 40 60 80 100

10-1

100

101

102

103

Fig. 3. Experimental results of MSN α(t)(left y-axis in log scale) and SMCV β(t)(right y-axis) over iterations

for dataset CIFAR-10, where the number of edge devices is 10 and the local mini-batch size is 200.

0 20 40 60 80 100

10-2

10-1

100

101

102

Fig. 4. Experimental results of MSN α(t)(left y-axis in log scale) and SMCV β(t)(right y-axis) over iterations

for dataset SVHN, where the number of edge devices is 10 and the local mini-batch size is 300.

+α(t)

β(t) + 1 1

pη(t)X

k∈K ppk(t)|hk(t)|

Bk(t)−K!2

+Dσ2

η(t).(11)

It is seen from (11) that while the gradient MSN α(t)appears linearly in the weights of

both individual and composite misalignment errors, the gradient SMCV β(t)plays a more

distinguishing role in the MSE expression. In particular, when β(t)→0, which could be the

case when the model training just begins (as could be veriﬁed by Figs. 2-4), the individual signal

misalignment error can be neglected. When β(t)→ ∞, which could be the case when the model

training converges or during the middle of the training when the dataset is highly non-IID (as

could be veriﬁed by Figs. 2-4), the composite signal misalignment error disappears.

III. OPTIMAL POWER CO NTROL WIT H KNOWN GRADIE NT STATISTI CS

In this section, we formulate and solve the optimal power control problem for minimizing

MSE when the gradient statistics α(t)and β(t)are known. For convenience, we omit iteration

index tin this section. For each device k∈ K, we deﬁne its aggregation level with power pand

denoising factor ηas

Gk(p, η) = √p|hk|

√ηBk

,(12)

which indicates the weight of the gradient from device kin the global gradient aggregation (7)1.

Furthermore, we deﬁne aggregation capability of device kas its aggregation level with peak

power Pkand unit denoising factor η= 1, i.e., Ck=Gk(Pk,1) = √Pk|hk|

Bk. Without loss of

generality, we assume that

C1≤... ≤Ck≤... ≤CK.(13)

A. Power Control Problem for General Case

In this subsection, we consider the optimal power control problem for MSE minimization in

the general case with arbitrary β. The problem is formulated as

P1: min βα

β+ 1 X

k∈K

(Gk(pk, η)−1)2+α

β+ 1 X

k∈K

Gk(pk, η)−K!2

+Dσ2

η(14a)

s.t. 0≤pk≤Pk,∀k∈ K (14b)

η≥0.(14c)

Different from the power control problem in [17], the objective function in (14a) contains not

only the individual misalignment error ( βα

β+1 Pk∈K (Gk(pk, η)−1)2), but also the composite

misalignment error ( α

β+1 (Pk∈K Gk(pk, η)−K)2). Problem P1is non-convex in general. Even

if the denoising factor ηis given, problem P1is still hard to solve due to the coupling of each

transmit power pk. In the following, we present some properties of the optimal solution.

Lemma 1: The optimal denoising factor η∗for problem P1satisﬁes η∗≥C2

Proof: Please refer to Appendix A.

Lemma 1 reduces the range of η∗and shows that the optimal transmit power of the 1-st device

is p∗

1=P1.

Lemma 2: The optimal power control policy satisﬁes p∗

k=Pk,∀k∈ {1, ..., l}and p∗

Pk,∀k∈ {l+ 1, ..., K}for some l∈ K.

Proof: Please refer to Appendix B.

Based on Lemma 2, solving problem P1can be equivalent to minimizing the objective function

in the following Kexclusive subregions of the global power region, denoted as {M}l∈K and

comparing their corresponding optimal solutions to obtain the global optimal solution:

Ml=[p1,···, pk]∈RK|pk=Pk,∀k∈ {1, ..., l},0≤pk< Pk,∀k∈ {l+ 1, ..., K}.(15)

To facilitate the derivation, we denote ˜

Mlas a relaxed region of Mlby removing the condition

pk< Pk, for k∈ {l+ 1, ..., K}, i.e.,

Ml=[p1,···, pk]∈RK|pk=Pk,∀k∈ {1, ..., l}, pk≥0,∀k∈ {l+ 1, ..., K}.(16)

1The weight should be 1 for all devices in the ideal case.

For the sub-problem deﬁned in each relaxed subregion ˜

Ml, for l∈ K, taking the derivative of

the objective function (14a) w.r.t. pkand equating it to zero for all k∈ {l+ 1, ..., K}, we obtain

the optimal transmit power ˜p∗

l,k at any given ηas

˜p∗

l,k ="β+K−Pl

i=1 Gi(Pi, η)

β+K−l#2

·B2

kη

|hk|2, k ∈ {l+ 1, ..., K}.(17)

Note that by such power control in (17), the aggregation level Gk(˜p∗

l,k, η)of each device k∈

{l+ 1, ..., K}is the same, given by

G0(l) = β+K−1

√ηPl

i=1 Ci

β+K−l.(18)

Substituting (17) back to (14a) and letting its derivative w.r.t. ηbe zero, we derive the optimal

denoising factor ηin closed-form for problem P1deﬁned in the l-th relaxed subregion, i.e.,

p˜η∗

βα

β+1

i=1

i+βα

(β+K−l)(β+1) l

i=1

Ci2

+Dσ2

β(β+K)α

(β+K−l)(β+1)

i=1

.(19)

Note that ˜p∗

l,k may be not less than its power constraint Pkfor some k∈ {l+1, ..., K }and thus the

corresponding ˜p∗

lmay not lie in the subregion Ml. If this happens, the optimal transmit power

of the sub-problem deﬁned in subregion Mlis irrelevant and does not need to be considered.

This is given in the following lemma.

Lemma 3: For the power control problem P1deﬁned in the l-th relaxed subregion ˜

Ml, if

∃k∈ {l+ 1, ..., K}such that ˜p∗

l,k ≥Pk, the global optimal power p∗,[p∗

1, ...p∗

K]of problem

P1must not be in Ml.

Proof: Please refer to Appendix C.

Lemma 3 shows that only the transmit power vectors ˜

p∗

l’s with elements satisfying ˜p∗

l,k <

Pk,∀k∈ {l+ 1, ..., K}are legal candidates of problem P1. Let Ldenote the index set of

the corresponding relaxed subregions. Note that Lis non-empty because ˜

p∗

Kis always a legal

transmit power candidate. Then, we only need to compare the legal candidate values to obtain

the minimum MSE

l∗= arg min

l∈L Vl,(20)

where Vlis the optimal value of (14a) in subregion Mlof P1and can be easily obtained by

substituting (17) and (19) back to (14a). The optimal solution to problem P1is given as follows.

Theorem 1: The optimal transmit power at each device that solves problem P1is given by

p∗

k=





Pk,∀k∈ {1, ..., l∗}

β+K−Pl∗

i=1 Gi(Pi,η∗)

β+K−l∗2

·B2

kη∗

|hk|2,∀k∈ {l∗+ 1, ..., K},(21)

and the optimal denoising factor at the edge server is given by

√η∗=

βα

β+1

l∗

i=1

i+βα

(β+K−l∗)(β+1) l∗

i=1

Ci2

+Dσ2

β(β+K)α

(β+K−l∗)(β+1)

l∗

i=1

,(22)

where l∗is given in (20).

Proof: Please refer to Appendix D.

Remark 1: Theorem 1 shows that these devices k∈ {1, ..., l∗}with aggregation capability

not higher than that of device l∗should transmit their gradients with full power, i.e., pk=Pk,

while devices k∈ {l∗+1, ..., K}with aggregation capability higher than that of device l∗should

transmit with the power so that they have the same aggregation level, given by

G∗

0=β+K−Pl∗

i=1 Gi(Pi, η∗)

β+K−l∗,(23)

somewhat analogous to channel inversion.

B. On The Optimal Transmit Power Function

In this subsection, we analyse the optimal transmit power p∗as a vector function of the

gradient SMCV βand the noise variance σ2

n, i.e., p∗(β, σ2

n) = [p∗

1(β, σ2

n), ..., p∗

K(β, σ2

n)]. Note

that the vector function p∗(β, σ2

n)might have abrupt changes at some (β, σ2

n)due to the discrete

nature of the optimal device threshold l∗in Theorem 1. In this work, however, we can show

that p∗(β, σ2

n)is continuous2everywhere for all (β≥0, σ2

n≥0). Deﬁne Xl⊆R2as the domain

of the vector function p∗(β, σ2

n)when p∗(β, σ2

n)lies in the subregion Ml, for l∈ K. That is,

∀(β, σ2

n)∈ Xl, one can have l∗(β, σ2

n) = l.

We ﬁrst show that p∗(β, σ2

n)is continuous and monotonic within each domain Xl,l∈ K. Let

˜p∗

l(β, σ2

n)denote the optimal power of the sub-problem deﬁned in the l-th relaxed subregion ˜

Ml,

∀l∈ K. Based on (17) and (19), it is obvious that ˜p∗

l(β, σ2

n)is continuous within each domain

Xl. Moreover, ˜p∗

l(β, σ2

n)increases monotonically with the noise variance σ2

nsince enlarging σ2

increases both ˜η∗

land ˜p∗

l,k, for k∈ {l+1, ..., K}. Furthermore, ˜p∗

l(β, σ2

n)decreases monotonically

with the gradient SMCV βsince it can be easily shown that ∂˜p∗

l(β,σ2

∂β is always negative. Thus

˜p∗

l(β, σ2

n)is monotonic within each domain Xl,l∈ K. By deﬁnition, within each domain

Xl,p∗(β, σ2

n)is equal to ˜p∗

l(β, σ2

n), thus, the vector function p∗(β , σ2

n)is also continuous and

monotonic within each domain Xl, for l∈ K.

Then we ﬁnd the boundary of each domain Xl, for l∈ K and the corresponding optimal

transmit power p∗. To this end, we need the following lemma on the lower and upper bounds

of the optimal transmit power values.

Lemma 4: The optimal transmit power p∗lies in subregion Mlif and only if the optimal

transmit power ˜p∗

lin the l-th relaxed subregion ˜

Mlsatisﬁes: (ClBk

|hk|)2≤˜p∗

l,k <(Cl+1Bk

|hk|)2,∀k∈

{l+ 1, ..., K}.

Proof: Please refer to Appendix E.

Lemma 4 shows that in each domain Xl, l ∈ K, the range of p∗(β , σ2

n)is left-closed and

right-open intervals in h(ClBk

|hk|)2,(Cl+1Bk

|hk|)2,∀k∈ {l+ 1, ..., K}. Recall that ˜p∗

l(β, σ2

n)is con-

tinuous and monotonic. The optimal transmit power p∗(β , σ2

n) = p∗

l(β, σ2

n)sits on the lower

bound of the range when (β , σ2

n)is at the lower boundary of domain Xl, denoted as Ll,

n(β, σ2

n)|˜p∗

l,k(β, σ2

n) = (ClBk

|hk|)2,∀k∈ {l+ 1, ..., K}o, and p∗(β, σ2

n) = p∗

l(β, σ2

n)approaches the

2p∗(β, σ 2

n)is a continuous vector valued function if and only if each element p∗

k(β, σ 2

n), for k∈ {1, ..., K}is a continuous

function.

10-2 10-1 100101102103

0.2

0.4

0.6

0.8

1.2

Transmit Power

(a) Average received SNR=0dB.

10-3 10-2 10-1 100101102

Transmit Power

(b) Average received SNR=10dB.

Fig. 5. Illustration of the optimal transmit power p∗as a function of gradient SMCV β.

upper bound of the range when (β , σ2

n)approaches the upper boundary of domain Xl, denoted

as Ul,n(β, σ2

n)|˜p∗

l,k(β, σ2

n) = (Cl+1 Bk

|hk|)2,∀k∈ {l+ 1, ..., K}o.

Next, we consider the continuity of p∗(β, σ2

n)at boundaries Lland Ulfor each l∈ K in the

following lemma.

Lemma 5: The optimal transmit power function p∗(β, σ2

n)is continuous at Ul=Ll+1, for

l∈ {1, ..., K −1}.

Proof: Please refer to Appendix F.

Finally, we can formally conclude the following property of the optimal transmit power

function with respect to the gradient statistics and noise variance.

Property 1: The optimal transmit power p∗(β, σ2

n)of problem P1is a continuous and mono-

tonic vector function in the entire domain of (β≥0, σ2

n≥0). Moreover, it decreases with β

and increases with σ2

Take a system with K= 6 devices for illustration. Fig. 5 shows the optimal transmit power of

each device p∗

kas a function of gradient SMCV β, together with the corresponding optimal index

l∗. The results are based on one channel realization of each device, |hk|, taken independently

from normalized Rayleigh distribution, given by [0.50,0.82,0.85,1.16,2.09,2.83]. The peak

power constraint of each device is set to be same, resulting in the same average received

SNR, given by Pk

Dσ2

n=5dB and 10dB, respectively. The gradient norms of devices, Bk, are

[0.23,0.31,0.26,0.28,0.28,0.16], noise variance σ2= 1 and D= 1. Fig. 5 clearly veriﬁes that the

optimal transmit power p∗(β, σ2

n)is a continuous and monotonically decreasing vector function

of the gradient SMCV β. In particular, it is seen that when β→0, all the devices transmit

with full power; when beta increases, the power of the devices with large aggregation capability

decreases, then gradually approaches a constant, and the larger the aggregation capability is, the

smaller the constant transmit power is. In addition, it is observed from Fig. 5 that when the

peak power budget increases, more devices can transmit with less power than its peak value.

Equivalently, the optimal transmit power decreases when the noise variance decreases.

C. Power Control Problem for Special Cases

In this subsection, we provide some discussions on the optimal power control policy in the

two special cases where the gradient SMCV β→ ∞ and β→0, respectively.

1) β→ ∞:As discussed in Section II-D, this case may happen when the model training

converges and/or the dataset is highly non-IID. In this case, the composite signal misalignment

error in the MSE expression disappears. Thus problem P1reduces to the power control problem

in [17]. To be self-contained, we re-state the problem and the solution as follows

P2: min αX

k∈K

(Gk(pk, η)−1)2+Dσ2

η(24a)

s.t. 0≤pk≤Pk,∀k∈ K (24b)

η≥0.(24c)

Theorem 2 ( [17]): The optimal transmit power that solves problem P2has a threshold-based

structure, i.e.,

p∗

k=(Pk,∀k∈ {1, ..., l∗}

kη∗

|hk|2,∀k∈ {l∗+ 1, ..., K},(25)

where the optimal denoising factor is given as

η∗=αPl∗

i=1 C2

i+Dσ2

αPl∗

i=1 Ci2

,(26)

and l∗is given in (20). Furthermore, it holds that C2

l∗≤η∗≤C2

l∗+1.

Proof: Please refer to [17, Theorem 1].

Note that the optimal denoising factor η∗for problem P2can be interpreted as the threshold,

since whether a device transmits with full power or not depends entirely on the comparison

between its aggregation capability Ckand √η∗. Moreover, this threshold increases with σ2

n. In

the extreme case when the noise power is very large and so is the threshold, all the devices

shall transmit with full power. Nevertheless, such threshold interpretation of η∗does not hold

for problem P1with general β.

2) β→0:As discussed in Section II-D, β→0could happen when the model training just

begins. In this case, the individual signal misalignment error disappears in the MSE expressions.

The original problem P1reduces to

P3: min αX

k∈K

Gk(pk, η)−K2

+Dσ2

η(27a)

s.t. 0≤pk≤Pk,∀k∈ K (27b)

η≥0.(27c)

Theorem 3: The optimal transmit power that solves problem P3is full power transmission,

i.e.,

p∗

k=Pk,∀k∈ K,(28)

and the optimal denoising factor is given by

η∗=αPi∈K Ci2+Dσ2

αK Pi∈K Ci2

.(29)

Proof: Please refer to Appendix G.

Remark 2: The optimal solution of problem P3is the special case of the solution of problem

P1with l∗=K. Note that the direction of the gradient vector received from each device at the

edge server is independent to the power of the transmitting device. Thus, increasing the power

of all devices can reduce the noise-induced error when the composite signal misalignment error

is ﬁxed.

IV. ADAPTIV E POWER CONTROL WITH UN KNOWN GRADI ENT STATISTICS

In this section, we consider the practical scenario where the gradient statistics α(t)and β(t)

are unknown. We propose a method to estimate α(t)and β(t)in each time block and then devise

an adaptive power control scheme based on the optimal solution of problem P1by using the

estimated α(t)and β(t).

A. Parameters Estimation

In this subsection, we propose a method to estimate α(t)and β(t)at each time block tdirectly

based on their deﬁnitions in (9) and (10), respectively.

1) Estimation of α(t):Note that the instantaneous gradient norm of each device, Bk(t), is

assumed to be available at the edge server with negligible cost. By deﬁnition (9), we can estimate

the gradient MSN as

ˆα(t) = 1

k∈K

k(t).(30)

2) Estimation of β(t):By deﬁnition in (10), the gradient SMCV β(t)depends on md(t)and

σd(t). Before each device sending its gradient at time block t, we cannot estimate β(t)in advance.

However, from the experimental results of real datasets in Figs. 2-4, it can be observed that β(t)

changes slowly over iterations t. Thus we propose to estimate β(t)based on the aggregated

gradient at time block t−1as below

β(t) = ˆα(t−1) −PD

d=1 ˆg2

d(t−1)

d=1 ˆg2

d(t−1) ,(31)

where ˆα(t−1) estimates Pdσ2

d(t−1) + m2

d(t−1) and Pdˆg2

d(t−1) estimates Pdm2

d(t−1).

B. FL with Adaptive Power Control

In this subsection, we propose the FL process with adaptive power control, which is presented

in Algorithm 1. The algorithm has three steps. First, each device locally takes one step of SGD

on the current model using its local dataset (line 5). After that each device calculates the norm

of its local gradient and uploads it to the edge server with conventional digital transmission (line

6 and line 7). Second, the edge server estimates parameters α(t)and β(t)based on the received

gradient norm at time block tand historical aggregated gradient (line 9 and line 16). Then the

optimal transmit power and denoising factor are obtained based on (21) and (22), respectively

(line 10). Third, the edge server informs the optimal transmit power to each device and each

device transmits local gradient with the assigned power simultaneously using AirComp to the

edge server in an analog manner (line 12-14).

Algorithm 1 FL Process with Adaptive Power Control

1: Initialize w(0) in edge server, ˆ

β(1);

2: for time block t= 1, ..., T do

3: Edge server broadcasts the global model w(t)to all edge devices k∈ K;

4: for each device k∈ K in parallel do

5: gk(t) = ∇LSGD

k,t (w(t));

6: Bk(t) = qPdg2

k,d(t);

7: Upload Bk(t)to edge server;

8: end for

9: Edge server estimates ˆα(t)based on (30);

10: Edge server obtains the optimal transmit power p∗(t)based on (21) and the optimal

denoising factor η∗(t)based on (22);

11: Edge server sends p∗

k(t)to device kfor all k∈ K;

12: for each device k∈ K in parallel do

13: Transmit gradient gk(t)with power p∗

k(t)to edge server using AirComp;

14: end for

15: Edge server receives y(t)and recovers ˆg(t)based on (7);

16: Edge server estimates ˆ

β(t+ 1) based on (31);

17: Edge server updates global model w(t+ 1) = w(t)−γˆg(t);

18: end for

19: Edge server returns w(T+ 1);

V. EXPE RIM E NTAL RESU LTS

In this section, we provide experimental results to validate the performance of the proposed

power control for AirComp-based FL over fading channels.

A. Experiment Setup

We conducted experiments on a simulated environment where the number of edge devices is

K= 10 if not speciﬁed otherwise. The wireless channels from each device to the edge server

follow IID Rayleigh fading, such that hk’s are modeled as IID complex Gaussian variables with

zero mean and unit variance. For each device k∈ K, we deﬁne SNRk=EhPk|hk|2

Dσ2

ni=Pk

Dσ2

nas

the average received SNR.

1) Baselines: We compare the proposed power control scheme with the following baseline

approaches:

•Error-free transmission: the aggregated gradient is updated without any transmission error,

which is equivalent to the centralized SGD algorithm.

•Threshold-based power control in [17]: this is the power control scheme given in [17],

which assumed that signals are normalized. Note that it is actually the special case of our

proposed power control scheme with β→ ∞ by considering the individual misalignment

error only in problem P1.

•Full power transmission: all devices transmit with full power Pkand the edge server applies

the optimal denoising factor in (22), where l∗=K.

2) Datasets: We evaluate the training of convolutional neural network on three datasets:

MNIST, CIFAR-10 and SVHN. MNIST dataset consists of 10 categories ranging from digit 0 to

9 and a total of 70000 labeled data samples (60000 for training and 10000 for testing). CIFAR-10

dataset includes 60000 color images (50000 for training and 10000 for testing) of 10 different

types of objects. SVHN is a real-world image dataset for developing machine learning and object

recognition algorithms with minimal requirement on data preprocessing and formatting, which

includes 99289 labeled data samples (73257 for training and 26032 for testing).

3) Data Distribution: To study the impact of the SMCV of gradient βfor optimal transmit

power, we simulate two types of dataset partitions among the mobile devices, i.e., the IID

setting and non-IID one. For the former, we randomly partition the training samples into 100

equal shards, each of which is assigned to one particular device. While for the latter, we ﬁrst

sort the data by digit label, divide it into 200 equal shards, and randomly assign 2shards to

each device.

4) Training and Control Parameters: In all our experiments, the number of local update steps

between two global aggregations is 1. Local batch size of each edge device is 10. The gradient

descent step size is γ= 0.01.

B. Experimental Results

Fig. 6 compares the test accuracy for the three considered datasets with IID dataset partition

and non-IID dataset partition, respectively, where the average received SNR is set as 10dB for all

devices. It is observed that the model accuracy of the proposed scheme is better than threshold-

based power control in [17] and full power transmission. From Figs. 2-4, we know that the

averaged gradient SMCV β(t)in the IID dataset partition is less than that in the non-IID dataset

partition and gradient SMCV β(t)increases over iterations. Threshold-based power control in

[17] has signiﬁcant accuracy loss in the IID partition or at the beginning of training. This is

because in this case, the gradient SMCV is small and thus the MSE is dominated by the composite

misalignment error. As a result, threshold-based power control in [17] scheme that considers

the individual misalignment error only is much inferior. Besides, full power transmission has

signiﬁcant accuracy loss in the non-IID partition or at the end of training. This is because the

gradient SMCV is large and therefore the full power transmission scheme fails to minimize the

individual misalignment error that dominates the MSE in this case.

Fig. 7 illustrates the test accuracy for MNIST with non-IID data partition at the average

received SNR =5db. It is observed that the overall performance of the proposed scheme is

still better than two baseline approaches at low SNR region. In speciﬁc, full power transmission

performs better than threshold-based power control in [17] scheme. This is mainly because

when the noise variance is large, full power transmission can strongly suppress noise error that

dominates the MSE.

Finally, Fig. 8 compares the test accuracy of different power control schemes at varying

number of devices K. Here, MNIST dataset with non-IID partition is used, the average received

SNR of all the devices is set as SNRk= 10dB and the results are averaged over 50 model

trainings. First, it is observed that the test accuracy achieved by all the four schemes increases

as Kincreases, due to the fact that the edge server can aggregate more data for averaging.

Second, the proposed scheme considerably outperforms both of threshold-based power control

in [17] and full power transmission throughout the whole regime of K. Full power transmission

approaches threshold-based power control in [17] when Kis small (i.e., K= 4 in Fig. 8), but

0 10 20 30 40 50

Rounds

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

(a) MNIST dataset with IID partition.

0 10 20 30 40 50

Rounds

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

(b) MNIST dataset with non-IID partition.

0 10 20 30 40 50 60

Rounds

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

0 10 20 30 40 50 60 70

Rounds

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

(d) CIFAR-10 dataset with non-IID partition.

0 5 10 15 20 25 30

Rounds

100

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

(e) SVHN dataset with IID partition.

0 10 20 30 40 50

Rounds

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

(f) SVHN dataset with non-IID partition.

Fig. 6. Performance comparison on different dataset partition, for K= 10 and SNRk= 10dB,∀k∈ K.

the performance compromises as Kincreases, due to the lack of power adaptation to reduce the

misalignment error.

VI. CO NCL USION

This work studied the power control optimization problem for the over-the-air federated

learning over fading channels by taking the gradient statistics into account. The optimal power

0 10 20 30 40 50

Rounds

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

Fig. 7. Performance comparison for MNIST dataset with non-IID partition at the average received SNR = 5db.

4 6 8 10 12 14 16 18 20

Number of Devices, K

Test Accuracy

Threshold-based in [17]

Full power transmission

Proposed scheme

Error-free transmission

Fig. 8. Performance comparison over the number of devices, where MNIST dataset is non-IID partition and

SNRk= 10dB,∀k∈ K.

control policy is derived in closed form when the ﬁrst- and second-order gradient statistics are

known. It is shown that the optimal transmit power on each device decreases with gradient SMCV

and increases with noise variance. In the special cases where βapproaches inﬁnity and zero,

the optimal transmit power reduces to threshold-based power control in [17] and full power

transmission, respectively. We propose an adaptive power control algorithm that dynamically

adjusts the transmit power in each iteration based on the estimation results. Experimental results

show that our proposed adaptive power control scheme outperforms the existing schemes.

APPEN DIX A

PROO F O F LEMM A 1

We prove this Lemma by contradiction. Suppose the optimal denoising factor η∗≤C2

Both the individual and composite signal misalignment errors can be forced to zero by letting

p∗

k=ηB2

|hk|2,∀k∈ K. The problem P1can thus be expressed as

min

η≤C2

Dσ2

η.(32)

It is obvious that the optimal solution of the above problem is η∗=C2

1. Thus, it must hold that

η≥C2

1for problem P1.

APPEN DIX B

PROO F O F LEMM A 2

To prove this lemma, we need to prove that if p∗

k2=Pk2for some device k2, we shall

have p∗

k1=Pk1,∀k1< k2. We prove this by contradiction. Let p∗= [p∗

1, ..., p∗

K]denote the

optimal transmit power to the problem P1. We assume that there are two devices k1< k2

satisfying p∗

k1< Pk1and p∗

k2=Pk2. Then there always exists a modiﬁed transmit power p′=

[p∗

1, ..., p∗

k1−1, p′

k1, p∗

k1+1, ..., p∗

k2−1, p′

k2, p∗

k2+1, ..., p∗

K], where p′

k1=Pk1and p′

k2< Pk2, satisfying

√p∗

k1|hk1|

√η∗Bk1+√p∗

k2|hk2|

√η∗Bk2=qp′

k1|hk1|

√η∗Bk1+qp′

k2|hk2|

√η∗Bk2

. The resulting MSE obtained by p′only differs from

the minimum MSE by p∗in the individual misalignment error term. The difference is given by

MSE(p∗)−MSE(p

′)

=βα

β+ 1 

 pp∗

k1|hk1|

√η∗Bk1−1!2

+ pp∗

k2|hk2|

√η∗Bk2−1!2

−

qp′

k1|hk1|

√η∗Bk1−1



−

qp′

k2|hk2|

√η∗Bk2−1



2



(33a)

=βα

β+ 1 



 pp∗

k1|hk1|

√η∗Bk1!2

+ pp∗

k2|hk2|

√η∗Bk2!2

−

qp′

k1|hk1|

√η∗Bk1



−

qp′

k2|hk2|

√η∗Bk2



2



(33b)

=βα

β+ 1 

pp∗

k2|hk2|

√η∗Bk2−qp′

k2|hk2|

√η∗Bk2



pp∗

k2|hk2|

√η∗Bk2

+qp′

k2|hk2|

√η∗Bk2−pp∗

k1|hk1|

√η∗Bk1−qp′

k1|hk1|

√η∗Bk1



(33c)

=2βα

β+ 1 

pp∗

k2|hk2|

√η∗Bk2−qp′

k2|hk2|

√η∗Bk2



pp∗

k2|hk2|

√η∗Bk2−qp′

k1|hk1|

√η∗Bk1

(33d)

≥0.(33e)

The inequality in (33e) holds as p∗

k2=Pk2> p′

k2and √p∗

k2|hk2|

√η∗Bk2≥qp′

k1|hk1|

√η∗Bk1

by the aggregation

capability ranking in (13). This indicates that p′is a better solution than p∗, which contradicts

the assumption. Therefore, for all pairs of two devices k1< k2, if p∗

k2=Pk2, we must have

p∗

k1=Pk1. Lemma 2 is proved.

APPEN DIX C

PROO F O F LEMM A 3

To prove this lemma is equal to proving that if the global optimal solution p∗of problem

P1lies in subregion Ml, the optimal solution ˜p∗

lof the sub-problem deﬁned in the relaxed

subregion ˜

Mlshould also lie Ml. We prove this by contradiction. Assume that p∗∈ Mlbut

˜p∗

l/∈ Ml. The derivative of (14a) w.r.t. pkand ηshows that ˜p∗

lis the only local optimal solution

in ˜

Ml. As p∗is the optimal solution in Mland cannot be on the boundary of Ml,p∗is also

a local optimal solution in Ml⊆˜

Ml. It contradicts that ˜p∗

lis the only local optimal solution

in ˜

Ml. Therefore, if the global optimal transmit power p∗is in the l-th subregion Ml,˜p∗

lmust

be in Ml. The proof of Lemma 3 is completed.

APPEN DIX D

PROO F O F THEO REM 1

To complete the proof, we need to show that with l∗deﬁned in (20), the optimal transmit

power is ˜p∗

l∗in the l∗-th relaxed subregion ˜

Ml∗. By Lemma 3, for all lin L, since ˜p∗

lis also in

subregion Ml,˜p∗

lis the optimal transmit power in the subregion Ml. Therefore, the candidate

transmit power ˜p∗

l∗with the smallest value Vl∗is the optimal transmit power of the problem P1.

According to (17) and (19), the optimal transmit power and denoising factor are the forms given

in Theorem 1.

APPEN DIX E

PROO F O F LEMM A 4

When p∗lies in subregion Ml,p∗is equal to ˜p∗

l. Thus, to prove the sufﬁciency of this Lemma

is equal to prove that the optimal transmit power p∗holds

Cl∗≤pp∗

k|hk|

< Cl∗+1,∀k∈ {l∗+ 1, ..., K }.(34)

Based on (21), p∗

k< Pk, for k∈ {l∗+ 1, ..., K}, and when k∈ {l∗+ 1}, we have √p∗

l∗+1|hl∗+1 |

Bl∗+1 <

Cl∗+1. Since √p∗

k|hk|

Bk,∀k∈ {l∗+ 1, ..., K}is the same, we have the inequality √p∗

k|hk|

Bk< Cl∗+1,

for k∈ {l∗+1, ..., K }. We prove Cl∗≤√p∗

k|hk|

Bk,∀k∈ {l∗+1, ..., K }by contradiction. We assume

that Cl∗>√p∗

l∗+1|hl∗+1 |

Bl∗+1 . As p∗

k=Pk, for k∈ {1, ..., l∗}, we have √p∗

l∗|hl∗|

Bl∗=Cl∗>√p∗

l∗+1|hl∗+1 |

Bl∗+1 .

Then there always exists a modiﬁed transmit power p′= [p∗

1, ..., p∗

l∗−1, p′

l∗, p′

l∗+1, p∗

l∗+2, ..., p∗

where transmit power p′

l∗and p′

l∗+1 satisfy √p′

l∗|hl∗|

Bl∗=qp′

l∗+1|hl∗+1 |

Bl∗+1 =1

2(√p∗

l∗|hl∗|

Bl∗+√p∗

l∗+1|hl∗+1 |

Bl∗+1 ).

The difference between MSE of the transmit power p∗and p′is given by

MSE(p∗)−MSE(p

′)

=βα

β+ 1 

 pp∗

l∗|hl∗|

√η∗Bl∗−1!2

+ pp∗

l∗+1|hl∗+1 |

√η∗Bl∗+1 −1!2

− pp′

l∗|hl∗|

√η∗Bl∗−1!2

−

qp′

l∗+1|hl∗+1|

√η∗Bl∗+1 −1



2



(35a)

=βα

β+ 1 



 pp∗

l∗|hl∗|

√η∗Bl∗!2

+ pp∗

l∗+1|hl∗+1|

√η∗Bl∗+1 !2

− pp′

l∗|hl∗|

√η∗Bl∗!2

−

qp′

l∗+1|hl∗+1 |

√η∗Bl∗+1 



2





(35b)

=βα

2 (β+ 1) pp∗

l∗|hl∗|

√η∗Bl∗−pp∗

l∗+1|hl∗+1|

√η∗Bl∗+1 !2>0.(35c)

This indicates that p′is a better solution than p∗, which contradicts the assumption. We have

the inequality Cl∗≤√p∗

k|hk|

Bk, for k∈ {l∗+ 1, ..., K}. Thus, the sufﬁciency of this Lemma has

been proved.

To prove the necessity of this Lemma, we prove the inverse negative proposition of it: if the

optimal transmit power p∗does not lie in subregion Ml, i.e., l∈ {1, ..., l∗−1, l∗+ 1, ..., K },

˜p∗

lsatisﬁes: √˜p∗

l,k|hk|

Bk≥Cl+1 or √˜p∗

l,k|hk|

Bk< Clfor all k∈ {l+ 1, ..., K}.

First, we prove that ˜p∗

l, for ∀l∈ {1, ..., l∗−1}holds

p˜p∗

l,k|hk|

Bk≥Cl+1,∀k∈ {l+ 1, ..., K}.(36)

We prove this by contradiction. We assume that ∃l∈ {1, ..., l∗−1},√˜p∗

l,k|hk|

Bk< Cl+1,∀k∈

{l+ 1, ..., K}. As Cl+1 ≤Ck, for ∀k∈ {l+ 1, ..., K}given in (13), √˜p∗

l,k|hk|

Bk< Ck, for

∀k∈ {l+ 1, ..., K},˜p∗

l,k < Pk, for ∀k∈ {l+ 1, ..., K}, i.e., ˜p∗

lis a feasible transmit power. Note

that ˜p∗

lis the optimal transmit power in ˜

Mland p∗is the optimal transmit power in Ml∗. As the

constraint of ˜p∗

lis less restrictive than the constraint of p∗, i.e., Ml∗⊆˜

Ml, the feasible transmit

power ˜p∗

lis better than the optimal transmit power p∗, which contradicts the assumption. We

have the inequality (36).

Second, we prove that ˜p∗

l, for ∀l∈ {l∗+ 1, ..., K}holds

p˜p∗

l,k|hk|

< Cl,∀k∈ {l+ 1, ..., K}.(37)

We prove this by contradiction. When l=l∗+ 1, we assume that √˜p∗

l,k|hk|

Bk≥Cl,∀k∈ {l+

1, ..., K}. Let pavg = [˜p∗

l,1, ..., ˜p∗

l,l−1, pavg

l, ..., pavg

K]denote a modiﬁed ˜p∗

l, where √pavg

l|hl|

Bl=... =

√pavg

K|hK|

BK=1

K−l+1 PK

i=l

√˜p∗

l,i|hi|

Bi. Using a proof similar to that of (35c), we can prove that

MSE(pavg )≤MSE(˜p∗

l). Let pfes = [˜p∗

l,1, ..., ˜p∗

l,l−1, pf es

l, ..., pfes

K]denote a modiﬁed ˜p∗

l, where

√pfes

k|hk|

Bk=Cl,, for k∈ {l, ..., K}. The derivative of (14a) w.r.t. pkfor all k∈ {l+ 1, ..., K}

shows that the MSE increases with pkwhen pk≥p∗

k. Note that pavg and pf es are in the l∗-th

relaxed subregion ˜

Ml∗and pavg

k≥pfes

k> p∗

k, for k∈ {l, ..., K}, MSE(pf es )≤MSE(pavg )

and MSE(pfes)≤MSE(˜p∗

l). The transmit power pf es ∈˜

Mlis better than the optimal transmit

power ˜p∗

l∈˜

Ml, which contradicts that ˜p∗

lis optimal solution in ˜

Ml. We have the inequality

(37) when l=l∗+ 1 and it can be extended to when l∈ {l∗+ 2, ..., K}by mathematical

induction.

Thus, the necessity of this Lemma has been proved. We complete the proof of Lemma 4.

APPEN DIX F

PROO F O F LEMM A 5

We ﬁrst prove that the upper boundary Ulis equal to the lower boundary Ll+1 , for l∈

{1, ..., K −1}. Then we prove that the optimal transmit power function p∗(β, σ2

n)is continuous

at Ul=Ll+1, for l∈ {1, ..., K −1}.

For all (β, σ2

n)∈ Ll+1, the corresponding optimal transmit power ˜p∗

l+1(β, σ2

n) = (Cl+1 Bk

|hk|)2,

i.e., √˜p∗

l+1,k(β,σ2

n)|hk|

Bk=Cl+1, for k∈ {l+ 2, ..., K }. Based on (17) and (19), we have

q˜η∗

l+1(β, σ2

n) = Pl+1

i=1 Ci+ (β+K−l−1)Cl+1

β+K(38a)

⇔

βα

β+1

l+1

i=1

i+βα

(β+K−l−1)(β+1) l+1

i=1

Ci2

+Dσ2

β(β+K)α

(β+K−l−1)(β+1)

l+1

i=1

=Pl+1

i=1 Ci+ (β+K−l−1)Cl+1

β+K(38b)

⇔

βα

β+1

l+1

i=1

i+Dσ2

β(β+K)α

(β+K−l−1)(β+1)

l+1

i=1

=(β+K−l−1)Cl+1

β+K(38c)

⇔βα

β+ 1

l+1

i=1

i+Dσ2

n=βα

β+ 1Cl+1

l+1

i=1

Ci(38d)

⇔βα

β+ 1

i=1

i+Dσ2

n=βα

β+ 1Cl+1

i=1

Ci(38e)

⇔

βα

β+1

i=1

i+Dσ2

β(β+K)α

(β+K−l)(β+1)

i=1

=(β+K−l)Cl+1

β+K(38f)

⇔

βα

β+1

i=1

i+βα

(β+K−l)(β+1) l

i=1

Ci2

+Dσ2

β(β+K)α

(β+K−l)(β+1)

i=1

=Pl

i=1 Ci+ (β+K−l)Cl+1

β+K(38g)

⇔q˜η∗

l(σ2

n, β) = Pl

i=1 Ci+ (β+K−l)Cl+1

β+K(38h)

⇔q˜p∗

l,k(β, σ2

n)|hk|

=Cl+1,∀k∈ {l+ 1, ..., K }(38i)

⇔˜p∗

l,k(β, σ2

n) = (Cl+1 Bk

|hk|)2,∀k∈ {l+ 1, ..., K},(38j)

then we have (σ2

n, β)∈ Ulby the deﬁnition of Ul, i.e., Ll+1 ⊆ Ul. For all (β, σ2

n)∈ Ul, we

have (σ2

n, β)∈ Ll+1, i.e., Ul⊆ Ll+1 by reversing the derivation of (38). Therefore, we have

Ll+1 =Ul, for l∈ {1, ..., K −1}.

For all (β, σ2

n)0∈ Ll+1 =Ul, deﬁne (β, σ2

n)+∈ Xl+1 and (β, σ2

n)−∈ Xlas two points inﬁnitely

close to (β, σ2

n)0, respectively. Then the left limitation of p∗((β, σ2

n)0)is given by

lim

(β,σ2

n)−→(β,σ2

n)0

p∗((β, σ2

n)−) = lim

(β,σ2

n)−→(β,σ2

n)0

˜p∗

l((β, σ2

n)−)

=˜p∗

l((β, σ2

n)0) = ˜p∗

l+1((β, σ2

n)0) = p∗((β, σ2

n)0).(39)

Similarly, the right limitation of p∗((β, σ2

n)0)is also equal to p∗((β, σ2

n)0). Therefore, the optimal

transmit power function p∗(β, σ2

n)is continuous at Ul=Ll+1, for l∈ {1, ..., K −1}. We complete

the proof of Lemma 5.

APPEN DIX G

PROO F O F THEO REM 3

For any denoising factor η≥1

K2Pk∈K Ck2

, it must hold that composite signal alignment

Pk∈K Gk(pk, η)≤Kfor problem P3. Therefore, for minimizing composite signal misalignment

error Pk∈K Gk(pk, η)−K2

, all the devices should always transmit with full power, i.e.,

p∗

k=Pk,∀k∈ K. The problem P3can be expressed as

min

η≥0αX

k∈K

Gk(pk, η)−K2

+Dσ2

η(40)

This is a unary quadratic function about 1

√η, thus it is easy to derive an optimal solution, i.e.,

η∗=αPi∈K Ci2+Dσ2

αK Pi∈K Ci2

.(41)

We complete the proof of Theorem 3.

REFER ENC E S

[1] N. Zhang and M. Tao, “Gradient statistics aware power control for over-the-air federated learning in fading channels,” in

Proc. IEEE ICC Workshops, 2020, pp. 1–6.

[2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE,

vol. 107, no. 11, pp. 2204–2239, 2019.

[3] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artiﬁcial intelligence

with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.

[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efﬁcient learning of deep networks

from decentralized data,” in Artiﬁcial Intelligence and Statistics, 2017, pp. 1273–1282.

[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B.

McMahan et al., “Towards federated learning at scale: System design,” arXiv preprint arXiv:1902.01046, 2019.

[6] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Transactions on

Intelligent Systems and Technology (TIST), vol. 10, no. 2, p. 12, 2019.

[7] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efﬁcient sgd via randomized

quantization and encoding,” Advances in Neural Information Processing Systems 30, vol. 3, pp. 1710–1721, 2018.

[8] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed

training of speech dnns,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.

[9] A. F. Aji and K. Heaﬁeld, “Sparse communication for distributed gradient descent,” in Proceedings of the 2017 Conference

on Empirical Methods in Natural Language Processing, 2017, pp. 440–445.

[10] Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compression for efﬁcient distributed deep learning,” arXiv

preprint arXiv:1802.06058, 2018.

[11] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource

constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, 2019.

[12] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp.

3498–3516, 2007.

[13] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun.,

vol. 19, no. 3, pp. 2022–2035, March 2020.

[14] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans.

Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.

[15] M. M. Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,”

in Proc. IEEE ISIT, July 2019, pp. 1432–1436.

[16] ——, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., pp. 1–1, 2020.

[17] X. Cao, G. Zhu, J. Xu, and K. Huang, “Optimal power control for over-the-air computation,” in Proc. IEEE GLOBECOM,

Dec. 2019, pp. 1–6.

ResearchGate has not been able to resolve any citations for this publication.

Wireless Network Intelligence at the Edge

Article

Full-text available

Sep 2019

Fueled by the availability of more data and computing power, recent breakthroughs in cloud-based machine learning (ML) have transformed every aspect of our lives from face recognition and medical diagnosis to natural language processing. However, classical ML exerts severe demands in terms of energy, memory and computing resources, limiting their adoption for resource constrained edge devices. The new breed of intelligent devices and high-stake applications (drones, augmented/virtual reality, autonomous systems, etc.), requires a novel paradigm change calling for distributed, low-latency and reliable ML at the wireless network edge (referred to as edge ML). In edge ML, training data is unevenly distributed over a large number of edge nodes, which have access to a tiny fraction of the data. Moreover training and inference are carried out collectively over wireless links, where edge devices communicate and exchange their learned models (not their private data). In a first of its kind, this article explores key building blocks of edge ML, different neural network architectural splits and their inherent tradeoffs, as well as theoretical and technical enablers stemming from a wide range of mathematical disciplines. Finally, several case studies pertaining to various high-stake applications are presented demonstrating the effectiveness of edge ML in unlocking the full potential of 5G and beyond.

Adaptive Federated Learning in Resource Constrained Edge Computing Systems

Article

Full-text available

Mar 2019

Emerging technologies and applications including Internet of Things (IoT), social networking, and crowd-sourcing generate large amounts of data at the network edge. Machine learning models are often built from the collected data, to enable the detection, classification, and prediction of future events. Due to bandwidth, storage, and privacy concerns, it is often impractical to send all the data to a centralized location. In this paper, we consider the problem of learning model parameters from data distributed across multiple edge nodes, without sending raw data to a centralized place. Our focus is on a generic class of machine learning models that are trained using gradientdescent based approaches. We analyze the convergence bound of distributed gradient descent from a theoretical point of view, based on which we propose a control algorithm that determines the best trade-off between local update and global parameter aggregation to minimize the loss function under a given resource budget. The performance of the proposed algorithm is evaluated via extensive experiments with real datasets, both on a networked prototype system and in a larger-scale simulated environment. The experimentation results show that our proposed approach performs near to the optimum with various machine learning models and different data distributions.

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

Conference Paper

Sep 2014

Gradient Statistics Aware Power Control for Over-the-Air Federated Learning in Fading Channels

Conference Paper

Jun 2020

Optimal Power Control for Over-the-Air Computation

Conference Paper

Dec 2019

Federated Learning via Over-the-Air Computation

Article

Jan 2020

The stringent requirements for low-latency and privacy of the emerging high-stake applications with intelligent devices such as drones and smart vehicles make the cloud computing inapplicable in these scenarios. Instead, edge machine learning becomes increasingly attractive for performing training and inference directly at network edges without sending data to a centralized data center. This stimulates a nascent field termed as federated learning for training a machine learning model on computation, storage, energy and bandwidth limited mobile devices in a distributed manner. To preserve data privacy and address the issues of unbalanced and non-IID data points across different devices, the federated averaging algorithm has been proposed for global model aggregation by computing the weighted average of locally updated model at each selected device. However, the limited communication bandwidth becomes the main bottleneck for aggregating the locally computed updates. We thus propose a novel over-the-air computation based approach for fast global model aggregation via exploring the superposition property of a wireless multiple-access channel. This is achieved by joint device selection and beamforming design, which is modeled as a sparse and low-rank optimization problem to support efficient algorithms design. To achieve this goal, we provide a difference-of-convex-functions (DC) representation for the sparse and low-rank function to enhance sparsity and accurately detect the fixed-rank constraint in the procedure of device selection. A DC algorithm is further developed to solve the resulting DC program with global convergence guarantees. The algorithmic advantages and admirable performance of the proposed methodologies are demonstrated through extensive numerical results.

Broadband Analog Aggregation for Low-Latency Federated Edge Learning

Article

Oct 2019

To leverage rich data distributed at the network edge, a new machine-learning paradigm, called edge learning, has emerged where learning algorithms are deployed at the edge for providing intelligent services to mobile users. While computing speeds are advancing rapidly, the communication latency is becoming the bottleneck of fast edge learning. To address this issue, this work is focused on designing a low-latency multi-access scheme for edge learning. To this end, we consider a popular privacy-preserving framework, federated edge learning (FEEL), where a global AI-model at an edge-server is updated by aggregating (averaging) local models trained at edge devices. It is proposed that the updates simultaneously transmitted by devices over broadband channels should be analog aggregated “over-the-air” by exploiting the waveform-superposition property of a multi-access channel. Such broadband analog aggregation (BAA) results in dramatical communication-latency reduction compared with the conventional orthogonal access (i.e., OFDMA). In this work, the effects of BAA on learning performance are quantified targeting a single-cell random network. First, we derive two tradeoffs between communication-and-learning metrics, which are useful for network planning and optimization. The power control (“truncated channel inversion”) required for BAA results in a tradeoff between the update-reliability [as measured by the receive signal-to-noise ratio (SNR)] and the expected update-truncation ratio. Consider the scheduling of cell-interior devices to constrain path loss. This gives rise to the other tradeoff between the receive SNR and fraction of data exploited in learning. Next, the latency-reduction ratio of the proposed BAA with respect to the traditional OFDMA scheme is proved to scale almost linearly with the device population. Experiments based on a neural network and a real dataset are conducted for corroborating the theoretical results.

Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air

Conference Paper

Jul 2019

Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing

Article

Jun 2019

With the breakthroughs in deep learning, the recent years have witnessed a booming of artificial intelligence (AI) applications and services, spanning from personal assistant to recommendation systems to video/audio surveillance. More recently, with the proliferation of mobile computing and Internet of Things (IoT), billions of mobile and IoT devices are connected to the Internet, generating zillions bytes of data at the network edge. Driving by this trend, there is an urgent need to push the AI frontiers to the network edge so as to fully unleash the potential of the edge big data. To meet this demand, edge computing, an emerging paradigm that pushes computing tasks and services from the network core to the network edge, has been widely recognized as a promising solution. The resulted new interdiscipline, edge AI or edge intelligence (EI), is beginning to receive a tremendous amount of interest. However, research on EI is still in its infancy stage, and a dedicated venue for exchanging the recent advances of EI is highly desired by both the computer system and AI communities. To this end, we conduct a comprehensive survey of the recent research efforts on EI. Specifically, we first review the background and motivation for AI running at the network edge. We then provide an overview of the overarching architectures, frameworks, and emerging key technologies for deep learning model toward training/inference at the network edge. Finally, we discuss future research opportunities on EI. We believe that this survey will elicit escalating attentions, stimulate fruitful discussions, and inspire further research ideas on EI.

Federated Machine Learning: Concept and Applications

Article

Jan 2019

Today’s artificial intelligence still faces two major challenges. One is that, in most industries, data exists in the form of isolated islands. The other is the strengthening of data privacy and security. We propose a possible solution to these challenges: secure federated learning. Beyond the federated-learning framework first proposed by Google in 2016, we introduce a comprehensive secure federated-learning framework, which includes horizontal federated learning, vertical federated learning, and federated transfer learning. We provide definitions, architectures, and applications for the federated-learning framework, and provide a comprehensive survey of existing works on this subject. In addition, we propose building data networks among organizations based on federated mechanisms as an effective solution to allowing knowledge to be shared without compromising user privacy.

Gradient Statistics Aware Power Control for Over-the-Air Federated Learning

Abstract and Figures

Recommended publications