ArticlePDF Available

Communication-Efficient Distributed Stochastic Gradient Descent with Pooling Operator

January 2023
SSRN Electronic Journal

January 2023

Authors:

University of Electronic Science and Technology of China

Training deep neural networks on large datasets can be accelerated by distributing the computations across multiple worker nodes. The distributed stochastic gradient descent (D-SGD) algorithm combined with gradient sparsification is a typical algorithm for this distributed training model, which guarantees better convergence and can significantly reduce communication bandwidth. However, we find that the existing gradient sparsification algorithms ignore the local correlation between gradient values under certain conditions, which can incur a loss of accuracy since the update region of the global parameter will stay in a particular area for a long epoch. Based on this observation, we introduced the pooling operator to solve this problem and combined it with the error-feedback method to design a new communication-efficient distributed gradient descent algorithm. We prove our algorithm converges at the same rate as vanilla SGD when equipped with error feedback. Experiments on CNN, an LSTM, and a ResNet model demonstrated that our algorithm could compress the number of uploaded gradient bits by two or three orders of magnitude while guaranteeing the high accuracy of the model.

Content uploaded by Aiguo Chen

Content may be subject to copyright.

Communication-eﬃcient Distributed Stochastic Gradient Descent with

Pooling Operator

Zhengao Caia, Aiguo Chena,b,∗, Yi Luocand Jiahao Lia

aShenzhen Institute of Advanced Study,University of Electronic Science and Technology of China, China,

bSchool of Computer Science and Engineering,University of Electronic Science and Technology of China, China,

cSchool of Information and Software Engineering,University of Electronic Science and Technology of China, China,

ARTICLE INFO

Keywords:

Distributed Machine Learning

Distributed SGD

Communication Eﬃcient SGD

Gradient Sparsiﬁcation

ABSTRACT

Training deep neural networks on large datasets can be accelerated by distributing the computations

across multiple worker nodes. The distributed stochastic gradient descent (D-SGD) algorithm com-

bined with gradient sparsiﬁcation is a typical algorithm for this distributed training model, which

guarantees better convergence and can signiﬁcantly reduce communication bandwidth. However, we

ﬁnd that the existing gradient sparsiﬁcation algorithms ignore the local correlation between gradient

values under certain conditions, which can incur a loss of accuracy since the update region of the global

parameter will stay in a particular area for a long epoch. Based on this observation, we introduced the

pooling operator to solve this problem and combined it with the error-feedback method to design a new

communication-eﬃcient distributed gradient descent algorithm. We prove our algorithm converges at

the same rate as vanilla SGD when equipped with error feedback. Experiments on CNN, an LSTM,

and a ResNet model demonstrated that our algorithm could compress the number of uploaded gradient

bits by two or three orders of magnitude while guaranteeing the high accuracy of the model.

1. Introduction

In recent years, deep neural networks(DNN) have been

used as powerful tools for machine learning and artiﬁcial

intelligence in a large number of domains[1,2], especially

in computer vision and natural language processing[3,4].

With the increasing volume of training data and the grow-

ing complexity of the DNN structure, distributed machine

learning[5,6,7,8,9,10,8] is widely used to train these

large-scale DNN models. It is a multi-nodal system with a

server node and multiple worker nodes to collaborate in run-

ning the Distributed Stochastic Gradient Descent (D-SGD)

algorithm. The server node is responsible for scheduling

the model’s parameters and worker nodes’ behavior[11,12].

Each worker node uses its local data to obtain the gradient of

the current parameters and transmits the computed gradient

matrix to the server node.

However, D-SGD suﬀers from one main drawback: the

communication overhead has become a bottleneck in the

process of training large-scale DNN models[13,14,5,15,

16]. For instance, ResNet[3,17] has more than 25 million

parameters, and the communication overhead may be in

the range of gigabytes per epoch[18,19,20,21,15], so

the communication overhead of transmitting the gradient

could be prohibitive[22]. Consequently, amount of gradient

compression methods has been proposed to reduce the scale

of gradient data and the time consumption of its transmission

while ensuring the convergence of the model[23,24,25,14,

6,26].

Gradient compression methods are mainly divided into

two types: gradient quantization and gradient sparsiﬁcation.

The quantization method uses the Sign operator to quantify

∗Corresponding author

agchen@uestc.edu.cn (A. Chen)

the original 32-bit or 64-bit ﬂoat number to 1-bit. Due to the

nature of computer character encoding, it is impossible to

obtain a gradient compression rate of more than 64 times

for the quantization method[5,27,23,28]. Another class

of gradient compression methods is gradient sparsiﬁcation,

which is more feasible. Typically, the Top-K sparsiﬁcation

algorithm[6,24] can compress the uploaded gradient traﬃc

by two or three orders of magnitude, and a large number of

papers[6,7] have shown that it can obtain the same conver-

gence and accuracy as typical D-SGD algorithm for both

convex and non-convex optimization problems. However,

for some large-scale DNN models(.e.g., ResNet[3] with a

high rate of sparsiﬁcation, the Top-K algorithm may lead

to severe accuracy penalty and worse convergence. In this

case, the iteration epochs of the Top-K algorithm must be

multiplied to achieve the same accuracy as the typical D-

SGD algorithm[6,29].

One reason for the convergence degradation is that the

Top-K operator discards the local correlation features of the

gradient entirely in some cases when sparsifying the gradient

matrix. For highly locally correlated data like images, there

is also a local correlation in the gradient obtained during the

training of the DNN. Therefore, since the Top-K operator

naturally retains only the K-largest absolute values in the

gradient matrix, this results in the selection of gradient

regions that may stay in a local region for a long epoch. In

contrast, the parameters in the other regions stop updating.

Based on this observation, we exploited the dimension-

ality reduction operation of pooling operators in convolu-

tional neural network(CNN)[30] and introduced the pooling

operators (MaxPool and RandPool) to sparsify the gradient

matrix. The MaxPool operator ﬁlters out the largest absolute

gradient elements from the region covered by the ﬁlter rather

than the K-largest values from the whole gradient. The

First Author et al.: Preprint submitted to Elsevier Page 1 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

RandPool operator ﬁlter out a random element covered by

the ﬁlter. This approach will preserve the local correlation

features of the gradient while sparsifying the gradient matrix

and eliminate the long regional adhesion that may result

from the Top-K operator. Note that our proposed MaxPool

operator diﬀers from the typical MaxPool operator in that the

results obtained by the typical MaxPool operator do not have

negative values. Since the gradient has negative values, we

modify the typical MaxPool operator by setting the selected

elements to the ones with the largest absolute values. The

MaxPool operator later in the article are the one that has

been modiﬁed in this way.

The pooling operators can compress the gradient matrix

by two or three orders of magnitude without suﬀering severe

accuracy compensation. We also proved that the pooling

sparsiﬁcation algorithm could obtain a linear convergence

rate with the addition of the error feedback method[31]

commonly used in gradient compression. The experiment

results demonstrate that our algorithm can get the same

communication savings and has a better convergence rate

and accuracy on computer vision-related DNN(.e.g., CNN,

ResNet ) compared to the existing sparsiﬁcation schemes

for the same compression ratio. For typical recurrent neu-

ral networks(RNN) in natural language processing(.e.g.,

LSTM[32]), our algorithm can get similar convergence rate

and accuracy as the Top-K algorithm.

The remainder of this paper is organized as follows.

Section 2presents some research related to our algorithm.

In Section 3, we introduce the rationale and details of our al-

gorithm and present proofs of our algorithm. We experimen-

tally demonstrate the advantages of our algorithm over the

currently proven feasible and generally optimal algorithm in

Section 4. Finally, we conclude with a discussion in Section

2. Related Work

Minibatch-SGD. There are many D-SGD algorithms to

optimize large-scale DNN in distributed machine-learning

environments, and Minibatch-SGD is currently the most

popular one. In the Minibatch-SGD algorithm, the server

sends the DNN structure and global parameters to each

worker node. When the worker node receives the parame-

ters, they use its local data to compute multiple stochastic

gradients and sends the average of these stochastic gradients

to the server. Finally, the server aggregates these gradients

to obtain the global parameters for the next epoch[11]. Other

optimization algorithms for DNN in distributed machine

learning, such as Local-SGD[12], have been shown to per-

form much worse than Minibatch-SGD in many cases[11].

Top-K and Rand-K Sparsiﬁcation. The key idea of the

Top-K sparsiﬁcation algorithm is to discard some “useless”

(tiny and less helpful in updating the global model) gradient

values by the Top-K operator. The researchers found that

99% of the small gradients in the stochastic gradient descent

algorithm are not very useful for updating the global model.

Therefore the role of the Top-K operator is to ﬁlter and trans-

mit the K-largest absolute values of the gradient matrix to the

global server[24,33,34]. Also, several recent papers have

theoretically indicate that these types of Top-K sparsiﬁcation

algorithms are capable of obtaining linear convergence rates

in convex and nonconvex problems[35,7,36]. A Random-

K algorithm exists, which diﬀers from the Top-K algorithm

only because when sparsifying the matrix, the transmitted

elements are randomly selected instead of the K elements

with the largest absolute value.

Error-feedback. In the traditional sparsiﬁcation method,

each worker node only needs to transmit those selected spe-

ciﬁc elements to the global server for each communication

epoch. Those elements that are not selected will be discarded

immediately. Inspired by the traditional momentum ap-

proach of stochastic gradient descent, a typical practice is to

record those gradient residuals that are not transmitted to the

global server in the current epoch and add them to the new

calculation result in the next epoch. Experiments have shown

that this error-feedback method can signiﬁcantly improve the

ﬁnal model convergence rate and accuracy[6,31,26]. Error-

Feedback is also known as error compensation, gradient with

memory, and gradient residuals[31,26].

3. Our Method

3.1. Observation for gradient space-correlation

and Algorithm insight

It is well known that in the study of DNN in computer

vision, the local correlation of the image cannot be ignored,

so scholars designed pooling and convolution operators to

distill the overall image information without ignoring the

implicit local information of the image. The subsequent

research on techniques such as anchor box and Feature Pyra-

mid Network[37,38,39] also proved the importance of local

correlation of the image. Based on this phenomenon, we also

ﬁnd that there is also a correlation between the gradient and

the training data during the training of DNN models, and the

gradient values obtained during the backpropagation process

are not randomly and disorderly distributed.

We designed a simple experiment on a single-layer CNN

model with some typical simple images for training, and

the results are shown in Figure 1. Apparently, for a triangle

of a binary image, the gradients have great similarity with

the original image from the epoch of beginning to the end.

The remaining gradients of some typical images also get

a distribution similar to the original image. This suggests

that there is also a local correlation between the gradient

values, and it also explains the loss of accuracy and con-

vergence degradation incurred by the Top-K algorithm in

some speciﬁc cases. When the learning rate of the local

gradient descent step run by each worker node is low and

the sparsity rate of the Top-K algorithm is high, the gradient

values selected for transmitting may stay in a region with a

larger absolute value for a long global epoch, which results

in the global parameters of other regions may not be updated

in time.

First Author et al.: Preprint submitted to Elsevier Page 2 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

Figure 1: The training data and the gradient of its corresponding layer, For each training image at the top, the visualization of

the 10×10 gradient value matrix of the convolution layer are represented by the corresponding image below, with darker colors

indicating larger values.

Figure 2: The Principle of the MaxPool Operator with stride = 2 and kernel-size = 2

To address this adhesion phenomenon, we introduce the

pooling operator commonly used in CNN to replace the Top-

Koperator. The extensive use of the pooling operator in

CNN stems from the local correlation of the images, and

the pooling operator can compress images while maintaining

their local correlation features[40]. Based on it, we exploit

and introduce the dimensionality reduction operation of

pooling operators to compress the gradient matrix.

We modiﬁed the Minibatch-SGD algorithm in that each

working node will compress its computed gradient before

sending it to the global server, which will then perform a

simple unpooling operation to restore the matrix to its orig-

inal size. Also, we introduce the error feedback method into

our algorithm. The index values of the unﬁltered gradients

in each epoch will be saved locally and added to the newly

computed gradient matrix in the next epoch.

3.2. Details and Proofs of Pooling Algorithm

Generally, the distributed machine learning model is

often expressed in the following optimization form. For the

input x ∈ , object function 𝑓∶ℝ𝑑→ℝ𝑑:

𝑓(𝑥) = 1

𝑛



𝑖=1

𝑓𝑖(𝑥), 𝑥∗∶= arg min

𝑥∈ℝ𝑑𝑓(𝑥), 𝑓 ∗∶= 𝑓𝑥∗

(1)

where each 𝑓𝑖is L-smooth and 𝜇-strongly convex:

𝑓(𝑦)≤𝑓(𝑥)+∇𝑓(𝑥), 𝑦−𝑥+𝐿

2𝑦−𝑥2,∀𝑥, 𝑦 ∈ℝ𝑑, 𝑖 ∈ [𝑛]

(2)

𝑓(𝑦)≥𝑓(𝑥)+ ∇𝑓(𝑥), 𝑦 −𝑥+𝜇

2𝑦−𝑥2,∀𝑥, 𝑦 ∈ℝ𝑑(3)

The solution to the above optimization problem is usu-

ally Minibatch-SGD[11]. For simplicity, let’s now consider

the Minibatch-SGD algorithm in the following form:

𝑥𝑘+1 = x𝑘−𝛾

𝑖∈𝕄

∇𝑓x𝑘

𝑖(4)

where 𝑥𝑘are the global parameters of the k-th epoch,𝑥𝑘

𝑖

is the local parameter of the k-th epoch of the i-th worker

node, 𝑌is the learning rate, ∇𝑓x𝑘

𝑖is the gradient of 𝑥𝑘

𝑖.

The target needs to sparsify is ∇𝑓x𝑘

𝑖, the gradient value

∇𝑓x𝑘

𝑖will be transmitted to the global server for updating

the global parameters 𝑥𝑘for each epoch, which is the main

communication bottleneck[6].

The pooling operator compresses gradient matrix ∇𝑓x𝑘

𝑖

according to the layer structure before uploading. For a

pooling ﬁlter with stride = Rand kernel-size = Rthe original

gradient matrix can be compressed to 1∕𝑅2of its original

size. The gradient matrix 𝑔𝑘

𝑖after the above operation is

submitted to the server via the network, and then the server

performs an UnPool operation to obtain its original size. In

addition, the elements 𝛼∇𝑓𝑥𝑘

𝑖−𝑔𝑘

𝑖that are not transmitted

will be added to the current gradient residual 𝑚𝑘

𝑖for the

next epoch‘s gradient residual 𝑚𝑘+1

𝑖, which can speed up the

training process to a great extent.

First Author et al.: Preprint submitted to Elsevier Page 3 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

The formal representation is as follows:

𝑥𝑘+1 =𝑥𝑘−𝛾

𝑖∈𝕄

UnPool 𝑔𝑘

𝑖(5)

𝑔𝑘

𝑖= MaxPool 𝑚𝑘

𝑖+𝛼∇𝑓𝑥𝑘

𝑖 (6)

𝑚𝑘+1

𝑖=𝑚𝑘

𝑖+𝛼∇𝑓𝑥𝑘

𝑖−𝑔𝑘

𝑖(7)

where the compression operator MaxPool𝑅∶ℝ𝑑→ℝ𝑑is

deﬁned for x ∈ :

MaxPool𝑅(𝑋) ∶= 









𝑥𝑖,𝑗 ,if 𝑥𝑖,𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑎𝑥 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒

𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑒𝑎𝑐ℎ 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑓 𝑖𝑙𝑡𝑒𝑟

0,ortherwise

(8)

and each symbol has the same meaning as Equation 1.

The abbreviated principle of the pooling Operator can be

seen in Figure 2. Since the RandPool operator is very similar

to the MaxPool operator, we won’t go over them.

For the compression operator MaxPool𝑅∶ℝ𝑑→ℝ𝑑

that satisﬁes the contraction property for pooling stride=R

and kernel-size=R, we have:

𝔼

𝑋− MaxPool𝑅(𝑋)

2≤1 − 1

𝑅𝑋2,∀𝑋∈ℝ𝑑(9)

The speciﬁc steps of the proof are shown in appendix

5.B.1. and then have the result as equation 10:

𝔼𝑓x𝑇−𝑓∗≤  𝐺2

𝜇𝑇 +𝑅2𝐺2

𝜇𝑇 2+𝑅3𝐺2

𝜇𝑇 3

(10)

where =𝐿

𝜇,𝔼𝜇

x0− x∗

≤2𝐺For L-smooth(See

Equation 2) and 𝜇-strongly convex(See Equation 3) f, and

for carefully chosen learning rates, we observe that for

T = Ω 𝑅1∕2, we have dominating convergence rate

𝐺2

𝜇𝑇 , as derived in[[6], Remark2.6], the same rate as

vanilla SGD[41].

For the pooling algorithm, it can be intuitively repre-

sented by Algorithm 1and Algorithm 2. Each worker node

will compute the gradient using the local dataset and perform

the pooling operator to obtain the compressed gradient 𝑔𝑘

𝑖,

while using 𝑚𝑘

𝑖to store the gradient residual for accelerating

the training process. Finally, send a compressed gradient 𝑔𝑘

𝑖

to the global server for updating the global parameter 𝑥𝑘+1.

The server continuously receives the compressed gradient

𝑔𝑘

𝑖and performs the UnPool operation to obtain the unbi-

ased estimated gradient and executes the gradient descent

algorithm to update the global parameters 𝑥𝑘+1

Algorithm 1 Distributed Parallel Pooling-SGD with

Memory-Server

Require: global model 𝑥0, global learning rate 𝛾

Ensure: trained global model

for global epoch 𝑡in 0...𝑇do

for each 𝑐𝑙𝑖𝑒𝑛𝑡𝑖do

▿𝑡

𝑖⇐𝑃 𝑜𝑜𝑙𝑆 𝐺𝐷𝐶𝑙𝑖𝑒𝑛𝑡𝑖(x𝑡)

// Get the compressed matrix from each client i

▿𝑡

𝑖⇐𝑈 𝑛𝑃 𝑜𝑜𝑙(▿𝑡

𝑖)

// Perform the Unpool operation to obtain an unbi-

ased estimate of the gradient matrix

end for

𝑥𝑡+1 ⇐𝑥𝑡−𝛾𝑀

𝑖=1 ▿𝑡

𝑖

//Execute parallel stochastic gradient descent algorithm

end for

Algorithm 2 PoolSGDClient

Require: client model 𝑥𝑡

𝑖, local learning rate 𝛾𝑖

Ensure: Compressed gradient matrix

Initialize gradient memory 𝑚𝑡

𝑖, Pooling Stride 𝑅and 𝛼

1: Compute the local gradient ▿𝑓(𝑥𝑡

𝑖)

2: g𝑡

𝑖⇐𝑃 𝑜𝑜𝑙𝑅(𝑚𝑖

𝑡+▿𝑓(𝑥𝑡

𝑖))

//Compute the compressed gradient matrix

3: m𝑡+1

𝑖⇐𝑚𝑡

𝑖+𝛼▿𝑓(𝑥𝑡

𝑖) − 𝑔𝑡

𝑖

//Compute gradient residual

4: Server ⇐𝑔𝑡

𝑖

//send the Compressed gradient matrix to global

server

4. Experiments

4.1. Experimental Setup

Implementation. We designed several experiments based

on the Minibatch[11]. We set up 100 worker nodes with

independently and identically distributed datasets. Each

worker node performs multiple backpropagation operations

using its local dataset to obtain the gradient value of the cur-

rent epoch and then performs the communication-eﬃcient

distributed gradient descent algorithm using Algorithm 1

and Algorithm 2with the learning rate of 0.05. Each exper-

iment uses 50 epochs to ensure that our algorithm performs

the same number of backpropagation and gradient descent

operations as the Top-K and Rand-K algorithms at the same

compression rate.

Task and Dataset. To cover a broad spectrum of deep

learning problems, we consider several typical CNN and

RNN models on diﬀerent typical data sets. The data sets

include Part-of-Speech Tagging, [42], and [43]. To verify

the correctness of our algorithm in CNN models, we run the

task on the MNIST data set with a simple CNN model and

Cifar-10 data set with Resnet-20[3] model. To verify the cor-

rectness of our algorithm in RNN models, we run the Part-

of-Speech Tagging task with the BiLSTM model[44,45] and

UDPOS dataset[46].

First Author et al.: Preprint submitted to Elsevier Page 4 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

Figure 3: The eﬀects of the same compression rate on the convergence rate in diﬀerent algorithms(i. Minibatch-SGD. ii. Top-K

SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD with

memory or not.) for the Mnist dataset with the CNN model. In(a), the compression rate is set to 0.04 except for Minibatch-SGD.

In(b), the compression rate is 0.01 except for Minibatch-SGD. Since the random selection algorithm (Rand-K and RandPool) works

poorly without gradient residual memory, all subsequent experiments of the random selection algorithm(Rand-K and RandPool)

are set up only with gradient residual memory acceleration.

Compress operator Test Accuracy

no 96.78 %

5 MaxPool 95.28 %

5 MaxPool With Memory 95.56 %

TopK0.04 93.15 %

TopK0.04 with Memory 95.54 %

5 RandPool 40.80 %

5 RandPool With Memory 94.16 %

RandK0.04 32.78 %

RandK0.04 With Memory 94.35 %

Compress operator Test Accuracy

no 96.78 %

10 MaxPool 94.10 %

10 MaxPool With Memory 95.04 %

TopK0.01 89.99 %

TopK0.01 with Memory 94.06 %

10 RandPool No convergence

10 RandPool With Memory 93.05 %

RandK0.01 No convergence

RandK0.01 With Memory 91.16 %

Table 1

Investigating eﬀects of the same compression rate on the accuracy in diﬀerent algorithms(i. Minibatch-SGD. ii. Top-K SGD with

memory or not. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD with memory or

not) for the Mnist dataset with the CNN model.

Figure 4: The eﬀects of the same compression rate on the convergence rate in diﬀerent algorithms(i. Minibatch-SGD. ii. Top-K

SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD

with memory or not.) for the Cifar-10 dataset with the ResNet-20 model. In(a), the compression rate is set to 0.04 except for

Minibatch-SGD. In(b), the compression rate is 0.01 except for Minibatch-SGD.

First Author et al.: Preprint submitted to Elsevier Page 5 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

Compress operator Test Accuracy

no 71.58%

5 MaxPool 59.88 %

5 MaxPool With Memory 69.87 %

TopK0.04 56.26 %

TopK0.04 with Memory 69.02 %

5 RandPool With Memory 66.86 %

RandK0.04 With Memory 64.28 %

Compress operator Test Accuracy

no 71.58 %

10 MaxPool 58.12 %

10 MaxPool With Memory 67.33 %

TopK0.01 43.28 %

TopK0.01 with Memory 66.08 %

10 RandPool With Memory 66.86 %

RandK0.01 With Memory 52.74 %

Table 2

Investigating eﬀects of the same compression rate on the accuracy in diﬀerent algorithms(i. Minibatch-SGD. ii. Top-K SGD

with memory or not. iii. PoolSGD with memory or not. iv. Rand-K SGD with memory. v. RandPool SGD with memory.) for the

experiments on the Cifar-10 dataset with the ResNet-20 model.

Baselines. The Top-K algorithm and Rand-K algorithm

are selected as a comparison with our algorithms. The Top-

K algorithm and Rand-K algorithm are used to compare

the convergence rate and the accuracy of the ﬁnal model

results with our algorithm for the same amount of com-

munication data savings. The Minibatch-SGD algorithm is

also selected as the benchmark for optimal accuracy. The

RandPool operator is similar to the Rand-K operator. We

can also change the MaxPool operator to the RandPool

operator while sparsing the gradient matrix.

Compression Rate Settings. We choose two diﬀerent

communication compression rates of 1

25 and 1

100 . For our

algorithms, the pooling strides R and kernel size are all

equally set to 5 and 10, then the pooling operator will

compress the target gradient matrices to their original 1

52and

102. For the Top-K algorithm and Rand-K algorithm, the K

is set to a multiple of 0.04 and 0.01 of the number of elements

in the original gradient matrix. This allows us to compare the

advantages of our algorithms with the Top-K algorithm and

Rand-K algorithm for the same compression ratio.

4.2. Result of several experiments

Convergence on the MNIST data set. The convergence

of the MNIST dataset in CNN is shown in Figure 3and

Table 1, we conducted experiments on a Minibatch-SGD

algorithm with compression rates of 1

52and 1

102, respectively,

as shown before. The results show that with or without

gradient residual memory in a typical CNN network, our

algorithm can obtain the same convergence rate as the Top-K

algorithm. The model loss through our algorithm is slightly

lower than the Top-K algorithm for the same training epoch,

and the ﬁnal model accuracy is slightly higher than that of

the Top-K algorithm.

It can be seen that the RandPool algorithm obtains better

results than the Rand-K algorithm, both with and without

gradient memory.

Convergence on the Cifar-10 data set. We perform

the image recognition task for the Cifar-10 dataset on the

ResNet model shown in Figure4and Table 2. Similarly,

we conducted experiments on a Minibatch-SGD algorithm

with compression rates of 1

52and 1

102, respectively. The

experimental setup is similar to the previous part of the

CNN model. The results also show that the pooling operator

can obtain relatively greater convergence speed and ﬁnal

accuracy both with and without gradient residual memory.

The MaxPool algorithm with gradient residual memory is

best similar to the convergence rate of the Minibatch-SGD

algorithm.

Convergence on the POS task. To verify the perfor-

mance of our algorithm in RNN models, we select the

Part-of-Speech Tagging task and run a bidirectional LSTM

neural network model. The results are shown in Figure 5

and Table 3, which indicate that our algorithm(BiLSTM)

and the Top-K algorithm have the same performance for the

same compression rate. The reason for this may be that the

gradient values have no local correlation for RNN model.

Summary. The experiments of CNN and RNN models

from diﬀerent data sets indicate that our algorithm would

not damage the model accuracy during training and outper-

forms the Top-K algorithm. Also, our local random selection

strategy i.e.RandPool algorithm outperforms the Rand-K

algorithm when it comes to randomly select the gradient

elements for transmitting. However it does not work as

well as the MaxPool operator, as shown in Figure 3and

Figure 4; the results of the comparison between Rand-K

and RandPool methods can be concluded that the RandPool

algorithm can obtain relatively better results than the Rand-

K algorithm because the diﬀerence between the Rand-K and

RandPool operators is only that the RandPool operator is a

randomly selected element in each ﬁlter, and because there is

some correlation between the gradient values, the RandPool

operator can better represent the global characteristics of the

original gradient matrix. This better supports the reason why

the MaxPool operator can obtain better results than the Top-

Koperator.

5. Conclusions

In this work, we ﬁnd one possible reason for the un-

stable experimental convergence of traditional sparsiﬁcation

First Author et al.: Preprint submitted to Elsevier Page 6 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

Figure 5: The eﬀects of the same compression rate on the convergence rate in diﬀerent algorithms(i. Minibatch-SGD. ii. Top-K

SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD

with memory or not.) for the UDPOS dataset with the BiLSTM model. In(a), the compression rate is set to 0.04 except for

Minibatch-SGD. In(b), the compression rate is 0.01 except for Minibatch-SGD.

Compress operator End Loss

no 0.6672

5 MaxPool 0.6939

5 MaxPool With Memory 0.6914

TopK0.04 0.7154

TopK0.04 with Memory 0.7035

5 RandPool With Memory 0.7125

RandK0.04 With Memory 0.7016

Compress operator End Loss

no 0.6672

10 MaxPool 0.7497

10 MaxPool With Memory 0.6837

TopK0.01 0.7512

TopK0.01 with Memory 0.7116

10 RandPool With Memory 1.0064

RandK0.01 With Memory 1.1218

Table 3

Investigating eﬀects of the same compression rate on the accuracy in diﬀerent algorithms(i. SGD. ii. Top-K SGD with memory

or not. iii. PoolSGD with memory or not. iv. Rand-K SGD with memory. v. RandPool SGD with memory.) for the experiments

on the UDPOS dataset with the BiLSTM model.

algorithms. To solve this problem, we ﬁnd that the gradient

distribution is locally correlated in some cases. Therefore we

introduce a pooling operator and combined it with the error-

feedback method to design a gradient sparsiﬁcation pooling

algorithm.

We considered both CNN and RNN models with dif-

ferent degrees of communication reduction:(i) CNN and

ResNet models on MNIST and Cifar-10 datasets demon-

strate better convergence performance than the Top-K algo-

rithm. (ii) LSTM networks on POS tasks perform similarly

to the Top-K algorithm. The results show that our algorithm

performs better than the Top-K algorithm on image-related

datasets and CNN models.

Declaration of competing interest

The authors declare that they have no known competing

ﬁnancial interests or personal relationships that could have

appeared to inﬂuence the work reported in this paper.

Acknowledgement

This research work was supported by the National Nat-

ural Science Foundation of China (NSFC) (U19A2059),

and the Sichuan Science and Technology Program (No.

206999977).

Appendix

Lemma A.1 For 𝐱∈ℝ𝑑,1≤𝑘≤𝑑, and operator

comp𝑘∈top𝑘,rand𝑘it holds that

𝔼

comp𝑘(𝐱) − 𝐱

2≤1 − 𝑘

𝑑𝐱2(11)

Proof. For the deﬁnition of the operators, for all 𝐱∈ℝ𝑑

we have



𝐱− top𝑘(𝐱)

2≤

𝐱− rand𝑘(𝐱)

2(12)

First Author et al.: Preprint submitted to Elsevier Page 7 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

and we apply the expectation

𝔼𝜔

𝐱− rand𝑘(𝐱)

2=1

Ω𝑘

𝜔∈Ω𝑘

𝑑



𝑖=1

𝐱2

𝑖𝕀{𝑖∉𝜔}

𝑑



𝑖=1

𝑥2

𝑖

𝜔∈Ω𝑘

𝕀{𝑖∉𝜔}

Ω𝑘=1 − 𝑘

𝑑𝐱2

(13)

which concludes the proof.

Proof B.1 For 𝐱∈ℝ𝑑,1≤𝑘≤𝑑, object function

𝑓∶ℝ𝑑→ℝ𝑑: that satisﬁes the contraction property for

pooling stride=R and kernel size=R, it holds

𝔼𝑋−MaxPoolR(𝑋)2≤1 − 1

𝑅𝑋2,∀𝑋∈ℝ𝑑(14)

Proof. From the deﬁnition of the MaxPool and Top-

Koperators, the MaxPool operator is the combination of

𝑑

𝑅Top-1 operators, the MaxPool operator simply takes

the Top-1 value in a ﬁlter with kernel size = R, so we have

𝔼

𝑋− MaxPool𝑅(𝑋)

2=

𝑥𝑖∈𝑋

𝔼



𝑥𝑖− Top −1 𝑥𝑖



≤

𝑥𝑖∈𝑋1 − 1

𝑑

𝑥𝑖

2=

𝑥𝑖∈𝑋1 − 1

𝑅

𝑥𝑖

2=1 − 1

𝑅𝑋2

(15)

where 𝑥𝑖denotes the submatrix covered by each ﬁlter.

References

[1] David Harwath, Antonio Torralba, and James Glass. Unsupervised

learning of spoken language with visual context. Advances in Neural

Information Processing Systems, 29, 2016.

[2] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning

based recommender system: A survey and new perspectives. ACM

Computing Surveys (CSUR), 52(1):1–38, 2019.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep

residual learning for image recognition. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pages 770–

778, 2016.

[4] Karen Simonyan and Andrew Zisserman. Very deep convolu-

tional networks for large-scale image recognition. arXiv preprint

arXiv:1409.1556, 2014.

[5] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and

Animashree Anandkumar. signsgd: Compressed optimisation for

non-convex problems. In International Conference on Machine

Learning, pages 560–569. PMLR, 2018.

[6] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Spar-

siﬁed sgd with memory. Advances in Neural Information Processing

Systems, 31, 2018.

[7] Dan Alistarh, Torsten Hoeﬂer, Mikael Johansson, Nikola Konstanti-

nov, Sarit Khirirat, and Cedric Renggli. The convergence of sparsi-

ﬁed gradient methods. Advances in Neural Information Processing

Systems, 31, 2018.

[8] István Hegedűs, Gábor Danner, and Márk Jelasity. Decentralized

learning works: An empirical comparison of gossip learning and

federated learning. Journal of Parallel and Distributed Computing,

148:109–124, 2021.

[9] Daniel Rosendo, Alexandru Costan, Patrick Valduriez, and Gabriel

Antoniu. Distributed intelligence on the edge-to-cloud continuum:

A systematic literature review. Journal of Parallel and Distributed

Computing, 166:71–94, 2022.

[10] Andrzej Goscinski, Flavia C. Delicato, Giancarlo Fortino, Anna

Kobusińska, and Gautam Srivastava. Special issue on distributed

intelligence at the edge for the future internet of things. Journal of

Parallel and Distributed Computing, 171:157–162, 2023.

[11] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch

vs local sgd for heterogeneous distributed learning. Advances in

Neural Information Processing Systems, 33:6281–6292, 2020.

[12] Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz

Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaim-

ing He. Accurate, large minibatch sgd: Training imagenet in 1 hour.

arXiv preprint arXiv:1706.02677, 2017.

[13] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson,

and Blaise Aguera y Arcas. Communication-eﬃcient learning of

deep networks from decentralized data. In Artiﬁcial intelligence and

statistics, pages 1273–1282. PMLR, 2017.

[14] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan

McMahan. Distributed mean estimation with limited communication.

In International conference on machine learning, pages 3329–3337.

PMLR, 2017.

[15] Longxin Lin, Zhenxiong Xu, Chien-Ming Chen, Ke Wang, Md. Raﬁul

Hassan, Md. Golam Rabiul Alam, Mohammad Mehedi Hassan, and

Giancarlo Fortino. Understanding the impact on convolutional neural

networks with diﬀerent model scales in aiot domain. Journal of

Parallel and Distributed Computing, 170:1–12, 2022.

[16] Shuo Ouyang, Dezun Dong, Yemao Xu, and Liquan Xiao. Commu-

nication optimization strategies for distributed deep neural network

training: A survey. Journal of Parallel and Distributed Computing,

149:52–65, 2021.

[17] Yaser Mansouri and M. Ali Babar. A review of edge computing: Fea-

tures and resource virtualization. Journal of Parallel and Distributed

Computing, 150:155–183, 2021.

[18] Felix Sattler, Simon Wiedemann, Klaus-Robert Muller, and Wojciech

Samek. Robust and communication-eﬃcient federated learning from

non-iid data. IEEE transactions on neural networks and learning

systems, 31(9):3400–3413, 2019.

[19] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari,

and Mehrdad Mahdavi. Federated learning with compression: Uni-

ﬁed analysis and sharp guarantees. In International Conference on

Artiﬁcial Intelligence and Statistics, pages 2350–2358. PMLR, 2021.

[20] Jay H Park, Sunghwan Kim, Jinwon Lee, Myeongjae Jeon, and

Sam H Noh. Accelerated training for cnn distributed deep learning

through automatic resource-aware layer placement. arXiv preprint

arXiv:1901.05803, 2019.

[21] Enmao Diao, Jie Ding, and Vahid Tarokh. Heteroﬂ: Computation and

communication eﬃcient federated learning for heterogeneous clients.

arXiv preprint arXiv:2010.01264, 2020.

[22] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and

Yoshua Bengio. Quantized neural networks: Training neural networks

with low precision weights and activations. The Journal of Machine

Learning Research, 18(1):6869–6898, 2017.

[23] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally.

Deep gradient compression: Reducing the communication bandwidth

for distributed training. arXiv preprint arXiv:1712.01887, 2017.

[24] Alham Fikri Aji and Kenneth Heaﬁeld. Sparse communication for

distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.

[25] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dim-

itris Papailiopoulos, and Stephen Wright. Atomo: Communication-

eﬃcient learning via atomic sparsiﬁcation. Advances in Neural

Information Processing Systems, 31, 2018.

[26] Enda Yu, Dezun Dong, Yemao Xu, Shuo Ouyang, and Xiangke Liao.

Cp-sgd: Distributed stochastic gradient descent with compression

and periodic compensation. Journal of Parallel and Distributed

Computing, 169:42–57, 2022.

[27] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran

Chen, and Hai Li. Terngrad: Ternary gradients to reduce communi-

cation in distributed deep learning. Advances in neural information

processing systems, 30, 2017.

First Author et al.: Preprint submitted to Elsevier Page 8 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

Short Title of the Article

[28] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and

Martin Jaggi. Error feedback ﬁxes signsgd and other gradient com-

pression schemes. In International Conference on Machine Learning,

pages 3252–3261. PMLR, 2019.

[29] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman

Arora, et al. Communication-eﬃcient distributed sgd with sketching.

Advances in Neural Information Processing Systems, 32, 2019.

[30] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-

organizing neural network model for a mechanism of visual pattern

recognition. In Competition and cooperation in neural nets, pages

267–285. Springer, 1982.

[31] Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter

Richtarik. Linearly converging error compensated sgd. Advances

in Neural Information Processing Systems, 33:20889–20900, 2020.

[32] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-

term memory-networks for machine reading. arXiv preprint

arXiv:1601.06733, 2016.

[33] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry

Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub

Konecy, Stefano Mazzocchi, Brendan McMahan, et al. Towards

federated learning at scale: System design. Proceedings of Machine

Learning and Systems, 1:374–388, 2019.

[34] Peng Jiang and Gagan Agrawal. A linear speedup analysis of

distributed deep learning with sparse and quantized communication.

Advances in Neural Information Processing Systems, 31, 2018.

[35] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari,

and Mehrdad Mahdavi. Federated learning with compression: Uniﬁed

analysis and sharp guarantees. 130:2350–2358, 2021.

[36] Beznosikov Aleksandr, Horváth Samuel, Richtárik Peter, and Sa-

faryan Mher. On biased compression for distributed learning. 2020.

[37] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao.

Yolov4: Optimal speed and accuracy of object detection. arXiv

preprint arXiv:2004.10934, 2020.

[38] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li, and

Jianbo Shi. Foveabox: Beyound anchor-based object detection. IEEE

Transactions on Image Processing, 29:7389–7398, 2020.

[39] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath

Hariharan, and Serge Belongie. Feature pyramid networks for object

detection. In Proceedings of the IEEE conference on computer vision

and pattern recognition, pages 2117–2125, 2017.

[40] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet

classiﬁcation with deep convolutional neural networks. Communica-

tions of the ACM, 60(6):84–90, 2017.

[41] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler

approach to obtaining an o (1/t) convergence rate for the projected

stochastic subgradient method. arXiv preprint arXiv:1212.2002,

2012.

[42] Li Deng. The mnist database of handwritten digit images for machine

learning research [best of the web]. IEEE signal processing magazine,

29(6):141–142, 2012.

[43] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet

classiﬁcation with deep convolutional neural networks. Communica-

tions of the ACM, 60(6):84–90, 2017.

[44] Libin Shen, Giorgio Satta, and Aravind Joshi. Guided learning for

bidirectional sequence classiﬁcation. In ACL, volume 7, pages 760–

767, 2007.

[45] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural

networks. IEEE transactions on Signal Processing, 45(11):2673–

2681, 1997.

[46] Milan Straka and Jana Strakova. Tokenizing, pos tagging, lemmatiz-

ing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL

2017 shared task: Multilingual Parsing from raw text to universal

dependencies, pages 88–99, 2017.

First Author et al.: Preprint submitted to Elsevier Page 9 of 9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869

Preprint not peer reviewed

ResearchGate has not been able to resolve any citations for this publication.

Decentralized learning works: An empirical comparison of gossip learning and federated learning

Article

Full-text available

Feb 2021

Machine learning over distributed data stored by many clients has important applications in use cases where data privacy is a key concern or central data storage is not an option. Recently, federated learning was proposed to solve this problem. The assumption is that the data itself is not collected centrally. In a master–worker architecture, the workers perform machine learning over their own data and the master merely aggregates the resulting models without seeing any raw data, not unlike the parameter server approach. Gossip learning is a decentralized alternative to federated learning that does not require an aggregation server or indeed any central component. The natural hypothesis is that gossip learning is strictly less efficient than federated learning due to relying on a more basic infrastructure: only message passing and no cloud resources. In this empirical study, we examine this hypothesis and we present a systematic comparison of the two approaches. The experimental scenarios include a real churn trace collected over mobile phones, continuous and bursty communication patterns, different network sizes and different distributions of the training data over the devices. We also evaluate a number of additional techniques including a compression technique based on sampling, and token account based flow control for gossip learning. We examine the aggregated cost of machine learning in both approaches. Surprisingly, the best gossip variants perform comparably to the best federated learning variants overall, so they offer a fully decentralized alternative to federated learning.

Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data

Article

Full-text available

Nov 2019

Federated learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning, however, comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods, however, are only of limited utility in the federated learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions, such as i.i.d. distribution of the client data, which typically cannot be found in federated learning. In this article, we propose sparse ternary compression (STC), a new compression framework that is specifically designed to meet the requirements of the federated learning environment. STC extends the existing compression technique of top-k gradient sparsification with a novel mechanism to enable downstream compression as well as ternarization and optimal Golomb encoding of the weight updates. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms federated averaging in common federated learning scenarios. These results advocate for a paradigm shift in federated optimization toward high-frequency low-bitwidth communication, in particular in the bandwidth-constrained learning environments.

Special issue on Distributed Intelligence at the Edge for the Future Internet of Things

Article

Jan 2023
J PARALLEL DISTR COM

Understanding the impact on convolutional neural networks with different model scales in AIoT domain

Article

Aug 2022
J PARALLEL DISTR COM

In recent years many amazing deep learning models have been developed, but in the process of practical applications, people often find that these deep learning models have high requirements for hardware storage space and computing power. In Artificial Intelligent of Things (AIoT) scenario, the computing power of the edge or terminal side are relatively limited, therefore, most conventional deep learning models are difficult to be deployed into AIoT devices. It is significant to explore the different performance under different scales of deep learning models. In this paper, we mainly propose a method to analyze the impact of deep learning models with various sizes through various experiments. We employ slimmable network as a Neural Archtecture Search(NAS) tool to realize various model size freely, and evaluate them on the indicators of flops, robustness and accuracy. The experimental results show the variation of flops, robustness and accuracy with the various model sizes, which help understand the impact on performance of deep learning models with different scales in AIoT systems.

CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation

Article

Jun 2022
J PARALLEL DISTR COM

Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exist two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to a decrease in convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and periodic full-gradient compensation, and propose a new distributed optimization method named CP-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CP-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and the current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CP-SGD, which provides a gradient correction every k iterations. We prove that CP-SGD has a convergence guarantee and it achieves at least O(1K+1K) convergence rate, where K is the number of iterations. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CP-SGD. Experimental results on a 32-GPU cluster show that convergence accuracy of CP-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2-bit gradient compression under a 56Gbps bandwidth environment. In addition, we analyze the performance of CP-SGD when training on 8, 16 and 32 GPUs. It is found that CP-SGD is suitable for most compression-supported update algorithms, and its scalability is approximately linear.

Distributed Intelligence on the Edge-to-Cloud Continuum: A Systematic Literature Review

Article

Apr 2022
J PARALLEL DISTR COM

The explosion of data volumes generated by an increasing number of applications is strongly impacting the evolution of distributed digital infrastructures for data analytics and machine learning (ML). While data analytics used to be mainly performed on cloud infrastructures, the rapid development of IoT infrastructures and the requirements for low-latency, secure processing has motivated the development of edge analytics. Today, to balance various trade-offs, ML-based analytics tends to increasingly leverage an interconnected ecosystem that allows complex applications to be executed on hybrid infrastructures where IoT Edge devices are interconnected to Cloud/HPC systems in what is called the Computing Continuum, the Digital Continuum, or the Transcontinuum. Enabling learning-based analytics on such a complex infrastructures is challenging. The large scale and optimized deployment of learning-based workflows across the Edge-to-Cloud Continuum requires extensive and reproducible experimental analysis of the application execution on representative testbeds. This is necessary to help understand the performance trade-offs that result from combining a variety of learning paradigms and supportive frameworks. A thorough experimental analysis requires the assessment of the impact of multiple factors, such as: model accuracy, training time, network overhead, energy consumption, processing latency, among others. This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today. It describes the main learning paradigms enabling learning-based analytics on the Edge-to-Cloud Continuum. The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed. Furthermore, we analyze how the selected systems provide support for experiment reproducibility. We conclude our review with a detailed discussion of relevant open research challenges and of future directions in this domain such as: holistic understanding of performance; performance optimization of applications;efficient deployment of Artificial Intelligence (AI) workflows on highly heterogeneous infrastructures; and reproducible analysis of experiments on the Computing Continuum.

A review of edge computing: Features and resource virtualization

Article

Jan 2021

With the advent of Internet of Things (IoT) connecting billions of mobile and stationary devices to serve real-time applications, cloud computing paradigms face some significant challenges such as high latency and jitter, non-supportive location-awareness and mobility, and non-adaptive communication types. To address these challenges, edge computing paradigms, namely Fog Computing (FC), Mobile Edge Computing (MEC) and Cloudlet, have emerged to shift the digital services from centralized cloud computing to computing at edges. In this article, we analyse cloud and edge computing paradigms from features and pillars perspectives to identify the key motivators of the transitions from one type of virtualized computing paradigm to another one. We then focus on computing and network virtualization techniques as the essence of all these paradigms, and delineate why virtualization features, resource richness and application requirements are the primary factors for the selection of virtualization types in IoT frameworks. Based on these features, we compare the state-of-the-art research studies in the IoT domain. We finally investigate the deployment of virtualized computing and networking resources from performance perspective in an edge-cloud environment, followed by mapping of the existing work to the provided taxonomy for this research domain. The lessons from the reviewed are that the selection of virtualization technique, placement and migration of virtualized resources rely on the requirements of IoT services (i.e., latency, scalability, mobility, multi-tenancy, privacy, and security). As a result, there is a need for prioritizing the requirements, integrating different virtualization techniques, and exploiting a hierarchical edge-cloud architecture.

Communication optimization strategies for distributed deep neural network training: A survey

Article

Mar 2021

Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slow the overall training speeds, which causes bottlenecks in distributed training, particularly in clusters with limited network bandwidths. To mitigate the drawbacks of distributed communications, researchers have proposed various optimization strategies. In this paper, we provide a comprehensive survey of communication strategies from both an algorithm viewpoint and a computer network perspective. Algorithm optimizations focus on reducing the communication volumes used in distributed training, while network optimizations focus on accelerating the communications between distributed devices. At the algorithm level, we describe how to reduce the number of communication rounds and transmitted bits per round. In addition, we elucidate how to overlap computation and communication. At the network level, we discuss the effects caused by network infrastructures, including logical communication schemes and network protocols. Finally, we extrapolate the potential future challenges and new research directions to accelerate communications for distributed deep neural network training.

FoveaBox: Beyound Anchor-Based Object Detection

Article

Jun 2020

We present FoveaBox, an accurate, flexible, and completely anchor-free framework for object detection. While almost all state-of-the-art object detectors utilize predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, their performance and generalization ability are also limited to the design of anchors. Instead, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations. In FoveaBox, an instance is assigned to adjacent feature levels to make the model more accurate.We demonstrate its effectiveness on standard benchmarks and report extensive experimental analysis. Without bells and whistles, FoveaBox achieves state-of-the-art single model performance on the standard COCO and Pascal VOC object detection benchmark. More importantly, FoveaBox avoids all computation and hyper-parameters related to anchor boxes, which are often sensitive to the final detection performance. We believe the simple and effective approach will serve as a solid baseline and help ease future research for object detection. The code has been made publicly available at https://github.com/taokong/FoveaBox .

Unsupervised learning of spoken language with visual context

Article

Dec 2016

Humans learn to speak before they can read or write, so why can't computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120,000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.

Communication-Efficient Distributed Stochastic Gradient Descent with Pooling Operator

Abstract

Recommended publications

Distributed SGD With Flexible Gradient Compression

Structured Directional Pruning via Perturbation Orthogonal Projection

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training