ArticlePDF Available

Communication-Efficient Distributed Stochastic Gradient Descent with Pooling Operator

Authors:

Abstract

Training deep neural networks on large datasets can be accelerated by distributing the computations across multiple worker nodes. The distributed stochastic gradient descent (D-SGD) algorithm combined with gradient sparsification is a typical algorithm for this distributed training model, which guarantees better convergence and can significantly reduce communication bandwidth. However, we find that the existing gradient sparsification algorithms ignore the local correlation between gradient values under certain conditions, which can incur a loss of accuracy since the update region of the global parameter will stay in a particular area for a long epoch. Based on this observation, we introduced the pooling operator to solve this problem and combined it with the error-feedback method to design a new communication-efficient distributed gradient descent algorithm. We prove our algorithm converges at the same rate as vanilla SGD when equipped with error feedback. Experiments on CNN, an LSTM, and a ResNet model demonstrated that our algorithm could compress the number of uploaded gradient bits by two or three orders of magnitude while guaranteeing the high accuracy of the model.
Communication-efficient Distributed Stochastic Gradient Descent with
Pooling Operator
Zhengao Caia, Aiguo Chena,b,, Yi Luocand Jiahao Lia
aShenzhen Institute of Advanced Study,University of Electronic Science and Technology of China, China,
bSchool of Computer Science and Engineering,University of Electronic Science and Technology of China, China,
cSchool of Information and Software Engineering,University of Electronic Science and Technology of China, China,
ARTICLE INFO
Keywords:
Distributed Machine Learning
Distributed SGD
Communication Efficient SGD
Gradient Sparsification
ABSTRACT
Training deep neural networks on large datasets can be accelerated by distributing the computations
across multiple worker nodes. The distributed stochastic gradient descent (D-SGD) algorithm com-
bined with gradient sparsification is a typical algorithm for this distributed training model, which
guarantees better convergence and can significantly reduce communication bandwidth. However, we
find that the existing gradient sparsification algorithms ignore the local correlation between gradient
values under certain conditions, which can incur a loss of accuracy since the update region of the global
parameter will stay in a particular area for a long epoch. Based on this observation, we introduced the
pooling operator to solve this problem and combined it with the error-feedback method to design a new
communication-efficient distributed gradient descent algorithm. We prove our algorithm converges at
the same rate as vanilla SGD when equipped with error feedback. Experiments on CNN, an LSTM,
and a ResNet model demonstrated that our algorithm could compress the number of uploaded gradient
bits by two or three orders of magnitude while guaranteeing the high accuracy of the model.
1. Introduction
In recent years, deep neural networks(DNN) have been
used as powerful tools for machine learning and artificial
intelligence in a large number of domains[1,2], especially
in computer vision and natural language processing[3,4].
With the increasing volume of training data and the grow-
ing complexity of the DNN structure, distributed machine
learning[5,6,7,8,9,10,8] is widely used to train these
large-scale DNN models. It is a multi-nodal system with a
server node and multiple worker nodes to collaborate in run-
ning the Distributed Stochastic Gradient Descent (D-SGD)
algorithm. The server node is responsible for scheduling
the model’s parameters and worker nodes’ behavior[11,12].
Each worker node uses its local data to obtain the gradient of
the current parameters and transmits the computed gradient
matrix to the server node.
However, D-SGD suffers from one main drawback: the
communication overhead has become a bottleneck in the
process of training large-scale DNN models[13,14,5,15,
16]. For instance, ResNet[3,17] has more than 25 million
parameters, and the communication overhead may be in
the range of gigabytes per epoch[18,19,20,21,15], so
the communication overhead of transmitting the gradient
could be prohibitive[22]. Consequently, amount of gradient
compression methods has been proposed to reduce the scale
of gradient data and the time consumption of its transmission
while ensuring the convergence of the model[23,24,25,14,
6,26].
Gradient compression methods are mainly divided into
two types: gradient quantization and gradient sparsification.
The quantization method uses the Sign operator to quantify
Corresponding author
agchen@uestc.edu.cn (A. Chen)
the original 32-bit or 64-bit float number to 1-bit. Due to the
nature of computer character encoding, it is impossible to
obtain a gradient compression rate of more than 64 times
for the quantization method[5,27,23,28]. Another class
of gradient compression methods is gradient sparsification,
which is more feasible. Typically, the Top-K sparsification
algorithm[6,24] can compress the uploaded gradient traffic
by two or three orders of magnitude, and a large number of
papers[6,7] have shown that it can obtain the same conver-
gence and accuracy as typical D-SGD algorithm for both
convex and non-convex optimization problems. However,
for some large-scale DNN models(.e.g., ResNet[3] with a
high rate of sparsification, the Top-K algorithm may lead
to severe accuracy penalty and worse convergence. In this
case, the iteration epochs of the Top-K algorithm must be
multiplied to achieve the same accuracy as the typical D-
SGD algorithm[6,29].
One reason for the convergence degradation is that the
Top-K operator discards the local correlation features of the
gradient entirely in some cases when sparsifying the gradient
matrix. For highly locally correlated data like images, there
is also a local correlation in the gradient obtained during the
training of the DNN. Therefore, since the Top-K operator
naturally retains only the K-largest absolute values in the
gradient matrix, this results in the selection of gradient
regions that may stay in a local region for a long epoch. In
contrast, the parameters in the other regions stop updating.
Based on this observation, we exploited the dimension-
ality reduction operation of pooling operators in convolu-
tional neural network(CNN)[30] and introduced the pooling
operators (MaxPool and RandPool) to sparsify the gradient
matrix. The MaxPool operator filters out the largest absolute
gradient elements from the region covered by the filter rather
than the K-largest values from the whole gradient. The
First Author et al.: Preprint submitted to Elsevier Page 1 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
RandPool operator filter out a random element covered by
the filter. This approach will preserve the local correlation
features of the gradient while sparsifying the gradient matrix
and eliminate the long regional adhesion that may result
from the Top-K operator. Note that our proposed MaxPool
operator differs from the typical MaxPool operator in that the
results obtained by the typical MaxPool operator do not have
negative values. Since the gradient has negative values, we
modify the typical MaxPool operator by setting the selected
elements to the ones with the largest absolute values. The
MaxPool operator later in the article are the one that has
been modified in this way.
The pooling operators can compress the gradient matrix
by two or three orders of magnitude without suffering severe
accuracy compensation. We also proved that the pooling
sparsification algorithm could obtain a linear convergence
rate with the addition of the error feedback method[31]
commonly used in gradient compression. The experiment
results demonstrate that our algorithm can get the same
communication savings and has a better convergence rate
and accuracy on computer vision-related DNN(.e.g., CNN,
ResNet ) compared to the existing sparsification schemes
for the same compression ratio. For typical recurrent neu-
ral networks(RNN) in natural language processing(.e.g.,
LSTM[32]), our algorithm can get similar convergence rate
and accuracy as the Top-K algorithm.
The remainder of this paper is organized as follows.
Section 2presents some research related to our algorithm.
In Section 3, we introduce the rationale and details of our al-
gorithm and present proofs of our algorithm. We experimen-
tally demonstrate the advantages of our algorithm over the
currently proven feasible and generally optimal algorithm in
Section 4. Finally, we conclude with a discussion in Section
5.
2. Related Work
Minibatch-SGD. There are many D-SGD algorithms to
optimize large-scale DNN in distributed machine-learning
environments, and Minibatch-SGD is currently the most
popular one. In the Minibatch-SGD algorithm, the server
sends the DNN structure and global parameters to each
worker node. When the worker node receives the parame-
ters, they use its local data to compute multiple stochastic
gradients and sends the average of these stochastic gradients
to the server. Finally, the server aggregates these gradients
to obtain the global parameters for the next epoch[11]. Other
optimization algorithms for DNN in distributed machine
learning, such as Local-SGD[12], have been shown to per-
form much worse than Minibatch-SGD in many cases[11].
Top-K and Rand-K Sparsification. The key idea of the
Top-K sparsification algorithm is to discard some “useless”
(tiny and less helpful in updating the global model) gradient
values by the Top-K operator. The researchers found that
99% of the small gradients in the stochastic gradient descent
algorithm are not very useful for updating the global model.
Therefore the role of the Top-K operator is to filter and trans-
mit the K-largest absolute values of the gradient matrix to the
global server[24,33,34]. Also, several recent papers have
theoretically indicate that these types of Top-K sparsification
algorithms are capable of obtaining linear convergence rates
in convex and nonconvex problems[35,7,36]. A Random-
K algorithm exists, which differs from the Top-K algorithm
only because when sparsifying the matrix, the transmitted
elements are randomly selected instead of the K elements
with the largest absolute value.
Error-feedback. In the traditional sparsification method,
each worker node only needs to transmit those selected spe-
cific elements to the global server for each communication
epoch. Those elements that are not selected will be discarded
immediately. Inspired by the traditional momentum ap-
proach of stochastic gradient descent, a typical practice is to
record those gradient residuals that are not transmitted to the
global server in the current epoch and add them to the new
calculation result in the next epoch. Experiments have shown
that this error-feedback method can significantly improve the
final model convergence rate and accuracy[6,31,26]. Error-
Feedback is also known as error compensation, gradient with
memory, and gradient residuals[31,26].
3. Our Method
3.1. Observation for gradient space-correlation
and Algorithm insight
It is well known that in the study of DNN in computer
vision, the local correlation of the image cannot be ignored,
so scholars designed pooling and convolution operators to
distill the overall image information without ignoring the
implicit local information of the image. The subsequent
research on techniques such as anchor box and Feature Pyra-
mid Network[37,38,39] also proved the importance of local
correlation of the image. Based on this phenomenon, we also
find that there is also a correlation between the gradient and
the training data during the training of DNN models, and the
gradient values obtained during the backpropagation process
are not randomly and disorderly distributed.
We designed a simple experiment on a single-layer CNN
model with some typical simple images for training, and
the results are shown in Figure 1. Apparently, for a triangle
of a binary image, the gradients have great similarity with
the original image from the epoch of beginning to the end.
The remaining gradients of some typical images also get
a distribution similar to the original image. This suggests
that there is also a local correlation between the gradient
values, and it also explains the loss of accuracy and con-
vergence degradation incurred by the Top-K algorithm in
some specific cases. When the learning rate of the local
gradient descent step run by each worker node is low and
the sparsity rate of the Top-K algorithm is high, the gradient
values selected for transmitting may stay in a region with a
larger absolute value for a long global epoch, which results
in the global parameters of other regions may not be updated
in time.
First Author et al.: Preprint submitted to Elsevier Page 2 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Figure 1: The training data and the gradient of its corresponding layer, For each training image at the top, the visualization of
the 10×10 gradient value matrix of the convolution layer are represented by the corresponding image below, with darker colors
indicating larger values.
Figure 2: The Principle of the MaxPool Operator with stride = 2 and kernel-size = 2
To address this adhesion phenomenon, we introduce the
pooling operator commonly used in CNN to replace the Top-
Koperator. The extensive use of the pooling operator in
CNN stems from the local correlation of the images, and
the pooling operator can compress images while maintaining
their local correlation features[40]. Based on it, we exploit
and introduce the dimensionality reduction operation of
pooling operators to compress the gradient matrix.
We modified the Minibatch-SGD algorithm in that each
working node will compress its computed gradient before
sending it to the global server, which will then perform a
simple unpooling operation to restore the matrix to its orig-
inal size. Also, we introduce the error feedback method into
our algorithm. The index values of the unfiltered gradients
in each epoch will be saved locally and added to the newly
computed gradient matrix in the next epoch.
3.2. Details and Proofs of Pooling Algorithm
Generally, the distributed machine learning model is
often expressed in the following optimization form. For the
input x , object function 𝑓𝑑𝑑:
𝑓(𝑥) = 1
𝑛
𝑛
𝑖=1
𝑓𝑖(𝑥), 𝑥∶= arg min
𝑥𝑑𝑓(𝑥), 𝑓 ∶= 𝑓𝑥
(1)
where each 𝑓𝑖is L-smooth and 𝜇-strongly convex:
𝑓(𝑦)𝑓(𝑥)+𝑓(𝑥), 𝑦𝑥+𝐿
2𝑦𝑥2,𝑥, 𝑦 𝑑, 𝑖 [𝑛]
(2)
𝑓(𝑦)𝑓(𝑥)+ 𝑓(𝑥), 𝑦 𝑥+𝜇
2𝑦𝑥2,𝑥, 𝑦 𝑑(3)
The solution to the above optimization problem is usu-
ally Minibatch-SGD[11]. For simplicity, let’s now consider
the Minibatch-SGD algorithm in the following form:
𝑥𝑘+1 = x𝑘𝛾
𝑖𝕄
𝑓x𝑘
𝑖(4)
where 𝑥𝑘are the global parameters of the k-th epoch,𝑥𝑘
𝑖
is the local parameter of the k-th epoch of the i-th worker
node, 𝑌is the learning rate, 𝑓x𝑘
𝑖is the gradient of 𝑥𝑘
𝑖.
The target needs to sparsify is 𝑓x𝑘
𝑖, the gradient value
𝑓x𝑘
𝑖will be transmitted to the global server for updating
the global parameters 𝑥𝑘for each epoch, which is the main
communication bottleneck[6].
The pooling operator compresses gradient matrix 𝑓x𝑘
𝑖
according to the layer structure before uploading. For a
pooling filter with stride = Rand kernel-size = Rthe original
gradient matrix can be compressed to 1∕𝑅2of its original
size. The gradient matrix 𝑔𝑘
𝑖after the above operation is
submitted to the server via the network, and then the server
performs an UnPool operation to obtain its original size. In
addition, the elements 𝛼𝑓𝑥𝑘
𝑖𝑔𝑘
𝑖that are not transmitted
will be added to the current gradient residual 𝑚𝑘
𝑖for the
next epoch‘s gradient residual 𝑚𝑘+1
𝑖, which can speed up the
training process to a great extent.
First Author et al.: Preprint submitted to Elsevier Page 3 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
The formal representation is as follows:
𝑥𝑘+1 =𝑥𝑘𝛾
𝑖𝕄
UnPool 𝑔𝑘
𝑖(5)
𝑔𝑘
𝑖= MaxPool 𝑚𝑘
𝑖+𝛼𝑓𝑥𝑘
𝑖 (6)
𝑚𝑘+1
𝑖=𝑚𝑘
𝑖+𝛼𝑓𝑥𝑘
𝑖𝑔𝑘
𝑖(7)
where the compression operator MaxPool𝑅𝑑𝑑is
defined for x :
MaxPool𝑅(𝑋) ∶=
𝑥𝑖,𝑗 ,if 𝑥𝑖,𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑎𝑥 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒
𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑒𝑎𝑐ℎ 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑓 𝑖𝑙𝑡𝑒𝑟
0,ortherwise
(8)
and each symbol has the same meaning as Equation 1.
The abbreviated principle of the pooling Operator can be
seen in Figure 2. Since the RandPool operator is very similar
to the MaxPool operator, we won’t go over them.
For the compression operator MaxPool𝑅𝑑𝑑
that satisfies the contraction property for pooling stride=R
and kernel-size=R, we have:
𝔼
𝑋 MaxPool𝑅(𝑋)
21 1
𝑅𝑋2,𝑋𝑑(9)
The specific steps of the proof are shown in appendix
5.B.1. and then have the result as equation 10:
𝔼𝑓x𝑇𝑓 𝐺2
𝜇𝑇 +𝑅2𝐺2
𝜇𝑇 2+𝑅3𝐺2
𝜇𝑇 3
(10)
where =𝐿
𝜇,𝔼𝜇
x0 x
2𝐺For L-smooth(See
Equation 2) and 𝜇-strongly convex(See Equation 3) f, and
for carefully chosen learning rates, we observe that for
T = Ω 𝑅1∕2, we have dominating convergence rate
𝐺2
𝜇𝑇 , as derived in[[6], Remark2.6], the same rate as
vanilla SGD[41].
For the pooling algorithm, it can be intuitively repre-
sented by Algorithm 1and Algorithm 2. Each worker node
will compute the gradient using the local dataset and perform
the pooling operator to obtain the compressed gradient 𝑔𝑘
𝑖,
while using 𝑚𝑘
𝑖to store the gradient residual for accelerating
the training process. Finally, send a compressed gradient 𝑔𝑘
𝑖
to the global server for updating the global parameter 𝑥𝑘+1.
The server continuously receives the compressed gradient
𝑔𝑘
𝑖and performs the UnPool operation to obtain the unbi-
ased estimated gradient and executes the gradient descent
algorithm to update the global parameters 𝑥𝑘+1
Algorithm 1 Distributed Parallel Pooling-SGD with
Memory-Server
Require: global model 𝑥0, global learning rate 𝛾
Ensure: trained global model
for global epoch 𝑡in 0...𝑇do
for each 𝑐𝑙𝑖𝑒𝑛𝑡𝑖do
𝑡
𝑖𝑃 𝑜𝑜𝑙𝑆 𝐺𝐷𝐶𝑙𝑖𝑒𝑛𝑡𝑖(x𝑡)
// Get the compressed matrix from each client i
𝑡
𝑖𝑈 𝑛𝑃 𝑜𝑜𝑙(𝑡
𝑖)
// Perform the Unpool operation to obtain an unbi-
ased estimate of the gradient matrix
end for
𝑥𝑡+1 𝑥𝑡𝛾𝑀
𝑖=1 𝑡
𝑖
//Execute parallel stochastic gradient descent algorithm
end for
Algorithm 2 PoolSGDClient
Require: client model 𝑥𝑡
𝑖, local learning rate 𝛾𝑖
Ensure: Compressed gradient matrix
Initialize gradient memory 𝑚𝑡
𝑖, Pooling Stride 𝑅and 𝛼
1: Compute the local gradient 𝑓(𝑥𝑡
𝑖)
2: g𝑡
𝑖𝑃 𝑜𝑜𝑙𝑅(𝑚𝑖
𝑡+𝑓(𝑥𝑡
𝑖))
//Compute the compressed gradient matrix
3: m𝑡+1
𝑖𝑚𝑡
𝑖+𝛼𝑓(𝑥𝑡
𝑖) 𝑔𝑡
𝑖
//Compute gradient residual
4: Server 𝑔𝑡
𝑖
//send the Compressed gradient matrix to global
server
4. Experiments
4.1. Experimental Setup
Implementation. We designed several experiments based
on the Minibatch[11]. We set up 100 worker nodes with
independently and identically distributed datasets. Each
worker node performs multiple backpropagation operations
using its local dataset to obtain the gradient value of the cur-
rent epoch and then performs the communication-efficient
distributed gradient descent algorithm using Algorithm 1
and Algorithm 2with the learning rate of 0.05. Each exper-
iment uses 50 epochs to ensure that our algorithm performs
the same number of backpropagation and gradient descent
operations as the Top-K and Rand-K algorithms at the same
compression rate.
Task and Dataset. To cover a broad spectrum of deep
learning problems, we consider several typical CNN and
RNN models on different typical data sets. The data sets
include Part-of-Speech Tagging, [42], and [43]. To verify
the correctness of our algorithm in CNN models, we run the
task on the MNIST data set with a simple CNN model and
Cifar-10 data set with Resnet-20[3] model. To verify the cor-
rectness of our algorithm in RNN models, we run the Part-
of-Speech Tagging task with the BiLSTM model[44,45] and
UDPOS dataset[46].
First Author et al.: Preprint submitted to Elsevier Page 4 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Figure 3: The effects of the same compression rate on the convergence rate in different algorithms(i. Minibatch-SGD. ii. Top-K
SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD with
memory or not.) for the Mnist dataset with the CNN model. In(a), the compression rate is set to 0.04 except for Minibatch-SGD.
In(b), the compression rate is 0.01 except for Minibatch-SGD. Since the random selection algorithm (Rand-K and RandPool) works
poorly without gradient residual memory, all subsequent experiments of the random selection algorithm(Rand-K and RandPool)
are set up only with gradient residual memory acceleration.
Compress operator Test Accuracy
no 96.78 %
5 MaxPool 95.28 %
5 MaxPool With Memory 95.56 %
TopK0.04 93.15 %
TopK0.04 with Memory 95.54 %
5 RandPool 40.80 %
5 RandPool With Memory 94.16 %
RandK0.04 32.78 %
RandK0.04 With Memory 94.35 %
Compress operator Test Accuracy
no 96.78 %
10 MaxPool 94.10 %
10 MaxPool With Memory 95.04 %
TopK0.01 89.99 %
TopK0.01 with Memory 94.06 %
10 RandPool No convergence
10 RandPool With Memory 93.05 %
RandK0.01 No convergence
RandK0.01 With Memory 91.16 %
Table 1
Investigating effects of the same compression rate on the accuracy in different algorithms(i. Minibatch-SGD. ii. Top-K SGD with
memory or not. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD with memory or
not) for the Mnist dataset with the CNN model.
Figure 4: The effects of the same compression rate on the convergence rate in different algorithms(i. Minibatch-SGD. ii. Top-K
SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD
with memory or not.) for the Cifar-10 dataset with the ResNet-20 model. In(a), the compression rate is set to 0.04 except for
Minibatch-SGD. In(b), the compression rate is 0.01 except for Minibatch-SGD.
First Author et al.: Preprint submitted to Elsevier Page 5 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Compress operator Test Accuracy
no 71.58%
5 MaxPool 59.88 %
5 MaxPool With Memory 69.87 %
TopK0.04 56.26 %
TopK0.04 with Memory 69.02 %
5 RandPool With Memory 66.86 %
RandK0.04 With Memory 64.28 %
Compress operator Test Accuracy
no 71.58 %
10 MaxPool 58.12 %
10 MaxPool With Memory 67.33 %
TopK0.01 43.28 %
TopK0.01 with Memory 66.08 %
10 RandPool With Memory 66.86 %
RandK0.01 With Memory 52.74 %
Table 2
Investigating effects of the same compression rate on the accuracy in different algorithms(i. Minibatch-SGD. ii. Top-K SGD
with memory or not. iii. PoolSGD with memory or not. iv. Rand-K SGD with memory. v. RandPool SGD with memory.) for the
experiments on the Cifar-10 dataset with the ResNet-20 model.
Baselines. The Top-K algorithm and Rand-K algorithm
are selected as a comparison with our algorithms. The Top-
K algorithm and Rand-K algorithm are used to compare
the convergence rate and the accuracy of the final model
results with our algorithm for the same amount of com-
munication data savings. The Minibatch-SGD algorithm is
also selected as the benchmark for optimal accuracy. The
RandPool operator is similar to the Rand-K operator. We
can also change the MaxPool operator to the RandPool
operator while sparsing the gradient matrix.
Compression Rate Settings. We choose two different
communication compression rates of 1
25 and 1
100 . For our
algorithms, the pooling strides R and kernel size are all
equally set to 5 and 10, then the pooling operator will
compress the target gradient matrices to their original 1
52and
1
102. For the Top-K algorithm and Rand-K algorithm, the K
is set to a multiple of 0.04 and 0.01 of the number of elements
in the original gradient matrix. This allows us to compare the
advantages of our algorithms with the Top-K algorithm and
Rand-K algorithm for the same compression ratio.
4.2. Result of several experiments
Convergence on the MNIST data set. The convergence
of the MNIST dataset in CNN is shown in Figure 3and
Table 1, we conducted experiments on a Minibatch-SGD
algorithm with compression rates of 1
52and 1
102, respectively,
as shown before. The results show that with or without
gradient residual memory in a typical CNN network, our
algorithm can obtain the same convergence rate as the Top-K
algorithm. The model loss through our algorithm is slightly
lower than the Top-K algorithm for the same training epoch,
and the final model accuracy is slightly higher than that of
the Top-K algorithm.
It can be seen that the RandPool algorithm obtains better
results than the Rand-K algorithm, both with and without
gradient memory.
Convergence on the Cifar-10 data set. We perform
the image recognition task for the Cifar-10 dataset on the
ResNet model shown in Figure4and Table 2. Similarly,
we conducted experiments on a Minibatch-SGD algorithm
with compression rates of 1
52and 1
102, respectively. The
experimental setup is similar to the previous part of the
CNN model. The results also show that the pooling operator
can obtain relatively greater convergence speed and final
accuracy both with and without gradient residual memory.
The MaxPool algorithm with gradient residual memory is
best similar to the convergence rate of the Minibatch-SGD
algorithm.
Convergence on the POS task. To verify the perfor-
mance of our algorithm in RNN models, we select the
Part-of-Speech Tagging task and run a bidirectional LSTM
neural network model. The results are shown in Figure 5
and Table 3, which indicate that our algorithm(BiLSTM)
and the Top-K algorithm have the same performance for the
same compression rate. The reason for this may be that the
gradient values have no local correlation for RNN model.
Summary. The experiments of CNN and RNN models
from different data sets indicate that our algorithm would
not damage the model accuracy during training and outper-
forms the Top-K algorithm. Also, our local random selection
strategy i.e.RandPool algorithm outperforms the Rand-K
algorithm when it comes to randomly select the gradient
elements for transmitting. However it does not work as
well as the MaxPool operator, as shown in Figure 3and
Figure 4; the results of the comparison between Rand-K
and RandPool methods can be concluded that the RandPool
algorithm can obtain relatively better results than the Rand-
K algorithm because the difference between the Rand-K and
RandPool operators is only that the RandPool operator is a
randomly selected element in each filter, and because there is
some correlation between the gradient values, the RandPool
operator can better represent the global characteristics of the
original gradient matrix. This better supports the reason why
the MaxPool operator can obtain better results than the Top-
Koperator.
5. Conclusions
In this work, we find one possible reason for the un-
stable experimental convergence of traditional sparsification
First Author et al.: Preprint submitted to Elsevier Page 6 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Figure 5: The effects of the same compression rate on the convergence rate in different algorithms(i. Minibatch-SGD. ii. Top-K
SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD
with memory or not.) for the UDPOS dataset with the BiLSTM model. In(a), the compression rate is set to 0.04 except for
Minibatch-SGD. In(b), the compression rate is 0.01 except for Minibatch-SGD.
Compress operator End Loss
no 0.6672
5 MaxPool 0.6939
5 MaxPool With Memory 0.6914
TopK0.04 0.7154
TopK0.04 with Memory 0.7035
5 RandPool With Memory 0.7125
RandK0.04 With Memory 0.7016
Compress operator End Loss
no 0.6672
10 MaxPool 0.7497
10 MaxPool With Memory 0.6837
TopK0.01 0.7512
TopK0.01 with Memory 0.7116
10 RandPool With Memory 1.0064
RandK0.01 With Memory 1.1218
Table 3
Investigating effects of the same compression rate on the accuracy in different algorithms(i. SGD. ii. Top-K SGD with memory
or not. iii. PoolSGD with memory or not. iv. Rand-K SGD with memory. v. RandPool SGD with memory.) for the experiments
on the UDPOS dataset with the BiLSTM model.
algorithms. To solve this problem, we find that the gradient
distribution is locally correlated in some cases. Therefore we
introduce a pooling operator and combined it with the error-
feedback method to design a gradient sparsification pooling
algorithm.
We considered both CNN and RNN models with dif-
ferent degrees of communication reduction:(i) CNN and
ResNet models on MNIST and Cifar-10 datasets demon-
strate better convergence performance than the Top-K algo-
rithm. (ii) LSTM networks on POS tasks perform similarly
to the Top-K algorithm. The results show that our algorithm
performs better than the Top-K algorithm on image-related
datasets and CNN models.
Declaration of competing interest
The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgement
This research work was supported by the National Nat-
ural Science Foundation of China (NSFC) (U19A2059),
and the Sichuan Science and Technology Program (No.
206999977).
Appendix
Lemma A.1 For 𝐱𝑑,1𝑘𝑑, and operator
comp𝑘top𝑘,rand𝑘it holds that
𝔼
comp𝑘(𝐱) 𝐱
21 𝑘
𝑑𝐱2(11)
Proof. For the definition of the operators, for all 𝐱𝑑
we have
𝐱 top𝑘(𝐱)
2
𝐱 rand𝑘(𝐱)
2(12)
First Author et al.: Preprint submitted to Elsevier Page 7 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
and we apply the expectation
𝔼𝜔
𝐱 rand𝑘(𝐱)
2=1
Ω𝑘
𝜔∈Ω𝑘
𝑑
𝑖=1
𝐱2
𝑖𝕀{𝑖𝜔}
=
𝑑
𝑖=1
𝑥2
𝑖
𝜔∈Ω𝑘
𝕀{𝑖𝜔}
Ω𝑘=1 𝑘
𝑑𝐱2
(13)
which concludes the proof.
Proof B.1 For 𝐱𝑑,1𝑘𝑑, object function
𝑓𝑑𝑑: that satisfies the contraction property for
pooling stride=R and kernel size=R, it holds
𝔼𝑋−MaxPoolR(𝑋)21 1
𝑅𝑋2,𝑋𝑑(14)
Proof. From the definition of the MaxPool and Top-
Koperators, the MaxPool operator is the combination of
𝑑
𝑅Top-1 operators, the MaxPool operator simply takes
the Top-1 value in a filter with kernel size = R, so we have
𝔼
𝑋 MaxPool𝑅(𝑋)
2=
𝑥𝑖𝑋
𝔼
𝑥𝑖 Top −1 𝑥𝑖
2
𝑥𝑖𝑋1 1
𝑑
𝑥𝑖
2=
𝑥𝑖𝑋1 1
𝑅
𝑥𝑖
2=1 1
𝑅𝑋2
(15)
where 𝑥𝑖denotes the submatrix covered by each filter.
References
[1] David Harwath, Antonio Torralba, and James Glass. Unsupervised
learning of spoken language with visual context. Advances in Neural
Information Processing Systems, 29, 2016.
[2] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning
based recommender system: A survey and new perspectives. ACM
Computing Surveys (CSUR), 52(1):1–38, 2019.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 770–
778, 2016.
[4] Karen Simonyan and Andrew Zisserman. Very deep convolu-
tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[5] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and
Animashree Anandkumar. signsgd: Compressed optimisation for
non-convex problems. In International Conference on Machine
Learning, pages 560–569. PMLR, 2018.
[6] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Spar-
sified sgd with memory. Advances in Neural Information Processing
Systems, 31, 2018.
[7] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstanti-
nov, Sarit Khirirat, and Cedric Renggli. The convergence of sparsi-
fied gradient methods. Advances in Neural Information Processing
Systems, 31, 2018.
[8] István Hegedűs, Gábor Danner, and Márk Jelasity. Decentralized
learning works: An empirical comparison of gossip learning and
federated learning. Journal of Parallel and Distributed Computing,
148:109–124, 2021.
[9] Daniel Rosendo, Alexandru Costan, Patrick Valduriez, and Gabriel
Antoniu. Distributed intelligence on the edge-to-cloud continuum:
A systematic literature review. Journal of Parallel and Distributed
Computing, 166:71–94, 2022.
[10] Andrzej Goscinski, Flavia C. Delicato, Giancarlo Fortino, Anna
Kobusińska, and Gautam Srivastava. Special issue on distributed
intelligence at the edge for the future internet of things. Journal of
Parallel and Distributed Computing, 171:157–162, 2023.
[11] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch
vs local sgd for heterogeneous distributed learning. Advances in
Neural Information Processing Systems, 33:6281–6292, 2020.
[12] Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaim-
ing He. Accurate, large minibatch sgd: Training imagenet in 1 hour.
arXiv preprint arXiv:1706.02677, 2017.
[13] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson,
and Blaise Aguera y Arcas. Communication-efficient learning of
deep networks from decentralized data. In Artificial intelligence and
statistics, pages 1273–1282. PMLR, 2017.
[14] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan
McMahan. Distributed mean estimation with limited communication.
In International conference on machine learning, pages 3329–3337.
PMLR, 2017.
[15] Longxin Lin, Zhenxiong Xu, Chien-Ming Chen, Ke Wang, Md. Rafiul
Hassan, Md. Golam Rabiul Alam, Mohammad Mehedi Hassan, and
Giancarlo Fortino. Understanding the impact on convolutional neural
networks with different model scales in aiot domain. Journal of
Parallel and Distributed Computing, 170:1–12, 2022.
[16] Shuo Ouyang, Dezun Dong, Yemao Xu, and Liquan Xiao. Commu-
nication optimization strategies for distributed deep neural network
training: A survey. Journal of Parallel and Distributed Computing,
149:52–65, 2021.
[17] Yaser Mansouri and M. Ali Babar. A review of edge computing: Fea-
tures and resource virtualization. Journal of Parallel and Distributed
Computing, 150:155–183, 2021.
[18] Felix Sattler, Simon Wiedemann, Klaus-Robert Muller, and Wojciech
Samek. Robust and communication-efficient federated learning from
non-iid data. IEEE transactions on neural networks and learning
systems, 31(9):3400–3413, 2019.
[19] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari,
and Mehrdad Mahdavi. Federated learning with compression: Uni-
fied analysis and sharp guarantees. In International Conference on
Artificial Intelligence and Statistics, pages 2350–2358. PMLR, 2021.
[20] Jay H Park, Sunghwan Kim, Jinwon Lee, Myeongjae Jeon, and
Sam H Noh. Accelerated training for cnn distributed deep learning
through automatic resource-aware layer placement. arXiv preprint
arXiv:1901.05803, 2019.
[21] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and
communication efficient federated learning for heterogeneous clients.
arXiv preprint arXiv:2010.01264, 2020.
[22] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. Quantized neural networks: Training neural networks
with low precision weights and activations. The Journal of Machine
Learning Research, 18(1):6869–6898, 2017.
[23] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally.
Deep gradient compression: Reducing the communication bandwidth
for distributed training. arXiv preprint arXiv:1712.01887, 2017.
[24] Alham Fikri Aji and Kenneth Heafield. Sparse communication for
distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
[25] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dim-
itris Papailiopoulos, and Stephen Wright. Atomo: Communication-
efficient learning via atomic sparsification. Advances in Neural
Information Processing Systems, 31, 2018.
[26] Enda Yu, Dezun Dong, Yemao Xu, Shuo Ouyang, and Xiangke Liao.
Cp-sgd: Distributed stochastic gradient descent with compression
and periodic compensation. Journal of Parallel and Distributed
Computing, 169:42–57, 2022.
[27] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran
Chen, and Hai Li. Terngrad: Ternary gradients to reduce communi-
cation in distributed deep learning. Advances in neural information
processing systems, 30, 2017.
First Author et al.: Preprint submitted to Elsevier Page 8 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
[28] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and
Martin Jaggi. Error feedback fixes signsgd and other gradient com-
pression schemes. In International Conference on Machine Learning,
pages 3252–3261. PMLR, 2019.
[29] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman
Arora, et al. Communication-efficient distributed sgd with sketching.
Advances in Neural Information Processing Systems, 32, 2019.
[30] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-
organizing neural network model for a mechanism of visual pattern
recognition. In Competition and cooperation in neural nets, pages
267–285. Springer, 1982.
[31] Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter
Richtarik. Linearly converging error compensated sgd. Advances
in Neural Information Processing Systems, 33:20889–20900, 2020.
[32] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-
term memory-networks for machine reading. arXiv preprint
arXiv:1601.06733, 2016.
[33] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry
Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub
Konecy, Stefano Mazzocchi, Brendan McMahan, et al. Towards
federated learning at scale: System design. Proceedings of Machine
Learning and Systems, 1:374–388, 2019.
[34] Peng Jiang and Gagan Agrawal. A linear speedup analysis of
distributed deep learning with sparse and quantized communication.
Advances in Neural Information Processing Systems, 31, 2018.
[35] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari,
and Mehrdad Mahdavi. Federated learning with compression: Unified
analysis and sharp guarantees. 130:2350–2358, 2021.
[36] Beznosikov Aleksandr, Horváth Samuel, Richtárik Peter, and Sa-
faryan Mher. On biased compression for distributed learning. 2020.
[37] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao.
Yolov4: Optimal speed and accuracy of object detection. arXiv
preprint arXiv:2004.10934, 2020.
[38] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li, and
Jianbo Shi. Foveabox: Beyound anchor-based object detection. IEEE
Transactions on Image Processing, 29:7389–7398, 2020.
[39] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath
Hariharan, and Serge Belongie. Feature pyramid networks for object
detection. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2117–2125, 2017.
[40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. Communica-
tions of the ACM, 60(6):84–90, 2017.
[41] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler
approach to obtaining an o (1/t) convergence rate for the projected
stochastic subgradient method. arXiv preprint arXiv:1212.2002,
2012.
[42] Li Deng. The mnist database of handwritten digit images for machine
learning research [best of the web]. IEEE signal processing magazine,
29(6):141–142, 2012.
[43] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. Communica-
tions of the ACM, 60(6):84–90, 2017.
[44] Libin Shen, Giorgio Satta, and Aravind Joshi. Guided learning for
bidirectional sequence classification. In ACL, volume 7, pages 760–
767, 2007.
[45] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural
networks. IEEE transactions on Signal Processing, 45(11):2673–
2681, 1997.
[46] Milan Straka and Jana Strakova. Tokenizing, pos tagging, lemmatiz-
ing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL
2017 shared task: Multilingual Parsing from raw text to universal
dependencies, pages 88–99, 2017.
First Author et al.: Preprint submitted to Elsevier Page 9 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Machine learning over distributed data stored by many clients has important applications in use cases where data privacy is a key concern or central data storage is not an option. Recently, federated learning was proposed to solve this problem. The assumption is that the data itself is not collected centrally. In a master–worker architecture, the workers perform machine learning over their own data and the master merely aggregates the resulting models without seeing any raw data, not unlike the parameter server approach. Gossip learning is a decentralized alternative to federated learning that does not require an aggregation server or indeed any central component. The natural hypothesis is that gossip learning is strictly less efficient than federated learning due to relying on a more basic infrastructure: only message passing and no cloud resources. In this empirical study, we examine this hypothesis and we present a systematic comparison of the two approaches. The experimental scenarios include a real churn trace collected over mobile phones, continuous and bursty communication patterns, different network sizes and different distributions of the training data over the devices. We also evaluate a number of additional techniques including a compression technique based on sampling, and token account based flow control for gossip learning. We examine the aggregated cost of machine learning in both approaches. Surprisingly, the best gossip variants perform comparably to the best federated learning variants overall, so they offer a fully decentralized alternative to federated learning.
Article
Full-text available
Federated learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning, however, comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods, however, are only of limited utility in the federated learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions, such as i.i.d. distribution of the client data, which typically cannot be found in federated learning. In this article, we propose sparse ternary compression (STC), a new compression framework that is specifically designed to meet the requirements of the federated learning environment. STC extends the existing compression technique of top-k gradient sparsification with a novel mechanism to enable downstream compression as well as ternarization and optimal Golomb encoding of the weight updates. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms federated averaging in common federated learning scenarios. These results advocate for a paradigm shift in federated optimization toward high-frequency low-bitwidth communication, in particular in the bandwidth-constrained learning environments.
Article
In recent years many amazing deep learning models have been developed, but in the process of practical applications, people often find that these deep learning models have high requirements for hardware storage space and computing power. In Artificial Intelligent of Things (AIoT) scenario, the computing power of the edge or terminal side are relatively limited, therefore, most conventional deep learning models are difficult to be deployed into AIoT devices. It is significant to explore the different performance under different scales of deep learning models. In this paper, we mainly propose a method to analyze the impact of deep learning models with various sizes through various experiments. We employ slimmable network as a Neural Archtecture Search(NAS) tool to realize various model size freely, and evaluate them on the indicators of flops, robustness and accuracy. The experimental results show the variation of flops, robustness and accuracy with the various model sizes, which help understand the impact on performance of deep learning models with different scales in AIoT systems.
Article
Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exist two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to a decrease in convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and periodic full-gradient compensation, and propose a new distributed optimization method named CP-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CP-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and the current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CP-SGD, which provides a gradient correction every k iterations. We prove that CP-SGD has a convergence guarantee and it achieves at least O(1K+1K) convergence rate, where K is the number of iterations. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CP-SGD. Experimental results on a 32-GPU cluster show that convergence accuracy of CP-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2-bit gradient compression under a 56Gbps bandwidth environment. In addition, we analyze the performance of CP-SGD when training on 8, 16 and 32 GPUs. It is found that CP-SGD is suitable for most compression-supported update algorithms, and its scalability is approximately linear.
Article
The explosion of data volumes generated by an increasing number of applications is strongly impacting the evolution of distributed digital infrastructures for data analytics and machine learning (ML). While data analytics used to be mainly performed on cloud infrastructures, the rapid development of IoT infrastructures and the requirements for low-latency, secure processing has motivated the development of edge analytics. Today, to balance various trade-offs, ML-based analytics tends to increasingly leverage an interconnected ecosystem that allows complex applications to be executed on hybrid infrastructures where IoT Edge devices are interconnected to Cloud/HPC systems in what is called the Computing Continuum, the Digital Continuum, or the Transcontinuum. Enabling learning-based analytics on such a complex infrastructures is challenging. The large scale and optimized deployment of learning-based workflows across the Edge-to-Cloud Continuum requires extensive and reproducible experimental analysis of the application execution on representative testbeds. This is necessary to help understand the performance trade-offs that result from combining a variety of learning paradigms and supportive frameworks. A thorough experimental analysis requires the assessment of the impact of multiple factors, such as: model accuracy, training time, network overhead, energy consumption, processing latency, among others. This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today. It describes the main learning paradigms enabling learning-based analytics on the Edge-to-Cloud Continuum. The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed. Furthermore, we analyze how the selected systems provide support for experiment reproducibility. We conclude our review with a detailed discussion of relevant open research challenges and of future directions in this domain such as: holistic understanding of performance; performance optimization of applications;efficient deployment of Artificial Intelligence (AI) workflows on highly heterogeneous infrastructures; and reproducible analysis of experiments on the Computing Continuum.
Article
With the advent of Internet of Things (IoT) connecting billions of mobile and stationary devices to serve real-time applications, cloud computing paradigms face some significant challenges such as high latency and jitter, non-supportive location-awareness and mobility, and non-adaptive communication types. To address these challenges, edge computing paradigms, namely Fog Computing (FC), Mobile Edge Computing (MEC) and Cloudlet, have emerged to shift the digital services from centralized cloud computing to computing at edges. In this article, we analyse cloud and edge computing paradigms from features and pillars perspectives to identify the key motivators of the transitions from one type of virtualized computing paradigm to another one. We then focus on computing and network virtualization techniques as the essence of all these paradigms, and delineate why virtualization features, resource richness and application requirements are the primary factors for the selection of virtualization types in IoT frameworks. Based on these features, we compare the state-of-the-art research studies in the IoT domain. We finally investigate the deployment of virtualized computing and networking resources from performance perspective in an edge-cloud environment, followed by mapping of the existing work to the provided taxonomy for this research domain. The lessons from the reviewed are that the selection of virtualization technique, placement and migration of virtualized resources rely on the requirements of IoT services (i.e., latency, scalability, mobility, multi-tenancy, privacy, and security). As a result, there is a need for prioritizing the requirements, integrating different virtualization techniques, and exploiting a hierarchical edge-cloud architecture.
Article
Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slow the overall training speeds, which causes bottlenecks in distributed training, particularly in clusters with limited network bandwidths. To mitigate the drawbacks of distributed communications, researchers have proposed various optimization strategies. In this paper, we provide a comprehensive survey of communication strategies from both an algorithm viewpoint and a computer network perspective. Algorithm optimizations focus on reducing the communication volumes used in distributed training, while network optimizations focus on accelerating the communications between distributed devices. At the algorithm level, we describe how to reduce the number of communication rounds and transmitted bits per round. In addition, we elucidate how to overlap computation and communication. At the network level, we discuss the effects caused by network infrastructures, including logical communication schemes and network protocols. Finally, we extrapolate the potential future challenges and new research directions to accelerate communications for distributed deep neural network training.
Article
We present FoveaBox, an accurate, flexible, and completely anchor-free framework for object detection. While almost all state-of-the-art object detectors utilize predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, their performance and generalization ability are also limited to the design of anchors. Instead, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations. In FoveaBox, an instance is assigned to adjacent feature levels to make the model more accurate.We demonstrate its effectiveness on standard benchmarks and report extensive experimental analysis. Without bells and whistles, FoveaBox achieves state-of-the-art single model performance on the standard COCO and Pascal VOC object detection benchmark. More importantly, FoveaBox avoids all computation and hyper-parameters related to anchor boxes, which are often sensitive to the final detection performance. We believe the simple and effective approach will serve as a solid baseline and help ease future research for object detection. The code has been made publicly available at https://github.com/taokong/FoveaBox .
Article
Humans learn to speak before they can read or write, so why can't computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120,000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.