Content uploaded by Aiguo Chen
Author content
All content in this area was uploaded by Aiguo Chen on Mar 10, 2024
Content may be subject to copyright.
Communication-efficient Distributed Stochastic Gradient Descent with
Pooling Operator
Zhengao Caia, Aiguo Chena,b,∗, Yi Luocand Jiahao Lia
aShenzhen Institute of Advanced Study,University of Electronic Science and Technology of China, China,
bSchool of Computer Science and Engineering,University of Electronic Science and Technology of China, China,
cSchool of Information and Software Engineering,University of Electronic Science and Technology of China, China,
ARTICLE INFO
Keywords:
Distributed Machine Learning
Distributed SGD
Communication Efficient SGD
Gradient Sparsification
ABSTRACT
Training deep neural networks on large datasets can be accelerated by distributing the computations
across multiple worker nodes. The distributed stochastic gradient descent (D-SGD) algorithm com-
bined with gradient sparsification is a typical algorithm for this distributed training model, which
guarantees better convergence and can significantly reduce communication bandwidth. However, we
find that the existing gradient sparsification algorithms ignore the local correlation between gradient
values under certain conditions, which can incur a loss of accuracy since the update region of the global
parameter will stay in a particular area for a long epoch. Based on this observation, we introduced the
pooling operator to solve this problem and combined it with the error-feedback method to design a new
communication-efficient distributed gradient descent algorithm. We prove our algorithm converges at
the same rate as vanilla SGD when equipped with error feedback. Experiments on CNN, an LSTM,
and a ResNet model demonstrated that our algorithm could compress the number of uploaded gradient
bits by two or three orders of magnitude while guaranteeing the high accuracy of the model.
1. Introduction
In recent years, deep neural networks(DNN) have been
used as powerful tools for machine learning and artificial
intelligence in a large number of domains[1,2], especially
in computer vision and natural language processing[3,4].
With the increasing volume of training data and the grow-
ing complexity of the DNN structure, distributed machine
learning[5,6,7,8,9,10,8] is widely used to train these
large-scale DNN models. It is a multi-nodal system with a
server node and multiple worker nodes to collaborate in run-
ning the Distributed Stochastic Gradient Descent (D-SGD)
algorithm. The server node is responsible for scheduling
the model’s parameters and worker nodes’ behavior[11,12].
Each worker node uses its local data to obtain the gradient of
the current parameters and transmits the computed gradient
matrix to the server node.
However, D-SGD suffers from one main drawback: the
communication overhead has become a bottleneck in the
process of training large-scale DNN models[13,14,5,15,
16]. For instance, ResNet[3,17] has more than 25 million
parameters, and the communication overhead may be in
the range of gigabytes per epoch[18,19,20,21,15], so
the communication overhead of transmitting the gradient
could be prohibitive[22]. Consequently, amount of gradient
compression methods has been proposed to reduce the scale
of gradient data and the time consumption of its transmission
while ensuring the convergence of the model[23,24,25,14,
6,26].
Gradient compression methods are mainly divided into
two types: gradient quantization and gradient sparsification.
The quantization method uses the Sign operator to quantify
∗Corresponding author
agchen@uestc.edu.cn (A. Chen)
the original 32-bit or 64-bit float number to 1-bit. Due to the
nature of computer character encoding, it is impossible to
obtain a gradient compression rate of more than 64 times
for the quantization method[5,27,23,28]. Another class
of gradient compression methods is gradient sparsification,
which is more feasible. Typically, the Top-K sparsification
algorithm[6,24] can compress the uploaded gradient traffic
by two or three orders of magnitude, and a large number of
papers[6,7] have shown that it can obtain the same conver-
gence and accuracy as typical D-SGD algorithm for both
convex and non-convex optimization problems. However,
for some large-scale DNN models(.e.g., ResNet[3] with a
high rate of sparsification, the Top-K algorithm may lead
to severe accuracy penalty and worse convergence. In this
case, the iteration epochs of the Top-K algorithm must be
multiplied to achieve the same accuracy as the typical D-
SGD algorithm[6,29].
One reason for the convergence degradation is that the
Top-K operator discards the local correlation features of the
gradient entirely in some cases when sparsifying the gradient
matrix. For highly locally correlated data like images, there
is also a local correlation in the gradient obtained during the
training of the DNN. Therefore, since the Top-K operator
naturally retains only the K-largest absolute values in the
gradient matrix, this results in the selection of gradient
regions that may stay in a local region for a long epoch. In
contrast, the parameters in the other regions stop updating.
Based on this observation, we exploited the dimension-
ality reduction operation of pooling operators in convolu-
tional neural network(CNN)[30] and introduced the pooling
operators (MaxPool and RandPool) to sparsify the gradient
matrix. The MaxPool operator filters out the largest absolute
gradient elements from the region covered by the filter rather
than the K-largest values from the whole gradient. The
First Author et al.: Preprint submitted to Elsevier Page 1 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
RandPool operator filter out a random element covered by
the filter. This approach will preserve the local correlation
features of the gradient while sparsifying the gradient matrix
and eliminate the long regional adhesion that may result
from the Top-K operator. Note that our proposed MaxPool
operator differs from the typical MaxPool operator in that the
results obtained by the typical MaxPool operator do not have
negative values. Since the gradient has negative values, we
modify the typical MaxPool operator by setting the selected
elements to the ones with the largest absolute values. The
MaxPool operator later in the article are the one that has
been modified in this way.
The pooling operators can compress the gradient matrix
by two or three orders of magnitude without suffering severe
accuracy compensation. We also proved that the pooling
sparsification algorithm could obtain a linear convergence
rate with the addition of the error feedback method[31]
commonly used in gradient compression. The experiment
results demonstrate that our algorithm can get the same
communication savings and has a better convergence rate
and accuracy on computer vision-related DNN(.e.g., CNN,
ResNet ) compared to the existing sparsification schemes
for the same compression ratio. For typical recurrent neu-
ral networks(RNN) in natural language processing(.e.g.,
LSTM[32]), our algorithm can get similar convergence rate
and accuracy as the Top-K algorithm.
The remainder of this paper is organized as follows.
Section 2presents some research related to our algorithm.
In Section 3, we introduce the rationale and details of our al-
gorithm and present proofs of our algorithm. We experimen-
tally demonstrate the advantages of our algorithm over the
currently proven feasible and generally optimal algorithm in
Section 4. Finally, we conclude with a discussion in Section
5.
2. Related Work
Minibatch-SGD. There are many D-SGD algorithms to
optimize large-scale DNN in distributed machine-learning
environments, and Minibatch-SGD is currently the most
popular one. In the Minibatch-SGD algorithm, the server
sends the DNN structure and global parameters to each
worker node. When the worker node receives the parame-
ters, they use its local data to compute multiple stochastic
gradients and sends the average of these stochastic gradients
to the server. Finally, the server aggregates these gradients
to obtain the global parameters for the next epoch[11]. Other
optimization algorithms for DNN in distributed machine
learning, such as Local-SGD[12], have been shown to per-
form much worse than Minibatch-SGD in many cases[11].
Top-K and Rand-K Sparsification. The key idea of the
Top-K sparsification algorithm is to discard some “useless”
(tiny and less helpful in updating the global model) gradient
values by the Top-K operator. The researchers found that
99% of the small gradients in the stochastic gradient descent
algorithm are not very useful for updating the global model.
Therefore the role of the Top-K operator is to filter and trans-
mit the K-largest absolute values of the gradient matrix to the
global server[24,33,34]. Also, several recent papers have
theoretically indicate that these types of Top-K sparsification
algorithms are capable of obtaining linear convergence rates
in convex and nonconvex problems[35,7,36]. A Random-
K algorithm exists, which differs from the Top-K algorithm
only because when sparsifying the matrix, the transmitted
elements are randomly selected instead of the K elements
with the largest absolute value.
Error-feedback. In the traditional sparsification method,
each worker node only needs to transmit those selected spe-
cific elements to the global server for each communication
epoch. Those elements that are not selected will be discarded
immediately. Inspired by the traditional momentum ap-
proach of stochastic gradient descent, a typical practice is to
record those gradient residuals that are not transmitted to the
global server in the current epoch and add them to the new
calculation result in the next epoch. Experiments have shown
that this error-feedback method can significantly improve the
final model convergence rate and accuracy[6,31,26]. Error-
Feedback is also known as error compensation, gradient with
memory, and gradient residuals[31,26].
3. Our Method
3.1. Observation for gradient space-correlation
and Algorithm insight
It is well known that in the study of DNN in computer
vision, the local correlation of the image cannot be ignored,
so scholars designed pooling and convolution operators to
distill the overall image information without ignoring the
implicit local information of the image. The subsequent
research on techniques such as anchor box and Feature Pyra-
mid Network[37,38,39] also proved the importance of local
correlation of the image. Based on this phenomenon, we also
find that there is also a correlation between the gradient and
the training data during the training of DNN models, and the
gradient values obtained during the backpropagation process
are not randomly and disorderly distributed.
We designed a simple experiment on a single-layer CNN
model with some typical simple images for training, and
the results are shown in Figure 1. Apparently, for a triangle
of a binary image, the gradients have great similarity with
the original image from the epoch of beginning to the end.
The remaining gradients of some typical images also get
a distribution similar to the original image. This suggests
that there is also a local correlation between the gradient
values, and it also explains the loss of accuracy and con-
vergence degradation incurred by the Top-K algorithm in
some specific cases. When the learning rate of the local
gradient descent step run by each worker node is low and
the sparsity rate of the Top-K algorithm is high, the gradient
values selected for transmitting may stay in a region with a
larger absolute value for a long global epoch, which results
in the global parameters of other regions may not be updated
in time.
First Author et al.: Preprint submitted to Elsevier Page 2 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Figure 1: The training data and the gradient of its corresponding layer, For each training image at the top, the visualization of
the 10×10 gradient value matrix of the convolution layer are represented by the corresponding image below, with darker colors
indicating larger values.
Figure 2: The Principle of the MaxPool Operator with stride = 2 and kernel-size = 2
To address this adhesion phenomenon, we introduce the
pooling operator commonly used in CNN to replace the Top-
Koperator. The extensive use of the pooling operator in
CNN stems from the local correlation of the images, and
the pooling operator can compress images while maintaining
their local correlation features[40]. Based on it, we exploit
and introduce the dimensionality reduction operation of
pooling operators to compress the gradient matrix.
We modified the Minibatch-SGD algorithm in that each
working node will compress its computed gradient before
sending it to the global server, which will then perform a
simple unpooling operation to restore the matrix to its orig-
inal size. Also, we introduce the error feedback method into
our algorithm. The index values of the unfiltered gradients
in each epoch will be saved locally and added to the newly
computed gradient matrix in the next epoch.
3.2. Details and Proofs of Pooling Algorithm
Generally, the distributed machine learning model is
often expressed in the following optimization form. For the
input x ∈ , object function 𝑓∶ℝ𝑑→ℝ𝑑:
𝑓(𝑥) = 1
𝑛
𝑛
𝑖=1
𝑓𝑖(𝑥), 𝑥∗∶= arg min
𝑥∈ℝ𝑑𝑓(𝑥), 𝑓 ∗∶= 𝑓𝑥∗
(1)
where each 𝑓𝑖is L-smooth and 𝜇-strongly convex:
𝑓(𝑦)≤𝑓(𝑥)+∇𝑓(𝑥), 𝑦−𝑥+𝐿
2𝑦−𝑥2,∀𝑥, 𝑦 ∈ℝ𝑑, 𝑖 ∈ [𝑛]
(2)
𝑓(𝑦)≥𝑓(𝑥)+ ∇𝑓(𝑥), 𝑦 −𝑥+𝜇
2𝑦−𝑥2,∀𝑥, 𝑦 ∈ℝ𝑑(3)
The solution to the above optimization problem is usu-
ally Minibatch-SGD[11]. For simplicity, let’s now consider
the Minibatch-SGD algorithm in the following form:
𝑥𝑘+1 = x𝑘−𝛾
𝑖∈𝕄
∇𝑓x𝑘
𝑖(4)
where 𝑥𝑘are the global parameters of the k-th epoch,𝑥𝑘
𝑖
is the local parameter of the k-th epoch of the i-th worker
node, 𝑌is the learning rate, ∇𝑓x𝑘
𝑖is the gradient of 𝑥𝑘
𝑖.
The target needs to sparsify is ∇𝑓x𝑘
𝑖, the gradient value
∇𝑓x𝑘
𝑖will be transmitted to the global server for updating
the global parameters 𝑥𝑘for each epoch, which is the main
communication bottleneck[6].
The pooling operator compresses gradient matrix ∇𝑓x𝑘
𝑖
according to the layer structure before uploading. For a
pooling filter with stride = Rand kernel-size = Rthe original
gradient matrix can be compressed to 1∕𝑅2of its original
size. The gradient matrix 𝑔𝑘
𝑖after the above operation is
submitted to the server via the network, and then the server
performs an UnPool operation to obtain its original size. In
addition, the elements 𝛼∇𝑓𝑥𝑘
𝑖−𝑔𝑘
𝑖that are not transmitted
will be added to the current gradient residual 𝑚𝑘
𝑖for the
next epoch‘s gradient residual 𝑚𝑘+1
𝑖, which can speed up the
training process to a great extent.
First Author et al.: Preprint submitted to Elsevier Page 3 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
The formal representation is as follows:
𝑥𝑘+1 =𝑥𝑘−𝛾
𝑖∈𝕄
UnPool 𝑔𝑘
𝑖(5)
𝑔𝑘
𝑖= MaxPool 𝑚𝑘
𝑖+𝛼∇𝑓𝑥𝑘
𝑖 (6)
𝑚𝑘+1
𝑖=𝑚𝑘
𝑖+𝛼∇𝑓𝑥𝑘
𝑖−𝑔𝑘
𝑖(7)
where the compression operator MaxPool𝑅∶ℝ𝑑→ℝ𝑑is
defined for x ∈ :
MaxPool𝑅(𝑋) ∶=
𝑥𝑖,𝑗 ,if 𝑥𝑖,𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑎𝑥 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒
𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑒𝑎𝑐ℎ 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑓 𝑖𝑙𝑡𝑒𝑟
0,ortherwise
(8)
and each symbol has the same meaning as Equation 1.
The abbreviated principle of the pooling Operator can be
seen in Figure 2. Since the RandPool operator is very similar
to the MaxPool operator, we won’t go over them.
For the compression operator MaxPool𝑅∶ℝ𝑑→ℝ𝑑
that satisfies the contraction property for pooling stride=R
and kernel-size=R, we have:
𝔼
𝑋− MaxPool𝑅(𝑋)
2≤1 − 1
𝑅𝑋2,∀𝑋∈ℝ𝑑(9)
The specific steps of the proof are shown in appendix
5.B.1. and then have the result as equation 10:
𝔼𝑓x𝑇−𝑓∗≤ 𝐺2
𝜇𝑇 +𝑅2𝐺2
𝜇𝑇 2+𝑅3𝐺2
𝜇𝑇 3
(10)
where =𝐿
𝜇,𝔼𝜇
x0− x∗
≤2𝐺For L-smooth(See
Equation 2) and 𝜇-strongly convex(See Equation 3) f, and
for carefully chosen learning rates, we observe that for
T = Ω 𝑅1∕2, we have dominating convergence rate
𝐺2
𝜇𝑇 , as derived in[[6], Remark2.6], the same rate as
vanilla SGD[41].
For the pooling algorithm, it can be intuitively repre-
sented by Algorithm 1and Algorithm 2. Each worker node
will compute the gradient using the local dataset and perform
the pooling operator to obtain the compressed gradient 𝑔𝑘
𝑖,
while using 𝑚𝑘
𝑖to store the gradient residual for accelerating
the training process. Finally, send a compressed gradient 𝑔𝑘
𝑖
to the global server for updating the global parameter 𝑥𝑘+1.
The server continuously receives the compressed gradient
𝑔𝑘
𝑖and performs the UnPool operation to obtain the unbi-
ased estimated gradient and executes the gradient descent
algorithm to update the global parameters 𝑥𝑘+1
Algorithm 1 Distributed Parallel Pooling-SGD with
Memory-Server
Require: global model 𝑥0, global learning rate 𝛾
Ensure: trained global model
for global epoch 𝑡in 0...𝑇do
for each 𝑐𝑙𝑖𝑒𝑛𝑡𝑖do
▿𝑡
𝑖⇐𝑃 𝑜𝑜𝑙𝑆 𝐺𝐷𝐶𝑙𝑖𝑒𝑛𝑡𝑖(x𝑡)
// Get the compressed matrix from each client i
▿𝑡
𝑖⇐𝑈 𝑛𝑃 𝑜𝑜𝑙(▿𝑡
𝑖)
// Perform the Unpool operation to obtain an unbi-
ased estimate of the gradient matrix
end for
𝑥𝑡+1 ⇐𝑥𝑡−𝛾𝑀
𝑖=1 ▿𝑡
𝑖
//Execute parallel stochastic gradient descent algorithm
end for
Algorithm 2 PoolSGDClient
Require: client model 𝑥𝑡
𝑖, local learning rate 𝛾𝑖
Ensure: Compressed gradient matrix
Initialize gradient memory 𝑚𝑡
𝑖, Pooling Stride 𝑅and 𝛼
1: Compute the local gradient ▿𝑓(𝑥𝑡
𝑖)
2: g𝑡
𝑖⇐𝑃 𝑜𝑜𝑙𝑅(𝑚𝑖
𝑡+▿𝑓(𝑥𝑡
𝑖))
//Compute the compressed gradient matrix
3: m𝑡+1
𝑖⇐𝑚𝑡
𝑖+𝛼▿𝑓(𝑥𝑡
𝑖) − 𝑔𝑡
𝑖
//Compute gradient residual
4: Server ⇐𝑔𝑡
𝑖
//send the Compressed gradient matrix to global
server
4. Experiments
4.1. Experimental Setup
Implementation. We designed several experiments based
on the Minibatch[11]. We set up 100 worker nodes with
independently and identically distributed datasets. Each
worker node performs multiple backpropagation operations
using its local dataset to obtain the gradient value of the cur-
rent epoch and then performs the communication-efficient
distributed gradient descent algorithm using Algorithm 1
and Algorithm 2with the learning rate of 0.05. Each exper-
iment uses 50 epochs to ensure that our algorithm performs
the same number of backpropagation and gradient descent
operations as the Top-K and Rand-K algorithms at the same
compression rate.
Task and Dataset. To cover a broad spectrum of deep
learning problems, we consider several typical CNN and
RNN models on different typical data sets. The data sets
include Part-of-Speech Tagging, [42], and [43]. To verify
the correctness of our algorithm in CNN models, we run the
task on the MNIST data set with a simple CNN model and
Cifar-10 data set with Resnet-20[3] model. To verify the cor-
rectness of our algorithm in RNN models, we run the Part-
of-Speech Tagging task with the BiLSTM model[44,45] and
UDPOS dataset[46].
First Author et al.: Preprint submitted to Elsevier Page 4 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Figure 3: The effects of the same compression rate on the convergence rate in different algorithms(i. Minibatch-SGD. ii. Top-K
SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD with
memory or not.) for the Mnist dataset with the CNN model. In(a), the compression rate is set to 0.04 except for Minibatch-SGD.
In(b), the compression rate is 0.01 except for Minibatch-SGD. Since the random selection algorithm (Rand-K and RandPool) works
poorly without gradient residual memory, all subsequent experiments of the random selection algorithm(Rand-K and RandPool)
are set up only with gradient residual memory acceleration.
Compress operator Test Accuracy
no 96.78 %
5 MaxPool 95.28 %
5 MaxPool With Memory 95.56 %
TopK0.04 93.15 %
TopK0.04 with Memory 95.54 %
5 RandPool 40.80 %
5 RandPool With Memory 94.16 %
RandK0.04 32.78 %
RandK0.04 With Memory 94.35 %
Compress operator Test Accuracy
no 96.78 %
10 MaxPool 94.10 %
10 MaxPool With Memory 95.04 %
TopK0.01 89.99 %
TopK0.01 with Memory 94.06 %
10 RandPool No convergence
10 RandPool With Memory 93.05 %
RandK0.01 No convergence
RandK0.01 With Memory 91.16 %
Table 1
Investigating effects of the same compression rate on the accuracy in different algorithms(i. Minibatch-SGD. ii. Top-K SGD with
memory or not. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD with memory or
not) for the Mnist dataset with the CNN model.
Figure 4: The effects of the same compression rate on the convergence rate in different algorithms(i. Minibatch-SGD. ii. Top-K
SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD
with memory or not.) for the Cifar-10 dataset with the ResNet-20 model. In(a), the compression rate is set to 0.04 except for
Minibatch-SGD. In(b), the compression rate is 0.01 except for Minibatch-SGD.
First Author et al.: Preprint submitted to Elsevier Page 5 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Compress operator Test Accuracy
no 71.58%
5 MaxPool 59.88 %
5 MaxPool With Memory 69.87 %
TopK0.04 56.26 %
TopK0.04 with Memory 69.02 %
5 RandPool With Memory 66.86 %
RandK0.04 With Memory 64.28 %
Compress operator Test Accuracy
no 71.58 %
10 MaxPool 58.12 %
10 MaxPool With Memory 67.33 %
TopK0.01 43.28 %
TopK0.01 with Memory 66.08 %
10 RandPool With Memory 66.86 %
RandK0.01 With Memory 52.74 %
Table 2
Investigating effects of the same compression rate on the accuracy in different algorithms(i. Minibatch-SGD. ii. Top-K SGD
with memory or not. iii. PoolSGD with memory or not. iv. Rand-K SGD with memory. v. RandPool SGD with memory.) for the
experiments on the Cifar-10 dataset with the ResNet-20 model.
Baselines. The Top-K algorithm and Rand-K algorithm
are selected as a comparison with our algorithms. The Top-
K algorithm and Rand-K algorithm are used to compare
the convergence rate and the accuracy of the final model
results with our algorithm for the same amount of com-
munication data savings. The Minibatch-SGD algorithm is
also selected as the benchmark for optimal accuracy. The
RandPool operator is similar to the Rand-K operator. We
can also change the MaxPool operator to the RandPool
operator while sparsing the gradient matrix.
Compression Rate Settings. We choose two different
communication compression rates of 1
25 and 1
100 . For our
algorithms, the pooling strides R and kernel size are all
equally set to 5 and 10, then the pooling operator will
compress the target gradient matrices to their original 1
52and
1
102. For the Top-K algorithm and Rand-K algorithm, the K
is set to a multiple of 0.04 and 0.01 of the number of elements
in the original gradient matrix. This allows us to compare the
advantages of our algorithms with the Top-K algorithm and
Rand-K algorithm for the same compression ratio.
4.2. Result of several experiments
Convergence on the MNIST data set. The convergence
of the MNIST dataset in CNN is shown in Figure 3and
Table 1, we conducted experiments on a Minibatch-SGD
algorithm with compression rates of 1
52and 1
102, respectively,
as shown before. The results show that with or without
gradient residual memory in a typical CNN network, our
algorithm can obtain the same convergence rate as the Top-K
algorithm. The model loss through our algorithm is slightly
lower than the Top-K algorithm for the same training epoch,
and the final model accuracy is slightly higher than that of
the Top-K algorithm.
It can be seen that the RandPool algorithm obtains better
results than the Rand-K algorithm, both with and without
gradient memory.
Convergence on the Cifar-10 data set. We perform
the image recognition task for the Cifar-10 dataset on the
ResNet model shown in Figure4and Table 2. Similarly,
we conducted experiments on a Minibatch-SGD algorithm
with compression rates of 1
52and 1
102, respectively. The
experimental setup is similar to the previous part of the
CNN model. The results also show that the pooling operator
can obtain relatively greater convergence speed and final
accuracy both with and without gradient residual memory.
The MaxPool algorithm with gradient residual memory is
best similar to the convergence rate of the Minibatch-SGD
algorithm.
Convergence on the POS task. To verify the perfor-
mance of our algorithm in RNN models, we select the
Part-of-Speech Tagging task and run a bidirectional LSTM
neural network model. The results are shown in Figure 5
and Table 3, which indicate that our algorithm(BiLSTM)
and the Top-K algorithm have the same performance for the
same compression rate. The reason for this may be that the
gradient values have no local correlation for RNN model.
Summary. The experiments of CNN and RNN models
from different data sets indicate that our algorithm would
not damage the model accuracy during training and outper-
forms the Top-K algorithm. Also, our local random selection
strategy i.e.RandPool algorithm outperforms the Rand-K
algorithm when it comes to randomly select the gradient
elements for transmitting. However it does not work as
well as the MaxPool operator, as shown in Figure 3and
Figure 4; the results of the comparison between Rand-K
and RandPool methods can be concluded that the RandPool
algorithm can obtain relatively better results than the Rand-
K algorithm because the difference between the Rand-K and
RandPool operators is only that the RandPool operator is a
randomly selected element in each filter, and because there is
some correlation between the gradient values, the RandPool
operator can better represent the global characteristics of the
original gradient matrix. This better supports the reason why
the MaxPool operator can obtain better results than the Top-
Koperator.
5. Conclusions
In this work, we find one possible reason for the un-
stable experimental convergence of traditional sparsification
First Author et al.: Preprint submitted to Elsevier Page 6 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
Figure 5: The effects of the same compression rate on the convergence rate in different algorithms(i. Minibatch-SGD. ii. Top-K
SGD with or without memory. iii. MaxPoolSGD with memory or not. iv. Rand-K SGD with memory or not. v. RandPool SGD
with memory or not.) for the UDPOS dataset with the BiLSTM model. In(a), the compression rate is set to 0.04 except for
Minibatch-SGD. In(b), the compression rate is 0.01 except for Minibatch-SGD.
Compress operator End Loss
no 0.6672
5 MaxPool 0.6939
5 MaxPool With Memory 0.6914
TopK0.04 0.7154
TopK0.04 with Memory 0.7035
5 RandPool With Memory 0.7125
RandK0.04 With Memory 0.7016
Compress operator End Loss
no 0.6672
10 MaxPool 0.7497
10 MaxPool With Memory 0.6837
TopK0.01 0.7512
TopK0.01 with Memory 0.7116
10 RandPool With Memory 1.0064
RandK0.01 With Memory 1.1218
Table 3
Investigating effects of the same compression rate on the accuracy in different algorithms(i. SGD. ii. Top-K SGD with memory
or not. iii. PoolSGD with memory or not. iv. Rand-K SGD with memory. v. RandPool SGD with memory.) for the experiments
on the UDPOS dataset with the BiLSTM model.
algorithms. To solve this problem, we find that the gradient
distribution is locally correlated in some cases. Therefore we
introduce a pooling operator and combined it with the error-
feedback method to design a gradient sparsification pooling
algorithm.
We considered both CNN and RNN models with dif-
ferent degrees of communication reduction:(i) CNN and
ResNet models on MNIST and Cifar-10 datasets demon-
strate better convergence performance than the Top-K algo-
rithm. (ii) LSTM networks on POS tasks perform similarly
to the Top-K algorithm. The results show that our algorithm
performs better than the Top-K algorithm on image-related
datasets and CNN models.
Declaration of competing interest
The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgement
This research work was supported by the National Nat-
ural Science Foundation of China (NSFC) (U19A2059),
and the Sichuan Science and Technology Program (No.
206999977).
Appendix
Lemma A.1 For 𝐱∈ℝ𝑑,1≤𝑘≤𝑑, and operator
comp𝑘∈top𝑘,rand𝑘it holds that
𝔼
comp𝑘(𝐱) − 𝐱
2≤1 − 𝑘
𝑑𝐱2(11)
Proof. For the definition of the operators, for all 𝐱∈ℝ𝑑
we have
𝐱− top𝑘(𝐱)
2≤
𝐱− rand𝑘(𝐱)
2(12)
First Author et al.: Preprint submitted to Elsevier Page 7 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
and we apply the expectation
𝔼𝜔
𝐱− rand𝑘(𝐱)
2=1
Ω𝑘
𝜔∈Ω𝑘
𝑑
𝑖=1
𝐱2
𝑖𝕀{𝑖∉𝜔}
=
𝑑
𝑖=1
𝑥2
𝑖
𝜔∈Ω𝑘
𝕀{𝑖∉𝜔}
Ω𝑘=1 − 𝑘
𝑑𝐱2
(13)
which concludes the proof.
Proof B.1 For 𝐱∈ℝ𝑑,1≤𝑘≤𝑑, object function
𝑓∶ℝ𝑑→ℝ𝑑: that satisfies the contraction property for
pooling stride=R and kernel size=R, it holds
𝔼𝑋−MaxPoolR(𝑋)2≤1 − 1
𝑅𝑋2,∀𝑋∈ℝ𝑑(14)
Proof. From the definition of the MaxPool and Top-
Koperators, the MaxPool operator is the combination of
𝑑
𝑅Top-1 operators, the MaxPool operator simply takes
the Top-1 value in a filter with kernel size = R, so we have
𝔼
𝑋− MaxPool𝑅(𝑋)
2=
𝑥𝑖∈𝑋
𝔼
𝑥𝑖− Top −1 𝑥𝑖
2
≤
𝑥𝑖∈𝑋1 − 1
𝑑
𝑥𝑖
2=
𝑥𝑖∈𝑋1 − 1
𝑅
𝑥𝑖
2=1 − 1
𝑅𝑋2
(15)
where 𝑥𝑖denotes the submatrix covered by each filter.
References
[1] David Harwath, Antonio Torralba, and James Glass. Unsupervised
learning of spoken language with visual context. Advances in Neural
Information Processing Systems, 29, 2016.
[2] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning
based recommender system: A survey and new perspectives. ACM
Computing Surveys (CSUR), 52(1):1–38, 2019.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 770–
778, 2016.
[4] Karen Simonyan and Andrew Zisserman. Very deep convolu-
tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[5] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and
Animashree Anandkumar. signsgd: Compressed optimisation for
non-convex problems. In International Conference on Machine
Learning, pages 560–569. PMLR, 2018.
[6] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Spar-
sified sgd with memory. Advances in Neural Information Processing
Systems, 31, 2018.
[7] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstanti-
nov, Sarit Khirirat, and Cedric Renggli. The convergence of sparsi-
fied gradient methods. Advances in Neural Information Processing
Systems, 31, 2018.
[8] István Hegedűs, Gábor Danner, and Márk Jelasity. Decentralized
learning works: An empirical comparison of gossip learning and
federated learning. Journal of Parallel and Distributed Computing,
148:109–124, 2021.
[9] Daniel Rosendo, Alexandru Costan, Patrick Valduriez, and Gabriel
Antoniu. Distributed intelligence on the edge-to-cloud continuum:
A systematic literature review. Journal of Parallel and Distributed
Computing, 166:71–94, 2022.
[10] Andrzej Goscinski, Flavia C. Delicato, Giancarlo Fortino, Anna
Kobusińska, and Gautam Srivastava. Special issue on distributed
intelligence at the edge for the future internet of things. Journal of
Parallel and Distributed Computing, 171:157–162, 2023.
[11] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch
vs local sgd for heterogeneous distributed learning. Advances in
Neural Information Processing Systems, 33:6281–6292, 2020.
[12] Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaim-
ing He. Accurate, large minibatch sgd: Training imagenet in 1 hour.
arXiv preprint arXiv:1706.02677, 2017.
[13] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson,
and Blaise Aguera y Arcas. Communication-efficient learning of
deep networks from decentralized data. In Artificial intelligence and
statistics, pages 1273–1282. PMLR, 2017.
[14] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan
McMahan. Distributed mean estimation with limited communication.
In International conference on machine learning, pages 3329–3337.
PMLR, 2017.
[15] Longxin Lin, Zhenxiong Xu, Chien-Ming Chen, Ke Wang, Md. Rafiul
Hassan, Md. Golam Rabiul Alam, Mohammad Mehedi Hassan, and
Giancarlo Fortino. Understanding the impact on convolutional neural
networks with different model scales in aiot domain. Journal of
Parallel and Distributed Computing, 170:1–12, 2022.
[16] Shuo Ouyang, Dezun Dong, Yemao Xu, and Liquan Xiao. Commu-
nication optimization strategies for distributed deep neural network
training: A survey. Journal of Parallel and Distributed Computing,
149:52–65, 2021.
[17] Yaser Mansouri and M. Ali Babar. A review of edge computing: Fea-
tures and resource virtualization. Journal of Parallel and Distributed
Computing, 150:155–183, 2021.
[18] Felix Sattler, Simon Wiedemann, Klaus-Robert Muller, and Wojciech
Samek. Robust and communication-efficient federated learning from
non-iid data. IEEE transactions on neural networks and learning
systems, 31(9):3400–3413, 2019.
[19] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari,
and Mehrdad Mahdavi. Federated learning with compression: Uni-
fied analysis and sharp guarantees. In International Conference on
Artificial Intelligence and Statistics, pages 2350–2358. PMLR, 2021.
[20] Jay H Park, Sunghwan Kim, Jinwon Lee, Myeongjae Jeon, and
Sam H Noh. Accelerated training for cnn distributed deep learning
through automatic resource-aware layer placement. arXiv preprint
arXiv:1901.05803, 2019.
[21] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and
communication efficient federated learning for heterogeneous clients.
arXiv preprint arXiv:2010.01264, 2020.
[22] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. Quantized neural networks: Training neural networks
with low precision weights and activations. The Journal of Machine
Learning Research, 18(1):6869–6898, 2017.
[23] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally.
Deep gradient compression: Reducing the communication bandwidth
for distributed training. arXiv preprint arXiv:1712.01887, 2017.
[24] Alham Fikri Aji and Kenneth Heafield. Sparse communication for
distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
[25] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dim-
itris Papailiopoulos, and Stephen Wright. Atomo: Communication-
efficient learning via atomic sparsification. Advances in Neural
Information Processing Systems, 31, 2018.
[26] Enda Yu, Dezun Dong, Yemao Xu, Shuo Ouyang, and Xiangke Liao.
Cp-sgd: Distributed stochastic gradient descent with compression
and periodic compensation. Journal of Parallel and Distributed
Computing, 169:42–57, 2022.
[27] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran
Chen, and Hai Li. Terngrad: Ternary gradients to reduce communi-
cation in distributed deep learning. Advances in neural information
processing systems, 30, 2017.
First Author et al.: Preprint submitted to Elsevier Page 8 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed
Short Title of the Article
[28] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and
Martin Jaggi. Error feedback fixes signsgd and other gradient com-
pression schemes. In International Conference on Machine Learning,
pages 3252–3261. PMLR, 2019.
[29] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman
Arora, et al. Communication-efficient distributed sgd with sketching.
Advances in Neural Information Processing Systems, 32, 2019.
[30] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-
organizing neural network model for a mechanism of visual pattern
recognition. In Competition and cooperation in neural nets, pages
267–285. Springer, 1982.
[31] Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter
Richtarik. Linearly converging error compensated sgd. Advances
in Neural Information Processing Systems, 33:20889–20900, 2020.
[32] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-
term memory-networks for machine reading. arXiv preprint
arXiv:1601.06733, 2016.
[33] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry
Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub
Konecy, Stefano Mazzocchi, Brendan McMahan, et al. Towards
federated learning at scale: System design. Proceedings of Machine
Learning and Systems, 1:374–388, 2019.
[34] Peng Jiang and Gagan Agrawal. A linear speedup analysis of
distributed deep learning with sparse and quantized communication.
Advances in Neural Information Processing Systems, 31, 2018.
[35] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari,
and Mehrdad Mahdavi. Federated learning with compression: Unified
analysis and sharp guarantees. 130:2350–2358, 2021.
[36] Beznosikov Aleksandr, Horváth Samuel, Richtárik Peter, and Sa-
faryan Mher. On biased compression for distributed learning. 2020.
[37] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao.
Yolov4: Optimal speed and accuracy of object detection. arXiv
preprint arXiv:2004.10934, 2020.
[38] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li, and
Jianbo Shi. Foveabox: Beyound anchor-based object detection. IEEE
Transactions on Image Processing, 29:7389–7398, 2020.
[39] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath
Hariharan, and Serge Belongie. Feature pyramid networks for object
detection. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2117–2125, 2017.
[40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. Communica-
tions of the ACM, 60(6):84–90, 2017.
[41] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler
approach to obtaining an o (1/t) convergence rate for the projected
stochastic subgradient method. arXiv preprint arXiv:1212.2002,
2012.
[42] Li Deng. The mnist database of handwritten digit images for machine
learning research [best of the web]. IEEE signal processing magazine,
29(6):141–142, 2012.
[43] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. Communica-
tions of the ACM, 60(6):84–90, 2017.
[44] Libin Shen, Giorgio Satta, and Aravind Joshi. Guided learning for
bidirectional sequence classification. In ACL, volume 7, pages 760–
767, 2007.
[45] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural
networks. IEEE transactions on Signal Processing, 45(11):2673–
2681, 1997.
[46] Milan Straka and Jana Strakova. Tokenizing, pos tagging, lemmatiz-
ing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL
2017 shared task: Multilingual Parsing from raw text to universal
dependencies, pages 88–99, 2017.
First Author et al.: Preprint submitted to Elsevier Page 9 of 9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4327869
Preprint not peer reviewed