Conference PaperPDF Available

GRACE: A Compressed Communication Framework for Distributed Machine Learning

July 2021

July 2021

DOI:10.1109/ICDCS51616.2021.00060

Conference: 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Authors:

Chen-yu Ho

King Abdullah University of Science and Technology

Ahmed M. Abdelmoneim

Queen Mary, University of London

Aritra Dutta

University of Central Florida

Show all 8 authorsHide

Content uploaded by Chen-yu Ho

Content may be subject to copyright.

GRACE: A Compressed Communication

Framework for Distributed Machine Learning

Hang Xu, Chen-Yu Ho, Ahmed M. Abdelmoniem, Aritra Dutta,

El Houcine Bergou, Konstantinos Karatsenidis, Marco Canini and Panos Kalnis

KAUST

Abstract—Powerful computer clusters are used nowadays to

train complex deep neural networks (DNN) on large datasets.

Distributed training increasingly becomes communication bound.

For this reason, many lossy compression techniques have been

proposed to reduce the volume of transferred data. Unfortunately,

it is difﬁcult to argue about the behavior of compression methods,

because existing work relies on inconsistent evaluation testbeds

and largely ignores the performance impact of practical system

conﬁgurations. In this paper, we present a comprehensive survey

of the most inﬂuential compressed communication methods for

DNN training, together with an intuitive classiﬁcation (i.e., quan-

tization, sparsiﬁcation, hybrid and low-rank). Next, we propose

GRACE, a uniﬁed framework and API that allows for con-

sistent and easy implementation of compressed communication

on popular machine learning toolkits. We instantiate GRACE

on TensorFlow and PyTorch, and implement 16 such methods.

Finally, we present a thorough quantitative evaluation with a va-

riety of DNNs (convolutional and recurrent), datasets and system

conﬁgurations. We show that the DNN architecture affects the

relative performance among methods. Interestingly, depending on

the underlying communication library and computational cost of

compression / decompression, we demonstrate that some methods

may be impractical. GRACE and the entire benchmarking suite

are available as open-source.

Index Terms—Survey, Distributed Machine Learning, Deep

Learning, Gradient Compression, Benchmark.

I. INTRODUCTION

Deep Neural Networks (DNNs) are becoming more complex.

For example, ResNet-152 has 152 layers and 60.2M parameters

[1], VGG-19 has 19 layers and 143M parameters [2], while

BERT-Large has 24 layers and 340M parameters [3]. Combined

with the large sizes of the training sets, parallelism during

training becomes a necessity. Consequently, popular deep

learning toolkits, including TensorFlow, PyTorch and MXNet,

support data parallelism

: The DNN model under training is

replicated in several compute nodes, a.k.a. workers, typically

equipped with powerful GPUs. Each worker independently

processes a partition of the data called mini-batch. Then,

local intermediate results (typically, the local gradients) are

exchanged through the network, and the aggregated values are

sent back to the workers; the process is repeated over many

epochs (i.e., full iterations of the training data).

Since the above-mentioned communication involves large

amounts of data, the network becomes the bottleneck [5]–

[7]. Luo et al. [5] argue that computation speed improves

Model and pipeline parallelism [4], which partition one replica of the model

to multiple compute nodes, is orthogonal to data parallelism, but outside the

scope of this paper.

0 50 100 150 200

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Baseline

Randk(0.01)

8-bit

(a) x-axis: Epochs

0 500 1000 1500 2000

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Baseline

Randk(0.01)

8-bit

Accuracy 0.86

(b) x-axis: Wall-time (s)

Fig. 1: Top-1 accuracy for VGG16 on CIFAR-10 with TensorFlow on 8

workers via 25 Gbps network links. In (b) Randk converges in 450s, but

8-bit quantization needs 1200s.

faster than network bandwidth; therefore, modern GPUs (e.g.,

NVIDIA V100) experience long idle times while waiting

for communication. This causes inefﬁcient utilization of the

computational resources, longer training times and/or higher

ﬁnancial cost for cloud-based operations.

To alleviate this problem, many works propose lossy compres-

sion during communication, to reduce the volume of transferred

data. Typically, the so-called back-propagation phase of DNN

training employs variants of the Stochastic Gradient Descent

(SGD) [8] optimizer. Since training is stochastic in nature,

intuitively, it should manage to converge despite the small errors

introduced by the lossy compression. We identify four main

classes of compressors in the literature: (i)Quantization [9]–

[14], which reduces the number of bits of each element in the

gradient tensor (e.g., cast ﬂoat32 to ﬂoat8); (ii)Sparsiﬁcation

[15]–[19], which transmits only a few elements per tensor (e.g.,

only the top-

largest elements); (iii)Hybrid methods [20]–

[24], which combine quantization with sparsiﬁcation; and (iv)

Low-rank methods [25]–[28], which decompose the gradient

into low-rank matrices.

Despite the abundance of compressed communication meth-

ods, it is unclear which one is more suitable and under what

circumstances, or what the relative trade-offs are. Figure 1

demonstrates the problem on a standard TensorFlow benchmark

(see §V for details) running on 8 workers with NVIDIA

V100 GPUs and 25 Gbps network. Two common compres-

sion methods, Random-

[17] and 8-bit quantization [11],

are compared against a baseline without compression. Most

existing papers present an accuracy versus training epochs

analysis, similar to Figure 1a, which shows almost equivalent

effectiveness for all methods. However, in practice, users

care about the actual elapsed time of the training process,

shown in Figure 1b. Random-

converges in roughly 450 s

TABLE I: Classiﬁcation of surveyed gradient compression methods. Note that k˜gk0and kgk0are the number of elements in the compressed and

uncompressed gradient, respectively; nature of operator Qis random or deterministic; EF-On indicates if error feedback is used in our experiments.

We implement 16 methods on TensorFlow and PyTorch.

Compression Ref. Similar Methods k˜gk0Nature of QEF-On Implementation

Quantization

8-bit quantization [11] kgk0Det TFlow

1-bit SGD [13] [10], [21], [24] kgk0Det TFlow, PyTorch

SignSGD [10] [13], [29] kgk0Det 5TFlow, PyTorch

SIGNUM [30] [10], [29] kgk0Det 5TFlow, PyTorch

QSGD [9] [14], [27], [31]

[32]–[34] kgk0Rand 5TFlow, PyTorch

LPC-SVRG [33] [9], [31], [34] kgk0Rand

Natural [31] [9], [33], [34] kgk0Rand TFlow, PyTorch

TernGrad [14] [9], [27], [33] kgk0Rand 5TFlow, PyTorch

EFsignSGD [12] [29] kgk0–NA– TFlow, PyTorch

INCEPTIONN [35] kgk0Det 5TFlow

Sparsiﬁcation

Random-k[17] kRand TFlow, PyTorch

Top-k[15] [17] kDet TFlow, PyTorch

Threshold-v[36] [15] Adaptive Det TFlow, PyTorch

Deep Gradient (DGC) [16] Adaptive Det TFlow, PyTorch

Adaptive sparsiﬁcation [19] [27] Adaptive Rand

Variance-based sparsiﬁcation [18] Adaptive Det

Sketched-SGD [37] [15], [17] kDet

Hybrid

Adaptive threshold SGD [21] [10], [13], [24] Adaptive Det TFlow

SketchML [22] [21], [24] Adaptive Rand TFlow

3LC [23] [14] Adaptive Det

Qsparse-local-SGD [20] Adaptive Rand

LowRank

ATOMO [27] [19] sparsity budget Rand

GradiVeQ [28] [27] (m+L)rDet

PowerSGD [26] [25] (m+L)rDet TFlow, PyTorch

GradZip [25] [26] (m+L)rDet

and is obviously preferable than the baseline that requires

850 s. Interestingly, 8-bit quantization converges after 1200 s,

rendering it worse than using no compression at all.

In general, the majority of the existing work exhibits one or

more of the following shortcomings: (i) Theoretical analysis

is based on unrealistic assumptions, such as convexity; (ii)

Implementation is stand-alone and does not reﬂect real-world

scenarios that utilize one of the popular deep learning toolkits;

(iii) Experimental evaluation ignores the computational cost of

compression/decompression, which, in some cases, is larger

than the savings by the reduced communication; (iv) Only

convergence versus the number of epochs is reported, whereas

actual execution time is ignored; (v) Experimental evaluation

is performed on non-standard benchmarks; or, for a restricted

set of models (e.g., only convolutional neural networks); or,

even without considering DNNs at all.

Motivated by these shortcomings, in this paper, we follow

a systematic approach to survey, categorize and evaluate

quantitatively the existing work on compressed communication

for Deep Learning under an extensive range of real-world

models, datasets, and system conﬁgurations. We also propose

the GRACE framework that allows (i) researchers to easily

implement novel methods using our API and evaluate them

on a standard testbed, and (ii) practitioners to investigate the

trade-offs and select the method that suits the characteristics

of their particular model and dataset. Our contributions are:

Survey.

In §III, we present a comprehensive survey of the

most inﬂuential works in compressed gradient communication;

refer to Table I for a summary.

Framework and API.

In §IV, we propose GRACE, a

uniﬁed framework and programming API that exposes the

necessary functions (e.g.,

compress

decompress

, and

memory_compensate

) for implementing a variety of com-

pressed communication methods. We embed GRACE in

TensorFlow and PyTorch and implement 16 representative

methods (see Table I). We release our code, execution scripts,

evaluation metrics, and raw data; and provide the models and

datasets.

Essentially, we develop a self-contained testbed that

can be the standard of evaluating future compression methods.

Quantitative evaluation.

In §V, we use a variety of models

that include convolutional (CNN) as well as recurrent neural

networks (RNN); and datasets from diverse domains that

include image classiﬁcation and segmentation, recommendation

systems, and language modeling. We vary the number of

workers as well as the network bandwidth; and report a rich

set of metrics including throughput, data volume, accuracy,

and computation overhead. While it is not possible to be

exhaustive, we believe our results spanning a comprehensive

set of 5 benchmarks, 7 model architectures and 4 ML tasks

offer insights and draw lessons that are broadly applicable.

2Public release at https://github.com/sands-lab/grace.

Our results reveal that the speed and accuracy of each

compression method depend on the particular DNN under

training. Performance is also inﬂuenced by the underlying

network communication libraries (e.g., OpenMPI [38] or

NCCL [39]) and network bandwidth. Interestingly, many

methods fail to match the no-compression baseline in terms

of accuracy, as well as in terms of execution time, due to

the computation overhead of compression/decompression; this

issue is more pronounced for faster networks.

II. BACKGROU ND

We focus on data-parallel distributed training [9], [10], [40]–

[43], where each worker possesses a local copy of the entire

model; computes local updates; and communicates regularly

with all other workers to synchronize with the aggregated global

state. Global aggregation is commonly implemented through

a collective communication library (e.g., Horovod [44]) in a

peer-to-peer topology.3

Distributed data-parallel learning.

A distributed optimization

problem minimizes a function f:

minx∈Rdf(x)def

nPn

i=1 fi(x),(1)

where

is the number of workers. Each worker has a local

copy of the model and a partition of the training data. The

workers jointly update the model parameters

x∈Rd

, where

is the number of parameters.

Consider a deep neural network (DNN), and let

xdef

={W, b}

be the space that contains the model parameters (also known

as weights

and biases

). Given a set of input data

with

their corresponding true labels, the training phase learns

for

each layer of the network. Let

f(x)def

nPn

i=1 



j=1 Lj(ˆyj(x, ˆxi,j ), yj)



|{z }

:=fi(x)

+R(x)(2)

be the loss function such that, at each worker

ˆxi,j

is the

input from its data partition

is the true label,

the loss function (e.g., squared loss, cross-entropy loss, etc.)

that calculates the discrepancy between the true label

and

the predicted value

ˆyj

, and

is a regularizer. Calculating

the loss function for each training sample is called forward

pass. During training, the parameter space

is updated by

minimizing Equation

(2)

via a stochastic optimization algorithm

that calculates the gradients of the loss function with respect to

each layer of the DNN; a process known as back-propagation.

In practice, each data partition

is further split into mini-

batches, each with

data points. Each worker

performs the

forward pass for all input data in a mini-batch; then performs

back-propagation to calculate the stochastic gradients over the

entire mini-batch; communicates with all other workers to

aggregate all local gradients; and ﬁnally, uses the aggregated

global state to update its parameters x.

Our work is also applicable to master-worker architectures, where aggre-

gation is performed in a central parameter server.

Layer 1 Layer 2 …

Layer 3 Layer L-1 Output Layer

Input !"#from a

minibatch at node i

Forward pass

produces output $"#

and calculates loss

……

Communication Backend

&'()*

+,-

&'()*

+,-./

&'()*

+,0

()*

+,-

()*

+,-./

()*

+,0

()*

+,1

&'()*

+,1

…

Back-propagation updates the model parameters of each layer

Each layer’s stochastic gradient (23

+,# 4is communicated aggregated /

5'(23

+,#

is used to update the parameters of each layer

Layer L

Gradient compu tation

()*

+,-

Input from Layer L-1

Sends layer-wise

compressed

263+,- 789()*

+,-:

Receives layer-wise

uncompressed

444444444/

;'8.<9263+,-:

Communication

Backend

(b)

(a)

Fig. 2: (a) DNN architecture at node i. (b) Gradient compression

mechanism for the Lth layer of a DNN.

Stochastic gradient descent (SGD).

SGD [8] is a ﬁrst-order

iterative optimization algorithm. At iteration

k+1

, SGD updates

the model parameters as:

xk+1 =xk−ηkgk(3)

where

ηk>0

is the learning rate and

is the stochastic

gradient at iteration

(i.e., an unbiased estimator of the gradient

). To converge faster, SGD is often equipped with a short-

term memory

, called momentum. For instance, Nesterov

[45] computes the gradient

at a look-ahead point

xk+γzk

as:

zk+1 =γzk−ηkg(xk+γzk),

where

0≤γ≤1

. Then,

Equation

(3)

of SGD becomes:

xk+1 =xk+zk+1.

In addition

to SGD, several accelerated versions, such as ADAM [46], or

ADAGrad [47], are used for DNN training.

III. GRADIENT COMPRESSION

We focus on gradient compression.

Let

gi,L

be the local

gradient

in worker

at layer

of the DNN during training

iteration

. Instead of transmitting

gi,L

, the worker sends

Q(gi,L

, where

is a compression operator (see Figure 2). The

receiver has a decompression operator

Q−1

that reconstructs

the gradient. Typically, this process is lossy; for this reason,

several methods incorporate a memory (or error feedback)

mechanism to compensate for the accumulated errors.

Formally, a Compressor is a random operator

Q(·) : Rd→Rd

, that satisﬁes

EQkx−Q(x)k2≤Ωkxk2

where

Ω>0

is the compression factor and the expectation is

taken over the randomness of

. If

Ω=1−δ

and

δ∈(0,1]

then

is a

-compressor; many sparsiﬁers belong to this

category. If

E(Q(x)) = x

, then

is unbiased, otherwise it

is biased. We classify gradient compression techniques into

four categories, shown in Table I: quantization, sparsiﬁcation,

hybrid and low-rank. The most inﬂuential methods are

presented below. For more details, refer to our companion

technical report [48].

A. Quantization

Quantization reduces the number of bits of each element of

the gradient, either by truncation or by mapping to a predeﬁned

set of code-words.

4For the orthogonal topic of parameter compression, see §VI.

5For simplicity, we will omit i, L from gi,L

kwhen possible.

Original

Gradient (g)

Let ! " # $ % & '

(

)

(

%+,- ,

.,/& 0/,/& 1 *23!!4) 5 6/,/7 88,889& !

88:88<

& !3!=2',6/,/788,889=

!3!=2'

>?& !3)**4

( 1 >?& !3@44'

/with probability >?& !3)**4,

!with probability ( 1 BC& !3@44'3

%+,- ,/

88,889

3.39

1.78

10.87

2.22

10.9

1.12

32.1

12.5

Fig. 3: QSGD example with s= 4,l= 3. The possible code-words

are 0,1

4,1

2,3

4,1and are represented by 3-bits.

8-bit quantization.

Dettmers [11] maps each ﬂoat32 element

of the gradient to 8 bits: 1 sign, 3 exponent and 4 mantissa bits.

To minimize precision loss, Dettmers also proposed a dynamic

scheme, where exponent bits range from 0 to 6.

1-bit SGD.

Seide et al. [13] propose an extreme form of

quantization: all gradient elements that are less than a user-

deﬁned threshold

(0 by default) are quantized to ‘

’; all

other elements are quantized to ‘

’.

Q−1

decodes ‘

’s and

‘

’ to the mean of the negative and non-negative values of

the local gradient, respectively. This work also introduces a

memory mechanism,

mk=gk−Q−1(˜gk)

to compensate for

the accumulated error. Let

˜gk

be the compressed gradient at

iteration k; then ˜gk=Q(gk+mk).

SignSGD, SIGNUM and EFsignSGD.

SignSGD [10] trans-

mits the sign of gradient elements by quantizing the negative

components to

−1

and the others to

. SIGNUM [30] is a

momentum version (see §II) of SignSGD. EFsignSGD [12]

improves SignSGD’s convergence via a memory mechanism.

Zheng et al. [29] extended the error feedback approach to a

bidirectional blockwise scheme with Nesterov momentum.

Ternary gradient.

TernGrad [14] uses three values

{−1,0,1}

scaled by the inﬁnity norm of the stochastic gradient

. First,

the elements of a bit-mask

are selected with probability

P(bi= 1|g[i]) = g[i]/kgk∞

. Then,

is quantized to

˜g=kgk∞sign(g)b

, where



denotes element-wise product.

TernGrad tends to achieve better convergence rate if the gradient

components are evenly distributed.

Quantized SGD.

QSGD is a codebook-based scheme by Alis-

tarh et al. [9]. Wu et al. [32] extend QSGD with error feedback.

QSGD quantizes each component

g[i]

of the stochastic gradient

via randomized rounding to a discrete set of values (i.e., code-

words):

˜g[i] = (kgk2sign(g[i]).(l

s) with probability pi=s|g[i]|

kgk2−l

kgk2sign(g[i]).(l+1

s) with probability 1 −pi

where

k · k2

is the Euclidean norm,

s≥1

and

l∈N

are

user-deﬁned parameters, such that

0≤l < s

and

|g[i]|

kgk2∈

[l/s, (l+ 1)/s].

An example is shown in Figure 3; there are

5 code-words, therefore, each element

g[i]

of the original

stochastic gradient is quantized to 3 bits.

LPC-SVRG and Natural Compression.

LPC-SVRG [33] is

a codebook-based approach that combines gradient clipping

Original Gradient

Top-k compression

Communicated gradient components Communicated gradient indices

-0.1

1.2

3 0

-3.5

4.9

0.88

0 0

-0.7

1 0 9

-0.3

-3.5

4.9

-3.5

4.9

95 6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fig. 4: Example Top-kcompression: 20% of the gradient components

and corresponding indices are sent.

with quantization. For bit-width

and scaling factor

δ > 0

gradient component

g[i]∈[ε, ε +δ]

is quantized to

with

probability

pi=ε+δ−g[i]

; or to

ε+δ

, otherwise, where

ε∈−2w−1δ, . . . , −δ, 0,δ,...,(2w−1−1)δ

. Quantized-

SVRG [9] is a related method with a variance reduction

mechanism. Horvath et al. [31] proposed a similar scheme,

called natural compression, that rounds the input to one of the

two closest integer powers of 2.

INCEPTIONN

[35] quantizes each 32-bit ﬂoating-point gradi-

ent element into four different levels (i.e., 32, 16, 8 and 0-bit)

and a 2-bit tag indicating the compression level. This work is

implemented on FPGA-based network cards (NICs) to reduce

the computational overhead of compression / decompression.

B. Sparsiﬁcation

Sparsiﬁcation methods select only a subset of the elements

of the original stochastic gradient

, resulting in a sparse

vector. Let

be a bitmask vector with the same number of

elements as

. An ‘

’ bit in

b[i]

indicates that the corresponding

gradient element

g[i]

is selected. The element-wise product

gb

generates a sparse vector of the original stochastic gradient. The

sparse vector can be represented as two vectors: one contains

the values of the selected elements of

, whereas the other

contains the indices of the corresponding ‘1’ in b.

Random-k

[17]. Let

be the size of the bitmap

. A set of

indices are randomly selected out of

possible ones, and the

corresponding bits of

are set to ‘

’. By design, Random-

is biased, but can be made unbiased by multiplying

with a

factor d

k.There is also a version with error feedback.

Top-k, Sketched-SGD and Threshold-v.

Top-

[15] selects

bitmask

such that

b[i] = 1

|g[i]|

belongs to the

largest

values of

(in absolute value); otherwise,

b[i] = 0

; Figure 4

shows an example. Stich et al. [17] propose a similar scheme

with memory. Ivkin et al. [37] propose Sketched-SGD, which

uses count-sketch to select the “heavy hitters” that approximate

the Top-

components of the gradient. In contrast to Top-

Threshold-

[36] selects the elements whose absolute values are

larger than a ﬁxed threshold. However, an appropriate threshold

is hard to determine as it depends on the model.

Deep gradient compression (DGC)

[16]. Each worker

calcu-

lates the local gradient

and updates it as:

k=βui

k−1+gi

One can think of the above step as momentum added to the

local gradient, a form of error feedback. Then, the gradient

is accumulated via:

k=vi

k−1+ui

. Only gradient elements

g[i]<−τ

and

g[i]> τ

are transmitted, where

is a user-

deﬁned threshold. To identify the threshold while incurring

M≈P

Rank 𝑟

factorization,

where 𝑟 ≪

min 𝑚, 𝐿

Fig. 5: Low-rank compression: Matrix Mis decomposed into two

low-rank matrices P, R each of rank r.

low overhead, Abdelmoniem et al. [49] propose an estimation

technique based on modeling the gradient according to sparsity-

inducing distributions.

Variance-based sparsiﬁcation.

Wangni et al. [19] observe that

the variance of the gradient affects the convergence rate; they

propose an unbiased sparse coding to maximize sparsity and

control the variance. They assign a probability

and generate

bitmask vector

such that

P{b[i]=1}=pi

, to obtain the

compressed gradient element

˜g[i] = Zig[i]

. A similar method

is proposed by Tsuzuku et al. [18].

C. Hybrid compressors

Hybrid methods combine quantization with sparsiﬁcation.

Qsparse local SGD.

Basu et al. [20] combine quantization

with Top-

or Random-

sparsiﬁcation. They implement

synchronous and asynchronous versions with error feedback.

Hard and adaptive threshold SGD.

Strom et al. [24] employ

a user-deﬁned threshold

. Gradient elements

g[i]∈[−τ, τ ]

are

omitted; therefore, the gradient is sparsiﬁed. For the remaining

gradient elements, if

g[i]<−τ

, it is quantized to ‘

’; else, if

g[i]> τ

, it is quantized to ‘

’. Those elements are then packed

into 32-bit words with one bit for the quantized value (‘

’ or

‘

’) and 31 bits for the element index. During decompression,

‘

’s and ‘

’s are decoded to

−τ

and

, respectively. Note

that the appropriate value of

is model-speciﬁc and hard to

determine in practice. Instead of a ﬁxed

, Adaptive [21] uses

a ratio

α < 1

of the proportion of negative and non-negative

gradient elements. Adaptive samples the gradient to determine

dynamically for each mini-batch two thresholds

τ−

and

τ+

that satisfy the α-ratio.

SketchML.

Jiang et al. [22] propose sketch-based compression.

The algorithm selects only the non-zero elements of the

gradient (i.e., sparsiﬁcation) and builds a non-uniform quantile

sketch [50]. Gradient values of each bucket are encoded with

the bucket’s index (i.e., quantization). The algorithm further

compresses the bucket indices through hashing.

3LC

[23] ﬁrst calculates

M=skgk∞

, the highest magnitude

gradient element scaled by a sparsity-multiplier parameter

s∈

[1,2)

. Then the quantized gradient is obtained by rounding

the scaled gradient

(1/M)g

. The output is further compressed

by aggressive lossless encoding. 3LC also implements error

compensation.

D. Low-rank decomposition

DNNs are over-parameterized and exhibit low-rank structure

[51], [52]. Based on this observation [53], [54], low-rank

methods represent the gradient as a matrix

M∈Rm×L

and factorize it into two low-rank matrices

P∈Rm×r

and

R∈Rr×L

that are smaller than

(see Figure 5). Typically,

the factorization is approximate.

ATOMO and GradiVeQ.

ATOMO [27] factorizes the gradient

matrix

in a way that minimizes the variance of the quantized

stochastic gradient. Let

˜g

be an unbiased estimator of stochastic

gradient

that has atomic decomposition

g=Piλiai

, where

A={ai} ⊂ V

are atoms in an inner product space

with

kaik= 1

. If for each

and

0≤pi≤1

ti∼Bernoulli(pi)

then ATOMO uses the estimator

˜g=Pi

λiti

piai

and by using a

sparsity budget

kpk1=s

solves a meta-optimization problem.

This controls the gradient variance and represents

with a set

of fewer basis elements that yield a low-rank approximation of

. The same authors proposed spectral-ATOMO, based on the

singular value decomposition (SVD) of the gradient. GradiVeQ

(gradient vector quantizer) [28] is also based on SVD.

Remark 1.With respect to the standard basis (atom), set

and

∞

, respectively, in

s=kgk1/kgkq

and probability

pi=|g[i]|/kgkq

. Then one can recover QSGD and TernGrad,

respectively, from ATOMO.

PowerSGD and GradZIP.

PowerSGD [26] uses power it-

eration to decompose the original gradient matrix

into

two

-rank matrices

and

. The scheme is biased and the

authors proposed to use a post-compression momentum. A

similar method, GradZIP [25], employs an extra regularizer

kPk2

F+kRk2

and uses an alternating direction method to

obtain factors Pand R.

E. General comment on convergence

While some compressed distributed SGD algorithms are

analyzed in the non-convex setup, some papers only provide

the convergence guarantee when

is convex, under standard

assumptions such as

-smoothness of

(e.g., see [17],

[32]). Under these assumptions, for convex

, the convergence

of compressed distributed SGD is

O(1/K)

, the same as the no-

compression vanilla SGD, where

is the iteration count. For

a non-convex function

(as it is the case with DNNs), it is

typical to show that the quantity

min

k∈[K]

E(k∇fkk2)→0

K→ ∞

. With compressed and distributed SGD, the majority

of the work shows the classical convergence rate

O(1/√K)

for

non-convex functions. We refer to [36] for a general non-convex

convergence analysis of distributed SGD without error feedback

for both biased and unbiased compressors. However, with error

feedback, the convergence analyses of compressed distributed

SGD algorithms are more mathematically involved. Stich et

al. [17] show, with an error feedback, sparsiﬁed SGD maintains

the same convergence rate as no-compression vanilla SGD

for the single node case and strongly convex

. Additionally,

for the single node case, Karimireddy et al. [12] show that

error feedback can alleviate the convergence issues of any

arbitrary compression operator. Many works generalize the

aforementioned scenarios to the distributed setting [55].

IV. GR ACE - A UNIFIED FRAMEWO RK

We develop GRACE, a uniﬁed framework for compressed

communication for distributed deep learning. We instantiate

GRACE within two popular ML toolkits, TensorFlow and

PyTorch. GRACE encompasses a wide range of compression

methods, capturing all the methods discussed in §III, and

yet it exposes a simple programming API with which one can

implement compression methods succinctly. GRACE provides a

reference for fair quantitative evaluation across diverse methods

and serves as a platform for rapid prototyping of new ones.

A. Distributed training loop

Our framework builds upon the distributed training loop

with compressed communication depicted as pseudo-code in

Algorithm 1. Each node executes the training loop in parallel

and periodically synchronizes with other nodes.

Customizable components.

Algorithm 1 references the follow-

ing components that are customized for different compressors:

•Q(·)

and

Q−1(·)

: denote the compression and decompres-

sion operators, respectively.

•φ(·)

: is the memory compensation function, which com-

pensates at each iteration the node-local gradient with the

previous iteration’s error feedback.

•ψ(·)

: is the memory update function that combines at

each iteration the memory with the node-local gradient

and error feedback.

•communication strategy: two types of collective commu-

nication strategies are explicitly supported, with support

for custom gradient aggregation functions (Agg).

Training loop process.

Each node locally computes a stochas-

tic gradient

based on a mini-batch of training samples (Line

4). Then, it combines

with its memory

via

φ(·)

Next,

the node applies compression operator

φ(gi

k, mi

produce

˜gi

(Line 5). Memory

is updated using

ψ(·)

(Line

6). Now, each node communicates its

˜gi

using a collective

communication primitive (Lines 8 and 11). Subsequently,

every node obtains an aggregate of decompressed gradient

typically

gk=1

nPiQ−1(˜gi

. At this point, we distinguish

the case of

Allreduce

and

Broadcast

Allgather

because the former results in the aggregate of the compressed

gradients, whereas the latter involves a one-to-all or all-to-

all communication, followed by a local aggregation step (the

Agg

function), which is customized for different methods.

Finally, with

, each node updates its model parameters

Equation (3) (Line 15). The loop repeats until convergence.

Layer-wise gradient as tensors.

We denote the stochastic

gradient

of a model as a single vector (at node

). This

is merely for ease of presentation. Our framework equally

applies to modern ML toolkits, where it is common during

back-propagation to compute

incrementally for each DNN

layer as some sequence ˆgi,j

kfor decreasing j.

Different optimizers.

Although we cast our training loop as

a distributed SGD process, we note that the customizable

The case with no memory compensation (hence, no memory) is a special

case, where φ(gi

k, mi

k) = gi

kand ψ(mi

k, gi

k,˜gi

k)=0.

Algorithm 1 Distributed Training Loop

Input:

Number of nodes

, learning rate

ηk

, compressor

, decompressor

Q−1

, memory compensation function

φ(·)

, and memory update function

ψ(·)

Output: Trained model x

1: On each node i:

2: Initialize: mi

0=0{vector of zeros}

3: for k= 0,1,..., do

4: Calculate stochastic gradient gi

5: ˜gi

k=Q(φ(mi

k, gi

k))

6: mi

k+1 =ψ(mi

k, gi

k,˜gi

7: if compressor uses Allreduce then

8: ˜gk=Allreduce(˜gi

9: gk=Q−1(˜gk)/ n

10: else if compressor uses Broadcast|Allgather then

11: [˜g1

k,˜g2

k,··· ,˜gn

k] = Broadcast(˜gi

k)|Allgather(˜gi

12: [g1

k, g2

k,··· , gn

k] = Q−1([˜g1

k,˜g2

k,··· ,˜gn

k])

13: gk=Agg([g1

k, g2

k,··· , gn

k])

14: end if

15: xi

k+1 =xi

k−ηkgk

16: end for

17: return x{each node has the same view of the model}

components (

Q−1

) are optimizer independent. Instead

of SGD, any stochastic algorithm, such as AdaGrad [47],

ADAM [46], can be used as optimizer in Algorithm 1. Our

experiments use different optimizers, including SGD, RMSProp

and SGD with momentum.

Memory compensation functions.

We use the following form

of functions φ(·)and ψ(·)in this paper:

φ(mi

k, gi

k) = βmi

k+γgi

ψ(mi

k, gi

k,˜gi

k) = φ(mi

k, gi

k)−˜gi

k(4)

where

β > 0

is the memory decay factor and

γ > 0

weighs the

relevance of the latest stochastic gradient. We use

β=γ= 1

unless otherwise noted. Users may customize these functions.

Communication with parameter server.

Our framework is

compatible with parameter server-based communication. Con-

ceptually, a parameter server provides a gradient aggregation

function equivalent to

Allreduce

. However, the Horovod

toolkit we base our implementation on, exclusively supports

collective communication libraries.

B. Programming interface

We provide an API for

compress Q

decompress

Q−1

memory_compensate φ

memory_update ψ

and

aggregate Agg

functions that are mentioned in the pseudo-

code. The framework considers context

ctx

as an opaque

object that carries any necessary metadata to allow for decom-

pression, which should return a tensor with same data type

and shape as the original tensor. For instance, in sparsiﬁcation

methods,

ctx

contains the shape and size of the original tensor.

Below is an example function deﬁnition that takes a tensor

with unique name and returns a list of compressed objects with

the context needed to decompress them:

compress : tensor, name →[comp], ctx

Our framework provides defaults for

aggregate

, as well as

memory_compensate

and

memory_update

(Equation 4).

The user needs to implement

compress

and

decompress

for each compression method, and indicate to the framework

the communication strategy to use.

Compression typically produces tensors of different di-

mensions or data types than the original ones. For instance,

sparsiﬁcation results in smaller tensors while quantization

results in either different data types or bit-packed elements. As

these manipulations are common across several methods, our

API implements the following helper functions:

API Description

quantize Quantizes tensor values and returns values in lower bits

dequantize Dequantizes a tensor and restores original bits

sparsify Sparsiﬁes a tensor in a certain selection algorithm

desparsify De-sparsiﬁes and restores original shape by ﬁlling zeros

pack Encodes several lower-bit values into one higher-bit value

unpack Unpacks and restores the original decoded form

We support both TensorFlow and PyTorch; however, they

have different APIs. Following Horovod’s strategy, we create

two similar yet distinct implementations.

Tensor manipulation operations.

Both TensorFlow and Py-

Torch provide high-level tensor-manipulation APIs in Python

as well as a C++ library to deﬁne custom tensor operations.

We adopt the Python API since it is typically used by model

creators and is backed by efﬁcient low-level kernels for GPUs

or other hardware accelerators; however, this does not prevent

the user from integrating custom C++ operators.

Communication primitives.

We leverage Horovod [44] for

communication that exposes three collective communication

primitives:

Allreduce

Allgather

, and

Broadcast

. On

the backend, these are implemented by several alternate

libraries: OpenMPI, NVIDIA NCCL and Facebook Gloo [56].

Communication strategies.

We support two types of com-

munication strategies: (i)

Allreduce

is the most efﬁcient

operation. However, it is not readily suitable for several scenar-

ios. The main limitations in the underlying implementation are

that it does not support sparse tensors and requires that input

tensors be of the same data type and dimension. Moreover,

it can only aggregate tensors by summing. In contrast, (ii)

Allgather

and

Broadcast

do not perform any aggregation

and support input tensors of different forms. This is well suited

for quantization when aggregation needs to be performed on

dequantized values as well as for sparsiﬁcation when different

nodes select gradient elements at non-overlapping indices.

C. Implemented compressors

We implement within GRACE 16 representative compressors

(see Table I). Where available, we draw from publicly available

implementations; however, in many cases these are not released,

although we reached out to the original authors to acquire them.

We follow faithfully the algorithm descriptions presented in

their corresponding papers, and we try to reproduce the original

accuracy results. Our implementation is a best-effort approach

that reﬂects the intention of the respective methods, although it

may not be as efﬁcient as the original ones. We avoid creating

custom C++ tensor operations because the Python API is

functionally sufﬁcient and because it would have required an

excessive effort given the large number of methods. Below, we

highlight noteworthy implementation details.

Quantization. quantize

converts the original 32-bit ﬂoating-

point values into a lower-bit representation.

ctx

stores addi-

tional information (such as mean and different norms) needed

to dequantize. In some scenarios,

pack

can further compress

the data by encoding several lower-bit values into a single

32-bit value.

dequantize

transforms quantized values to an

approximation of the original values.

unpack

decodes the

packed data into their original representation.

Sparsiﬁcation. sparsify

ﬂattens the original gradient into

a rank-1 tensor. Then, it selects

out of

elements, and

creates two

1×m

rank-1 tensors to represent the selected

values and indices.

ctx

stores the shape of the original tensor.

desparsify

restores the values into a rank-1 tensor of size

, ﬁlls missing values with zeros, and reshapes the tensor to

the original gradient..

Adaptive [21].

Adaptive splits the gradient into a positive- and

a negative-value part. We apply

quantize

to encode values

in a ternary format and separate the +1 and -1 values. We use

sparsify

to select elements according to a sparsiﬁcation

ratio

. As the values are all ones, we only send the mean and

the selected indices of each part.

DGC [16].

The momentum correction used in DGC is similar

to memory compensation. We implement it by customizing the

memory functions.

memory_compensate

adjusts the values

by both memory and momentum.

memory_update

uses the

minimum absolute value in the compressed gradient as the

threshold to get the mask and to update the memory.

V. EX PE RI ME NTAL EVALUATI ON

We perform a comprehensive quantitative evaluation of the

16 implemented compressors mentioned in Table I. Our results

cover 5 benchmarks, 7 model architectures and 4 deep learning

tasks (i.e., image classiﬁcation, segmentation, recommendation,

and language modeling). Due to space limit, below, we present

the most representative results; for the complete set, refer to

our technical report [48].

We break down the efﬁciency gains of compression along

two metrics:

(i)

the data volume that each worker generates,

and

(ii)

the training throughput (in terms of training samples

per second). Measuring data volumes characterizes the intrin-

sic communication-level algorithmic efﬁciency of a method;

whereas throughput offers the extrinsic measure of performance

gains while other practical system artifacts are at play (e.g.,

computational overheads of compression).

A. Experimental setup and methodology

Environment.

We run most of the experiments on 8 dedicated

machines with Ubuntu 18.04.2 LTS and Linux v.4.15.0-74, 16-

Core Intel Xeon Silver 4112 at 2.6 GHz, 512 GB RAM and one

NVIDIA Tesla V100 GPU card with 16 GB on-board memory.

They are interconnected via network links at 1, 10 and 25 Gbps.

For time-insensitive metrics (e.g., accuracy, data volume) we

also use a shared cluster with a heterogeneous group of nodes

each equipped with at least one NVIDIA Tesla V100 GPU

card. We deploy CUDA 10.1, PyTorch 1.3, TensorFlow 1.14,

Horovod 0.18.2, OpenMPI 4.0 and NCCL 2.4.8.

TABLE II: Summary of the benchmarks and quality metrics used in this work.

Task Model Dataset Training

parameters

Gradient

vectors Epochs Quality

metric

Baseline

quality

Image

Classiﬁcation

ResNet-20 [1] CIFAR-10 [57] 269,467 51 328

Top-1 Accuracy

90.86%

DenseNet40-K12 [58] CIFAR-10 [57] 357,491 158 328 92.07%

Custom ResNet-9 [59] CIFAR-10 [57] 6,573,120 25 24 91.67%

VGG16 [2] CIFAR-10 [57] 14,982,987 30 328 86.32%

ResNet-50 [1] ImageNet [60] 25,559,081 161 90 75.37%

VGG19 [2] ImageNet [60] 143,671,337 38 90 68.90%

Recommendation NCF [61] Movielens-20M [62] 31,832,577 10 30 Best Hit Rate 95.98%

Language

Modeling LSTM [63] PTB [64] 19,775,200 7 25 Test Perplexity 100.168

Image

Segmentation U-Net [65] DAGM2007 [66] 1,850,305 46 2,500 Intersection

over Union (IoU) 96.4%

Benchmarks.

We use industry-standard benchmarks from

TensorFlow [67], [68] and NVIDIA [69]. These benchmarks

span 4 common deep learning tasks from different domains and

involve a mix of convolutional and recurrent neural networks,

and ones with large embedding layers. The trainable parameters

span 3 orders of magnitude. The number of communicated

gradient vectors range from 7 to 161. The quality of the model is

reported under diverse nomenclatures according to benchmark-

speciﬁc metrics as shown in Table II.

Methodology.

We run each experiment for a ﬁxed number

of training epochs (complete iterations over the training set)

according to every benchmark’s speciﬁcation. The reported

quality of the model (e.g., accuracy), which is based on a

held-out test set, is the best one witnessed throughout training.

We use no compression as the baseline for comparison.

We ensure our baselines converge to state-of-the-art results

(Table II). The default optimizers are: SGD with momentum for

image classiﬁcation, RMSProp for segmentation, ADAM for

recommendation and SGD for language modeling. Compressors

use the same optimizer as the baseline, except for image

classiﬁcation whereby PowerSGD, Random-

, DGC, SignSGD

and SIGNUM use vanilla SGD as it achieves better quality.

When reporting relative results, they are normalized to the

relevant metrics measured for the baseline case. We took care

to make repetitions to validate statistically the model quality,

except when it is too time-consuming to do so (as in training

with ImageNet for instance). We focus mainly on TensorFlow

results and comment on the differences with PyTorch where

relevant; refer to our technical report [48] for PyTorch results.

Unless otherwise noted, we use the default conﬁguration

with each benchmark. We keep all hyperparameters the same

as the baseline, except for the cases where speciﬁc settings

are stated in a compressor’s original paper. Speciﬁcally, for

EFsignSGD, we set

β= 1

and

equal to the initial learning

rate. The performance of compressors is sensitive to a range of

factors such as the optimizer (e.g., SGD or ADAM), standard

hyperparameters (e.g, learning rate) and varying degree of

compression. Where practical, we experiment with multiple

values of these factors and report on their effects; however, a

complete sensitivity analysis is out of scope.

We report throughput as the average measured at steady

state at the last 100 iterations during training. We measure the

transmitted data volume in bytes based on the input size and a

standard representation of data types (e.g., 4 bytes for ﬂoat32,

or 1 byte for 256-level quantized data).

The results shown in this section refer to experiments with

10 Gbps network links and OpenMPI over TCP as collective

library.

We also run experiments with 25 Gbps links and

observe mild improvements in throughput (on average, 1.3%).

B. Model quality vs. training throughput

Figure 6 shows the effects of compression on model quality

as a function of throughput, normalized to the baseline

(highlighted with a vertical line in red). We present results

across different benchmarks. Compressors that achieve poor

quality (below the y-axis cut-off), are omitted. In general, we

observe that training converges to solutions with quality metrics

comparable to the respective baselines in most cases. In some

cases, the model quality is slightly higher than the baseline.

This can be attributed to the stochastic nature of the process,

allowing compression to cancel out bad gradient directions;

[16] also reported this phenomenon.

With respect to throughput, several compressors perform

worse than the baseline. This happens for any benchmark

where the trained model is primarily computation-bound (e.g.,

ResNet, DenseNet, U-Net). In contrast, for communication-

bound models (e.g., NCF, VGG), there are several combinations

of compressors that mark a signiﬁcant throughput improve-

ment. Note that, computation-bound models can also become

communication-bound due to lower speed network (e.g., in the

case of federated learning).

The recommendation benchmark (Figure 6d) is particularly

interesting. First, this is a previously unexplored benchmark in

the literature on compressed communication (which primarily

has focused on convolutional NNs). Second, it highlights that

there exists, in this particular case, a deﬁnite trade-off between

model quality and throughput: while many compressors achieve

1.5

to 4.5

speedup, quality lowers by up to 10%. Third,

NCCL is faster than OpenMPI, but it constrains input sizes, preventing a

fair comparison.

0.0 0.2 0.4 0.6 0.8 1.0

Relative Throughput

0.85

0.86

0.87

0.88

0.89

0.90

0.91

0.92

Top-1 accuracy

0.45 0.50

0.905

0.910

(a) Image Classiﬁcation - ResNet20 - CIFAR-10

0.0 0.2 0.4 0.6 0.8 1.0

Relative Throughput

0.84

0.86

0.88

0.90

0.92

Top-1 accuracy

0.40 0.48

0.912

0.920

(b) Image Classiﬁcation - DenseNet40-K12 -

CIFAR-10

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Relative Throughput

0.55

0.60

0.65

0.70

0.75

Top-1 accuracy

0.51 0.54

0.736

0.744

1234

Relative Throughput

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

Best Hit Rate

0.30 0.40

0.944

0.952

Top K

TopK-EF

(d) Recommendation - NCF - MovieLens-20M

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25

Relative Throughput

100

105

110

115

120

Test Perplexity

0.25 0.50

99.0

100.5

(e) Language Modeling - LSTM - PTB

0.2 0.4 0.6 0.8 1.0

Relative Throughput

0.90

0.92

0.94

0.96

0.98

IoU (threshold=0.125)

(f) Image Segmentation - U-Net - DAGM2007

Fig. 6: Performance of compressors in terms of model quality vs training throughput.

0.0 0.2 0.4 0.6 0.8 1.0

Relative Avg. Data Volume/iteration

0.55

0.60

0.65

0.70

0.75

Top-1 accuracy

0.48 0.51

0.736

0.744

(a) Image Classiﬁcation - ResNet-50 - ImageNet

0.0 0.2 0.4 0.6 0.8 1.0

Relative Avg. Data Volume/iteration

100

105

110

115

120

Test Perplexity

0.10 0.20

99.0

100.5

(b) Language Modeling - LSTM - PTB

0.0 0.2 0.4 0.6 0.8 1.0

Relative Avg. Data Volume/iteration

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

Best Hit Rate

0.00 0.04

0.828

0.834

0.840

Top K

TopK-EF

Fig. 7: Performance of compressors in terms of model quality vs data volume.

it illustrates that for compressors with tunable degree of

compression, quality lowers as compression is more aggressive.

Interestingly, these observations are not common in other

benchmarks. For instance, QSGD and Top-

in CIFAR-10

experiments score a ballpark model quality across varying

degree of compression.

Table I indicates with a where error feedback (EF, a.k.a.

memory) is applied; we ﬁnd that EF improves accuracy for

those compressors, in particular with sparsiﬁcation. However,

our results empirically establish that EF harms the convergence

of several quantization methods (SignSGD, SIGNUM, QSGD

and Terngrad). In the case of SignSGD and SIGNUM, the issue

is known and is ﬁxed by design by EFsignSGD. Interestingly,

and exclusively for the recommendation task, applying EF with

Top-

, 8-bit, and Natural Compression leads to worsened model

quality. We highlight the difference for Top-kin Figure 6d.

Takeaways.

No method consistently performs well across

all benchmarks and there is no strong correlation between

throughput and model quality.

C. Model quality vs. transmitted data volume

We now consider the model quality versus the transmitted

data volume

trade-off. Figure 7 shows for each compressor

Because we do not implement packing, the data volumes are inﬂated for

quantization methods. However, in a relative sense our results still hold.

its best model quality and the average communicated data

volume per iteration to achieve that quality (refer to [48]

for the full set of results). In general, we observe that a

compressor that sends more data leads to higher model quality.

This is true in most cases especially in the language modeling,

image classiﬁcation for ImageNet, and recommendation tasks

as shown in Figures 7a, 7b, 7c, respectively. However, we

observe that for some compressors (e.g., Adaptive), a higher

data volume results in lower model quality. This is consistent

with previously published results [16], [21], [26].

Takeaways.

The quality vs. data volume trade-off is non trivial;

therefore, compression should be tuned carefully to deliver the

best beneﬁts for a given scenario.

D. Computational overheads of compression

We run a micro-benchmark experiment that measures the

combined latency of

compress

and

decompress

in iso-

lation; Figure 8 shows the results as a violin plot for 30

repetitions; the operators run on the GPU. Results show that

compressors induce non-negligible overheads. We proﬁle the

code and observe the following: (i) Both Adaptive and DGC

involve a loop to adjust the threshold to best match the target

ratio. This is expensive; throughput improved by

≈2×

executing only one iteration. (ii) As shown in Figure 8, Random-

shows high overhead as the

tf.random.shuffle

op-

eration executes on the CPU due to lack of a GPU kernel.

However, during real training, TensorFlow can schedule device-

host data transfer so that it overlaps with GPU computation,

so this overhead is at times mitigated. (iii) In Random-

tf.random.shuffle

takes excessively long time on CPU

for both the large embedding and fully-connected layers in

recommendation and language modeling. The execution time

far exceeds the execution of the forward pass and hence

communication phase stalls by waiting for this operation. (iv)

8-bit invokes a

find_bins

operation for each quantized

value which, due to lack of a GPU implementation, is

executed on the CPU. (v) We also observe that some methods

rely on expensive operations (i.e.,

tf.where

). These are

sparsiﬁcation methods that rely on a threshold (e.g., Threshold-

, DGC) and quantization methods that choose target elements

meeting a criteria (e.g., 1-bit SGD, Terngrad, 8-bit, Natural

Compression). Sketch-ML also imposes high overhead due to

sketch operations.

Takeaways.

Implementing compressors requires careful engi-

neering, with custom GPU or well-optimized CPU code, to

account for their intrinsic computational overheads.

E. Machine learning toolkit, transport and links

Figure 9 shows the throughput of different compressors

in CIFAR-10 image classiﬁcation task using PyTorch with

different communication protocols: TCP and remote direct

memory access (RDMA). Throughput is mostly consistent yet

higher than what we observe for most compressors in the

TensorFlow image classiﬁcation tasks. The RDMA transport

protocol is consistently better than TCP.

Figure 10 shows the relative throughput for the same

experimental setup as Figure 6c, except it uses 1 Gbps network

links. As this setup emphasizes the network bottleneck, there

is now a large number of compressors that obtain a throughput

speedup over the baseline.

Takeaways.

The machine learning toolkit as well as the trans-

port protocol and network speed, affect compressor throughput.

F. Summary of observations

•

No particular compression method outperforms every other

across all experimental scenarios.

•

The computational overheads of compression are not

negligible. At higher communication bandwidths (10 Gbps

or more), avoiding compression typically results in faster

training, which agrees with the results of [35], [70].

•

With some exceptions, error feedback (EF) is widely appli-

cable to sparsiﬁcation and improves accuracy signiﬁcantly.

However, its side-effect is memory overhead, which may

lead to smaller mini-batch sizes.

•

Higher data volume does not imply higher accuracy;

however, we observe that when compression is heavy,

a low data volume tends to decrease accuracy.

•

The hosting ML framework inﬂuences performance only

to a minor extent; the major performance variations are

due to the underlying collective communication libraries.

VI. RE LATE D WO RK

Yang [71] was one of the ﬁrst to study the trade-off between

computation and communication for distributed stochastic

optimization. Since then, numerous approaches have been

proposed; refer to the survey by Ben-Nun and Hoeﬂer [72] for

details. Below we cite those that are relevant to our work.

Compression for ad-hoc P2P overlays.

Unlike our work,

which assumes all-to-all aggregation semantics (e.g., Allreduce),

others [43], [73] consider an ad-hoc peer-to-peer network

overlay, where nodes communicate only with neighbours.

Althought some of these methods use familiar techniques,

like sparsiﬁcation and quantization, their main characteristic

is that they redeﬁne the aggregation semantics to involve only

a subset of workers at a time. We leave it as future work to

integrate in our framework’s communication primitives that

accommodate the P2P overlay setting.

Fewer communication rounds.

Some methods reduce the

volume of transmitted data by communicating less often.

CoCoA [74] is dual coordinate ascent algorithm that performs

several local steps before communicating with other workers.

Wang and Joshi [75] propose periodic averaging SGD, to update

the local model at each worker node and then use periodic

average to update the ﬁnal parameters.

Asynchronous communication.

Hogwild! [76] propose asyn-

chronous parallel SGD, where the computing nodes access

shared memory and can modify the parameters at any time

without locking. De Sa et al. [77] develop a low-precision

asynchronous SGD method and provide an FPGA implemen-

tation. Asynchronous communication is outside the scope of

our paper.

SignSGD

EFsignSGD

TernGrad

QSGD(64)

SIGNUM

1-bit SGD

Thresh

Topk(0.01)

Randk(0.01)

8-bit

Natural

DGC(0.01)

SketchML(64)

Adaptive(0.01)

INCEPTIONN

10 3

10 2

10 1

100

Latency [s]

1MB 10MB 100MB

Fig. 8: Latency of compress and

decompress for different compressors

with a range of input sizes.

Thresh(0.01)

DGC(0.01)

Natural

1-bit SGD

TopK(0.01)

TernGrad

EFsignSGD

QSGD(64)

SignSGD

Randk(0.01)

SIGNUM

PowerSGD

10000

20000

30000

40000

50000

60000

Throughput (images/sec)

Baseline TCP Baseline RDMA TCP RDMA

Fig. 9: Throughput for ResNet-9 on CIFAR-

10 contrasting TCP vs. RDMA performance

in PyTorch.

012345

Relative Throughput

0.55

0.60

0.65

0.70

0.75

Top-1 accuracy

Fig. 10: Performance of compressors for

ResNet-50 on ImageNet via 1 Gbps network.

Legend in Figure 6.

Communication primitives.

SwitchML [7] uses a pro-

grammable network switch to implement in-network aggrega-

tion. Instead of compression, SwitchML reduces the transmitted

data by computing on the network switch. A similar idea is

explored in DAIET [78]. SparcML [79], on the other hand,

implements a stream structure to support sparse tensors.

Other communication strategies.

OmniReduce [80] imple-

ments sparse Allreduce and sends the non-zero gradient blocks

to the workers. Gajjala et al. [81] use Huffman encoding

for efﬁciently packing and transmitting the quantized vectors.

DeepReduce [82] is a compressed communication framework

that allows both independent and combined compression of

values and indices of sparse tensors.

Model compression.

Instead of compressing the communi-

cated gradient, many papers propose to compress the model

parameters. ZipML [34], in particular, applies compression

similar to that of QSGD to the model, data, and gradient.

Model compression is orthogonal to our work and out of

scope; we refer to a survey by Guo [83].

VII. CONCLUSION

We survey the most inﬂuential methods on gradient compres-

sion for distributed, data-parallel DNN training. We propose

GRACE, a uniﬁed framework with the corresponding Ten-

sorFlow and PyTorch API, and implement 16 representative

compression methods. We use convolutional and recurrent

DNNs, as well as a variety of datasets and system conﬁgura-

tions, to perform thorough quantitative evaluation and report

metrics that include accuracy, throughput and communication

volume. We observe that the computational overhead of

compression / decompression is non-trivial and may render

several methods inapplicable in practice. We release our API,

code and experimental results, as well as the DNN models and

datasets. We envision that our work will beneﬁt: (i) researchers,

who will use it as the basis for consistent implementation and

evaluation of new methods; and (ii) practitioners, who need

an appropriate compression method for their training setup.

REFERENCES

[1]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image

Recognition,” in CVPR, 2015.

[2]

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks

for Large-Scale Image Recognition,” in ICLR, 2015.

[3]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training

of Deep Bidirectional Transformers for Language Understanding,” in

NAACL, 2018.

[4]

D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur,

G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized

Pipeline Parallelism for DNN Training,” in SOSP, 2019.

[5]

L. Luo, J. Nelson, L. Ceze, A. Phanishayee, and A. Krishnamurthy,

“PHub: Rack-Scale Parameter Server for Distributed Deep Neural Network

Training,” in SoCC, 2018.

[6]

Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo,

“A Generic Communication Scheduler for Distributed DNN Training

Acceleration,” in SOSP, 2019.

[7]

A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishna-

murthy et al., “Scaling Distributed Machine Learning with In-Network

Aggregation,” in NSDI, 2021.

[8]

H. Robbins and S. Monro, “A Stochastic Approximation Method,” Annals

of Mathematical Statistics, vol. 22, 1951.

[9]

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD:

Communication-Efﬁcient SGD via Gradient Quantization and Encoding,”

in NeurIPS, 2017.

[10]

J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar,

“signSGD: Compressed Optimisation for Non-Convex Problems,” in

ICML, 2018.

[11]

T. Dettmers, “8-Bit Approximations for Parallelism in Deep Learning,”

in ICLR, 2016.

[12]

S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error Feedback

Fixes Sign SGD and other Gradient Compression Schemes,” in ICML,

2019.

[13]

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-Bit Stochastic Gradient

Descent and Application to Data-Parallel Distributed Training of Speech

DNNs,” in INTERSPEECH, 2014.

[14]

W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li,

“TernGrad: Ternary Gradients to Reduce Communication in Distributed

Deep Learning,” in NeurIPS, 2017.

[15]

A. F. Aji and K. Heaﬁeld, “Sparse Communication for Distributed

Gradient Descent,” in EMNLP-IJCNLP, 2017.

[16]

Y. Lin, S. Han, H. Mao, Y. Wang, and W. Dally, “Deep Gradient

Compression: Reducing the Communication Bandwidth for Distributed

Training,” in ICLR, 2018.

[17]

S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsiﬁed SGD with

Memory,” in NeurIPS, 2018.

[18]

Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based Gradient Com-

pression for Efﬁcient Distributed Deep Learning,” in ICLR, 2018.

[19]

J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient Sparsiﬁcation for

Communication-Efﬁcient Distributed Optimization,” in NeurIPS, 2018.

[20]

D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Dis-

tributed SGD with Quantization, Sparsiﬁcation and Local Computations,”

in NeurIPS, 2019.

[21]

N. Dryden, S. A. Jacobs, T. Moon, and B. Van Essen, “Communication

Quantization for Data-Parallel Training of Deep Neural Networks,”

MLHPC, 2016.

[22]

J. Jiang, F. Fu, T. Yang, and B. Cui, “SketchML: Accelerating Distributed

Machine Learning with Data Sketches,” SIGMOD, 2018.

[23]

H. Lim, D. Andersen, and M. Kaminsky, “3LC: Lightweight and Effective

Trafﬁc Compression for Distributed Machine Learning,” in MLSys, 2019.

[24]

N. Strom, “Scalable Distributed DNN Training using Commodity GPU

Cloud Computing,” in INTERSPEECH, 2015.

[25]

M. Cho, V. Muthusamy, B. Nemanich, and R. Puri, “GradZip: Gradient

Compression using Alternating Matrix Factorization for Large-scale Deep

Learning,” in NeurIPS, 2019.

[26]

T. Vogels, S. P. Karimireddy, and M. Jaggi, “PowerSGD: Practical Low-

Rank Gradient Compression for Distributed Optimization,” in NeurIPS,

2019.

[27]

H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright,

“ATOMO: Communication Efﬁcient Learning via Atomic Sparsiﬁcation,”

in NeurIPS, 2018.

[28]

M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing

et al., “GradiVeQ: Vector Quantization for Bandwidth-Efﬁcient Gradient

Aggregation in Distributed CNN Training,” in NeurIPS, 2018.

[29]

S. Zheng, Z. Huang, and J. Kwok, “Communication-Efﬁcient Distributed

Blockwise Momentum SGD with Error-Feedback,” in NeurIPS, 2019.

[30]

J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar, “signSGD

with Majority Vote is Communication Efﬁcient And Fault Tolerant,” in

ICLR, 2019.

[31]

S. Horvath, C.-Y. Ho, L. Horvath, A. N. Sahu, M. Canini, and P. Richtárik,

“Natural Compression for Distributed Deep Learning,” arXiv preprint

arXiv:1905.10988v2, 2019.

[32]

J. Wu, W. Huang, J. Huang, and T. Zhang, “Error Compensated Quantized

SGD and its Applications to Large-scale Distributed Optimization,” in

ICML, 2018.

[33]

Y. Yu, J. Wu, and J. Huang, “Exploring Fast and Communication-Efﬁcient

Algorithms in Large-Scale Distributed Networks,” in AISTATS, 2019.

[34]

H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, “ZipML:

Training Linear Models with End-to-End Low Precision, and a Little Bit

of Deep Learning,” in ICML, 2017.

[35]

Y. Li, J. Park, M. Alian, Y. Yuan, Z. Qu, P. Pan, R. Wang et al.,

“A Network-Centric Hardware/Algorithm Co-Design to Accelerate Dis-

tributed Training of Deep Neural Networks,” in Micro, 2018.

[36]

A. Dutta, E. Bergou, A. Abdelmoniem, C. Ho, A. Sahu, M. Canini, and

P. Kalnis, “On the Discrepancy between the Theoretical Analysis and

Practical Implementations of Compressed Communication for Distributed

Deep Learning,” in AAAI, 2020.

[37]

N. Ivkin, D. Rothchild, E. Ullah, V. Braverman, I. Stoica, and R. Arora,

“Communication-efﬁcient Distributed SGD with Sketching,” in NeurIPS,

2019.

[38] “Open MPI,” https://www.open-mpi.org/.

[39] “NCCL,” https://developer.nvidia.com/nccl.

[40]

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized Stochastic

Gradient Descent,” in NeurIPS, 2010.

[41]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato

et al., “Large scale distributed deep networks,” in NeuRIPS, 2012.

[42]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski,

J. Long et al., “Scaling Distributed Machine Learning with the Parameter

Server,” in OSDI, 2014.

[43]

H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication

Compression for Decentralized Training,” in NeurIPS, 2018.

[44]

A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep

learning in TensorFlow,” arXiv preprint arXiv:1802.05799, 2018.

[45]

Y. Nesterov, “Gradient Methods for Minimizing Composite Functions,”

Mathematical Programming, vol. 140, no. 1, 2013.

[46]

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”

in ICLR, 2015.

[47]

J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for

Online Learning and Stochastic Optimization,” JMLR, vol. 12, 2011.

[48]

H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou,

K. Karatsenidis, M. Canini, and P. Kalnis, “Compressed Communication

for Distributed Deep Learning: Survey and Quantitative Evaluation,”

KAUST, Tech. Rep., 2020. [Online]. Available: http://hdl.handle.net/

10754/662495

[49]

A. M. Abdelmoniem, A. Elzanaty, M.-S. Alouini, and M. Canini,

“An Efﬁcient Statistical-based Gradient Compression Technique for

Distributed Training Systems,” in MLSys, 2021.

[50]

M. Greenwald and S. Khanna, “Space-Efﬁcient Online Computation of

Quantile Summaries,” in SIGMOD, 2001.

[51]

S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger Generalization

Bounds for Deep Nets via a Compression Approach,” in ICML, 2018.

[52]

Y. Li, T. Ma, and H. Zhang, “Algorithmic Regularization in Over-

parameterized Matrix Sensing and Neural Networks with Quadratic

Activations,” in COLT, 2018.

[53]

P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford, “Streaming

PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample

Guarantees for Oja’s Algorithm,” in COLT, 2016.

[54]

E. Oja, “Simpliﬁed Neuron Model as a Principal Component Analyzer,”

Journal of Mathematical Biology, vol. 15, no. 3, 1982.

[55]

D. Alistarh, T. Hoeﬂer, M. Johansson, S. Khirirat, N. Konstantinov,

and C. Renggli, “The Convergence of Sparsiﬁed Gradient Methods,” in

NeurIPS, 2018.

[56] “Gloo,” https://github.com/facebookincubator/gloo.

[57]

A. Krizhevsky and G. Hinton, “Learning Multiple Layers of Features

From Tiny Images,” Technical report, UToronto, vol. 1, no. 4, 2009.

[58]

G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely

Connected Convolutional Networks,” in CVPR, 2017.

[59] D. Page, “CIFAR10-fast,” https://github.com/davidcpage/cifar10-fast.

[60]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F. F. Li, “ImageNet: a

Large-Scale Hierarchical Image Database,” in CVPR, 2009.

[61]

X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural

Collaborative Filtering,” in WWW, 2017.

[62] “Movielens,” https://grouplens.org/datasets/movielens/.

[63]

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural

Computing, vol. 9, no. 8, 1997.

[64]

M. Marcus, B. Santorini, M. Marcinkiewicz, and A. Taylor, “Treebank-3,”

https://catalog.ldc.upenn.edu/LDC99T42.

[65]

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks

for Biomedical Image Segmentation,” in MICCAI, 2015.

[66]

“29th Annual symposium of the German association for pattern recogni-

tion,” https://resources.mpi-inf.mpg.de/conference/dagm/2007/index.html.

[67]

“LSTM-PTB,” https://github.com/tensorﬂow/models/tree/master/tutorials/

rnn.

[68] “TensorFlow benchmark,” https://github.com/tensorﬂow/benchmarks.

[69]

“NVIDIA deep learning examples,” https://github.com/NVIDIA/

DeepLearningExamples.

[70]

L. Luo, P. West, A. Krishnamurthy, L. Ceze, and J. Nelson, “PLink:

Discovering and Exploiting Datacenter Network Locality for Efﬁcient

Cloud-based Distributed Training,” in MLSys, 2020.

[71]

T. Yang, “Trading Computation for Communication: Distributed Stochas-

tic Dual Coordinate Ascent,” in NeurIPS, 2013.

[72]

T. Ben-Nun and T. Hoeﬂer, “Demystifying Parallel and Distributed Deep

Learning: An In-Depth Concurrency Analysis,” ACM Computing Surveys,

vol. 52, no. 4, 2019.

[73]

A. Koloskova, S. Stich, and M. Jaggi, “Decentralized Stochastic

Optimization and Gossip Algorithms with Compressed Communication,”

in ICML, 2019.

[74]

M. Jaggi, V. Smith, M. Takac, J. Terhorst, S. Krishnan, T. Hofmann,

and M. I. Jordan, “Communication-Efﬁcient Distributed Dual Coordinate

Ascent,” in NeurIPS, 2014.

[75]

J. Wang and G. Joshi, “Adaptive Communication Strategies to Achieve

the Best Error-Runtime Trade-off in Local-Update SGD,” in MLSys,

2019.

[76]

B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A Lock-Free Approach

to Parallelizing Stochastic Gradient Descent,” in NeurIPS, 2011.

[77]

C. De Sa, M. Feldman, C. Ré, and K. Olukotun, “Understanding and

Optimizing Asynchronous Low-precision Stochastic Gradient Descent,”

in ISCA, 2017.

[78]

A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis, “In-

Network Computation is a Dumb Idea Whose Time Has Come,” in

HotNets, 2017.

[79]

C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoeﬂer,

“SparCML: High-Performance Sparse Communication for Machine

Learning,” in SC, 2019.

[80]

J. Fei, C.-Y. Ho, A. N. Sahu, M. Canini, and A. Sapio, “Efﬁcient

Sparse Collective Communication and its application to Accelerate

Distributed Deep Learning,” KAUST, Tech. Rep., 2020. [Online].

Available: http://hdl.handle.net/10754/665369

[81]

R. R. Gajjala, S. Banchhor, A. M. Abdelmoniem, A. Dutta, M. Canini,

and P. Kalnis, “Huffman Coding Based Encoding Techniques for Fast

Distributed Deep Learning,” in ACM CoNEXT Dist. ML, 2020.

[82]

K. Kostopoulou, H. Xu, A. Dutta, X. Li, A. Ntoulas, and P. Kalnis,

“DeepReduce: A Sparse-tensor Communication Framework for Distributed

Deep Learning,” arXiv preprint arXiv:2102.03112, 2021.

[83]

Y. Guo, “A Survey on Methods and Theories of Quantized Neural

Networks,” arXiv preprint arXiv:1808.04752v2, 2018.

Near-Linear Scaling Data Parallel Training with Overlapping-Aware Gradient Compression

Preprint

Full-text available

Feb 2024

Existing Data Parallel (DP) trainings for deep neural networks (DNNs) often experience limited scalability in speedup due to substantial communication overheads. While Overlapping technique can mitigate such problem by parallel-ing communication and computation in DP, its effectiveness is constrained by the high communication-to-computation ratios (CCR) of DP training tasks. Gradient compression (GC) is a promising technique to obtain lower CCR by reducing communication volume directly. However, it is challenging to obtain real performance improvement by applying GC into Overlapping because of (1) severe performance penalties in traditional GCs caused by high compression overhead and (2) decline of Overlapping benefit owing to the possible data dependency in GC schemes. In this paper, we propose COVAP, a novel GC scheme designing a new coarse-grained filter, makes the compression overhead close to zero. COVAP ensures an almost complete overlap of communication and computation by employing adaptive compression ratios and tensor sharding tailored to specific training tasks. COVAP also adopts an improved error feedback mechanism to maintain training accuracy. Experiments are conducted on Alibaba Cloud ECS instances with different DNNs of real-world applications. The results illustrate that COVAP outperforms existent GC schemes in time-to-solution by 1.92x-15.39x and exhibits near-linear scaling. Furthermore, COVAP achieves best scalability under experiments on four different cluster sizes. (This paper has been accepted by IEEE ICPADS 2023, later in 2024 available is its IEEE version.)

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Preprint

Jun 2024

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

A communication-efficient distributed deep learning remote sensing image change detection framework

Article

Full-text available

May 2024
INT J APPL EARTH OBS

With the introduction of deep learning methods, the computation required for remote sensing change detection has significantly increased, and distributed computing is applied to remote sensing change detection to improve computational efficiency. However, due to the large size of deep learning models, the time-consuming gradient transfer during distributed model training weakens the acceleration effectiveness in change detection. Data communication and updates can be the bottlenecks in distributed change detection systems with limited network resources. To address the interrelated problems, we propose a communication-efficient distributed deep learning remote sensing change detection framework (CEDD-CD) based on the synchronous update architecture. The CEDD-CD integrates change detection with communication-efficient distributed gradient compression approaches, which can efficiently reduce the data volume to be transferred. In addition, for the implicit effect caused by the delay of compressed gradient update, a momentum compensation mechanism under theoretical analysis was constructed to reduce the time consumption required for model convergence and strengthen the stability of distributed training. We also designed a unified distributed change detection system architecture to reduce the complexity of distributed modeling. Experiments were conducted on three datasets; the qualitative and quantitative results demonstrate that the CEDD-CD was effective for massive remote sensing image change detection.

Client selection in federated learning based on gradients importance

Conference Paper

Jan 2024

FLOAT: Federated Learning Optimizations with Automated Tuning

Conference Paper

Apr 2024

A Unified Momentum-based Paradigm of Decentralized SGD for Non-Convex Models and Heterogeneous Data

Article

Apr 2024

Near-Linear Scaling Data Parallel Training with Overlapping-Aware Gradient Compression

Conference Paper

Dec 2023

An efficient bandwidth-adaptive gradient compression algorithm for distributed training of deep neural networks

Article

Mar 2024
J SYST ARCHITECT

Efficient asynchronous federated learning with sparsification and quantization

Article

Jan 2024

While data is distributed in multiple edge devices, federated learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training, while several devices are selected in each round. However, straggler devices may slow down the training process or even make the system crash during training. Meanwhile, other idle edge devices remain unused. As the bandwidth between the devices and the server is relatively low, the communication of intermediate data becomes a bottleneck. In this article, we propose time‐efficient asynchronous federated learning with sparsification and quantization, that is, TEASQ‐Fed. TEASQ‐Fed can fully exploit edge devices to asynchronously participate in the training process by actively applying for tasks. We utilize control parameters to choose an appropriate number of parallel edge devices, which simultaneously execute the training tasks. In addition, we introduce a caching mechanism and weighted averaging with respect to model staleness to further improve the accuracy. Furthermore, we propose a sparsification and quantitation approach to compress the intermediate data to accelerate the training. The experimental results reveal that TEASQ‐Fed improves the accuracy (up to 16.67% higher) while accelerating the convergence of model training (up to twice faster).

Distributed Analytics For Big Data: A Survey

Article

Mar 2024
NEUROCOMPUTING

Efficient sparse collective communication and its application to accelerate distributed deep learning

Conference Paper

Full-text available

Aug 2021

Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning

Conference Paper

Full-text available

Dec 2020

Distributed stochastic algorithms, equipped with gradient compression techniques, such as codebook quantization, are becoming increasingly popular and considered state-of-the-art in training large deep neural network (DNN) models. However, communicating the quantized gradients in a network requires efficient encoding techniques. For this, practitioners generally use Elias encoding-based techniques without considering their computational overhead or data-volume. In this paper, based on Huffman coding, we propose several lossless encoding techniques that exploit different characteristics of the quantized gradients during distributed DNN training. Then, we show their effectiveness on 5 different DNN models across three different data-sets, and compare them with classic state-of-the-art Elias-based encoding techniques. Our results show that the proposed Huffman-based encoders (i.e., RLH, SH, and SHS) can reduce the encoded data-volume by up to 5.1×, 4.32×, and 3.8×, respectively, compared to the Elias-based encoders.

On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning

Article

Full-text available

Apr 2020

Compressed communication, in the form of sparsification or quantization of stochastic gradients, is employed to reduce communication costs in distributed data-parallel training of deep neural networks. However, there exists a discrepancy between theory and practice: while theoretical analysis of most existing compression methods assumes compression is applied to the gradients of the entire model, many practical implementations operate individually on the gradients of each layer of the model.In this paper, we prove that layer-wise compression is, in theory, better, because the convergence rate is upper bounded by that of entire-model compression for a wide range of biased and unbiased compression methods. However, despite the theoretical bound, our experimental study of six well-known methods shows that convergence, in practice, may or may not be better, depending on the actual trained model and compression ratio. Our findings suggest that it would be advantageous for deep learning frameworks to include support for both layer-wise and entire-model compression.

U-Net: Convolutional Networks for Biomedical Image Segmentation

Book

Jan 2015

Scalable distributed DNN training using commodity GPU cloud computing

Conference Paper

Sep 2015

Nikko Strom

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

Conference Paper

Sep 2014

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Conference Paper

Jun 2017

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

Article

Apr 2020

Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper, we propose Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD . We analyze convergence for Qsparse-local-SGD in the distributed setting for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.

SparCML: high-performance sparse communication for machine learning

Conference Paper

Nov 2019

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML¹, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks.

A generic communication scheduler for distributed DNN training acceleration

Conference Paper

Oct 2019

We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler work generally for various DNN training frameworks, we introduce a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines. We further introduce a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions. ByteScheduler now supports TensorFlow, PyTorch, and MXNet without modifying their source code, and works well with both Parameter Server (PS) and all-reduce architectures for gradient synchronization, using either TCP or RDMA. Our experiments show that ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196% (or 2.96X of original speed).

GRACE: A Compressed Communication Framework for Distributed Machine Learning

Recommended publications

SNAP: A Communication Efficient Distributed Machine Learning Framework for Edge Computing

Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation

Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning

DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning

Efficient sparse collective communication and its application to accelerate distributed deep learnin...