ArticlePDF Available

A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations

April 2021
IEEE Transactions on Neural Networks and Learning Systems PP(99)

April 2021
PP(99)

DOI:10.1109/TNNLS.2021.3071762

Authors:

Sparsh Mittal

Indian Institute of Technology Roorkee

Poonam Rajput

Indian Institute of Technology Hyderabad

Sreenivas Subramoney

Intel

CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this paper, we present a survey of techniques for optimizing DL applications on CPUs. We include the methods proposed for both inference and training and those offered in the context of mobile, desktop/server, and distributed-systems. We identify the areas of strength and weaknesses of CPUs in the field of DL. This paper will interest practitioners and researchers in the area of artificial intelligence, computer architecture, mobile systems, and parallel computing.

(a) PRF design with 2 VFMA units (used in Haswell processor) (b) simple extension to 4 VFMA units leads to 12 reads/cycle from PRF, which is infeasible (c) The technique of Jain et al. [48] inserts "VFMA remote register" and across-VFMA connections for reducing the number of reads (d) Architecture of VFMA in their proposed design

…

(a) Register tiling steps performed by automatic code generator [48] (b) data orchestration (c) changing layout to improve prefetching efficiency

…

(a)-(b) Hardware-unaware pruning (c) multiplication of input vector with a "sparse weight matrix" (saved with CSR style) (d) weightalignment according to SIMD width [95] (e) hardware-aware pruning (f) "sparse weight matrix" is stored in a modified CSR format which enables SIMD-multiplication for storing the "sparse weight matrix" which decreases the model size. Also, loading and multiplication can be done using SIMD instructions. (3) Moderately-parallel processors such as CPU leverage instruction/memory-level parallelism along with SIMD units. For CPUs, they first use vertex pruning on CONV layers and then use "SIMD-based weight pruning" on FC layers. Their technique achieves higher performance than hardware-unaware pruning with no accuracy loss. Zou et al. [51] propose a technique to reduce datacommunication overheads while parallelizing CNN training on a multicore processor. Figures 6(a)-(b) show two baseline techniques. (a) Conventional parallelization: Here, a group of kernels runs on each core and their output fmaps are broadcast to other cores for allowing processing of the next layer. This scheme leads to large data-transfer overhead. (b) Structure-level parallelization: In this scheme, the DNN is transformed into a partially-connected design. The cores do not broadcast the fmaps for certain layers, but the output is consumed by neurons mapped to the same core. This scheme reduces computation and data-transfer overhead at the cost of accuracy loss. Also, it requires manually deciding which and how many layers should be partitioned. (c) "Communication-aware parallelization": If the parameters of a kernel are pruned to be zero during training, the output fmap will be zero irrespective of the value of input fmap. Hence, the input fmaps of this layer (i.e., output fmaps of the previous layer) that will be multiplied with zero need not be transferred across the cores. Their pruning scheme intentionally distributes the nonzero weights at specific positions, which enables avoiding the communication for zero-weights/fmaps. For achieving structured pruning, they use the "group Lasso regularization" scheme. On using mesh topology in the interconnect, the data-transfer cost between two cores is decided by their Hamming distance. Hence,

…

(a) Conventional (b) structure-aware and (c) communication-aware parallelizaion [51]

…

Implementation of compS [R2], R1, #CF instruction [34]

…

Figures - uploaded by Sparsh Mittal

Content may be subject to copyright.

Content uploaded by Sparsh Mittal

Content may be subject to copyright.

A Survey of Deep Learning on CPUs: Opportunities and

Co-optimizations

Sparsh Mittal?, Poonam Rajput†, Sreenivas Subramoney§

?ECE department, IIT Roorkee, †CSE department, IIT Hyderabad,

§Processor Architecture Research Lab, Intel Labs, India.

sparshfec@iitr.ac.in,cs17mtech11019@iith.ac.in,sreenivas.subramoney@intel.com.

Abstract—CPU is a powerful, pervasive, and indispensable platform

for running deep learning (DL) workloads in systems ranging from

mobile to extreme-end servers. In this paper, we present a survey of

techniques for optimizing DL applications on CPUs. We include the

methods proposed for both inference and training and those offered

in the context of mobile, desktop/server, and distributed-systems. We

identify the areas of strength and weaknesses of CPUs in the ﬁeld of

DL. This paper will interest practitioners and researchers in the area

of artiﬁcial intelligence, computer architecture, mobile systems, and

parallel computing.

I. INTRODUCTION

The recent success of deep learning has made it ubiquitous in all

ﬁelds of human endeavor. Deep learning (DL) applications have

unique architectural characteristics and efﬁciency requirements

and hence, the choice of computing system can have a profound

impact on how large piece of the DL pie a user can ﬁnally

enjoy. Even though accelerators may provide higher throughput

than general-purpose computing systems (CPUs) on DL applica-

tions, they may not be optimal on other metrics. For example,

the programming models of all DL accelerators have not been

standardized and their diversity leads to lack of portability [1–3].

In comparison, the hardware/software stack of CPUs is already

well-established and understood and CPUs are also inevitably

present in any system. They can provide reasonable speedups on

a broad range of applications. In mobile and embedded domains,

CPU is still the most widely used computing system due to its high

availability, portability and software support. These features have

motivated researchers, including those from leading datacenter and

cloud-service provider companies such as Amazon [4], Facebook

[2, 3, 5–7], Google [8, 9], Microsoft [10–16] and Samsung [17–

19] to benchmark and optimize DL on CPUs.

Optimizing DL applications on CPUs, however, presents its

challenges. Achieving high efﬁciency requires carefully matching

the strength of CPUs with the architectural characteristics of

DL applications. For example, pruned deep neural networks

(DNNs) have sparse data-structure and perform sparse matrix

multiplication (MM), and hence, achieving high resource utiliza-

tion requires careful optimization. Similarly, convolution (CONV)

algorithm parameters, data layouts and numeric precision need to

be carefully chosen to strike the right balance between latency,

throughput and accuracy depending on the system conﬁguration.

Further, DNN layer-speciﬁc optimizations are required for achiev-

ing best performance [6, 15, 20–24]. Systems at different scales

such as mobile and data-center, single vs. multi-node systems have

different properties and challenges, as do different DL algorithm-

s/applications such as convolution neural networks (CNNs) and

recurrent neural networks (RNNs) and different phases of DL viz.,

“This work is supported by Semiconductor Research Corporation.”

inference and training. Evidently, addressing these challenges call

for CPU-speciﬁc study and optimizations of DL applications.

Contributions: This paper presents a survey on optimizing DL

on CPUs and CPUs for DL. Section II highlights the advantages of

CPUs for accelerating deep learning workloads and also classiﬁes

the works based on critical parameters. Section III reviews tech-

niques for co-optimizing DNNs on CPUs. Section IV presents

techniques for optimizing DNNs, which are especially relevant

in mobile and data-center CPUs. It also discusses techniques for

data, thread and node-level parallelization. Section V concludes

the paper with a discussion of future outlook. Our goal in this

paper is not to delve into accelerator vs. CPU performance debate,

but to highlight the unique features, limitations and optimization

techniques of CPUs. We include studies performed on mobile,

server and cluster of CPUs. We also include works that proﬁle or

benchmarks DNNs on CPUs to gain insights. We include works

that utilize the computing power of CPU and not those that merely

use its memory, use it as a host, or only for parameter aggregation

in distributed training.

II. MOT IVATIO N AN D OVE RVI EW

A. Motivation for using CPUs for running DNNs

We now discuss the factors that motivate the use of CPUs for

accelerating DNNs.

High memory capacity of CPUs: 3D CONV and even 2D

CONV with large batch-size requires a massive amount of mem-

ory. On GPUs, these fundamental primitives often get severely

memory bottlenecked due to the limited memory capacity of

GPUs [25]. It forces the researchers to use less accurate 2D CONV

operations. Since CPU-managed hosts in cloud and datacenter

scenarios have much larger memory capacities, running memory-

hungry operations such as 3D CONV on CPUs is not merely

attractive, but often imperative [26, 27].

Usefulness for medium-parallelism and sparse DNNs: In

some workloads such as RNN, the amount of computations

increases with rising sequence length. However, the parallelization

of RNN is challenging because of the dependencies between the

steps and the use of small batch size. Similarly, DNNs such

as InceptionNet variants have ﬁlter shapes of 1x1, 3x3, 1x3,

3x1, etc., which lead to irregular memory accesses and variable-

amount of parallelism across the layers. Such applications with

limited parallelism ﬁt more naturally to CPUs, which have few

fast cores than to GPUs, which have many slow cores [10].

Likewise, task-parallelism is more suited to CPUs, whereas SIMD

(single instruction multiple data) parallelism is better exploited by

GPUs [28]. Similarly, sparse DNNs [15, 29–34] are inefﬁcient on

massively parallel processors. It is because operations on sparse

data-structures have poor spatial locality due to irregular memory

accesses. They also prohibit effective use of optimizations such as

vectorization and cache tiling. Since for hiding memory latency,

CPUs depends more on caches and out-of-order execution than

on parallelization, they can generally be more effective for sparse

DNNs.

Usefulness in mobile systems: Embedded/mobile computing

systems are rising in prominence; for example, more than 90% of

the advertisement earnings of Facebook come from mobile [2]. In

mobile computing systems, the CPU can provide similar or higher

performance than GPU [2, 17]. Also, for applications requiring

frequent or continuous inference, GPUs may not be most suitable

as they can quickly dissipate the battery [23]. Further, in smart-

phones, most DL frameworks do not support all operations on

GPU or DSP [2]. The unsupported computations are executed on

CPUs, which leads to CPU-GPU data-transfer overheads. These

factors motivate the use of CPU or CPU-accelerator heterogeneous

computing for accelerating NN computations [35].

Usefulness in data-center scenarios: Data-centers supporting

web services such as social networks sees a signiﬁcant ﬂuctuation

in computing demand over time. CPUs can allow meeting this

variability in demand due to their high availability, efﬁciency for

both DL and non-DL tasks, and the ability to provide low latency

inference [6]. This allows datacenter and cloud service providers

to readily amortize their CPU-based server investments on DL

and non-DL tasks and optimize their business.

Usefulness in extreme environments: Processing systems

used in harsh environments such as space and defense require

a high-degree of fault-tolerance and security certiﬁcations [36].

Since radiation-hardening of accelerators has not been sufﬁciently

researched, CPUs remain the processing system of choice for

executing DNNs in such extreme environments [37].

Challenges in over-specialization: Real-life applications use

DNNs with heterogeneous properties, e.g., image classiﬁcation

and understanding are performed using AlexNet and “long-term

recurrent convolutional network”, respectively, which have dif-

ferent architectural characteristics. For such scenarios, a general-

purpose architecture may provide higher efﬁciency than an over-

specialized design. In fact, application-speciﬁc accelerators are in-

feasible in IoT devices and wearables due to their tight power/area

budgets. For example, a smartwatch chip cannot host separate

accelerators for speech/audio/image/video processing [38]. In fact,

the area of Eyeriss accelerator is 30 times that of Cortex A35 core

[30] and for such scenarios, a general-purpose core may be more

suited than accelerators.

Limitations of accelerators: Although accelerators provide

high throughput, they may not be optimal on other metrics.

Accelerators require long design cycles and massive investment,

and hence, they cannot be used in fast-changing domains such as

DL where algorithms and network architectures are undergoing

massive advances [38]. Also, reliable cost-sensitive deployment

of custom DL accelerators into real-world DL usage scenarios

is challenging due to complexities from heterogeneity, custom

software and programming support and integration overheads into

existing ecosystems. Further, achieving economies of scale and

business continuity both for customers and providers of datacenter

and cloud services of DL, is fraught with risk considering the

fragmented DL accelerator space. Furthermore, upgrading the

entire service from CPUs to accelerators requires high costs and

engineering work. Hence, CPUs are still used for many product

features. While large-scale companies such as Google, Amazon,

Facebook, Microsoft, etc. have the resources to build and maintain

their custom-accelerators from bottom-up, for other companies,

CPUs (or GPUs) remain the most feasible platform.

Also, accelerators have large data-transfer and network setup

overheads, which may nullify their performance advantage [39].

FGPAs run at low clock frequencies and are often difﬁcult to

deploy and maintain [40]. GPU and “tensor processing unit”

(TPU) achieve high resource utilization and throughput when the

batch size is large. However, the strict latency constraints of real-

time inference prohibit the use of large batch sizes and CPUs can

generally provide the least or comparable latency at low batch

sizes [41]. Also, the model loading overhead of GPUs is higher

than that of CPUs [2, 42]. Hence, CPUs may be more suitable

for real-time inference and for scenarios where training times are

not very long [43]. Due to the low-precision operation of DSP,

it may not be accepted for applications requiring high accuracy.

Finally, the per-hour usage cost of accelerators is higher than that

of general-purpose CPU-based compute, e.g., GPU costs 8×more

than CPU on Amazon web services [44].

B. Classiﬁcation

Tables I-V classify the works based on key characteristics. The

arithmetic intensity (AmI) of an application is deﬁned as the ratio

of number of computations performed and the amount of data

transfer to/from main memory [5, 15, 26, 45, 46].

TABLE I

EVALUATI ON P LATF ORM ,CONTEXT AND METRICS

Evaluation on Simulator [34, 38, 47–49], Real CPU (all others)

Optimization

metric

Energy [1, 17, 21, 22, 48, 50–58], thermal issues

[2, 53], ﬁnancial cost [43, 59], performance (nearly

all)

Comparison

with

execution in cloud [23, 46, 60], GPU [8, 19, 21, 52,

59, 61–63], FPGA [62], Xeon Phi [61, 62], neural

compute stick [19]

DL phase

evaluated

training [1, 8, 11, 13, 15, 25, 28–30, 33, 34, 42, 46,

49, 52, 61, 63–77], inference [1, 2, 4, 6, 9, 10, 13,

14, 16–21, 24, 25, 27, 30–32, 34, 37, 39, 42, 45, 47–

51, 53–57, 59, 59, 64, 66, 67, 70, 76–87]

CPU model/vendor

Wearable [23, 38, 60]

Mobile ARM CPU [2, 4, 18, 20, 21, 23, 38, 39, 50, 53–

57, 68, 72, 81–83, 88–90], Qualcomm CPU [19, 50,

70, 81, 83, 89–91], Intel Atom [32, 50]

Desktop AMD CPU: EPYC [4], Opteron [73], A10-7850K

APU [52], IBM CPU: Power8 [61, 84], Power9

[68], SPARC CPU [37], Intel CPU: Coffe lake [42],

Skylake [4, 19, 27, 37, 45, 49, 59, 64, 67–69, 71, 75,

76, 80], Broadwell [14, 26, 32, 45, 58, 66, 78], Ivy

Bridge [61, 74], Haswell [16, 31, 45, 48, 59, 63, 75,

80], Sandy Bridge [6, 10, 15, 25, 29, 63, 65, 73],

Westmere [9]

III. TECHNIQUES FOR CO-OPTIMIZING DNNS ON CPUS

We now review techniques for selecting optimal CONV scheme

(Section III-A) and optimizing data-reuse (Section III-B). In con-

text of sparse DNNs, we review hardware-aware pruning schemes

(Section III-C), schemes for avoiding inaffectual operations (Sec-

tion III-D) and for using randomized algorithms (Section III-E).

A. Choosing optimal CONV strategy

CONV can be performed using one of the four strategies [96]:

direct, lowering (based on “generalized matrix multiplication” or

GEMM), fast Fourier transform (FFT), and Winograd.

Zlateski et al. [45] compare Gauss-FFT, regular-FFT, and Wino-

grad based CONV implementations on seven CPU models. The

compute-to-memory ratio of these CPUs ranges between 14 to 41.

The highest AmI of FFT and Winograd transforms are 5.55 and

2.38, respectively, which are much lower than the compute-to-

memory ratio of these CPUs. They evaluate forward propagation

(FWP) time for each distinct layer of VGG and AlexNet. They

TABLE II

BENCHMARKS,DATASE TS AN D FR AM EWO RK S (HMM = HIDDEN

MAR KOV MO DE L, ML P = MULTI-LAYER P ER CEP TRO N)

Benchmark/NN-model

CNN Alexnet [18, 19, 22, 24, 32, 45, 48, 50, 57, 61, 63,

64, 68, 69, 73, 82, 86, 92], GoogLeNet [18, 22, 24,

32, 47, 53, 57, 61, 76, 77, 81, 86], ResNet [4, 18–

20, 22, 27, 42, 47, 48, 53, 56, 63, 67–71, 76, 77,

81, 83, 86], VGG [4, 19, 20, 45, 46, 48, 61, 64, 71,

76, 81, 83, 85, 92], SqueezeNet [24, 37, 53, 77, 86],

LeNet [33, 51, 54, 57, 61, 74], Overfeat [46, 48, 64],

YOLO [22, 87], MobileNet [19, 39, 42, 58, 83, 86, 88],

NiN [48, 82], ResNext [77], DenseNet [77], ShufﬂeNet

[58], MNASnet [58], 3D CNN [28, 51, 59, 75]

RNN [10, 14, 27, 63, 80]

Attention-

based

[6, 16, 67]

NN+HMM [9, 46, 62, 65, 72, 79]

MLP [51, 52, 67]

Autoencoder [33, 52]

Dataset used

ImageNet [1, 2, 4, 6, 8, 15, 17–20, 22–25, 27, 29–32, 45–48,

51, 53, 58, 61, 64, 67–69, 71, 73, 75, 77, 81, 82, 84–

86, 88, 90, 91]

MNIST [1, 15, 25, 33, 37, 51, 54, 61]

CIFAR10 [1, 15, 25, 30, 49, 51, 55, 56, 61, 73]

Others Pascal VOC 2007 [22, 87], switchboard + hub500 [46],

One billion words benchmark [78], automatic speaker

veriﬁcation spooﬁng and countermeasures challenge

dataset [23], Google streetview [23], Oxford ﬂowers

[49], OpenBLAS [29], UEC-Food100 [82], KITTI

[21], Penn Treebank [80], Amazon-670K [66] and

Delicious-200K [66]

Frameworks used

Torch/Pytorch [2, 21, 22, 25, 29, 31, 39, 50, 58, 59, 63, 88]

TensorFlow [4, 10, 17, 25–27, 33, 37, 49, 53, 53, 59, 61, 63, 66,

68, 76, 76, 77, 84, 89]

Caffe [6, 15, 18, 24–26, 28, 30, 32, 34, 53, 53, 55, 57, 58,

61, 63, 69, 73, 74, 81, 83, 90, 93]

Others MXNet [4, 63], CNTK [10, 14, 63], Theano [25, 28],

Apache SINGA [61]

Language/library used

MKL/MKL-

DNN

[6, 10, 14–16, 26, 28, 33, 34, 48, 49, 52, 59, 64, 67,

68, 71, 73, 75, 76, 76, 77, 80, 81]

OpenBLAS [6, 15, 18, 29–31, 53, 57, 74, 82]

Others Eigen [9, 45, 70, 76, 81, 89], C++ [20, 75, 77, 89],

OpenCL [1, 17, 85], NNPACK [19, 81]

TABLE III

DNN-ARCHITECTURE/ALGORITHM/SOF TWAR E LE VE L OPT IM IZ ATIO NS

(NA S = NEURAL ARCHITECTURE SEARCH)

CONV strate-

gies

Direct [15, 28, 56, 71], FFT [2, 28, 45, 56], Winograd

[2, 26, 45, 81], GEMM (nearly all others)

Optimizations

for sparse

NNs

Pruning [2, 51, 80, 94], weight sharing (k-means

clustering) [2], singular value decomposition [20,

23, 72], use of modiﬁed compressed-sparse row

(CSR) format [15, 95], avoiding transfer of [51] and

computations on [15] zero values, decomposing an

MM into a sparse and a dense MM [33]

Loop

optimizations

unrolling [9, 13, 14, 38, 64, 71, 89], interchange [13],

fusion [14]

Compiler op-

timizations

liveness analysis [47], automatic code generation

[48], static dependency analysis [30], instruction

reordering [30], template-metaprogramming [75, 89],

pre-computing constant functions/expressions [16,

89]

Algorithm/

heuristic used

linear regression model [24, 60], dynamic program-

ming [4], mixed-integer linear programming [23],

diamond search algorithm [22], load-balancing using

dynamic scheduling [8, 26, 85]

Hardware-

aware NAS

[58, 90, 91]

TABLE IV

MEM ORY-RELATED OPTIMIZATIONS

Changing

data-layout/

alignment

[9, 11, 13, 16, 28, 46, 48, 55, 75, 81, 89]

Prefetching [34, 38, 48, 67, 71, 75]

Reducing

need of

temporary

memory

avoiding the need of lowering [15], expanding only

few columns during lowering [55], performing max

pooling in in-situ manner [55], after completion of a

layer, reallocating the memory of a layer to the next

layer [47]

Padding [9, 14, 59, 75]

Others

optimizations

removing node or code irrelavant to inference [14,

87], increasing register ﬁle (RF) bandwidth [48],

TLB optimizations [15, 67], NUMA-aware policies

[11, 73], Managed memory allocation in CPU-GPU

heterogeneous computing [17, 33, 87]

Improving reuse/locality

Tiling [13, 14, 29, 38, 46, 64, 65, 67, 71, 81]

Layer/operation-

fusion

[18, 20, 31, 64, 81, 89]

Scheduling scheduling computations so as to achieve reuse of

weights [28], executing computations of different

phases on same core [10], assigning consecutive

parts of the work on the same worker [8], executing

layers that supply data to other layer on the same

cluster [86]

Matrix-

MM fusion [10, 78], ﬂattening 2D matrix to 1D

matrix to improve spatial locality [48]

Others Caching in low-motion videos [22], reusing register

operands within and across VFMA units [48],

TABLE V

COMPUTE-RELATED OPTIMIZATIONS

Improving

vectorization

efﬁciency

vectorizing CONV kernels across depth direction

[54], vectorizing across dense input in sparse-dense

MM [15], in cache-tiling, choosing one of the tile

dimensions to be a multiple of SIMD width [46],

mapping sparse data-structure as the same input to

all the lanes, so that the computations of all the lanes

become redundant and entire vector instruction can

be skipped [30]

Quantization [2, 6, 9, 13, 14, 16, 17, 20, 21, 31, 55, 70, 79, 82, 88]

Batching [1, 5, 6, 8–10, 16, 26, 34, 46, 65, 66, 68, 73, 78, 95]

Improving

concurrency

Pipelining [9, 14, 20, 34, 60, 65, 71, 72, 86, 87],

double-buffering [20, 89]

Approximate

computing

using lookup table to implement sigmoid/tanh func-

tions [16, 55] and arithmetic operations [79], im-

plementing tanh/sigmoid using fraction expansion

and controlling the number of terms to achieve

desired precision [10], inaccurate handling of de-

normal/NaN/inﬁnity/zero [24], ignoring overﬂow [9],

performing quantization with a power-of-two scaling

[55], using cropped or resized input images [82],

low-rank decomposition [29]

Other

optimizations

using partial out-of-order core [38], choosing optimal

thread-granularity [24, 26]

ﬁnd that overall, FFT CONV provides higher performance than

Winograd, although for different layers, FFT or Winograd has

higher performance. The data-movement between the highest level

of on-chip, core-exclusive cache (e.g., L2 in CPUs) and off-chip

memory accounts for memory loads, regular and streaming stores

to main memory and prefetches to main memory. By using the

AmI, data-movement, and number of FLOPs of different stages,

they compute the speedup of FFT CONV over Winograd CONV.

These theoretical estimates agree with their experimental results

that overall, FFT CPMV .

In both Winograd and FFT CONVs, in most cases, transform

stages have poor utilization of computing resources. These stages

have low AmI and hence, become bottlenecked by the memory

bandwidth (BW). Winograd CONV performs fewer FLOPs than

FFT since it operates on real numbers. Yet, Winograd CONV can

work with tiny transform sizes (e.g., up to 6×6) only since it

becomes numerically unstable for larger sizes. By contrast, FFT

CONV can work with very large tile sizes (e.g., 31) since it does

not have instability issues. For large tile sizes, the image can be

partitioned into overlapping tiles with minimal padding overhead.

This reduces the number of FLOPs and data-movement in FFT

CONV below that of Winograd CONV. Thus, the type of CONV

layer and processor architecture decide whether Winograd or FFT

CONV is better, but on average, FFT CONV provides higher

performance than Winograd CONV. As the compute-to-memory

ratio of processors increases, FFT CONV will become faster than

Winograd CONV because it has higher AmI due to the use of

complex numbers.

Budden et al. [26] implement fast Winograd CONV algorithms

on N-dimensional CONV. These algorithms prepare the matrices

with “simple” values such as integers. It helps in improving nu-

merical stability and also allows for preparing minimal algorithms

by hand. These algorithms identify and remove redundant sub-

expressions, which lowers the overhead of applying transform

matrices. Theoretically, a large speedup can be obtained for N-

dimensional tensors. So, they ﬁrst extend fast CONV algorithms

to the general case of N-dimensional tensors. They note that for

N-dimensional CONV on CPUs, manually reducing the transform

overhead is not very important. That is because the MM overhead

can be amortized over more number of kernels and channels since

memory constraints are much less severe on CPU than they are

on GPU. They show that compared to direct CONV, fast tensor

CONV can provide up to 8×speedup, ignoring the overhead of

the matrix-tensor products. The sparsity of matrices allows for

increasing this speedup even further.

To achieve the peak throughput on CPU, one needs to ensure (1)

optimal utilization of single-core using SIMD instructions and (2)

scaling to multiple cores. Fast CONV algorithms perform sparse

computations, which lead to poor AmI. To improve AmI, they

perform CONV for multiple batches together. To further improve

the throughput, they use vectorization and “fused multiply add”

(FMA) operations. If the N-dimensional data-tensor is DN, then

full utilization can be realized if DNis an integer multiple

of the SIMD vector width. On a Xeon E7-8890 CPU, their

optimizations allow reaching 70% utilization, whereas the MKL

CONV primitives reach only 20% utilization.

They further parallelize the code such that different threads

process different subsets of tiles. Work-stealing scheduling is

used to achieve load-balancing. Let Qdenote the number of

tiles processed by a thread. A too-large value of Qleads to

contention at the shared last-level cache (LLC), and too small

value degrades AmI. The ﬁnal value of Qis selected based on

these considerations. This parallelization strategy provides near-

linear performance scaling, such as ∼17×speedup with 18 cores.

They further evaluate the execution time of three CONV layers

using their methodology and compare it with TensorFlow (TF)

with AVX (“advanced vector extensions”) support and Caffe using

MKL. Every layer has 32 channels and 32 kernels. The dimension

of each kernel is 4×4. With 18 cores, the throughput provided by

their technique is 10.9 MVox/s, whereas that with TF and Caffe

is 1.77 and 0.41, respectively.

Rajbhandari et al. [15] study the performance, multicore scal-

ing, and goodput (ratio of useful computation) of CNNs as a

function of their sparsity and arithmetic intensity (AmI). AmI

is approximated to twice the number of features in the dataset.

They note that the lowering scheme reduces AmI by virtue

of increasing the memory accesses. Hence, lowering+parallel-

GEMM leads to poor single-core performance for medium and

small-size CONVs. Also, it cannot skip computations on zero-

data and, therefore, leads to low goodput. The “Parallel-GEMM”

scheme divides the training inputs across cores. AmI per core is

reduced with increasing core count. Hence, parallel-GEMM shows

poor multicore scaling.

They present three techniques for improving CNN training

performance on multicore CPUs. (1) “GEMM-in-parallel” scheme

runs multiple single-threaded GEMMs simultaneously on different

cores. This scheme does not reduce AmI of each core and hence,

provides better scalability with the increasing number of cores.

This scheme is especially beneﬁcial for CONVs with a smaller

number of output features since the “Parallel-GEMM” scheme

reduces their AmI even further.

(2) “Sparse Kernel:” They leverage sparsity of error gradients

to enhance the goodput of backward propagation (BWP) com-

putations by avoiding computations on zero values. They store

the error gradients in “column tiled compressed sparse row”

(CT-CSR) style, whereby sparse-matrix is tiled in a column-

wise manner, and then, every tile is stored in the CSR style.

This format is shown in Figure 1. By virtue of tiling in both

row and column dimension, CT-CSR achieves higher reuse of

tile values than CSR format. Without column-wise tiling, values

of two nearby rows may be mutually far, as decided by the

column width of the whole matrix. By contrast, CT-CSR stores

two adjacent elements in memory in adjacent rows within a

tile, which reduces TLB (“translation lookaside buffer”) misses.

Further, their technique generates vector instructions using “Intel

intrinsics” for vectorizing across the dense input and optimizing

cache reuse. Their code-generation engine utilizes “sparse-dense

MM” as the underlying code unit for efﬁciently executing CONV

with vectorization. The output of these code units is obtained in

place without requiring lowering. Thus, AmI is not reduced.

(3) “Stencil-kernel:” For small CNNs, they use direct CONV,

which avoids lowering and hence, does not reduce AmI. It uses

stencil-style processing to spatially reuse inputs while they are in

the cache, for computing multiple output values. By comparison,

lowering forgoes reuse because of replicating the inputs. This

technique works in two steps: ﬁrst, the “basic block generator”

produces register tiled vector instructions. It lowers the number

of loads and improves the reuse of loaded vector inputs. Second,

input and output are copied to contiguous memory regions to

improve TLB efﬁciency and then tiled to improve cache efﬁciency.

Overall, their technique proﬁles every layer with “parallel-

GEMM”, “GEMM-in-parallel”, stencil-kernel (FWP only) and

sparse-kernel (BWP only). Based on this comparison, the tech-

nique with the least latency is chosen for every layer. For BWP,

this comparison is repeated after a few layers to adapt to the

changes in the sparsity of the error gradient. For their system

and parameters, GEMM-in-parallel is superior to parallel-GEMM

when a layer has less than 1024 features, sparse-kernel is better

than “GEMM-in-parallel” for layers with more than 75% sparsity,

and stencil-kernel is superior to “GEMM-in-parallel” when a layer

has less than 128 output features. Their technique improves the

performance of a range of DNNs.

B. Optimizing data-reuse

The parallel GEMM libraries aim at improving the performance

of GEMMs with high (e.g., >1000) data-reuse. However, in RNN

inference, the batch size is kept small for meeting the latency

SLA. Hence, the reuse in RNN remains small (e.g., <10). On

a modern CPU, RNNs ﬁt completely in the L3 cache. Since the

shared L3 cache feeds to multiple private L2 caches, the data

1. Tile along Column

Original sparse matrix First tile Second tile

2. Store in CSR format

CT-CSR

format

CSR

format

2nd tile:

Value = [D, B]

IA = [0, 0, 1, 2]

JA = [1, 0]

1st tile:

Value = [A, B, C, B]

IA = [0, 1, 3, 4]

JA = [0, 1, 2, 1]

Value = [A, B, C, D, B, B]

IA = [0, 1, 4, 6]

JA = [0, 1, 2, 4, 1, 3]

Fig. 1. Illustration of CSR and CT-CSR format [15]

required by multiple cores is transferred repeatedly between these

caches. The data-transfer volume is decided by how the GEMM

computations are partitioned on the cores. For example, if two

cores compute the upper and lower half of the output matrix,

respectively, input matrix Q needs to be copied on the L2 cache

of both the cores. Similarly, if two cores compute the left and

right half of the output matrix, respectively, then input matrix

P needs to be replicated. However, parallel-GEMM libraries fail

to partition in a manner that fully leverages this data-reuse. On

Xeon E5-2650 CPU, the BW between L2 and L3 caches is only

62.5 GigaFloats/s. For a batch size of 1, the maximum data-reuse

of LSTM (“long short term memory”) is only 2, and hence, its

theoretical peak performance is 125 GigaFlops (=2×62.5). It is

below 8% of the peak performance of Xeon E5-2650 CPU (1.69

TeraFlops). Further, the parallel-GEMM libraries do not reuse the

weight matrix of RNN across different sequences.

Zhang et al. [10] present a technique for addressing the above

challenges. They model RNN computation as a directed acyclic

graph where each node is an MM computation and edges show the

dependencies. The building blocks of a schedule are the MMs. A

valid schedule is made-up of a sequence of phases such that every

phase has a non-overlapping subset of nodes. If i < j, then the

nodes of phase ihave to be run before the nodes of phase j. The

nodes of a phase can be run in parallel. Figure 2(a) illustrates two

valid phased schedules for LSTM. In the ﬁrst schedule, all MMs

at time tare in phase t. If the MMs of a phase take hidden state

htas input, then the phase is termed as a time-dependent phase.

Otherwise, the phase has no dependency across the sequence and

is termed as a time-independent phase. For instance, in the second

schedule of Figure 2(a), phase 1 is time-independent, whereas the

remaining phases are time-dependent since they need the value of

ht−1to ﬁnd ht.

Listing 1. Phased LSTM

Schedule-1 and 2

/ / Phased LSTM Schedule 1

for t :

Phase t : / / time dependent

Wi.xt, Wf.xt, Wc.xt, Wo.xt

Ui.ht−1, Uf.ht−1, Uo.ht−1

/ / Phased LSTM Schedule 2

Phase 1 : / / time independent

Wi.x0,…, Wi.xt,Wf. x0,…, Wf. xt,

Wc.x0,…, Wc.xt,Wo. x0,…, Wo. xt

for t :

Phase ( t +1 ) : / / time dependent

Ui. ht−1, Uf. ht−1, Uo. ht−1

Time

m1 Processor allocated to m1 thread of

M1 process

Cache associated with each processor

(b) (c)

(a)

Fig. 2. (a) Two phased LSTM schedules (b) Parallel-GEMMs-in-sequence

To reduce the search-space, they leverage three heuristics: (1)

Due to the symmetry of time-dependent phases across timesteps,

the least-latency schedule is the same in every timestep. (2) If

two consecutive phases have no dependency, then their MMs can

be seen as part of a single phase. (3) Time-independent phases

are computed before all the dependent phases, as illustrated in

the second schedule of Figure 2(a). Further, they apply four

optimizations to optimize reuse are as follows:

1. MM fusion: In every phase, a pair of MMs with a

common input is fused into a single MM. Assume MM1

where C1[M,P] =A1[M,N]×B1[N,P] and MM2 where C2[m,l]

=A1[M,N]×B2[N,Q]. Then, they are fused by concatenating

B1 and B2 along the column, such that C12[M,(P+Q)] =

A1[M,N]×B12[N,(P+Q)]. It improves data-reuse by enabling the

reuse of matrix A1.

2. Finding parallelism-degree to optimize reuse: The naive

“parallel-GEMMs-in-sequence” scheme, shown in Figure 2(b),

seeks to parallelize the MMs on all the cores. It initially runs

the ﬁrst MM on all the cores and then runs the second MM,

which leads to low-performance due to the large overhead of data

movement. Since MMs in a phase can be executed independently,

they run multiple MMs concurrently such that every MM runs in

parallel. For instance, two MMs are run in parallel such that each

uses half the cores. This “parallel-GEMMs-in-parallel” scheme is

shown in Figure 2(c). Here, each MM leverages a limited set of

cores, which reduces data-duplication and boosts data-reuse. Since

both the number of MMs in RNNs and cores in CPUs are small,

the optimal degree of parallelism can be easily found.

3. Partitioning to reduce data-movement: They propose

a partitioning scheme which, given the parallelism degree C,

generates a C-partitioning of MM computation such that the data-

movement between L2 and L3 caches is minimized. Assume that

the MM C[i, j] = PkA[i, k ]×B[k, j]has Rpartitions. Here

Di,Djand Dkare the number of partitions across i,jand k

dimensions and Di×Dj×Dk=D. The data-movement is a

function of input and output matrix sizes and L2 and L3 cache

sizes. In their experiments, the input matrix ﬁts in the L2 cache,

and all matrices together ﬁt in the L3 cache. For such cases,

the combined data-movement is Dj|A|+Di|B|+ 2Dk|C|. This

quantity can be minimized by intelligently choosing the value of

Di,Djand Dk.

4. Weight-based streamlining: They further extend the above

partitioning scheme to leverage the reuse of weight matrices

(B) across time-dependent phases (TiDP) of a sequence. This

scheme ensures that weights needed for computing the partition

can be accommodated in the L2 cache of a single core, so

they can be reused before getting evicted. For this, it ensures

that parallel partitions are executed on the same core across

TiDPs. For this purpose, in OpenMP, they create a parallel region

spanning the whole RNN sequence of computation. It also reduces

the thread-creation overhead. Every thread executes at most one

partition in each TiDP, and its partition remains the same across

the TiDPs. They implement each partition using single-threaded

BLAS, which leverages vectorization and improves L1 cache

efﬁciency. From all the schedules generated above, their technique

chooses the one with the least latency. The above optimization

approach is applied only once for each RNN model and is used

whenever inference is performed.

Their technique (running on CPU) outperforms TF and CNTK

(running on CPU or GPU). Their technique is especially effective

for small batch sizes where reuse in a single phase is small. When

the batch size or matrix dimension is large, the entire weight

matrix does not ﬁt in the L2 cache. However, individual weight

blocks ﬁt in the L2 cache, and their technique exploits reuse of

weights across TiDPs. Their technique is better than cuDNN on

GPU for batch size below 15. cuDNN uses a single kernel for the

whole RNN sequence, whereas TF generates many nodes in the

computation graph, and the movement of tensors between nodes

lead to high overhead.

Jain et al. [48] optimize the performance of GEMM kernel on

PRF

VFMA VFMA

PRF

VFMA VFMA VFMA VFMA

(b)

VFMA VFMA VFMA VFMA

AcrossVFMA

Connection (Reuse

across units)

VFMA local-

reuse register

(Reuse at same

VFMA)

(c)

Local-

reuse

VFMA

output

From

PRF

From AcrossVFMA connection

From

PRF

(d)

PRF

(a)

From

PRF

VFMA

Pipeline

VFMA

Twelve

reads per

cycle

Six reads

per cycle

Five

reads

per

cycle

Fig. 3. (a) PRF design with 2 VFMA units (used in Haswell processor) (b) simple extension to 4 VFMA units leads to 12 reads/cycle from PRF,

which is infeasible (c) The technique of Jain et al. [48] inserts “VFMA remote register” and across-VFMA connections for reducing the number of

reads (d) Architecture of VFMA in their proposed design

CPUs. They note that increasing the number of VFMA (“vector

fused mutiply add”) units for increasing CPU FLOPs presents

challenges of supplying data to these units in each clock cycle.

Since GEMM uses register tiling, it repeatedly uses the data in

registers before accessing the L1 cache. Hence, a change in cache

size or/and BW alone has no impact on GEMM performance.

Both register BW and the number of architectural registers need

to be increased together to obtain a signiﬁcant speedup. Higher

the architectural registers lets using higher tile size. GEMM

operations are performed using VFMA units, and each VFMA unit

needs 3 additional read ports in the physical register ﬁle (PRF).

However, increasing read ports increases the access energy and

latency of PRF [97].

Their technique focuses on reducing the number of PRF reads

by exploiting the data-reuse in GEMM operation. Figure 3(a)

shows the baseline CPU with 2 VFMA units. This design reads six

of VFMAs will require 12 PRF reads/cycle, as shown in Figure

3(b). In GEMM operations, matrix elements are reused. To exploit

temporal reuse to reduce the number of PRF reads, they add an

“architecturally visible register” termed “VFMA remote register”.

The proposed design is shown in Figure 3(c)-(d). This register

allows reusing an input at the same unit over multiple operations.

It can be written from other registers or the caches, but not from

the VFMA itself, and this restriction simpliﬁes the hardware.

Further, they add uni-directional links between different VFMA

units (Figure 3(c)-(d)). These links allow reading an input operand

only once from the PRF and then reusing it across VFMA units.

These links do not transfer the VFMA output. Their changes allow

reusing the data within and across VFMA units, which reduces

the number of reads to RF and allows increasing the number of

VMFA units. They also discuss the “instruction set architecture”

(ISA) extensions and changes to instruction scheduling required

for utilizing the microarchitectural improvements.

They further present an “automatic code generation technique”

that generates code for GEMM based on the number of “archi-

tectural registers” and VFMA units to optimize data reuse. This

technique applies two strategies: (1) register tiling of both input

and output matrices, as shown in Figure 4(a). It allows data-reuse

in PRF, as shown in Figure 4(b). (2) prefetching-aware layout

transformation: They note that for many CONV layers, VFMA

utilization remains low despite the use of register tiling. Although

the memory access pattern is predictable, the CPU gets stalled on

cache misses because the prefetchers cannot prefetch across page

limits. In CONV layers, the matrix sizes are large, and hence, on

moving to the next row, the stride crosses a page boundary.

To mitigate this inefﬁciency, their technique ﬂattens the 2D

input matrices A and B into 1D matrices such that memory

accesses are to contiguous locations. It is shown in Figure 4(c).

This transformation allows the prefetcher to prefetch the data in

the L1 cache, which improves VFMA utilization. This transfor-

mation can be performed before all the GEMM operations, and its

overhead is amortized over computations. To further reduce the

overhead, it can be interleaved with the GEMM operations. Their

technique improves the performance and “energy delay product”

of several NNs compared to an Intel Haswell server baseline.

Further, their technique beneﬁts not only CONV layers but also

“fully connected” (FC) and LSTM layers. Also, the performance

of their automatically generated code is close to that of highly-

optimized Intel MKL.

C. Hardware-aware pruning techniques

Yu et al. [95] evaluate a hardware-unaware pruning scheme

using ﬁve CNNs on GPU, CPU and microcontroller. They ﬁnd that

the performance improvement from pruning is much lower than

the fraction of reduction in computations. Sparse matrices lead

to irregular memory accesses, and decoding them requires extra

computations, which is evident from Figures 5(a)-(c). Since matrix

tiling and memory-coalescing cannot be performed on sparse MM,

pruning harms the performance of all CNNs on the GPU. Also,

pruning precludes the use of parallelization. On CPU, pruning

boosts the performance of FC layers by reducing the memory

accesses but harms the performance of CONV layers by reducing

the opportunity for weight-reuse. Since the simple architecture of

the microcontroller cannot hide memory latency, pruning boosts

performance on microcontroller by reducing model size.

Yu et al. present an architecture-aware pruning technique for

three classes of processors: (1) Highly-parallel processors such

as GPU rely on thread-level parallelism. For them, they perform

vertex-pruning, which utilizes mask layers for selecting unimpor-

tant vertices so that their output can be blocked. After training

of mask layers, the blocked vertices are removed, and once all

redundant vertices are removed, mask layers are removed, and

CNN is retrained. Node pruning does not make the CNN sparse,

and hence, it provides higher throughput than weight pruning on

GPU. (2) Processors with low-parallelism, such as Cortex-M4,

have in-order cores with only a few SIMD lanes and no caches. On

such processors, they prune weights based on SIMD awareness,

as shown in Figures 5(d)-(e). For this, weights are grouped in

size of SIMD width. Then, those groups are pruned whose “root-

mean-square” of weights is lower than a threshold. Pruning and

retraining are iteratively performed for maintaining the original

accuracy. A modiﬁed CSR scheme, shown in Figure 5(f), is used

𝑎1

𝑎3

𝑎2

𝑎4𝑎5

𝑎0𝑏0𝑏1𝑏2𝑏3

𝑏4𝑏5𝑏6𝑏7

𝑎2

𝑎0𝑎1

𝑎4𝑎5

𝑎3𝑏1

𝑏0𝑏3

𝑏2𝑏5

𝑏4

…. ….

𝑏6𝑏7

Matrix A Matrix B

𝑏0𝑏1𝑏2𝑏3

𝑏4𝑏5𝑏6𝑏7

𝑐0𝑐1𝑐2𝑐3

𝑐 𝑐5𝑐6𝑐7

𝑐8𝑐9𝑐10𝑐11

𝑎1

𝑎3

𝑎2

𝑎4𝑎5

Matrix A Matrix B Matrix C

𝑀𝑡

𝐾𝑡

𝑁𝑡

𝑀𝑡

𝑁𝑡

1calculates “partial sum” of output tile C for input tile A and B

Repeating 1 2 completes computation of one output tile

Repeating 1 2 calculates Nt columns of outputs

Repeating 1 2 completes matrix multiplication

𝑎0

VFMA VFMA VFMA VFMA

𝑏0𝑏1𝑏2𝑏3

𝑎4, 𝑎2, 𝑎0𝑎4, 𝑎2, 𝑎0𝑎4, 𝑎2, 𝑎0𝑎4, 𝑎2, 𝑎0

𝑐8, 𝑐4, 𝑐0𝑐9, 𝑐5, 𝑐1𝑐10, 𝑐6, 𝑐2𝑐11, 𝑐7, 𝑐3

𝑐8, 𝑐4, 𝑐0𝑐9, 𝑐5, 𝑐1𝑐10, 𝑐6, 𝑐2𝑐11 , 𝑐7, 𝑐3

(a)

(b)

(c)

Fig. 4. (a) Register tiling steps performed by automatic code generator [48] (b) data orchestration (c) changing layout to improve prefetching efﬁciency

𝐴 = [3, 9, 8, 5, 4, 6, 4, 5, 9,

8, 6, 6, 6]

𝐽𝐴 = [0, 1, 3, 0, 2, 4,

5, 4, 1, 2, 5, 0, 2]

𝐼𝐴 = [0, 3, 6, 7, 8, 11,13]

Input Vector

(a) Original

weight matrix

(b) After conventional

pruning

multiplied with input vector

𝐴 = [ 3, 9 , 2,8 , 5,0 ,

6, 2 , 5,0 , 2, 9 ,

8,0 , 2,6 , 6,0 , 6,2 , ]

𝐽𝐴 = [0, 2, 0, 4, 4, 0 ,

2, 4, 0, 2]

𝐼𝐴 = [0, 4, 8, 8, 10,16,20]

(d) Original

weight matrix

(e) After SIMD-

aware pruning

(f) Weight matrix stored in

modified CSR format

Fig. 5. (a)-(b) Hardware-unaware pruning (c) multiplication of input

vector with a “sparse weight matrix” (saved with CSR style) (d) weight-

alignment according to SIMD width [95] (e) hardware-aware pruning (f)

“sparse weight matrix” is stored in a modiﬁed CSR format which enables

SIMD-multiplication

for storing the “sparse weight matrix” which decreases the model

size. Also, loading and multiplication can be done using SIMD

instructions. (3) Moderately-parallel processors such as CPU

leverage instruction/memory-level parallelism along with SIMD

units. For CPUs, they ﬁrst use vertex pruning on CONV layers

and then use “SIMD-based weight pruning” on FC layers. Their

technique achieves higher performance than hardware-unaware

pruning with no accuracy loss.

Zou et al. [51] propose a technique to reduce data-

communication overheads while parallelizing CNN training on

a multicore processor. Figures 6(a)-(b) show two baseline tech-

niques. (a) Conventional parallelization: Here, a group of kernels

runs on each core and their output fmaps are broadcast to other

cores for allowing processing of the next layer. This scheme leads

to large data-transfer overhead. (b) Structure-level parallelization:

In this scheme, the DNN is transformed into a partially-connected

design. The cores do not broadcast the fmaps for certain layers, but

the output is consumed by neurons mapped to the same core. This

scheme reduces computation and data-transfer overhead at the cost

of accuracy loss. Also, it requires manually deciding which and

how many layers should be partitioned.

of a kernel are pruned to be zero during training, the output fmap

will be zero irrespective of the value of input fmap. Hence, the

input fmaps of this layer (i.e., output fmaps of the previous layer)

that will be multiplied with zero need not be transferred across

the cores. Their pruning scheme intentionally distributes the non-

zero weights at speciﬁc positions, which enables avoiding the

communication for zero-weights/fmaps. For achieving structured

pruning, they use the “group Lasso regularization” scheme. On

using mesh topology in the interconnect, the data-transfer cost

between two cores is decided by their Hamming distance. Hence,

map High inter-

core data-

traffic

(a)

map

No inter-core

data traffic,

accuracy loss

(b)

CNN

map

Minor inter-core

data traffic,

~0 accuracy loss

Structure

-level

Communication

-aware

(c)

core

Layer1 Layer2

Inter-core communication

Conventional

Fig. 6. (a) Conventional (b) structure-aware and (c) communication-aware

parallelizaion [51]

the performance depends on the location of non-zero parameters in

a kernel. Therefore, their pruning technique also takes into account

the Hamming distances between the cores and, thus, reduces data-

transfer between distant cores. Their technique saves energy and

improves performance by reducing data-transfer overheads for a

negligible reduction in DNN accuracy.

Liu et al. [29] present a two-stage decomposition technique to

decrease the redundancy at inter-channel and intra-channel level.

In the ﬁrst stage, decomposition is performed depending on the

“reconstruction error” of kernel weights. Then, the ﬁne-tuning of

the network is performed while applying the sparsity condition. In

the second stage, the training error, sparsity of CONV kernels, and

the number of CONV bases are together optimized by minimizing

a “sparse group-lasso objective function”. Thus, they use low-

rank decomposition and also seek to achieve sparsity in ﬁlter

weights. With only a few bases, their technique achieves above

90% sparsity in CONV kernels with below 1% loss of accuracy.

Figure 7(a)-(b) contrast the operation of a CONV layer in the

conventional scheme and their technique, respectively.

Let R=P×Q, where P∈Rm×kis a (dense) input fmap

matrix and Q∈Rk×nis a (sparse) weight matrix. They further

present a sparse-dense MM scheme for efﬁciently running the

sparse CONV kernels. Pand Qare split into tiles that can

be accomodated into the L2 cache. Further, every tile of Pis

split in “row bands” having 8-elements and every tile of Qis

split in “column bands” having 8-elements. Then, two bands are

multiplied for obtaining an 8x8 matrix. Let this MM be shown as

R=¯

P×¯

Q,¯

P∈R8×k,¯

Q∈Rk×8,¯

R∈R8×8. For any matrix

M, let mi,∗be the ith row of Mand m∗,j be the jth column

of M. The MM can be represented as ¯r∗,j =Pi=1→k¯p∗,i ¯qi,j ,

where 1≤j≤8. Here, every ¯r∗,j and ¯p∗,i is stored in one

AVX vector. For every non-zero value ¯qi,j ,ishows which ¯p∗,i

to multiply with and jshows which ¯r∗,j to save to. Since each

of them correspond to an AVX register and ¯

Qis sparse matrix

which is ﬁxed after training, they embed iand jinto the code as

Input fmaps

Output

fmaps

Output fmaps

Conv. Kernels

Channel

basis

Kernel basis

Sparse

Kernel

Matrix

(a)

Pseudocode for

computing R = P x Q

𝑟7+= 𝑝1x 𝑞1,7

𝑟3+= 𝑝3x 𝑞3,3

𝑟6+= 𝑝3x 𝑞3,6

𝑟3+= 𝑝5x 𝑞5,3

𝑟5+= 𝑝5x 𝑞5,5

Q: 12x8 sparse

matrix (shaded

squares are non-zero)

P: 8 x 12

dense matrix

(c)

(b)

𝑟

4+= 𝑝7x 𝑞7,4

𝑟5+= 𝑝7x 𝑞7,5

𝑟3+= 𝑝10 x 𝑞10,3

𝑟5+= 𝑝10 x 𝑞10,5

𝑟

4+= 𝑝11 x 𝑞11,4

Fig. 7. (a) CONV layer in conventional CNN operates on large number of

CONV kernels (b) Liu et al. [29] apply decomposition on the channels and

CONV kernels to obtain a highly-sparse kernel matrix. (c) Pseudo-code

generated for multiplying a sparse matrix with dense matrix [29]

the index of registers. Figure 7(c) shows an example of the code

generated by their technique.

For MM operation, their algorithm achieves close to ideal

speedup from the sparsity. For all the layers of a CNN, their

technique achieves more than 90% sparsity and high speedup.

Their technique can accelerate both large kernels (e.g., 11×11)

and small kernels (e.g., 3×3), whereas the previous technique

can accelerate only large kernels.

D. Skipping redundant instructions and memory accesses

Sen et al. [30] note that by leveraging dynamic sparsity, 25% to

60% computations in DNN can be elided. However, their dynamic

nature and the fact that the sparsity levels are not very high,

making it inefﬁcient to leverage them in software. They propose

ISA and microarchitectural extensions to CPUs for exploiting

sparsity at the hardware-level. To leverage sparsity, the processor

needs to dynamically detect whether an instruction (say I0)

produces a zero result and, if so, skip all future instructions that

become redundant due to I0producing a zero value. For example,

in Figure 8(a), if the input (r8) is zero, then instructions 4, 6, and 7

become redundant. However, such instructions may not occur right

after I0and may not occur in succession. Also, the instruction

to be skipped should not even be fetched because squashing an

instruction after it is fetched leads to a pipeline bubble. However,

squashing a multi-cycle instruction after fetching can still provide

a performance improvement.

Figure 8(b) shows the overview of their technique. Their

technique adds a “sparsity register ﬁle” (sRF) and a “sparsity-

based skip address” (SBSA) table, which are shown in Figure

8(c). The sRF uses isSparse bit to record, which registers in

the RF store zero values. The “regUpdInFlight” bit shows if an

instruction updating the register is in ﬂight inside the pipeline.

The SBSA table tracks under what conditions which instructions

can be avoided. An entry of the SBSA table has three ﬁelds:

(1) preceedingPC: PC of instruction just before the redundant

instruction sequence (ii) sRFCondition: An instruction is skipped,

only if this condition is satisﬁed (iii) instsToSkip: redundant

instruction sequence length. For instance, the entry for skipping

instructions 6-7 in Figure 8(a), is shown in the second row of the

SBSA table shown in Figure 8(d).

They add a new instruction termed SBSA-LD, which loads a

speciﬁc memory region into the SBSA table. By using this instruc-

tion, the SBSA table can be pre-loaded at program startup. Since

CONV kernels use only a few library functions such as BLAS,

LD r2, [p2] //Load OUT

LOOP:LD r0, [p0] //Load INP

ADD p0, p0, #1

LD r1, [p1] //Load KER

ADD p1, p1, #1

FMUL r3, r1, r0 //r3= INP*KER

FADD r2, r2, r3 //OUT += r3

INC INDEX

BNE INDEX, #N, LOOP

ST r2, [p2]

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

4, 6-7

6,7

---

Vector Dot-product

Redundant

insts.if

result = 0

Sparsity Register

File (sRF)

Sparsity Based Skip

Address (SBSA) Table

States

Identify & skip

Redundancy Module

Sparse Value

Checker (SVC)

Change PC if next set

of instructions can be

skipped

Update sRF if

instruction result

is zero

Fe De Exe WB

Processor

Pipeline

Prece

dingP

inst

sTo

Skip

sRF

condition

4067 2

sRF

[Rs1]

sRF[Rs2]

4100 5

sRF[Rs3]

sRF[Rs4]

4250

2 sRF[Rs3]

SBSA Table

isSpar

regupdinFli

ght

1 0

0 1

sRF

(b)

(a)

(c)

(d)

Rs1

Rs2

Fig. 8. (a) Assembly-language code for vector dot-product (b) Overview

of the technique of Sen et al. [30] (c) design of sRF and SBSA table

using only 20 entries in the SBSA table sufﬁces for capturing all

the redundant sequences. Their technique is executed in parallel

to instruction-cache access. The instructions/registers that operate

on or store values of sparse data-structures are identiﬁed during

compilation. Using a static dependency analysis, the instructions

that may become redundant due to above instructions/registers

are marked. Based on these, SBSA-LD instruction is added to the

program assembly.

They perform instruction reordering so that the exeuction of I0

is ﬁnished before fetching of redundant instructions may start. In

vector processors, generally, one input operand is the same for all

the eight lanes, whereas the other input is different. The sparse

data-structure is mapped as the same input to all lanes, which

ensures that the computations of all the lanes become redundant,

and the whole vector instruction can be skipped. They show the

detailed working of their technique for GEMM function from the

OpenBLAS library on an ARM processor. On both scalar and

vector processors, their technique accelerates both CONV and FC

layers having a wide range of sparsity.

Akin et al. [34] propose a technique to (de)compress dynamic

data produced by DNNs on CPUs. They propose two SIMD

instructions, compL and compS.compS compresses and stores

the data and compL loads and decompresses the data compressed

by compS.compS takes one 512b input register R1, one register

R2 as a pointer, and a condition ﬂag (CF). It reads the input via

R1, compresses it, and stores it at the memory location pointed

to by R2. The CF condition can be either “equal-to-zero” or

“less-than-or-equal-to-zero”. The latter condition allows fusing

ReLU activation and compression into a single instruction. As

shown in Figure 9, compS compares each 32b element of 512b

vector in R1 based on CF. It creates a 16b (=512b/32) mask as

the compression metadata. The mask is concatenated with the

uncompressed elements of the vector and stored at the location

pointed by R2. Also, R2 is incremented by the amount of data

written plus header-size, which allows repeatedly applying compS

to compress and store more values.

For the example shown in Figure 9, assume that CF checks

elements that are equal to or less than zero and 7 out of 16

elements meet this condition. Then, the mask becomes 0x2D94.

Based on the mask, uncompressed elements are gathered together

and along with the mask, stored at the location pointed by R2. R2

is incremented by 7*4B+2B = 30B. These instructions can replace

the regular load/store instructions. They can be inserted at the

0x1000

(R2)

y z 0 w0 0 x 0 0 r 0 0t 0 0 p

= = = == = = = = = = == = = =

1 1 0 10 0 1 0 0 1 0 01 0 0 1

Lane Select & Shift

x y z w t r

popcnt

header

0x2000

(0x2D94) (7)

(4 byte values)

(header size) 0x201E

…

Memory:

R2 (after execution)

Fig. 9. Implementation of compS [R2], R1, #CF instruction [34]

beginning and end of various layers so that data is (de)compressed

before writing to or reading from memory.

Since different steps of compression/decompression happen

sequentially, they can lead to large memory latency. This latency

can be hidden by using a stream prefetcher, since a layer writes

the fmaps sequentially, and another layer reads them sequentially

without performing random accesses. They slice the fmap into

chunks, and different threads compress their own chunks in

parallel using different compressed data pointers. Their technique

reduces physical memory usage but not virtual memory usage.

Their benchmarks show sparsity ratios above 50%, and hence,

the overhead of metadata is easily amortized. Use of pipelining,

prefetching, parallel execution and bulk communication of large

fmaps helps in hiding the latency of logic micro-ops. Their com-

pression technique allows the fmaps to be stored in the on-chip

cache, which reduces off-chip trafﬁc. Compared to vcompress

and vexpand instructions in AVX512 ISA, their instructions use

fewer static instructions and registers. Hence, their technique can

work for both small and large fmaps sizes.

E. Use of randomized algorithms

During training, for each training data point, performing FWP

and BWP operation only on very few sampled neurons is suf-

ﬁcient. Locality sensitive hashing (LSH) functions are those for

which collision probability increases monotonically with increas-

ing similarity. LSH algorithm provides a natural approach for

adaptive sampling since it allows sampling neurons in proportion

to weights without calculating the activations, i.e., without know-

ing the input. Since this sampling approach makes the network

sparse, it forgoes the parallelism advantage of GPUs, and hence,

it is more suited for implementation on CPUs.

Chen et al. [66] propose using randomized algorithms for

accelerating NN computations on CPU. In the initialization phase,

their technique initializes K×L LSH functions and L hash tables

for every layer. K denotes the number of hash codes in every hash

table. Let Nj

lshow neuron jin layer l,hlshow hash function

and xlshow input for layer l. The hash buckets mapped by LSH

function hl(wa

l)saves the ID aof the neuron. Each bucket has

B entries, and by choosing a small value of B, the memory usage

and overhead of parallel accumulation can be kept low. During the

FWP phase, in each layer, instead of computing all the activations,

their technique computes hl(xl). By using the hash codes, the

IDs of sampled (and hence, active) neurons are retrieved from

the matching “buckets” in hash tables. For instance, in Figure

10, h1(x1)is calculated and is used for retrieving N2

1and N4

as the sampled neurons. Only their activations are computed

and propagated as inputs to the subsequent layer. Remaining

activations are taken as 0 and hence, not computed. In their

technique, zero values are not accessed and no computations

happen on them.

Input

Hidden 1

1 | 1

3 | 3

2 | 2,4

Hidden 2

2 | 1,4

3 | 2

1 |3

Output

Hash table Hash table

Fig. 10. Working of FWP phase in the technique of Chen et al. [66].

For an input, ﬁrst its H1 hash code is obtained and from the ﬁrst hidden

layer, active neurons are ascertained. Activations are found only for these

neurons. Same procedure is repeated for successive layers. Each layer uses

multiple hash tables, only one is shown in this ﬁgure (L=1).

BWP phase proceeds similarly to the FWP phase. The error is

back-propagated layer-by-layer for computing the gradients and

updating the weights. Partial gradients are propagated only to

active neurons in earlier layers through the connected weights.

Thus, any inactive neuron or weight, which is not part of the

FWP phase for input, is not accessed. Thus, the sparsity is fully

exploited, which reduces the number of arithmetic operations in

their technique to much less than that performed in GEMM. After

updating of weights, the positions of neurons in the “hash tables”

have to be modiﬁed accordingly. It requires the removal of a

neuron from the old bucket and insertion to a new bucket. The

update frequency of hash tables is reduced exponentially to reduce

its overhead, since the magnitude of gradient updates decreases

over training iterations.

Gradients are calculated independently across different input-

items in the batch. The randomness and high degree of sparsity in

gradient updates allow parallelizing gradient-accumulation asyn-

chronously over the training data-items since updates are unlikely

to create conﬂicts. This feature enables their technique to achieve

near-linear speedup with a rising number of cores. A small degree

of overlap is tolerated based on the HOGWILD idea. No memory

access or computation is performed on zero values.

They experiment with different hash functions in the LSH

family. They also experiment with different sampling strategies,

e.g., taking most frequently occurring neurons in the Lhash tables,

taking neurons occurring with a frequency above a threshold,

etc. They compare the performance of their technique on a 44-

core Xeon E5-2699A CPU with that of TF on the same CPU

and Tesla V100 32GB GPU. They use FC networks on extreme-

classiﬁcation datasets viz., Amazon-670K, and Delicious-200K.

TF on CPU uses vectorization and has the least performance. Their

technique uses multithreading but no vectorization, and it outper-

forms GPU. The speedup is more signiﬁcant on the Amazon-670K

dataset since it is a bigger dataset. In their technique, less than

0.5% of neurons are active, which reduces memory accesses and

computations. Still, their technique does not compromise on the

accuracy per iteration. With increasing batch size, the advantage

of their technique over TF-GPU grows further. Note that high

sparsity of these datasets may undermine the beneﬁt of GPU. They

also evaluate the sampled softmax algorithm, which performs

static sampling of neurons. This algorithm requires sampling

of 20% of the total number of classes for achieving the same

accuracy as their technique. It conﬁrms the advantage of their

LSH-based adaptive sampling technique.

IV. OPTIMIZATIONS AT VARIOUS SCALES

We now discuss techniques for optimizing CNNs in mobile

(Section IV-A) and data-center scale (Section IV-B) CPUs. We

also discuss parallelization techniques at data, thread (Section

IV-C) and node-levels (Section IV-D).

A. Optimizations for Mobile CPUs

Wu et al. [2] discuss the challenges faced by Facebook in

running the Facebook app, which runs multiple DL applications,

on edge devices such as smartphones. They focus on smartphone

models that account for 85% of the market. Mobile chiplet types

and performance: No single chipset dominates the market. The

number of different “system-on-chips” (SoCs) on which Facebook

app runs is more than 2000 for Android and only about 12

for iOS. Also, there is is a signiﬁcant difference in the peak

performance of various mobile SoCs. Optimization techniques are

required for enabling DL inference on devices with widely varying

performance to achieve satisfactory user-experience.

Prospects of CPU and GPU: On a median mobile SoC, the

theoretical peak performance of CPUs equals that of GPUs and

only on 11% mobile SoCs, the performance of GPU is 3×that of

CPU. Further, compared to CPUs in high-end SoCs, CPUs in mid-

end SoCs are only 20% slower, but the GPUs in mid-end SoCs

are up to 4×slower than the GPUs in high-end SoCs. Further, the

lack of high-BW memory and sharing of the BW between CPU

and GPU constrain the performance of GPU.

Further, in smartphones running Android, the programming

support for mobile GPUs is not fully mature. For example,

some smartphones run older versions of OpenCL/OpenGL; in

some smartphones, loading the library leads to failure or crashes,

whereas others have no library. Hence, in the mobile domain,

accelerators such as DSP/GPU have limited scope. In fact, in the

mobile realm, a signiﬁcant fraction of inference runs on CPU

because of its wide availability, standardization, and software

support. The use of mobile GPUs is feasible where the software

support is mature such as in iPhones.

CPU architecture: In 2018, 72% of smartphones still used

CPU cores designed before 2013. Further, Android smartphones

generally have a higher number of less powerful cores, whereas

iOS smartphones have fewer but more powerful cores. The two

most widely-used CPU models are Cortex A53 (48% share) and

Cortex A7 (15% share). Both these cores are superscalar, in-

order and allow using one to four cores per cluster. Most SoCs

have big.LITTLE architecture [98] where one cluster has “high-

performance cores” and another cluster has “energy-efﬁcient

cores”. There is a shared cache between cores in the same

clusters but not between cores in different clusters. Hence, the

synchronization between clusters incurs a high cost, and as such,

Facebook apps are targeted to run on the cluster with high-

performance cores.

Optimizations to Facebook app: To account for the limited

memory capacity of mobile, they use techniques for reducing

model and binary size, e.g., weight/channel pruning, quantization,

selecting optimal spatial resolution and reducing the complexity

of DL algorithm. Further, the app uses NNPACK [99] and

QNNPACK [7] libraries, which offer efﬁcient implementation of

CONV and other DNN primitives on mobile CPUs. The use of

two libraries allows achieving high performance on a range of

smartphones and usage scenarios.

Comparison of CPU and DSP: They evaluate DNN models

used for virtual reality on the CPU and DSP of a smartphone. The

memory space is shared between DSP and CPU, but they have a

separate layer of caches. For all DNNs, DSP outperforms CPU,

with a mean speedup of 1.9×. The speedup is high for DNNs

with simple CONV operations, such as image classiﬁcation. The

speedup of DSP over CPU is lower for DNNs with memory-

bound operations such as depth-wise CONV and “pose estimation

models”. In DSP, load-store operations happen at the granularity

of 128B or more. It requires extra memory transformation op-

erations. Also, for memory-intensive layers, e.g., group CONV,

additional computations are needed for optimizing the memory

layout of ﬁlters and activations for reaping the full beneﬁts of

vectorization.

They also compare the power, thermal, and performance proﬁle

of CPU and DSP. CPU consumes 2×power as DSP, and thus,

due to its high power-dissipation, thermal throttling is performed

on CPU. It reduces CPU power, but it still stays higher than that

of DSP. Throttling reduces the throughput of CPU signiﬁcantly.

Compared to CPU, DSP also has a lower variation in inference

latency, which leads to better user experience. Due to these rea-

sons, the Facebook app executes virtual-reality related models on

DSP. The limitation of DSP is its higher programming overhead,

the need for optimizing the layout and lower accuracy due to the

use of ﬁxed-point (FxP) data/computations.

Lai et al. [55] propose software kernels for executing NNs on

ARM Cortex-M CPUs that implement SIMD instructions, such

as 16-bit MAC (multiply accumulate) operations. Quantization:

They develop kernels that can work with both 8b and 16b data.

A few Arm Cortex-M processors lack a dedicated “ﬂoating-point

unit”. Hence, they implement quantization with a power-of-two

scaling, which only requires bit-shift operations. Converting from

8b to 16b data-type requires data-transformation. They optimize

data-transformation to reduce the reordering overhead. Also, MM

is implemented with 2x2 kernels to achieve data-reuse and reduce

the number of load instructions.

Weight reordering and partial-lowering: Although imple-

menting larger kernels improves the scope for data-reuse, the

availability of limited registers prohibits the use of large kernels.

As the weights are constant during inference, they reorder the

weight matrix to interleave the row data so that it can be read in

single pointer access. Performing CONV using lowering-method

involves reordering and expanding the image input using im2col

and then performing GEMM operation. However, the im2col

operation requires a large amount of temporary memory. To avoid

the memory overhead, they expand only a few (e.g., two) columns.

This partial im2col approach brings the memory footprint

of the CNN to ∼133KB, whereas the naive im2col consumes

∼332KB memory.

Choice of layout: With a batch size of one, there are two

data-layouts: CHW and HWC (C=channel, H= height, W=width).

GEMM performance is independent of the layout. However, data-

movement operations in im2col are more efﬁcient with HWC-

layout. im2col is executed only along the height and width

dimensions. With HWC-layout, data can be copied efﬁciently

since the data for every pixel is stored at contiguous locations.

Hence, they assume an HWC layout for applying CONV kernel.

Pooling: Pooling can be implemented in window-based (i.e.,

traditional) or split-x/y manner. In the split-x/y style, pooling is

performed ﬁrst in x-dimension (along the width) and then in y-

dimension (along with the height). It allows reusing the average/-

max operations performed in x-dimension for the y-dimension

also. To avoid the need for extra memory for storing interim

result after x-dimension pooling, they overwrite the input value.

Compared to window-based pooling, split-x/y pooling improves

performance without any extra memory cost.

Activation functions: To implement the ReLU layer, they

create a mask based on the sign bit of a q7tnumber. For this,

the byte-level subtraction instruction ( QSUB8) is used It is

performed in a SIMD manner, which offers 4×performance-

boost over the scalar operation. Sigmoid and tanh functions are

implemented using lookup tables with ﬁxed-point input/output.

They use the “CMSIS-NN kernels” [55] for a CNN which is

trained on the CIFAR-10 dataset. On a Cortex-M7 core with

216MHz frequency, they achieve a frame-rate of 10.1 frames

per second. Their implementation provides large improvement in

throughput and energy efﬁciency by using the arm_conv 1D

CONV function from CMSIS-DSP library.

Lee et al. [54] present a vectorization-friendly CONV scheme

for improving CNN throughput. LeNet-5 has three CONV layers,

denoted as C1, C2, and C3. Let N denote CONV kernel type,

and W/D/H denote kernel width/depth/height. C1 has N types of

H×W CONV kernels. C2 and C3 have N types of 3D CONV

kernels (D×H×W). For C1, N=6, H=5, and W=5. For C2, N=16,

H=5, W=5, and D=6 and for C3, N=120, H=5, W=5, and D=16.

Kernel C2 is shown in Figure 11(a). In the traditional scheme,

there are N×H kernels of size W in 2D CONV and N×D×H

kernels of size W in 3D CONV. With 128B vector register and

16b FxP weights/inputs, there are 8 lanes in each vector. With

the traditional approach, W=5 and thus, 3 out of 8 lanes remain

unused for C1, C2, and C3.

Their proposed approach works by vectorizing the CONV

kernels in the depth direction. With this approach, there are W×H

kernels of size N in 2D CONV and N×W×H kernels of size

D in 3D CONV. Figure 11(b) shows the shape of kernel C2

after reshaping. Here, for C1 and C2, the number of lanes used

increases from 5 to 6. For C3, the vector size is 16, and thus,

each row uses 2 vector registers, and there is no idle lane. Figure

11(a)-(b) illustrates the working of their technique for the C3 layer.

Figure 11(c) shows the usage of SIMD lanes in the traditional and

proposed technique. They evaluate their technique on a Cortex-

A53 quad-core processor using LeNet-5 trained on the MNIST

dataset. On both single-core and multi-core (with OpenMP par-

allelization), their technique provides higher performance and

energy efﬁciency than a traditional SIMD implementation.

Xu et al. [22] note that during continuous vision scenarios,

mobile devices have no or low motion. Hence, the adjacent frames

in a video have regions with a high degree of similarity. They

exploit this locality by using a cache, which reduces latency and

energy consumption. They note that with increasing layer-ID, the

size of a reusable region is reduced, and this is referred to as

“cache decay”. As shown in Figure 12(a), if the input to a CONV

layer has a reusable portion of 5x5 pixels, then its output has a

reusable portion of only 3x3 pixels. Pooling and “local response

normalization” (LRN) layers also create cache decay. In fact, FC

layers delete the reusable portion since every element in its input

determines the value of the output element. Thus, due to cache

decay in multiple layers, a large reusable portion vanishes after a

few layers. Hence, they perform reuse only on CONV layers to

bound the memory overhead and also because CONV layers take

the largest fraction of execution time. Since the inputs of each

layer have different dimensions and semantics, their technique

matches only the input image and passes on the cached portions

for the entire inference.

Matching decisions: They note that pixel-level matching and

reuse are ineffective since similarity scores of matching pixels

are inadequate even for similar scenes in two overlapped images.

Hence, they perform matching at the granularity of a chunk of 8x8

pixels. Thus, a single or few changed pixel(s) do not impact the

chunk-wise matching result if the remaining pixels in the chunk

show a match. The matched chunks are chosen in a way that they

can be combined into larger chunks. For example, in Figure 12(b),

the similarity score may be highest for the match shown in (i), but

this match is not suitable for caching. It is because small chunks

disappear after few layers due to cache decay. For example, for

a 3x3 CONV kernel and 5x5 chunks B1 and B2, the reusable

portion has two 3x3 boxes, i.e., 18 pixels. By comparison, the

match is shown in Figure 12(b)(ii) ﬁnds two neighboring chunks

in the current frame that bears similarity to the chunks in the

prior frame and hence, these two chunks can be consolidated into

a single one. Thus, the reusable portion becomes a 3x10 box,

which has 30 pixels.

Their technique searches the same-size chunk in the previous

frame with the highest match using the “diamond search algo-

rithm”. The “average chunk movement” is computed as the mean

movement of the matched chunks whose “peak signal to noise

ratio” (PSNR) exceeds a threshold. Then, for every chunk in the

present frame, its PSNR is computed with the chunk at an offset

of “average chunk movement” in the previous frame. If the PSNR

exceeds the threshold, they are assumed to be matched. Finally,

the neighboring matched chunks are merged into larger chunks.

During inference, they adjust the cache mapping based on the

properties of each layer. Figure 12(c) illustrates propagation of an

exemplary reusable portion across different layers. To understand

the ﬁgure, assume that a block (100, 40, <100, 100>) is matched

to a block (100, 40, <120, 120>) in the last frame, where the

format for showing a block is (width, height, <coordinates of top-

left point>). This image is the input of a CONV layer. For this

layer, the reusable portion of the output is computed as (45, 15,

53, 53). The ReLU layer does not change the reusable portion as

it operates individually on the value of each input. But the pooling

layer reduces the size of the reusable portion due to the padding.

During CONV operation, ﬁrst, the reusable portions are copied

from the cache of the previous frame, and then, CONV is

performed only on the remaining pixels. Further, the output fmap

of every CONV layer is cached until the completion of the

inference of the subsequent frame. Although their technique reuses

similar image patches, accuracy is lost since the patches may

be numerically different. To avoid the accuracy loss due to the

use of cached data from very old frames, they ﬂush the cache

and compute an entirely new frame after every 10 frames. For a

range of DNNs, their technique reduces energy consumption and

processing latency with only small accuracy loss.

Lu et al. [100] study the characteristics of AlexNet, ResNet-50,

GoogleNet and VGG-16 on CPUs of Jetson TK1 and TX1. TK1

has 2.5GHz 32b quad-core Cortex-15A CPU and 2GB DDR3L

memory. TX1 has a 1.9GHz 64b quad-core Cortex-A57 CPU and

4GB LPDDR4 memory. For AlexNet, CONV layers run faster on

TX1 CPU, whereas FC layers run faster on TK1 CPU. The FWP

latency is lower on TK1 even though its CPU is weaker than that

of TX1. Although the clock frequency of TX1 is lower, it executes

more instructions in each cycle than TK1 (3 vs. 2). Hence, it has

higher performance for CONV layers, which are compute-bound.

TK1 has a larger L2 cache and since TX1 uses 64b address, more

memory is consumed for addressing and less memory remains for

saving the data in the cache. Hence, data needs to be fetched more

frequently when running FC layers on TX1 and this is responsible

for large latency.

For the remaining three DNNs, TX1 CPU provides lower FWP

latency than TK1 CPU. Since VGG-16 performs multiplications

of large-size matrices, the throughput on VGG-16 is double of that

on AlexNet. On TX1, the inference latency of ResNet is lower

5(W, Lanes in use) idle lanes

120(N) x 16(D) x 5(H)

(Used vectors)

120(N)

5(H)

5(W)

<120(N) types of convolution kernels>

1 2 3 4 5 6 70 1 2 3 4 5 6 70

120(N) x 5(W) x 5(H) x 2

(Used vectors)

120(N)

<120(N) types of convolution kernels>

16(D)

5(H)

(a)

(b) 16(D, Lanes in use of 2 vectors)

C2 C3

Conven

tional

Idle lanes/vector 3 3 3

# of vectors 30

16x30

120x80

Depth-

directio

nal

Idle lanes/vector 2 2 0

# of vectors 25

16x25

120

×25×

(e)

<16 types 5 x 5 x 6>

6(D)

5(H)

16(N)

5(H)

5(W)

<16 types 6 x 5 x 5>

(c)

(d)

Fig. 11. (a) original dimensions of CONV kernel C2 (b) dimensions of C2 after vectorization-friendly reshaping [54] (c) With conventional shapes,

on vectorizing C3 layer, only 5 out of 8 lanes are used. (d) after reshaping [54], all the 8 lanes are used. (e) utilization levels [54]

B1 B2

Previous frame

Current frame

Previous frame

(i) The “best”

match, with

highest

matching score

(ii) The “proper”

match, with

highest

matching score

3x3

Convolution

Non –reusable

regions

Reusable region (5x5)

input feature map

Reusable region

(3x3) in

output feature

map

(b)(a)

(100,40,100,100)

=>(100,40,120,120)

(45,15,53,53)

=>(45,15,63,63)

CONV

Kernel=11x11

Stride=2, Padding=5

(45,15,53,53)

=>(45,15,63,63)

(21,7,27,27)

=>(21,7,32,32)

Pooling

Kernel=3x3

Stride=2, Padding=1

(c)

ReLU

Fig. 12. (a) illustration of cache-decay due to CONV. (b) (i) match with high similarity but low number of reusable pixels (b) (ii) a match with high

number of reusable portions [22] (c) Change in reusable portion as it passes through different layers

than that of GoogleNet, even though ResNet has twice the number

of FLOPs than GoogleNet. It is because GoogleNet uses LRN

layers, which account for 55% of FWP latency on TX1. Also,

GoogleNet has a higher number of CONV layers and multiplies

matrices of smaller size. Since TK1 and TX1 run 32b and 64b

OS (respectively), TX1 needs higher memory due to the higher

needs of the framework itself.

Motamedi et al. [24] present a technique to select the best

thread-granularity while running a CNN on a mobile SoC. Here,

threads imply threads of CPU, DSP and GPU. They note that

launching the highest number of logical threads reduces data-

reuse and exacerbates thread scheduling overhead. Also, due to the

limited resources of mobile SoC, thread-execution is serialized.

Hence, this approach does not provide the highest performance.

Further, various CNNs and even various layers of a CNN show

the highest performance with different numbers of threads. Also,

for the same CNN, the optimal thread-count is generally higher

for more powerful SoCs and smaller for less powerful SoCs.

Android devices allow two modes of approximate computing

[101]: (1) relaxed mode where denormalized numbers are handled

inaccurately. (2) imprecise mode where NaN, inﬁnity and ±0

are also handled inaccurately. By using these modes, SIMD

instructions can be enabled and the application can be parallelized

over a higher number of threads. Their technique evaluates the

impact of changing the mode of each layer on the overall accuracy

of the CNN. Based on this, for every layer, either exact or one

of the approximate computing modes is selected. Further, their

technique exploits parallelization opportunities at different levels,

e.g., all output fmaps of a layer are obtained parallelly. Also,

multiple pixels in an output fmap are computed in parallel. In

their technique, a separate thread computes every pixel and thus,

the total number of threads equals the total number of pixels in

all output fmaps of a layer. The computation of a thread is further

accelerated using sub-word parallelization. Also, loop-order is

optimized for improving cache performance and a suitable data-

layout is chosen for improving BW utilization.

They study the use of coarse thread-granularities, which is

shown as the number of pixels (say Q) computed by each thread.

For deciding which Qpixels should be computed by a thread, they

discuss two schemes: (1) A pixel is selected from another position

in the same output fmap. Here, the kernel is reused Qtimes which

reduces memory accesses (2) A pixel is selected from the same

position in another output fmap. These pixels are produced from

CONV of the same input fmap with different kernels, and thus,

the input fmap is reused Qtimes. From these schemes, they select

one which leads to fewer memory accesses.

They develop a linear regression model for correlating the

“computational complexity” of each layer with its latency under

a ﬁxed frequency (1497MHz on Snapdragon 800). A separate

regression model is developed for each value of Q. Also, separate

models need to be developed for every SoC. They ﬁnd using

Q= 1 always leads to the highest latency and the best latency

value is up to 4×lower than the latency with Q= 1. Using

this model, they select the value of Qfor each layer, which

provides the lowest latency. They perform experiments on Nexus 5

(with Snapdragon 800) and Nexus 6 (with Snapdragon 810) using

AlexNet, SqueezeNet, and GoogLeNet. Their technique reduces

the energy and latency compared to baseline execution. In fact,

the latency with their technique is close to that obtained using the

exhaustive search.

Mehta et al. [38] note that the battery capacities of a typical

smartphone and smartwatch are 2550mAh and 300mAh, respec-

tively. Hence, making the right choice of core-type, core-count,

and their capabilities are essential to meet the quality-of-service

(QoS) in the processors used for wearables such as smartwatches.

They study the characteristics of wearables to ﬁnd the optimal

core design for them. They ﬁrst identify 10 frequently-used and

computation-intensive applications running on wearables, which

have strict QoS requirements. These applications termed Wear-

Bench come from domains such as image/video/audio/speech

processing. Most of these applications are parallel and can ben-

eﬁt from using multiple cores. Also, they are data/computation-

intensive, but not control-intensive. Also, they have high L1 data

cache miss-rate. 25% of their operations are vector operations,

and hence, cache misses have a crucial impact on performance.

Based on these factors and constraints, a smartwatch needs to

use multiple (e.g., 4) simple cores that are reasonably efﬁcient

for applications running on them. An inorder core is stalled on

each miss, and hence, it is inefﬁcient due to the high data-

cache misses of WearBench. Therefore, they use a partial out-

of-order core, which avoids read-after-write hazards by using a

scoreboard and can have at most two outstanding cache misses.

It is less complicated than a full out-of-order core due to not

using speculation or renaming and having a smaller L2 cache.

Since scoreboarding is not enough for reaching the performance

of an out-of-order core, they use a “software-assisted hardware

prefetcher” for prefetching the data to the L1 cache. It reduces the

number of L1 cache misses. The application-developer needs to

insert a prefetching instruction in the program. Since the instruc-

tion is inserted before the loop, its overhead remains insigniﬁcant.

The hardware need not track the data-item to be prefetched as

the instruction itself speciﬁes it. Instead, the hardware ﬁnds the

correct prefetch distance. Also, prior information on the number

of cache-blocks to be prefetched allows the hardware to prefetch

across physical pages. Results conﬁrm that on the metric of

performance/area and performance/power, their core design is

better than both inorder and out-of-order cores. Also, “software-

assisted hardware prefetching” is crucial for high performance and

is more effective than hardware-prefetching.

B. Optimizations at Data-center scale

Hazelwood et al. [3] discuss the hardware used by Facebook

data-centers for supporting a diverse range of machine learning-

based services/products. Hardware infrastructure: Their data-

centers have nearly eight major types of compute and storage

racks. For instance, a 2U chassis has three compute sleds, which

can support two types of servers. One option is a “single-socket

CPU” (1xCPU), which supports web-tier. It is a throughput-driven

state-less application and hence, works well with a power-efﬁcient

CPU (e.g., Broadwell-D) with somewhat small memory and

storage capacity. Another sled alternative is a “dual-socket CPU”

(2X Skylake SP or Broadwell-EP) with high memory capacity for

supporting memory and computation-intensive applications.

Training: Ofﬂine training of different services is performed

over different platforms. Sigma (used for classiﬁcation and

anomaly detection) and “News Feed” are trained on CPUs.

Language translation, speech recognition and Lumos (used for

extracting high-level features from images) are trained on GPUs.

For Facer (face detection and recognition framework), the generic

model is trained on GPUs after many months due to its sta-

bility. The user-speciﬁc model is trained on 1xCPUs when a

sufﬁcient number of new images have been generated. Similarly,

for “Search”, both CPU and GPU are used for training. GPU

is primarily used for ofﬂine training due to its throughput-

optimized architecture. Although GPUs provide higher throughput

than CPUs, the availability of a large number of CPUs makes them

an attractive target for running DL applications, especially during

off-peak hours. They also leverage distributed training techniques

for scaling to CPU-GPU heterogeneous computing platforms.

The global scale services need hundreds of terabytes of data

and complex processing. The inability to rapidly and continuously

train DL models can have severe consequences such as presenting

irrelevant/stale news and ads, and not blocking spam and offensive

contents. Further, while data-workload is rapidly-changing and

complicated, training operations show higher stability (only a

few key operations) and regularity and prefer processors with-

out cache/thread-contention. Hence, these workloads are run on

different processors. The data-processing servers read the data,

process and compress them and transfer them to training servers

that exclusively focus on efﬁcient training.

Inference: Different services have different memory and

computation requirements. As an example, the “ads ranking

scheme” performs multiple rounds of screening using a multi-

layer perceptron-like model. This model has a sparse embedding

layer and hence, has a high memory footprint. Therefore, the

later rounds, where the memory footprint becomes even higher,

are executed on a different server. Most of the online inference

is done on 1xCPUs or 2xCPUs. Since 1xCPUs have higher

energy and cost-efﬁciency, they are preferred over 2xCPUs. Some

services can be run on the powerful mobile devices of end-user,

which reduces communication overheads. 2xCPUs are required

for running some bulky services. The latency SLAs also determine

the compute-platform chosen for running a service.

Gupta et al. [5] study three recommendation models (RMs),

termed RM1, RM2, and RM3, which are representative of the

production-class RMs used at Facebook. Characteristics of RMs:

The input to RMs is the interaction between the user and the

content, e.g., the user preferences for videos. These inputs include

both dense features (e.g., videos seen by many users) and sparse

features (videos seen by a user). Sparse features are shown as

multiple vectors of sparse IDs. Dealing with sparse features

requires the use of “embedding tables” (ETs), which convert from

sparse to dense format.

The bottom-FC layers process dense features. Their outcomes

are concatenated and processed by the top-FC layers, which

predicts the “click-through-rate” of the user and the content

(video/post). The requests for different user-post pairs are batched

together. A single RM may require up to 20GB memory. Also,

in RM1, RM2 and RM3, the size of ETs are 100MB, 10GB and

1GB, respectively. Further, embedding table (ET) operations lead

to irregular memory accesses. Hence, on a Broadwell CPU, this

causes an LLC MPKI (misses per kilo-instruction) of 8, which

is orders of magnitude higher than that of FC layers of RM or

a typical CONV layer in a DNN. Similarly, the element-wise

addition has very poor AmI.

The recommendation is done in two steps: ﬁltering and ranking.

The ﬁltering step uses lightweight machine learning models or

DNN-based RM1 to reduce the number of candidate posts sig-

niﬁcantly. Then, using DNN-based RMs, tens of posts are ﬁnally

selected. The RMs used for ranking (RM2 and RM3) are bulkier

than those used for ﬁltering, e.g., the bottom-FC layers of RM3

are larger since it uses more dense features. By comparison, RM2

has more ETs since it uses more sparse features. RM3 is compute-

intensive and RM2 is memory-intensive. RM3 is utilized for

recommending social media posts, which possess dense features.

By contrast, RM1 and RM2 are being used in services with several

sparse features. Hence, the number of ET lookups per input is

higher in RM1 and RM2 than that in RM3. Also, the memory

accesses of RM1 and RM2 show higher irregularity and number

of cache misses.

They perform experiments on CPUs of three generations,

namely, Haswell, Broadwell and Skylake. Their architectures are

shown in Table VI. Results on latency: On Broadwell CPU,

the latency of RM1, RM2 and RM3 are 0.04ms, 0.30ms and

0.60ms, respectively. RM2 has many ETs and RM3 has wide

TABLE VI

CPU CHARACTERISTICS [5] (ALL TH RE E HAVE 2SO CK ET S AND

256GB DRAM)

Processor Haswell Broadwell Skylake

Frequency 2.5GHz 2.4GHz 2.0GHz

SIMD AVX-2 AVX-2 AVX-512

Cores-per-socket 12 14 20

L1/L2/L3 (KB) 32/256/30720 32/256/35840 32/1024/28160

L2/L3 type Inclusive Inclusive Exclusive

DRAM Freq. 1600 MHz 2400 MHz 2666MHz

DRAM type DDR3 DDR4 DDR4

DRAM BW 51GB/s 77GB/s 85 GB/s

FC layers. The operations responsible for highest fraction of

latency are SparseLengthsSum (an ET operation) in RM2

and BatchMatMul and FC computations in RM3. Clearly,

accelerating only MM, such as FC or BatchMatMul, will not

boost the performance of all three RMs. Acceleration of memory-

bound operations, e.g., ET lookups, is also important.

Results at different batch sizes: In the data-center context,

“throughput under a latency constraint” is a more important

metric than latency. For improving throughput, multiple queries

are batched, or multiple instances of RM are co-located on a

machine. They ﬁnd latency of RMs on three CPUs with a batch

size of 16, 128 and 256. At a batch size of 16, the inference latency

is smallest on Broadwell. Compared to Skylake, Broadwell has a

higher frequency and the vectorization capability of Skylake is

not fully utilized at the low batch size. Broadwell has 2400MHz

DDR4 memory, whereas Haswell has 1600MHz DDR3 memory.

The memory-intensive SparseLengthsSum operator leads to

low BW utilization. Hence, the reason the performance is lower

on Haswell than on Broadwell is not the memory BW, but the

lower frequency and hence, higher memory latency on Haswell.

At higher batch-size, the latency is lowest on Skylake due to

the use of AVX-512. For the computation-intensive RM3, Skylake

starts outperforming at a batch size of 64, but for memory-

intensive RM1 and RM3, it does so only at a batch size of 128.

The speedup of Skylake over Broadwell/Haswell is lower than

the ratio of their SIMD width, which is because of the irregular

memory accesses. Thus, even for inference, batching is vital to

achieving high throughput.

Effect of co-locating RMs: Here, throughput is measured as the

number of inferences per second, bound by a latency constraint.

On performing co-location, the CPU with an exclusive cache

hierarchy (Skylake) shows lower performance loss and variability

than those with inclusive hierarchy (Haswell and Broadwell).

Thus, although co-location increases the throughput, “service-

level agreement” (SLA) constraints may not be met due to in-

creased variability. Broadwell and Skylake are the best at low and

high (respectively) levels of co-location and this trend is similar

to that observed with batching. Due to the irregular memory

accesses present in RMs, the L2 cache miss-rate is higher in the

inclusive hierarchy than an exclusive hierarchy. For Broadwell,

the L2 miss-rate with single and 16 co-located inferences is 17

and 22 MPKI, respectively. For Skylake, these values are only

13 and 13.2, respectively. Broadwell has a smaller L2 cache,

but more importantly, it sees a higher amount of cache back-

invalidations due to its inclusive hierarchy. Further, RM2 shows

higher performance loss than RM1/RM3. Due to co-location, in

RM2, the latency of SparseLengthsSum and FC increases by

3×and 1.6×, respectively. Since SparseLengthsSum already

has poor cache reuse, resource-contention has a severe impact on

it. Based on these observations, the right degree of co-location

can be decided.

Effect of hyperthreading: On using hyperthreading, the num-

ber of inferences on each physical core is doubled. This increases

the latency of FC and SparseLengthsSum by 1.6×and 1.3×,

respectively. Since FC leverages SIMD hardware, which is now

time-multiplexed between threads, hyperthreading leads to larger

performance loss in computation-intensive RM3 than in memory-

intensive RM1 or RM2. Since only a few cores in a data-center

execute two hyperthreaded RMs, the impact of hyperthreading is

higher on 99-percentile latency than on average latency.

Park et al. [6] characterize DL models powering social me-

dia services at Facebook. They identify the hardware demands

of RMs, visual understanding and natural language processing

workloads. (1) ETs used in RMs require high memory capacity

(more than tens of GBs) and memory BW. Operations on ETs

involve multiplication between sparse and dense matrices. These

RMs usually predict event-probabilities for several ads for a single

user, within a few hundreds of milliseconds. Hence, batching can

be used in FCs. ET lookups dominate the inference latency and

they involve random accesses across table columns.

(2) The image resolution required for object detection is higher

than that required for image classiﬁcation. Also, the number

of detected objects can be increased by increasing the number

of region proposals, at the cost of increased computation cost.

Similarly, video understanding beneﬁts from increased spatial and

temporal resolution. Although depthwise CONVs have the low

computation, they are memory-bound due to their low data reuse.

(3) The dependencies inherent in RNNs prevent parallelization.

Inference with the low and high batch size is useful for instant

and ofﬂine translation, respectively.

Larger on-chip memory can beneﬁt many of these DL models.

Computer vision models have a high number of operations per

weight, but the number of operations per activation is not high

since it depends on the output feature dimension. Hence, their

performance depends on on-chip memory BW. Further, apart from

square matrices, matrices/vectors of other dimensions also arise

frequently due to depth/group-wise CONV and low batch size.

Hence, in addition to MM modules, vector-operation modules are

essential for handling the remaining computations.

On CPU, FCs have the highest latency, followed by ET lookups.

They develop a library for performing low-precision linear algebra

on CPU. For FP16 (16-bit ﬂoating point) and “8-bit multipli-

cations with 32-bit accumulation” GEMM, this library provides

higher performance than FP32 GEMM in Intel MKL.

They propose schemes to avoid accuracy loss due to quanti-

zation, such as doing quantization at ﬁne-granularity (e.g., for

each output channel in CONVs, for each entry in ET), skipping

quantization for a layer if the error becomes high, performing

“quantization-aware training”, etc. They run a “frequent subgraph

mining” algorithm on the DL graph for ﬁnding the frequently-

executed subgraphs. From this, the subgraph groups that are

estimated to provide the highest speedup from fusion are ﬁnally

chosen. For example, they merge (computation-bound) batched

MM with (memory BW-bound) “tensor manipulation” computa-

tions, which brings a 10% reduction in inference latency.

C. Data- and thread-level parallelization

Table VII highlights the attributes of parallelization techniques.

Liu et al. [4] note that kernel libraries such as OpenBLAS

and Intel MKL-DNN optimize only common operations such as

CONV, but do not perform DNN graph-level optimizations. The

graph-level improvements are achieved by the DL frameworks

such as TF. However, these frameworks have a limited scope of

TABLE VII

AN OVE RVIE W OF PAR AL LE LI ZATI ON AP PRO ACH ES

Data-level paral-

lelization (vector-

ization)

[9, 11, 13–15, 15–17, 20, 21, 24, 26, 28, 29, 34,

38, 46, 49, 53–55, 57, 59, 64, 70–72, 75, 77, 81,

82, 89, 93]

Thread-level par-

allelization

Language not mentioned [4, 8, 11, 24, 26, 46, 63,

70, 82, 84, 86, 89], OpenMP [20, 34, 64, 73, 74,

76, 77, 81, 81, 85, 93], Pthread [62, 65, 73, 81, 81],

Intel TBB [14]

Node-level paral-

lelization

Using data-parallelism [8, 11, 46, 73, 77, 78],

model-parallelism [8, 11, 46, 77]

Reducing

synchronization

overhead

performing it periodically [8, 65], leveraging Hog-

wild idea [66, 78], performing only pointer opera-

tions and not arithmetic operations inside a critical

section [28]

performing graph-level improvements since the implementation

of operations is provided by the third-party libraries. Hence,

the optimizations performed at kernel and graph-level are not

synergistic with each other. Also, different CPU designs use

different libraries and integrating a library into a DL framework is

cumbersome. Finally, the use of these libraries as plug-ins leads

to contention with other libraries used by the frameworks. For

instance, TF uses both MKL-DNN and Eigen libraries and the

threads of these libraries contend for the resources.

They propose a technique for jointly optimizing at the level

of individual operations and the whole graph. Instead of writing

code in assembly language or using intrinsics, they utilize high-

level features such as vectorization, which allows easily extending

the optimizations to the whole DNN graph. To improve data-

access locality, they reorder the memory access dimensions and

also perform register blocking for enhancing the usage of vector-

instructions. FMA operation is used for performing CONV. CONV

is implemented as a template whose arguments are a loop-

unrolling factor, block size, and the number of utilized registers. It

allows adapting the implementation to CPUs with different vector-

length (e.g., AVX2, AVX-512 and NEON), cache size, etc., and

to different parameters of CONV (e.g., kernel dimensions).

To achieve thread-level parallelization on Q cores, they use a

thread-pool with Q threads. The outermost loop of the compu-

tation is uniformly partitioned into Q portions. Each thread runs

a portion on a different core, which avoids resource contention.

Thread-coordination is achieved using C++11 atomics. For global

data-structures, false sharing between threads is avoided using

cache line padding. Overall, by avoiding resource contention

and reducing thread-launching cost, their parallelization approach

obtains higher performance than OpenMP.

They classify the CNN operations into three types: (1) layout-

unaware, that can handle the data in any layout, e.g., ReLU (2)

layout-tolerant that can work with different layouts, e.g., CONV,

pooling, batch-normalization (3) layout-speciﬁc that work with

only one layout, e.g., ﬂatten, reshape. They note that, in general,

the operations between CONV layers are either layout-unaware or

layout-tolerant. Hence, their technique transforms the layout only

for layout-speciﬁc layers but otherwise keeps the same layout as

that used in the CONV layer. Figure 13 illustrates the working

of their technique. Here, N, C, H and W refer to batch size,

number of input channels, fmap height and width, respectively.

The kernel layout is KCRS, where K, R, S refer to the output

channel, kernel width and height, respectively. For improving

cache efﬁciency, they organize fmap layout as NCHW[x]c, where

c is a subdimension of channel C. Also, x equals sizeOf(c) and

the number of channels equals the product of sizeOf(C) and

sizeOf(c). The layout of output is NCHW[y]c, which is similar

to that of input, except that the factor of the split could differ.

The organization of output kernel is KCRS[x]c[y]k, where y is

the sub-dimension of output channel K.

FLATTEN

CONV

Layout tolerant

operators,

e.g., pooling, relu,

broadcast

operators, etc.

CONV

Data

NCHW

Kernel

KCRS

Kernel

KCRS

FLATTEN

CONV_

NCHW16c

Data

NCHW

NCHW16c

NCHW

CONV_

NCHW16c

Layout

Transform

Pre-

transformed

Kernel

OIHW16i16o

NCHW16c

Pre-

transformed

Kernel

OIHW16i16o

Layout

Transform

NCHW16c

The optimized layout

(NCHW16c) passes through

the operators without any

layout-transform overhead.

Optimized

layout

Layout tolerant

operators,

e.g., pooling, relu,

broadcast

operators, etc.

(a) (b)

Fig. 13. (a) Data-layout used in a conventional CNN (b) selective layout

transformation scheme by Liu et al. [4]

Finally, they propose a scheme for automating the search for

optimal parameter values. This scheme works in two phases: (1)

for each compute-intensive operation such as CONV, the best

parameters are individually found (2) for optimizing the end-to-

end performance, dynamic programming algorithm is used for

intelligently combining the results from individual schemes. They

perform experiments using 15 DNNs. On AMD EPYC and Intel

Skylake CPUs, their technique outperforms OpenVINO, MXNet

and TF for most DNNs. On ARM Cortex A72 CPU, it outperforms

MXNet and TF for all DNNs. Their technique works well for

all DNNs on all CPU designs, whereas framework-neutral (e.g.,

OpenVINO) and framework-speciﬁc (e.g., TF) approaches work

well only in some cases.

Vanhoucke et al. [9] discuss optimizations for accelerating NNs

on CPUs. For matrix multiplication C=A∗B, they store matrix

Ain row-major order and Bin column-major order. Also, they

perform loop-unrolling for the inner loop, which performs R+ =

P[i]∗Q[i]operation. Multiple accumulations are done in parallel,

which allows the compiler to execute them in a pipelined manner.

They vectorize multiply-and-add operations using 128b SIMD

instructions (“streaming SIMD extensions” or SSE). To achieve

16B (128b) alignment of memory blocks, the calls to malloc()

can be replaced with posix_memalign(), or the special al-

locators can be used from the “standard template library”. Also,

zero-padding is performed to ensure that data-vector operands are

multiple of 16B. They quantize activations into 8b unsigned values

and weights into 8b signed value. The biases are stored as 32b

int and the input layer is stored as FP since their values span a

broad dynamic range. They ﬁnd that the use of FxP datatype alone

does not provide higher performance than FP implementation on

a CPU. They use the pmaddubsw instruction from the Intel

SSSE3 set, which does a parallel multiply-and-add operation on

sixteen unsigned 8b integer activations and sixteen signed 8b

integer weights and produce eight 16b integers If the result value

overﬂows, it is saturated to 16b. To further optimize this, they

use SSE4.1 set, which provides a single instruction for converting

from 16b to 32b.

Without batching, their CPU implementation achieves slightly

better performance than GPU, although with batching, GPU

provides a signiﬁcant performance improvement. Batching also

beneﬁts CPU by allowing reuse of both activations and weights.

Their optimizations improve NN performance by 4×over an FP

baseline, with a negligible loss of accuracy.

D. Node-level parallelization

Dean et al. [8] present a framework that allows implementing

model parallelism across machines and different threads of a

machine. Their framework also allows data-parallelism whereby

multiple replicas of a model optimize a single objective. The user

speciﬁes the computations performed at every node in every layer

of the model, and the messages communicated during the FWP

and BWP phases of computation. Large models can be partitioned

on multiple machines, which is shown in Figure 14(a). Here,

the states of only those vertices need to be transferred across

machines that cross partition boundaries. Inside every partition,

the computations are parallelized using available cores. Their

technique also manages communication and synchronization of

machines in both inference and training phases.

They propose two techniques, shown in Figure 14(b)-(c),

for large-scale distributed training using this framework: Down-

pour SGD (“stochastic gradient descent”), which is an on-

line scheme and Sandblaster L-BFGS (“limited-memory Broy-

den–Fletcher–Goldfarb–Shanno algorithm”) which is a batch

scheme. Both techniques utilize a centralized sharded parameter

server (PS). Also, they can work well even when model replicas

have a different speed or when the number of replicas changes

due to failure/restart. Different replicas compute different training

instances and their outputs are periodically combined to achieve

data parallelism.

1. “Downpour SGD” The traditional SGD performs serial exe-

cution. Their proposed Downpour SGD is a type of asynchronous

SGD that utilizes several replicas of a single model. The training

data is divided into multiple parts and a model-copy is run on

every part. The models achieve communication via the PS, which

stores the present state of all parameters divided across multiple

processors. For example, with 20 PS shards, every shard stores and

applies updates to 1/20th of the model parameters. Here, model

replicas are mutually independent and PS shards are also mutually

independent. It introduces stochasticity in an optimization scheme

because, at any point in time, the parameters of every shard may

have seen a different number of updates applied in a different

order. For reducing the data-transfer costs, parameter-fetching and

gradient-pushing can be done after multiple steps.

To increase the robustness of this technique, they use the

“Adagrad adaptive learning rate” scheme, which uses a different

learning rate (ηin Figure 14(b)) for every parameter. Since these

learning rates are obtained only from the sum of the square of

gradients of every parameter, Adagrad can be easily applied inside

every PS shard. Adagrad increases the number of replicas that

can function together. Further, they start the training with only

one replica and later start other replicas. These two optimizations

avoid instability in training DNNs with “Downpour SGD”.

2. “Sandblaster L-BFGS”: It is a distributed realization of L-

BFGS and uses both model and data parallelism. L-BFGS uses a

“coordinator process” which sends commands such as multiply,

add, and dot-product to different PS shards. PS shards execute

these commands independently and store the output locally. It

allows executing huge models without communicating with a

centralized PS.

In the traditional parallel implementation of L-BFGS, the

slowest processor becomes a scaling bottleneck. To mitigate this

issue, they assign work to each replica as it becomes free. This

dynamic scheduling approach achieves load-balancing. Towards

the completion of a batch, remaining work is assigned to multiple

replicas and the result from the replica that ﬁnishes ﬁrst is used.

Consecutive parts of work are assigned to the same worker,

which avoids data-access issues. In “Downpour SGD”, there is

frequent synchronization with the PS, whereas, in this technique,

parameters are fetched only at the beginning of every batch after

the coordinator has updated them. Similarly, gradients are sent

only after a certain number of portions are done.

They evaluate their techniques for image recognition and au-

dio processing benchmarks. Models use at most 20 cores per

processor. They ﬁnd that for the smallest model, which has FC

structure, the highest speedup of 2.2×is obtained on 8 processors.

The largest model has locally-connected design, and hence, it

provides increasing speedup with a rising number of processors.

The highest speedup is 12×, which is obtained for 81 processors.

Overall, their techniques achieve a large speedup over traditional

versions of SGD and L-BFGS. For the largest model, they use

32 CPUs with 16 cores in each CPU and by combining this with

their proposed optimizations, tens of thousands of CPU cores can

be used for training a large model.

Ji et al. [78] propose a technique for parallelizing the Word2Vec

algorithm using both shared and distributed memory. This al-

gorithm uses the Hogwild approach for parallelizing the SGD

algorithm, which avoids the need for synchronizing between

updates. However, a cacheline with a particular model entry can

be updated by multiple threads, which leads to the shuttling of

cachelines across cores and large access latency. Also, although a

target word is utilized in the model updates for input words, only

one update is performed at a time. In fact, the algorithm performs

multiple dot-products and, thus, fails to leverage the opportunity

of data reuse.

They propose batching both the input context words and

the negative samples, which converts the dot-product (“level-1

BLAS” operations) into GEMM (“level-3 BLAS” operations). For

multithreading of GEMM operations, Hogwild strategy is used.

Their technique allows the use of vectorized multiply-add instruc-

tions and optimized libraries. While the baseline implementation

performs model updates after every dot-product, their technique

performs model updates after the whole GEMM operation. Thus,

the model update frequency is reduced, and hence, their technique

shows much better scaling.

However, differences in model updates can affect the conver-

gence rate and the magnitude of difference depends on the batch-

size used for the inputs. Since they use batch sizes below 20, the

impact on convergence remains small. For scaling the computation

to multiple nodes, they note that model parallelism is not useful

because each GEMM is small. Hence, they use data-parallelism.

Since the network BW is much lower than the CPU memory

BW, complete model synchronization over four nodes takes 0.5

seconds, which is too high. They note that in Word2vec, the

update frequency of a word depends on its popularity. Hence,

their technique seeks to update the model with the same frequency

as the word frequency and uses the “sub-model” and not “full-

model” synchronization approach. Their technique achieves near-

linear scaling with the number of cores and nodes and provides

a throughput of 110M words per second using 32 nodes of

Broadwell CPUs.

Das et al. [46] optimize the SGD algorithm on a single node and

then scale it to multiple nodes. They note that FWP, BWP and

weight update operations have similar memory access patterns,

and hence, a similar cache-blocking scheme should work well

for them. If weights and activations do not ﬁt in cache, they

need to be fetched from the main memory. Hence, the cache

Machine 1

Machine 2

Machine 3

Machine 4

w∆w

w’ = w – η ∆w

Model

Replicas

Data

Shards

Data

Coordinator

(tiny messages)

(b)

Model

Replicas

(a) (c)

Fig. 14. (a) Model-parallelism [8] (b) “Downpour SGD”: Model replicas pull parameters wand push gradients ∆wto parameter server (PS) in

asynchronous manner. (c) “Sandblaster L-BFGS”: A single “coordinator” transmits small messages to PS and replicas for achieving batch optimization.

capacity determines the AmI of computations. They model the

cache-blocking problem as a “constrained minimization problem”

and solve it using a brute-force search. Further, while ﬁnding the

block-size residing in the cache, one of the dimensions is chosen

to be a multiple of SIMD width. They observe that on a Xeon

CPU with 128KB cache per thread, AmI value above 25 can be

achieved for a majority of CONV layers, even when minibatch

size is set to one.

After blocking, the data is laid such that access in the innermost

loops is maximally contiguous. It improves spatial locality, BW

utilization and prefetching efﬁciency. For all datatypes, the inner-

most dimension is laid-out over groups of SIMD-width output

fmaps, which allows vectorizing the operations. They further

perform “register blocking” for improving the ratio of VFMA

computations to load/store operations. They also partition the

work among different threads.

Then, they perform strong scaling of synchronous SGD to

multiple nodes. For CONV layers, model-parallelism is preferred

when minibatch size is small and kernel size is large. For

FC layers, model-parallelism is preferred unless the minibatch

size becomes huge (e.g., above 5000). They further explore the

“hybrid parallelism” approach where the nodes are divided into

groups, and nodes inside a group use model-parallelism and data-

parallelism are used across the node-groups. This approach divides

the work along both fmap and minibatch dimensions. Hybrid

parallelism leads to lower data-trafﬁc than either data or model-

parallelism. For OverFeat-FAST and VGG-A CNNs, they achieve

a speedup of 42X and 53X, respectively, on a 64-node machine.

Finally, on a DNN for speech recognition, their technique achieves

6.5X speedup with 16 nodes.

Roy et al. [73] present a “non-uniform memory access”

(NUMA)-aware technique for improving the performance of

DNNs on multicore CPUs. When CONV operation is performed

on input images, different CPUs can process different images or

different portions of images. It referred to as batch-level and

BLAS-level parallelization, respectively. They conduct experi-

ments with one to four NUMA nodes on both AMD and Intel

CPUs. They note that BVLC-Caffe performs only BLAS-level

parallelization, whereas Intel-Caffe performs batch-level paral-

lelization also. Due to this, and other architectural optimizations,

Intel-Caffe provides higher performance for all NUMA node-

counts. However, both frameworks show poor scalability with an

increasing number of NUMA nodes due to the increased count

of accesses to remote NUMA domains, which incur high latency.

Also, out of four NUMA domains, one domain itself consumes

a large fraction of read/write BW. In the default Caffe, a thread

can run on any core in any domain. The memory is allocated

to the domain where ﬁrst memory access is issued. However,

their inputs, temporary buffers and outputs may reside in different

NUMA domains. Hence, during CNN processing, memory ac-

cesses happen across NUMA boundaries. This is shown in Figure

15(a).

Their NUMA-aware technique performs hierarchical paral-

lelization, which is shown in Figure 15(b). It works as follows: (1)

In every NUMA domain, a Pthread-based SGD routine is created

(2) Each routine is parallelized at batch-level using OpenMP

threads which have afﬁnity to their parent Pthread routine (3)

Every OpenMP thread does BLAS operations via MKL threads

which keep the afﬁnity of their parent OpenMP threads. The

data corresponding to the computations is also distributed at the

granularity of domains and threads. They perform data-parallelism

whereby each SGD routine running in different domain processes,

different input samples, and after a ﬁxed number of samples, the

network parameters are updated. Thus, memory accesses issued

by the computations in a domain are served by the data in

the same area. Inter-domain communication is required only for

gradient-update. Their NUMA-aware Caffe design provides better

throughput and scaling than Intel-Caffe.

V. CONCLUDING REMARKS

The landscape of next-generation deep learning demands that

CPU play a bigger role than merely a host, and we believe

that CPU is ready to take on this challenge. In this paper, we

surveyed the techniques for optimizing DL on CPUs and DL-

aware optimizations to CPUs. We organized the work on several

categories to show their similarities and differences. We conclude

this paper with a brief mention of directions for future research.

In recent years, the vector width in CPUs has increased from

64b to 512b. Increasing this further would mean that each vector

penalty. Even then, it would beneﬁt only a few applications that

have such high SIMD parallelism. Thus, merely increasing peak

performance will not be sufﬁcient; more revolutionary improve-

ments are required to boost the performance of a broad range

of DL applications. For example, although existing CPUs allow

conversion between FP16 and FP32, they do not natively support

FP16 computations. Also, they need multiple instructions for

implementing INT8 multiplications with 32b accumulation, and

hence, it provides only small improvement over FP32 computa-

tions. To address these limitations, CPU vendors have recently

introduced hardware and software support for low-precision com-

puting [102, 103].

Since CPUs devote a large amount of chip-area to caches, by

leveraging in-memory computing capabilities, the performance of

CPUs can be signiﬁcantly increased. Further, apart from MM,

modern DNNs also perform computations with other patterns,

such as sparse lookups [5], vector operations [6] and deconvolu-

tion [104]. Future CPUs can also host dedicated DL accelerators

to accelerate such operations and, thus, bring the best of both

worlds together. Vendor-optimized libraries will remain essential

for extracting the last bit of performance from a processor and

this has been conﬁrmed by the observations such as Intel-Caffe

Layer 1 Layer 2

DRAM

Thread 1

Thread 2

Thread 3

Thread 4

DRAM

Thread 1

Thread 2

Thread 3

Thread 4

DRAM

NUMA 0

DRAM

Thread 5

Thread 6

Thread 7

Thread 8

DRAM

Thread 5

Thread 6

Thread 7

Thread 8

DRAM

NUMA 1

Time

(a) (b)

DRAM

Thread 1

Thread 2

Thread 3

Thread 4

DRAM

Thread 1

Thread 2

Thread 3

Thread 4

DRAM

NUMA 0

DRAM

Thread 5

Thread 6

Thread 7

Thread 8

DRAM

Thread 5

Thread 6

Thread 7

Thread 8

DRAM

NUMA 1

Layer 1 Layer 2

Time

Gradient

update

∆w1

∆w2

Solver replica 2

Solver

replica 1

Fig. 15. (a) Current Caffe designs lead to large communication across NUMA domains. (b) Hierarchical mapping [73] reduces this overhead.

outperforming BVLC-Caffe [73, 93] and Intel-MKL boosting the

performance of TF [68]. Research works that do not use optimized

libraries [63] such as Intel MKL or use an earlier version of

TF that do not provide FMA/SIMD support [63] may offer a

misleading picture of CPU performance.

Some of the optimizations proposed for accelerators may not be

necessary or effective on CPUs. For example, even though low-

bitwidth DNNs facilitate efﬁcient computation on accelerators,

they offer little beneﬁt on existing systems with FMA instructions.

It is because when FMA instructions are pipelined, their cost is

nearly the same as that of additions alone. Also, multiplications

are costly only when the operand bitwidth is high. Evidently,

a CPU-speciﬁc study of recently-proposed DL techniques is

required to assess their potential on CPUs. Further, efforts are

required for accelerating recent DL algorithms such as generative

adversarial networks [104] and reinforcement learning on CPUs.

Large companies, startups and academic institutes have recently

proposed a range of AI accelerators and optimization techniques.

However, a key obstacle in achieving synergy between these

efforts is the lack of open-source ISAs and tools. Due to the

proprietary nature of these platforms and products, even the

best ideas have not translated into widely adopted technologies.

We believe that development of an open-source ISA such as

RISC-V can help greatly in breaking these barriers. By virtue

of its open-source nature, RISC-V will reduce royalty overheads

and promote reproducibility and reuse. Further, RISC-V makes

it possible to design “domain speciﬁc extensions”, encouraging

CPU-accelerator heterogeneous computing. Due to these and

several other features of RISC-V, it is expected to boost the entire

ecosystem of AI computing.

CONV involves several levels of nested loops, which are

difﬁcult to optimize manually. Some researchers use assembly-

language instructions, such as Intel intrinsics to optimize them.

However, this approach is not scalable. Polyhedral compilers

[105–107] can model these loops in terms of a polyhedra and

then apply sophisticated matrix transformations (such loop-tiling)

on them to improve both cache locality and parallelism. This can

provide signiﬁcant boost in performance.

The choice of a computing system for DL applications will be

made based on multiple metrics such as throughput, latency and

energy efﬁciency over a range of applications/operations, ease of

use and development, portability, reliability, availability, cost, etc.

Since the over-emphasis of single/few metrics is likely to provide a

misleading picture, future studies should provide a comprehensive

evaluation of all metrics of interest. In fact, due to the diversity of

metrics and model characteristics (layer-count, layer-type, batch

size, precision, etc.), a single processing unit may not be optimal

for all scenarios. Evidently, high-level interfaces to intelligently

map a query to a suitable processing unit(s), such as CPU and/or

accelerator will be vital for achieving highest efﬁciency. Further,

development of high-level APIs such as OpenVINO that allow

execution on a range of processing units will be important for

reducing the programmer effort.

REFERENCES

[1] O. Valery et al., “Low Precision Deep Learning Training on

Mobile Heterogeneous Platform,” in PDP, 2018, pp. 109–

117.

[2] C.-J. Wu et al., “Machine learning at Facebook: Under-

standing inference at the edge,” in HPCA, 2019, pp. 331–

344.

[3] K. Hazelwood et al., “Applied machine learning at Face-

book: A datacenter infrastructure perspective,” in HPCA,

2018, pp. 620–629.

[4] Y. Liu et al., “Optimizing CNN Model Inference on CPUs,”

in ATC, 2019, pp. 1025–1040.

[5] U. Gupta et al., “The Architectural Implications of Face-

book’s DNN-based Personalized Recommendation,” arXiv

preprint arXiv:1906.03109, 2019.

[6] J. Park et al., “Deep learning inference in Facebook data

centers: Characterization, performance optimizations and

hardware implications,” arXiv preprint arXiv:1811.09886,

2018.

[7] M. Dukhan et al., “QNNPACK: Open source library for op-

timized mobile deep learning,” http://bitly.ws/8SyQ, 2018.

[8] J. Dean et al., “Large scale distributed deep networks,” in

Advances in neural information processing systems, 2012,

pp. 1223–1231.

[9] V. Vanhoucke et al., “Improving the speed of neural net-

works on CPUs,” 2011.

[10] M. Zhang et al., “DeepCPU: Serving RNN-based deep

learning models 10x faster,” in USENIX ATC, 2018, pp.

951–965.

[11] T. Chilimbi et al., “Project ADAM: Building an efﬁcient

and scalable deep learning training system,” in OSDI, 2014,

pp. 571–582.

[12] A. Gujarati et al., “Swayam: distributed autoscaling to meet

SLAs of machine learning inference services with resource

efﬁciency,” in ACM/IFIP/USENIX Middleware Conference,

2017, pp. 109–120.

[13] W. Bao et al., “NGEMM: Optimizing GEMM for Deep

Learning via Compiler-based Techniques,” arXiv preprint

arXiv:1910.00178, 2019.

[14] L.-W. Chang et al., “Accelerating recurrent neural networks

through compiler techniques and quantization,” Workshop

on Systems for ML and Open Source Software, 2018.

[15] S. Rajbhandari et al., “Optimizing CNNs on multicores for

scalability, performance and goodput,” in ACM SIGPLAN

Notices, vol. 52, no. 4, 2017, pp. 267–280.

[16] J. Devlin, “Sharp models on dull hardware: Fast and

accurate neural machine translation decoding on the CPU,”

arXiv preprint arXiv:1705.01991, 2017.

[17] Y. Kim et al., “µLayer: Low Latency On-Device Infer-

ence Using Cooperative Single-Layer Acceleration and

Processor-Friendly Quantization,” in EuroSys Conference,

2019, p. 45.

[18] D. Li et al., “DeepRebirth: Accelerating deep neural net-

work execution on mobile devices,” in AAAI Conference on

Artiﬁcial Intelligence, 2018.

[19] M. Almeida et al., “EmBench: Quantifying Performance

Variations of Deep Neural Networks across Modern Com-

modity Devices,” International Workshop on Embedded and

Mobile Deep Learning, 2019.

[20] P. Meloni et al., “NEURAghe: Exploiting CPU-FPGA

Synergies for Efﬁcient and Flexible CNN Inference Ac-

celeration on Zynq SoCs,” TRETS, vol. 11, no. 3, p. 18,

2018.

[21] V. Peluso et al., “Enabling energy-efﬁcient unsupervised

monocular depth estimation on ARMv7-based platforms,”

in DATE, 2019, pp. 1703–1708.

[22] M. Xu et al., “Accelerating convolutional neural networks

for continuous mobile vision via cache reuse,” arXiv

preprint arXiv:1712.01670, 2017.

[23] N. D. Lane et al., “DeepX: A software accelerator for low-

power deep learning inference on mobile devices,” in IPSN,

2016, p. 23.

[24] M. Motamedi et al., “Machine intelligence on resource-

constrained IoT devices: The case of thread granularity

optimization for CNN inference,” TECS, vol. 16, no. 5s,

p. 151, 2017.

[25] Y. Wu et al., “Experimental Characterizations and Analysis

of Deep Learning Frameworks,” in Big Data. IEEE, 2018,

pp. 372–377.

[26] D. Budden et al., “Deep tensor convolution on multicores,”

in ICML, 2017, pp. 615–624.

[27] Y. E. Wang et al., “Benchmarking TPU, GPU, and

CPU Platforms for Deep Learning,” arXiv preprint

arXiv:1907.10701, 2019.

[28] A. Zlateski et al., “ZNN–A Fast and Scalable Algorithm

for Training 3D Convolutional Networks on Multi-core and

Many-Core Shared Memory Machines,” in IPDPS. IEEE,

2016, pp. 801–811.

[29] B. Liu et al., “Sparse convolutional neural networks,” in

CVPR, 2015, pp. 806–814.

[30] S. Sen et al., “SparCE: Sparsity Aware General-Purpose

Core Extensions to Accelerate Deep Neural Networks,”

TOCS, vol. 68, no. 6, pp. 912–925, 2018.

[31] S. Cao et al., “SeerNet: Predicting CNN Feature-Map

Sparsity Through Low-Bit Quantization,” in CVPR, 2019,

pp. 11 216–11 225.

[32] J. Park et al., “Faster CNNs with direct sparse convolutions

and guided pruning,” arXiv preprint arXiv:1608.01409,

2016.

[33] K.-Y. Peng et al., “Adaptive runtime exploiting sparsity in

tensor of deep learning neural network on heterogeneous

systems,” in SAMOS, 2017, pp. 105–112.

[34] B. Akin et al., “ZCOMP: Reducing DNN Cross-Layer

Memory Footprint Using Vector Extensions,” in MICRO,

2019, pp. 126–138.

[35] S. Mittal et al., “A Survey of CPU-GPU Heterogeneous

Computing Techniques,” ACM Computing Surveys, vol. 47,

no. 4, pp. 69:1–69:35, 2015.

[36] S. Mittal et al., “A Survey of Techniques for Modeling

and Improving Reliability of Computing Systems,” TPDS,

2015.

[37] P. Blacker et al., “Rapid Prototyping of Deep Learning

Models on Radiation Hardened CPUs,” in AHS, 2019, pp.

25–32.

[38] S. Mehta et al., “WearCore: A core for wearable work-

loads?” in PACT. IEEE, 2016, pp. 153–164.

[39] J. Hanhirova et al., “Latency and throughput characteriza-

tion of convolutional neural networks for mobile computer

vision,” in ACM Multimedia Systems Conference, 2018, pp.

204–215.

[40] S. Mittal, “A Survey of FPGA-based Accelerators for

Convolutional Neural Networks,” Neural computing and

applications, 2018.

[41] N. Rao, “Intel Excels in First MLPerf Inference Results,”

https://www.intel.ai/mlperf-nov2019/, 2019.

[42] Y. Ma et al., “Moving Deep Learning into Web Browser:

How Far Can We Go?” in The World Wide Web Conference,

2019, pp. 1234–1244.

[43] C. Zhang et al., “MArk: Exploiting Cloud Services for

Cost-Effective, SLO-Aware Machine Learning Inference

Serving,” in ATC, 2019.

[44] “AWS EC2 Pricing,” http://bitly.ws/8SyN.

[45] A. Zlateski et al., “The anatomy of efﬁcient FFT and

winograd convolutions on modern CPUs,” in ICS, 2019,

pp. 414–424.

[46] D. Das et al., “Distributed deep learning using syn-

chronous stochastic gradient descent,” arXiv preprint

arXiv:1602.06709, 2016.

[47] A. Sarma et al., “CASH: Compiler Assisted Hardware

Design for Improving DRAM Energy Efﬁciency in CNN

Inference,” MEMSYS, 2019.

[48] A. Jain et al., “Architectural support for convolutional

neural networks on modern cpus,” in PACT, 2018, p. 16.

[49] Z. Chishti et al., “Memory system characterization of deep

learning workloads,” in MemSys, 2019, pp. 497–505.

[50] N. D. Lane et al., “An early resource characterization of

deep learning on wearables, smartphones and internet-of-

things devices,” in international workshop on internet of

things towards applications, 2015, pp. 7–12.

[51] K. Zou et al., “Learn-to-scale: Parallelizing deep learning

inference on chip multiprocessor architecture,” in DATE,

2019, pp. 1172–1177.

[52] J. Gu et al., “Implementation and evaluation of deep neural

networks (DNN) on mainstream heterogeneous systems,” in

Asia-Paciﬁc Workshop on Systems, 2014, p. 12.

[53] D. Velasco-Montero et al., “On the Correlation of CNN

Performance and Hardware Metrics for Visual Inference

on a Low-Cost CPU-based Platform,” in IWSSIP, 2019, pp.

249–254.

[54] S.-J. Lee et al., “Efﬁcient SIMD implementation for ac-

celerating convolutional neural network,” in International

Conference on Communication and Information Process-

ing, 2018, pp. 174–179.

[55] L. Lai et al., “CMSIS-NN: Efﬁcient neural network kernels

for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601,

2018.

[56] T. Abtahi et al., “Accelerating convolutional neural network

with FFT on embedded hardware,” IEEE T VLSI SYST,

vol. 26, no. 9, pp. 1737–1749, 2018.

[57] C. F. Rodrigues et al., “Fine-Grained Energy and Perfor-

mance Proﬁling framework for Deep Convolutional Neural

Networks,” arXiv preprint arXiv:1803.11151, 2018.

[58] X. Dai et al., “ChamNet: Towards Efﬁcient Network Design

Through Platform-Aware Model Adaptation,” in Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

2019.

[59] S. Popovych et al., “PZnet: Efﬁcient 3D ConvNet Inference

on Manycore CPUs,” in Science and Information Confer-

ence, 2019, pp. 369–383.

[60] M. Xu et al., “DeepWear: Adaptive Local Ofﬂoading for

On-Wearable Deep Learning,” IEEE Transactions on Mo-

bile Computing, 2019.

[61] S. Shams et al., “Evaluation of deep learning frameworks

over different HPC architectures,” in ICDCS, 2017, pp.

1389–1396.

[62] J. Hauswald et al., “Sirius: An open end-to-end voice and

vision personal assistant and its implications for future

warehouse scale computers,” in ACM SIGPLAN Notices,

vol. 50, no. 4, 2015, pp. 223–238.

[63] S. Shi et al., “Benchmarking state-of-the-art deep learning

software tools,” in CCBD, 2016, pp. 99–104.

[64] A. Venkat et al., “SWIRL: High-performance many-core

CPU code generation for deep neural networks,” IJHPCA,

p. 1094342019866247, 2019.

[65] S. Fan et al., “Parallel Computing in DNNs Using CPU

and MIC,” in ISPA/IUCC, 2017, pp. 646–652.

[66] B. Chen et al., “SLIDE: In Defense of Smart Algorithms

over Hardware Acceleration for Large-Scale Deep Learning

Systems,” arXiv preprint arXiv:1903.03129, 2019.

[67] E. Georganas et al., “High-performance deep learning via

a single building block,” arXiv preprint arXiv:1906.06440,

2019.

[68] G. Ramirez-Gargallo et al., “TensorFlow on state-of-the-art

HPC clusters: a machine learning use case,” 2019.

[69] Y. You et al., “Imagenet training in minutes,” in ICPP,

2018, p. 1.

[70] B. Jacob et al., “Quantization and training of neural net-

works for efﬁcient integer-arithmetic-only inference,” in

CVPR, 2018, pp. 2704–2713.

[71] Z. Gong et al., “SparseTrain: Leveraging Dynamic Sparsity

in Training DNNs on General-Purpose SIMD Processors,”

arXiv preprint arXiv:1911.10175, 2019.

[72] A. Xing et al., “Speeding up deep neural networks for

speech recognition on ARM Cortex-A series processors,”

in ICNC, 2014, pp. 123–127.

[73] P. Roy et al., “NUMA-Caffe: NUMA-aware deep learning

neural networks,” TACO, vol. 15, no. 2, p. 24, 2018.

[74] M. G. Tallada, “Coarse grain parallelization of deep neural

networks,” in ACM SIGPLAN Notices, vol. 51, no. 8, 2016,

p. 1.

[75] A. Zlateski et al., “Compile-time optimized and statically

scheduled ND convnet primitives for multi-core and many-

core (Xeon Phi) CPUs,” in ICS, 2017, p. 8.

[76] N. Hasabnis, “Auto-tuning TensorFlow Threading Model

for CPU Backend,” in MLHPC. IEEE, 2018, pp. 14–25.

[77] Y. E. Wang et al., “Exploiting parallelism opportu-

nities with deep learning frameworks,” arXiv preprint

arXiv:1908.04705, 2019.

[78] S. Ji et al., “Parallelizing word2vec in shared and dis-

tributed memory,” IEEE Transactions on Parallel and Dis-

tributed Systems, 2019.

[79] R. Takeda et al., “Acoustic model training based on node-

wise weight boundary model increasing speed of discrete

neural networks,” in ASRU, 2015, pp. 52–58.

[80] H. Yin et al., “Hardware-Guided Symbiotic Training for

Compact, Accurate, yet Execution-Efﬁcient LSTM,” arXiv

preprint arXiv:1901.10997, 2019.

[81] H. Lan et al., “FeatherCNN: Fast Inference Computation

with TensorGEMM on ARM Architectures,” IEEE T PAR-

ALL DISTR, 2019.

[82] K. Yanai et al., “Efﬁcient mobile implementation of a CNN-

based object recognition system,” in ACM international

conference on Multimedia, 2016, pp. 362–366.

[83] A. Ignatov et al., “AI benchmark: Running deep neural

networks on android smartphones,” in ECCV, 2018, pp. 0–

[84] M. Guignard et al., “Performance characterization of state-

of-the-art deep learning workloads on an IBM ”Minsky”

platform,” in Proceedings of the 51st Hawaii International

Conference on System Sciences, 2018.

[85] M. Loukadakis et al., “Accelerating deep neural networks

on low power heterogeneous architectures,” 2018.

[86] S. Wang et al., “High-Throughput CNN Inference on Em-

bedded ARM big.LITTLE Multi-Core Processors,” arXiv

preprint arXiv:1903.05898, 2019.

[87] S. Rallapalli et al., “Are very deep neural networks feasible

on mobile devices,” IEEE Trans. Circ. Syst. Video Technol,

2016.

[88] M. Rusci et al., “Memory-Driven Mixed Low Precision

Quantization For Enabling Deep Network Inference On

Microcontrollers,” arXiv preprint arXiv:1905.13082, 2019.

[89] D. Frajberg et al., “Accelerating deep learning inference

on mobile systems,” in International Conference on AI and

Mobile Services, 2019, pp. 118–134.

[90] B. Wu et al., “FBNet: Hardware-aware efﬁcient convnet

design via differentiable neural architecture search,” in

Conference on Computer Vision and Pattern Recognition,

2019, pp. 10 734–10 742.

[91] L. L. Zhang et al., “Fast hardware-aware neural architecture

search,” in Conference on Computer Vision and Pattern

Recognition (CVPR) Workshops, 2020.

[92] A. Zlateski et al., “FFT Convolutions are Faster than

Winograd on Modern CPUs, Here is Why,” arXiv preprint

arXiv:1809.07851, 2018.

[93] J. J. K., “Beneﬁts of Intel Optimized Caffe in comparison

with BVLC Caffe,” http://bitly.ws/8Szz, 2017.

[94] B. Li et al., “Efﬁcient transformer-based large scale lan-

guage representations using hardware-friendly block struc-

tured pruning,” arXiv preprint arXiv:2009.08065, 2020.

[95] J. Yu et al., “Scalpel: Customizing DNN pruning to the

underlying hardware parallelism,” in ACM SIGARCH Com-

puter Architecture News, vol. 45, no. 2, 2017, pp. 548–560.

[96] S. Mittal et al., “A Survey of Techniques for Optimizing

Deep Learning on GPUs,” J SYST ARCHITECT, 2019.

[97] S. Mittal, “A Survey of Techniques for Designing and

Managing CPU Register File,” CONCURR COMP-PRACT

E, 2016.

[98] S. Mittal, “A Survey Of Techniques for Architecting and

Managing Asymmetric Multicore Processors,” ACM Com-

puting Surveys, vol. 48, no. 3, pp. 45:1–45:38, January

2016.

[99] M. Dukhan, “Acceleration package for neural networks on

multi-core CPUs,” http://bitly.ws/8SyP.

[100] Z. Lu et al., “Modeling the resource requirements of

convolutional neural networks on mobile devices,” in ACM

international conference on Multimedia, 2017, pp. 1663–

1671.

[101] S. Mittal, “A Survey Of Techniques for Approximate Com-

puting,” ACM Comput. Surv., vol. 48, no. 4, pp. 62:1–62:33,

2016.

[102] N. Stephens, “BFloat16 processing for Neural Networks on

Armv8-A,” http://bitly.ws/8Szv, 2019.

[103] https://intel.ly/3ihvw9W, 2020.

[104] D. Xu et al., “Accelerating generative neural networks

on unmodiﬁed deep learning processors—a software ap-

proach,” IEEE Transactions on Computers, vol. 69, no. 8,

pp. 1172–1184, 2020.

[105] S. Verdoolaege et al., “Polyhedral parallel code generation

for cuda,” ACM Transactions on Architecture and Code

Optimization (TACO), vol. 9, no. 4, pp. 1–23, 2013.

[106] J. Ragan-Kelley et al., “Halide: a language and compiler

for optimizing parallelism, locality, and recomputation in

image processing pipelines,” Acm Sigplan Notices, vol. 48,

no. 6, pp. 519–530, 2013.

[107] N. Vasilache et al., “Tensor comprehensions: Framework-

agnostic high-performance machine learning abstractions,”

arXiv preprint arXiv:1802.04730, 2018.

Enhancement of Convolutional Neural Network Hardware Accelerators Efficiency Using Sparsity Optimization Framework

Article

Full-text available

Jan 2024

Convolutional neural networks (CNNs) accelerator has been utilized wide for several digital applications to improve processing efficiency. However, the traditional CNN accelerator processor performance is insufficient to run the digital smart application as per the user’s needs. Hence, the present research study was intended to design the modified CNN accelerator for prediction and data broadcasting applications. Hence, the newly designed accelerator is named a novel Siberian Tiger-based Convolutional Neural Accelerator architecture (STbCNA). Here, the sparse features and the tiger fitness data reuse strategy have been considered to gain the exact prediction outcome. In addition, the throughput parameter is considered in this research study. The predicted outcome is transferred to the user to make the rainfall aware to satisfy this parameter. Consequently, the Throughput and other FPGA parameters were calculated and compared with other models. In that, the modified CNN (STbCNA) scored the finest outcome.

Ransomware Detection Using Machine Learning: A Review, Research Limitations and Future Directions

Article

Full-text available

Jan 2024

Ransomware attacks are on the rise in terms of both frequency and impact. The shift to remote work due to the COVID-19 pandemic has led more people to work online, prompting companies to adapt quickly. Unfortunately, this increased online activity has provided cybercriminals numerous opportunities to carry out devastating attacks. One recent method employed by malicious actors involves infecting corporate networks with ransomware to extract millions of dollars in profits. Ransomware falls into the category of malware. It works by encrypting sensitive data and demanding payments from victims to receive the encryption keys necessary for decrypting their data. The prevalence of this type of attack has prompted governments and organisations worldwide to intensify their efforts to combat ransomware. In response, the research community has also focused on ransomware detection, leveraging technologies such as machine learning. Despite this increased attention, practical solutions for real-world applications remain scarce in the existing literature. Numerous surveys have explored literature within the domain. Still, there is a notable lack of emphasis on the design of ransomware detection systems and the practical aspects of detection, including real-time and early detection. Against this backdrop, our review delves into the existing literature on ransomware detection, specifically examining the machine-learning techniques, detection approaches, and designs employed. Finally, we highlight the limitations of prior studies and propose future research directions in this crucial area.

Unlocking the power of industrial artificial intelligence towards Industry 5.0: Insights, pathways, and challenges

Article

Apr 2024
J MANUF SYST

With the continuous development of human-centric, resilient, and sustainable manufacturing towards Industry 5.0, Artificial Intelligence (AI) has gradually unveiled new opportunities for additional functionalities, new features, and tendencies in the industrial landscape. On the other hand, the technology-driven Industry 4.0 paradigm is still in full swing. However, there exist many unreasonable designs, configurations, and implementations of Industrial Artificial Intelligence (IndAI) in practice before achieving either Industry 4.0 or Industry 5.0 vision, and a significant gap between the individualized requirement and actual implementation result still exists. To provide insights for designing appropriate models and algorithms in the upgrading process of the industry, this perspective article classifies IndAI by rating the intelligence levels and presents four principles of implementing IndAI. Three significant opportunities of IndAI, namely, collaborative intelligence, self-learning intelligence, and crowd intelligence, towards Industry 5.0 vision are identified to promote the transition from a technology-driven initiative in Industry 4.0 to the coexistence and interplay of Industry 4.0 and a value-oriented proposition in Industry 5.0. Then, pathways for implementing IndAI towards Industry 5.0 together with key empowering techniques are discussed. Social barriers, technology challenges, and future research directions of IndAI are concluded, respectively. We believe that our effort can lay a foundation for unlocking the power of IndAI in futuristic Industry 5.0 research and engineering practice.

Performance Characterization of Popular DNN Models on Out-of-Order CPUs

Conference Paper

Oct 2023

A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs

Conference Paper

Apr 2024

Optimizing DNN Training with Pipeline Model Parallelism for Enhanced Performance in Embedded Systems

Article

Apr 2024
J PARALLEL DISTR COM

AI and Nuclear: A perfect intersection of danger and potential?

Article

Mar 2024
ENERG ECON

CUTE: A scalable CPU-centric and Ultra-utilized Tensor Engine for convolutions

Article

Mar 2024
J SYST ARCHITECT

YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs

Conference Paper

Feb 2024

Hardware-Aware Evolutionary Approaches to Deep Neural Networks

Chapter

Jan 2024

This chapter gives an overview of evolutionary algorithm (EA) based methods applied to the design of efficient implementations of deep neural networks (DNN). We introduce various acceleration hardware platforms for DNNs developed especially for energy-efficient computing in edge devices. In addition to evolutionary optimization of their particular components or settings, we will describe neural architecture search (NAS) methods adopted to directly design highly optimized DNN architectures for a given hardware platform. Techniques that co-optimize hardware platforms and neural network architecture to maximize the accuracy-energy trade-offs will be emphasized. Case studies will primarily be devoted to NAS for image classification. Finally, the open challenges of this popular research area will be discussed.

Exploiting Parallelism Opportunities with Deep Learning Frameworks

Article

Dec 2020

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This article takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.30× and 1.38×, respectively.

SparseTrain: Leveraging Dynamic Sparsity in Software for Training DNNs on General-Purpose SIMD Processors

Conference Paper

Sep 2020

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Conference Paper

Jan 2020

Architectural support for convolutional neural networks on modern CPUs

Conference Paper

Nov 2018

μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization

Conference Paper

Mar 2019

Emerging mobile services heavily utilize Neural Networks (NNs) to improve user experiences. Such NN-assisted services depend on fast NN execution for high responsiveness, demanding mobile devices to minimize the NN execution latency by efficiently utilizing their underlying hardware resources. To better utilize the resources, existing mobile NN frameworks either employ various CPU-friendly optimizations (e.g., vectorization, quantization) or exploit data parallelism using heterogeneous processors such as GPUs and DSPs. However, their performance is still bounded by the performance of the single target processor, so that realtime services such as voice-driven search often fail to react to user requests in time. It is obvious that this problem will become more serious with the introduction of more demanding NN-assisted services.

Machine Learning at Facebook: Understanding Inference at the Edge

Conference Paper

Feb 2019

Fast Hardware-Aware Neural Architecture Search

Conference Paper

Jun 2020

Harnessing Deep Learning via a Single Building Block

Conference Paper

May 2020

Enabling Cost-Effective, SLO-Aware Machine Learning Inference Serving on Public Cloud

Article

Jul 2020

The remarkable advances of Machine Learning (ML) have spurred an increasing demand for ML- as-a-Service on public cloud: developers train and publish ML models as online services to provide low-latency inference for dynamic queries. The primary challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing serving cost. In this article, we proposes MArk (Model Ark), a general-purpose inference serving system, to tackle the dual challenge of SLO compliance and cost effectiveness. MArk employs three design choices tailored to inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to $7.8\times$ while achieving even better latency performance.

Accelerating Generative Neural Networks on Unmodified Deep Learning Processors - A Software Approach

Article

Jun 2020

Generative neural network is a new category of neural networks and it has been widely utilized in many applications such as content generation, unsupervised learning, segmentation and pose estimation. It typically involves massive computing-intensive deconvolution operations that cannot be fitted to conventional neural network processors directly. However, prior works mainly investigated specialized hardware architectures through intensive hardware modifications to the existing deep learning processors to accelerate deconvolution together with the convolution. In contrast, this work proposes a novel deconvolution implementation with a software approach and enables fast and efficient deconvolution execution on the existing deep learning processors. Our proposed method reorganizes the computation of deconvolution and allows the deep learning processors to treat it as the standard convolution by splitting the original deconvolution filters into multiple small filters. Compared to prior acceleration schemes, the implemented acceleration scheme achieves 2.4X - 4.3X performance speedup and reduces the energy consumption by 27.7% - 54.5% on a set of realistic benchmarks. In addition, we have also applied the deconvolution computing approach to the off-the-shelf commodity deep learning processors. The performance of deconvolution also exhibits significant performance speedup over prior deconvolution implementations.

A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations

Abstract and Figures

Recommended publications

A Survey of Deep Learning on CPUs: Opportunities and Co-optimizations

Detecting Usage of Mobile Phones using Deep Learning Technique

Improving Accuracy and Efficiency of Object Detection Algorithms using Multiscale Feature Aggregatio...

Improving Accuracy and Efficiency of Object Detection Algorithms Using Multiscale Feature Aggregatio...