ArticlePDF Available

A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations

Authors:

Abstract and Figures

CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this paper, we present a survey of techniques for optimizing DL applications on CPUs. We include the methods proposed for both inference and training and those offered in the context of mobile, desktop/server, and distributed-systems. We identify the areas of strength and weaknesses of CPUs in the field of DL. This paper will interest practitioners and researchers in the area of artificial intelligence, computer architecture, mobile systems, and parallel computing.
(a)-(b) Hardware-unaware pruning (c) multiplication of input vector with a "sparse weight matrix" (saved with CSR style) (d) weightalignment according to SIMD width [95] (e) hardware-aware pruning (f) "sparse weight matrix" is stored in a modified CSR format which enables SIMD-multiplication for storing the "sparse weight matrix" which decreases the model size. Also, loading and multiplication can be done using SIMD instructions. (3) Moderately-parallel processors such as CPU leverage instruction/memory-level parallelism along with SIMD units. For CPUs, they first use vertex pruning on CONV layers and then use "SIMD-based weight pruning" on FC layers. Their technique achieves higher performance than hardware-unaware pruning with no accuracy loss. Zou et al. [51] propose a technique to reduce datacommunication overheads while parallelizing CNN training on a multicore processor. Figures 6(a)-(b) show two baseline techniques. (a) Conventional parallelization: Here, a group of kernels runs on each core and their output fmaps are broadcast to other cores for allowing processing of the next layer. This scheme leads to large data-transfer overhead. (b) Structure-level parallelization: In this scheme, the DNN is transformed into a partially-connected design. The cores do not broadcast the fmaps for certain layers, but the output is consumed by neurons mapped to the same core. This scheme reduces computation and data-transfer overhead at the cost of accuracy loss. Also, it requires manually deciding which and how many layers should be partitioned. (c) "Communication-aware parallelization": If the parameters of a kernel are pruned to be zero during training, the output fmap will be zero irrespective of the value of input fmap. Hence, the input fmaps of this layer (i.e., output fmaps of the previous layer) that will be multiplied with zero need not be transferred across the cores. Their pruning scheme intentionally distributes the nonzero weights at specific positions, which enables avoiding the communication for zero-weights/fmaps. For achieving structured pruning, they use the "group Lasso regularization" scheme. On using mesh topology in the interconnect, the data-transfer cost between two cores is decided by their Hamming distance. Hence,
… 
Content may be subject to copyright.
1
A Survey of Deep Learning on CPUs: Opportunities and
Co-optimizations
Sparsh Mittal?, Poonam Rajput, Sreenivas Subramoney§
?ECE department, IIT Roorkee, CSE department, IIT Hyderabad,
§Processor Architecture Research Lab, Intel Labs, India.
sparshfec@iitr.ac.in,cs17mtech11019@iith.ac.in,sreenivas.subramoney@intel.com.
Abstract—CPU is a powerful, pervasive, and indispensable platform
for running deep learning (DL) workloads in systems ranging from
mobile to extreme-end servers. In this paper, we present a survey of
techniques for optimizing DL applications on CPUs. We include the
methods proposed for both inference and training and those offered
in the context of mobile, desktop/server, and distributed-systems. We
identify the areas of strength and weaknesses of CPUs in the field of
DL. This paper will interest practitioners and researchers in the area
of artificial intelligence, computer architecture, mobile systems, and
parallel computing.
I. INTRODUCTION
The recent success of deep learning has made it ubiquitous in all
fields of human endeavor. Deep learning (DL) applications have
unique architectural characteristics and efficiency requirements
and hence, the choice of computing system can have a profound
impact on how large piece of the DL pie a user can finally
enjoy. Even though accelerators may provide higher throughput
than general-purpose computing systems (CPUs) on DL applica-
tions, they may not be optimal on other metrics. For example,
the programming models of all DL accelerators have not been
standardized and their diversity leads to lack of portability [1–3].
In comparison, the hardware/software stack of CPUs is already
well-established and understood and CPUs are also inevitably
present in any system. They can provide reasonable speedups on
a broad range of applications. In mobile and embedded domains,
CPU is still the most widely used computing system due to its high
availability, portability and software support. These features have
motivated researchers, including those from leading datacenter and
cloud-service provider companies such as Amazon [4], Facebook
[2, 3, 5–7], Google [8, 9], Microsoft [10–16] and Samsung [17–
19] to benchmark and optimize DL on CPUs.
Optimizing DL applications on CPUs, however, presents its
challenges. Achieving high efficiency requires carefully matching
the strength of CPUs with the architectural characteristics of
DL applications. For example, pruned deep neural networks
(DNNs) have sparse data-structure and perform sparse matrix
multiplication (MM), and hence, achieving high resource utiliza-
tion requires careful optimization. Similarly, convolution (CONV)
algorithm parameters, data layouts and numeric precision need to
be carefully chosen to strike the right balance between latency,
throughput and accuracy depending on the system configuration.
Further, DNN layer-specific optimizations are required for achiev-
ing best performance [6, 15, 20–24]. Systems at different scales
such as mobile and data-center, single vs. multi-node systems have
different properties and challenges, as do different DL algorithm-
s/applications such as convolution neural networks (CNNs) and
recurrent neural networks (RNNs) and different phases of DL viz.,
“This work is supported by Semiconductor Research Corporation.”
inference and training. Evidently, addressing these challenges call
for CPU-specific study and optimizations of DL applications.
Contributions: This paper presents a survey on optimizing DL
on CPUs and CPUs for DL. Section II highlights the advantages of
CPUs for accelerating deep learning workloads and also classifies
the works based on critical parameters. Section III reviews tech-
niques for co-optimizing DNNs on CPUs. Section IV presents
techniques for optimizing DNNs, which are especially relevant
in mobile and data-center CPUs. It also discusses techniques for
data, thread and node-level parallelization. Section V concludes
the paper with a discussion of future outlook. Our goal in this
paper is not to delve into accelerator vs. CPU performance debate,
but to highlight the unique features, limitations and optimization
techniques of CPUs. We include studies performed on mobile,
server and cluster of CPUs. We also include works that profile or
benchmarks DNNs on CPUs to gain insights. We include works
that utilize the computing power of CPU and not those that merely
use its memory, use it as a host, or only for parameter aggregation
in distributed training.
II. MOT IVATIO N AN D OVE RVI EW
A. Motivation for using CPUs for running DNNs
We now discuss the factors that motivate the use of CPUs for
accelerating DNNs.
High memory capacity of CPUs: 3D CONV and even 2D
CONV with large batch-size requires a massive amount of mem-
ory. On GPUs, these fundamental primitives often get severely
memory bottlenecked due to the limited memory capacity of
GPUs [25]. It forces the researchers to use less accurate 2D CONV
operations. Since CPU-managed hosts in cloud and datacenter
scenarios have much larger memory capacities, running memory-
hungry operations such as 3D CONV on CPUs is not merely
attractive, but often imperative [26, 27].
Usefulness for medium-parallelism and sparse DNNs: In
some workloads such as RNN, the amount of computations
increases with rising sequence length. However, the parallelization
of RNN is challenging because of the dependencies between the
steps and the use of small batch size. Similarly, DNNs such
as InceptionNet variants have filter shapes of 1x1, 3x3, 1x3,
3x1, etc., which lead to irregular memory accesses and variable-
amount of parallelism across the layers. Such applications with
limited parallelism fit more naturally to CPUs, which have few
fast cores than to GPUs, which have many slow cores [10].
Likewise, task-parallelism is more suited to CPUs, whereas SIMD
(single instruction multiple data) parallelism is better exploited by
GPUs [28]. Similarly, sparse DNNs [15, 29–34] are inefficient on
massively parallel processors. It is because operations on sparse
data-structures have poor spatial locality due to irregular memory
accesses. They also prohibit effective use of optimizations such as
vectorization and cache tiling. Since for hiding memory latency,
2
CPUs depends more on caches and out-of-order execution than
on parallelization, they can generally be more effective for sparse
DNNs.
Usefulness in mobile systems: Embedded/mobile computing
systems are rising in prominence; for example, more than 90% of
the advertisement earnings of Facebook come from mobile [2]. In
mobile computing systems, the CPU can provide similar or higher
performance than GPU [2, 17]. Also, for applications requiring
frequent or continuous inference, GPUs may not be most suitable
as they can quickly dissipate the battery [23]. Further, in smart-
phones, most DL frameworks do not support all operations on
GPU or DSP [2]. The unsupported computations are executed on
CPUs, which leads to CPU-GPU data-transfer overheads. These
factors motivate the use of CPU or CPU-accelerator heterogeneous
computing for accelerating NN computations [35].
Usefulness in data-center scenarios: Data-centers supporting
web services such as social networks sees a significant fluctuation
in computing demand over time. CPUs can allow meeting this
variability in demand due to their high availability, efficiency for
both DL and non-DL tasks, and the ability to provide low latency
inference [6]. This allows datacenter and cloud service providers
to readily amortize their CPU-based server investments on DL
and non-DL tasks and optimize their business.
Usefulness in extreme environments: Processing systems
used in harsh environments such as space and defense require
a high-degree of fault-tolerance and security certifications [36].
Since radiation-hardening of accelerators has not been sufficiently
researched, CPUs remain the processing system of choice for
executing DNNs in such extreme environments [37].
Challenges in over-specialization: Real-life applications use
DNNs with heterogeneous properties, e.g., image classification
and understanding are performed using AlexNet and “long-term
recurrent convolutional network”, respectively, which have dif-
ferent architectural characteristics. For such scenarios, a general-
purpose architecture may provide higher efficiency than an over-
specialized design. In fact, application-specific accelerators are in-
feasible in IoT devices and wearables due to their tight power/area
budgets. For example, a smartwatch chip cannot host separate
accelerators for speech/audio/image/video processing [38]. In fact,
the area of Eyeriss accelerator is 30 times that of Cortex A35 core
[30] and for such scenarios, a general-purpose core may be more
suited than accelerators.
Limitations of accelerators: Although accelerators provide
high throughput, they may not be optimal on other metrics.
Accelerators require long design cycles and massive investment,
and hence, they cannot be used in fast-changing domains such as
DL where algorithms and network architectures are undergoing
massive advances [38]. Also, reliable cost-sensitive deployment
of custom DL accelerators into real-world DL usage scenarios
is challenging due to complexities from heterogeneity, custom
software and programming support and integration overheads into
existing ecosystems. Further, achieving economies of scale and
business continuity both for customers and providers of datacenter
and cloud services of DL, is fraught with risk considering the
fragmented DL accelerator space. Furthermore, upgrading the
entire service from CPUs to accelerators requires high costs and
engineering work. Hence, CPUs are still used for many product
features. While large-scale companies such as Google, Amazon,
Facebook, Microsoft, etc. have the resources to build and maintain
their custom-accelerators from bottom-up, for other companies,
CPUs (or GPUs) remain the most feasible platform.
Also, accelerators have large data-transfer and network setup
overheads, which may nullify their performance advantage [39].
FGPAs run at low clock frequencies and are often difficult to
deploy and maintain [40]. GPU and “tensor processing unit”
(TPU) achieve high resource utilization and throughput when the
batch size is large. However, the strict latency constraints of real-
time inference prohibit the use of large batch sizes and CPUs can
generally provide the least or comparable latency at low batch
sizes [41]. Also, the model loading overhead of GPUs is higher
than that of CPUs [2, 42]. Hence, CPUs may be more suitable
for real-time inference and for scenarios where training times are
not very long [43]. Due to the low-precision operation of DSP,
it may not be accepted for applications requiring high accuracy.
Finally, the per-hour usage cost of accelerators is higher than that
of general-purpose CPU-based compute, e.g., GPU costs 8×more
than CPU on Amazon web services [44].
B. Classification
Tables I-V classify the works based on key characteristics. The
arithmetic intensity (AmI) of an application is defined as the ratio
of number of computations performed and the amount of data
transfer to/from main memory [5, 15, 26, 45, 46].
TABLE I
EVALUATI ON P LATF ORM ,CONTEXT AND METRICS
Evaluation on Simulator [34, 38, 47–49], Real CPU (all others)
Optimization
metric
Energy [1, 17, 21, 22, 48, 50–58], thermal issues
[2, 53], financial cost [43, 59], performance (nearly
all)
Comparison
with
execution in cloud [23, 46, 60], GPU [8, 19, 21, 52,
59, 61–63], FPGA [62], Xeon Phi [61, 62], neural
compute stick [19]
DL phase
evaluated
training [1, 8, 11, 13, 15, 25, 28–30, 33, 34, 42, 46,
49, 52, 61, 63–77], inference [1, 2, 4, 6, 9, 10, 13,
14, 16–21, 24, 25, 27, 30–32, 34, 37, 39, 42, 45, 47–
51, 53–57, 59, 59, 64, 66, 67, 70, 76–87]
CPU model/vendor
Wearable [23, 38, 60]
Mobile ARM CPU [2, 4, 18, 20, 21, 23, 38, 39, 50, 53–
57, 68, 72, 81–83, 88–90], Qualcomm CPU [19, 50,
70, 81, 83, 89–91], Intel Atom [32, 50]
Desktop AMD CPU: EPYC [4], Opteron [73], A10-7850K
APU [52], IBM CPU: Power8 [61, 84], Power9
[68], SPARC CPU [37], Intel CPU: Coffe lake [42],
Skylake [4, 19, 27, 37, 45, 49, 59, 64, 67–69, 71, 75,
76, 80], Broadwell [14, 26, 32, 45, 58, 66, 78], Ivy
Bridge [61, 74], Haswell [16, 31, 45, 48, 59, 63, 75,
80], Sandy Bridge [6, 10, 15, 25, 29, 63, 65, 73],
Westmere [9]
III. TECHNIQUES FOR CO-OPTIMIZING DNNS ON CPUS
We now review techniques for selecting optimal CONV scheme
(Section III-A) and optimizing data-reuse (Section III-B). In con-
text of sparse DNNs, we review hardware-aware pruning schemes
(Section III-C), schemes for avoiding inaffectual operations (Sec-
tion III-D) and for using randomized algorithms (Section III-E).
A. Choosing optimal CONV strategy
CONV can be performed using one of the four strategies [96]:
direct, lowering (based on “generalized matrix multiplication” or
GEMM), fast Fourier transform (FFT), and Winograd.
Zlateski et al. [45] compare Gauss-FFT, regular-FFT, and Wino-
grad based CONV implementations on seven CPU models. The
compute-to-memory ratio of these CPUs ranges between 14 to 41.
The highest AmI of FFT and Winograd transforms are 5.55 and
2.38, respectively, which are much lower than the compute-to-
memory ratio of these CPUs. They evaluate forward propagation
(FWP) time for each distinct layer of VGG and AlexNet. They
3
TABLE II
BENCHMARKS,DATASE TS AN D FR AM EWO RK S (HMM = HIDDEN
MAR KOV MO DE L, ML P = MULTI-LAYER P ER CEP TRO N)
Benchmark/NN-model
CNN Alexnet [18, 19, 22, 24, 32, 45, 48, 50, 57, 61, 63,
64, 68, 69, 73, 82, 86, 92], GoogLeNet [18, 22, 24,
32, 47, 53, 57, 61, 76, 77, 81, 86], ResNet [4, 18–
20, 22, 27, 42, 47, 48, 53, 56, 63, 67–71, 76, 77,
81, 83, 86], VGG [4, 19, 20, 45, 46, 48, 61, 64, 71,
76, 81, 83, 85, 92], SqueezeNet [24, 37, 53, 77, 86],
LeNet [33, 51, 54, 57, 61, 74], Overfeat [46, 48, 64],
YOLO [22, 87], MobileNet [19, 39, 42, 58, 83, 86, 88],
NiN [48, 82], ResNext [77], DenseNet [77], ShuffleNet
[58], MNASnet [58], 3D CNN [28, 51, 59, 75]
RNN [10, 14, 27, 63, 80]
Attention-
based
[6, 16, 67]
NN+HMM [9, 46, 62, 65, 72, 79]
MLP [51, 52, 67]
Autoencoder [33, 52]
Dataset used
ImageNet [1, 2, 4, 6, 8, 15, 17–20, 22–25, 27, 29–32, 45–48,
51, 53, 58, 61, 64, 67–69, 71, 73, 75, 77, 81, 82, 84–
86, 88, 90, 91]
MNIST [1, 15, 25, 33, 37, 51, 54, 61]
CIFAR10 [1, 15, 25, 30, 49, 51, 55, 56, 61, 73]
Others Pascal VOC 2007 [22, 87], switchboard + hub500 [46],
One billion words benchmark [78], automatic speaker
verification spoofing and countermeasures challenge
dataset [23], Google streetview [23], Oxford flowers
[49], OpenBLAS [29], UEC-Food100 [82], KITTI
[21], Penn Treebank [80], Amazon-670K [66] and
Delicious-200K [66]
Frameworks used
Torch/Pytorch [2, 21, 22, 25, 29, 31, 39, 50, 58, 59, 63, 88]
TensorFlow [4, 10, 17, 25–27, 33, 37, 49, 53, 53, 59, 61, 63, 66,
68, 76, 76, 77, 84, 89]
Caffe [6, 15, 18, 24–26, 28, 30, 32, 34, 53, 53, 55, 57, 58,
61, 63, 69, 73, 74, 81, 83, 90, 93]
Others MXNet [4, 63], CNTK [10, 14, 63], Theano [25, 28],
Apache SINGA [61]
Language/library used
MKL/MKL-
DNN
[6, 10, 14–16, 26, 28, 33, 34, 48, 49, 52, 59, 64, 67,
68, 71, 73, 75, 76, 76, 77, 80, 81]
OpenBLAS [6, 15, 18, 29–31, 53, 57, 74, 82]
Others Eigen [9, 45, 70, 76, 81, 89], C++ [20, 75, 77, 89],
OpenCL [1, 17, 85], NNPACK [19, 81]
TABLE III
DNN-ARCHITECTURE/ALGORITHM/SOF TWAR E LE VE L OPT IM IZ ATIO NS
(NA S = NEURAL ARCHITECTURE SEARCH)
CONV strate-
gies
Direct [15, 28, 56, 71], FFT [2, 28, 45, 56], Winograd
[2, 26, 45, 81], GEMM (nearly all others)
Optimizations
for sparse
NNs
Pruning [2, 51, 80, 94], weight sharing (k-means
clustering) [2], singular value decomposition [20,
23, 72], use of modified compressed-sparse row
(CSR) format [15, 95], avoiding transfer of [51] and
computations on [15] zero values, decomposing an
MM into a sparse and a dense MM [33]
Loop
optimizations
unrolling [9, 13, 14, 38, 64, 71, 89], interchange [13],
fusion [14]
Compiler op-
timizations
liveness analysis [47], automatic code generation
[48], static dependency analysis [30], instruction
reordering [30], template-metaprogramming [75, 89],
pre-computing constant functions/expressions [16,
89]
Algorithm/
heuristic used
linear regression model [24, 60], dynamic program-
ming [4], mixed-integer linear programming [23],
diamond search algorithm [22], load-balancing using
dynamic scheduling [8, 26, 85]
Hardware-
aware NAS
[58, 90, 91]
TABLE IV
MEM ORY-RELATED OPTIMIZATIONS
Changing
data-layout/
alignment
[9, 11, 13, 16, 28, 46, 48, 55, 75, 81, 89]
Prefetching [34, 38, 48, 67, 71, 75]
Reducing
need of
temporary
memory
avoiding the need of lowering [15], expanding only
few columns during lowering [55], performing max
pooling in in-situ manner [55], after completion of a
layer, reallocating the memory of a layer to the next
layer [47]
Padding [9, 14, 59, 75]
Others
optimizations
removing node or code irrelavant to inference [14,
87], increasing register file (RF) bandwidth [48],
TLB optimizations [15, 67], NUMA-aware policies
[11, 73], Managed memory allocation in CPU-GPU
heterogeneous computing [17, 33, 87]
Improving reuse/locality
Tiling [13, 14, 29, 38, 46, 64, 65, 67, 71, 81]
Layer/operation-
fusion
[18, 20, 31, 64, 81, 89]
Scheduling scheduling computations so as to achieve reuse of
weights [28], executing computations of different
phases on same core [10], assigning consecutive
parts of the work on the same worker [8], executing
layers that supply data to other layer on the same
cluster [86]
Matrix-
related
MM fusion [10, 78], flattening 2D matrix to 1D
matrix to improve spatial locality [48]
Others Caching in low-motion videos [22], reusing register
operands within and across VFMA units [48],
TABLE V
COMPUTE-RELATED OPTIMIZATIONS
Improving
vectorization
efficiency
vectorizing CONV kernels across depth direction
[54], vectorizing across dense input in sparse-dense
MM [15], in cache-tiling, choosing one of the tile
dimensions to be a multiple of SIMD width [46],
mapping sparse data-structure as the same input to
all the lanes, so that the computations of all the lanes
become redundant and entire vector instruction can
be skipped [30]
Quantization [2, 6, 9, 13, 14, 16, 17, 20, 21, 31, 55, 70, 79, 82, 88]
Batching [1, 5, 6, 8–10, 16, 26, 34, 46, 65, 66, 68, 73, 78, 95]
Improving
concurrency
Pipelining [9, 14, 20, 34, 60, 65, 71, 72, 86, 87],
double-buffering [20, 89]
Approximate
computing
using lookup table to implement sigmoid/tanh func-
tions [16, 55] and arithmetic operations [79], im-
plementing tanh/sigmoid using fraction expansion
and controlling the number of terms to achieve
desired precision [10], inaccurate handling of de-
normal/NaN/infinity/zero [24], ignoring overflow [9],
performing quantization with a power-of-two scaling
[55], using cropped or resized input images [82],
low-rank decomposition [29]
Other
optimizations
using partial out-of-order core [38], choosing optimal
thread-granularity [24, 26]
find that overall, FFT CONV provides higher performance than
Winograd, although for different layers, FFT or Winograd has
higher performance. The data-movement between the highest level
of on-chip, core-exclusive cache (e.g., L2 in CPUs) and off-chip
memory accounts for memory loads, regular and streaming stores
to main memory and prefetches to main memory. By using the
AmI, data-movement, and number of FLOPs of different stages,
they compute the speedup of FFT CONV over Winograd CONV.
These theoretical estimates agree with their experimental results
that overall, FFT CPMV .
In both Winograd and FFT CONVs, in most cases, transform
stages have poor utilization of computing resources. These stages
have low AmI and hence, become bottlenecked by the memory
bandwidth (BW). Winograd CONV performs fewer FLOPs than
FFT since it operates on real numbers. Yet, Winograd CONV can
4
work with tiny transform sizes (e.g., up to 6×6) only since it
becomes numerically unstable for larger sizes. By contrast, FFT
CONV can work with very large tile sizes (e.g., 31) since it does
not have instability issues. For large tile sizes, the image can be
partitioned into overlapping tiles with minimal padding overhead.
This reduces the number of FLOPs and data-movement in FFT
CONV below that of Winograd CONV. Thus, the type of CONV
layer and processor architecture decide whether Winograd or FFT
CONV is better, but on average, FFT CONV provides higher
performance than Winograd CONV. As the compute-to-memory
ratio of processors increases, FFT CONV will become faster than
Winograd CONV because it has higher AmI due to the use of
complex numbers.
Budden et al. [26] implement fast Winograd CONV algorithms
on N-dimensional CONV. These algorithms prepare the matrices
with “simple” values such as integers. It helps in improving nu-
merical stability and also allows for preparing minimal algorithms
by hand. These algorithms identify and remove redundant sub-
expressions, which lowers the overhead of applying transform
matrices. Theoretically, a large speedup can be obtained for N-
dimensional tensors. So, they first extend fast CONV algorithms
to the general case of N-dimensional tensors. They note that for
N-dimensional CONV on CPUs, manually reducing the transform
overhead is not very important. That is because the MM overhead
can be amortized over more number of kernels and channels since
memory constraints are much less severe on CPU than they are
on GPU. They show that compared to direct CONV, fast tensor
CONV can provide up to 8×speedup, ignoring the overhead of
the matrix-tensor products. The sparsity of matrices allows for
increasing this speedup even further.
To achieve the peak throughput on CPU, one needs to ensure (1)
optimal utilization of single-core using SIMD instructions and (2)
scaling to multiple cores. Fast CONV algorithms perform sparse
computations, which lead to poor AmI. To improve AmI, they
perform CONV for multiple batches together. To further improve
the throughput, they use vectorization and “fused multiply add”
(FMA) operations. If the N-dimensional data-tensor is DN, then
full utilization can be realized if DNis an integer multiple
of the SIMD vector width. On a Xeon E7-8890 CPU, their
optimizations allow reaching 70% utilization, whereas the MKL
CONV primitives reach only 20% utilization.
They further parallelize the code such that different threads
process different subsets of tiles. Work-stealing scheduling is
used to achieve load-balancing. Let Qdenote the number of
tiles processed by a thread. A too-large value of Qleads to
contention at the shared last-level cache (LLC), and too small
value degrades AmI. The final value of Qis selected based on
these considerations. This parallelization strategy provides near-
linear performance scaling, such as 17×speedup with 18 cores.
They further evaluate the execution time of three CONV layers
using their methodology and compare it with TensorFlow (TF)
with AVX (“advanced vector extensions”) support and Caffe using
MKL. Every layer has 32 channels and 32 kernels. The dimension
of each kernel is 4×4. With 18 cores, the throughput provided by
their technique is 10.9 MVox/s, whereas that with TF and Caffe
is 1.77 and 0.41, respectively.
Rajbhandari et al. [15] study the performance, multicore scal-
ing, and goodput (ratio of useful computation) of CNNs as a
function of their sparsity and arithmetic intensity (AmI). AmI
is approximated to twice the number of features in the dataset.
They note that the lowering scheme reduces AmI by virtue
of increasing the memory accesses. Hence, lowering+parallel-
GEMM leads to poor single-core performance for medium and
small-size CONVs. Also, it cannot skip computations on zero-
data and, therefore, leads to low goodput. The “Parallel-GEMM”
scheme divides the training inputs across cores. AmI per core is
reduced with increasing core count. Hence, parallel-GEMM shows
poor multicore scaling.
They present three techniques for improving CNN training
performance on multicore CPUs. (1) “GEMM-in-parallel” scheme
runs multiple single-threaded GEMMs simultaneously on different
cores. This scheme does not reduce AmI of each core and hence,
provides better scalability with the increasing number of cores.
This scheme is especially beneficial for CONVs with a smaller
number of output features since the “Parallel-GEMM” scheme
reduces their AmI even further.
(2) “Sparse Kernel:” They leverage sparsity of error gradients
to enhance the goodput of backward propagation (BWP) com-
putations by avoiding computations on zero values. They store
the error gradients in “column tiled compressed sparse row”
(CT-CSR) style, whereby sparse-matrix is tiled in a column-
wise manner, and then, every tile is stored in the CSR style.
This format is shown in Figure 1. By virtue of tiling in both
row and column dimension, CT-CSR achieves higher reuse of
tile values than CSR format. Without column-wise tiling, values
of two nearby rows may be mutually far, as decided by the
column width of the whole matrix. By contrast, CT-CSR stores
two adjacent elements in memory in adjacent rows within a
tile, which reduces TLB (“translation lookaside buffer”) misses.
Further, their technique generates vector instructions using “Intel
intrinsics” for vectorizing across the dense input and optimizing
cache reuse. Their code-generation engine utilizes “sparse-dense
MM” as the underlying code unit for efficiently executing CONV
with vectorization. The output of these code units is obtained in
place without requiring lowering. Thus, AmI is not reduced.
(3) “Stencil-kernel:” For small CNNs, they use direct CONV,
which avoids lowering and hence, does not reduce AmI. It uses
stencil-style processing to spatially reuse inputs while they are in
the cache, for computing multiple output values. By comparison,
lowering forgoes reuse because of replicating the inputs. This
technique works in two steps: first, the “basic block generator”
produces register tiled vector instructions. It lowers the number
of loads and improves the reuse of loaded vector inputs. Second,
input and output are copied to contiguous memory regions to
improve TLB efficiency and then tiled to improve cache efficiency.
Overall, their technique profiles every layer with “parallel-
GEMM”, “GEMM-in-parallel”, stencil-kernel (FWP only) and
sparse-kernel (BWP only). Based on this comparison, the tech-
nique with the least latency is chosen for every layer. For BWP,
this comparison is repeated after a few layers to adapt to the
changes in the sparsity of the error gradient. For their system
and parameters, GEMM-in-parallel is superior to parallel-GEMM
when a layer has less than 1024 features, sparse-kernel is better
than “GEMM-in-parallel” for layers with more than 75% sparsity,
and stencil-kernel is superior to “GEMM-in-parallel” when a layer
has less than 128 output features. Their technique improves the
performance of a range of DNNs.
B. Optimizing data-reuse
The parallel GEMM libraries aim at improving the performance
of GEMMs with high (e.g., >1000) data-reuse. However, in RNN
inference, the batch size is kept small for meeting the latency
SLA. Hence, the reuse in RNN remains small (e.g., <10). On
a modern CPU, RNNs fit completely in the L3 cache. Since the
shared L3 cache feeds to multiple private L2 caches, the data
5
A
0
0
0
0
0
0
B
C
0
D
0
0
B
0
B
0
0
A
0
0
0
B
C
0
B
0
0
0
0
0
D
0
B
0
0
1. Tile along Column
Original sparse matrix First tile Second tile
2. Store in CSR format
CT-CSR
format
CSR
format
2nd tile:
Value = [D, B]
IA = [0, 0, 1, 2]
JA = [1, 0]
1st tile:
Value = [A, B, C, B]
IA = [0, 1, 3, 4]
JA = [0, 1, 2, 1]
Value = [A, B, C, D, B, B]
IA = [0, 1, 4, 6]
JA = [0, 1, 2, 4, 1, 3]
Fig. 1. Illustration of CSR and CT-CSR format [15]
required by multiple cores is transferred repeatedly between these
caches. The data-transfer volume is decided by how the GEMM
computations are partitioned on the cores. For example, if two
cores compute the upper and lower half of the output matrix,
respectively, input matrix Q needs to be copied on the L2 cache
of both the cores. Similarly, if two cores compute the left and
right half of the output matrix, respectively, then input matrix
P needs to be replicated. However, parallel-GEMM libraries fail
to partition in a manner that fully leverages this data-reuse. On
Xeon E5-2650 CPU, the BW between L2 and L3 caches is only
62.5 GigaFloats/s. For a batch size of 1, the maximum data-reuse
of LSTM (“long short term memory”) is only 2, and hence, its
theoretical peak performance is 125 GigaFlops (=2×62.5). It is
below 8% of the peak performance of Xeon E5-2650 CPU (1.69
TeraFlops). Further, the parallel-GEMM libraries do not reuse the
weight matrix of RNN across different sequences.
Zhang et al. [10] present a technique for addressing the above
challenges. They model RNN computation as a directed acyclic
graph where each node is an MM computation and edges show the
dependencies. The building blocks of a schedule are the MMs. A
valid schedule is made-up of a sequence of phases such that every
phase has a non-overlapping subset of nodes. If i < j, then the
nodes of phase ihave to be run before the nodes of phase j. The
nodes of a phase can be run in parallel. Figure 2(a) illustrates two
valid phased schedules for LSTM. In the first schedule, all MMs
at time tare in phase t. If the MMs of a phase take hidden state
htas input, then the phase is termed as a time-dependent phase.
Otherwise, the phase has no dependency across the sequence and
is termed as a time-independent phase. For instance, in the second
schedule of Figure 2(a), phase 1 is time-independent, whereas the
remaining phases are time-dependent since they need the value of
ht1to find ht.
Listing 1. Phased LSTM
Schedule-1 and 2
/ / Phased LSTM Schedule 1
for t :
Phase t : / / time dependent
Wi.xt, Wf.xt, Wc.xt, Wo.xt
Ui.ht−1, Uf.ht−1, Uo.ht−1
/ / Phased LSTM Schedule 2
Phase 1 : / / time independent
Wi.x0,…, Wi.xt,Wf. x0,…, Wf. xt,
Wc.x0,…, Wc.xt,Wo. x0,…, Wo. xt
for t :
Phase ( t +1 ) : / / time dependent
Ui. ht−1, Uf. ht−1, Uo. ht−1
m1
m1
m2
m2
M1
M2
Time
Time
d1
m1 Processor allocated to m1 thread of
M1 process
Cache associated with each processor
(b) (c)
d1
d2
d2
m1
m1
M1
m1
m1
m2
m2
m2
m2
M2
d1
d1
d1
d1
d2
d2
d2
d2
(a)
Fig. 2. (a) Two phased LSTM schedules (b) Parallel-GEMMs-in-sequence
(c) Parallel-GEMMs-in-parallel [10]
To reduce the search-space, they leverage three heuristics: (1)
Due to the symmetry of time-dependent phases across timesteps,
the least-latency schedule is the same in every timestep. (2) If
two consecutive phases have no dependency, then their MMs can
be seen as part of a single phase. (3) Time-independent phases
are computed before all the dependent phases, as illustrated in
the second schedule of Figure 2(a). Further, they apply four
optimizations to optimize reuse are as follows:
1. MM fusion: In every phase, a pair of MMs with a
common input is fused into a single MM. Assume MM1
where C1[M,P] =A1[M,N]×B1[N,P] and MM2 where C2[m,l]
=A1[M,N]×B2[N,Q]. Then, they are fused by concatenating
B1 and B2 along the column, such that C12[M,(P+Q)] =
A1[M,N]×B12[N,(P+Q)]. It improves data-reuse by enabling the
reuse of matrix A1.
2. Finding parallelism-degree to optimize reuse: The naive
“parallel-GEMMs-in-sequence” scheme, shown in Figure 2(b),
seeks to parallelize the MMs on all the cores. It initially runs
the first MM on all the cores and then runs the second MM,
which leads to low-performance due to the large overhead of data
movement. Since MMs in a phase can be executed independently,
they run multiple MMs concurrently such that every MM runs in
parallel. For instance, two MMs are run in parallel such that each
uses half the cores. This “parallel-GEMMs-in-parallel” scheme is
shown in Figure 2(c). Here, each MM leverages a limited set of
cores, which reduces data-duplication and boosts data-reuse. Since
both the number of MMs in RNNs and cores in CPUs are small,
the optimal degree of parallelism can be easily found.
3. Partitioning to reduce data-movement: They propose
a partitioning scheme which, given the parallelism degree C,
generates a C-partitioning of MM computation such that the data-
movement between L2 and L3 caches is minimized. Assume that
the MM C[i, j] = PkA[i, k ]×B[k, j]has Rpartitions. Here
Di,Djand Dkare the number of partitions across i,jand k
dimensions and Di×Dj×Dk=D. The data-movement is a
function of input and output matrix sizes and L2 and L3 cache
sizes. In their experiments, the input matrix fits in the L2 cache,
and all matrices together fit in the L3 cache. For such cases,
the combined data-movement is Dj|A|+Di|B|+ 2Dk|C|. This
quantity can be minimized by intelligently choosing the value of
Di,Djand Dk.
4. Weight-based streamlining: They further extend the above
partitioning scheme to leverage the reuse of weight matrices
(B) across time-dependent phases (TiDP) of a sequence. This
scheme ensures that weights needed for computing the partition
can be accommodated in the L2 cache of a single core, so
they can be reused before getting evicted. For this, it ensures
that parallel partitions are executed on the same core across
TiDPs. For this purpose, in OpenMP, they create a parallel region
spanning the whole RNN sequence of computation. It also reduces
the thread-creation overhead. Every thread executes at most one
partition in each TiDP, and its partition remains the same across
the TiDPs. They implement each partition using single-threaded
BLAS, which leverages vectorization and improves L1 cache
efficiency. From all the schedules generated above, their technique
chooses the one with the least latency. The above optimization
approach is applied only once for each RNN model and is used
whenever inference is performed.
Their technique (running on CPU) outperforms TF and CNTK
(running on CPU or GPU). Their technique is especially effective
for small batch sizes where reuse in a single phase is small. When
the batch size or matrix dimension is large, the entire weight
matrix does not fit in the L2 cache. However, individual weight
blocks fit in the L2 cache, and their technique exploits reuse of
weights across TiDPs. Their technique is better than cuDNN on
GPU for batch size below 15. cuDNN uses a single kernel for the
whole RNN sequence, whereas TF generates many nodes in the
computation graph, and the movement of tensors between nodes
lead to high overhead.
Jain et al. [48] optimize the performance of GEMM kernel on
6
PRF
VFMA VFMA
PRF
VFMA VFMA VFMA VFMA
(b)
VFMA VFMA VFMA VFMA
AcrossVFMA
Connection (Reuse
across units)
VFMA local-
reuse register
(Reuse at same
VFMA)
(c)
Local-
reuse
Register
VFMA
output
From
PRF
From AcrossVFMA connection
From
PRF
(d)
PRF
(a)
From
PRF
VFMA
Pipeline
Register
next
VFMA
Twelve
reads per
cycle
Six reads
per cycle
Five
reads
per
cycle
Fig. 3. (a) PRF design with 2 VFMA units (used in Haswell processor) (b) simple extension to 4 VFMA units leads to 12 reads/cycle from PRF,
which is infeasible (c) The technique of Jain et al. [48] inserts “VFMA remote register” and across-VFMA connections for reducing the number of
reads (d) Architecture of VFMA in their proposed design
CPUs. They note that increasing the number of VFMA (“vector
fused mutiply add”) units for increasing CPU FLOPs presents
challenges of supplying data to these units in each clock cycle.
Since GEMM uses register tiling, it repeatedly uses the data in
registers before accessing the L1 cache. Hence, a change in cache
size or/and BW alone has no impact on GEMM performance.
Both register BW and the number of architectural registers need
to be increased together to obtain a significant speedup. Higher
register BW allows providing data to the VFMAs and increasing
the architectural registers lets using higher tile size. GEMM
operations are performed using VFMA units, and each VFMA unit
needs 3 additional read ports in the physical register file (PRF).
However, increasing read ports increases the access energy and
latency of PRF [97].
Their technique focuses on reducing the number of PRF reads
by exploiting the data-reuse in GEMM operation. Figure 3(a)
shows the baseline CPU with 2 VFMA units. This design reads six
register operands from PRF in each cycle. Doubling the number
of VFMAs will require 12 PRF reads/cycle, as shown in Figure
3(b). In GEMM operations, matrix elements are reused. To exploit
temporal reuse to reduce the number of PRF reads, they add an
“architecturally visible register” termed “VFMA remote register”.
The proposed design is shown in Figure 3(c)-(d). This register
allows reusing an input at the same unit over multiple operations.
It can be written from other registers or the caches, but not from
the VFMA itself, and this restriction simplifies the hardware.
Further, they add uni-directional links between different VFMA
units (Figure 3(c)-(d)). These links allow reading an input operand
only once from the PRF and then reusing it across VFMA units.
These links do not transfer the VFMA output. Their changes allow
reusing the data within and across VFMA units, which reduces
the number of reads to RF and allows increasing the number of
VMFA units. They also discuss the “instruction set architecture”
(ISA) extensions and changes to instruction scheduling required
for utilizing the microarchitectural improvements.
They further present an “automatic code generation technique”
that generates code for GEMM based on the number of “archi-
tectural registers” and VFMA units to optimize data reuse. This
technique applies two strategies: (1) register tiling of both input
and output matrices, as shown in Figure 4(a). It allows data-reuse
in PRF, as shown in Figure 4(b). (2) prefetching-aware layout
transformation: They note that for many CONV layers, VFMA
utilization remains low despite the use of register tiling. Although
the memory access pattern is predictable, the CPU gets stalled on
cache misses because the prefetchers cannot prefetch across page
limits. In CONV layers, the matrix sizes are large, and hence, on
moving to the next row, the stride crosses a page boundary.
To mitigate this inefficiency, their technique flattens the 2D
input matrices A and B into 1D matrices such that memory
accesses are to contiguous locations. It is shown in Figure 4(c).
This transformation allows the prefetcher to prefetch the data in
the L1 cache, which improves VFMA utilization. This transfor-
mation can be performed before all the GEMM operations, and its
overhead is amortized over computations. To further reduce the
overhead, it can be interleaved with the GEMM operations. Their
technique improves the performance and “energy delay product”
of several NNs compared to an Intel Haswell server baseline.
Further, their technique benefits not only CONV layers but also
“fully connected” (FC) and LSTM layers. Also, the performance
of their automatically generated code is close to that of highly-
optimized Intel MKL.
C. Hardware-aware pruning techniques
Yu et al. [95] evaluate a hardware-unaware pruning scheme
using five CNNs on GPU, CPU and microcontroller. They find that
the performance improvement from pruning is much lower than
the fraction of reduction in computations. Sparse matrices lead
to irregular memory accesses, and decoding them requires extra
computations, which is evident from Figures 5(a)-(c). Since matrix
tiling and memory-coalescing cannot be performed on sparse MM,
pruning harms the performance of all CNNs on the GPU. Also,
pruning precludes the use of parallelization. On CPU, pruning
boosts the performance of FC layers by reducing the memory
accesses but harms the performance of CONV layers by reducing
the opportunity for weight-reuse. Since the simple architecture of
the microcontroller cannot hide memory latency, pruning boosts
performance on microcontroller by reducing model size.
Yu et al. present an architecture-aware pruning technique for
three classes of processors: (1) Highly-parallel processors such
as GPU rely on thread-level parallelism. For them, they perform
vertex-pruning, which utilizes mask layers for selecting unimpor-
tant vertices so that their output can be blocked. After training
of mask layers, the blocked vertices are removed, and once all
redundant vertices are removed, mask layers are removed, and
CNN is retrained. Node pruning does not make the CNN sparse,
and hence, it provides higher throughput than weight pruning on
GPU. (2) Processors with low-parallelism, such as Cortex-M4,
have in-order cores with only a few SIMD lanes and no caches. On
such processors, they prune weights based on SIMD awareness,
as shown in Figures 5(d)-(e). For this, weights are grouped in
size of SIMD width. Then, those groups are pruned whose “root-
mean-square” of weights is lower than a threshold. Pruning and
retraining are iteratively performed for maintaining the original
accuracy. A modified CSR scheme, shown in Figure 5(f), is used
7
𝑎1
𝑎3
𝑎2
𝑎4𝑎5
𝑎0𝑏0𝑏1𝑏2𝑏3
𝑏4𝑏5𝑏6𝑏7
𝑎2
𝑎0𝑎1
𝑎4𝑎5
𝑎3𝑏1
𝑏0𝑏3
𝑏2𝑏5
𝑏4
…. ….
𝑏6𝑏7
Matrix A Matrix B
𝑏0𝑏1𝑏2𝑏3
𝑏4𝑏5𝑏6𝑏7
𝑐0𝑐1𝑐2𝑐3
𝑐 𝑐5𝑐6𝑐7
𝑐8𝑐9𝑐10𝑐11
𝑎1
𝑎3
𝑎2
𝑎4𝑎5
K
MX
K
N
=M
N
Matrix A Matrix B Matrix C
𝑀𝑡
𝐾𝑡
𝐾𝑡
𝑁𝑡
𝑀𝑡
𝑁𝑡
1
2
3
2
1
4
3
12
4
1calculates “partial sum” of output tile C for input tile A and B
Repeating 1 2 completes computation of one output tile
Repeating 1 2 calculates Nt columns of outputs
Repeating 1 2 completes matrix multiplication
3
34
𝑎0
VFMA VFMA VFMA VFMA
𝑏0𝑏1𝑏2𝑏3
𝑎4, 𝑎2, 𝑎0𝑎4, 𝑎2, 𝑎0𝑎4, 𝑎2, 𝑎0𝑎4, 𝑎2, 𝑎0
𝑐8, 𝑐4, 𝑐0𝑐9, 𝑐5, 𝑐1𝑐10, 𝑐6, 𝑐2𝑐11, 𝑐7, 𝑐3
𝑐8, 𝑐4, 𝑐0𝑐9, 𝑐5, 𝑐1𝑐10, 𝑐6, 𝑐2𝑐11 , 𝑐7, 𝑐3
(a)
(b)
(c)
Fig. 4. (a) Register tiling steps performed by automatic code generator [48] (b) data orchestration (c) changing layout to improve prefetching efficiency
3
9
2
8
1
0
5
0
4
0
6
2
0
2
0
0
0
4
2
0
0
0
5
0
2
9
8
0
2
6
6
0
6
2
0
1
3
9
8
5
4
6
4
5
9
8
6
6
6
𝐴 = [3, 9, 8, 5, 4, 6, 4, 5, 9,
8, 6, 6, 6]
𝐽𝐴 = [0, 1, 3, 0, 2, 4,
5, 4, 1, 2, 5, 0, 2]
𝐼𝐴 = [0, 3, 6, 7, 8, 11,13]
3
2
6
2
1
4
Input Vector
(a) Original
weight matrix
(b) After conventional
pruning
(c) Sparse weight matrix (in CSR style)
multiplied with input vector
3
9
2
8
1
0
5
0
4
0
6
2
0
2
0
0
0
4
2
0
0
0
5
0
2
9
8
0
2
6
6
0
6
2
0
1
3
9
2
8
5
0
6
2
5
0
2
9
8
0
2
6
6
0
6
2
𝐴 = [ 3, 9 , 2,8 , 5,0 ,
6, 2 , 5,0 , 2, 9 ,
8,0 , 2,6 , 6,0 , 6,2 , ]
𝐽𝐴 = [0, 2, 0, 4, 4, 0 ,
2, 4, 0, 2]
𝐼𝐴 = [0, 4, 8, 8, 10,16,20]
3
2
6
2
1
4
(d) Original
weight matrix
(e) After SIMD-
aware pruning
(f) Weight matrix stored in
modified CSR format
Fig. 5. (a)-(b) Hardware-unaware pruning (c) multiplication of input
vector with a “sparse weight matrix” (saved with CSR style) (d) weight-
alignment according to SIMD width [95] (e) hardware-aware pruning (f)
“sparse weight matrix” is stored in a modified CSR format which enables
SIMD-multiplication
for storing the “sparse weight matrix” which decreases the model
size. Also, loading and multiplication can be done using SIMD
instructions. (3) Moderately-parallel processors such as CPU
leverage instruction/memory-level parallelism along with SIMD
units. For CPUs, they first use vertex pruning on CONV layers
and then use “SIMD-based weight pruning” on FC layers. Their
technique achieves higher performance than hardware-unaware
pruning with no accuracy loss.
Zou et al. [51] propose a technique to reduce data-
communication overheads while parallelizing CNN training on
a multicore processor. Figures 6(a)-(b) show two baseline tech-
niques. (a) Conventional parallelization: Here, a group of kernels
runs on each core and their output fmaps are broadcast to other
cores for allowing processing of the next layer. This scheme leads
to large data-transfer overhead. (b) Structure-level parallelization:
In this scheme, the DNN is transformed into a partially-connected
design. The cores do not broadcast the fmaps for certain layers, but
the output is consumed by neurons mapped to the same core. This
scheme reduces computation and data-transfer overhead at the cost
of accuracy loss. Also, it requires manually deciding which and
how many layers should be partitioned.
(c) “Communication-aware parallelization”: If the parameters
of a kernel are pruned to be zero during training, the output fmap
will be zero irrespective of the value of input fmap. Hence, the
input fmaps of this layer (i.e., output fmaps of the previous layer)
that will be multiplied with zero need not be transferred across
the cores. Their pruning scheme intentionally distributes the non-
zero weights at specific positions, which enables avoiding the
communication for zero-weights/fmaps. For achieving structured
pruning, they use the “group Lasso regularization” scheme. On
using mesh topology in the interconnect, the data-transfer cost
between two cores is decided by their Hamming distance. Hence,
map High inter-
core data-
traffic
(a)
map
No inter-core
data traffic,
accuracy loss
(b)
CNN
CNN
CNN
map
Minor inter-core
data traffic,
~0 accuracy loss
Structure
-level
Communication
-aware
(c)
core
core
core
Layer1 Layer2
Layer1 Layer2
Layer1 Layer2
Inter-core communication
Conventional
Fig. 6. (a) Conventional (b) structure-aware and (c) communication-aware
parallelizaion [51]
the performance depends on the location of non-zero parameters in
a kernel. Therefore, their pruning technique also takes into account
the Hamming distances between the cores and, thus, reduces data-
transfer between distant cores. Their technique saves energy and
improves performance by reducing data-transfer overheads for a
negligible reduction in DNN accuracy.
Liu et al. [29] present a two-stage decomposition technique to
decrease the redundancy at inter-channel and intra-channel level.
In the first stage, decomposition is performed depending on the
“reconstruction error” of kernel weights. Then, the fine-tuning of
the network is performed while applying the sparsity condition. In
the second stage, the training error, sparsity of CONV kernels, and
the number of CONV bases are together optimized by minimizing
a “sparse group-lasso objective function”. Thus, they use low-
rank decomposition and also seek to achieve sparsity in filter
weights. With only a few bases, their technique achieves above
90% sparsity in CONV kernels with below 1% loss of accuracy.
Figure 7(a)-(b) contrast the operation of a CONV layer in the
conventional scheme and their technique, respectively.
Let R=P×Q, where PRm×kis a (dense) input fmap
matrix and QRk×nis a (sparse) weight matrix. They further
present a sparse-dense MM scheme for efficiently running the
sparse CONV kernels. Pand Qare split into tiles that can
be accomodated into the L2 cache. Further, every tile of Pis
split in “row bands” having 8-elements and every tile of Qis
split in “column bands” having 8-elements. Then, two bands are
multiplied for obtaining an 8x8 matrix. Let this MM be shown as
¯
R=¯
Pׯ
Q,¯
PR8×k,¯
QRk×8,¯
RR8×8. For any matrix
M, let mi,be the ith row of Mand m,j be the jth column
of M. The MM can be represented as ¯r,j =Pi=1k¯p,i ¯qi,j ,
where 1j8. Here, every ¯r,j and ¯p,i is stored in one
AVX vector. For every non-zero value ¯qi,j ,ishows which ¯p,i
to multiply with and jshows which ¯r,j to save to. Since each
of them correspond to an AVX register and ¯
Qis sparse matrix
which is fixed after training, they embed iand jinto the code as
8
Input fmaps
Output
fmaps
Output fmaps
Conv. Kernels
Channel
basis
Kernel basis
Sparse
Kernel
Matrix
(a)
Pseudocode for
computing R = P x Q
𝑟7+= 𝑝1x 𝑞1,7
𝑟3+= 𝑝3x 𝑞3,3
𝑟6+= 𝑝3x 𝑞3,6
𝑟3+= 𝑝5x 𝑞5,3
𝑟5+= 𝑝5x 𝑞5,5
Q: 12x8 sparse
matrix (shaded
squares are non-zero)
P: 8 x 12
dense matrix
(c)
(b)
𝑟
4+= 𝑝7x 𝑞7,4
𝑟5+= 𝑝7x 𝑞7,5
𝑟3+= 𝑝10 x 𝑞10,3
𝑟5+= 𝑝10 x 𝑞10,5
𝑟
4+= 𝑝11 x 𝑞11,4
Fig. 7. (a) CONV layer in conventional CNN operates on large number of
CONV kernels (b) Liu et al. [29] apply decomposition on the channels and
CONV kernels to obtain a highly-sparse kernel matrix. (c) Pseudo-code
generated for multiplying a sparse matrix with dense matrix [29]
the index of registers. Figure 7(c) shows an example of the code
generated by their technique.
For MM operation, their algorithm achieves close to ideal
speedup from the sparsity. For all the layers of a CNN, their
technique achieves more than 90% sparsity and high speedup.
Their technique can accelerate both large kernels (e.g., 11×11)
and small kernels (e.g., 3×3), whereas the previous technique
can accelerate only large kernels.
D. Skipping redundant instructions and memory accesses
Sen et al. [30] note that by leveraging dynamic sparsity, 25% to
60% computations in DNN can be elided. However, their dynamic
nature and the fact that the sparsity levels are not very high,
making it inefficient to leverage them in software. They propose
ISA and microarchitectural extensions to CPUs for exploiting
sparsity at the hardware-level. To leverage sparsity, the processor
needs to dynamically detect whether an instruction (say I0)
produces a zero result and, if so, skip all future instructions that
become redundant due to I0producing a zero value. For example,
in Figure 8(a), if the input (r8) is zero, then instructions 4, 6, and 7
become redundant. However, such instructions may not occur right
after I0and may not occur in succession. Also, the instruction
to be skipped should not even be fetched because squashing an
instruction after it is fetched leads to a pipeline bubble. However,
squashing a multi-cycle instruction after fetching can still provide
a performance improvement.
Figure 8(b) shows the overview of their technique. Their
technique adds a “sparsity register file” (sRF) and a “sparsity-
based skip address” (SBSA) table, which are shown in Figure
8(c). The sRF uses isSparse bit to record, which registers in
the RF store zero values. The “regUpdInFlight” bit shows if an
instruction updating the register is in flight inside the pipeline.
The SBSA table tracks under what conditions which instructions
can be avoided. An entry of the SBSA table has three fields:
(1) preceedingPC: PC of instruction just before the redundant
instruction sequence (ii) sRFCondition: An instruction is skipped,
only if this condition is satisfied (iii) instsToSkip: redundant
instruction sequence length. For instance, the entry for skipping
instructions 6-7 in Figure 8(a), is shown in the second row of the
SBSA table shown in Figure 8(d).
They add a new instruction termed SBSA-LD, which loads a
specific memory region into the SBSA table. By using this instruc-
tion, the SBSA table can be pre-loaded at program startup. Since
CONV kernels use only a few library functions such as BLAS,
LD r2, [p2] //Load OUT
LOOP:LD r0, [p0] //Load INP
ADD p0, p0, #1
LD r1, [p1] //Load KER
ADD p1, p1, #1
FMUL r3, r1, r0 //r3= INP*KER
FADD r2, r2, r3 //OUT += r3
INC INDEX
BNE INDEX, #N, LOOP
ST r2, [p2]
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
--
4, 6-7
--
6,7
--
7
---
--
--
--
Vector Dot-product
Redundant
insts.if
result = 0
Sparsity Register
File (sRF)
Sparsity Based Skip
Address (SBSA) Table
States
Identify & skip
Redundancy Module
Sparse Value
Checker (SVC)
Change PC if next set
of instructions can be
skipped
Update sRF if
instruction result
is zero
Fe De Exe WB
PC
Processor
Pipeline
Prece
dingP
C
inst
sTo
Skip
sRF
condition
4067 2
sRF
[Rs1]
|
sRF[Rs2]
4100 5
sRF[Rs3]
&
sRF[Rs4]
4250
2 sRF[Rs3]
SBSA Table
isSpar
se
regupdinFli
ght
1 0
0 1
sRF
(b)
(a)
(c)
(d)
Rs1
Rs2
Fig. 8. (a) Assembly-language code for vector dot-product (b) Overview
of the technique of Sen et al. [30] (c) design of sRF and SBSA table
using only 20 entries in the SBSA table suffices for capturing all
the redundant sequences. Their technique is executed in parallel
to instruction-cache access. The instructions/registers that operate
on or store values of sparse data-structures are identified during
compilation. Using a static dependency analysis, the instructions
that may become redundant due to above instructions/registers
are marked. Based on these, SBSA-LD instruction is added to the
program assembly.
They perform instruction reordering so that the exeuction of I0
is finished before fetching of redundant instructions may start. In
vector processors, generally, one input operand is the same for all
the eight lanes, whereas the other input is different. The sparse
data-structure is mapped as the same input to all lanes, which
ensures that the computations of all the lanes become redundant,
and the whole vector instruction can be skipped. They show the
detailed working of their technique for GEMM function from the
OpenBLAS library on an ARM processor. On both scalar and
vector processors, their technique accelerates both CONV and FC
layers having a wide range of sparsity.
Akin et al. [34] propose a technique to (de)compress dynamic
data produced by DNNs on CPUs. They propose two SIMD
instructions, compL and compS.compS compresses and stores
the data and compL loads and decompresses the data compressed
by compS.compS takes one 512b input register R1, one register
R2 as a pointer, and a condition flag (CF). It reads the input via
R1, compresses it, and stores it at the memory location pointed
to by R2. The CF condition can be either “equal-to-zero” or
“less-than-or-equal-to-zero”. The latter condition allows fusing
ReLU activation and compression into a single instruction. As
shown in Figure 9, compS compares each 32b element of 512b
vector in R1 based on CF. It creates a 16b (=512b/32) mask as
the compression metadata. The mask is concatenated with the
uncompressed elements of the vector and stored at the location
pointed by R2. Also, R2 is incremented by the amount of data
written plus header-size, which allows repeatedly applying compS
to compress and store more values.
For the example shown in Figure 9, assume that CF checks
elements that are equal to or less than zero and 7 out of 16
elements meet this condition. Then, the mask becomes 0x2D94.
Based on the mask, uncompressed elements are gathered together
and along with the mask, stored at the location pointed by R2. R2
is incremented by 7*4B+2B = 30B. These instructions can replace
the regular load/store instructions. They can be inserted at the
9
r
R2
0x1000
(R2)
y z 0 w0 0 x 0 0 r 0 0t 0 0 p
= = = == = = = = = = == = = =
1 1 0 10 0 1 0 0 1 0 01 0 0 1
Lane Select & Shift
x y z w t r
CF
popcnt
header
<<
+
0x2000
2
2
(0x2D94) (7)
R1
(4 byte values)
(header size) 0x201E
h
x
y
z
w
t
p
Memory:
R2 (after execution)
p
Fig. 9. Implementation of compS [R2], R1, #CF instruction [34]
beginning and end of various layers so that data is (de)compressed
before writing to or reading from memory.
Since different steps of compression/decompression happen
sequentially, they can lead to large memory latency. This latency
can be hidden by using a stream prefetcher, since a layer writes
the fmaps sequentially, and another layer reads them sequentially
without performing random accesses. They slice the fmap into
chunks, and different threads compress their own chunks in
parallel using different compressed data pointers. Their technique
reduces physical memory usage but not virtual memory usage.
Their benchmarks show sparsity ratios above 50%, and hence,
the overhead of metadata is easily amortized. Use of pipelining,
prefetching, parallel execution and bulk communication of large
fmaps helps in hiding the latency of logic micro-ops. Their com-
pression technique allows the fmaps to be stored in the on-chip
cache, which reduces off-chip traffic. Compared to vcompress
and vexpand instructions in AVX512 ISA, their instructions use
fewer static instructions and registers. Hence, their technique can
work for both small and large fmaps sizes.
E. Use of randomized algorithms
During training, for each training data point, performing FWP
and BWP operation only on very few sampled neurons is suf-
ficient. Locality sensitive hashing (LSH) functions are those for
which collision probability increases monotonically with increas-
ing similarity. LSH algorithm provides a natural approach for
adaptive sampling since it allows sampling neurons in proportion
to weights without calculating the activations, i.e., without know-
ing the input. Since this sampling approach makes the network
sparse, it forgoes the parallelism advantage of GPUs, and hence,
it is more suited for implementation on CPUs.
Chen et al. [66] propose using randomized algorithms for
accelerating NN computations on CPU. In the initialization phase,
their technique initializes K×L LSH functions and L hash tables
for every layer. K denotes the number of hash codes in every hash
table. Let Nj
lshow neuron jin layer l,hlshow hash function
and xlshow input for layer l. The hash buckets mapped by LSH
function hl(wa
l)saves the ID aof the neuron. Each bucket has
B entries, and by choosing a small value of B, the memory usage
and overhead of parallel accumulation can be kept low. During the
FWP phase, in each layer, instead of computing all the activations,
their technique computes hl(xl). By using the hash codes, the
IDs of sampled (and hence, active) neurons are retrieved from
the matching “buckets” in hash tables. For instance, in Figure
10, h1(x1)is calculated and is used for retrieving N2
1and N4
1
as the sampled neurons. Only their activations are computed
and propagated as inputs to the subsequent layer. Remaining
activations are taken as 0 and hence, not computed. In their
technique, zero values are not accessed and no computations
happen on them.
1
2
3
4
5
Input
1
2
3
4
Hidden 1
H1
1 | 1
3 | 3
2 | 2,4
1
2
3
4
Hidden 2
H2
2 | 1,4
3 | 2
1 |3
1
8
5
Output
Hash table Hash table
Fig. 10. Working of FWP phase in the technique of Chen et al. [66].
For an input, first its H1 hash code is obtained and from the first hidden
layer, active neurons are ascertained. Activations are found only for these
neurons. Same procedure is repeated for successive layers. Each layer uses
multiple hash tables, only one is shown in this figure (L=1).
BWP phase proceeds similarly to the FWP phase. The error is
back-propagated layer-by-layer for computing the gradients and
updating the weights. Partial gradients are propagated only to
active neurons in earlier layers through the connected weights.
Thus, any inactive neuron or weight, which is not part of the
FWP phase for input, is not accessed. Thus, the sparsity is fully
exploited, which reduces the number of arithmetic operations in
their technique to much less than that performed in GEMM. After
updating of weights, the positions of neurons in the “hash tables”
have to be modified accordingly. It requires the removal of a
neuron from the old bucket and insertion to a new bucket. The
update frequency of hash tables is reduced exponentially to reduce
its overhead, since the magnitude of gradient updates decreases
over training iterations.
Gradients are calculated independently across different input-
items in the batch. The randomness and high degree of sparsity in
gradient updates allow parallelizing gradient-accumulation asyn-
chronously over the training data-items since updates are unlikely
to create conflicts. This feature enables their technique to achieve
near-linear speedup with a rising number of cores. A small degree
of overlap is tolerated based on the HOGWILD idea. No memory
access or computation is performed on zero values.
They experiment with different hash functions in the LSH
family. They also experiment with different sampling strategies,
e.g., taking most frequently occurring neurons in the Lhash tables,
taking neurons occurring with a frequency above a threshold,
etc. They compare the performance of their technique on a 44-
core Xeon E5-2699A CPU with that of TF on the same CPU
and Tesla V100 32GB GPU. They use FC networks on extreme-
classification datasets viz., Amazon-670K, and Delicious-200K.
TF on CPU uses vectorization and has the least performance. Their
technique uses multithreading but no vectorization, and it outper-
forms GPU. The speedup is more significant on the Amazon-670K
dataset since it is a bigger dataset. In their technique, less than
0.5% of neurons are active, which reduces memory accesses and
computations. Still, their technique does not compromise on the
accuracy per iteration. With increasing batch size, the advantage
of their technique over TF-GPU grows further. Note that high
sparsity of these datasets may undermine the benefit of GPU. They
also evaluate the sampled softmax algorithm, which performs
static sampling of neurons. This algorithm requires sampling
of 20% of the total number of classes for achieving the same
accuracy as their technique. It confirms the advantage of their
LSH-based adaptive sampling technique.
10
IV. OPTIMIZATIONS AT VARIOUS SCALES
We now discuss techniques for optimizing CNNs in mobile
(Section IV-A) and data-center scale (Section IV-B) CPUs. We
also discuss parallelization techniques at data, thread (Section
IV-C) and node-levels (Section IV-D).
A. Optimizations for Mobile CPUs
Wu et al. [2] discuss the challenges faced by Facebook in
running the Facebook app, which runs multiple DL applications,
on edge devices such as smartphones. They focus on smartphone
models that account for 85% of the market. Mobile chiplet types
and performance: No single chipset dominates the market. The
number of different “system-on-chips” (SoCs) on which Facebook
app runs is more than 2000 for Android and only about 12
for iOS. Also, there is is a significant difference in the peak
performance of various mobile SoCs. Optimization techniques are
required for enabling DL inference on devices with widely varying
performance to achieve satisfactory user-experience.
Prospects of CPU and GPU: On a median mobile SoC, the
theoretical peak performance of CPUs equals that of GPUs and
only on 11% mobile SoCs, the performance of GPU is 3×that of
CPU. Further, compared to CPUs in high-end SoCs, CPUs in mid-
end SoCs are only 20% slower, but the GPUs in mid-end SoCs
are up to 4×slower than the GPUs in high-end SoCs. Further, the
lack of high-BW memory and sharing of the BW between CPU
and GPU constrain the performance of GPU.
Further, in smartphones running Android, the programming
support for mobile GPUs is not fully mature. For example,
some smartphones run older versions of OpenCL/OpenGL; in
some smartphones, loading the library leads to failure or crashes,
whereas others have no library. Hence, in the mobile domain,
accelerators such as DSP/GPU have limited scope. In fact, in the
mobile realm, a significant fraction of inference runs on CPU
because of its wide availability, standardization, and software
support. The use of mobile GPUs is feasible where the software
support is mature such as in iPhones.
CPU architecture: In 2018, 72% of smartphones still used
CPU cores designed before 2013. Further, Android smartphones
generally have a higher number of less powerful cores, whereas
iOS smartphones have fewer but more powerful cores. The two
most widely-used CPU models are Cortex A53 (48% share) and
Cortex A7 (15% share). Both these cores are superscalar, in-
order and allow using one to four cores per cluster. Most SoCs
have big.LITTLE architecture [98] where one cluster has “high-
performance cores” and another cluster has “energy-efficient
cores”. There is a shared cache between cores in the same
clusters but not between cores in different clusters. Hence, the
synchronization between clusters incurs a high cost, and as such,
Facebook apps are targeted to run on the cluster with high-
performance cores.
Optimizations to Facebook app: To account for the limited
memory capacity of mobile, they use techniques for reducing
model and binary size, e.g., weight/channel pruning, quantization,
selecting optimal spatial resolution and reducing the complexity
of DL algorithm. Further, the app uses NNPACK [99] and
QNNPACK [7] libraries, which offer efficient implementation of
CONV and other DNN primitives on mobile CPUs. The use of
two libraries allows achieving high performance on a range of
smartphones and usage scenarios.
Comparison of CPU and DSP: They evaluate DNN models
used for virtual reality on the CPU and DSP of a smartphone. The
memory space is shared between DSP and CPU, but they have a
separate layer of caches. For all DNNs, DSP outperforms CPU,
with a mean speedup of 1.9×. The speedup is high for DNNs
with simple CONV operations, such as image classification. The
speedup of DSP over CPU is lower for DNNs with memory-
bound operations such as depth-wise CONV and “pose estimation
models”. In DSP, load-store operations happen at the granularity
of 128B or more. It requires extra memory transformation op-
erations. Also, for memory-intensive layers, e.g., group CONV,
additional computations are needed for optimizing the memory
layout of filters and activations for reaping the full benefits of
vectorization.
They also compare the power, thermal, and performance profile
of CPU and DSP. CPU consumes 2×power as DSP, and thus,
due to its high power-dissipation, thermal throttling is performed
on CPU. It reduces CPU power, but it still stays higher than that
of DSP. Throttling reduces the throughput of CPU significantly.
Compared to CPU, DSP also has a lower variation in inference
latency, which leads to better user experience. Due to these rea-
sons, the Facebook app executes virtual-reality related models on
DSP. The limitation of DSP is its higher programming overhead,
the need for optimizing the layout and lower accuracy due to the
use of fixed-point (FxP) data/computations.
Lai et al. [55] propose software kernels for executing NNs on
ARM Cortex-M CPUs that implement SIMD instructions, such
as 16-bit MAC (multiply accumulate) operations. Quantization:
They develop kernels that can work with both 8b and 16b data.
A few Arm Cortex-M processors lack a dedicated “floating-point
unit”. Hence, they implement quantization with a power-of-two
scaling, which only requires bit-shift operations. Converting from
8b to 16b data-type requires data-transformation. They optimize
data-transformation to reduce the reordering overhead. Also, MM
is implemented with 2x2 kernels to achieve data-reuse and reduce
the number of load instructions.
Weight reordering and partial-lowering: Although imple-
menting larger kernels improves the scope for data-reuse, the
availability of limited registers prohibits the use of large kernels.
As the weights are constant during inference, they reorder the
weight matrix to interleave the row data so that it can be read in
single pointer access. Performing CONV using lowering-method
involves reordering and expanding the image input using im2col
and then performing GEMM operation. However, the im2col
operation requires a large amount of temporary memory. To avoid
the memory overhead, they expand only a few (e.g., two) columns.
This partial im2col approach brings the memory footprint
of the CNN to 133KB, whereas the naive im2col consumes
332KB memory.
Choice of layout: With a batch size of one, there are two
data-layouts: CHW and HWC (C=channel, H= height, W=width).
GEMM performance is independent of the layout. However, data-
movement operations in im2col are more efficient with HWC-
layout. im2col is executed only along the height and width
dimensions. With HWC-layout, data can be copied efficiently
since the data for every pixel is stored at contiguous locations.
Hence, they assume an HWC layout for applying CONV kernel.
Pooling: Pooling can be implemented in window-based (i.e.,
traditional) or split-x/y manner. In the split-x/y style, pooling is
performed first in x-dimension (along the width) and then in y-
dimension (along with the height). It allows reusing the average/-
max operations performed in x-dimension for the y-dimension
also. To avoid the need for extra memory for storing interim
result after x-dimension pooling, they overwrite the input value.
Compared to window-based pooling, split-x/y pooling improves
performance without any extra memory cost.
11
Activation functions: To implement the ReLU layer, they
create a mask based on the sign bit of a q7tnumber. For this,
the byte-level subtraction instruction ( QSUB8) is used It is
performed in a SIMD manner, which offers 4×performance-
boost over the scalar operation. Sigmoid and tanh functions are
implemented using lookup tables with fixed-point input/output.
They use the “CMSIS-NN kernels” [55] for a CNN which is
trained on the CIFAR-10 dataset. On a Cortex-M7 core with
216MHz frequency, they achieve a frame-rate of 10.1 frames
per second. Their implementation provides large improvement in
throughput and energy efficiency by using the arm_conv 1D
CONV function from CMSIS-DSP library.
Lee et al. [54] present a vectorization-friendly CONV scheme
for improving CNN throughput. LeNet-5 has three CONV layers,
denoted as C1, C2, and C3. Let N denote CONV kernel type,
and W/D/H denote kernel width/depth/height. C1 has N types of
H×W CONV kernels. C2 and C3 have N types of 3D CONV
kernels (D×H×W). For C1, N=6, H=5, and W=5. For C2, N=16,
H=5, W=5, and D=6 and for C3, N=120, H=5, W=5, and D=16.
Kernel C2 is shown in Figure 11(a). In the traditional scheme,
there are N×H kernels of size W in 2D CONV and N×D×H
kernels of size W in 3D CONV. With 128B vector register and
16b FxP weights/inputs, there are 8 lanes in each vector. With
the traditional approach, W=5 and thus, 3 out of 8 lanes remain
unused for C1, C2, and C3.
Their proposed approach works by vectorizing the CONV
kernels in the depth direction. With this approach, there are W×H
kernels of size N in 2D CONV and N×W×H kernels of size
D in 3D CONV. Figure 11(b) shows the shape of kernel C2
after reshaping. Here, for C1 and C2, the number of lanes used
increases from 5 to 6. For C3, the vector size is 16, and thus,
each row uses 2 vector registers, and there is no idle lane. Figure
11(a)-(b) illustrates the working of their technique for the C3 layer.
Figure 11(c) shows the usage of SIMD lanes in the traditional and
proposed technique. They evaluate their technique on a Cortex-
A53 quad-core processor using LeNet-5 trained on the MNIST
dataset. On both single-core and multi-core (with OpenMP par-
allelization), their technique provides higher performance and
energy efficiency than a traditional SIMD implementation.
Xu et al. [22] note that during continuous vision scenarios,
mobile devices have no or low motion. Hence, the adjacent frames
in a video have regions with a high degree of similarity. They
exploit this locality by using a cache, which reduces latency and
energy consumption. They note that with increasing layer-ID, the
size of a reusable region is reduced, and this is referred to as
“cache decay”. As shown in Figure 12(a), if the input to a CONV
layer has a reusable portion of 5x5 pixels, then its output has a
reusable portion of only 3x3 pixels. Pooling and “local response
normalization” (LRN) layers also create cache decay. In fact, FC
layers delete the reusable portion since every element in its input
determines the value of the output element. Thus, due to cache
decay in multiple layers, a large reusable portion vanishes after a
few layers. Hence, they perform reuse only on CONV layers to
bound the memory overhead and also because CONV layers take
the largest fraction of execution time. Since the inputs of each
layer have different dimensions and semantics, their technique
matches only the input image and passes on the cached portions
for the entire inference.
Matching decisions: They note that pixel-level matching and
reuse are ineffective since similarity scores of matching pixels
are inadequate even for similar scenes in two overlapped images.
Hence, they perform matching at the granularity of a chunk of 8x8
pixels. Thus, a single or few changed pixel(s) do not impact the
chunk-wise matching result if the remaining pixels in the chunk
show a match. The matched chunks are chosen in a way that they
can be combined into larger chunks. For example, in Figure 12(b),
the similarity score may be highest for the match shown in (i), but
this match is not suitable for caching. It is because small chunks
disappear after few layers due to cache decay. For example, for
a 3x3 CONV kernel and 5x5 chunks B1 and B2, the reusable
portion has two 3x3 boxes, i.e., 18 pixels. By comparison, the
match is shown in Figure 12(b)(ii) finds two neighboring chunks
in the current frame that bears similarity to the chunks in the
prior frame and hence, these two chunks can be consolidated into
a single one. Thus, the reusable portion becomes a 3x10 box,
which has 30 pixels.
Their technique searches the same-size chunk in the previous
frame with the highest match using the “diamond search algo-
rithm”. The “average chunk movement” is computed as the mean
movement of the matched chunks whose “peak signal to noise
ratio” (PSNR) exceeds a threshold. Then, for every chunk in the
present frame, its PSNR is computed with the chunk at an offset
of “average chunk movement” in the previous frame. If the PSNR
exceeds the threshold, they are assumed to be matched. Finally,
the neighboring matched chunks are merged into larger chunks.
During inference, they adjust the cache mapping based on the
properties of each layer. Figure 12(c) illustrates propagation of an
exemplary reusable portion across different layers. To understand
the figure, assume that a block (100, 40, <100, 100>) is matched
to a block (100, 40, <120, 120>) in the last frame, where the
format for showing a block is (width, height, <coordinates of top-
left point>). This image is the input of a CONV layer. For this
layer, the reusable portion of the output is computed as (45, 15,
53, 53). The ReLU layer does not change the reusable portion as
it operates individually on the value of each input. But the pooling
layer reduces the size of the reusable portion due to the padding.
During CONV operation, first, the reusable portions are copied
from the cache of the previous frame, and then, CONV is
performed only on the remaining pixels. Further, the output fmap
of every CONV layer is cached until the completion of the
inference of the subsequent frame. Although their technique reuses
similar image patches, accuracy is lost since the patches may
be numerically different. To avoid the accuracy loss due to the
use of cached data from very old frames, they flush the cache
and compute an entirely new frame after every 10 frames. For a
range of DNNs, their technique reduces energy consumption and
processing latency with only small accuracy loss.
Lu et al. [100] study the characteristics of AlexNet, ResNet-50,
GoogleNet and VGG-16 on CPUs of Jetson TK1 and TX1. TK1
has 2.5GHz 32b quad-core Cortex-15A CPU and 2GB DDR3L
memory. TX1 has a 1.9GHz 64b quad-core Cortex-A57 CPU and
4GB LPDDR4 memory. For AlexNet, CONV layers run faster on
TX1 CPU, whereas FC layers run faster on TK1 CPU. The FWP
latency is lower on TK1 even though its CPU is weaker than that
of TX1. Although the clock frequency of TX1 is lower, it executes
more instructions in each cycle than TK1 (3 vs. 2). Hence, it has
higher performance for CONV layers, which are compute-bound.
TK1 has a larger L2 cache and since TX1 uses 64b address, more
memory is consumed for addressing and less memory remains for
saving the data in the cache. Hence, data needs to be fetched more
frequently when running FC layers on TX1 and this is responsible
for large latency.
For the remaining three DNNs, TX1 CPU provides lower FWP
latency than TK1 CPU. Since VGG-16 performs multiplications
of large-size matrices, the throughput on VGG-16 is double of that
on AlexNet. On TX1, the inference latency of ResNet is lower
12
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
5(W, Lanes in use) idle lanes
120(N) x 16(D) x 5(H)
(Used vectors)
120(N)
5(H)
5(W)
<120(N) types of convolution kernels>
1 2 3 4 5 6 70 1 2 3 4 5 6 70
1 2 3 4 5 6 70 1 2 3 4 5 6 70
120(N) x 5(W) x 5(H) x 2
(Used vectors)
120(N)
<120(N) types of convolution kernels>
16(D)
5(H)
(a)
(b) 16(D, Lanes in use of 2 vectors)
C1
C2 C3
Conven
tional
Idle lanes/vector 3 3 3
# of vectors 30
16x30
120x80
Depth-
directio
nal
Idle lanes/vector 2 2 0
# of vectors 25
16x25
120
×25×
2
(e)
<16 types 5 x 5 x 6>
6(D)
5(H)
16(N)
16(N)
5(H)
5(W)
<16 types 6 x 5 x 5>
(c)
(d)
Fig. 11. (a) original dimensions of CONV kernel C2 (b) dimensions of C2 after vectorization-friendly reshaping [54] (c) With conventional shapes,
on vectorizing C3 layer, only 5 out of 8 lanes are used. (d) after reshaping [54], all the 8 lanes are used. (e) utilization levels [54]
B1 B2
B1 B2
B1 B2
Previous frame
Current frame
Previous frame
(i) The “best”
match, with
highest
matching score
(ii) The “proper”
match, with
highest
matching score
3x3
Convolution
Non reusable
regions
Reusable region (5x5)
in
input feature map
Reusable region
(3x3) in
output feature
map
(b)(a)
(100,40,100,100)
=>(100,40,120,120)
(45,15,53,53)
=>(45,15,63,63)
CONV
Kernel=11x11
Stride=2, Padding=5
(45,15,53,53)
=>(45,15,63,63)
(21,7,27,27)
=>(21,7,32,32)
Pooling
Kernel=3x3
Stride=2, Padding=1
(c)
ReLU
Fig. 12. (a) illustration of cache-decay due to CONV. (b) (i) match with high similarity but low number of reusable pixels (b) (ii) a match with high
number of reusable portions [22] (c) Change in reusable portion as it passes through different layers
than that of GoogleNet, even though ResNet has twice the number
of FLOPs than GoogleNet. It is because GoogleNet uses LRN
layers, which account for 55% of FWP latency on TX1. Also,
GoogleNet has a higher number of CONV layers and multiplies
matrices of smaller size. Since TK1 and TX1 run 32b and 64b
OS (respectively), TX1 needs higher memory due to the higher
needs of the framework itself.
Motamedi et al. [24] present a technique to select the best
thread-granularity while running a CNN on a mobile SoC. Here,
threads imply threads of CPU, DSP and GPU. They note that
launching the highest number of logical threads reduces data-
reuse and exacerbates thread scheduling overhead. Also, due to the
limited resources of mobile SoC, thread-execution is serialized.
Hence, this approach does not provide the highest performance.
Further, various CNNs and even various layers of a CNN show
the highest performance with different numbers of threads. Also,
for the same CNN, the optimal thread-count is generally higher
for more powerful SoCs and smaller for less powerful SoCs.
Android devices allow two modes of approximate computing
[101]: (1) relaxed mode where denormalized numbers are handled
inaccurately. (2) imprecise mode where NaN, infinity and ±0
are also handled inaccurately. By using these modes, SIMD
instructions can be enabled and the application can be parallelized
over a higher number of threads. Their technique evaluates the
impact of changing the mode of each layer on the overall accuracy
of the CNN. Based on this, for every layer, either exact or one
of the approximate computing modes is selected. Further, their
technique exploits parallelization opportunities at different levels,
e.g., all output fmaps of a layer are obtained parallelly. Also,
multiple pixels in an output fmap are computed in parallel. In
their technique, a separate thread computes every pixel and thus,
the total number of threads equals the total number of pixels in
all output fmaps of a layer. The computation of a thread is further
accelerated using sub-word parallelization. Also, loop-order is
optimized for improving cache performance and a suitable data-
layout is chosen for improving BW utilization.
They study the use of coarse thread-granularities, which is
shown as the number of pixels (say Q) computed by each thread.
For deciding which Qpixels should be computed by a thread, they
discuss two schemes: (1) A pixel is selected from another position
in the same output fmap. Here, the kernel is reused Qtimes which
reduces memory accesses (2) A pixel is selected from the same
position in another output fmap. These pixels are produced from
CONV of the same input fmap with different kernels, and thus,
the input fmap is reused Qtimes. From these schemes, they select
one which leads to fewer memory accesses.
They develop a linear regression model for correlating the
“computational complexity” of each layer with its latency under
a fixed frequency (1497MHz on Snapdragon 800). A separate
regression model is developed for each value of Q. Also, separate
models need to be developed for every SoC. They find using
Q= 1 always leads to the highest latency and the best latency
value is up to 4×lower than the latency with Q= 1. Using
this model, they select the value of Qfor each layer, which
provides the lowest latency. They perform experiments on Nexus 5
(with Snapdragon 800) and Nexus 6 (with Snapdragon 810) using
AlexNet, SqueezeNet, and GoogLeNet. Their technique reduces
the energy and latency compared to baseline execution. In fact,
the latency with their technique is close to that obtained using the
exhaustive search.
Mehta et al. [38] note that the battery capacities of a typical
smartphone and smartwatch are 2550mAh and 300mAh, respec-
tively. Hence, making the right choice of core-type, core-count,
and their capabilities are essential to meet the quality-of-service
(QoS) in the processors used for wearables such as smartwatches.
They study the characteristics of wearables to find the optimal
core design for them. They first identify 10 frequently-used and
computation-intensive applications running on wearables, which
have strict QoS requirements. These applications termed Wear-
Bench come from domains such as image/video/audio/speech
13
processing. Most of these applications are parallel and can ben-
efit from using multiple cores. Also, they are data/computation-
intensive, but not control-intensive. Also, they have high L1 data
cache miss-rate. 25% of their operations are vector operations,
and hence, cache misses have a crucial impact on performance.
Based on these factors and constraints, a smartwatch needs to
use multiple (e.g., 4) simple cores that are reasonably efficient
for applications running on them. An inorder core is stalled on
each miss, and hence, it is inefficient due to the high data-
cache misses of WearBench. Therefore, they use a partial out-
of-order core, which avoids read-after-write hazards by using a
scoreboard and can have at most two outstanding cache misses.
It is less complicated than a full out-of-order core due to not
using speculation or renaming and having a smaller L2 cache.
Since scoreboarding is not enough for reaching the performance
of an out-of-order core, they use a “software-assisted hardware
prefetcher” for prefetching the data to the L1 cache. It reduces the
number of L1 cache misses. The application-developer needs to
insert a prefetching instruction in the program. Since the instruc-
tion is inserted before the loop, its overhead remains insignificant.
The hardware need not track the data-item to be prefetched as
the instruction itself specifies it. Instead, the hardware finds the
correct prefetch distance. Also, prior information on the number
of cache-blocks to be prefetched allows the hardware to prefetch
across physical pages. Results confirm that on the metric of
performance/area and performance/power, their core design is
better than both inorder and out-of-order cores. Also, “software-
assisted hardware prefetching” is crucial for high performance and
is more effective than hardware-prefetching.
B. Optimizations at Data-center scale
Hazelwood et al. [3] discuss the hardware used by Facebook
data-centers for supporting a diverse range of machine learning-
based services/products. Hardware infrastructure: Their data-
centers have nearly eight major types of compute and storage
racks. For instance, a 2U chassis has three compute sleds, which
can support two types of servers. One option is a “single-socket
CPU” (1xCPU), which supports web-tier. It is a throughput-driven
state-less application and hence, works well with a power-efficient
CPU (e.g., Broadwell-D) with somewhat small memory and
storage capacity. Another sled alternative is a “dual-socket CPU”
(2X Skylake SP or Broadwell-EP) with high memory capacity for
supporting memory and computation-intensive applications.
Training: Offline training of different services is performed
over different platforms. Sigma (used for classification and
anomaly detection) and “News Feed” are trained on CPUs.
Language translation, speech recognition and Lumos (used for
extracting high-level features from images) are trained on GPUs.
For Facer (face detection and recognition framework), the generic
model is trained on GPUs after many months due to its sta-
bility. The user-specific model is trained on 1xCPUs when a
sufficient number of new images have been generated. Similarly,
for “Search”, both CPU and GPU are used for training. GPU
is primarily used for offline training due to its throughput-
optimized architecture. Although GPUs provide higher throughput
than CPUs, the availability of a large number of CPUs makes them
an attractive target for running DL applications, especially during
off-peak hours. They also leverage distributed training techniques
for scaling to CPU-GPU heterogeneous computing platforms.
The global scale services need hundreds of terabytes of data
and complex processing. The inability to rapidly and continuously
train DL models can have severe consequences such as presenting
irrelevant/stale news and ads, and not blocking spam and offensive
contents. Further, while data-workload is rapidly-changing and
complicated, training operations show higher stability (only a
few key operations) and regularity and prefer processors with-
out cache/thread-contention. Hence, these workloads are run on
different processors. The data-processing servers read the data,
process and compress them and transfer them to training servers
that exclusively focus on efficient training.
Inference: Different services have different memory and
computation requirements. As an example, the “ads ranking
scheme” performs multiple rounds of screening using a multi-
layer perceptron-like model. This model has a sparse embedding
layer and hence, has a high memory footprint. Therefore, the
later rounds, where the memory footprint becomes even higher,
are executed on a different server. Most of the online inference
is done on 1xCPUs or 2xCPUs. Since 1xCPUs have higher
energy and cost-efficiency, they are preferred over 2xCPUs. Some
services can be run on the powerful mobile devices of end-user,
which reduces communication overheads. 2xCPUs are required
for running some bulky services. The latency SLAs also determine
the compute-platform chosen for running a service.
Gupta et al. [5] study three recommendation models (RMs),
termed RM1, RM2, and RM3, which are representative of the
production-class RMs used at Facebook. Characteristics of RMs:
The input to RMs is the interaction between the user and the
content, e.g., the user preferences for videos. These inputs include
both dense features (e.g., videos seen by many users) and sparse
features (videos seen by a user). Sparse features are shown as
multiple vectors of sparse IDs. Dealing with sparse features
requires the use of “embedding tables” (ETs), which convert from
sparse to dense format.
The bottom-FC layers process dense features. Their outcomes
are concatenated and processed by the top-FC layers, which
predicts the “click-through-rate” of the user and the content
(video/post). The requests for different user-post pairs are batched
together. A single RM may require up to 20GB memory. Also,
in RM1, RM2 and RM3, the size of ETs are 100MB, 10GB and
1GB, respectively. Further, embedding table (ET) operations lead
to irregular memory accesses. Hence, on a Broadwell CPU, this
causes an LLC MPKI (misses per kilo-instruction) of 8, which
is orders of magnitude higher than that of FC layers of RM or
a typical CONV layer in a DNN. Similarly, the element-wise
addition has very poor AmI.
The recommendation is done in two steps: filtering and ranking.
The filtering step uses lightweight machine learning models or
DNN-based RM1 to reduce the number of candidate posts sig-
nificantly. Then, using DNN-based RMs, tens of posts are finally
selected. The RMs used for ranking (RM2 and RM3) are bulkier
than those used for filtering, e.g., the bottom-FC layers of RM3
are larger since it uses more dense features. By comparison, RM2
has more ETs since it uses more sparse features. RM3 is compute-
intensive and RM2 is memory-intensive. RM3 is utilized for
recommending social media posts, which possess dense features.
By contrast, RM1 and RM2 are being used in services with several
sparse features. Hence, the number of ET lookups per input is
higher in RM1 and RM2 than that in RM3. Also, the memory
accesses of RM1 and RM2 show higher irregularity and number
of cache misses.
They perform experiments on CPUs of three generations,
namely, Haswell, Broadwell and Skylake. Their architectures are
shown in Table VI. Results on latency: On Broadwell CPU,
the latency of RM1, RM2 and RM3 are 0.04ms, 0.30ms and
0.60ms, respectively. RM2 has many ETs and RM3 has wide
14
TABLE VI
CPU CHARACTERISTICS [5] (ALL TH RE E HAVE 2SO CK ET S AND
256GB DRAM)
Processor Haswell Broadwell Skylake
Frequency 2.5GHz 2.4GHz 2.0GHz
SIMD AVX-2 AVX-2 AVX-512
Cores-per-socket 12 14 20
L1/L2/L3 (KB) 32/256/30720 32/256/35840 32/1024/28160
L2/L3 type Inclusive Inclusive Exclusive
DRAM Freq. 1600 MHz 2400 MHz 2666MHz
DRAM type DDR3 DDR4 DDR4
DRAM BW 51GB/s 77GB/s 85 GB/s
FC layers. The operations responsible for highest fraction of
latency are SparseLengthsSum (an ET operation) in RM2
and BatchMatMul and FC computations in RM3. Clearly,
accelerating only MM, such as FC or BatchMatMul, will not
boost the performance of all three RMs. Acceleration of memory-
bound operations, e.g., ET lookups, is also important.
Results at different batch sizes: In the data-center context,
“throughput under a latency constraint” is a more important
metric than latency. For improving throughput, multiple queries
are batched, or multiple instances of RM are co-located on a
machine. They find latency of RMs on three CPUs with a batch
size of 16, 128 and 256. At a batch size of 16, the inference latency
is smallest on Broadwell. Compared to Skylake, Broadwell has a
higher frequency and the vectorization capability of Skylake is
not fully utilized at the low batch size. Broadwell has 2400MHz
DDR4 memory, whereas Haswell has 1600MHz DDR3 memory.
The memory-intensive SparseLengthsSum operator leads to
low BW utilization. Hence, the reason the performance is lower
on Haswell than on Broadwell is not the memory BW, but the
lower frequency and hence, higher memory latency on Haswell.
At higher batch-size, the latency is lowest on Skylake due to
the use of AVX-512. For the computation-intensive RM3, Skylake
starts outperforming at a batch size of 64, but for memory-
intensive RM1 and RM3, it does so only at a batch size of 128.
The speedup of Skylake over Broadwell/Haswell is lower than
the ratio of their SIMD width, which is because of the irregular
memory accesses. Thus, even for inference, batching is vital to
achieving high throughput.
Effect of co-locating RMs: Here, throughput is measured as the
number of inferences per second, bound by a latency constraint.
On performing co-location, the CPU with an exclusive cache
hierarchy (Skylake) shows lower performance loss and variability
than those with inclusive hierarchy (Haswell and Broadwell).
Thus, although co-location increases the throughput, “service-
level agreement” (SLA) constraints may not be met due to in-
creased variability. Broadwell and Skylake are the best at low and
high (respectively) levels of co-location and this trend is similar
to that observed with batching. Due to the irregular memory
accesses present in RMs, the L2 cache miss-rate is higher in the
inclusive hierarchy than an exclusive hierarchy. For Broadwell,
the L2 miss-rate with single and 16 co-located inferences is 17
and 22 MPKI, respectively. For Skylake, these values are only
13 and 13.2, respectively. Broadwell has a smaller L2 cache,
but more importantly, it sees a higher amount of cache back-
invalidations due to its inclusive hierarchy. Further, RM2 shows
higher performance loss than RM1/RM3. Due to co-location, in
RM2, the latency of SparseLengthsSum and FC increases by
3×and 1.6×, respectively. Since SparseLengthsSum already
has poor cache reuse, resource-contention has a severe impact on
it. Based on these observations, the right degree of co-location
can be decided.
Effect of hyperthreading: On using hyperthreading, the num-
ber of inferences on each physical core is doubled. This increases
the latency of FC and SparseLengthsSum by 1.6×and 1.3×,
respectively. Since FC leverages SIMD hardware, which is now
time-multiplexed between threads, hyperthreading leads to larger
performance loss in computation-intensive RM3 than in memory-
intensive RM1 or RM2. Since only a few cores in a data-center
execute two hyperthreaded RMs, the impact of hyperthreading is
higher on 99-percentile latency than on average latency.
Park et al. [6] characterize DL models powering social me-
dia services at Facebook. They identify the hardware demands
of RMs, visual understanding and natural language processing
workloads. (1) ETs used in RMs require high memory capacity
(more than tens of GBs) and memory BW. Operations on ETs
involve multiplication between sparse and dense matrices. These
RMs usually predict event-probabilities for several ads for a single
user, within a few hundreds of milliseconds. Hence, batching can
be used in FCs. ET lookups dominate the inference latency and
they involve random accesses across table columns.
(2) The image resolution required for object detection is higher
than that required for image classification. Also, the number
of detected objects can be increased by increasing the number
of region proposals, at the cost of increased computation cost.
Similarly, video understanding benefits from increased spatial and
temporal resolution. Although depthwise CONVs have the low
computation, they are memory-bound due to their low data reuse.
(3) The dependencies inherent in RNNs prevent parallelization.
Inference with the low and high batch size is useful for instant
and offline translation, respectively.
Larger on-chip memory can benefit many of these DL models.
Computer vision models have a high number of operations per
weight, but the number of operations per activation is not high
since it depends on the output feature dimension. Hence, their
performance depends on on-chip memory BW. Further, apart from
square matrices, matrices/vectors of other dimensions also arise
frequently due to depth/group-wise CONV and low batch size.
Hence, in addition to MM modules, vector-operation modules are
essential for handling the remaining computations.
On CPU, FCs have the highest latency, followed by ET lookups.
They develop a library for performing low-precision linear algebra
on CPU. For FP16 (16-bit floating point) and “8-bit multipli-
cations with 32-bit accumulation” GEMM, this library provides
higher performance than FP32 GEMM in Intel MKL.
They propose schemes to avoid accuracy loss due to quanti-
zation, such as doing quantization at fine-granularity (e.g., for
each output channel in CONVs, for each entry in ET), skipping
quantization for a layer if the error becomes high, performing
“quantization-aware training”, etc. They run a “frequent subgraph
mining” algorithm on the DL graph for finding the frequently-
executed subgraphs. From this, the subgraph groups that are
estimated to provide the highest speedup from fusion are finally
chosen. For example, they merge (computation-bound) batched
MM with (memory BW-bound) “tensor manipulation” computa-
tions, which brings a 10% reduction in inference latency.
C. Data- and thread-level parallelization
Table VII highlights the attributes of parallelization techniques.
Liu et al. [4] note that kernel libraries such as OpenBLAS
and Intel MKL-DNN optimize only common operations such as
CONV, but do not perform DNN graph-level optimizations. The
graph-level improvements are achieved by the DL frameworks
such as TF. However, these frameworks have a limited scope of
15
TABLE VII
AN OVE RVIE W OF PAR AL LE LI ZATI ON AP PRO ACH ES
Data-level paral-
lelization (vector-
ization)
[9, 11, 13–15, 15–17, 20, 21, 24, 26, 28, 29, 34,
38, 46, 49, 53–55, 57, 59, 64, 70–72, 75, 77, 81,
82, 89, 93]
Thread-level par-
allelization
Language not mentioned [4, 8, 11, 24, 26, 46, 63,
70, 82, 84, 86, 89], OpenMP [20, 34, 64, 73, 74,
76, 77, 81, 81, 85, 93], Pthread [62, 65, 73, 81, 81],
Intel TBB [14]
Node-level paral-
lelization
Using data-parallelism [8, 11, 46, 73, 77, 78],
model-parallelism [8, 11, 46, 77]
Reducing
synchronization
overhead
performing it periodically [8, 65], leveraging Hog-
wild idea [66, 78], performing only pointer opera-
tions and not arithmetic operations inside a critical
section [28]
performing graph-level improvements since the implementation
of operations is provided by the third-party libraries. Hence,
the optimizations performed at kernel and graph-level are not
synergistic with each other. Also, different CPU designs use
different libraries and integrating a library into a DL framework is
cumbersome. Finally, the use of these libraries as plug-ins leads
to contention with other libraries used by the frameworks. For
instance, TF uses both MKL-DNN and Eigen libraries and the
threads of these libraries contend for the resources.
They propose a technique for jointly optimizing at the level
of individual operations and the whole graph. Instead of writing
code in assembly language or using intrinsics, they utilize high-
level features such as vectorization, which allows easily extending
the optimizations to the whole DNN graph. To improve data-
access locality, they reorder the memory access dimensions and
also perform register blocking for enhancing the usage of vector-
instructions. FMA operation is used for performing CONV. CONV
is implemented as a template whose arguments are a loop-
unrolling factor, block size, and the number of utilized registers. It
allows adapting the implementation to CPUs with different vector-
length (e.g., AVX2, AVX-512 and NEON), cache size, etc., and
to different parameters of CONV (e.g., kernel dimensions).
To achieve thread-level parallelization on Q cores, they use a
thread-pool with Q threads. The outermost loop of the compu-
tation is uniformly partitioned into Q portions. Each thread runs
a portion on a different core, which avoids resource contention.
Thread-coordination is achieved using C++11 atomics. For global
data-structures, false sharing between threads is avoided using
cache line padding. Overall, by avoiding resource contention
and reducing thread-launching cost, their parallelization approach
obtains higher performance than OpenMP.
They classify the CNN operations into three types: (1) layout-
unaware, that can handle the data in any layout, e.g., ReLU (2)
layout-tolerant that can work with different layouts, e.g., CONV,
pooling, batch-normalization (3) layout-specific that work with
only one layout, e.g., flatten, reshape. They note that, in general,
the operations between CONV layers are either layout-unaware or
layout-tolerant. Hence, their technique transforms the layout only
for layout-specific layers but otherwise keeps the same layout as
that used in the CONV layer. Figure 13 illustrates the working
of their technique. Here, N, C, H and W refer to batch size,
number of input channels, fmap height and width, respectively.
The kernel layout is KCRS, where K, R, S refer to the output
channel, kernel width and height, respectively. For improving
cache efficiency, they organize fmap layout as NCHW[x]c, where
c is a subdimension of channel C. Also, x equals sizeOf(c) and
the number of channels equals the product of sizeOf(C) and
sizeOf(c). The layout of output is NCHW[y]c, which is similar
to that of input, except that the factor of the split could differ.
The organization of output kernel is KCRS[x]c[y]k, where y is
the sub-dimension of output channel K.
FLATTEN
CONV
Layout tolerant
operators,
e.g., pooling, relu,
broadcast
operators, etc.
CONV
Data
NCHW
NCHW
NCHW
NCHW
Kernel
KCRS
Kernel
KCRS
FLATTEN
CONV_
NCHW16c
Data
NCHW
NCHW16c
NCHW16c
NCHW
CONV_
NCHW16c
Layout
Transform
Pre-
transformed
Kernel
OIHW16i16o
NCHW16c
Pre-
transformed
Kernel
OIHW16i16o
Layout
Transform
NCHW16c
The optimized layout
(NCHW16c) passes through
the operators without any
layout-transform overhead.
Optimized
layout
Layout tolerant
operators,
e.g., pooling, relu,
broadcast
operators, etc.
(a) (b)
Fig. 13. (a) Data-layout used in a conventional CNN (b) selective layout
transformation scheme by Liu et al. [4]
Finally, they propose a scheme for automating the search for
optimal parameter values. This scheme works in two phases: (1)
for each compute-intensive operation such as CONV, the best
parameters are individually found (2) for optimizing the end-to-
end performance, dynamic programming algorithm is used for
intelligently combining the results from individual schemes. They
perform experiments using 15 DNNs. On AMD EPYC and Intel
Skylake CPUs, their technique outperforms OpenVINO, MXNet
and TF for most DNNs. On ARM Cortex A72 CPU, it outperforms
MXNet and TF for all DNNs. Their technique works well for
all DNNs on all CPU designs, whereas framework-neutral (e.g.,
OpenVINO) and framework-specific (e.g., TF) approaches work
well only in some cases.
Vanhoucke et al. [9] discuss optimizations for accelerating NNs
on CPUs. For matrix multiplication C=AB, they store matrix
Ain row-major order and Bin column-major order. Also, they
perform loop-unrolling for the inner loop, which performs R+ =
P[i]Q[i]operation. Multiple accumulations are done in parallel,
which allows the compiler to execute them in a pipelined manner.
They vectorize multiply-and-add operations using 128b SIMD
instructions (“streaming SIMD extensions” or SSE). To achieve
16B (128b) alignment of memory blocks, the calls to malloc()
can be replaced with posix_memalign(), or the special al-
locators can be used from the “standard template library”. Also,
zero-padding is performed to ensure that data-vector operands are
multiple of 16B. They quantize activations into 8b unsigned values
and weights into 8b signed value. The biases are stored as 32b
int and the input layer is stored as FP since their values span a
broad dynamic range. They find that the use of FxP datatype alone
does not provide higher performance than FP implementation on
a CPU. They use the pmaddubsw instruction from the Intel
SSSE3 set, which does a parallel multiply-and-add operation on
sixteen unsigned 8b integer activations and sixteen signed 8b
integer weights and produce eight 16b integers If the result value
overflows, it is saturated to 16b. To further optimize this, they
use SSE4.1 set, which provides a single instruction for converting
from 16b to 32b.
Without batching, their CPU implementation achieves slightly
better performance than GPU, although with batching, GPU
provides a significant performance improvement. Batching also
benefits CPU by allowing reuse of both activations and weights.
16
Their optimizations improve NN performance by 4×over an FP
baseline, with a negligible loss of accuracy.
D. Node-level parallelization
Dean et al. [8] present a framework that allows implementing
model parallelism across machines and different threads of a
machine. Their framework also allows data-parallelism whereby
multiple replicas of a model optimize a single objective. The user
specifies the computations performed at every node in every layer
of the model, and the messages communicated during the FWP
and BWP phases of computation. Large models can be partitioned
on multiple machines, which is shown in Figure 14(a). Here,
the states of only those vertices need to be transferred across
machines that cross partition boundaries. Inside every partition,
the computations are parallelized using available cores. Their
technique also manages communication and synchronization of
machines in both inference and training phases.
They propose two techniques, shown in Figure 14(b)-(c),
for large-scale distributed training using this framework: Down-
pour SGD (“stochastic gradient descent”), which is an on-
line scheme and Sandblaster L-BFGS (“limited-memory Broy-
den–Fletcher–Goldfarb–Shanno algorithm”) which is a batch
scheme. Both techniques utilize a centralized sharded parameter
server (PS). Also, they can work well even when model replicas
have a different speed or when the number of replicas changes
due to failure/restart. Different replicas compute different training
instances and their outputs are periodically combined to achieve
data parallelism.
1. “Downpour SGD” The traditional SGD performs serial exe-
cution. Their proposed Downpour SGD is a type of asynchronous
SGD that utilizes several replicas of a single model. The training
data is divided into multiple parts and a model-copy is run on
every part. The models achieve communication via the PS, which
stores the present state of all parameters divided across multiple
processors. For example, with 20 PS shards, every shard stores and
applies updates to 1/20th of the model parameters. Here, model
replicas are mutually independent and PS shards are also mutually
independent. It introduces stochasticity in an optimization scheme
because, at any point in time, the parameters of every shard may
have seen a different number of updates applied in a different
order. For reducing the data-transfer costs, parameter-fetching and
gradient-pushing can be done after multiple steps.
To increase the robustness of this technique, they use the
“Adagrad adaptive learning rate” scheme, which uses a different
learning rate (ηin Figure 14(b)) for every parameter. Since these
learning rates are obtained only from the sum of the square of
gradients of every parameter, Adagrad can be easily applied inside
every PS shard. Adagrad increases the number of replicas that
can function together. Further, they start the training with only
one replica and later start other replicas. These two optimizations
avoid instability in training DNNs with “Downpour SGD”.
2. “Sandblaster L-BFGS”: It is a distributed realization of L-
BFGS and uses both model and data parallelism. L-BFGS uses a
“coordinator process” which sends commands such as multiply,
add, and dot-product to different PS shards. PS shards execute
these commands independently and store the output locally. It
allows executing huge models without communicating with a
centralized PS.
In the traditional parallel implementation of L-BFGS, the
slowest processor becomes a scaling bottleneck. To mitigate this
issue, they assign work to each replica as it becomes free. This
dynamic scheduling approach achieves load-balancing. Towards
the completion of a batch, remaining work is assigned to multiple
replicas and the result from the replica that finishes first is used.
Consecutive parts of work are assigned to the same worker,
which avoids data-access issues. In “Downpour SGD”, there is
frequent synchronization with the PS, whereas, in this technique,
parameters are fetched only at the beginning of every batch after
the coordinator has updated them. Similarly, gradients are sent
only after a certain number of portions are done.
They evaluate their techniques for image recognition and au-
dio processing benchmarks. Models use at most 20 cores per
processor. They find that for the smallest model, which has FC
structure, the highest speedup of 2.2×is obtained on 8 processors.
The largest model has locally-connected design, and hence, it
provides increasing speedup with a rising number of processors.
The highest speedup is 12×, which is obtained for 81 processors.
Overall, their techniques achieve a large speedup over traditional
versions of SGD and L-BFGS. For the largest model, they use
32 CPUs with 16 cores in each CPU and by combining this with
their proposed optimizations, tens of thousands of CPU cores can
be used for training a large model.
Ji et al. [78] propose a technique for parallelizing the Word2Vec
algorithm using both shared and distributed memory. This al-
gorithm uses the Hogwild approach for parallelizing the SGD
algorithm, which avoids the need for synchronizing between
updates. However, a cacheline with a particular model entry can
be updated by multiple threads, which leads to the shuttling of
cachelines across cores and large access latency. Also, although a
target word is utilized in the model updates for input words, only
one update is performed at a time. In fact, the algorithm performs
multiple dot-products and, thus, fails to leverage the opportunity
of data reuse.
They propose batching both the input context words and
the negative samples, which converts the dot-product (“level-1
BLAS” operations) into GEMM (“level-3 BLAS” operations). For
multithreading of GEMM operations, Hogwild strategy is used.
Their technique allows the use of vectorized multiply-add instruc-
tions and optimized libraries. While the baseline implementation
performs model updates after every dot-product, their technique
performs model updates after the whole GEMM operation. Thus,
the model update frequency is reduced, and hence, their technique
shows much better scaling.
However, differences in model updates can affect the conver-
gence rate and the magnitude of difference depends on the batch-
size used for the inputs. Since they use batch sizes below 20, the
impact on convergence remains small. For scaling the computation
to multiple nodes, they note that model parallelism is not useful
because each GEMM is small. Hence, they use data-parallelism.
Since the network BW is much lower than the CPU memory
BW, complete model synchronization over four nodes takes 0.5
seconds, which is too high. They note that in Word2vec, the
update frequency of a word depends on its popularity. Hence,
their technique seeks to update the model with the same frequency
as the word frequency and uses the “sub-model” and not “full-
model” synchronization approach. Their technique achieves near-
linear scaling with the number of cores and nodes and provides
a throughput of 110M words per second using 32 nodes of
Broadwell CPUs.
Das et al. [46] optimize the SGD algorithm on a single node and
then scale it to multiple nodes. They note that FWP, BWP and
weight update operations have similar memory access patterns,
and hence, a similar cache-blocking scheme should work well
for them. If weights and activations do not fit in cache, they
need to be fetched from the main memory. Hence, the cache
17
Machine 1
Machine 2
Machine 3
Machine 4
w∆w
PS
w’ = w – η ∆w
Model
Replicas
Data
Shards
PS
Data
Coordinator
(tiny messages)
(b)
Model
Replicas
(a) (c)
Fig. 14. (a) Model-parallelism [8] (b) “Downpour SGD”: Model replicas pull parameters wand push gradients wto parameter server (PS) in
asynchronous manner. (c) “Sandblaster L-BFGS”: A single “coordinator” transmits small messages to PS and replicas for achieving batch optimization.
capacity determines the AmI of computations. They model the
cache-blocking problem as a “constrained minimization problem”
and solve it using a brute-force search. Further, while finding the
block-size residing in the cache, one of the dimensions is chosen
to be a multiple of SIMD width. They observe that on a Xeon
CPU with 128KB cache per thread, AmI value above 25 can be
achieved for a majority of CONV layers, even when minibatch
size is set to one.
After blocking, the data is laid such that access in the innermost
loops is maximally contiguous. It improves spatial locality, BW
utilization and prefetching efficiency. For all datatypes, the inner-
most dimension is laid-out over groups of SIMD-width output
fmaps, which allows vectorizing the operations. They further
perform “register blocking” for improving the ratio of VFMA
computations to load/store operations. They also partition the
work among different threads.
Then, they perform strong scaling of synchronous SGD to
multiple nodes. For CONV layers, model-parallelism is preferred
when minibatch size is small and kernel size is large. For
FC layers, model-parallelism is preferred unless the minibatch
size becomes huge (e.g., above 5000). They further explore the
“hybrid parallelism” approach where the nodes are divided into
groups, and nodes inside a group use model-parallelism and data-
parallelism are used across the node-groups. This approach divides
the work along both fmap and minibatch dimensions. Hybrid
parallelism leads to lower data-traffic than either data or model-
parallelism. For OverFeat-FAST and VGG-A CNNs, they achieve
a speedup of 42X and 53X, respectively, on a 64-node machine.
Finally, on a DNN for speech recognition, their technique achieves
6.5X speedup with 16 nodes.
Roy et al. [73] present a “non-uniform memory access”
(NUMA)-aware technique for improving the performance of
DNNs on multicore CPUs. When CONV operation is performed
on input images, different CPUs can process different images or
different portions of images. It referred to as batch-level and
BLAS-level parallelization, respectively. They conduct experi-
ments with one to four NUMA nodes on both AMD and Intel
CPUs. They note that BVLC-Caffe performs only BLAS-level
parallelization, whereas Intel-Caffe performs batch-level paral-
lelization also. Due to this, and other architectural optimizations,
Intel-Caffe provides higher performance for all NUMA node-
counts. However, both frameworks show poor scalability with an
increasing number of NUMA nodes due to the increased count
of accesses to remote NUMA domains, which incur high latency.
Also, out of four NUMA domains, one domain itself consumes
a large fraction of read/write BW. In the default Caffe, a thread
can run on any core in any domain. The memory is allocated
to the domain where first memory access is issued. However,
their inputs, temporary buffers and outputs may reside in different
NUMA domains. Hence, during CNN processing, memory ac-
cesses happen across NUMA boundaries. This is shown in Figure
15(a).
Their NUMA-aware technique performs hierarchical paral-
lelization, which is shown in Figure 15(b). It works as follows: (1)
In every NUMA domain, a Pthread-based SGD routine is created
(2) Each routine is parallelized at batch-level using OpenMP
threads which have affinity to their parent Pthread routine (3)
Every OpenMP thread does BLAS operations via MKL threads
which keep the affinity of their parent OpenMP threads. The
data corresponding to the computations is also distributed at the
granularity of domains and threads. They perform data-parallelism
whereby each SGD routine running in different domain processes,
different input samples, and after a fixed number of samples, the
network parameters are updated. Thus, memory accesses issued
by the computations in a domain are served by the data in
the same area. Inter-domain communication is required only for
gradient-update. Their NUMA-aware Caffe design provides better
throughput and scaling than Intel-Caffe.
V. CONCLUDING REMARKS
The landscape of next-generation deep learning demands that
CPU play a bigger role than merely a host, and we believe
that CPU is ready to take on this challenge. In this paper, we
surveyed the techniques for optimizing DL on CPUs and DL-
aware optimizations to CPUs. We organized the work on several
categories to show their similarities and differences. We conclude
this paper with a brief mention of directions for future research.
In recent years, the vector width in CPUs has increased from
64b to 512b. Increasing this further would mean that each vector
register load accesses multiple cache lines, which leads to a severe
penalty. Even then, it would benefit only a few applications that
have such high SIMD parallelism. Thus, merely increasing peak
performance will not be sufficient; more revolutionary improve-
ments are required to boost the performance of a broad range
of DL applications. For example, although existing CPUs allow
conversion between FP16 and FP32, they do not natively support
FP16 computations. Also, they need multiple instructions for
implementing INT8 multiplications with 32b accumulation, and
hence, it provides only small improvement over FP32 computa-
tions. To address these limitations, CPU vendors have recently
introduced hardware and software support for low-precision com-
puting [102, 103].
Since CPUs devote a large amount of chip-area to caches, by
leveraging in-memory computing capabilities, the performance of
CPUs can be significantly increased. Further, apart from MM,
modern DNNs also perform computations with other patterns,
such as sparse lookups [5], vector operations [6] and deconvolu-
tion [104]. Future CPUs can also host dedicated DL accelerators
to accelerate such operations and, thus, bring the best of both
worlds together. Vendor-optimized libraries will remain essential
for extracting the last bit of performance from a processor and
this has been confirmed by the observations such as Intel-Caffe
18
Layer 1 Layer 2
DRAM
Thread 1
Thread 2
Thread 3
Thread 4
DRAM
Thread 1
Thread 2
Thread 3
Thread 4
DRAM
NUMA 0
DRAM
Thread 5
Thread 6
Thread 7
Thread 8
DRAM
Thread 5
Thread 6
Thread 7
Thread 8
DRAM
NUMA 1
Time
(a) (b)
DRAM
Thread 1
Thread 2
Thread 3
Thread 4
DRAM
Thread 1
Thread 2
Thread 3
Thread 4
DRAM
NUMA 0
DRAM
Thread 5
Thread 6
Thread 7
Thread 8
DRAM
Thread 5
Thread 6
Thread 7
Thread 8
DRAM
NUMA 1
Layer 1 Layer 2
Time
Gradient
update
∆w1
∆w2
Solver replica 2
Solver
replica 1
Fig. 15. (a) Current Caffe designs lead to large communication across NUMA domains. (b) Hierarchical mapping [73] reduces this overhead.
outperforming BVLC-Caffe [73, 93] and Intel-MKL boosting the
performance of TF [68]. Research works that do not use optimized
libraries [63] such as Intel MKL or use an earlier version of
TF that do not provide FMA/SIMD support [63] may offer a
misleading picture of CPU performance.
Some of the optimizations proposed for accelerators may not be
necessary or effective on CPUs. For example, even though low-
bitwidth DNNs facilitate efficient computation on accelerators,
they offer little benefit on existing systems with FMA instructions.
It is because when FMA instructions are pipelined, their cost is
nearly the same as that of additions alone. Also, multiplications
are costly only when the operand bitwidth is high. Evidently,
a CPU-specific study of recently-proposed DL techniques is
required to assess their potential on CPUs. Further, efforts are
required for accelerating recent DL algorithms such as generative
adversarial networks [104] and reinforcement learning on CPUs.
Large companies, startups and academic institutes have recently
proposed a range of AI accelerators and optimization techniques.
However, a key obstacle in achieving synergy between these
efforts is the lack of open-source ISAs and tools. Due to the
proprietary nature of these platforms and products, even the
best ideas have not translated into widely adopted technologies.
We believe that development of an open-source ISA such as
RISC-V can help greatly in breaking these barriers. By virtue
of its open-source nature, RISC-V will reduce royalty overheads
and promote reproducibility and reuse. Further, RISC-V makes
it possible to design “domain specific extensions”, encouraging
CPU-accelerator heterogeneous computing. Due to these and
several other features of RISC-V, it is expected to boost the entire
ecosystem of AI computing.
CONV involves several levels of nested loops, which are
difficult to optimize manually. Some researchers use assembly-
language instructions, such as Intel intrinsics to optimize them.
However, this approach is not scalable. Polyhedral compilers
[105–107] can model these loops in terms of a polyhedra and
then apply sophisticated matrix transformations (such loop-tiling)
on them to improve both cache locality and parallelism. This can
provide significant boost in performance.
The choice of a computing system for DL applications will be
made based on multiple metrics such as throughput, latency and
energy efficiency over a range of applications/operations, ease of
use and development, portability, reliability, availability, cost, etc.
Since the over-emphasis of single/few metrics is likely to provide a
misleading picture, future studies should provide a comprehensive
evaluation of all metrics of interest. In fact, due to the diversity of
metrics and model characteristics (layer-count, layer-type, batch
size, precision, etc.), a single processing unit may not be optimal
for all scenarios. Evidently, high-level interfaces to intelligently
map a query to a suitable processing unit(s), such as CPU and/or
accelerator will be vital for achieving highest efficiency. Further,
development of high-level APIs such as OpenVINO that allow
execution on a range of processing units will be important for
reducing the programmer effort.
REFERENCES
[1] O. Valery et al., “Low Precision Deep Learning Training on
Mobile Heterogeneous Platform,” in PDP, 2018, pp. 109–
117.
[2] C.-J. Wu et al., “Machine learning at Facebook: Under-
standing inference at the edge,” in HPCA, 2019, pp. 331–
344.
[3] K. Hazelwood et al., “Applied machine learning at Face-
book: A datacenter infrastructure perspective,” in HPCA,
2018, pp. 620–629.
[4] Y. Liu et al., “Optimizing CNN Model Inference on CPUs,”
in ATC, 2019, pp. 1025–1040.
[5] U. Gupta et al., “The Architectural Implications of Face-
book’s DNN-based Personalized Recommendation,arXiv
preprint arXiv:1906.03109, 2019.
[6] J. Park et al., “Deep learning inference in Facebook data
centers: Characterization, performance optimizations and
hardware implications,” arXiv preprint arXiv:1811.09886,
2018.
[7] M. Dukhan et al., “QNNPACK: Open source library for op-
timized mobile deep learning,” http://bitly.ws/8SyQ, 2018.
[8] J. Dean et al., “Large scale distributed deep networks,” in
Advances in neural information processing systems, 2012,
pp. 1223–1231.
[9] V. Vanhoucke et al., “Improving the speed of neural net-
works on CPUs,” 2011.
[10] M. Zhang et al., “DeepCPU: Serving RNN-based deep
learning models 10x faster,” in USENIX ATC, 2018, pp.
951–965.
[11] T. Chilimbi et al., “Project ADAM: Building an efficient
and scalable deep learning training system,” in OSDI, 2014,
pp. 571–582.
[12] A. Gujarati et al., “Swayam: distributed autoscaling to meet
SLAs of machine learning inference services with resource
efficiency,” in ACM/IFIP/USENIX Middleware Conference,
2017, pp. 109–120.
[13] W. Bao et al., “NGEMM: Optimizing GEMM for Deep
Learning via Compiler-based Techniques,” arXiv preprint
arXiv:1910.00178, 2019.
[14] L.-W. Chang et al., “Accelerating recurrent neural networks
through compiler techniques and quantization,” Workshop
on Systems for ML and Open Source Software, 2018.
[15] S. Rajbhandari et al., “Optimizing CNNs on multicores for
scalability, performance and goodput,” in ACM SIGPLAN
Notices, vol. 52, no. 4, 2017, pp. 267–280.
19
[16] J. Devlin, “Sharp models on dull hardware: Fast and
accurate neural machine translation decoding on the CPU,”
arXiv preprint arXiv:1705.01991, 2017.
[17] Y. Kim et al., “µLayer: Low Latency On-Device Infer-
ence Using Cooperative Single-Layer Acceleration and
Processor-Friendly Quantization,” in EuroSys Conference,
2019, p. 45.
[18] D. Li et al., “DeepRebirth: Accelerating deep neural net-
work execution on mobile devices,” in AAAI Conference on
Artificial Intelligence, 2018.
[19] M. Almeida et al., “EmBench: Quantifying Performance
Variations of Deep Neural Networks across Modern Com-
modity Devices,International Workshop on Embedded and
Mobile Deep Learning, 2019.
[20] P. Meloni et al., “NEURAghe: Exploiting CPU-FPGA
Synergies for Efficient and Flexible CNN Inference Ac-
celeration on Zynq SoCs,” TRETS, vol. 11, no. 3, p. 18,
2018.
[21] V. Peluso et al., “Enabling energy-efficient unsupervised
monocular depth estimation on ARMv7-based platforms,”
in DATE, 2019, pp. 1703–1708.
[22] M. Xu et al., “Accelerating convolutional neural networks
for continuous mobile vision via cache reuse,” arXiv
preprint arXiv:1712.01670, 2017.
[23] N. D. Lane et al., “DeepX: A software accelerator for low-
power deep learning inference on mobile devices,” in IPSN,
2016, p. 23.
[24] M. Motamedi et al., “Machine intelligence on resource-
constrained IoT devices: The case of thread granularity
optimization for CNN inference,” TECS, vol. 16, no. 5s,
p. 151, 2017.
[25] Y. Wu et al., “Experimental Characterizations and Analysis
of Deep Learning Frameworks,” in Big Data. IEEE, 2018,
pp. 372–377.
[26] D. Budden et al., “Deep tensor convolution on multicores,
in ICML, 2017, pp. 615–624.
[27] Y. E. Wang et al., “Benchmarking TPU, GPU, and
CPU Platforms for Deep Learning,” arXiv preprint
arXiv:1907.10701, 2019.
[28] A. Zlateski et al., “ZNN–A Fast and Scalable Algorithm
for Training 3D Convolutional Networks on Multi-core and
Many-Core Shared Memory Machines,” in IPDPS. IEEE,
2016, pp. 801–811.
[29] B. Liu et al., “Sparse convolutional neural networks,” in
CVPR, 2015, pp. 806–814.
[30] S. Sen et al., “SparCE: Sparsity Aware General-Purpose
Core Extensions to Accelerate Deep Neural Networks,”
TOCS, vol. 68, no. 6, pp. 912–925, 2018.
[31] S. Cao et al., “SeerNet: Predicting CNN Feature-Map
Sparsity Through Low-Bit Quantization,” in CVPR, 2019,
pp. 11 216–11 225.
[32] J. Park et al., “Faster CNNs with direct sparse convolutions
and guided pruning,” arXiv preprint arXiv:1608.01409,
2016.
[33] K.-Y. Peng et al., “Adaptive runtime exploiting sparsity in
tensor of deep learning neural network on heterogeneous
systems,” in SAMOS, 2017, pp. 105–112.
[34] B. Akin et al., “ZCOMP: Reducing DNN Cross-Layer
Memory Footprint Using Vector Extensions,” in MICRO,
2019, pp. 126–138.
[35] S. Mittal et al., “A Survey of CPU-GPU Heterogeneous
Computing Techniques,ACM Computing Surveys, vol. 47,
no. 4, pp. 69:1–69:35, 2015.
[36] S. Mittal et al., “A Survey of Techniques for Modeling
and Improving Reliability of Computing Systems,TPDS,
2015.
[37] P. Blacker et al., “Rapid Prototyping of Deep Learning
Models on Radiation Hardened CPUs,” in AHS, 2019, pp.
25–32.
[38] S. Mehta et al., “WearCore: A core for wearable work-
loads?” in PACT. IEEE, 2016, pp. 153–164.
[39] J. Hanhirova et al., “Latency and throughput characteriza-
tion of convolutional neural networks for mobile computer
vision,” in ACM Multimedia Systems Conference, 2018, pp.
204–215.
[40] S. Mittal, “A Survey of FPGA-based Accelerators for
Convolutional Neural Networks,Neural computing and
applications, 2018.
[41] N. Rao, “Intel Excels in First MLPerf Inference Results,”
https://www.intel.ai/mlperf-nov2019/, 2019.
[42] Y. Ma et al., “Moving Deep Learning into Web Browser:
How Far Can We Go?” in The World Wide Web Conference,
2019, pp. 1234–1244.
[43] C. Zhang et al., “MArk: Exploiting Cloud Services for
Cost-Effective, SLO-Aware Machine Learning Inference
Serving,” in ATC, 2019.
[44] “AWS EC2 Pricing,” http://bitly.ws/8SyN.
[45] A. Zlateski et al., “The anatomy of efficient FFT and
winograd convolutions on modern CPUs,” in ICS, 2019,
pp. 414–424.
[46] D. Das et al., “Distributed deep learning using syn-
chronous stochastic gradient descent,” arXiv preprint
arXiv:1602.06709, 2016.
[47] A. Sarma et al., “CASH: Compiler Assisted Hardware
Design for Improving DRAM Energy Efficiency in CNN
Inference,” MEMSYS, 2019.
[48] A. Jain et al., “Architectural support for convolutional
neural networks on modern cpus,” in PACT, 2018, p. 16.
[49] Z. Chishti et al., “Memory system characterization of deep
learning workloads,” in MemSys, 2019, pp. 497–505.
[50] N. D. Lane et al., “An early resource characterization of
deep learning on wearables, smartphones and internet-of-
things devices,” in international workshop on internet of
things towards applications, 2015, pp. 7–12.
[51] K. Zou et al., “Learn-to-scale: Parallelizing deep learning
inference on chip multiprocessor architecture,” in DATE,
2019, pp. 1172–1177.
[52] J. Gu et al., “Implementation and evaluation of deep neural
networks (DNN) on mainstream heterogeneous systems,” in
Asia-Pacific Workshop on Systems, 2014, p. 12.
[53] D. Velasco-Montero et al., “On the Correlation of CNN
Performance and Hardware Metrics for Visual Inference
on a Low-Cost CPU-based Platform,” in IWSSIP, 2019, pp.
249–254.
[54] S.-J. Lee et al., “Efficient SIMD implementation for ac-
celerating convolutional neural network,” in International
Conference on Communication and Information Process-
ing, 2018, pp. 174–179.
[55] L. Lai et al., “CMSIS-NN: Efficient neural network kernels
for arm cortex-m cpus,arXiv preprint arXiv:1801.06601,
2018.
[56] T. Abtahi et al., “Accelerating convolutional neural network
with FFT on embedded hardware,” IEEE T VLSI SYST,
vol. 26, no. 9, pp. 1737–1749, 2018.
[57] C. F. Rodrigues et al., “Fine-Grained Energy and Perfor-
mance Profiling framework for Deep Convolutional Neural
20
Networks,” arXiv preprint arXiv:1803.11151, 2018.
[58] X. Dai et al., “ChamNet: Towards Efficient Network Design
Through Platform-Aware Model Adaptation,” in Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
2019.
[59] S. Popovych et al., “PZnet: Efficient 3D ConvNet Inference
on Manycore CPUs,” in Science and Information Confer-
ence, 2019, pp. 369–383.
[60] M. Xu et al., “DeepWear: Adaptive Local Offloading for
On-Wearable Deep Learning,IEEE Transactions on Mo-
bile Computing, 2019.
[61] S. Shams et al., “Evaluation of deep learning frameworks
over different HPC architectures,” in ICDCS, 2017, pp.
1389–1396.
[62] J. Hauswald et al., “Sirius: An open end-to-end voice and
vision personal assistant and its implications for future
warehouse scale computers,” in ACM SIGPLAN Notices,
vol. 50, no. 4, 2015, pp. 223–238.
[63] S. Shi et al., “Benchmarking state-of-the-art deep learning
software tools,” in CCBD, 2016, pp. 99–104.
[64] A. Venkat et al., “SWIRL: High-performance many-core
CPU code generation for deep neural networks,” IJHPCA,
p. 1094342019866247, 2019.
[65] S. Fan et al., “Parallel Computing in DNNs Using CPU
and MIC,” in ISPA/IUCC, 2017, pp. 646–652.
[66] B. Chen et al., “SLIDE: In Defense of Smart Algorithms
over Hardware Acceleration for Large-Scale Deep Learning
Systems,” arXiv preprint arXiv:1903.03129, 2019.
[67] E. Georganas et al., “High-performance deep learning via
a single building block,arXiv preprint arXiv:1906.06440,
2019.
[68] G. Ramirez-Gargallo et al., “TensorFlow on state-of-the-art
HPC clusters: a machine learning use case,” 2019.
[69] Y. You et al., “Imagenet training in minutes,” in ICPP,
2018, p. 1.
[70] B. Jacob et al., “Quantization and training of neural net-
works for efficient integer-arithmetic-only inference,” in
CVPR, 2018, pp. 2704–2713.
[71] Z. Gong et al., “SparseTrain: Leveraging Dynamic Sparsity
in Training DNNs on General-Purpose SIMD Processors,
arXiv preprint arXiv:1911.10175, 2019.
[72] A. Xing et al., “Speeding up deep neural networks for
speech recognition on ARM Cortex-A series processors,
in ICNC, 2014, pp. 123–127.
[73] P. Roy et al., “NUMA-Caffe: NUMA-aware deep learning
neural networks,” TACO, vol. 15, no. 2, p. 24, 2018.
[74] M. G. Tallada, “Coarse grain parallelization of deep neural
networks,” in ACM SIGPLAN Notices, vol. 51, no. 8, 2016,
p. 1.
[75] A. Zlateski et al., “Compile-time optimized and statically
scheduled ND convnet primitives for multi-core and many-
core (Xeon Phi) CPUs,” in ICS, 2017, p. 8.
[76] N. Hasabnis, “Auto-tuning TensorFlow Threading Model
for CPU Backend,” in MLHPC. IEEE, 2018, pp. 14–25.
[77] Y. E. Wang et al., “Exploiting parallelism opportu-
nities with deep learning frameworks,arXiv preprint
arXiv:1908.04705, 2019.
[78] S. Ji et al., “Parallelizing word2vec in shared and dis-
tributed memory,” IEEE Transactions on Parallel and Dis-
tributed Systems, 2019.
[79] R. Takeda et al., “Acoustic model training based on node-
wise weight boundary model increasing speed of discrete
neural networks,” in ASRU, 2015, pp. 52–58.
[80] H. Yin et al., “Hardware-Guided Symbiotic Training for
Compact, Accurate, yet Execution-Efficient LSTM,arXiv
preprint arXiv:1901.10997, 2019.
[81] H. Lan et al., “FeatherCNN: Fast Inference Computation
with TensorGEMM on ARM Architectures,IEEE T PAR-
ALL DISTR, 2019.
[82] K. Yanai et al., “Efficient mobile implementation of a CNN-
based object recognition system,” in ACM international
conference on Multimedia, 2016, pp. 362–366.
[83] A. Ignatov et al., “AI benchmark: Running deep neural
networks on android smartphones,” in ECCV, 2018, pp. 0–
0.
[84] M. Guignard et al., “Performance characterization of state-
of-the-art deep learning workloads on an IBM ”Minsky”
platform,” in Proceedings of the 51st Hawaii International
Conference on System Sciences, 2018.
[85] M. Loukadakis et al., “Accelerating deep neural networks
on low power heterogeneous architectures,” 2018.
[86] S. Wang et al., “High-Throughput CNN Inference on Em-
bedded ARM big.LITTLE Multi-Core Processors,” arXiv
preprint arXiv:1903.05898, 2019.
[87] S. Rallapalli et al., “Are very deep neural networks feasible
on mobile devices,IEEE Trans. Circ. Syst. Video Technol,
2016.
[88] M. Rusci et al., “Memory-Driven Mixed Low Precision
Quantization For Enabling Deep Network Inference On
Microcontrollers,” arXiv preprint arXiv:1905.13082, 2019.
[89] D. Frajberg et al., “Accelerating deep learning inference
on mobile systems,” in International Conference on AI and
Mobile Services, 2019, pp. 118–134.
[90] B. Wu et al., “FBNet: Hardware-aware efficient convnet
design via differentiable neural architecture search,” in
Conference on Computer Vision and Pattern Recognition,
2019, pp. 10 734–10 742.
[91] L. L. Zhang et al., “Fast hardware-aware neural architecture
search,” in Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, 2020.
[92] A. Zlateski et al., “FFT Convolutions are Faster than
Winograd on Modern CPUs, Here is Why,arXiv preprint
arXiv:1809.07851, 2018.
[93] J. J. K., “Benefits of Intel Optimized Caffe in comparison
with BVLC Caffe,” http://bitly.ws/8Szz, 2017.
[94] B. Li et al., “Efficient transformer-based large scale lan-
guage representations using hardware-friendly block struc-
tured pruning,” arXiv preprint arXiv:2009.08065, 2020.
[95] J. Yu et al., “Scalpel: Customizing DNN pruning to the
underlying hardware parallelism,” in ACM SIGARCH Com-
puter Architecture News, vol. 45, no. 2, 2017, pp. 548–560.
[96] S. Mittal et al., “A Survey of Techniques for Optimizing
Deep Learning on GPUs,” J SYST ARCHITECT, 2019.
[97] S. Mittal, “A Survey of Techniques for Designing and
Managing CPU Register File,CONCURR COMP-PRACT
E, 2016.
[98] S. Mittal, “A Survey Of Techniques for Architecting and
Managing Asymmetric Multicore Processors,” ACM Com-
puting Surveys, vol. 48, no. 3, pp. 45:1–45:38, January
2016.
[99] M. Dukhan, “Acceleration package for neural networks on
multi-core CPUs,” http://bitly.ws/8SyP.
[100] Z. Lu et al., “Modeling the resource requirements of
convolutional neural networks on mobile devices,” in ACM
international conference on Multimedia, 2017, pp. 1663–
1671.
21
[101] S. Mittal, “A Survey Of Techniques for Approximate Com-
puting,” ACM Comput. Surv., vol. 48, no. 4, pp. 62:1–62:33,
2016.
[102] N. Stephens, “BFloat16 processing for Neural Networks on
Armv8-A,” http://bitly.ws/8Szv, 2019.
[103] https://intel.ly/3ihvw9W, 2020.
[104] D. Xu et al., “Accelerating generative neural networks
on unmodified deep learning processors—a software ap-
proach,” IEEE Transactions on Computers, vol. 69, no. 8,
pp. 1172–1184, 2020.
[105] S. Verdoolaege et al., “Polyhedral parallel code generation
for cuda,” ACM Transactions on Architecture and Code
Optimization (TACO), vol. 9, no. 4, pp. 1–23, 2013.
[106] J. Ragan-Kelley et al., “Halide: a language and compiler
for optimizing parallelism, locality, and recomputation in
image processing pipelines,” Acm Sigplan Notices, vol. 48,
no. 6, pp. 519–530, 2013.
[107] N. Vasilache et al., “Tensor comprehensions: Framework-
agnostic high-performance machine learning abstractions,”
arXiv preprint arXiv:1802.04730, 2018.
... Tackle acceleration can degrade training costs and plant deep literacy models [12]. For illustration, operating a GPU can significantly reduce the cost of training equated to using a CPU, particularly for considerable-scale tasks [13]. Hardware acceleration is a consequential tool for perfecting the velocity and effectiveness of bottomless literacy tasks [14]. ...
... Sparsh Mittal et al. [13] carried out a check of accelerator arrangement and optimizations for CNN. They stressed their pivotal inventions and arranged them into several orders. ...
Article
Full-text available
Convolutional neural networks (CNNs) accelerator has been utilized wide for several digital applications to improve processing efficiency. However, the traditional CNN accelerator processor performance is insufficient to run the digital smart application as per the user’s needs. Hence, the present research study was intended to design the modified CNN accelerator for prediction and data broadcasting applications. Hence, the newly designed accelerator is named a novel Siberian Tiger-based Convolutional Neural Accelerator architecture (STbCNA). Here, the sparse features and the tiger fitness data reuse strategy have been considered to gain the exact prediction outcome. In addition, the throughput parameter is considered in this research study. The predicted outcome is transferred to the user to make the rainfall aware to satisfy this parameter. Consequently, the Throughput and other FPGA parameters were calculated and compared with other models. In that, the modified CNN (STbCNA) scored the finest outcome.
... However, there is a growing interest in utilising deep learning for ransomware detection, offering superior accuracy and potential for real-time detection. Nonetheless, deep learning models are computationally demanding and costly to deploy [175]. Some studies have addressed this challenge by deploying their models in environments with ample CPU cores or GPUs [176]. ...
Article
Full-text available
Ransomware attacks are on the rise in terms of both frequency and impact. The shift to remote work due to the COVID-19 pandemic has led more people to work online, prompting companies to adapt quickly. Unfortunately, this increased online activity has provided cybercriminals numerous opportunities to carry out devastating attacks. One recent method employed by malicious actors involves infecting corporate networks with ransomware to extract millions of dollars in profits. Ransomware falls into the category of malware. It works by encrypting sensitive data and demanding payments from victims to receive the encryption keys necessary for decrypting their data. The prevalence of this type of attack has prompted governments and organisations worldwide to intensify their efforts to combat ransomware. In response, the research community has also focused on ransomware detection, leveraging technologies such as machine learning. Despite this increased attention, practical solutions for real-world applications remain scarce in the existing literature. Numerous surveys have explored literature within the domain. Still, there is a notable lack of emphasis on the design of ransomware detection systems and the practical aspects of detection, including real-time and early detection. Against this backdrop, our review delves into the existing literature on ransomware detection, specifically examining the machine-learning techniques, detection approaches, and designs employed. Finally, we highlight the limitations of prior studies and propose future research directions in this crucial area.
... AI CPUs and GPUs are of dedicated design for fasting IndAI computing. CPU is mainly used to manage the entire system, execute software programs, process input and output, and control task scheduling in AI [76]. It is responsible for the operation of the AI framework, data preprocessing, and post-processing tasks. ...
Article
With the continuous development of human-centric, resilient, and sustainable manufacturing towards Industry 5.0, Artificial Intelligence (AI) has gradually unveiled new opportunities for additional functionalities, new features, and tendencies in the industrial landscape. On the other hand, the technology-driven Industry 4.0 paradigm is still in full swing. However, there exist many unreasonable designs, configurations, and implementations of Industrial Artificial Intelligence (IndAI) in practice before achieving either Industry 4.0 or Industry 5.0 vision, and a significant gap between the individualized requirement and actual implementation result still exists. To provide insights for designing appropriate models and algorithms in the upgrading process of the industry, this perspective article classifies IndAI by rating the intelligence levels and presents four principles of implementing IndAI. Three significant opportunities of IndAI, namely, collaborative intelligence, self-learning intelligence, and crowd intelligence, towards Industry 5.0 vision are identified to promote the transition from a technology-driven initiative in Industry 4.0 to the coexistence and interplay of Industry 4.0 and a value-oriented proposition in Industry 5.0. Then, pathways for implementing IndAI towards Industry 5.0 together with key empowering techniques are discussed. Social barriers, technology challenges, and future research directions of IndAI are concluded, respectively. We believe that our effort can lay a foundation for unlocking the power of IndAI in futuristic Industry 5.0 research and engineering practice.
... The large amount of computational resources required to train many of these models has made specialized hardware such as GPUs or FPGAs the preferred computing platform. However, in such a heterogeneous scenario, the utilization of general-purpose CPUs is rapidly growing [6]. Aspects such as the flexibility and high availability of CPUs are some of the advantages of CPUs for DNN computing [7] [8] and multiple training and inference processes are currently performed in this kind of hardware [9] [10]. ...
Chapter
This chapter gives an overview of evolutionary algorithm (EA) based methods applied to the design of efficient implementations of deep neural networks (DNN). We introduce various acceleration hardware platforms for DNNs developed especially for energy-efficient computing in edge devices. In addition to evolutionary optimization of their particular components or settings, we will describe neural architecture search (NAS) methods adopted to directly design highly optimized DNN architectures for a given hardware platform. Techniques that co-optimize hardware platforms and neural network architecture to maximize the accuracy-energy trade-offs will be emphasized. Case studies will primarily be devoted to NAS for image classification. Finally, the open challenges of this popular research area will be discussed.
Article
State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This article takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.30× and 1.38×, respectively.
Conference Paper
Emerging mobile services heavily utilize Neural Networks (NNs) to improve user experiences. Such NN-assisted services depend on fast NN execution for high responsiveness, demanding mobile devices to minimize the NN execution latency by efficiently utilizing their underlying hardware resources. To better utilize the resources, existing mobile NN frameworks either employ various CPU-friendly optimizations (e.g., vectorization, quantization) or exploit data parallelism using heterogeneous processors such as GPUs and DSPs. However, their performance is still bounded by the performance of the single target processor, so that realtime services such as voice-driven search often fail to react to user requests in time. It is obvious that this problem will become more serious with the introduction of more demanding NN-assisted services.
Article
The remarkable advances of Machine Learning (ML) have spurred an increasing demand for ML- as-a-Service on public cloud: developers train and publish ML models as online services to provide low-latency inference for dynamic queries. The primary challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing serving cost. In this article, we proposes MArk (Model Ark), a general-purpose inference serving system, to tackle the dual challenge of SLO compliance and cost effectiveness. MArk employs three design choices tailored to inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to $7.8\times$ while achieving even better latency performance.
Article
Generative neural network is a new category of neural networks and it has been widely utilized in many applications such as content generation, unsupervised learning, segmentation and pose estimation. It typically involves massive computing-intensive deconvolution operations that cannot be fitted to conventional neural network processors directly. However, prior works mainly investigated specialized hardware architectures through intensive hardware modifications to the existing deep learning processors to accelerate deconvolution together with the convolution. In contrast, this work proposes a novel deconvolution implementation with a software approach and enables fast and efficient deconvolution execution on the existing deep learning processors. Our proposed method reorganizes the computation of deconvolution and allows the deep learning processors to treat it as the standard convolution by splitting the original deconvolution filters into multiple small filters. Compared to prior acceleration schemes, the implemented acceleration scheme achieves 2.4X - 4.3X performance speedup and reduces the energy consumption by 27.7% - 54.5% on a set of realistic benchmarks. In addition, we have also applied the deconvolution computing approach to the off-the-shelf commodity deep learning processors. The performance of deconvolution also exhibits significant performance speedup over prior deconvolution implementations.