ArticlePDF Available

Abstract and Figures

Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve desired performance levels. Consequently, hardware accelerators that use application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism as well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.
Content may be subject to copyright.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
FPGA-based Accelerators of Deep
Learning Networks for Learning and
Classification: A Review
AHMAD SHAWAHNA1, SADIQ M. SAIT1, 2, (Senior Member, IEEE), AND AIMAN EL-MALEH1,
(Member, IEEE)
1Department of Computer Engineering, King Fahd University of Petroleum & Minerals, Dhahran-31261, Saudi Arabia
2Center for Communications and IT Research, Research Institute, King Fahd University of Petroleum & Minerals, Dhahran-31261, Saudi Arabia
Corresponding author: Sadiq M. Sait (e-mail: sadiq@kfupm.edu.sa).
This work was supported by the King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia.
ABSTRACT Due to recent advances in digital technologies, and availability of credible data, an area
of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness
in solving complex learning problems not possible before. In particular, convolution neural networks
(CNNs) have demonstrated their effectiveness in image detection and recognition applications. However,
they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve
desired performance levels. Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been
employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for
accelerating the implementation of deep learning networks due to their ability to maximize parallelism as
well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating
deep learning networks on FPGAs. We highlight the key features employed by the various techniques
for improving the acceleration performance. In addition, we provide recommendations for enhancing the
utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent
trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the
future advances on efficient hardware accelerators and to be useful for deep learning researchers.
INDEX TERMS Adaptable Architectures, Convolutional Neural Networks (CNNs), Deep Learning,
Dynamic Reconfiguration, Energy-Efficient Architecture, Field Programmable Gate Arrays (FPGAs),
Hardware Accelerator, Machine Learning, Neural Networks, Optimization, Parallel Computer Architec-
ture, Reconfigurable Computing.
I. INTRODUCTION
IN recent years, due to the availability of massive amounts
of credible data (Big Data: Text, Video, Audio, etc.),
and tremendous advances in the area of digital electronics
technologies that provide immense computing power, there
has been a revival in the area of artificial intelligence (AI),
particularly in the area of deep learning (DL) [1]–[3], a sub-
field of machine learning (ML).
The field of DL emerged in 2006 after a long pause in the
area of neural networks (NNs) research [4]. A key aspect in
DL is that the networks and/or their weights are not designed
by human beings. Instead, they are learned from data using
a general purpose learning procedure [5], [6].
While ML uses algorithms to parse and learn from data,
to make informed decisions, DL structures algorithms in
layers to create an artificial neural network (ANN) that
can learn, and similar to human intelligence, can make
accurate decisions on its own [7]. Therefore, instead of
designing algorithms by hand, systems can be built and
trained to implement concepts in a way similar to what
comes naturally to humans, and with accuracy sometimes
exceeding human-level performance [8], [9].
In DL, each layer is designed to detect features at different
levels. A layer transforms the representation at one level
(starting from input data which maybe images, text, or
sound) to a representation at a higher, slightly more abstract
VOLUME 4, 2016 1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
level [10]. For example, in image recognition, where input
initially comes in the form of pixels, the first layer detects
low level features such as edges and curves. The output
of the first layer becomes input to the second layer which
produces higher level features, for example semi-circles, and
squares [11]. The next layer assembles the output of the
previous layer to parts of familiar objects, and a subsequent
layer detects the objects. As we go through more layers, the
network yields an activation map that represents more and
more complex features. The deeper you go into the network,
the filters begin to be more responsive to a larger region of
the pixel space. Higher level layers amplify aspects of the
received inputs that are important for discrimination and
suppress irrelevant variations.
A. APPLICATIONS OF DEEP LEARNING NETWORKS
With the now widely used convolution neural networks
(CNNs) [12], [13] and deep neural networks (DNNs) [14],
[15], it is now possible to solve problems in domains
where knowledge is not easily expressed explicitly and
implicit information is stored in the raw data. Solutions to
multifarious problems in the domain of sciences, business,
etc., have been possible that were not conceivable for several
years, in spite of best attempts by the AI community.
This has been primarily possible due to the excellent abil-
ity of deep learning in discovering intricate structures in
high-dimensional data. Examples include character recog-
nition [16], gesture recognition [17], speech recognition
(e.g., in Google Now, Siri, or click-through prediction on
an advertisement) [18]–[20], document processing [21]–
[23], natural language processing [24], [25], video classi-
fication [26], image classification [27]–[32], face detection
and recognition [33], [34], robot navigation [35]–[37], real-
time multiple object tracking [38], financial forecasting [39],
and medical diagnosis systems [40]–[42], to name a few.
Other recent areas of applications include automated
driving (e.g., learning to detect stop signs, traffic lights,
pedestrians, etc.), aerospace and defense (e.g., identify ob-
jects from satellites and identify safe or unsafe zones),
medical research (e.g., in identification of cancer cells),
industrial automation (e.g., to improve worker safety by
detecting when people or objects are within an unsafe
distance of machines), and electronics (used in automated
hearing, speech translation, etc.) [9], [43]–[46].
B. EMERGENCE OF DEEP LEARNING NETWORKS
Convolutional neural networks are considered as one of
the most influential innovations in the field of computer
vision [47]. The success of deep learning networks grew
to prominence in 2012 when Krizhevsky et al. [28] uti-
lized CNNs to win the annual olympics of computer
vision, ImageNet large-scale vision recognition challenge
(ILSVRC) [30]. Using AlexNet model, they achieved an
astounding improvement as the image classification error
dropped from 26% (in 2011) to 15%. ImageNet is a stan-
dard benchmark dataset used to evaluate the performance
FIGURE 1. ImageNet Competition Results [50].
of object detection and image classification algorithms. It
consists of millions of different images distributed over tens
of thousands of object classes.
CNNs have achieved even better accuracy in classifica-
tion and various computer vision tasks. The classification
accuracy in ILSVRC improved to 88.8% [48], 93.3% [31],
and 96.4% [49] in the 2013, 2014, and 2015 competitions,
respectively. Fig. 1 shows the accuracy loss for the winners
of ImageNet competitions before and after the emergence
of deep learning algorithms.
Thereafter, large host companies started using CNNs at
the core of their services. Google, Microsoft, Facebook,
Amazon, Pinterest, and Instagram are currently using neural
networks for their photo search, Bing’s image feeds, auto-
matic tagging algorithms, product recommendations, home
feed personalization, and for their search infrastructure,
respectively [11]. However, the classic use-case of CNNs
is for image and speech processing [51].
A typical CNN is a multi-layered feed-forward ANN with
a pipeline-like architecture. Specifically, each layer performs
a well-known computation on the outputs of the previous
layer to generate the inputs for the next layer. In general,
CNNs have two types of inputs; the data to be tested or
classified (also named as feature maps), and the weights.
Images, audio files, and recorded videos are examples of the
input data to be classified using CNNs. On the other hand,
the network weights are the data generated from training
the CNN on a dataset containing similar inputs to the one
being tested.
C. HARDWARE ACCELERATION OF DEEP LEARNING
NETWORKS
To provide more accurate results as well as real-time object
recognition, for example in applications such as robots and
auto-piloted cars, the size of the convolution neural network
needs to be increased by adding more neural network
layers [28]. However, evolving more and new type of NN
layers results in more complex CNN structures as well as
high depth CNN models. Thus, billions of operations and
millions of parameters, as well as substantial computing re-
sources are required to train and evaluate the resultant large-
scale CNN [31], [52], [53]. Such requirements represent
2VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
a computational challenge for general purpose processors
(GPP). Consequently, hardware accelerators such as appli-
cation specific integrated circuit (ASIC), field programmable
gate array (FPGA), and graphic processing unit (GPU) have
been employed to improve the throughput of the CNN.
In practice, CNNs are trained off-line using the back-
propagation process [54]. Then, the off-line trained CNNs
are used to perform recognition tasks using the feed-forward
process [55]. Therefore, the speed of feed-forward process
is what matters.
GPUs are the most widely used hardware accelerators
for improving both training and classification processes in
CNNs [56]. This is due to their high memory bandwidth
and throughput as they are highly efficient in floating-point
matrix-based operations [57]–[59]. However, GPU acceler-
ators consume a large amount of power. Therefore, their use
in CNN-based applications implemented as a cloud service
on large servers or in battery operated devices becomes a
challenge. Furthermore, GPUs gain their performance from
their ability to process a large image batch in parallel. For
some applications like a video stream, input images should
be processed frame by frame as the latency of the result of
each frame is critical to the application’s performance. For
some tracking algorithms, the result of one frame affects the
process of the next frame [60]. Nurvitadhi et al. [61] recently
evaluated emerging DNN algorithms on latest generations
of GPUs (i.e., NVIDIA Titan X Pascal) and FPGAs (i.e.,
Intel Arria 10 GX 1150 and Intel Stratix 10 2800). The
experimental results show that current trends in deep neural
networks favor FPGA platforms as they offer higher power
efficiency (a.k.a., performance per Watt).
FPGA and ASIC hardware accelerators have relatively
limited memory, I/O bandwidths, and computing resources
compared with GPU-based accelerators. However, they can
achieve at least moderate performance with lower power
consumption [62]. The throughput of ASIC design can be
improved by customizing memory hierarchy and assigning
dedicated resources [63]. However, the development cycle,
cost, and flexibility are not satisfactory in ASIC-based
acceleration of deep learning networks [64], [65]. As an
alternative, FPGA-based accelerators are currently in use
to provide high throughput at a reasonable price with low
power consumption and reconfigurability [66], [67]. The
availability of high-level synthesis (HLS) tools, using C or
C++, from FPGA vendors lowers the programming hurdle
and shortens the development time of FPGA-based hardware
accelerators [68]–[70].
Convolutional neural networks have a very useful prop-
erty, that is, each feature map neuron shares its weights
with all other neurons [71]. The authors in [72], [73] proved
that the highest energy expense results from accessing the
off-chip DRAM memory for data movement rather than
computation. In other words, the energy cost of the increased
memory accesses and data movement due to the large
number of CNN operations often exceeds the energy cost
of computation [64], [74]. Thus, CNN accelerators need to
carefully consider this to achieve efficient architecture in
terms of time and power.
In this paper, we review the current status of using FPGAs
as accelerators for implementing deep learning networks.
We highlight the implementation challenges and design
directions used to tackle those challenges. We also provide
future recommendations to maximize the performance of
FPGAs as accelerators for deep learning networks and
simplify their use.
The remainder of the paper is organized as follows.
Section II provides background information about CNNs,
their key operations, and some well-known deep learning
networks. In addition, it introduces the basic structure of FP-
GAs and highlights their features enabling them to acceler-
ate computationally intensive applications. It also discusses
the implementation challenges of deep learning networks
on FPGAs and how these challenges can be overcome.
Section III reviews existing CNNs compression techniques
and presents the current status of accelerating deep learning
networks using ASIC-based and FPGA-based accelerators.
Section IV describes the use of metaheuristics in the de-
sign and optimization of CNNs implementation. Section
V summarizes existing design approaches for accelerating
deep learning networks and provides recommendations for
future directions that will simplify the use of FPGA-based
accelerators and enhance their performance. Finally, section
VI concludes the paper.
II. BACKGROUND AND TERMINOLOGY
This section gives an overview of the key operations and
terminology used in convolutional neural networks (CNNs)
and provides examples of well-known deep learning net-
works. In addition, it illustrates the basic structure of field
programmable gate arrays (FPGAs) and how deep learning
methods can benefit from the capabilities of FPGAs. The
last subsection highlights the challenges of implementing
deep learning networks on FPGAs.
A. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
In this subsection, we describe the key operations and
terminology involved in the construction of CNNs including
convolution, activation functions, normalization, pooling,
and characteristics of fully connected layers.
1) Convolution (CONV)
A convolution operation can be thought of as the production
of a matrix smaller in size than the original image matrix,
representing pixels, by sliding a small window (called filter,
feature identifier, or kernel) of size kˆkover the image
(called input feature map (FM)), to produce an output
feature neuron value [75]. The filter is an array of numbers
called weights or parameters. These weights are computed
during the training phase. As the filter slides over the
feature map, it multiplies the values in the filter with the
original pixel values, that is, it first performs element-wise
multiplication, and then sums the products, to produce a
VOLUME 4, 2016 3
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
single number. The inputs and outputs of the CONV layer
are a series of FM arrays.
This operation, starting from the top left corner of the
FM, is repeated by moving the window Sstrides at a
time, first in the right direction, until the end of the FM
is reached, and then proceeding downwards until the FM
is completely scanned and all the elements of the FM are
covered. The sliding of the filter window and performing
the operation is known by the verb convolving, hence the
noun convolution [11], [76]. Normally, the size of the kernel
is very small, less than or equals to 11 ˆ11. Each output-
input FM pair has a set of weights equal to the kernel size
and each output FM is computed based on the sum of the
convolution operations performed on all input FMs. Note
that different CONV layers in the same CNN model vary
considerably in their sizes.
In summary, the convolution operation comprises four
levels of loops; the output FMs loop (Loop-4), the loop
across the input FMs (Loop-3), the loop along the di-
mensions of a single input FM (scan operation, Loop-2),
and the kernel window size loop (multiply-and-accumulate
(MAC) operation, Loop-1). CONV layers are dominant in
CNN algorithms since they often constitute more than 90%
of the total CNN operations [28], [29], [49], [74], [77],
[78]. Therefore, many attempts have been made to speedup
CONV operations using loop unrolling technique [55], [79],
as will be discussed later. Loop unrolling maximizes the
parallelism of CONV MACs computation which requires a
special consideration of processing elements (PEs) and reg-
ister arrays architecture. Fig. 2 illustrates the loop unrolling
of CONV loops levels.
2) Activation Functions (AFs)
Activation function in neural networks is similar to action
potential in animal cells such as neurons. A neuron is said
to fire if it emits an action potential. A popularly used
activation function is the sigmoid function which can be
expressed as
fpxq 1{p1`e´xq(1)
where xrepresents the weighted sum of the neuron inputs
and if it is a sufficiently large positive number, the sigmoid
function approximates to unity. For sufficiently large nega-
tive values of x, the sigmoid function is close to 0. Another
popular activation function is
fpxq tanhpxq(2)
The above standard sigmoid and tanh non-linear functions
require long training time [28]. A recently proposed and
commonly used AF in CNNs is rectified linear unit (ReLU)
which is defined as
fpxq maxpx, 0q(3)
ReLU activation function is known to converge faster
in training, and has lesser computational complexity [80],
FIGURE 2. CONV Loops Unrolling [83]: (a) Unrolling Loop-1; (b) Unrolling
Loop-2; (c) Unrolling Loop-3; (d) Unrolling Loop-4, where, P kx,P ky,P ix,
P iy,P if , and P of are loop unrolling design variables for the kernel window
width, kernel window height, input FM width, input FM height, number of input
FMs, and the number of output FMs, respectively.
[81] than standard sigmoid and tanh functions. In addition,
it does not require input normalization to prevent it from
saturating [28], [80], [82].
3) Normalization
In real life, a phenomenon called ‘lateral inhibition’ appears,
which refers to the capacity of an excited neuron to subdue
its neighbors, thereby creating a contrast in that area. In
CNNs, to accomplish this, local response normalization
(LRN) or simply normalization is used, particularly when
dealing with ReLU neurons, because they have unbounded
activation that needs normalization. It detects high frequency
features with a large response. If we normalize around the
local neighborhood of the excited neuron, it becomes even
more sensitive as compared to its neighbors. At the same
time, it will dampen the responses that are uniformly large
in any given local neighborhood. If all the values are large,
then normalizing those values will diminish all of them. So,
basically it performs some kind of inhibition and boosts the
neurons with relatively larger activations.
Normalization can be done within the same feature or
across neighboring features by a factor that depends on the
neighboring neurons. Expressions to compute the response
normalized activity can be found in [28], [80].
4) Pooling
Pooling, also known as subsampling, is employed to pro-
gressively reduce the spatial size of the representation,
thereby reducing the amount of parameters and computation
in the network. Pooling layers are periodically inserted
in between successive convolutional layers. They operate
independently on every depth slice of the input and resize it
spatially using the MAX operation. The most common form
is a pooling layer with filters of size 2ˆ2applied where the
MAX operation would be taking a maximum over 4samples
thereby discarding 75 percent of the activations [84]. In
4VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 3. AlexNet CNN Architecture [28].
addition to the popular MAX pooling, the pooling units in
some CNNs are also used to perform other functions, such
as AVG and MIN operations [80].
5) Fully Connected Layer (FC)
A common form of a convolutional neural network archi-
tecture comprises stacks of a few convolutional and ReLU
layers, followed by layers for pooling, and this pattern is
repeated until the image has merged spatially to a small
size. This is followed by one or more fully connected layers,
also known as inner-product layers, whose neurons have full
connections to all activations in the previous layer, hence
the name. The last fully connected layer is the classification
layer and it holds the output such as the class scores [80].
B. EXAMPLES OF DEEP LEARNING NETWORKS
We list in this subsection some of the well-known deep
learning networks.
AlexNet (2012) is a convolutional neural network
consisting of 5convolutional layers, interspersed by
2normalization layers, as well as 3fully connected
layers [28]. Each convolutional layer performs the
activation function using ReLU. In addition, 3pooling
layers are employed with the first, second, and last
convolutional layers. The architecture of AlexNet CNN
is shown in Fig. 3. AlexNet won the 2012 ImageNet
challenge by classifying 224 ˆ224 input color images
to 1,000 different output classes.
VGG (2014) is a convolutional neural network model
similar to AlexNet in terms of the number of fully
connected layers. However, it consists of 5groups of
convolutional layers [29], [81]. The exact number of
CONV layers in each group depends on the version
of the VGG, visual geometry group, model. Table 1
shows the number of CONV and FC layers for the
most commonly used VGG models.
ResNets (2016) are deep residual networks with ex-
tremely irregular and complex structures compared to
AlexNet and VGG CNN models [49], [85], [86]. This is
due to having more types of layers, where non-adjacent
layers incorporate shortcuts to compute the residual
functions, as well as having highly deep structures, that
is, between 50 and 1000 CONV layers. Unlike AlexNet
and VGG models where the layers are connected in
sequence, the interconnections in ResNet layers are in
the form of a directed acyclic graph (DAG). ResNet-
50 and ResNet-152 are widely used, especially for
image classification. ResNet-50/152 structure contains
53/155 CONV (most of them are followed by batch
normalization (BatchNorm), scale, and ReLU layers),
1/1 MAX pooling, 1/1 Average pooling, 1/1 FC, and,
16/50 element-wise (Eltwise) layers, respectively.
C. FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)
FPGAs are off-the-shelf programmable devices that provide
a flexible platform for implementing custom hardware func-
tionality at a low development cost. They consist mainly of
a set of programmable logic cells, called configurable logic
blocks (CLBs), a programmable interconnection network,
and a set of programmable input and output cells around the
device [87]. In addition, they have a rich set of embedded
components such as digital signal processing (DSP) blocks
which are used to perform arithmetic-intensive operations
such as multiply-and-accumulate, block RAMs (BRAMs),
look-up tables (LUTs), flip-flops (FFs), clock management
unit, high speed I/O links, and others. Fig. 4 shows a basic
structure of an FPGA.
FPGAs are widely considered as accelerators for
computationally-intensive applications as they enable mod-
els with highly flexible fine-grained parallelism and as-
TABLE 1. CNN Layers for VGG Models.
Layers VGG-11 VGG-16 VGG-19
CONV (Group 1) 1 2 2
CONV (Group 2) 1 2 2
CONV (Group 3) 2 3 4
CONV (Group 4) 2 3 4
CONV (Group 5) 2 3 4
CONV (Total) 8 13 16
FC 3 3 3
Total 11 16 19
VOLUME 4, 2016 5
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 4. FPGA Basic Structure [87].
sociative operations such as broadcast and collective re-
sponse [88]. In [89], [90], FPGA computing models used
for applications acceleration are presented, including data
streaming, associative computing, highly parallel memory
access, use of standard hardware structures such as first
in first out (FIFO) buffers, stacks and priority queues, and
functional parallelism.
FPGAs have the advantage of maximizing performance
per Watt of power consumption, reducing costs for large
scale operations [91]. This makes them an excellent choice
as accelerators for battery operated devices and in cloud
services on large servers. FPGAs have recently been widely
used for deep learning acceleration given the flexibility in
implementing architectures with large degree of parallelism
resulting in high execution speeds [92].
The adoption of software-level programming models such
as the open computing language (OpenCL) standard [93],
[94] in FPGA tools made them more attractive to use for
deep learning [95], [96]. In addition, the feed-forward nature
of deep learning algorithms makes FPGAs offer a clear
advantage as they can create customized hardware circuits
that are deeply pipelined and inherently multithreaded [91].
FPGAs also have the capability of partial dynamic config-
uration, which allows part of the FPGA to be reconfigured
while the rest is being used. This could be of potential
benefit to deep learning methods where the next layer could
be reconfigured while the current layer is being used.
D. CHALLENGES OF FPGA-BASED IMPLEMENTATION
OF DEEP LEARNING NETWORKS
Implementation of deep learning networks and, in particular,
CNNs on FPGAs has a number of challenges including
the requirement of a significant amount of storage, external
memory bandwidth, and computational resources on the
order of billions of operations per second [97]. For example,
AlexNet CNN has over 60 million model parameters which
need 250MB of memory for storing the weights based
on 32-bit floating-point representation as well as requires
around 1.5 billion operations for each input image [80].
This large amount of storage required is not supported by
existing commercial FPGAs and hence the weights have
to be stored on external memory and transferred to the
FPGA during computation. Without careful implementation
of deep learning networks and maximizing resource sharing,
the implementation may not fit on FPGAs due to limited
logic resources.
The problem exacerbates with more complex models such
as VGG CNN model which have 16 layers. For example,
the VGG-16 CNN model has 138 million weights and
needs over 30 GOPS [98]. Although the current trends in
implementing CNNs is going toward compressing the entire
CNN model with dramatically reducing data bit-width [99],
it is expected that future CNN models will get more complex
with larger number of layers as the amount of training data
continues to grow and the problems to be solved get more
complex.
In addition, different layers in CNNs have different
characteristics which result in different parallelism and
memory access requirements. Different layers in a CNN
network exhibit vastly different amounts of intra-output
and inter-output parallelism [100]. Intra-output parallelism
parallelizes the computation of a single output image since
it is the sum of ninput-kernel convolutions. However,
inter-output parallelism is based on computing multiple
output FMs in parallel. Furthermore, convolutional layers
are computational-centric while fully connected layers are
memory centric [98]. For example, the number of operations
in each group of convolutional layers in VGG-16 model are
on the order of 2 to 9 GOPS while the number of weights are
on the order of 0.04 to 7.08 million. However, the number
of operations in fully connected layers are in the order of
0.01 to 0.21 GOPS, while the number of weights are on the
order of 4.10 to 102.76 million. Thus, the developed CNN
accelerator must be designed carefully to meet the varying
requirements of different layers and needs to be flexible to
maximize the performance for each CNN layer.
As technology advances, FPGAs continue to grow in size
and capabilities. It is crucial to have some mechanisms for
addressing the requirements for efficient implementations
of deep learning networks. Addressing hardware resource
limitations requires reuse of computational resources, and
storing of partial results in internal memories. Data transfer
and computational resource usage are significantly impacted
by the ordering of operations and selection of parallelism in
the implementation of CNNs on FPGAs. Careful scheduling
of operations can result in significant reduction in external
memory access and internal buffer sizes. External memory
bandwidth requirements can be also decreased by using
reduced precision for representing the weights with minimal
impact on solution quality, which also results in a better
energy efficiency. In addition, the number of external mem-
ory accesses can be reduced by utilizing on-chip memory
and exploiting data reuse. Furthermore, the large number of
weights in the fully connected layer can be reduced, based
on utilizing singular value decomposition (SVD) [101] with
a small impact on accuracy. In the next section, we will
6VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
review various design approaches used to cope with those
challenges for implementing deep learning networks.
III. ACCELERATION OF DEEP LEARNING NETWORKS:
CURRENT STATUS
In this section, we will start by covering convolutional
neural networks (CNNs) compression techniques as they
have a significant impact on the implementation complexity
of CNNs. CNNs compression techniques target the min-
imization of the number of operations and the memory
footprint with minimal impact on accuracy. Then, we discuss
hardware acceleration techniques for deep learning (DL)
algorithms and CNNs based on both application specific
integrated circuit (ASIC) and field programmable gate array
(FPGA) implementations. In general, hardware accelerators
focus on designing specific modules and architectures that
ensure data reuse, enhance data locality, and accelerate
convolutional (CONV) layer operations based on performing
needed operations in parallel.
A. CNNS COMPRESSION
In this subsection, we review techniques that target the com-
pression of CNNs which results in significantly reducing
their implementation complexity with minimal impact on
accuracy.
Denton et al. [102] proposed a technique to reduce the
memory footprint for the network weights in object recog-
nition systems. They used singular value decomposition
(SVD) [101] and filter clustering methods for this purpose.
The results for convolutional model of 15 layers in [48]
show that the proposed technique speeds up the operations
in convolutional layers by a factor of 2, compared to CPU
Eigen3-based library implementation [103]. In addition, it
successfully achieved 13ˆmemory footprint reduction for
the fully connected layers while preserving the recognition
accuracy within 99%.
In another work, Han et al. [104] employed network
pruning techniques [105]–[107] to reduce the over-fitting
and complexity of neural network models. Their results
demonstrated that pruning redundant connections as well as
less influential connections achieved 9ˆand 13ˆcompres-
sion for AlexNet and VGG-16 models, respectively, while
achieving zero accuracy loss for both.
In a subsequent work, Han et al. [108] proposed a deep
compression technique for more reduction of the storage
requirements of CNNs through the enforcement of weights
sharing. Deep compression basically consists of pruning,
trained weights quantization, and Huffman coding pipeline
stages. The experimental results show that the proposed
compression technique successfully reduced the storage
requirement of AlexNet and VGG-16 CNN models by 35ˆ
and 49ˆ, respectively, without affecting their accuracy. This
also improved the power efficiency (a.k.a., performance per
Watt) by 3ˆto 7ˆ.
B. ASIC-BASED ACCELERATORS
In this subsection, we present some recent work in the area
of hardware-based accelerators (ASICs).
An ASIC-based hardware accelerator referred to as Dian-
Nao [109] was designed for large-scale convolutional neural
networks and deep neural networks. DianNao accelerates
neural networks by minimizing memory transfers, which
opened a new paradigm for hardware accelerators. Since
the weights are repeatedly used in the computations of con-
volution layers, frequent memory access can significantly
degrade the overall performance. Therefore, the authors
exploited the locality properties of neural network layers
to design custom storage structures that take advantages
of these properties. In addition, they employed dedicated
buffers and tiling techniques to reduce the overall external
memory traffic through increasing data locality.
Chen et al. [109] also observed that using short fixed-
point representation of feature maps (FMs) and weights can
also significantly reduce computation resources and memory
footprint. They found that the area and power of a 32-
bit multiplier can be reduced by a factor of 0.164ˆand
0.136ˆ, respectively, using 16-bit multipliers. Consequently,
DianNao has been implemented using 65nm fabrication
technology with 16-bit fixed-point arithmetic units, 6 bits of
which are used for the integer part and the remaining 10 for
the fractional part. The experimental results demonstrated
that DianNao has an average performance of 452 GOPS
with power consumption of 485 mW. The results depicted
that using 16-bit arithmetic units instead of 32-bit ones in-
troduced only 0.26% accuracy loss on MNIST dataset [110].
On the other hand, the scalability and efficiency of DianNao
accelerator are severely limited by the bandwidth constraints
of the memory system.
In a related research work, Chen et al. [111], [112] pro-
posed DaDianNao multi-chip supercomputer which offers
sufficient memory capacity suitable for on-chip storage of
all weights in CNNs. This system is mainly important for
today’s large-scale deployments of sophisticated industry
and consumers services. DaDianNao uses 16-bit fixed-point
numbers in the inference process like DianNao, but it is
implemented using 28nm technology. The results show that
DaDianNao outperforms the performance of a single GPU
architecture by up to 656.63ˆand reduces the average
energy consumption by 184.05ˆwith only 0.01% accuracy
error rate on MNIST dataset for a 64-chip system.
Another member of the DianNao family, called PuDian-
Nao [113], has been designed using TSMC 65nm process
to support multiple techniques and scenarios of machine
learning (ML). PuDianNao accelerates different ML tech-
niques through extracting their critical locality properties
and computational primitives with the use of on-chip storage
as well as 7 novel functional units. Experimental results
show that PuDianNao is 1.20ˆand 128.41ˆfaster and
energy-efficient, respectively, than NVIDIA K20M GPU
architecture. However, both of DaDianNao [111] and Pu-
VOLUME 4, 2016 7
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 5. Processing Element (PE) Architecture in; (a) FlexFlow, (b) 2D-Mapping [118].
DianNao architectures have not been optimized to be used
for embedded applications.
To improve the scalability and energy efficiency of Di-
anNao design discussed in [109], ShiDianNao accelerator
was proposed [114]. ShiDianNao is designed especially
for real-time object recognition applications such as self-
driving cars, smartphones, and security using 65nm CMOS
technology. The proposed accelerator directly connects with
a CMOS/CCD sensor in the image processing chip. In
addition, all the weights of CNN layers are stored in SRAM
on-chip memory, as the target here is small CNN models.
ShiDianNao is embedded inside the processing chip to elim-
inate off-chip DRAM memory accesses and minimize data
movements between the SRAM holding the CNN model
and the individual processing elements from the sensor.
ShiDianNao has a power consumption of 320.10 mW with
a peak performance of 194 GOPS under 1 GHz working
frequency. Moreover, ShiDianNao has 1.87ˆspeedup and
is 60ˆmore energy-efficient than DianNao [109].
However, DianNao [109], DaDianNao [111], [112], Pu-
DianNao [113], and ShiDianNao [114] are not implemented
using FPGA or any other reconfigurable hardware. There-
fore, they cannot be efficiently adapted to different appli-
cation demands (i.e., different neural network sizes). In
addition, ASIC designs have a long development cycle and
lack flexibility for handling varying DL network designs.
Finally, CNN accelerators, which store all weights on-chip
such as ShiDianNao [114], will not be able to support
realistic large-scale CNN models.
Similar approaches to the DianNao family of techniques
are presented in [115] with similar limitations. ISAAC [116]
and PRIME [117] have explored in-memory processing to
design an acceleration architecture for neural networks. The
proposed ISAAC architecture has achieved better improve-
ments of 14.8ˆ, 5.5ˆ, and 7.5ˆin throughput, energy, and
computational density, respectively, than the state-of-the-art
DaDianNao architecture.
In CNN models, fine-grained parallelism appears at fea-
ture map level, in the neuron level, and in the synapse level.
Lu et al. [118] reviewed current accelerators that exploit
the intrinsic parallelism and observed a mismatch between
the parallel types supported by the computing engine and
the dominant parallel types that appear in CNN workloads.
They identified that most of the previous techniques pro-
posed solutions that fall into one of the three representative
architectures: (i) Systolic, (ii) 2D-mapping, and (iii) Tiling.
Due to limitations of dataflow of each of the above three
architectures, most existing accelerators support only one
specific parallelism. Systolic architectures exploit synapse
parallelism (SP), 2D-mapping architectures exploit neuron
parallelism (NP), and tiling architectures exploit feature map
parallelism (FP). However, in a practical CNN, the dominant
parallel type depends on the number of input FMs, the
number of output FMs, the size of the output FMs, and
the size of the kernel.
With three components (feature map, neuron, synapse)
that can be either left serial, or parallelized, we get 23
possible combinations. An example of processing style
could be SFSNMS, meaning, single feature map, single
neuron, and multiple synapse.
To address the above problem, and support all possi-
ble processing styles, Lu et al. [118] proposed a flexible
dataflow architecture, called FlexFlow, with minimal con-
trols. FlexFlow supports all types of data paths in each type
of parallelism in different layers efficiently.
As a first step, a modification to the processing element
(PE) micro-architecture, and the interconnections between
PEs, is proposed. PEs are arranged in rows where each
row can complete one convolution and serve one output
neuron. The adders in each PE row are connected to form
the adder tree. Fig. 5 illustrates the proposed PE in FlexFlow
and that in 2D-mapping architecture. By eliminating de-
pendency between adjacent PEs, the proposed convolutional
unit supports the comprehensive MFMNMS parallelisms. To
cater to different types of parallelisms, they also proposed
a hierarchical dataflow with high data “routability” and low
control overhead. The entire dataflow can be divided into
three sub-flows: (i) distribution to local storage in each
8VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
PE, (ii) fetching of data from local storage for operators
(multiplier and adder), and, (iii) dataflow from neuron and
kernel buffers to the distribution layer. They also presented
a method to determine parallelization type and degree (i.e.,
the unrolling parameters) for each CONV layer.
FlexFlow architecture was evaluated for computing re-
source utilization, performance, power, energy, and area.
Comparison was made with three typical architectures (i.e.,
systolic, 2D-mapping, and tiling) using six practical work-
loads, including AlexNet and VGG. They also examined
the scalability of FlexFlow in terms of resource utilization,
power, and area with growing scales of computing engine.
From experimental results, it was found that computing
resource utilization of each baseline was over 80% across
all workloads in contrast to other baselines that utilized
less than 60% (most of them less than 40%). In terms
of performance, FlexFlow demonstrated over 420 GOPS
performance with 1 GHz working frequency. It also out-
performed others in terms of data reusability and power
efficiency.
C. FPGA-BASED ACCELERATORS
In this subsection, we will review recent techniques employ-
ing FPGAs for the acceleration of deep learning networks.
For each reviewed technique, we will highlight the key
features utilized to maximize performance and throughput
in the acceleration process.
FPGA implementations of CNNs appeared in the mid-
1990’s when Cloutier et al. [119] designed the virtual
image processor (VIP) on Altera EPF81500 FPGA. VIP
is a single-instruction stream multiple-data streams (SIMD)
multiprocessor architecture with a 2D torus connection
topology of processing elements (PEs). VIP improves the
performance through the use of low-accuracy arithmetic
to avoid implementing full-fledged multipliers. Fortunately,
recent digital signal processing (DSP)-oriented FPGAs in-
clude large numbers of multiply-and-accumulate (MAC)
units which allow for extremely fast and low power CNN
implementations.
Thereafter, FPGA implementations of deep learning net-
works have mainly focused on accelerating the computa-
tional engine through optimizing CONV layer operations.
Several studies in the literature [120]–[126] have reported
FPGA-based implementations of convolution operation.
Farabet et al. [127] presented an FPGA implementation of
CNN that uses one dedicated hardware convolver and a soft-
processor for data processing and controlling, respectively.
The proposed implementation is referred to as convolutional
network processor (CNP). CNP exploits the parallelism of
CONV layers to accelerate the computational engine of
CNNs while fully utilizing the large number of DSPs, the
MAC hardware units on FPGA. The proposed architecture
consists of Virtex4 SX35 FPGA platform and external mem-
ory. The authors designed a dedicated hardware interface
with the external memory to allow 8simultaneous read/write
accesses transparently. In addition, they used first in first
FIGURE 6. 2D Convolution Module of 3ˆ3Kernel [127].
out (FIFO) buffers between the FPGA and the external
memory chip in both directions to guarantee the steadiness
of dataflow.
The vector arithmetic and logic unit in CNP implements
2D CONV, pooling, and non-linear activation function op-
erations of convolutional networks. The implementation of
2D CONV with kernel of size 3(i.e., K3) is shown
in Fig. 6, where xis the data from input feature map
(FM), yis the partial result to be combined with the current
result, zis the result to the output FM, Wij is the weight
value in the convolution kernel, and Wis the width of the
input image. It can be seen that the proposed convolutional
module accomplishes K2MAC operations simultaneously
in each clock cycle. CNP represents FMs and weights using
16-bit (Q8.8) fixed-point format. The proposed accelerator
has been implemented for a face detection system with
LeNet-5 architecture [128]. It utilized 90% and 28% of the
general logic and multipliers, respectively. In addition, CNP
consumed less than 15 Watts of power.
Sankaradas et al. [129] proposed a massively parallel
coprocessor to accelerate CNNs using Virtex5 LX330T
FPGA platform. The proposed accelerator mainly focused
on optimizing computation engine by employing the paral-
lelism within convolution kernel and FMs. The coprocessor
can be considered as parallel clusters of vector processing
elements (VPEs) where each cluster is designed using 2D
convolvers, adders, sub-samplers, and look-up tables. Each
VPE consists of multiplier-accumulator and programmable
register units to hold kernel weights and FM data. To
hold the massive intermediate data of CNNs, the authors
employed a dedicated off-chip memory (4 DDR2 memory
banks) with a large bandwidth on the coprocessor card.
Moreover, the proposed accelerator uses a low precision
data representation feature with memory packing to further
improve the memory bandwidth as well as the throughput.
20-bit and 16-bit fixed-point representations were utilized
for kernel weights and FMs, respectively.
VOLUME 4, 2016 9
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 7. MAPLE Processing Core Architecture [132].
The authors examined their architecture on CNN with
4 CONV layers and without any fully connected (FC)
layer for a face recognition application. The results show
that the proposed coprocessor is 6ˆfaster than a software
implementation on a 2.2 GHz AMD Opteron processor
with less than 11 Watts of power dissipation. However, the
proposed accelerator cannot be used to accelerate full CNNs
as it uses few CONV layers without any FC layer. A full
CNN model consists of both CONV layers and FC layers.
Thus, an efficient CNN accelerator for real-life applications
is needed to consider both. Similar approaches to the work
of Sankardas et al. [129] are presented in [130], [131] to
accelerate support vector machines (SVM).
MAPLE [132] is a programmable FPGA prototype sys-
tem presented to accelerate both learning and classification
tasks in applications with unstructured large amount of
data. The authors analyzed five workload domains to help
in designing MAPLE. These workloads are SVM [133],
supervised semantic indexing (SSI) [134], K-means [135],
generalized learning vector quantization (GLVQ) [136], and
CNNs [71]. They found that their computations can be
structured as parallel streams of vector or matrix operations.
Thus, they architected MAPLE as a 2D grid of vector pro-
cessing elements as shown in Fig. 7. To efficiently perform
matrix multiplication, they allocate a private local storage
to each PE which is used to store a column, or part of it,
from the multiplier matrix. In this way, matrix multiplication
is accomplished by streaming the multiplicand matrix rows
through the PEs where each PE performs a MAC operation.
The PEs are organized in clusters, where each group is
served by a separate memory bank of the banked off-chip
memories, which create independent streams for processor-
memory computation.
Moreover, MAPLE uses on-chip smart memory blocks
to process the large intermediate data on-the-fly using in-
memory processing. Fig. 8 shows the architecture of the
smart memory block. To illustrate the idea of on-the-fly
in-memory processing, lets consider finding the maximum
Kelements. The filter compares the input data with the
FIGURE 8. MAPLE Smart Memory Block [132].
threshold value (VAL). If the input value is greater than
VAL, it updates the list by replacing VAL at address ADDR
with the input value. Then, the scanner (SCAN) searches for
the new minimum value in the list and updates the threshold
VAL and ADDR accordingly. It should be mentioned here
that the employment of in-memory processing reduced the
off-chip memory traffic by 1.64ˆ,25.7ˆ, and 76ˆfor SSI,
K-means, and CNN workloads, respectively. MAPLE pro-
totype has been implemented on Virtex5 SX240T platform
running at 125MHz. The experimental results for face and
digit recognition CNNs [137]–[139] show that MAPLE is
50% faster than that for 1.3 GHz NVIDIA Tesla C870 GPU
implementation.
Chakradhar et al. [100] proposed a dynamically config-
urable CNN architecture on FPGA. The proposed system
consists of three main components; a coprocessor, a dy-
namically configurable CNN (DC-CNN) processing core,
and 3-bank memory subsystem. The coprocessor is designed
such that it automatically configures the software and the
hardware elements to fully exploit the parallelism at the
workload level. DC-CNN is responsible for executing CNN
applications and its architecture is shown in Fig. 9. It
consists of mcomputational elements (each with n2D
FIGURE 9. The Architecture of DC-CNN [100].
10 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
convolvers as well as sub-sampling (S) and non-linearity
(NL) pipelined units), madders (each with ninputs), and
input/output switches. The internal structure of the switches
vector encloses mˆnselectors which are used to help
in exploring the entire design space and to provide the
configurability function across different layers of CNN
model. To determine the best (m,n) feasible combination for
each layer, the system analyzes the workload using integer
factorization techniques because it is considered fast for
small numbers [140], [141]. Dynamic programming is also
used to quickly prune infeasible combinations.
The authors compared the proposed DC-CNN architec-
ture, considering 20 2D convolvers as well as a mem-
ory subsystem of 128-bit port width, with a 1.35 GHz
NVIDIA’s GPU implementation. The results show that DC-
CNN achieved 4.0ˆ,4.4ˆ,5.4ˆ,6.0ˆ, and 6.5ˆspeedup
for face recognition [137], face detection [139], mobile
robot vision [127], video surveillance [100], and automotive
safety [100] workloads, respectively. It is worth mentioning
that DC-CNN is the first architecture that achieves a perfor-
mance suitable for real-time processing for video streaming
as it processes up to 30 frames per second. In addition, DC-
CNN is more energy-efficient than the GPU implementation
as it consumes 14 Watts, while more than 150 Watts are
consumed by the GPU. On the other hand, the authors
modeled their architecture on a CNN with 3 CONV layers
only without any FC layer which makes it unsuitable for
today’s other real-life applications.
A second-generation of CNP [127] architecture has been
proposed in [142] by designing a stream processor sys-
tem. The proposed design replaces the dedicated hardware
convolver in CNP with multiple parallel vector processing
units, named as ALUs, laid out in a 2D grid. Each ALU
is composed of four local routers, one global router, and a
streaming operator. The local routers are used to stream data
to/from the neighbors. Streaming data to and from global
data line is done through the global router. The streaming
operators in the ALU are fully pipelined to produce a
result per clock cycle as described in [127] with the use of
Q8.8 coding to represent FMs and weights. The proposed
system also uses a multi-port direct memory access (DMA)
streaming engine to allow individual streams of data to
operate seamlessly within processing blocks. The results
show that the proposed stream processor system can run
small CNNs at up to 30 fps while consuming about 15 Watts.
An improved version of CNP architectures given in [127],
[142] was presented in [143] and referred to as neuFlow.
Particularly, neuFlow has replaced the 2D grid of ALUs
with a 2D grid of processing tiles (PTs). The proposed
architecture contains a 2D grid of PTs, a control unit,
and a smart DMA module, as shown in Fig. 10. Each
PT consists of local operators and a routing multiplexer
(MUX). The top three PTs have been implemented to
perform MAC operation. Thus, they can be used to perform
2D convolution, simple dot-products, and spatial pooling.
General-purpose operations, such as dividing and squaring,
FIGURE 10. The Architecture of neuFlow [143].
have been implemented at the middle three PTs. Therefore,
the middle row of neuFlow can be used for normalization.
Finally, neuFlow’s bottom PTs row implements non-linear
operations. Moreover, each operator employed input and
output FIFOs to stall its pipeline when required. On the
other hand, PT’s MUX is used to connect its local operators
with the neighboring PT’s streaming operators and off-chip
memory instead of the used local routers and global router
discussed in [142].
NeuFlow uses a dataflow compiler, named luaFlow, to
translate a high-level flow-graph representation of CNNs
in Torch5 [144] into HDL scripts with different levels of
parallelism. In addition, luaFlow produces a binary code
configuration file and holds it in the embedded control
unit. Thereafter, the control unit configures the 2D grid
of PTs (connections and streaming operator) and the DMA
ports through run-time configuration buses. A smart memory
module has been designed to support multiple asynchronous
accesses of off-chip memory through its reconfigurable
ports. By targeting the larger Xilinx Virtex6 VLX240T
FPGA, neuFlow achieved 147 GOPS at 10 Watts for street
scene parsing CNN in [145] with the use of 16 bits to
represent FMs and weights.
Peemen et al. [146] utilized the flexible off-chip memory
hierarchy method to design a configurable memory-centric
accelerator template for a variety of models of CNNs. This
accelerator exploits data reuse in complex access patterns to
reduce off-chip memory communication, which minimizes
the bandwidth requirements. The memory-centric accelera-
tor maximizes the efficiency of on-chip memories for better
data locality using loop transformation (to optimize the
tiling parameters) and block RAM (BRAM)-based multi-
bank on-chip buffers [147]. At the same time, it minimizes
the size of FPGA on-chip memories to optimize energy
and area usage, which are key requirements for embedded
platforms.
The memory-centric accelerator uses a SIMD cluster
of MAC PEs with flexible reuse buffers to accelerate the
CONV layer. The acceleration template has been imple-
mented on Virtex6 FPGAs. In addition, a MicroBlaze pro-
VOLUME 4, 2016 11
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
cessor has been utilized to configure and communicate with
the accelerator via FIFO-based fast simplex link (FSL). The
proposed accelerator has been analyzed for a CNN vision
task of size 2.74 GMAC and the results show that the
memory-centric accelerator is 11ˆfaster than the standard
implementation of similar FPGA resources.
Neural network next (nn-X) [148] is a real-time system-
on-chip (SoC) computing system for deep learning networks
on mobile devices. The architecture of nn-X consists of a
host processor, a co-processor, and external memory. The
co-processor accelerates the learning networks by paral-
lelizing their operations throughout arrays of configurable
processing elements referred to as collections. Each collec-
tion contains one convolution engine, one pooling module,
and one non-linear operator. The CONV engine accelerates
the CONV operation by fully pipelining the incoming data
with the use of cache memories. The collections are able
to communicate with one another using the collection route
component to achieve cascaded pipelining, which results in
reducing accesses to external memory. The data transfer
between the collections and the external memory is ac-
complished throughout the co-processor full-duplex memory
router, which provides independent data streams. The nn-
X has been prototyped on Xilinx ZC706 which contains
Zynq XC7Z045, two ARM Cortex-A9 processors, and 1 GB
DDR3. Eight collections have been employed to achieve
large parallelism. The results for face recognition model
in [149] show that nn-X is 115ˆfaster than the two
embedded ARM processors.
Zhang et al. [55] proposed a roofline-based model to
accelerate convolutional neural networks on FPGAs. The
roofline model is an intuitive visual performance model used
to relate the attainable performance to the peak performance
that can be provided by the hardware platform and the
off-chip memory traffic [150]. The focus in their work
is primarily on accelerating the convolutional layers as
it consumes more than 90% of the computational time
during the prediction process [77]. In doing so, the authors
optimized both the computation operations and the memory
access operations in convolutional layers. They considered a
CNN application composed of five convolutional layers that
won the ImageNet competition in 2012 [28]. The proposed
accelerator uses polyhedral-based data dependence analy-
sis [151] to fully utilize all FPGA computational resources
through loop unrolling, loop pipelining, and loop tile size
enumeration. Note that loop unrolling maximizes the par-
allel computation of CONV MAC operations. On the other
hand, local memory promotion and loop transformation are
used to reduce redundant communication operations and to
maximize the data sharing/reuse, respectively.
Subsequently, the roofline performance model is used
to identify the optimal design from all possible solutions
in the design space. Specifically, the authors model all
possible legal designs delivered from the polyhedral analysis
in roofline to find the optimal unrolling factor xTm, Tnyfor
every convolutional layer, where Tmand Tnare the tile size
FIGURE 11. Zhang et al. [55] Accelerator Architecture.
for the output FMs and input FMs, respectively. However,
designing a CNN accelerator with different unrolling factors
to each convolutional layer is challenging. Therefore, the
proposed architecture enumerates all possible valid designs
to find uniform cross-layer unrolling factors. Thereafter, the
hardware accelerator is implemented based on the cross-
layer optimal unrolling factors.
The proposed accelerator composed of a computational
engine and memory sub-system is depicted in Fig. 11.
The computation engine is designed as Tmduplicated tree-
shaped poly structures with Tninputs from the input FMs,
Tninputs from the weights, and one input from the bias.
On the other hand, the memory sub-system is implemented
as four sets of on-chip buffers; two sets to store the input
FMs and weights, each with Tnbuffer banks, and two
buffer sets of Tmindependent banks for storing the output
FMs. To overlap data transfer with computation, on-chip
buffers are operated in a ping-pong manner. In addition,
two independent channels are implemented for load and off-
load operations to increase the bandwidth utilization. More-
over, MicroBlaze processor is used to send configuration
parameters and commands for the accelerator over AXI4lite
bus. The CNN accelerator communicates with external data
transfer engines through FIFO interfaces, where the data
transfer engines are used to access DDR3 DRAM memory
through AXI4 bus.
The accelerator is designed using Vivado 2013.4 high
level synthesis tool and implemented on Xilinx VC707
FPGA board clocked at 100 MHz. The experimental results
depict that the proposed implementation achieves a peak
performance of 61.62 GFLOPS as well as a 17.42ˆspeedup
over the software implementation on Intel Xeon CPU E5-
2430 at 2.20 GHz with 15 MB cache and 16 threads. In
addition to this, the results show that the proposed FPGA
architecture is 24.6ˆmore energy-efficient than the software
implementation as the total power consumption is only 18.6
Watts. The proposed implementation has some limitations
such as designing the accelerator with new cross-layer
unrolling factors for different architectures of CNNs. Fur-
12 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 12. Top-Level Archeticture of Microsoft CNN Accelerator [152].
thermore, using the CNN accelerator with uniform unrolling
factors might be sub-optimal for some CONV layers, which
affects the overall performance.
In 2014, Microsoft research team of Catapult project
integrated FPGA boards into data center applications to
successfully achieve 2ˆspeedup for Bing Ranking (the
large-scale search engine) [67]. A year later, Ovtcharov et
al. [152] at Microsoft Research utilized Catapult hardware
infrastructure, a dual-socket Xeon server equipped with
Stratix-V GSMD5 FPGA, to design a specialized hardware
for accelerating the forward propagation of deep CNNs in
a power-constrained data center.
The top-level architecture of the proposed CNN accelera-
tor is shown in Fig. 12. Multi-banked input buffer and kernel
weight buffer are used to provide an efficient buffering
scheme of FMs and weights, respectively. To minimize
the off-chip memory traffic, a specialized network on-chip
has been designed to re-distribute the output FMs on the
multi-banked input buffer instead of transferring them to
the external memory. The 3D convolution operations (such
as the dot-product) and other CNN operations are indepen-
dently performed using spatially distributed scalable vectors
of PEs. The controller engine is responsible for streaming
and data delivery of multi-banked input buffer and kernel
weight buffer data to each of the PE vectors. In addition, it
supports configuring multiple CNN layers at run-time. The
results show that the proposed design is able to classify 134
images/sec, while consuming about 25 Watts, for AlexNet
model on ImageNet-1K dataset [28], which is 3ˆbetter
than the published throughput results for the Roofline-based
FPGA Accelerator [55]. The authors mentioned that using
top-of-the-line FPGAs such as Arria 10 GX 1150 improves
the throughput to around 233 images/sec.
Qiu et al. [98] proposed an FPGA design to accelerate
CNNs for a large-scale image classification challenge on
embedded systems. The focus was on accelerating both
CONV and FC layers, since they are considered as the
most computational-centric and the most memory-centric
operations in CNNs, respectively. The proposed accelerator
reduces the resource consumption using specific design
of convolver hardware module. In addition, the authors
applied singular value decomposition (SVD) to the weight
matrix of FC layer in order to reduce memory footprint at
this layer [101]. To further reduce memory footprint and
bandwidth requirement of CNN, they proposed a dynamic-
precision data quantization flow component. This compo-
nent is responsible for finding the optimal fractional length
for weights in each layer as well as the optimal fractional
length for FMs in adjacent layers, while achieving minimal
accuracy loss. Then, it converts the floating-point numbers
representing weights and FMs into fixed-point numbers.
In addition, the authors proposed a data arrangement
scheme that maximizes the burst length of each transaction
to the external memory to accelerate CONV and FC layers,
as well as to avoid unnecessary access latency. Note that
maximizing the DRAM burst length raises up the effective
DRAM bandwidth [55], [153].
The proposed architecture consists of a processing system
(CPU) and programmable logic (FPGA). CNN computations
are performed through special design of processing element
modules in FPGA. The main modules in the processing
element are convolver complex, max-pooling, non-linearity,
data shift, bias shift, and adder tree, as shown in Fig. 13.
The convolver complex is designed as a classical line
buffer [154], as shown in Fig. 14, to achieve convolution
operations as well as to compute FC layer multiplication of
matrix-vector. The pooling layer implemented in the max-
pooling module is used to output the maximum value in the
input data stream with a window of size 2. The activation
function of CNN is applied to the input data stream using the
non-linearity module. The adder tree accumulates the partial
sums generated from the convolvers. Finally, data shift
and bias shift modules are responsible for accomplishing
dynamic quantization.
The proposed embedded FPGA platform has been im-
plemented using VGG-16-SVD network with 16-bit fixed-
FIGURE 13. Processing Element Module of Qiu et al. [98] Embedded
Accelerator Architecture.
VOLUME 4, 2016 13
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 14. Convolver Complex Design of Qiu et al. [98] Embedded
Accelerator Architecture.
point numbers on Zynq XC7Z045 platform. The results
demonstrate that applying SVD technique reduces memory
footprint of FC layer by 85.8% with a compression rate of
7.04% while introducing an accuracy loss of only 0.04%.
Finally, the overall performance of the proposed acceler-
ator reported is 136.97 GOPS under 150 MHz working
frequency with the top-5 accuracy of 86.66% and a total
power consumption of 9.63 Watts.
DeepBurning [155] is an FPGA-based neural network
(NN) design automation tool. It allows for building learning
accelerators for specific NN with optimized performance
and custom design parameters configuration using a pre-
constructed register transfer level (RTL) module library. The
RTL library holds the hardware descriptive scripts for NN
reconfigurable components as well as their configuration
scripts. In addition, it contains other RTL building blocks
for logical and arithmetic operations such as the connection
box (used to exchange data between NN layers as well as to
approximate the division operation) and approximate look-
up table (LUT) (used to simplify a function or operation to
allow it to be mapped into hardware).
In order to design an optimized hardware, DeepBurning
compresses the passed NN model to the greatest extent
using temporal and spatial folding which helps also in
satisfying the resource constraints and minimizing the re-
quired hardware modules. DeepBurning not only generates
the hardware description for neural network scripts, but
also analyzes the complex access pattern and data local-
ity using an integrated compiler to generate a run-time
control flow which provides energy-efficient, and, better
data reuse implementation. In addition, the DeepBurning
compiler investigates the accelerator on-chip memory size
and throughput to properly tile and partition the NN weights
and feature data layouts. Moreover, DeepBurning uses the
address flow component to automatically fetch and store
off-chip memory and on-chip memory data. The authors
compared the performance of DeepBurning with that in [55],
considering AlexNet CNN model, as they both operate
at 100 MHz. They considered a high budget resources
constrained DeepBurning on Zynq-7045 device. The results
show that DeepBurning is 1.13ˆslower but 1.45ˆmore
energy-efficient.
An OpenCL-based optimization framework to accelerate
large-scale convolutional neural network models was pro-
posed by Suda et al. [80]. They found that the number of
performed CONV MAC operations in parallel (NCON V ),
SIMD vectorization factor (SCON V ), normalization layer
loop unrolling factor (NNO RM ), the number of parallel
pooling outputs in one cycle (NP OOL ), and the number of
parallel FC MAC operations (NFC ) are the key variables
that determine the parallelism of the design. Subsequently,
they analytically and empirically modeled the execution
time for each layer as a function of the above mentioned
variables. Then, genetic algorithm was used to explore the
design space for finding the optimal combination of the key
design variables considering the resources constraints.
The authors implemented the scalable CONV block in
a similar fashion to that in [138] as a matrix multipli-
cation by flattening and on-the-fly rearrangement of the
feature data. The OpenCL software has been utilized in
their work due to its parallel programming model as well
as its ability to integrate the compiled RTL design with
external memory interfacing IPs [156], which uses memory
coalescing technique with complex load and store units. In
addition, it has optimized matrix multiplication and CPU-
FPGA communication libraries [157], [158].
The framework is used on both VGG-16 and AlexNet
CNN models which are implemented on P395-D8 [159]
and DE5-Net [160] FPGA boards with fixed-point opera-
tions according to their precision study. They compared the
proposed implementation with 3.3 GHz core i5-4590 CPU
implementation that uses Caffe tool [58] with ATLAS [161]
optimized library for matrix/vector operations. The results
show that the OpenCL optimized framework on P395-
D8 achieved 5.5ˆ(117.8 GOPS) and 9.5ˆ(72.4 GOPS)
speedups for VGG-16 and AlexNet models, respectively.
On the other hand, DE5-Net FPGA achieved less throughput
speedup than the P395-D8 (2.2ˆ(47.5 GOPS) for VGG-16,
and 4.2ˆ(31.8 GOPS) for AlexNet) as it has 7.67ˆless
DSPs than what is available on P395-D8.
Zhang et al. [153], [162] analyzed the transformation
of CONV and FC layers to regular matrix multiplication
presented in prior work [98]. For VGG-16 model, they found
that such transformation necessitates up to 25ˆduplication
of input FMs. To address this problem and improve the
bandwidth utilization, they designed a uniformed matrix
multiplication kernel that uses either input-major mapping
(IMM) or weight-major mapping (WMM) techniques while
computing FC layer. In IMM, the designed kernel batches
a group of different input FMs together, and then performs
the matrix multiplication. IMM technique improves the data
reuse of FC weights. On the other hand, the designed kernel
with WMM technique makes use of the fact that the FC
layer is communication-bound in which the weight matrix
14 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
is much larger than the input FM matrix. In particular,
it loads input FM matrix to a weight buffer and loads
weight matrix to input FM buffer. Subsequently, a regular
matrix multiplication is performed on these matrices. As a
result, WMM may allow for a higher data reuse than IMM,
especially for input FMs that can be reused multiple times
considering the limited hardware resources.
For the above, the roofline model was applied to identify
the optimal mapping technique under different batch sizes
and data precisions. The results demonstrate that WMM is
better than IMM in term of data reuse and bandwidth uti-
lization, especially in small batch sizes which is required for
real-time inference. Hence, the same matrix multiplication
kernel is utilized for the computation of both CONV and
FC layers, but with the use of IMM in CONV layer and
WMM in FC layer. Based on this, the authors proposed
a software/hardware co-design library, which they named
Caffeine, to accelerate CNNs on FPGAs.
With an easy-to-use developed tool, Caffeine aids in
automatically choosing the best hardware parameters, using
the model files from Caffe and FPGA device specifications
obtained from the user. Caffeine FPGA engine uses a high-
level synthesis (HLS)-based systolic-like architecture to
implement matrix multiplication kernel. It allows changing
parameters such as number of PEs, precision, and FM size.
Caffeine further maximizes the FPGA computing capability
by optimizing multi-level data parallelism discussed in [55]
and pipeline parallelism using polyhedral-based optimiza-
tion framework given in [163]. Caffeine framework also
handles the weights and biases reorganization in off-chip
DRAM to maximize the underlying memory bandwidth
utilization. In addition, the double-buffering technique is
employed to prefetch the next data tile for each PE. Caffeine
has been evaluated by implementing AlexNet and VGG-
16 CNNs on Ultrascale KU060 (20nm and 200 MHz)
and on Virtex7 690T (28nm and 150 MHz) considering
different precisions. The VGG-16 implementation with 16-
bit fixed-point on Ultrascale KU060 and Virtex7 690T
provided 43.5ˆand 65ˆoverall throughput enhancement,
respectively, compared to implementation on a two-socket
server, each with a 6-core Intel CPU (E5-2609 at 1.9 GHz).
A special case of dataflow, referred to as synchronous
dataflow (SDF) [164], is a paradigm of computation that
allows for representing a computing system as a stream-
ing problem. In this way, SDF model can represent the
hardware implementation of CNNs using linear algebra and
directed SDF graph (SDFG). Each node of SDFG represents
a hardware building block that can immediately start its
computation as soon as the data are available through its
input arcs. Such representation of CNN model offers a
fast design space exploration. Venieris and Bouganis [165]
employed SDF model to optimize the mapping of CNNs
onto FPGAs based on HLS.
In particular, the proposed fpgaConvNet framework in
[165] takes as input a high-level script programmed by DL
expert describing the CNN model, along with specifications
FIGURE 15. SDF Graph Partitioning [165].
of the targeted FPGA platform. Thereafter, it parses the
input script through a developed domain-specific language
(DSL) processor to model the CNN in the form of a directed
acyclic graph (DAG) where each node corresponds to a
CNN layer. Then, the DAG-based CNN is transformed into
an SDFG representation and modeled as a topology matrix.
The topology matrix contains the number of incoming
parallel streams, the width of each data stream, and the
production or consumption rates at each node. In addition,
the DSL processor extracts information about the platform-
specific resource constraints.
Unlike other attempts, instead of exploring the design
space for the optimal parameters of loop unrolling and tiling,
fpgaConvNet explores the design space of the topology
matrix components while considering the resource con-
straints. In doing so, fpgaConvNet performs graph parti-
tioning, coarse-grained folding, and fine-grained folding.
The graph partitioning splits the original SDFG into sub-
graphs and each subgraph is then mapped to a distinct
bitstream as shown in Fig. 15. Note that the proposed multi-
bitstream architecture might have multiple CONV layer
processors (CLPs), as in the provided example. This away,
on-chip RAM is used for intermediate results and data reuse
within the subgraph, while accesss of off-chip memory is
minimized and limited for input and output streams of
the subgraph. However, this scheme adds reconfiguration
penalty due to the need for reconfiguring the FPGA when
the data flows between adjacent subgraphs. To amortize
this overhead, several input data streams are processed in
a pipelined manner.
Thereafter, each bitstream architecture is optimized using
coarse-grained folding and fine-grained folding. In coarse-
grain folding, CONV, pooling, non-linear, and other major
operations of each layer are unrolled to provide the highest
possible throughput by having several parallel units of each
operation. The fine-grain folding controls the unrolling and
pipelining of the dot-product operations inside CONV and
average pooling units. Instead of fully unrolling the imple-
mentation of dot-product which produces a 1 dot-product
per cycle, with the use of a high number of multipliers and
adders, fpgaConvNet uses a smaller number of MAC units
and schedules the execution of different operations using
time-multiplexing. A trade-off between the performance
and the required hardware resources can be achieved by
changing the unroll factor and the degree of multiplex-
VOLUME 4, 2016 15
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
ing. Therefore, fpgaConvNet employed simulated annealing
[166] to find the optimal partitioning points and folding
factors. Finally, fpgaConvNet uses optimal components to
derive the configuration of PEs and buffers, and generates
a synthesizable Vivado HLS hardware design.
fpgaConvNet framework has been evaluated by mapping
LeNet-5 and scene labelling [167] small CNN models with
Q8.8 fixed-point representation onto a Zynq-7000 XC7Z020
FPGA platform working at 100 MHz. In mapping LeNet-5,
fpgaConvNet achieves up to 1.62ˆthe performance density
of CNP [127]. Compared to Tegra K1 GPU implementation
of scene labelling CNN, fpgaConvNet surpasses Tegra K1’s
power efficiency by 1.05ˆ.
Ma et al. [78] proposed a Python-based modularized RTL
compiler to accelerate CNNs by employing loop unrolling
optimization [55], [79] for CONV layer operations. A de-
tailed review article of this work has been recently published
and referred to as ALAMO [168]. The proposed compiler
integrates both the RTL finer level optimization and the
flexibility of HLS to generate efficient Verilog parameterized
RTL scripts for ASIC or FPGA platform under the available
number of parallel computing resources (i.e., the number
of multipliers (Nm)). If Nmis greater than the number of
input FMs (Nif ), the proposed compiler fully unrolls Loop-
3(Nif , refer to subsection II-A1 for more details) while it
partially unrolls Loop-4 (Nof ) to exploit the data reuse of
shared features among Nm{Nif output FMs. Otherwise, it
partially unrolls Loop-3 which results in Nif {Nmrepeated
sliding of kernel window. On the other hand, Loop-2 (XˆY)
is serially computed after Loop-1 (K) to minimize the
number of partial sums.
The overall modules of the proposed CNN accelerator are
shown in Fig. 16. The controller is responsible for directing
and ensuring in-order computation of CNN modules for
each layer. The data routers oversee the selection of data
read and data write of two adjacent modules as well as the
assignment of buffer outputs to shared or pool multipliers
of the multiplier bank. The feature buffers hold the FMs
using on-chip RAMs. The weight buffers are used to ensure
the availability of CONV and FC layers’ weights before
their computation as well as to overlap the transfer of FC
layer weights with its computation. The CONV module
consists of control logic, groups of adder trees, and ReLU
components. The control logic component parametrizes the
loop unrolling factors based on the configuration of each
layer (Nif ,Nof ,X,Y, and K). The CONV module
contains Nm{Nif adders to sum Nif parallel multiplier
results and accumulate them. Moreover, the adder trees can
be shared by layers with identical Nif to be as one single
module. The ReLU component checks the input pixel sign
bit to either output zero or the data pixel itself. The POOL
module contains accumulators or comparators to perform
average or maximum operation, respectively. The NORM
module maintains the required components to perform the
operations of local response normalization such as square,
non-linear (using look-up table), and multiplication oper-
FIGURE 16. ALAMO Overall Acceleration Modules [78].
ations. Finally, the FC module shares the multiplier bank
module with the CONV module to perform the matrix-
vector multiplication (MVM).
ALAMO architecture permits the output pixels to be only
stored in the feature buffers, which makes ALAMO suitable
for CNNs with only small intermediate data volumes. The
proposed RTL compiler has been tested by accelerating
two CNN models; AlexNet and NiN [169]. The generated
parameterized RTL scripts for AlexNet and NiN are synthe-
sized using Altera Quartus synthesis tool and implemented
on DE5-Net FPGA board. The experimental results for
AlexNet model are compared with the results for OpenCL-
based design [80] as both use the same FPGA board with
similar hardware resources for AlexNet. ALAMO achieved
1.9ˆand 1.3ˆimprovement for throughput and power
consumption, respectively. Moreover, the overall throughput
of NiN model is 1.03ˆbetter than that of AlexNet. This is
because NiN has more CONV layers and many of them
have the same Nif .
Liu et al. [170] proposed a parallel framework for FPGA-
based CNN accelerators that exploits four levels of par-
allelism; task level, layer level, loop level, and operator
level. Task-level parallelism involves executing multiple im-
age prediction tasks simultaneously. Layer-level parallelism
exploits pipelining across layers to enable parallel execution
of all layers with different images. Loop-level parallelism
utilizes loop unrolling in performing convolutions and this
can be achieved either through intra-output or inter-output
parallelism. Finally, operator-level parallelism is achieved
by parallelising the kˆkMACs operations needed for
convolution operation in convolutional layers or the nMACs
needed for inner-product computation in fully connected
layers. Fig. 17 shows the parallel framework exploiting these
four levels of parallelism.
The authors have used 16-bit fixed-point format for rep-
resenting pixels in input feature maps and output feature
maps. However, they have used 32 bits for intermediate
results which get truncated to 16 bits. In addition, they
have used 8 bits for representing kernels and weights. They
have presented a systematic methodology for design space
16 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 17. Parallel Framework Exploiting Four Levels of Parallelism [170].
exploration to find the optimal solution that maximizes
the throughput of an FPGA-based accelerator under given
FPGA constraints such as on-chip memory, computational
resources, external memory bandwidth, and clock frequency.
The proposed technique has been evaluated by imple-
menting three CNN accelerators on the VC709 board for
LeNet, AlexNet, and VGG-S. It has achieved a throughput
of 424.7 GOPS, 445.6 GOPS, and 473.4 GOPS for LeNet,
AlexNet, and VGG-S accelerators, respectively. In addition,
the performance has been compared with MatConvNet tool
running the CNN models on Intel Core i7-4790K CPU (4.0
GHz) and NVIDIA GTX-770 GPU (1,536 CUDA cores, 2
GB GDDR5, 224.3 GB/s memory bandwidth). Compared
to the CPU implementations, the accelerators for LeNet,
AlexNet, and VGG-S achieved 14.84ˆ,6.96ˆ, and 4.79ˆ
in performance, respectively, and 51.84ˆ,24.69ˆ, and
16.46ˆin power efficiency, respectively. Compared to the
GPU implementations, the accelerators achieved better per-
formance in the small-scale network LeNet (3.17ˆ), com-
parable performance in the medium-scale network AlexNet
(0.96ˆ), and worse performance in the large-scale network
VGG-S (0.56ˆ). However, the accelerators achieved higher
power efficiency than the GPU implementations in all three
networks with 28.3ˆfor LeNet, 8.7ˆfor AlexNet and
4.98ˆfor VGG-S.
FP-DNN [171] is an end-to-end framework that auto-
matically generates optimized FPGA-based implementations
of deep neural networks (DNNs) using an RTL-HLS hy-
brid library. FP-DNN compiler, programed using C++ and
OpenCL, takes TensorFlow symbolic descriptions [172] of
DNNs, and then performs model inference through the use
of model mapper, software generator, and hardware gener-
ator modules. The model mapper extracts the topological
structure and layers configurations of DNN model from the
TensorFlow descriptions and generates an execution graph
for the target model. The execution graph shows layer-by-
layer operations and read/write data transactions.
FP-DNN compiler allocates off-chip DRAM data buffers
to store intermediate data, weights, and model parame-
ters and configurations. The model mapper maximizes the
storage resource reuse through minimizing the number of
required physical buffers. Specifically, it formulates the
data reuse problem as a graph coloring problem [173],
and then the left-edge algorithm is applied to generate
kernel configuration and kernel schedule. Subsequently, the
software generator uses the kernel schedule to generate a
host C++ program which initializes the model, manages
the data buffers, and schedules the kernel execution. On
the other hand, the hardware generator uses the kernel
configuration and the execution graph to generate the FPGA
hardware codes by instantiating the corresponding optimized
templates from an expandable RTL-HLS hybrid library.
Each template is comprised of Verilog-based computational
engine and OpenCL-based control logics engine.
The architecture of the proposed FPGA-based accelerator
consists of matrix multiplication and data arranger modules.
Matrix multiplication module is a hand-written Verilog code
that is designed and optimized based on the hardware
constraints of Altera Stratix-V GSMD5 FPGA. It applies
tiling and ping-pong double buffers techniques to improve
the throughput. On the other hand, data arranger is an
OpenCL-based module that is responsible for mapping the
computational part of a layer to matrix multiplication as well
as performing data communication with off-chip memory
and matrix multiplication module. Mapping DNNs compu-
tational operations to matrix multiplication has been widely
applied in prior studies [80], [132], [174]. FP-DNN maps
FC layer to matrix multiplication by batching input vectors
together. Before model deployment, FMs and weights are
rearranged in DRAM using the channel-major scheme to
optimize the communication between the accelerator and
off-chip DRAM. On the other hand, both floating-point
and fixed-point representations have been supported for
implementation, and they can be adjusted by the user.
The proposed RTL-HLS hybrid framework has been eval-
uated by accelerating VGG-19, LSTM-LM [175], ResNet-
152 DNNs on Stratix-V GSMD5 FPGA. Note that this is
the first work that implements ResNet-152 on FPGA. The
VOLUME 4, 2016 17
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
experimental results demonstrated that the speedup of FP-
DNN for 16-bit fixed-point implementations are about 1.9ˆ
- 3.06ˆcompared with the server that includes 2 processors
each with 8-core Intel Xeon E5-2650v2 at 2.6 GHz.
In line with the current trends towards compressed neural
networks, with dramatically reduced weights and activations
bit-width using 1-bit or 2-bit quantization [176]–[180],
Umuroglu et al. [181] conducted a set of experiments to
estimate the trade-off between the network size and preci-
sion using the roofline model. They found that binarized
neural networks (BNNs) [180] require 2 to 11 times more
operations and parameters than an 8-bit fixed-point CNN
to achieve a comparable accuracy on MNIST [71] dataset.
However, the performance of BNN is found to be 16ˆfaster
than the fixed-point network.
Subsequently, the authors proposed a framework, referred
to as FINN [181], that maps a trained BNN onto FPGA.
FINN generates a synthesizable C++ network description
of a flexible heterogeneous streaming architecture. The
architecture consists of pipelined compute engines that
communicate via on-chip data streams. Each BNN layer has
been implemented using dedicated compute engines with
1-bit values for weights and FMs; +1 and -1 are used to
represent a set bit and unset bit, respectively.
The authors have optimized accumulation, batch normal-
ization (batchnorm), activation, and pooling operations of
BNNs. In particular, the accumulation of a binary dot-
product has been implemented as a counter of set bits
(popcount operation). The popcount-accumulate reduces the
number of required look-up tables (LUTs) and flip-flops
(FFs) by a half, compared to the implementation of signed-
accumulation. BNN batchnorm and activation operations
have been simplified and implemented together as unsigned
comparison with a threshold τk, +1 is produced when
the input value is greater than or equals to τk, and -
1 otherwise. The value of τkis computed during run-
time. Such an implementation of batchnorm-activation op-
erations requires much smaller number of LUTs, without
the need for DSPs and FFs, compared to regular imple-
mentation of batchnorm-activation. Max-pooling, average-
polling, and min-pooling have been effectively implemented
with Boolean OR-operator, Boolean majority function, and
Boolean AND-operator, respectively.
The accelerator architecture is composed of building
blocks from the FINN hardware library. The matrix-vector-
threshold unit (MVTU) is the core computational building
block as matrix-vector operations followed by thresholding
form the majority of BNN operations. The design of MVTU
consists of an input buffer, an array of Pparallel PEs each
with SSIMD lanes, and an output buffer. BNN weight
matrix is distributed across the PEs and stored locally in on-
chip memory. Subsequently, the input images are streamed
through the MVTU and multiplied with the weight matrix.
Particularly, the PE computes the dot-product between an
input vector and a row of weight matrix, each of S-bits
wide, using an XNOR gate, as shown in Fig. 18. Then, it
FIGURE 18. The Architecture of MVTU PE [181].
compares the number of set bits to a threshold and produces
a 1-bit output value as previously discussed.
Umuroglu et al. [181] implemented the CONV layer using
a sliding window unit (SWU) and an MVTU, where convo-
lutional operation is transformed to matrix-multiplication of
image matrix and filter matrix. SWU generates the image
matrix to MVTU by moving the sliding window over the
input FMs, while the filter matrix is generated by packing
the weights from the convolution filters as shown in Fig. 19.
In order to meet the user throughput requirement, MVTU is
folded (time-multiplexed) by controlling the values of Pand
S. Folding of MVM decides partitioning of the matrix across
PEs. Every row of matrix tile is mapped to a distinct PE and
every column of PE buffer is mapped to a distinct SIMD
lane. In this away, the required number of cycles to compute
one MVM (total fold) is obtained as pXˆYq{pPˆSq, where
Xand Yare the dimensions of the matrix. The folding
factors of BNN layers have been determined such that every
BNN layer takes nearly the same number of cycles.
To evaluate FINN, the authors implemented CNV topol-
ogy on Xilinx Zynq-7000 board at 200 MHz to acceler-
ate BNNs inference on CIFAR-10 [182]. CNV contains
FIGURE 19. Transforming CONV to Matrix-Multiplication [181], where, ifm and
ofm are the input and output feature maps, respectively.
18 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 20. CONV Acceleration Architecture and Dataflow [83], where, P ix P ox 3,P iy P oy 3, and P of 3.
three repetitions of two 3ˆ3CONVs and 2ˆ2max-
pooling layers. Its topology is inspired by VGG-16 and
BinaryNet [180]. Although CNV accepts images with 24-
bits/pixel as an input and produces a 10-element vector of
16-bit values, 2-bits are used for representing intermediate
results while 1-bit is used for representing CONV and
FC weights. Experimental results demonstrated that the
proposed design provides high performance (2.5 TOPS)
while incurring low energy consumption (11.7 Watts). FINN
outperforms the design by Ovtcharov et al. [152] by over
13.8ˆfor throughput.
In [83], loop optimization techniques [55], [79] have been
employed in FPGA to design a customized CNN accelerator
through speeding up CONV layer operations. Firstly, an in-
depth analysis is provided to numerically characterize loop
unrolling, loop tiling, and loop interchange optimization
techniques. In doing so, 8 CONV dimensions parameters
(N˚), 8 loop unrolling design variables (P˚), and 8 loop
tiling design variables (T˚) have been used with a con-
straint, as for a specific loop level, 1ďP˚ďT˚ďN˚.
Note that unrolling Loop-1 and Loop-3 requires P kx ˆP ky
and P if multipliers, respectively, an adder tree with fan-in
of P kx ˆP k y and P if , respectively, and an accumulator.
On the other hand, unrolling Loop-2 requires P ixˆP iy par-
allel units of MAC to reuse the same weight for P ix ˆP iy
times, while the input feature pixel can be reused by P of
times when unrolling Loop-4 with the use of P of parallel
MAC units. Thus, P kx ˆP k y ˆP if ˆP ix ˆP iy ˆP of
multipliers are required. Please refer to Fig. 2 for more
details on CONV loops levels and their parameters. In
loop tile optimization, the authors have numerically set
the lower bound on the required size of the input pixel
buffer, the weight buffer, and output pixel buffer that ensures
reading each input feature pixel and weight from the off-
chip memory only once. On the other hand, loop interchange
technique has a great impact on the times of memory access
as well as the number of partial sums since it determines
the order of computing CONV loops.
Secondly, the authors have provided a quantitative anal-
ysis of the design variables to minimize each of computing
latency, partial sum storage, on-chip buffer access, and off-
chip DRAM access. Subsequently, MATLAB scripts are
used to randomly sample a subset of the solution space
to find the optimal design configurations. This is due to
the large solution space, more than 7.2ˆ1013 possible
configurations for loop tiling variables of width (P ox) and
height (P oy) output FM alone. According to the randomly
sampling results for VGG-16 CNN model on Arria 10 GX
1150 FPGA, uniform unrolling factors for CONV layers are
used with P ix P ox P iy P oy 14 and P of 16
for Loop-2 and Loop-4, respectively, to reuse input feature
pixels and weights. On the other hand, Loop-1 and Loop-3
are serially computed to prevent the movement of the partial
sums between the MAC units and consume them ASAP
since both Loop-1 and Loop-3 need to be finished in order
to obtain one final output pixel. More importantly, the order
of loops computation has been found to be as follows. Loop-
1is computed first, then comes Loop-3, and finally Loop-2
and Loop-4 are computed in any order.
Finally, a customized convolution accelerator module
with efficient dataflow has been designed based on the
previous results and used for all VGG-16 CONV layers.
The CONV accelerator consists of 3,136 (P ixˆP iy ˆP of )
independent MAC units and 14 (P of) input pixel buffers.
Fig. 20 shows an example of the designed CONV accel-
erator when P ix,P iy, and P of are all equal to 3. The
input pixels are shifted after fetching them out of the input
pixel buffers. Subsequently, they can be reused among the
input register arrays. Then, the input pixels are fed into the
associated MAC units. The figure also shows that the input
pixels and weights are shared by P of and P ix ˆP iy MAC
units, respectively.
The overall CNN acceleration system mainly consists
of two SDRAM banks that hold the input feature pixels
and weights, two modular Scatter-Gather DMA (mSGDMA)
engines to facilitate the simultaneous read/write from/to
the SDRAMs, and a controller to govern the sequential
computation of layers as well as the iterations of the four
VOLUME 4, 2016 19
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
CONV loops. On the other hand, dual weight buffers have
been used to increase the throughput of FC layer through
overlapping the inner-product computation with off-chip
communication. The acceleration system has been written as
parametrized Verilog scripts. The experimental results show
that the proposed accelerator has a throughput of 645.25
GOPS, which is more than 3.2ˆenhancement compared to
prior VGG-16 FPGA-based implementations [80], [98].
Venieris and Bouganis [183] further extended fpga-
ConvNet framework [165] to allow for optimizing either
throughput or latency depending on the size of the workload.
For large workloads, weights reloading transformation has
been introduced to efficiently design latency-critical CNNs
on FPGA. In contrast with fpgaConvNet, where a distinct
architecture is designed for each subgraph, the weights
reloading transformation allows for generating a single
flexible architecture, named as the reference architecture and
derived using pattern matching, to execute the workloads
of all subgraphs by transitioning to different modes. Upon
the execution of a new subgraph, the subgraph’s weights
are read into the on-chip memory and the multiplexers are
configured to form the appropriate datapath. Fig. 21 demon-
strates how weights reloading is applied. The authors have
mentioned that the required time for transferring subgraph’s
weights is much smaller than the average time for full
FPGA reconfiguration, 272.7ˆless when loading 4.5 MB
of weights for a VGG-16 layer on Zynq XC7Z045.
In the situation discussed above, due to limited on-chip
memory capacity, it might not be possible to load all weights
required for a single CONV layer. To handle this, the
authors introduced an input FMs folding factor (fin) with
each CONV layer. A CONV layer (CON V i) is partitioned
into finisubgraphs in which each subgraph executes a
fraction of CONV ito produce a fraction of the output
FMs. The proposed latency-driven methodology has been
evaluated by implementing AlexNet and VGG-16 with 16-
bit fixed-point precision for both on Zynq XC7Z045 at 125
FIGURE 21. Weights Reloading [183].
FIGURE 22. Overall DLA Architecture [188].
MHz. The experimental results showed 1.49ˆand 0.65ˆ
higher CONV throughput than DeepBurning [155] and the
embedded FPGA accelerator in [98] for AlexNet and VGG-
16 implementations, respectively.
Lavin and Gray [184] demonstrated that CNN algorithms
with small filters can be efficiently derived using Winograd
algorithm [185] and fast Fourier transform (FFT) algo-
rithm [186] due to their advantages in improving resource
efficiency and reducing arithmetic complexity. Winograd
computation involves a mix of element-wise (Eltwise) and
general-purpose matrix multiplication, where some of the
matrices need to be transformed. In particular, Winograd
algorithm exploits the structure similarity among nˆntiled
input FM pixels given a filter of size rˆrto generate mˆm
tiled pixels of the output FM, where mrepresents the stride
between Winograd tiles (mn´r`1), while minimizing
the number of required CONV multiplications from m2r2
for conventional CONV algorithm to n2. In another work,
Zhang et al. [187] implemented FFT algorithm for CNN on
FPGA platform. However, their proposed implementation
shows little reduction of computation complexity with small
filters such as 3ˆ3.
Aydonat et al. [188] presented a deep learning architec-
ture (DLA) based on OpenCL. Their proposed architecture
reduces the external memory bandwidth requirements by an
order-of-magnitude for both the convolutional and fully con-
nected layers. This is achieved by caching all intermediate
feature maps on-chip in stream buffers. For fully connected
layers, image batching is used where a batch of images
are processed together through the fully connected layers.
The approach utilizes the Winograd transformation to reduce
the multiply-accumulate operations, which could reduce the
number of needed operations by about 50%. In addition,
it uses half-precision (FP16) floating-point operations with
shared exponents, which significantly reduces the needed
computational resources.
The overall DLA architecture is shown in Fig. 22.
Each PE consists of dot-product units, accumulators, and
caches, for performing dot-products for convolution and
fully connected layers. Caches are used for storing filter
weights. To avoid idle computation cycles, double-buffering
is used such that filter weights for the next convolution
20 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
layer are prefetched onto the caches while filter weights are
loaded from the caches for a particular convolution layer.
Stream buffers store feature data and stream it to PEs. Each
stream buffer is double-buffered similar to filter caches.
Images are loaded from the DDR and are stored in stream
buffers before the first convolution layer starts execution.
During a convolution layer execution, while feature data
for a convolution layer is being streamed into the PEs,
the outputs of convolutions are simultaneously stored in
the buffers. The StreamBuffer unit applies the Winograd
transformations to features, and streams the transformed
features to the first PE which are forwarded through all
the PEs via the daisy-chained input connections between
them. The ReLU unit receives the outputs of the PEs via
daisy-chained output connections. Then, the normalization
unit receives the outputs of the ReLU unit and applies
the normalization formula across the feature maps. The
pooling unit receives the outputs of the normalization unit
and computes the maximum value in a window. The output
of the pooling unit is stored back in the stream buffer for
further processing, if more convolution layers are to follow.
Otherwise, the outputs of the pooling unit are stored in
external memory. For the fully connected layers, features
data are stored on PEs caches while filter weights are stored
in stream buffers. For the first fully connected layer, features
data are read back from external memory and loaded onto
the PE caches. The ReLU output is sent directly to DDR,
without applying normalization or pooling. The sequencer
generates the control signals to control the operation of
the various blocks in DLA according to the topology of
the executed CNN. Executing a different CNN requires just
changing the sequencer configuration.
The DLA has been evaluated by implementing AlexNet
CNN on Intel’s Arria 10 dev kit which contains a A10-1150
device (20nm) using a 96 batch size for the fully connected
layers. It achieved a performance of 1020 images/s. In
addition, it achieved 8.4x more GFLOPS than the latest
Ultrascale (KU 20nm) result reported in [162], which uses
a 32 batch size for the fully connected layers, and 19ˆ
more GFLOPS than the latest Stratix V result reported in
[80]. Furthermore, it has achieved energy efficiency at 23
images/s/W, which is similar to what is achieved with the
best publicly known implementation of AlexNet on NVIDIA
Titan X GPU.
Unlike DLA architecture [188] where a 1D Winograd
algorithm was employed to reduce arithmetic complexity,
Lu et al. [189] implemented a novel FPGA architecture
with a two-dimensional Winograd algorithm [185] to ac-
celerate convolutional computation of CNNs. The overall
architecture consists of line buffer structure and Winograd
PE engine, as shown in Fig. 23. Particularly, n`minput
lines and moutput lines of on-chip buffers are used to
effectively reuse FM data among different tiles. While
Winograd PE engine reads the first ninput lines to perform
Winograd computation, the next minput lines load pixels
from off-chip memory using FIFOs to overlap the data
FIGURE 23. Winograd-based CNN Accelerator [189], where, mis the size of
the input FM tile, nis the size of the output FM tile, Mis the number of input
channels, Nis the number of output channels, Wis the maximal width of all
input FMs, Cis the width of the output FMs.
transfer and computation. Thereafter, the input lines are
rotated in a circular fashion to make the next ninput lines
ready. On the other hand, Winograd PE engine composed
of 4 pipelined stages performs transformation, element-
wise matrix multiplication, additional transformation, and
accumulation of output tiles, respectively.
A vector of PEs is employed to achieve parallelism
through unrolling Loop ´4(P of ) and Loop ´3(P if )
similar to that in [55]. To implement FC layer, the proposed
accelerator uses the input line buffers to hold FC weights
while input neurons are stored on the filter buffers. Then,
Winograd PE engine is reused to implement FC operation
but with bypassing the transformation stages. Moreover, a
batch (Nbatch) of input FMs are assembled and processed
together in order to improve the memory bandwidth. An
analytical model has been proposed for a fast design space
exploration of optimal design parameters (n,P of ,P if,
Nbatch) constrained by FPGA configuration with a 16-bit
fixed-point representation for both FM data and filter.
The proposed accelerator has been evaluated by imple-
menting AlexNet and VGG-16 on Xilinx ZCU102 FPGA.
AlexNet CONV layers have 3 different filters. Conventional
CONV algorithm has been applied to the first CONV layer
as it has a filter of size 11ˆ11 while a uniform filter of size
3ˆ3for Winograd algorithm has been used to implement
the rest of the layers. The design parameters are found to
be equal to (6, 4, 8, 128) and (6, 4, 16, 128) for AlexNet
and VGG-16, respectively. The experimental results demon-
strated that the proposed Winograd-based CNN accelerator
has an average performance of 854.6 GOPS and 2940.7
GOPS for AlexNet and VGG-16, respectively, with power
consumption of 23.6 Watts for both. The proposed accelera-
tor has also been evaluated on Xilinx ZC706 platform where
the design parameters are found to be as (6, 2, 8, 32) and
(7, 4, 4, 32) for AlexNet and VGG-16, respectively. The ex-
perimental results demonstrated that Winograd-based CNN
accelerator has an average performance of 201.4 GOPS and
679.6 GOPS for AlexNet and VGG-16, respectively, with
power consumption of 23.6 Watts for both. Compared to
the implementation of VGG-16 on NVIDIA Titan X with
VOLUME 4, 2016 21
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 24. Compute Unit with a 2D BRAM-to-PE Interconnection [190].
the latest CuDNN 5.1, Titan X gives better performance
than Xilinx ZC706 but the implementation on Xilinx ZC706
achieves 1.73ˆhigher energy efficiency.
Zhang et al. [190] presented an OpenCL-based architec-
ture for accelerating CNNs on FPGA. They also proposed
an analytical performance model to identify the bottleneck
in OpenCL-based acceleration of VGG-19 CCN model on
modern FPGA platforms such as Altera Arria 10 GX 1150.
Based on roofline mode analysis, it is shown that the
bandwidth requirement of VGG-19 workload is higher than
what is provided by the FPGA board. Thus, they identified
on-chip memory bandwidth as the key performance bottle-
neck. In addition, they observed that exploited data-level
parallelism in the existing Altera OpenCL library [191] leads
to wasteful replication of on-chip memory (BRAM). This is
due to connecting each PE with a dedicated BRAM port.
Therefore, a Verilog-based accelerator kernel has been
designed and warped to an OpenCL IP in order to opti-
mally balance on-chip memory bandwidth with workload
computational throughput and off-chip memory accesses.
In particular, the proposed kernel consists of a compute
subsystem, a local memory subsystem, and a 2D dispatcher.
The compute subsystem is organized hierarchically into
compute units (CUs) and PEs. At PE level, the authors have
FIGURE 25. 2D Dispatcher [190], where, X0is the column size of kernel
buffer as well as the row size of the input feature buffer, and X1is the row size
of kernel buffer.
FIGURE 26. Line Buffer Design [190].
designed a 2D multi-cast interconnection between BRAMs
(32-bit data width) and PEs to improve the efficiency of
on-chip BRAM usage by sharing the data of one BRAM
port with several PEs as shown in Fig. 24. The CU has
been designed as a 2D PE array of size 16 ˆ16 to match
the computational bandwidth with the maximum streaming
bandwidth (512-bit data bus) provided by off-chip memory.
The 2D dispatcher divides the work items into work groups
each of size (X0,X1) as shown in Fig. 25. Thereafter, it
adaptively schedules the work items within each work group
to the CUs starting with the lowest dimension to balance
the memory bandwidth with capacity. The 2D dispatcher is
also responsible for host/device memory data transfers. In
addition, the authors have limited the maximum fan-out for
registers to 100 in order to guarantee a higher frequency.
The CONV layer has been implemented as a matrix
multiplication by flattening and rearranging the data using
line buffer [154], as shown in Fig. 26, in a similar fashion
to that in [80]. The line buffer converts continuous address
stream from external memory into a stream conducive for
CONV operation to substantially reduce the bandwidth
requirement of off-chip memory. To implement FC layer,
the proposed accelerator uses one column of PEs in the CU.
The proposed implementation has achieved 866 GOPS and
1790 GOPS with the use of 32-bit floating-point and 16-
bit fixed-point, respectively, under 370 MHz and 385 MHz
working frequencies, respectively.
All previously discussed FPGA-based CNN accelerators,
except the ones discussed in [165], [170], have employed
a single CLP to maximize the aggregate throughput of
performed consecutive convolutional operations. However,
Shen et al. [192] noted that using a single globally-optimized
CLP design for the computation of CONV layers of radi-
cally different configurations and dimensions leads to sub-
optimal performance and insufficient utilization of FPGA
resources. Fig. 27a demonstrates the use of a single CLP to
iteratively process L1,L2, and L3CONV layers where the
dimensions of the hardware (CLP ,C LP 1, and CLP 2) and
the layers are represented by the size and shape of the boxes.
It is clear that computing L1and portions of L3leaves
FPGA resources unutilized as their dimensions are smaller
22 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 27. Operation of CONV Layer Processors (CLPs) on CNN with three CONV Layers [192].
than the dimension of the used CLP. Note that processing
a CONV layer with a dimension bigger than the dimension
of CLP, such as L3, requires the repeated use of CLP to
process different portions of the layer.
The authors have also followed the methodology
in [55] to derive an optimal single CLP, through finding
the optimal unrolling factor xTm, Tny, for implementing
SqueezeNet [193] and AlexNet on Virtex7 690T FPGA
with a single precision floating-point and 16-bit fixed-point
arithmetic units, respectively. They found that one quarter
of DSP slices of SqueezeNet’s CLP remain unused. Even
more worse utilization has been observed for AlexNet. The
optimal single CLP has not utilized, on average, more than
one quarter of the arithmetic unit resources.
On the other hand, they also noted that using one CLP
for each stage of CONV layer in a fashion similar to
that in [194] is not efficient due to three reasons. First, it
reduces the on-chip BRAM buffer size of each CLP which
minimizes overall data locality. Second, such one-to-one
mapping of CONV layers and CLPs requires orchestrating
many off-chip memory accesses which incurs latency and
bandwidth overheads. Third, the overall control overhead
scales with the number of CLPs which leaves insufficient
resources for the computation of CNN.
To address the above inefficiencies, Shen et al. [192]
proposed a multi-CLP accelerator system for CNNs where
the available PFGA hardware resources are partitioned
across multiple smaller CLPs. Each CLP is tailored with a
dimension that closely matches the dimensions of a subset of
CONV layers. Thereafter, these specialized CLPs are used
to concurrently operate on a batch of images to achieve a
higher overall throughput, as shown in Fig. 27b, where the
same hardware in Fig. 27a is partitioned into two parallel
CLPs; CLP 1and C LP 2.
Shen et al. [192] developed an optimization search algo-
rithm that uses dynamic programming to find optimal de-
signs. For given configurations of CNN model (i.e., CONV
layers descriptions) and resource constraints of the targeted
FPGA platform (i.e., number of DSP slices, BRAM-18Kb
units, and off-chip memory bandwidth), it derives the opti-
mal number of CLPs (along with their xTm, Tnydimensions)
as well as the optimal mapping between CONV layers and
CLPs that maximize the performance. The assignment of
CNN layers to CLPs is static, where each CNN layer is
mapped and bounded to a particular CLP. Subsequently,
CNN layers are pipelined to their CLP, as shown in Fig. 27b,
where L1and L3are pipelined to CLP 1while L2is re-
peatedly processed on CLP 2with very little idle hardware
which improves the performance compared to single CLP
approach. Moreover, the optimization algorithm also finds
the optimal partition of on-chip BRAM resources of each
CLP that minimizes the overall off-chip memory accesses.
Note that the optimal dimension of each CLP is found based
on the work in [55].
Subsequently, C++ (HLS) templates are parameterized
to design CLPs and to form a complete implementation
of CNN. A standard AXI crossbar is used to intercon-
nect the independent CLPs. The ping-pong double-buffering
technique is also used for input FMs, output FMs, and
weights to allow for transferring data while computation
is in progress. The experimental results of implementing
AlexNet with a single precision floating-point using multi-
CLP accelerator on Virtex7 485T and 690T FPGAs at 100
MHz demonstrate 1.31ˆand 1.54ˆhigher throughput than
the state-of-the-art single CLP design in [55], respectively.
For the more recent SqueezeNet network, the proposed
multi-CLP accelerator results in speedup of 1.9ˆand 2.3ˆ
on Virtex7 485T and 690T FPGAs at 170 MHz with 16-bit
fixed-point, respectively.
Wei et al. [195] presented a systolic architecture for
automatically implementing a given CNN on FPGA based
on OpenCL description, maximizing clock frequency and
resource utilization. The proposed systolic architecture is
shown in Fig. 28. Each PE shifts the data of the weights
(W) and inputs (IN) horizontally and vertically to the
neighboring PEs in each cycle. The 2D structure of PEs is
designed to match the FPGA 2D layout structure to reduce
routing complexity and achieve timing constraints.
VOLUME 4, 2016 23
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 28. Systolic Array Architecture for CNN [195].
The technique first finds a feasible mapping for the given
CNN to the systolic array to guarantee that proper data
is available at specific locations in the PE array at every
cycle. Then, the size of PE array (dimensions) is determined
which has an impact on the required number of DSPs,
the clock frequency, and the DSPs efficiency. Finally, the
data reuse strategy is determined by choosing proper tiling
sizes. The proposed technique has been evaluated using
AlexNet and VGG16 on Intel’s Arria 10 GT 1150 board.
The technique has explored the use of both 32-bit floating-
point and fixed-point using 8-bits for weights and 16-bits for
data. Evaluation results show that, for the VGG16 CNN, the
technique achieves up to 1,171 GOPS on Intel’s Arria 10
device with a clock frequency of 231.85 MHZ and (8-16)-bit
fixed-point representation.
In another recent research work, Ma et al. [196] general-
ized the previously proposed accelerator in [83] to efficiently
accelerate ResNet-50 and ResNet-152 on Arria 10 GX 1150
FPGA. In doing so, they designed flexible and scalable
CONV, ReLU, BatchNorm, scale, pooling, FC, and Eltwise
primitives. In addition, local control logic and registers have
been used with each primitive to control their computation
order and to hold their configurations, respectively. By doing
so, ResNets primitives can be efficiently reused for different
parameters of each layer.
For ResNets scalable CONV primitive, there are four
(kernel, stride) size configurations; (3ˆ3, 1), (1ˆ1, 1),
(1ˆ1, 2), and (7ˆ7, 2). Therefore, a similar architecture
and dataflow to that shown in Fig. 20 has been used for
CONV but with the use of two sets of register arrays; with
shifting between the registers (which is shown in Fig. 20,
Set-1), and without shifting between the registers (Set-2).
The CONV primitive with 3ˆ3kernel and stride of 1 uses
Set-1 register array, while Set-2 is used with (1ˆ1, 1),
(1ˆ1, 2), and (7ˆ7, 2) configurations. In CONV primitive
with Set-2, the input pixels are fed from the input pixel
buffers into the corresponding registers without shifting, and
then to MAC units. The skipped input pixels in (1ˆ1, 2)
configuration are not stored to the input pixel buffers. On
the other hand, the (7ˆ7, 2) configuration of the kernel
and stride sizes is retained as the (1ˆ1, 1) case while
transferring repeated input pixels into the input pixel buffers
and rearranging their storage patterns. The CONV primitive
also takes care of zero-paddings for different (kernel, stride)
size configurations.
The loop unrolling and tiling techniques in [83] have
also been employed to accelerate CONV primitive with
a uniform mapping of PEs to all ResNets CONV layers.
However, designing of efficient CNN modules is not enough,
as the memory accesses and data movements between these
modules must also be minimized. Therefore, the authors
have designed a layer-by-layer computation flow. The global
control logic is responsible for governing the sequential op-
erations of primitives and their dataflow through predefined
and preloaded layered-based execution flowchart, as shown
in Fig. 29. In addition, it has been modeled to reconfigure
ResNet primitives according to the parameters of each layer
during runtime. For instance, it maps a particular number of
PEs to CONV layer based on loop unrolling parameters as
well as it controls the selection of register array type (Set-1
or Set-2) based on CONV (kernel, stride) parameters.
On the other hand, a custom DMA manager has been
designed to control the operations of DMA. Note that the
DMA is responsible for transferring the input FM pixels,
weights, and output FM pixels between off-chip memory
and on-chip buffers. Unlike ALAMO architecture [168]
where the output pixels are only stored in on-chip buffers,
this work as well as the work discussed in [83] store the
output pixels in off-chip memory with the use of loop
tiling technique in order to have a flexible architecture
that can process large-scale CNNs. The dual weight buffers
technique has not been used in this work due to the current
trend in CNNs where either the size of FC weights has
been significantly reduced (2 M in ResNet compared with
123.6 M in VGG) or the FC layers are completely removed
such as in NiN. The experimental results demonstrated that
the achieved throughput for ResNet-50 and ResNet-152 are
285.1 GOPS and 315.5 GOPS, respectively. Finally, the
authors mentioned that higher throughput can be achieved
FIGURE 29. Execution Flowchart of ResNets Layers [196].
24 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 30. DLAU Accelerator Architecture [197].
using batch computing [194].
Wang et al. [197] proposed a scalable design on FPGA for
accelerating deep learning algorithms. In order to provide
a scalable architecture and support various deep learning
applications, the proposed architecture utilizes the tiling
technique in which the large-scale input data is partitioned
into small subsets. The size of the tile is configured to
leverage the trade-off between the hardware cost and the
speedup. Moreover, the authors explored hot spots profil-
ing to determine the computational parts that need to be
accelerated to improve the performance. The experimental
results illustrated that matrix multiplication and activation
functions are the key operations in deep learning algorithms
as they consume about 98.6% and 1.1% of the overall
execution time, respectively. Thus, the proposed accelerator
is responsible for speeding up both matrix multiplication
and activation function computations.
The main components of the proposed architecture are
the embedded processor, the DDR3 memory controller,
the DMA module, and the deep learning acceleration unit
(DLAU), as shown in Fig. 30. The embedded processor
utilizes the JTAG-UART to communicate with the accel-
eration unit [198]. The DLAU unit accesses the DDR3
memory to read the tiled input data and to write the
results back through the DMA module during the execu-
tion. The DLAU utilizes three fully pipelined processing
units to improve the throughput, while minimizing the
memory transfer operations. These units are tiled matrix
multiplication unit (TMMU), partial sum accumulation unit
(PSAU), and activation function acceleration unit (AFAU).
TMMU is responsible for multiplication and generating
the partial sums. To optimize the performance, TMMU
is structured as a pipelined binary adder tree. Moreover,
it uses two sets of registers alternately to overlap the
computation with the communication, one group is used
for the computation, while in parallel, the other group is
loaded with the next node data every clock cycle. On the
other hand, PSAU is responsible for accumulating the partial
sums generated from TMMU. Finally, AFAU implements
the sigmoid function using piecewise linear interpolation
to speedup the computation with negligible accuracy loss.
Since the processing units in DLAU might have inconsistent
throughput rates, each unit has input FIFO buffer and output
FIFO buffer to prevent data loss.
The authors implemented the proposed architecture on
Xilinx Zynq Zedboard with ARM Cortex-A9 processors
clocked at 667 MHz. In addition, they used the MNIST
dataset as a benchmark considering the network size as
64ˆ64, 128ˆ128, and 256ˆ256. The experimental results
demonstrated that the speedup of the DLAU accelerator is
up to 36.1ˆcompared with the Intel Core2 processors at
256ˆ256 network size. In addition, the results depict that
the proposed architecture is quite energy-efficient as the total
power consumption was only 234 mW.
In [199], a generalized end-to-end acceleration system of
the previously proposed accelerators in [78], [83], [168],
[196] has been developed to support diverse CNN models.
In doing so, a user-friendly interface and an RTL-level
compiler have been proposed to automatically generate
customized FPGA designs. The authors have developed
an expandable optimized RTL-based library containing the
most commonly used CNN operations. These operations
have been coded in Verilog and designed based on the
quantitative analysis and optimization strategies discussed
in [83]. The compiler generates a DAG-based structure for
the used CNN model and then compiles it with RTL mod-
ules in the library. The proposed compiler allows the user
to input the high-level information of the used CNN model
(previously designed on Caffe framework [58]) as well as
the design variables (i.e., loop unrolling and loop tiling
variables) with the resource constrains of the targeted FPGA
platform. Such utility facilitates the exploration of the best
trade-off between the resource usage and the performance.
Unlike the architecture in [168] where individual CONV
module is assigned to each CONV layer, the scalable RTL
computing module proposed in this work is reused by all
CNN layers of the same type for different CNNs as shown
in Fig. 31. Note that it is not necessary to have all these
modules in the architecture. For instance, the RTL compiler
FIGURE 31. Overall Architecture and Dataflow [199].
VOLUME 4, 2016 25
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
FIGURE 32. CONV Reconfigurable Computing Module [199].
will not compile or synthesize Eltwise and combined batch
normalization with scale (Bnorm) modules for VGG-16
model which greatly saves the hardware resources.
On the other hand, the authors categorized CNN layers
into key layers (i.e., CONV, pool, and FC layers) and
affiliated layers (e.g., ReLU, Bnorm, Eltwise, and all other
layers). They have also defined layer combos, where each
combo is composed of a key layer and several affiliated
layers. Layer combos are sequentially executed according
to their order in the DAG. Moreover, the layer combo is
also divided into several sequential tiles. The computation
of each combo tile starts by reading its input pixels from
off-chip DRAM and ends by writing back its output pixels
to off-chip DRAM. The global control logic, inter-tile
control logic, and intra-tile control logic are responsible
for governing the sequential operations of layer combos
and reconfiguring the modules, combo tiles, and tile layers
(key and affiliated layers), respectively, through predefined
flexible execution schedule similar to that in [196].
The authors have also employed special storage pattern of
both input pixels and weights on off-chip memory before the
acceleration process to maximize data reuse and minimize
of data communication. The architecture of CONV module
is designed based on the acceleration strategies in [83],
[196] but with a different organization of MAC units as
shown in Fig. 32. The MAC units of CONV module
have been organized into P iy ˆP of independent MAC
blocks, with each MAC block containing P ix MAC units to
further minimize the buffer read operations and the partial
sums movements. Moreover, such organization enables to
handle varying (kernel, stride) sizes configurations through
generating different variants of CONV register arrays during
the compilation.
Experimental results demonstrated that the achieved
throughput on Intel Stratix V GXA7 for NiN, VGG-16,
ResNet-50, and ResNet-152 are 282.67 GOPS, 352.24
GOPS, 250.75 GOPS, and 278.67 GOPS, respectively. On
the other hand, the achieved throughput on Intel Arria
10 GX 1150 was 587.63 GOPS, 720.15 GOPS, 619.13
GOPS, and 710.30 GOPS for NiN, VGG-16, ResNet-50,
and ResNet-152, respectively. More than 2ˆthroughput
improvements have been achieved on Intel Arria 10 GX
1150 since it has 1.8ˆand 5.9ˆmore logic elements and
DSPs than the Intel Stratix V GXA7, respectively, which
allows for larger loop unrolling variables.
Recently, the programmable solutions group at Intel has
developed an FPGA software-programmable and run-time
reconfigurable overlay for deep learning inference [200].
The developed overlay is referred to as the deep learning
accelerator (DLA). For the hardware side of Intel’s DLA, the
team has partitioned the configurable parameters into run-
time and compile-time parameters. The run-time parameters
allow for easy and quick use of different neural network
frameworks, while the compile-time parameters provide a
tunable architecture for performance. Intel’s DLA uses a
lightweight very long instruction word (VLIW) network,
an 8-bit unidirectional ring network, to support the control
and reprogramming logic. Compared with typical overlays,
Intel’s DLA comes with only 1% overhead while other
typical overlays tend to always come with larger over-
heads [201]. The reprogramming of Intel’s DLA overlay
allows for consecutive runs of multiple NNs in a single
application run [202] without the need for reconfiguring and
recompiling the FPGA.
Fig. 33 shows that a 1D array of PEs is used to perform
convolution, multiplication, or any other matrix operations.
Each PE contains a double-buffered filter cache allowing
for pre-loading of next filters while computing. The stream
buffer employed the double-buffering mechanism as well
to store the inputs and the intermediate data on-chip. To
have flexible NN architecture, Intel’s DLA employs an Xbar
interconnect that connects all the core functions required.
Thus, deep learning functions can be easily added to the
overlay through the Xbar by picking them from a suite
FIGURE 33. Intel’s DLA: Neural Network Inference Accelerator [200].
26 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
of pre-optimized functions of the select frameworks that
Intel’s DLA uses. The width adaptation module has been
used to control the throughput of the function. In addition,
Intel’s DLA supports vectorization across the input width
(Q_VEC), input height (P_VEC), input depth (C_VEC),
output depth (K_VEC), filter width (S_VEC), and other
dimensions as depicted in Fig. 33. The authors mention
that vectorization depends on the layers’ dimensions of
the considered framework. However, they did not provide
a systematic way for finding the optimal balance for the
number of used PEs and the size of the caches. For efficient
use of resources, Intel’s DLA maps AVG pooling and FC
primitives to convolutions in order to avoid having under-
utilized (over time) dedicated auxiliary functions.
For the software side of Intel’s DLA, the proposed accel-
erator uses a graph compiler to map a NN architecture to
the overlay for maximizing the hardware efficiency through
slicing, allocation, and scheduling. In the slicing pass, the
graph compiler breaks down the architecture into subgraph
(a chain of functions) in such a way that they fit within the
computing and storage resources of the overlay. A single
CONV layer followed by a pooling layer is an example of
CNN subgraph. The graph compiler optimizes the external
memory spill-points by group slicing technique. The group
slicing allows several sequential convolutions, for instance,
of a single slice to be computed before moving onto the
next slice while using the whole stream buffer. During
the allocation pass, the graph compiler optimizes the use
of a custom developed filter caches and stream buffer by
managing the read and write from the stream buffer for each
slice. Moreover, it assigns an external memory address when
the stream buffer is not big enough to hold the slice data.
Finally, Intel’s DLA compiler schedules the execution of
subgraphs using cost-based (the ratio of the output size
FIGURE 34. Design Flow from CNN Model to Hardware Acceleration [60].
FIGURE 35. Overall Architecture of Angel-Eye [60].
to the effective input size) priority queue. The authors
utilized the software-programmable and run-time recon-
figurable overlay to optimize the software and hardware
implementation of GoogleNet [31] and ResNet [49] CNNs.
The benchmark results on an Arria 10 GX 1150 FPGA
demonstrated that Intel’s DLA has a throughput of 900
fps on GoogLeNet. The team pointed that multi-FPGA
deployment [203] might be used to further improve the
throughput of Intel’s DLA.
Kaiyuan et al. [60] proposed a complete design flow,
referred to as Angel-Eye, for mapping CNNs onto FPGA.
It includes a data quantization strategy, a parameterizable
and run-time configurable hardware architecture to support
various CNNs, FPGA platforms, and a compiler to map
a given CNN onto the hardware architecture. It adopts
the approach of using a flexible hardware architecture and
maps different CNNs onto it by changing the software.
The proposed design flow from CNN model to hardware
acceleration is shown in Fig. 34.
Due to the large dynamic range of data across different
layers, the best radix point is found for each layer for a
given bit width. They demonstrated that their strategy can
simplify state-of-the-art CNNs to 8-bit fixed-point format
with negligible accuracy loss. Although 8-bits are used for
representing data, 24 bits are used for representing inter-
mediate data in layers, which is then aligned and quantized
to 8 bits. Fig. 35 and Fig. 36 show the overall architecture
of Angel-Eye and the structure of a single PE, respectively.
The architecture is designed for supporting an instruction
interface that supports three types of instructions; LOAD,
SAVE, and CALC.
The overall architecture is divided into four main compo-
FIGURE 36. Structure of a Single PE [60].
VOLUME 4, 2016 27
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
nents; PE array, on-chip buffer, external memory, and con-
troller. The PE array implements the convolution operations
in CNN and supports kernel level parallelism, and input
and output channel parallelisms. It uses a 3ˆ3convolution
kernel, as this size is most popular in state-of-the-art CNN
models. However, larger kernel sizes are supported based on
the use of multiple 3ˆ3kernels. The use of on-chip buffers
allows data I/O and convolution calculation to be done
in parallel. The controller is responsible for receiving the
instructions and issuing them to the other components. The
compiler maps the CNN descriptor to the set of instructions
that will be executed by the hardware. It follows basic
scheduling rules to fully utilize data localization in CNN
and reduce data I/O.
The block partition step partitions the calculation of one
layer to fit each block into the hardware. The memory
mapping step allocates memory for communication between
the host CPU and the CNN accelerator. Based on the block
partition result, on-chip memory is allocated for the input
and output feature map blocks and for the convolution
kernels and bias values. The dependency check step checks
data dependency among instructions and sets appropriate
instruction flags to maximize parallelism between convolu-
tion calculation and data I/O. Based on experimental results,
it is shown that the 8-bit implementation of Angel-Eye
on XC7Z020 achieves up to 16ˆhigher energy efficiency
than NVIDIA TK1 and 10ˆhigher than NVIDIA TX1.
In addition, the 16-bit implementation of Angel-Eye on
XC7Z045 is 6ˆfaster and 5ˆhigher in power efficiency
than peer FPGA implementation on the same platform [148].
In [83] and [199], a special register array architecture has
been designed to rearrange buffers data and direct them into
PEs for the purpose of implementing CONV module that
supports specific stride and zero-padding settings. Although
the designed CONV module is not generalized for any (ker-
nel, stride) size configurations, it is composed of complex
wire routing and control logic as shown in Fig. 20. To have
flexibility in directing the dataflow of CONV pixels, Ma
FIGURE 37. CONV Acceleration Architecture and Dataflow using Data
Router [204], where, P ix P ox, and P iy P oy.
et al. [204] replaced the register array architecture in [199]
with a data router as shown in Fig. 37.
The data router is a scalable set of data bus from buffer
to PE (BUF2PE). The BUF2PE data bus consists of simple
register arrays with FIFOs in between to form a line buffer
similar to that in [154]. The register array uses the FIFO to
pass its input pixels to the adjacent registers. Each BUF2PE
data bus has different data movements within its register
arrays to implement specific stride and kernel size settings.
Unlike the register array architecture in [83] where the west
zero-paddings are handled by changing the storage pattern
within the input pixel buffer, the BUF2PE handles such
kind of paddings by shifting the connection between the
register arrays and the input pixel buffers to simplify the
data transferring from off-chip memory to on-chip buffers.
However, there is still a need for adjusting the storage
pattern within the input buffers in order to handle other
zero-paddings.
The global control logic is responsible for selecting the
suitable BUF2PE data bus from the data router as well as
the suitable storage pattern within the input buffers based on
the (kernel, stride) size configuration of CONV layer. The
CONV module has also been optimized by reducing the
required number of parallel adders that add the partial sums
with biases as well as the number of parallel multipliers and
adders needed to perform Bnorm operation by serializing
the parallel outputs using multipliers. In addition, 16-bit
fixed-point has been used to represent both weights and
pixels, while dynamically adjusting the decimal point in
different layers to fully utilize the existing data width [60].
The proposed compiler in [199] has been used to config-
ure the parameterized Verilog scripts of the overall CNN
acceleration system. Experimental results show throughput
degradation on both Intel Arria 10 GX 1150 and Intel Stratix
V GXA7 in comparison to the results in [199].
In Table 2 and Table 3, we summarize the reviewed
FPGA-based deep learning networks acceleration tech-
niques. For each technique, we list the year the technique
was introduced, the key features employed for acceleration,
the used deep learning model, the number of needed op-
erations per image, the FPGA platform used to implement
the technique, the precision used for the FMs and weights,
the clock frequency used, the design entry for describing
the modeled deep learning network, the type of LUT for
the used platform, the number of resources available by the
used platform in terms of BRAMs, LUTs, FFs, and DSPs,
the percentage of each resource utilization, the performance
in GOPS, the speedup in comparison to a given baseline
model, and finally the power efficiency (GOPS/W).
IV. METAHEURISTICS IN THE DESIGN OF
CONVOLUTIONAL NEURAL NETWORKS
Currently, convolutional neural network (CNN) structures
are designed based on human expertise. For a given applica-
tion, this consists of determining the number of convolution
layers, number of fully connected layers, sizes of feature
28 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
TABLE 2. Implementation and Performance Summary of FPGA-based Accelerators.
Technique Year Key Features DL Model Image Operations
(GOP) Platform Precision Frequency
(MHz) LUT Type Design Entry
Resources Resources Utilization Performance
(GOPS) Speedup Baseline
Power
Efficiency
(GOPS/W)
BRAMs/
M20K
LUTs/
ALMs FFs DSPs BRAMs/
M20K
LUTs/
ALMs FFs DSPs
CNP [127] 2009
2D CONV modules,
Memory interface with
8 simultaneous accesses
LeNet-5 0.52 Virtex4
SX35
16-bit
fixed-point 200 4-input LUTs Lush 192 30,720 30,720 192 N/A 90% 90% 28% 5.25 N/A 0.35
CONV Coprocessor
Accelerator [129] 2009
Parallel clusters of
2D convolver units,
data quantization,
off-chip memory banks
4CONV layers 1.06 Virtex5
LX330T
16-bit
fixed-point 115 6-input LUTs C324 207,360 207,360 192 0.93% 17% 19.05% 55.73% 6.74 6ˆ2.2 GHz
AMD Opteron 0.61
MAPLE [132] 2010
In-memory processing,
banked off-chip memories,
2D array of VPEs
4CONV layers 1.06 Virtex5
SX240T fixed-point 125 6-input LUTs C++ 516 149,760 149,760 1,056 N/A 70.5ˆ1.3 GHz
C870 GPU N/A
DC-CNN [100] 2010
Integer factorization
to determine the best
config for each layer,
input and output switches
3CONV layers 0.52 Virtex5
SX240T
48-bit
fixed-point 120 6-input LUTs RTL 516 149,760 149,760 1,056 N/A 16 4.0ˆ- 6.5ˆ1.35 GHz
C870 GPU 1.14
NeuFlow [143] 2011
Multiple full-custom
processing tiles (PTs),
Pipelining, Fast streaming
memory interface
4CONV layers
in [145] N/A Virtex6
VLX240T
16-bit
fixed-point 200 6-input LUTs HDL 416 150,720 301,440 768 N/A 147 133.6ˆ2.66 GHz
Core 2 Duo CPU 14.7
Memory-Centric
Accelerator [146] 2013
Flexible off-chip memory
hierarchy, Data reuse,
Loop transformation
4CONV layers 5.48 Virtex6
VLX240T fixed-point 150 6-input LUTs C416 150,720 301,440 768 45.5% 1.1% N/A 6% 17 11ˆStandard Virtex6
Implementation N/A
nn-X [148] 2014
Cascaded pipelining,
Multiple stream
processing
2CONV layers 0.55 Zynq
XC7Z045
16-bit
fixed-point 142 4-input LUTs Lua 545 218,600 437,200 900 N/A 23.18 115ˆ
800 MHz
Embedded ARM
Cortex-A9 Processors
2.9
Roofline-based FPGA
Accelerator [55] 2015
Roofline-based model,
Loop transformation,
Loop tiling/unrolling
AlexNet 1.33 Virtex7
VX485T
32-bit
float-point 100 4-input LUTs C2,060 303,600 607,200 2,800 50% 61.3% 33.87% 80% 61.62 17.42ˆ2.2 GHz
Intel Xeon 3.31
Microsoft Specialized
CNN Accelerator [152] 2015
Multi-banked buffers,
Network on-chip for re-
distribution of output data,
Software configurable
AlexNet 1.33 Stratix-V
GSMD5
32-bit
float-point 100 4-input LUTs C2,014 172,600 690,000 1,590 N/A 178.63 3ˆRoofline-based FPGA
Accelerator [55] 7.15
Embedded FPGA
Accelerator [98] 2015
Data quantization
and arrangment,
SVD
VGG-16 30.76 Zynq
XC7Z045
16-bit
fixed-point 150 4-input LUTs RTL 545 218,600 437,200 900 86.7% 83.5% 29.2% 89.2% 136.97 1.4ˆ2.9 GHz
Intel Xeon 14.22
DeepBurning [155] 2016
NN model compression,
Compiler-based library,
Automatic partitioning/
tiling, Function Approx.
AlexNet 1.46 Zynq
XC7Z045
16-bit
fixed-point 100 4-input LUTs RTL 545 218,600 437,200 900 N/A 17.3% 7.65% 16% 73 15.88ˆ2.4 GHz
Intel Xeon 6.43
OpenCL-based FPGA
Accelerator [80] 2016
Design space exploration
for all CNN layers,
Genetic algorithm,
Altera OpenCL SDK
AlexNetpaq1.46 Stratix-Vp1q
GXA7 (8-16)-bit
fixed-point 120 7-input LUTs OpenCL
2,560 234,720 938,880 256 37% 47% N/A 100% 31.8pa1q4.2ˆ
3.3 GHz
Intel i5-4590
1.23
N/A 47.5pb1q2.2ˆN/A
VGG-16pbq30.9 Stratix-Vp2q
GSD8 2,567 262,400 1,050,000 1,963 52% 46% N/A 37% 72.4pa2q9.5ˆ3.79
N/A 117.8pb2q5.5ˆN/A
Caffeine [153], [162] 2016
Systolic array architecture,
Loop unrolling, Double
buffering, Pipelining,
Roofline model
AlexNetpaq1.46 Virtex7p1q
VX690T
16-bitpaq
fixed-point 150
6-input LUTs C++
2,940 433,200 866,400 3,600 42% 81% 36% 78% 354pb1aq9.7ˆTwo-socket server
each with a 6-core
Intel CPU
E5-2609 at 1.9GHz
13.62
36% 31% 11% 38% 266pb2aq7.3ˆ10.64
VGG-16pbq31.1 Xilinxp2q
KU060
8-bitpbq
fixed-point 200 2,160 331,680 663,360 2,760 36% 60% 20% 4% 1,171.7pb2bq29ˆ45.07
N/A 165pa2aq4.2ˆN/A
fpgaConvNet [165] 2016
Datapath optimization,
Synchronous dataflow,
Partitioning and folding,
Design space exploration
LeNet-5 0.0038 Zynq
XC7Z020
16-bit
fixed-point 100 4-input LUTs HLS 140 53,200 106,400 200
4.4% 18.2% 13% 1.2% 0.48 0.09ˆCNP [127] N/A
Scene
Labelling [167] 7.6528 6.5% 61% 36.6% 71.8% 12.73 0.17ˆTegra K1 GPU 7.27
VOLUME 4, 2016 29
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
TABLE 3. Implementation and Performance Summary of FPGA-based Accelerators.
Technique Year Key Features DL Model Image Operations
(GOP) Platform Precision Frequency
(MHz) LUT Type Design Entry
Resources Resources Utilization Performance
(GOPS) Speedup Baseline
Power
Efficiency
(GOPS/W)
BRAMs/
M20K
LUTs/
ALMs FFs DSPs BRAMs/
M20K
LUTs/
ALMs FFs DSPs
Throughput-Optimized
FPGA Accelerator [170] 2017
Four-levels of parallelism,
Memory-cache mechanism,
Design space exploration
LeNet 0.04 Virtex7
VX690T
(8-16)-bit
fixed-point 100 6-input LUTs N/A 1,470 433,200 866,400 3,600
32.4% 53.8% 35.5% 80.8% 424.7 14.84ˆ4.0 GHz
Intel Core
i7-4790K CPU
16.85
AlexNet 1.46 69.5% 47.7% 37.3% 79.8% 445.6 6.96ˆ17.97
VGG-S 5.29 88.8% 51.7% 34.5% 81.9% 473.4 4.79ˆ18.49
FP-DNN [171] 2017
RTL-HLS hybrid compiler,
Hand-written matrix
multiply, Quantization,
Tiling and double buffers
VGG-19 39.26 Stratix-V
GSMD5
16-bit
fixed-point 150 4-input LUTs
C++
and
OpenCL
2,014 172,600 690,000 1,590
48% 27% N/A 66% 364.36 3.06ˆ2 Processors each
2.6 GHz
Intel Xeon
8-core E5-2650v2
14.57
ResNet-152 22.62 48% 27% N/A 66% 226.47 1.9ˆ9.06
FINN [181] 2017
Binarized CNN,
Pipelining, Automatic
partitioning and tiling,
Approximate arithmetic
CNV 0.113 Zynq
XC7Z045
(1-2)-bit
precision 200 4-input LUTs C++ 545 218,600 437,200 900 34.1% 21.2% N/A N/A 2,465.5 13.8ˆMicrosoft Specialized
CNN Accelerator [152] 210.7
Customized CONV
Loops Accelerator [83] 2017
Numerical analysis of
CONV loops opt.,
Solution space exploration,
Dataflow opt.
VGG-16 30.95 Arria 10
GX 1150
(8-16)-bit
float-point 150 8-input LUTs Verilog 2,713 427,200 1,708,800 1,518 70% 38% N/A 100% 645.25
3.2ˆEnergy-efficient
CNN [205] 30.44
5.5ˆOpenCL-based FPGA
Accelerator [80]pb2q
Latency-Driven Design
for FPGA-based
CNNs [183]
2017
Synchronous dataflow,
Weights reloading
and SDF transformations,
Design space exploration
AlexNet 1.3315 Zynq
XC7Z045
16-bit
fixed-point 125 4-input LUTs HLS 545 218,600 437,200 900 N/A
161.98 1.49ˆDeepBurning [155]
N/A
VGG-16 30.72 123.12 0.65ˆEmbedded FPGA
Accelerator [98]
DLA [188] 2017
Winograd transformations,
Double stream buffers,
PEs double cache buffers,
Daisy-chained PEs
AlexNet 1.46 Arria 10
GX 1150
Half-precision
FP16 with
shared
exponent
303 8-input LUTs OpenCL 2,713 427,200 1,708,800 1,518 92% 58% 40% 97% 1,382
8.4ˆCaffeine [162]pa2aq
30.71
19.1ˆOpenCL-based FPGA
Accelerator [80]pa2q
Winograd-based
CNN Accelerator [189] 2017
Winograd algorithm, Loop
unrolling, Double buffers,
Batching, Line buffers,
Design space exploration
AlexNetpaq1.33 Zynqp1q
ZCU102 16-bit
fixed-point
200 6-input LUTs
C
912 274,080 548,160 2,520
N/A
854.6pa1q11.8ˆ[80]pa2q36.2
2,940.7pb1q8.31ˆCaffeine [162]pb1aq124.6
VGG-16pbq30.76 Xilinxp2q
ZC706 167 4-input LUTs 545 218,600 437,200 900 201.4pa2q2.78ˆ[80]pa2q21.4
679.6pb2q0.13ˆTitan X (CuDNN 5.1) 72.3
OpenCL-based
Architecture for
Accelerating
CNNs [190]
2017
2D BRAM-to-PE
interconnection, 2D
dispatcher, Roofline
model, OpenCL
VGG-16 30.76 Arria 10
GX 1150
32-bit
float-point 370
8-input LUTs OpenCL
2,713 427,200 1,708,800 1,518 46.1% N/A N/A 87% 866 4.41ˆAltera OpenCL [191]
on Arria 10 Platform 20.75
16-bit
fixed-point 385 2,713 427,200 1,708,800 3,036 53.4% N/A N/A 90.8% 1,790 N/A 47.78
Multi-CLP
Accelerator for
CNNs [192]
2017
Multiple CONV
processors, Pipelining,
Dynamic programming,
Double-buffering
AlexNetpaq1.33 Virtex7p1q
VX485T
32-bit
float-point 100 4-input LUTs
C++
2,060 303,600 607,200 2,800 39.4% 58.3% 44.6% 87.3% 85.2pa1q1.31ˆSingle CLP
Design Based
in [55]
11.2
48.8% 84.7% 40.2% 88.3% 113.92pa2q1.54ˆ11.17
SqueezeNetpbq0.77 Virtex7p2q
VX690T
16-bit
fixed-point 170 6-input LUTs 2,940 433,200 866,400 3,600 N/A 708.3pb1q1.9ˆN/A
37.7% 30.9% 18.6% 97.1% 909.7pb2q2.3ˆ126.3
Automated Systolic
Array Architecture
for CNN [195]
2017
2D systolic array
architecture, Roofline
model, Automation flow
Design space exploration
AlexNetpaq1.4 Arria 10
GX 1150
32-bitp1q
float-point 239.62pa1q
8-input LUTs OpenCL 2,713 427,200 1,708,800 1,518
87% 82% N/A 85% 360.4pa1q
N/A
20.75
N/A
VGG-16pbq31.1 (8-16)-bitp2q
fixed-point
221.65pb1q90.5% 83% N/A 88.3% 460.5pb1q
231.85pb2q61.5% 73% N/A 98.8% 1,171.3pb2q
End-to-End Scalable
FPGA Accelerator [196] 2017
Flexible and scalable
ResNet modules, CONV
loops opt., Dataflow opt.,
Controlled execution
flow of ResNets layers
ResNet-50 7.74
Arria 10
GX 1150
16-bit
fixed-point 150 8-input LUTs Verilog 2,713 427,200 1,708,800 1,518
80% 30% N/A 69% 285.07
N/A N/A
ResNet-152 22.62 93% 33% N/A 69% 315.48
DLAU [197] 2017
Pipelined processing
units, Tiling,
FIFO buffers
DNN N/A Zynq
XC7Z020
48-bit
float-point 200 4-input LUTs N/A 280 53,200 106,400 220 12.5% 68.4% 26.6% 75.9% N/A 36.1ˆ2.3 GHz
Intel Core2 N/A
An Automatic RTL
Compiler for High-
Throughput Deep
CNNs [199]
2017
Library-based RTL
compiler, Flexible and
scalable CNN modules,
Layer combo computation,
CONV loops and dataflow
opt., Controlled execution
flow of CNN layers
NiNpaq2.2 Stratix-Vp1q
GXA7
16-bit
fixed-point
150 7-input LUTs
Verilog
2,560 234,720 938,880 256
59% 96% N/A 100% 282.67pa1qą6.4ˆDeepBurning [155]
N/A
86% 90% N/A 100% 352.24pb1qN/A
VGG-16pbq30.95 76% 74% N/A 100% 250.75pc1qN/A
93% 77% N/A 100% 278.67pd1q1.23ˆFP-DNN [171]
ResNet-50pcq7.74 Arria 10p2q
GX 1150 200 8-input LUTs 2,713 427,200 1,708,800 1,518
56% 37% N/A 100% 587.63pa2qN/A
82% 30% N/A 100% 720.15pb2q2.0ˆCaffeine [162]pb1aq
ResNet-152pdq22.62 71% 51% N/A 100% 619.13pc2qN/A
87% 54% N/A 100% 710.30pd2qN/A
ALAMO [78], [168] 2018
Modularized RTL
compiler, Loop
unrolling, Loop
tiling
AlexNet 1.46 Stratix-V
GXA7
(8-16)-bit
fixed-point 100 7-input LUTs Verilog 2,560 234,720 938,880 256
61% 52% N/A 100% 114.5 1.9ˆRoofline-based FPGA
Accelerator [55] 5.87
NiN 2.2 91% 48% N/A 100% 117.3 N/A 6.14
Angel-Eye [189] 2018
Quantization,
Parallel PEs,
Compiler,
On-chip buffers
VGG-16 30.76
Zynq
XC7Z020
8-bit
fixed-point 214
4-input LUTs N/A
140 53,200 106,400 220 61% 56% 33% 86.4% 84.3 3.8ˆ
nn-X [148]
24.1
Zynq
XC7Z045
16-bit
fixed-point 150 545 218,600 437,200 900 89% 84% 29% 87% 137 6ˆ14.2
Optimizing the CONV
Operation to Accelerate
DNNs on FPGA [204]
2018
Scalable set of data buses
(BUF2PE), Optimized
CONV module for
different (kernel, stride)
size configurations,
Flexible and scalable CNN
modules, CONV loops
and dataflow opt.
NiNpaq2.2 Stratix-Vp1q
GXA7
16-bit
fixed-point
150 7-input LUTs
Verilog
2,560 234,720 938,880 256
59% 97% N/A 100% 278.2pa1qN/A
N/A
86% 93% N/A 100% 348.8pb1qN/A
VGG-16pbq30.95 76% 75% N/A 100% 243.3pc1qN/A
93% 78% N/A 100% 276.6pd1q1.2ˆFP-DNN [171]
ResNet-50pcq7.74 Arria 10p2q
GX 1150 200 8-input LUTs 2,713 427,200 1,708,800 1,518
56% 38% N/A 100% 584.8pa2qN/A
82% 32% N/A 100% 715.9pb2qN/A
ResNet-152pdq22.62 71% 52% N/A 100% 611.4pc2qN/A
87% 55% N/A 100% 707.2pd2qN/A
30 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
maps in each layer, along with other operators. Recent
research has demonstrated that a large number of weights
in fully connected layers could be eliminated with minimal
impact on accuracy. In addition, although the suggested
CNN structures by experts perform well for various appli-
cations, the question arises whether the suggested structures
could be optimized for performance with minimal impact on
accuracy. Since the designed CNN has a significant impact
on the complexity of its implementation, we review in this
section some approaches attempting to optimize the design
of CNNs using metaheuristics.
NP-hard combinatorial optimization problems [206] ap-
pear in the design of CNNs. Some examples of areas include
design of CNN structures, selection of weights and bias
values to improve accuracy, and determination of optimal
values of variables to reduce run-time. Below, we briefly
touch upon some existing literature in these areas.
A. CNN STRUCTURE OPTIMIZATION
In the design of CNNs, the number of possible network
structures increases exponentially with the number of layers.
Xie and Yuille used genetic algorithm in learning deep
network structures [207]. The objective was to find the
best CNN structure that would minimize the error rate.
The cost function was the CNN accuracy. They proposed
an elegant encoding of chromosome using a fixed length
binary string to represent each network structure. A CNN
string represents only the convolution layers.
In each generation, using standard genetic operations new
individuals are generated and weak ones eliminated. The
quality of an individual was assessed by its recognition ac-
curacy which is obtained via the time consuming operation
of training the network, and evaluating it on a validation set.
Two small data sets were used (MNIST and CIFAR-10) to
run the genetic implementation via which they demonstrated
the discovery of new structures.
B. CNN WEIGHTS AND BIAS VALUES OPTIMIZATION
An attempt to train CNNs using metaheuristics (that is,
determine weights and bias values) is presented in [208].
The objective again was to improve accuracy and minimize
the estimated error. The authors experiment with three
metaheuristic algorithms, namely; simulated annealing, dif-
ferential evolution, and harmony search. The algorihtms
compute the values of weights and bias in the last layer.
These values are used as the solution vector denoted by x
which is to be optimized. The move comprised adding a
small value of xto perturb the state. The cost function y
is modeled as
y1
2ˆřN
inpo´uq2
N˙0.5
(4)
where, ois the expected output, uis the real output, and Nis
the number of used samples. The stopping criterion is when
the iteration count is reached or when the cost function goes
below a pre-specified value.
C. CNN DESIGN VARIABLES OPTIMIZATION
Suda et al. [80] presented a systematic methodology for
design space exploration with the objective of maximizing
the throughput of an OpenCL-based FPGA accelerator for
a given CNN model (please see subsection III-C). FPGA
resource constraints such as on-chip memory, registers, com-
putational resources and external memory bandwidth are
considered. The optimization problem comprises finding the
best combination of NCON V ,SCONV ,NN ORM ,NP OO L,
and NF C variables, where
NCON V is size of the filter (or neuron or kernel);
SCON V is the factor by which computational resources
are vectorized to execute in a single-instruction stream
multiple-data streams (SIMD) fashion;
NNO RM represents the number of normalization oper-
ations performed in a single cycle;
NP OOL is the number of parallel outputs of the pooling
layer in a single cycle to achieve acceleration; and,
NF C is the number of parallel multiply and accumulate
(MAC) operations preformed in a single work-item
within the fully connected layer.
The objective function to be minimized is the run-time
(RT ), and is given by
T L
ÿ
i0
RTirNCONV , SC ON V , NNOR M , NPO OL, NF C s(5)
subject to digital signal processing (DSP) slices, logic, and
memory constraints, where T L represents the total number
of CNN layers including the repeated layers. The convolu-
tion layer run-time (RT CO NV ) is analytically modeled as
a function of design variables as
RT CONV i#of C onvolution Opsi
NCON V ˆSC ONV ˆF requency (6)
As for the other layers, that is, normalization, pooling, and
fully connected, the following general model is proposed
RT Layeri#of Layer Opsi
Unroll f actor ˆF r equency (7)
The above analytical models are later validated by per-
forming full synthesis at selective points and running them
on the FPGA accelerator.
Clearly, in order to determine the best values of the
discussed design variables, exhaustive search, especially if
the number of variables and or FPGA resources is large, is
infeasible. We have to resort to iterative non-deterministic
heuristics [206] such as simulated annealing, simulated
evolution, tabu search, genetic algorithm, particle swarm
optimization, cuckoo search, etc., or any of the modern
metaheuristics, to efficiently traverse the search space to find
acceptable solutions.
VOLUME 4, 2016 31
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
The proposed methodology employing genetic algorithm
was demonstrated by optimizing the implementation of two
representative CNNs, AlexNet and VGG, on two Altera
Stratix-V FPGA platforms, DE5-Net and P395-D8 boards,
both of which have different hardware resources. Peak
performance is achieved for both, for the convolution oper-
ations, and for the entire CNN network.
One major issue related to use of non-deterministic iter-
ative heuristics in the design of neural networks and CNNs
is the large amount of memory required to store the state of
solution and the amount of time taken to determine the cost
of the solution, be it accuracy/error estimation, run-time, or
any other objective. Reasonable estimation techniques and
analytical formulations are required to efficiently traverse
the design space in search of efficient solutions.
V. SUMMARY AND RECOMMENDATIONS
In this section, we highlight the key features discussed in
the acceleration of convolutional neural networks (CNNs)
implemented on FPGAs, and provide recommendations to
enhance the effectiveness of employing FPGAs in the ac-
celeration of CNNs.
All reviewed techniques are centered around accelerating
the convolution (CONV) operation as it consumes around
90% of the computational time. This is achieved by utiliz-
ing parallel multiply-accumulate operations bounded by re-
source limitations. In addition, careful design of data access
patterns are targeted to minimize the memory bandwidth
requirements utilizing internal memory structures and max-
imizing data reuse. This is crucial in the acceleration process
due to the large memory data that needs to be accessed
including feature maps (FMs) and weights. To minimize
the memory footprint and to achieve effective utilization
of resources, some techniques optimize the number of bits
used to represent the feature maps and weights with minimal
impact on accuracy. This is combined with the optimized
selection of the number of fraction bits used for each layer.
Other techniques optimize the number of used weights in the
fully connected (FC) layers as they are memory-intensive.
Coprocessors are also employed to automatically configure
both the software and the hardware elements to fully exploit
parallelism [100].
To optimize parallelization of convolution operations,
several approaches have been attempted. Work load analysis
has been tried to determine computations that can be struc-
tured as parallel streams [132]. The roofline model based
accelerator uses polyhedral-based data dependence analysis
to find the optimal unrolling factor for every convolutional
layer [150], and to fully utilize all FPGA computational
resources through loop pipelining. To optimize performance,
tiled matrix multiplication is structured as a pipelined binary
adder tree for performing multiplication and generating
partial sums [198]. An optimization framework has been
proposed by Suda et al. who identified the key variables of
the design [80] and optimize them to maximize parallelism.
To reduce computational complexity of CONV layers and
improve resource efficiency, a number of approaches such
as [184], [188], [189] utilized Winograd transformation in
performing CONV operations as this reduces the computa-
tional complexity by around 50%.
To maximize throughput, several techniques such as
[165], [170], [192] have used multiple CONV layer pro-
cessors (CLPs) instead of using a single CLP that is opti-
mized for all CONV layers. This pipelines the operation of
the multiple CLPs achieving layer-level parallelism which
maximizes resource utilization and enhances performance in
comparison to using a single CLP.
Since the computational requirement of FC layers is
significantly less than that of CONV layers, to improve
performance, and maximize resource utilization, a number
of techniques such as [153], [162], [188], [189] create
batches by grouping different input FMs and processing
them together in FC layers.
Complex access patterns and data locality are used in
DeepBurning tool [155] for better data reuse. In [197],
the authors explored hot spots profiling to determine the
computational parts that need to be accelerated to improve
the performance. Acceleration is accomplished by reducing
the memory bandwidth requirements. Techniques proposed
exploit data reuse to reduce off-chip memory communica-
tions. Loop transformations have also been used by reducing
tiling parameters to improve data locality, and to reduce
redundant communication operations to maximize the data
sharing/reuse.
Efficient buffering, where the weight buffers are used to
ensure the availability of CONV and FC layers’ weights
before their computation, as well as to overlap the transfer
of FC layer weights with its computation, helps in improving
performance [78], [168]. In the Catapult project, FPGA
boards were integrated into data center applications and
achieved speedup. Microsoft Research’s Catapult utilized
multi-banked input buffer and kernel weight buffer to pro-
vide an efficient buffering scheme of feature maps and
weights, respectively. To minimize the off-chip memory
traffic, a specialized network on-chip was designed to re-
distribute the output feature maps on the multi-banked
input buffer instead of transferring them to the external
memory [152].
To further reduce memory footprint and bandwidth re-
quirement, optimal fractional length for weights and feature
maps in each layer are used. Singular value decomposition
(SVD) has also been applied to the weight matrix of FC
layer in order to reduce memory footprint at this layer [98].
Tiling techniques have been proposed where large-scale
input data is partitioned into small subsets or tiles whose size
is configured to leverage the trade-off between the hardware
cost and the speedup [197].
Automation tools have been developed that automatically
build neural networks with optimized performance [155].
They employ pre-constructed register transfer level (RTL)
module library that holds hardware (including logical and
arithmetic operations) and configuration scripts. DeepBurn-
32 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
ing, for example, generates the hardware description for
neural network scripts. Another modularized RTL compiler,
ALAMO, integrates both the RTL finer level optimization
and the flexibility of high-level synthesis (HLS) to generate
efficient Verilog parameterized RTL scripts for ASIC or
FPGA platform under the available number of parallel
computing resources (i.e., the number of multipliers) [78],
[168]. Acceleration is achieved by employing loop unrolling
technique for CONV layer operations. Some of the reviewed
techniques also help minimize the size of FPGA on-chip
memories to optimize energy and area usage [146], [147].
In Table 4 and Table 5, we list the optimization mech-
anisms utilized by each of the reviewed techniques to
maximize performance and throughput of FPGA-based deep
learning networks.
To enhance utilization of FPGAs in CNNs acceleration
and to maximize their effectiveness, we recommend the
development of a framework that includes a user-friendly
interface that allows the user to easily specify the CNN
model to be accelerated. This includes specifying the CNN
model parameters in terms of number of convolution layers
and their sizes, and number of fully connected layers along
with other intermediate operations. The specified CNN
model weights will be read from a file. In addition, the user
should have the option of specifying the FPGA platform
that will be used for implementing the CNN accelerator
and the maximum tolerable error, along with the selection
of a library from a set of applications to be used for model
optimization and evaluation. The framework then should
perform optimizations to find the minimum number of bits
that need to be used for representing the weights and feature
maps and the number of fraction bits to be used for each
layer. In addition, optimization of fully connected layers is
performed to minimize the memory requirements. All such
optimizations are carried out bounded by the maximum error
specified by the user for the specified application library.
The framework should be designed based on the devel-
opment of a scalable hardware architecture that works for
any given FPGA platform and achieves higher speedup with
the availability of higher resources. Based on the available
resources, specified by the FPGA platform, the tool will
perform optimizations to maximize parallelism and data
reuse, given the resource limitations. The tool will then
automatically generate the CNN model that will fit on the
given FPGA platform and will allow the user to evaluate
the performance based on the chosen application library.
This will allow the user to evaluate the performance gains
while evaluating different FPGA platforms with different
resources. The tool should have the option to generate per-
formance measures based on different performance metrics
as selected by the user such as number of frames processed
per second or number of operations performed per second.
In addition, the tool will report other design metrics such
as resource utilization, memory sizes and bandwidth, and
power dissipation.
Furthermore, it is desired to have the option for the user
to specify the desired performance for a given CNN model
and have the tool perform necessary analysis and evaluation
and recommend to the user candidate FPGA platforms for
achieving the desired performance levels. This will require
the development of reasonably accurate analytical models
that will estimate the needed resources for achieving the
desired performance. The user can then choose the recom-
mended FPGA platform and perform complete evaluation
to verify that the desired performance levels are met.
VI. CONCLUSION
In this paper, we reviewed recent developments in the area
of acceleration of deep learning networks and, in particular,
convolution neural networks (CNNs) on field programmable
gate arrays (FPGAs). The paper begins with a brief overview
of deep learning techniques highlighting their importance,
key operations, and applications. Special emphasis is given
on CNNs as they have wide applications in the area of
image detection and recognition and require both CPU
and memory intensive operations that can be effectively
accelerated utilizing FPGA inherent ability to maximize
parallelism of operations.
While the paper briefly touches upon the acceleration
techniques for deep learning algorithms and CNNs from
both software and hardware perspectives, the core of this
article has been the review of recent techniques employed
in the acceleration of CNNs on FPGAs. A thorough up-to-
date review is provided that illustrates the employment of
various possibilities and techniques such as exploitation of
parallelism utilizing loop tiling and loop unrolling, effective
use of internal memory to maximize data reuse, operation
pipelining, and effective use of data sizes to minimize mem-
ory footprint, and, to optimize FPGA resource utilization.
The paper also presented the use of tools for generating
register transfer level (RTL) scripts that not only help in
automating the design process, but also help in exploring the
design space and suggesting efficient hardware. The paper
discusses the use of analytics such as: (i) work load analysis
in determining the computations that can be parallelized,
(ii) optimal loop unrolling factors, (iii) determining access
patterns to improve data locality, etc. In addition, a brief
review of the use of non-deterministic heuristics in solving
NP-hard combinatorial optimization problems in the design
and implementation of CNNs has been presented. Finally,
the paper summarizes the key features employed by the
various FPGA-based CNN acceleration techniques and pro-
vided recommendations for enhancing the effectiveness of
utilizing FPGAs in CNNs acceleration.
ACKNOWLEDGMENT
Authors acknowledge King Fahd University of Petroleum
& Minerals, Dhahran, Saudi Arabia for all support. We also
like to acknowledge Dr. Blair P. Bremberg and Ms. Sumaiya
Hussain Sadiq for their help in professional English editing
of this manuscript.
VOLUME 4, 2016 33
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
TABLE 4. Optimization Mechanisms Employed for FPGA-based Acceleration of Deep Learning Networks.
Technique VIP
[119]
CNP
[127]
CONV
Coprocessor
Accelerator [129]
MAPLE
[132]
DC-CNN
[100]
NeuFlow
[143]
Memory-Centric
Accelerator
[146]
nn-X
[148]
Roofline-based
FPGA
Accelerator [55]
Embedded
FPGA
Accelerator [98]
DeepBurning
[155]
OpenCL-based
FPGA
Accelerator [80]
Caffeine
[153], [162]
fpgaConvNet
[165]
Loop
Unrolling Ś Ś Ś Ś
Loop
Tiling Ś Ś Ś Ś Ś Ś Ś
Loop
Interchange Ś
Pipelining ŚŚ Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś
Batching Ś
Multi-CLPs Ś
Fixed-Point
Precision Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś
Per-Layer
Quantization Ś
Singular Value
Decomposition Ś
Prefetching Ś Ś Ś
Rearranging
Memory Data Ś Ś Ś Ś
In-Memory
Processing Ś
Line
Buffer Ś
Double
Buffering ŚŚ
Approximating
Non-Linear AF Ś Ś Ś Ś Ś Ś Ś
Eliminating
FC Layer Ś Ś Ś Ś
Roofline
Model Ś Ś
Polyhedral
Optimization Ś Ś
Dynamic
Programming Ś
Graph
Partitioning Ś
34 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
TABLE 5. Optimization Mechanisms Employed for FPGA-based Acceleration of Deep Learning Networks.
Technique ALAMO
[78], [168]
Throughput-
Optimized
FPGA
Accelerator [170]
FP-DNN
[171]
FINN
[181]
Customized
CONV Loop
Accelerator
[83]
Latency-Driven
Design for
FPGA-based
CNNs [183]
DLA
[188]
Winograd-
based CNN
Accelerator
[189]
OpenCL-based
Architecture for
Accelerating
CNNs [190]
Multi-CLP
Accelerator
for CNNs
[192]
Automated
Systolic Array
Architecture
for CNN [195]
End-to-End
Scalable FPGA
Accelerator
[196]
DLAU
[197]
An Automatic
RTL Compiler for
High-Throughput
Deep CNNs [199]
Intel’s
DLA
[200]
Angel-Eye
[60]
Optimizing the
CONV Operation
to Accelerate DNNs
on FPGA [204]
Loop
Unrolling Ś Ś Ś Ś Ś Ś Ś Ś Ś
Loop
Tiling Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś
Loop
Interchange Ś Ś
Pipelining ŚŚ Ś Ś Ś Ś Ś Ś Ś
Input
Batching Ś
FC Layer
Batching Ś Ś Ś
Multi-CLPs Ś Ś Ś
Binarized
CNN Ś
Fixed-Point
Precision Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś
Per-Layer
Quantization Ś Ś
Prefetching Ś Ś Ś Ś
Rearranging
Memory Data Ś Ś
Line
Buffer ŚŚ Ś
Double
Buffering ŚŚ Ś Ś Ś Ś Ś Ś
Padding
Optimizations Ś Ś
Winograd
Algorithm Ś Ś
Approximating
Non-Linear AF Ś Ś Ś
Roofline
Model ŚŚ
Polyhedral
Optimization Ś
Dynamic
Programming Ś
Graph
Coloring Ś
Graph
Partitioning Ś Ś
Pattern
Matching Ś
VOLUME 4, 2016 35
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
REFERENCES
[1] Y. Bengio et al., “Learning deep architectures for ai, Foundations and
trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. 1
[2] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neu-
ral networks, vol. 61, pp. 85–117, 2015. 1
[3] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
MIT Press Cambridge, 2016, vol. 1. 1
[4] L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: A
survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, p. e1253, 2018. 1
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-
tations by back-propagating errors,” Nature, vol. 323, no. 6088, p. 533,
1986. 1
[6] Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J,
“Neurocomputing: Foundations of research,” ch. Learning Representa-
tions by Back-propagating Errors, pp. 696–699, 1988. 1
[7] M. A. Nielsen, Neural networks and deep learning. Determination Press,
USA, 2015, vol. 25. 1
[8] T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo geolocation with
convolutional neural networks, in European Conference on Computer
Vision. Springer, 2016, pp. 37–55. 1
[9] Mathworks. What Is Deep Learning? [Online]. Available: https://www.
mathworks.com/discovery/deep-learning.html/, 2018. 1, 2
[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, p. 436, 2015. 2
[11] A. Deshpande. A Beginner’s Guide To Understanding Convolutional
Neural Networks [Online]. Available: https://adeshpande3.github.io/A-
Beginner%27s-Guide-To-Understanding-Convolutional-Neural-
Networks/, 2018. 2, 4
[12] J. E. Dayhoff, Neural network architectures: an introduction. Van
Nostrand Reinhold New York, 1990. 2
[13] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,
and time series,” The handbook of brain theory and neural networks, vol.
3361, no. 10, p. 1995, 1995. 2
[14] J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge,
R. G. Dreslinski, J. Mars, and L. Tang, “Djinn and tonic: DNN as a
service and its implications for future warehouse scale computers,” in
ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM,
2015, pp. 27–40. 2
[15] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,
R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for
video classification,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 4694–4702. 2
[16] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.
Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-
propagation network,” in Advances in neural information processing
systems, 1990, pp. 396–404. 2
[17] P. Barros, S. Magg, C. Weber, and S. Wermter, A multichannel con-
volutional neural network for hand posture recognition, in International
Conference on Artificial Neural Networks. Springer, 2014, pp. 403–410.
2
[18] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep
recurrent neural networks,” in Acoustics, speech and signal processing
(icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–
6649. 2
[19] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning
deep structured semantic models for web search using clickthrough data,”
in Proceedings of the 22nd ACM international conference on Conference
on information & knowledge management. ACM, 2013, pp. 2333–2338.
2
[20] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu,
“Convolutional neural networks for speech recognition, IEEE/ACM
Transactions on audio, speech, and language processing, vol. 22, no. 10,
pp. 1533–1545, 2014. 2
[21] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convo-
lutional neural networks applied to visual document analysis,” in null.
IEEE, 2003, p. 958. 2
[22] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural
networks for text classification.” in AAAI, vol. 333, 2015, pp. 2267–
2273. 2
[23] Y. Kim, “Convolutional neural networks for sentence classification,”
arXiv preprint arXiv:1408.5882, 2014. 2
[24] R. Collobert and J. Weston, “A unified architecture for natural language
processing: Deep neural networks with multitask learning,” in Proceed-
ings of the 25th international conference on Machine learning. ACM,
2008, pp. 160–167. 2
[25] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief
networks for natural language understanding,” IEEE/ACM Transactions
on Audio, Speech and Language Processing (TASLP), vol. 22, no. 4, pp.
778–784, 2014. 2
[26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732. 2
[27] J. Mutch and D. G. Lowe, “Multiclass object recognition with sparse,
localized features,” in Computer Vision and Pattern Recognition, 2006
IEEE Computer Society Conference on, vol. 1. IEEE, 2006, pp. 11–18.
2
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks, in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105. 2, 4, 5, 12, 13
[29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. 2,
4, 5
[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual
recognition challenge,” International Journal of Computer Vision, vol.
115, no. 3, pp. 211–252, 2015. 2
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,
arXiv preprint arXiv:1409.4842, vol. 7, 2015. 2, 27
[32] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
information processing systems, 2015, pp. 91–99. 2
[33] K. Korekado, T. Morie, O. Nomura, T. Nakano, M. Matsugu, and
A. Iwata, “An image filtering processor for face/object recognition us-
ing merged/mixed analog-digital architecture, in VLSI Circuits, 2005.
Digest of Technical Papers. 2005 Symposium on. IEEE, 2005, pp. 220–
223. 2
[34] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural
network cascade for face detection,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2015, pp. 5325–5334.
2
[35] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-road
obstacle avoidance through end-to-end learning, in Advances in neural
information processing systems, 2006, pp. 739–746. 2
[36] R. Hadsell, A. Erkan, P. Sermanet, J. Ben, K. Kavukcuoglu, U. Muller,
and Y. LeCun, “A multi-range vision strategy for autonomous offroad
navigation,” Proc. Robotics and Applications (RA’07), vol. 1, no. 7, 2007.
2
[37] P. Sermanet, R. Hadsell, M. Scoffier, M. Grimes, J. Ben, A. Erkan,
C. Crudele, U. Miller, and Y. LeCun, “A multirange architecture for
collision-free off-road robot navigation, Journal of Field Robotics,
vol. 26, no. 1, pp. 52–87, 2009. 2
[38] B. Blanco-Filgueira, D. García-Lesta, M. Fernández-Sanjurjo, V. M.
Brea, and P. López, “Deep learning-based multiple object visual tracking
on embedded system for iot and mobile edge computing applications,”
arXiv preprint arXiv:1808.01356, 2018. 2
[39] P. D. McNelis, Neural networks in finance: gaining predictive edge in the
market. Academic Press, 2005. 2
[40] P. J. Lisboa and E. C. Ifeachor, Artificial neural networks in biomedicine.
Springer Science & Business Media, 2000. 2
[41] P. W. Mirowski, Y. LeCun, D. Madhavan, and R. Kuzniecky, “Comparing
SVM and convolutional networks for epileptic seizure prediction from
intracranial EEG,” in Machine Learning for Signal Processing, 2008.
MLSP 2008. IEEE Workshop on. IEEE, 2008, pp. 244–249. 2
[42] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural
networks for LVCSR using rectified linear units and dropout,” in Acous-
tics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on. IEEE, 2013, pp. 8609–8613. 2
[43] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu,
U. Muller, and Y. LeCun, “Learning long-range vision for autonomous
off-road driving, Journal of Field Robotics, vol. 26, no. 2, pp. 120–144,
2009. 2
36 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
[44] L. Deng, D. Yu et al., “Deep learning: methods and applications,
Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–
387, 2014. 2
[45] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hi-
erarchies for accurate object detection and semantic segmentation,” in
Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587. 2
[46] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified,
small, low power fully convolutional neural networks for real-time object
detection for autonomous driving.” in CVPR Workshops, 2017, pp. 446–
454. 2
[47] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification, in Pro-
ceedings of the IEEE international conference on computer vision, 2015,
pp. 1026–1034. 2
[48] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
tional networks,” in European conference on computer vision. Springer,
2014, pp. 818–833. 2, 7
[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778. 2, 4, 5, 27
[50] Image-Net. The ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) [Online]. Available: http://image-net.org/challenges/LSVRC/,
2018. 2
[51] A.-r. Mohamed, G. E. Dahl, G. Hinton et al., “Acoustic modeling using
deep belief networks,” IEEE Trans. Audio, Speech & Language Process-
ing, vol. 20, no. 1, pp. 14–22, 2012. 2
[52] O. Nomura and T. Morie, “Projection-field-type VLSI convolutional
neural networks using merged/mixed analog-digital approach, in Inter-
national Conference on Neural Information Processing. Springer, 2007,
pp. 1081–1090. 2
[53] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project
Adam: Building an efficient and scalable deep learning training system.”
in OSDI, vol. 14, 2014, pp. 571–582. 2
[54] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-
bard, and L. D. Jackel, “Backpropagation applied to handwritten zip code
recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989. 3
[55] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170. 3, 4, 12,
13, 14, 15, 16, 19, 21, 23, 29, 30, 34
[56] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and H. Es-
maeilzadeh, “Neural acceleration for gpu throughput processors,” in
Proceedings of the 48th International Symposium on Microarchitecture.
ACM, 2015, pp. 482–493. 3
[57] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural
networks for acoustic modeling in speech recognition: The shared views
of four research groups,” IEEE Signal processing magazine, vol. 29,
no. 6, pp. 82–97, 2012. 3
[58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22nd ACM international
conference on Multimedia. ACM, 2014, pp. 675–678. 3, 14, 25
[59] A. Vasudevan, A. Anderson, and D. Gregg, “Parallel multi channel
convolution using general matrix multiplication, in Application-specific
Systems, Architectures and Processors (ASAP), 2017 IEEE 28th Interna-
tional Conference on. IEEE, 2017, pp. 19–24. 3
[60] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and
H. Yang, “Angel-eye: A complete design flow for mapping cnn onto
embedded FPGA,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2018. 3, 27,
28, 35
[61] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong
Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra et al.,
“Can FPGAs beat GPUs in accelerating next-generation deep neural
networks?” in Proceedings of the 2017 ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays. ACM, 2017, pp. 5–14.
3
[62] J. Misra and I. Saha, “Artificial neural networks in hardware: A survey of
two decades of progress,” Neurocomputing, vol. 74, no. 1-3, pp. 239–255,
2010. 3
[63] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceler-
ation for general-purpose approximate programs,” in Proceedings of the
2012 45th Annual IEEE/ACM International Symposium on Microarchi-
tecture. IEEE Computer Society, 2012, pp. 449–460. 3
[64] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “Eie: efficient inference engine on compressed deep neural net-
work,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual
International Symposium on. IEEE, 2016, pp. 243–254. 3
[65] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang, “A
reconfigurable streaming deep convolutional neural network accelerator
for internet of things,” IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 65, no. 1, pp. 198–208, 2018. 3
[66] W. Vanderbauwhede and K. Benkrid, High-performance computing using
FPGAs. Springer, 2013. 3
[67] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,
J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “A
reconfigurable fabric for accelerating large-scale datacenter services,
ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 13–
24, 2014. 3, 13
[68] Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen, “High-level
synthesis: productivity, performance, and software constraints,” Journal
of Electrical and Computer Engineering, vol. 2012, p. 1, 2012. 3
[69] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
“High-level synthesis for fpgas: From prototyping to deployment, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems, vol. 30, no. 4, pp. 473–491, 2011. 3
[70] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. An-
derson, S. Brown, and T. Czajkowski, “Legup: high-level synthesis for
fpga-based processor/accelerator systems,” in Proceedings of the 19th
ACM/SIGDA international symposium on Field programmable gate ar-
rays. ACM, 2011, pp. 33–36. 3
[71] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998. 3, 10, 18
[72] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee,
S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding sources
of inefficiency in general-purpose chips, in ACM SIGARCH Computer
Architecture News, vol. 38, no. 3. ACM, 2010, pp. 37–47. 3
[73] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco,
“GPUs and the future of parallel computing,” IEEE Micro, vol. 31, no. 5,
pp. 7–17, 2011. 3
[74] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
energy-efficient dataflow for convolutional neural networks,” in ACM
SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press,
2016, pp. 367–379. 3, 4
[75] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust
object recognition with cortex-like mechanisms,” IEEE Transactions on
Pattern Analysis & Machine Intelligence, no. 3, pp. 411–426, 2007. 3
[76] P. Joshi. What Is Local Response Normalization In
Convolutional Neural Networks [Online]. Available:
https://prateekvjoshi.com/2016/04/05/what-is-local-response-
normalization-in-convolutional-neural-networks/, 2018. 4
[77] J. Cong and B. Xiao, “Minimizing computation in convolutional neural
networks,” in International conference on artificial neural networks.
Springer, 2014, pp. 281–290. 4, 12
[78] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, “Scalable and mod-
ularized rtl compilation of convolutional neural networks onto FPGA,
in Field Programmable Logic and Applications (FPL), 2016 26th Inter-
national Conference on. IEEE, 2016, pp. 1–8. 4, 16, 25, 30, 32, 33,
35
[79] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations
for high-performance computing,” ACM Computing Surveys (CSUR),
vol. 26, no. 4, pp. 345–420, 1994. 4, 16, 19
[80] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.
Seo, and Y. Cao, “Throughput-optimized opencl-based FPGA accelerator
for large-scale convolutional neural networks, in Proceedings of the
2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. ACM, 2016, pp. 16–25. 4, 5, 6, 14, 16, 17, 20, 21, 22, 29,
30, 31, 32, 34
[81] M. Denil, B. Shakibi, L. Dinh, N. De Freitas et al., “Predicting parameters
in deep learning,” in Advances in neural information processing systems,
2013, pp. 2148–2156. 4, 5
VOLUME 4, 2016 37
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
[82] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-
mann machines,” in Proceedings of the 27th international conference on
machine learning (ICML-10), 2010, pp. 807–814. 4
[83] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop operation and
dataflow in FPGA acceleration of deep convolutional neural networks,
in Proceedings of the 2017 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2017, pp. 45–54. 4, 19, 24,
25, 26, 28, 30, 35
[84] A. Karpathy. Convolutional Neural Networks for Visual Recognition
[Online]. Available: http://cs231n.github.io/convolutional-networks/,
2018. 4
[85] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
networks,” in European conference on computer vision. Springer, 2016,
pp. 630–645. 5
[86] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning.” in
AAAI, vol. 4, 2017, p. 12. 5
[87] J. Villasenor and W. H. Mangione-Smith, “Configurable computing,
Scientific American, vol. 276, no. 6, pp. 66–71, 1997. 5, 6
[88] S. D. Brown, R. J. Francis, J. Rose, and Z. G. Vranesic, Field-
programmable gate arrays. Springer Science & Business Media, 2012,
vol. 180. 6
[89] M. C. Herbordt, Y. Gu, T. VanCourt, J. Model, B. Sukhwani, and
M. Chiu, “Computing models for FPGA-based accelerators,” Computing
in science & engineering, vol. 10, no. 6, pp. 35–45, 2008. 6
[90] B. S. C. Varma, K. Paul, and M. Balakrishnan, Architecture exploration
of FPGA based accelerators for BioInformatics applications. Springer,
2016. 6
[91] G. Lacey, G. W. Taylor, and S. Areibi, “Deep learning on FPGAs: Past,
present, and future,” arXiv preprint arXiv:1602.04283, 2016. 6
[92] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini, P. Ak-
selrod, and S. Talay, “Large-scale FPGA-based convolutional networks,”
Scaling up Machine Learning: Parallel and Distributed Approaches, pp.
399–419, 2011. 6
[93] A. Munshi, “The opencl specification,” in Hot Chips 21 Symposium
(HCS), 2009 IEEE. IEEE, 2009, pp. 1–314. 6
[94] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming
standard for heterogeneous computing systems,” Computing in science
& engineering, vol. 12, no. 3, pp. 66–73, 2010. 6
[95] A. R. Omondi and J. C. Rajapakse, FPGA implementations of neural
networks. Springer, 2006, vol. 365. 6
[96] H. M. Waidyasooriya, M. Hariyama, and K. Uchiyama, Design of FPGA-
Based Computing Systems with OpenCL. Springer, 2018. 6
[97] V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, and Z. Zhang, “Hardware for
machine learning: Challenges and opportunities,” in Custom Integrated
Circuits Conference (CICC), 2018 IEEE. IEEE, 2018, pp. 1–8. 6
[98] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, and
S. Song, “Going deeper with embedded FPGA platform for convolutional
neural network,” in Proceedings of the 2016 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. ACM, 2016, pp. 26–
35. 6, 13, 14, 20, 29, 30, 32, 34
[99] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
Y. Wang et al., “Ese: Efficient speech recognition engine with sparse
lstm on FPGA,” in Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. ACM, 2017, pp. 75–
84. 6
[100] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynam-
ically configurable coprocessor for convolutional neural networks, in
ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM,
2010, pp. 247–257. 6, 10, 11, 29, 32, 34
[101] C. F. Van Loan, “Matrix computations (johns hopkins studies in mathe-
matical sciences),” 1996. 6, 7, 13
[102] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploit-
ing linear structure within convolutional networks for efficient evalua-
tion,” in Advances in neural information processing systems, 2014, pp.
1269–1277. 7
[103] G. Guennebaud, B. Jacob, M. Lenz et al., “Eigen v3, 2010,” URL
http://eigen. tuxfamily. org, 2015. 7
[104] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network, in Advances in neural information
processing systems, 2015, pp. 1135–1143. 7
[105] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
Advances in neural information processing systems, 1990, pp. 598–605.
7
[106] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network
construction with back-propagation,” in Advances in neural information
processing systems, 1989, pp. 177–185. 7
[107] B. Hassibi and D. G. Stork, “Second order derivatives for network
pruning: Optimal brain surgeon,” in Advances in neural information
processing systems, 1993, pp. 164–171. 7
[108] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding,
arXiv preprint arXiv:1510.00149, 2015. 7
[109] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” ACM Sigplan Notices, vol. 49, no. 4, pp. 269–284,
2014. 7, 8
[110] Y. LeCun, “The mnist database of handwritten digits, http://yann. lecun.
com/exdb/mnist/, 1998. 7
[111] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,
in Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622. 7, 8
[112] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam,
and Y. Chen, “Dadiannao: A neural network supercomputer, IEEE
Transactions on Computers, vol. 66, no. 1, pp. 73–88, 2017. 7, 8
[113] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou,
and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator, in
ACM SIGARCH Computer Architecture News, vol. 43, no. 1. ACM,
2015, pp. 369–381. 7, 8
[114] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “Shidiannao: Shifting vision processing closer to the
sensor, in ACM SIGARCH Computer Architecture News, vol. 43, no. 3.
ACM, 2015, pp. 92–104. 8
[115] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of
deep neural networks: A tutorial and survey,” Proceedings of the IEEE,
vol. 105, no. 12, pp. 2295–2329, 2017. 8
[116] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars,”
ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–
26, 2016. 8
[117] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,
“Prime: A novel processing-in-memory architecture for neural network
computation in reram-based main memory, in ACM SIGARCH Com-
puter Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 27–39.
8
[118] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible
dataflow accelerator architecture for convolutional neural networks, in
High Performance Computer Architecture (HPCA), 2017 IEEE Interna-
tional Symposium on. IEEE, 2017, pp. 553–564. 8
[119] J. Cloutier, E. Cosatto, S. Pigeon, F. R. Boyer, and P. Y. Simard, “Vip:
An FPGA-based processor for image processing and neural networks,”
in Microelectronics for Neural Networks, 1996., Proceedings of Fifth
International Conference on. IEEE, 1996, pp. 330–336. 9, 34
[120] D. F. Wolf, R. A. Romero, and E. Marques, “Using embedded proces-
sors in hardware models of artificial neural networks,” in V Simposio
Brasileiro de automação inteligente, Brasil, 2001. 9
[121] K. R. Nichols, M. A. Moussa, and S. M. Areibi, “Feasibility of floating-
point arithmetic in FPGA based artificial neural networks,” in In CAINE.
Citeseer, 2002. 9
[122] K. Benkrid and S. Belkacemi, “Design and implementation of a 2d
convolution core for video applications on FPGAs, in Digital and
Computational Video, 2002. DCV 2002. Proceedings. Third International
Workshop on. IEEE, 2002, pp. 85–92. 9
[123] F. Cardells-Tormo and P.-L. Molinet, “Area-efficient 2-d shift-variant
convolvers for FPGA-based digital image processing, in Signal Pro-
cessing Systems Design and Implementation, 2005. IEEE Workshop on.
IEEE, 2005, pp. 209–213. 9
[124] R. G. Gironés, R. C. Palero, J. C. Boluda, and A. S. Cortés, “FPGA
implementation of a pipelined on-line backpropagation,” Journal of VLSI
signal processing systems for signal, image and video technology, vol. 40,
no. 2, pp. 189–213, 2005. 9
[125] H. Zhang, M. Xia, and G. Hu, “A multiwindow partial buffering scheme
for FPGA-based 2-d convolvers, IEEE Transactions on Circuits and
Systems II: Express Briefs, vol. 54, no. 2, pp. 200–204, 2007. 9
38 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
[126] A. W. Savich, M. Moussa, and S. Areibi, “The impact of arithmetic
representation on implementing mlp-bp on FPGAs: A study, IEEE
transactions on neural networks, vol. 18, no. 1, pp. 240–252, 2007. 9
[127] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An FPGA-based
processor for convolutional networks, in Field Programmable Logic and
Applications, 2009. FPL 2009. International Conference on. IEEE,
2009, pp. 32–37. 9, 11, 16, 29, 34
[128] Y. LeCun et al., “Lenet-5, convolutional neural networks, URL:
http://yann. lecun. com/exdb/lenet, p. 20, 2015. 9
[129] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic,
E. Cosatto, and H. P. Graf, “A massively parallel coprocessor for convo-
lutional neural networks,” in Application-specific Systems, Architectures
and Processors, 2009. ASAP 2009. 20th IEEE International Conference
on. IEEE, 2009, pp. 53–60. 9, 10, 29, 34
[130] H. P. Graf, S. Cadambi, V. Jakkula, M. Sankaradass, E. Cosatto,
S. Chakradhar, and I. Dourdanovic, A massively parallel digital learning
processor, in Advances in Neural Information Processing Systems,
2009, pp. 529–536. 10
[131] S. Cadambi, I. Durdanovic, V. Jakkula, M. Sankaradass, E. Cosatto,
S. Chakradhar, and H. P. Graf, “A massively parallel FPGA-based co-
processor for support vector machines,” in 2009 17th IEEE Symposium
on Field Programmable Custom Computing Machines. IEEE, 2009, pp.
115–122. 10
[132] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf,
“A programmable parallel accelerator for learning and classification, in
Proceedings of the 19th international conference on Parallel architectures
and compilation techniques. ACM, 2010, pp. 273–284. 10, 17, 29, 32,
34
[133] J. C. Platt, “12 fast training of support vector machines using sequential
minimal optimization,” Advances in kernel methods, pp. 185–208, 1999.
10
[134] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi,
O. Chapelle, and K. Weinberger, “Learning to rank with (a lot of) word
features,” Information retrieval, vol. 13, no. 3, pp. 291–314, 2010. 10
[135] J. MacQueen et al., “Some methods for classification and analysis of mul-
tivariate observations, in Proceedings of the fifth Berkeley symposium
on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA,
USA, 1967, pp. 281–297. 10
[136] A. Sato and K. Yamada, “Generalized learning vector quantization, in
Advances in neural information processing systems, 1996, pp. 423–429.
10
[137] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition:
A convolutional neural-network approach, IEEE transactions on neural
networks, vol. 8, no. 1, pp. 98–113, 1997. 10, 11
[138] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional
neural networks for document processing,” in Tenth International Work-
shop on Frontiers in Handwriting Recognition. Suvisoft, 2006. 10, 14
[139] F. Nasse, C. Thurau, and G. A. Fink, “Face detection using gpu-based
convolutional neural networks, in International Conference on Computer
Analysis of Images and Patterns. Springer, 2009, pp. 83–90. 10, 11
[140] J. D. Dixon, “Asymptotically fast factorization of integers, Mathematics
of computation, vol. 36, no. 153, pp. 255–260, 1981. 11
[141] P. L. Montgomery, “A survey of modern integer factorization algorithms,”
CWI quarterly, vol. 7, no. 4, pp. 337–365, 1994. 11
[142] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culur-
ciello, “Hardware accelerated convolutional neural networks for synthetic
vision systems,” in Circuits and Systems (ISCAS), Proceedings of 2010
IEEE International Symposium on. IEEE, 2010, pp. 257–260. 11
[143] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. Le-
Cun, “Neuflow: A runtime reconfigurable dataflow processor for vision,
in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011
IEEE Computer Society Conference on. IEEE, 2011, pp. 109–116. 11,
29, 34
[144] R. Collobert, C. Farabet, K. Kavukcuoglu et al., “Torch, in Workshop on
Machine Learning Open Source Software, NIPS, vol. 76, 2008. 11
[145] D. Grangier, L. Bottou, and R. Collobert, “Deep convolutional networks
for scene parsing,” in ICML 2009 Deep Learning Workshop, vol. 3, no. 6.
Citeseer, 2009, p. 109. 11, 29
[146] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal, “Memory-centric
accelerator design for convolutional neural networks, in Computer De-
sign (ICCD), 2013 IEEE 31st International Conference on. IEEE, 2013,
pp. 13–19. 11, 29, 33, 34
[147] A. Beric, J. van Meerbergen, G. de Haan, and R. Sethuraman, “Memory-
centric video processing,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 18, no. 4, pp. 439–452, 2008. 11, 33
[148] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240
g-ops/s mobile coprocessor for deep neural networks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2014, pp. 682–687. 12, 28, 29, 30, 34
[149] C. Farabet, C. Poulet, and Y. LeCun, An FPGA-based stream processor
for embedded real-time vision with convolutional networks, in Com-
puter Vision Workshops (ICCV Workshops), 2009 IEEE 12th Interna-
tional Conference on. IEEE, 2009, pp. 878–885. 12
[150] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful
visual performance model for multicore architectures,” Communications
of the ACM, vol. 52, no. 4, pp. 65–76, 2009. 12, 32
[151] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedral-based
data reuse optimization for configurable computing,” in Proceedings of
the ACM/SIGDA international symposium on Field programmable gate
arrays. ACM, 2013, pp. 29–38. 12
[152] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S.
Chung, “Accelerating deep convolutional neural networks using special-
ized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11, 2015. 13,
19, 29, 30, 32
[153] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine:
Towards uniformed representation and acceleration for deep convolu-
tional neural networks,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 2018. 13, 14, 29, 32, 34
[154] B. Bosi, G. Bois, and Y. Savaria, “Reconfigurable pipelined 2-d con-
volvers for fast digital signal processing, IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 7, no. 3, pp. 299–308, 1999.
13, 22, 28
[155] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “Deepburning: automatic
generation of FPGA-based learning accelerators for the neural network
family, in Proceedings of the 53rd Annual Design Automation Confer-
ence. ACM, 2016, p. 110. 14, 20, 29, 30, 32, 34
[156] K. O. W. Group et al., “The opencl specification version 1.1, http://www.
khronos. org/registry/cl/specs/opencl-1.1. pdf, 2011. 14
[157] M. S. Abdelfattah, A. Hagiescu, and D. Singh, “Gzip on a chip: High
performance lossless data compression on FPGAs using opencl,” in
Proceedings of the International Workshop on OpenCL 2013 & 2014.
ACM, 2014, p. 4. 14
[158] Altera. OpenCL Design Examples [Online]. Available:
https://www.altera.com/support/support-resources/designexamples/
design-software/opencl.html/, 2018. 14
[159] Nallatech. P395-D8 OpenCL FPGA Accelerator Cards [Online]. Avail-
able: http://www.nallatech.com/wp-content/uploads/openclcardspb_v1_
51.pdf/, 2018. 14
[160] Altera. DE5-Net FPGA Kit User Manual [Online]. Available: ftp://ftp.
altera.com/up/pub/Altera_Material/Boards/DE5/DE5_User_/, 2018. 14
[161] R. C. Whaley and J. J. Dongarra, “Automatically tuned linear algebra
software,” in Supercomputing, 1998. SC98. IEEE/ACM Conference on.
IEEE, 1998, pp. 38–38. 14
[162] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards
uniformed representation and acceleration for deep convolutional neural
networks,” in Computer-Aided Design (ICCAD), 2016 IEEE/ACM In-
ternational Conference on. IEEE, 2016, pp. 1–8. 14, 21, 29, 30, 32,
34
[163] W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong, “Improving
high level synthesis optimization opportunity through polyhedral trans-
formations,” in Proceedings of the ACM/SIGDA international sympo-
sium on Field programmable gate arrays. ACM, 2013, pp. 9–18. 15
[164] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,” Proceed-
ings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987. 15
[165] S. I. Venieris and C.-S. Bouganis, “FPGAconvnet: A framework for map-
ping convolutional neural networks on fpgas, in Field-Programmable
Custom Computing Machines (FCCM), 2016 IEEE 24th Annual Inter-
national Symposium on. IEEE, 2016, pp. 40–47. 15, 20, 22, 29, 32,
34
[166] C. R. Reeves, Modern heuristic techniques for combinatorial problems.
Advanced topics in computer science. Mc Graw-Hill, 1995. 16
[167] L. Cavigelli, M. Magno, and L. Benini, “Accelerating real-time embed-
ded scene labeling with convolutional networks, in Proceedings of the
52nd Annual Design Automation Conference. ACM, 2015, p. 108. 16,
29
VOLUME 4, 2016 39
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
[168] Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J.-s. Seo, “Alamo: FPGA
acceleration of deep learning algorithms with a modularized rtl compiler,
Integration, 2018. 16, 24, 25, 30, 32, 33, 35
[169] M. Lin, Q. Chen, and S. Yan, “Network in network, arXiv preprint
arXiv:1312.4400, 2013. 16
[170] Z. Liu, Y. Dou, J. Jiang, J. Xu, S. Li, Y. Zhou, and Y. Xu,
“Throughput-optimized fpga accelerator for deep convolutional neural
networks,” ACM Transactions on Reconfigurable Technology and Sys-
tems (TRETS), vol. 10, no. 3, p. 17, 2017. 16, 17, 22, 30, 32, 35
[171] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
and J. Cong, “FP-DNN: An automated framework for mapping deep
neural networks onto FPGAs with RTL-HLS hybrid templates, in Field-
Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th
Annual International Symposium on. IEEE, 2017, pp. 152–159. 17, 30,
35
[172] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-
scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283. 17
[173] M. H. Alsuwaiyel, Algorithms: Design Techniques And Analysis (Re-
vised Edition). World Scientific, 2016, vol. 14. 17
[174] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv
preprint arXiv:1410.0759, 2014. 17
[175] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network
regularization,” arXiv preprint arXiv:1409.2329, 2014. 17
[176] W. Sung, S. Shin, and K. Hwang, “Resiliency of deep neural networks
under quantization,” arXiv preprint arXiv:1511.06488, 2015. 18
[177] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Im-
agenet classification using binary convolutional neural networks, in
European Conference on Computer Vision. Springer, 2016, pp. 525–
542. 18
[178] M. Kim and P. Smaragdis, “Bitwise neural networks, arXiv preprint
arXiv:1601.06071, 2016. 18
[179] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net:
Training low bitwidth convolutional neural networks with low bitwidth
gradients,” arXiv preprint arXiv:1606.06160, 2016. 18
[180] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net-
works with weights and activations constrained to+ 1 or- 1, 2016. 18,
19
[181] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
and K. Vissers, “Finn: A framework for fast, scalable binarized neural
network inference,” in Proceedings of the 2017 ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays. ACM, 2017,
pp. 65–74. 18, 30, 35
[182] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
tiny images,” Citeseer, Tech. Rep., 2009. 18
[183] S. I. Venieris and C.-S. Bouganis, “Latency-driven design for fpga-
based convolutional neural networks, in Field Programmable Logic and
Applications (FPL), 2017 27th International Conference on. IEEE,
2017, pp. 1–8. 20, 30, 35
[184] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 4013–4021. 20, 32
[185] S. Winograd, Arithmetic complexity of computations. Siam, 1980,
vol. 33. 20, 21
[186] C. Van Loan, Computational frameworks for the fast Fourier transform.
Siam, 1992, vol. 10. 20
[187] C. Zhang and V. Prasanna, “Frequency domain acceleration of convo-
lutional neural networks on cpu-fpga shared memory system,” in Pro-
ceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays. ACM, 2017, pp. 35–44. 20
[188] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu,
“An opencl™ deep learning accelerator on arria 10, in Proceedings of
the 2017 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. ACM, 2017, pp. 55–64. 20, 21, 30, 32, 35
[189] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for
convolutional neural networks on fpgas, in Field-Programmable Custom
Computing Machines (FCCM), 2017 IEEE 25th Annual International
Symposium on. IEEE, 2017, pp. 101–108. 21, 30, 32, 35
[190] J. Zhang and J. Li, “Improving the performance of opencl-based fpga
accelerator for convolutional neural network, in Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays. ACM, 2017, pp. 25–34. 22, 30, 35
[191] T. S. Czajkowski, D. Neto, M. Kinsner, U. Aydonat, J. Wong,
D. Denisenko, P. Yiannacouras, J. Freeman, D. P. Singh, and S. D.
Brown, “Opencl for fpgas: Prototyping a compiler, in Proceedings of the
International Conference on Engineering of Reconfigurable Systems and
Algorithms (ERSA). The Steering Committee of The World Congress
in Computer Science, Computer Engineering and Applied Computing
(WorldComp), 2012, p. 1. 22, 30
[192] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator effi-
ciency through resource partitioning,” in Computer Architecture (ISCA),
2017 ACM/IEEE 44th Annual International Symposium on. IEEE,
2017, pp. 535–547. 22, 23, 30, 32, 35
[193] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and< 0.5 mb model size,” arXiv preprint arXiv:1602.07360,
2016. 23
[194] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-
formance FPGA-based accelerator for large-scale convolutional neural
networks,” in Field Programmable Logic and Applications (FPL), 2016
26th International Conference on. IEEE, 2016, pp. 1–9. 23, 25
[195] W. Xuechao, Y. Cody Hao, Z. Peng, C. Youxiang, W. Yuxin, H. Han,
L. Yun, and C. Jason, “Automated systolic array architecture synthesis
for high throughput cnn inference on FPGAs,” in Proceedings of the 2017
Design Automation Conference. ACM, 2017, pp. 1–6. 23, 24, 30, 35
[196] Y. Ma, M. Kim, Y. Cao, S. Vrudhula, and J.-s. Seo, “End-to-end scalable
FPGA accelerator for deep residual networks,” in Circuits and Systems
(ISCAS), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–4.
24, 25, 26, 30, 35
[197] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “Dlau: A
scalable deep learning accelerator unit on FPGA,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 36,
no. 3, pp. 513–517, 2017. 25, 30, 32, 35
[198] Altera. JTAG UART Core [Online]. Available: https://www.altera.com/
en_US/pdfs/literature/hb/nios2/n2cpu_nii51009.pdf, 2018. 25, 32
[199] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “An automatic rtl compiler
for high-throughput FPGA implementation of diverse deep convolutional
neural networks,” in Field Programmable Logic and Applications (FPL),
2017 27th International Conference on. IEEE, 2017, pp. 1–8. 25, 26,
28, 30, 35
[200] M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. OConnell,
N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling et al., “Dla: Compiler
and FPGA overlay for neural network inference acceleration, arXiv
preprint arXiv:1807.06434, 2018. 26, 35
[201] A. K. Jain, S. A. Fahmy, and D. L. Maskell, “Efficient overlay architec-
ture based on dsp blocks,” in Field-Programmable Custom Computing
Machines (FCCM), 2015 IEEE 23rd Annual International Symposium
on. IEEE, 2015, pp. 25–28. 26
[202] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector, in European conference on
computer vision. Springer, 2016, pp. 21–37. 26
[203] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,
T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., Accelerat-
ing persistent neural networks at datacenter scale,” in Hot Chips, vol. 27,
2017. 27
[204] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing the convolution
operation to accelerate deep neural networks on FPGA,” IEEE Transac-
tions on Very Large Scale Integration (VLSI) Systems, no. 99, pp. 1–14,
2018. 28, 30, 35
[205] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient
cnn implementation on a deeply pipelined fpga cluster, in Proceedings
of the 2016 International Symposium on Low Power Electronics and
Design. ACM, 2016, pp. 326–331. 30
[206] S. M. Sait and H. Youssef, Iterative computer algorithms with appli-
cations in engineering: solving combinatorial optimization problems.
IEEE Computer Society Press, 1999. 31
[207] L. Xie and A. L. Yuille, “Genetic CNN. in ICCV, 2017, pp. 1388–1397.
31
[208] L. Rere, M. I. Fanany, and A. M. Arymurthy, “Metaheuristic algorithms
for convolution neural network, Computational intelligence and neuro-
science, vol. 2016, 2016. 31
40 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2890150, IEEE Access
Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
AHMAD SHAWAHNA obtained M.S. in com-
puter engineering from King Fahd University of
Petroleum and Minerals (KFUPM), Saudi Arabia,
in 2016. He also received the B.Sc degree in
computer engineering from An-Najah National
University, Palestine, in 2012. Ahmad Shawahna
is a Ph.D. student in the department of com-
puter engineering of KFUPM. In addition, he
is currently working at the Center for Commu-
nications and IT Research (CCITR), KFUPM.
His research interests include hardware accelerator, deep learning, CNNs,
FPGA, wireless security, network security, Internet of Things (IoT), and
cloud computing.
SADIQ M. SAIT (Senior Member, IEEE) ob-
tained his Bachelor’s degree in Electronics En-
gineering from Bangalore University in 1981,
and Master’s and Ph.D. degrees in Electrical
Engineering from KFUPM in 1983 and 1987,
respectively. In 1981 Sait received the best Elec-
tronic Engineer award from the Indian Institute
of Electrical Engineers, Bangalore (where he was
born). Sait has authored over 200 research papers,
contributed chapters to technical books, and lec-
tured in over 25 countries. Sadiq M. Sait is also the principle author of two
books. He is currently Professor of Computer Engineering and the Director
of the Center for Communications and IT Research of KFUPM.
AIMAN EL-MALEH is a Professor in the Com-
puter Engineering Department at King Fahd Uni-
versity of Petroleum & Minerals. He holds a B.Sc.
in Computer Engineering, with first honors, from
King Fahd University of Petroleum & Minerals
in 1989, a M.A.SC. in Electrical Engineering
from University of Victoria, Canada, in 1991,
and a Ph.D in Electrical Engineering, with dean’s
honor list, from McGill University, Canada, in
1995. He was a member of scientific staff with
Mentor Graphics Corp., a leader in design automation, from 1995-1998.
Dr. El-Maleh received the Excellence in Teaching award from KFUPM
in 2001/2002, 2006/2007 and 2011/2012, the Excellence in Advising
award from KFUPM in 2013/2014 and 2017/2018, the Excellence in
Research award from KFUPM in 2010/2011 and 2015/2016, and the
First Instructional Technology award from KFUPM in 2009/2010. Dr.
El-Maleh’s research interests are in the areas of synthesis, testing, and
verification of digital systems. In addition, he has research interests in
defect and soft-error tolerance design, VLSI design, design automation
and efficient FPGA implementations of deep learning algorithms and data
compression techniques. Dr. El-Maleh is the winner of the best paper
award for the most outstanding contribution in the field of test at the 1995
European Design & Test Conference. His paper presented at the 1995
Design Automation Conference was also nominated for best paper award.
He holds five US patents.
VOLUME 4, 2016 41
... Improving the delivery of complex systems, including video and image processing systems, is the goal of model-based design, which entails the systematic use of models throughout the development process (figure 1) [14]. While delineating design behaviours at high abstraction levels, HLS is swiftly gathering traction among designers in search of a method to guarantee continuous verification throughout the design cycle [15]. Illustrative software for HLS comprises Vivado HLS [16] and MATLAB HDL Coder [17]. ...
... In light of the substantial volume of input and output data, this method is consistently implemented in order to minimize memory access times and reuse data. The overall architecture of the accelerator is shown in figure 3: The effective FPGA bandwidth increases in tandem with the duration of the bursts, eventually reaching a plateau beyond a specific threshold [15]. The row-major data architecture in DRAM is typically rendered inaccessible as a result of the data tiling method. ...
Article
Field Programmable Gate Arrays (FPGAs) have garnered significant attention in the development and enhancement of target identification algorithms that employ YOLOv2 models and FPGAs, owing to their adaptability and user-friendliness. The Simulink HDL compiler was utilized to design, simulate, and implement our proposed design. In an effort to rectify this, this paper presents a comprehensive programming and design proposal. The implementation of the YOLOv2 algorithm for real-time vehicle detection on the Xilinx® Zynq-7000 System-on-a-chip is proposed in this work. Real-time testing of the synthesised hardware revealed that it can process Full HD video at a rate of 16 frames per second. On the Xilinx Zynq-7000 SOC, the estimated dynamic power consumption is less than 90 mW. When comparing the results of the proposed work to those of other simulations, it is observed that resource utilization is enhanced by around 204 k (75%) LUT, 305 (12%) DSP, and 224 k (41%) flip-flops at 200 MHz.
... Для прискорення роботи глибоких нейронних мереж було широко досліджено використання FPGA [6], проте згодом з'явились більш вузькі рішення заточені лише на моделі глибокого навчання. Так з'явились наступні спеціалізовані продуктиакселератори для прискорення алгоритмів глибокого навчання: ...
Conference Paper
Full-text available
Штучний інтелект здійснив революцію в обробці зображень і комп’ютерному зорі, домінуючи в сотнях галузей включаючи розпізнавання об’єктів [1], класифікацію зображень [2], семантичну сегментацію [3] та інше. Водночас стрімко поширюється використання мікрокомп’ютерів для систем штучного інтелекту в реальному часі [4] [5]. Вони приваблюють невеликим використанням електричної потужності, мають низьку затримку та дешеву вартість.
... Shawahna et al. [6] discussed the challenges of implementing deep neural networks on FPGA in detail. Particular attention was paid to the possibility of the full use of the hardware resources of FPGA systems, including the possibility of performing fast calculations and cooperation with external memories. ...
Article
Full-text available
System health monitoring (SHM) of a ball screw laboratory system using an embedded real-time platform based on Field-Programmable Gate Array (FPGA) technology was developed. The ball screw condition assessment algorithms based on machine learning approaches implemented on multiple platforms were compared and evaluated. Studies on electric power consumption during the processing of the proposed structure of a neural network, implementing SHM, were carried out for three hardware platforms: computer, Raspberry Pi 4B, and Kria KV260. It was found that the average electrical power consumed during calculations is the lowest for the Kria platform using the FPGA system. However, the best ratio of the average power consumption to the accuracy of the neural network was obtained for the Raspberry Pi 4B. The concept of an efficient and energy-saving hardware platform that enables monitoring and analysis of the operation of the selected dynamic system was proposed. It allows for easy integration of many software environments (e.g., MATLAB and Python) with the System-on-a-Chip (SoC) platform containing an FPGA and a CPU.
... In particular, deep learning techniques need much storage to leverage big datasets and intense computer processing to create autonomous adaptive, innovative systems with extraordinarily accurate and humanlike behavior [100]. It is worth noting that deep learning is a very new field [101]. Various parameters are described in following Table 8, related to the hardware perspective. ...
Preprint
Full-text available
Continuous photoplethysmography (PPG)-based blood pressure monitoring is necessary for healthcare and fitness applications. In Artificial Intelligence (AI), signal classification levels with the machine and deep learning arrangements need to be explored further. Techniques based on time-frequency spectra, such as Short-time Fourier Transform (STFT), have been used to address the challenges of motion artifact correction. Therefore, the proposed study works with PPG signals of more than 200 patients (650+ signal samples) with hypertension, using STFT with various Neural Networks (Convolution Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (Bi-LSTM), followed by machine learning classifiers, such as, Support Vector Machine (SVM) and Random Forest (RF). The classification has been done for two categories: Prehypertension (normal levels) and Hypertension (includes Stage I and Stage II). Various performance metrics have been obtained with two batch sizes of 3 and 16 for the fusion of the neural networks. With precision and specificity of 100% and recall of 82.1%, the LSTM model provides the best results among all combinations of Neural Networks. However, the maximum accuracy of 71.9% is achieved by the LSTM-CNN model. Further stacked Ensemble method has been used to achieve 100% accuracy for Meta-LSTM-RF, Meta- LSTM-CNN-RF and Meta- STFT-CNN-SVM.
... It was only when Single-Instruction-Multiple-Data and Single-Instruction-Multiple-Threads paradigms in the GPU domain were adopted by the FPGA community, inference on reconfigurable devices was deemed possible. Similar to [54] and [46], we will organize our literature survey on FPGAs into three different categories: ...
... Meanwhile, PS handles complex operations using processing cores, facilitates data communication between off-chip memory and PL through memory interfaces, and enables interconnection with peripherals (i.e., sensors and displays), allowing the construction of complex system-on-chip circuits [21]. The design of CNN accelerators on FPGAs offers several advantages over other hardware devices (e.g., CPU and GPU) in terms of design flexibility, power efficiency, and latency [32]. First, the flexible and configurable nature of FPGAs enables the optimized design of various CNN architectures and layer operators. ...
Article
Full-text available
With the recent advancements in high-performance computing, convolutional neural networks (CNNs) have achieved remarkable success in various vision tasks. However, along with improvements in model accuracy, the size and computational complexity of the models have significantly increased with the increasing number of parameters. Although graphics processing unit (GPU) platforms equipped with high-performance memory and specialized in parallel processing are commonly used for CNN processing, the significant power consumption presents challenges in their utilization on edge devices. To address these issues, research is underway to design CNN models using field-programmable gate arrays (FPGAs) as accelerators. FPGAs provide a high level of flexibility, allowing efficient optimization of convolution operations, which account for a significant portion of the CNN computations. Additionally, FPGAs are known for their low power consumption compared to GPUs, making them a promising energy-efficient platform. In this paper, we review and summarize various approaches and techniques related to the design of FPGA-based CNN accelerators. Specifically, to comprehensively study CNN accelerators, we investigate the advantages and disadvantages of various methods for optimizing CNN accelerators and previously designed efficient accelerator architectures. We expect this paper to serve as an important guideline for future hardware research in artificial intelligence.
... Work [20] is devoted to the analysis and comparison of existing FPGA-based DNN accelerators. This review is poorly structured but analyzes many papers. ...
Article
Full-text available
The technology development greatly increases the amount of digital visual information. Existing devices cannot efficiently process such huge amounts of data. The technical characteristics of digital image processing (DIP) devices and systems are being actively improved to resolve this contradiction in science and technology. The state-of-the-art methodology includes a huge number of very diverse approaches at the mathematical, software, and hardware implementation levels. We have analyzed all modern trends to improve the technical characteristics of DIP devices and systems. The main distinguishing feature of this review is that we are not limited to considering various aspects of neural network image processing, to which the vast majority of both review and research papers on the designated topic are devoted. Review papers on the subject under consideration are analyzed. Various mathematical and arithmetic-logical methods for improving the characteristics of image processing devices are described in detail. Original and significant architectural and structural solutions are analyzed. Promising neural network models of visual data processing are characterized. Hardware platforms for the design and operation of DIP systems that are efficient in terms of resource costs are considered. The most significant improvements achieved through the hardware implementation of models and methods on field-programmable gate arrays and application-specific integrated circuits are noted.
... Despite this, accelerating the construction of CNNs utilizing FPGA SoC has emerged as a novel alternative due to its ability to maximize parallelism data processing and power efficiency while lowering the costs and dangers connected with data transmission. Additionally, by taking reconfiguration benefit, different CNN models and architectures can be easily reconfigured in the FPGA for many application types [23]. In this context many research work has been proposed to implement vision systems on FPGA-SoC. ...
Article
Full-text available
Technological breakthroughs have ushered in a revolution in a variety of industries, including agriculture, during the last several decades. This has given rise to what is now known as Agriculture 4.0, which emphasizes strategy and systems rather than the traditional obligations of the past. As a result, many human procedures have been replaced by a new generation of intelligent devices. Crop production management in Agriculture 4.0, on the other hand, poses a considerable challenge, particularly when it comes to prompt and accurate crop disease identification. Plant diseases are of special significance since they significantly reduce agricultural yield in terms of both quality and quantity. Deep learning neural network models are being utilized for early diagnosis of plant diseases in order to overcome this difficulty. These models can automatically extract features, generate high-dimensional features from low-dimensional ones, and achieve better learning results. In this research, we offer a joint solution involving image processing, phytopathology, and embedded platforms that intends to minimize the time necessary for human labor by leveraging AI to facilitate plant disease detection. We propose a learning-based PlantNet architecture for detecting plant diseases from leaf images, in which achieved about 97 % accuracy and about 0.27 loss on the PlantVillage dataset. However, because putting AI techniques on embedded systems can substantially cut energy consumption and processing times while also minimizing the costs and dangers involved with data transfer, it is worth considering. The second goal of this paper is to use high-level synthesis to accelerate the proposed PlantNet architecture. Moreover, we propose a hardware-software (HW/SW) design for integrating the suggested vision system on an embedded FPGA-SoC platform. Finally, we present a comparative study with state-of-the-art works, which demonstrates that the proposed design outperforms the others in terms of normalized GFLOPS (1.93), reduced power consumption (2.48 W), and minimum required processing time (0.04 seconds).
Conference Paper
Modern computer vision applications are vastly dependent on convolutional neural networks (CNNs) in order to perform feature extractions, dimensionality reduction, etc. There are a series of arithmetic calculations markedly dominated by multiply-accumulate (MAC) operations in a single convolutional layer which is directly dependent on the kernel size. In this work, we propose a design method for a run-time re-configurable convolution IP core which could be implemented for any layer of a CNN having variable kernel sizes and verified for different dimensions, 5 × 5 for LeNet-5 and 11 × 11 for AlexNet respectively. The proposed accelerator is 6.25 x and 12.1 x faster than basic MAC operation for LeNet-5 and AlexNet respectively. The design has been compared with existing architecture and has improved performance by 2.46x G(FL)OPS/s per MAC module. Our proposed design has run-time re-configuration support with a scheduler logic which empowers to realize the design for any specific layer, kernel dimension and channels for respectively with efficient resource utilization.
Article
Text classification is a foundational task in many NLP applications. Traditional text classifiers often rely on many human-designed features, such as dictionaries, knowledge bases and special tree kernels. In contrast to traditional methods, we introduce a recurrent convolutional neural network for text classification without human-designed features. In our model, we apply a recurrent structure to capture contextual information as far as possible when learning word representations, which may introduce considerably less noise compared to traditional window-based neural networks. We also employ a max-pooling layer that automatically judges which words play key roles in text classification to capture the key components in texts. We conduct experiments on four commonly used datasets. The experimental results show that the proposed method outperforms the state-of-the-art methods on several datasets, particularly on document-level datasets.
Article
Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.
Article
Compute and memory demands of state-of-the-art deep learning methods are still a shortcoming that must be addressed to make them useful at IoT end-nodes. In particular, recent results depict a hopeful prospect for image processing using Convolutional Neural Netwoks, CNNs, but the gap between software and hardware implementations is already considerable for IoT and mobile edge computing applications due to their high power consumption. This proposal performs low-power and real time deep learning-based multiple object visual tracking implemented on an NVIDIA Jetson TX2 development kit. It includes a camera and wireless connection capability and it is battery powered for mobile and outdoor applications. A collection of representative sequences captured with the on-board camera, dETRUSC video dataset, is used to exemplify the performance of the proposed algorithm and to facilitate benchmarking. The results in terms of power consumption and frame rate demonstrate the feasibility of deep learning algorithms on embedded platforms although more effort in the joint algorithm and hardware design of CNNs is needed.
Article
With the recent advancement of multilayer convolutional neural networks (CNN) and fully connected networks (FCN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGAbased acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/ software co-designed library to efficiently accelerate the entire CNN and FCN on FPGAs. First, we propose a uniformed convolutional matrixmultiplication representation for both computation-bound convolutional layers and communication-bound fully connected (FCN) layers. Based on this representation, we optimize the accelerator micro-architecture and maximize the underlying FPGA computing and bandwidth resource utilization based on a revised roofline model. Moreover, we design an automation flow to directly compile high-level network definitions to the final FPGA accelerator. As a case study, we integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet networks on multiple FPGA platforms. Caffeine achieves a peak performance of 1,460 GOPS on a medium-sized Xilinx KU060 FPGA board; to our knowledge, this is the best published result. It achieves more than 100x speed-up on FCN layers over prior FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 29x and 150x performance and energy gains over Caffe on a 12-core Xeon server, and 5.7x better energy efficiency over the GPU implementation. Performance projections for a system with a high-end FPGA (Virtex7 690t) show even higher gains.
Article
As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.
Article
Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state‐of‐the‐art prediction results. Along with the success of deep learning in many application domains, deep learning is also used in sentiment analysis in recent years. This paper gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis. This article is categorized under: • Fundamental Concepts of Data and Knowledge > Data Concepts • Algorithmic Development > Text Mining
Article
Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9. X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.