ArticlePDF Available

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

December 2018
IEEE Access PP(99):1-1

December 2018
PP(99):1-1

DOI:10.1109/ACCESS.2018.2890150

License
CC BY-NC-ND 4.0

Authors:

Ahmad Shawahna

King Fahd University of Petroleum and Minerals

Sadiq Sait

King Fahd University of Petroleum and Minerals

Aiman El-Maleh

King Fahd University of Petroleum and Minerals

Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve desired performance levels. Consequently, hardware accelerators that use application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism as well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.

ImageNet Competition Results [50].

…

. CNN Layers for VGG Models.

…

. Implementation and Performance Summary of FPGA-based Accelerators.

…

. Implementation and Performance Summary of FPGA-based Accelerators.

…

+32

FPGA Basic Structure [87].

…

Figures - uploaded by Ahmad Shawahna

Content may be subject to copyright.

Content uploaded by Ahmad Shawahna

Content may be subject to copyright.

Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer 10.1109/ACCESS.2017.DOI

FPGA-based Accelerators of Deep

Learning Networks for Learning and

Classiﬁcation: A Review

AHMAD SHAWAHNA1, SADIQ M. SAIT1, 2, (Senior Member, IEEE), AND AIMAN EL-MALEH1,

(Member, IEEE)

1Department of Computer Engineering, King Fahd University of Petroleum & Minerals, Dhahran-31261, Saudi Arabia

2Center for Communications and IT Research, Research Institute, King Fahd University of Petroleum & Minerals, Dhahran-31261, Saudi Arabia

Corresponding author: Sadiq M. Sait (e-mail: sadiq@kfupm.edu.sa).

This work was supported by the King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia.

ABSTRACT Due to recent advances in digital technologies, and availability of credible data, an area

of artiﬁcial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness

in solving complex learning problems not possible before. In particular, convolution neural networks

(CNNs) have demonstrated their effectiveness in image detection and recognition applications. However,

they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve

desired performance levels. Consequently, hardware accelerators that use application speciﬁc integrated

circuits (ASICs), ﬁeld programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been

employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for

accelerating the implementation of deep learning networks due to their ability to maximize parallelism as

well as due to their energy efﬁciency. In this paper, we review recent existing techniques for accelerating

deep learning networks on FPGAs. We highlight the key features employed by the various techniques

for improving the acceleration performance. In addition, we provide recommendations for enhancing the

utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent

trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the

future advances on efﬁcient hardware accelerators and to be useful for deep learning researchers.

INDEX TERMS Adaptable Architectures, Convolutional Neural Networks (CNNs), Deep Learning,

Dynamic Reconﬁguration, Energy-Efﬁcient Architecture, Field Programmable Gate Arrays (FPGAs),

Hardware Accelerator, Machine Learning, Neural Networks, Optimization, Parallel Computer Architec-

ture, Reconﬁgurable Computing.

I. INTRODUCTION

IN recent years, due to the availability of massive amounts

of credible data (Big Data: Text, Video, Audio, etc.),

and tremendous advances in the area of digital electronics

technologies that provide immense computing power, there

has been a revival in the area of artiﬁcial intelligence (AI),

particularly in the area of deep learning (DL) [1]–[3], a sub-

ﬁeld of machine learning (ML).

The ﬁeld of DL emerged in 2006 after a long pause in the

area of neural networks (NNs) research [4]. A key aspect in

DL is that the networks and/or their weights are not designed

by human beings. Instead, they are learned from data using

a general purpose learning procedure [5], [6].

While ML uses algorithms to parse and learn from data,

to make informed decisions, DL structures algorithms in

layers to create an artiﬁcial neural network (ANN) that

can learn, and similar to human intelligence, can make

accurate decisions on its own [7]. Therefore, instead of

designing algorithms by hand, systems can be built and

trained to implement concepts in a way similar to what

comes naturally to humans, and with accuracy sometimes

exceeding human-level performance [8], [9].

In DL, each layer is designed to detect features at different

levels. A layer transforms the representation at one level

(starting from input data which maybe images, text, or

sound) to a representation at a higher, slightly more abstract

VOLUME 4, 2016 1

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

level [10]. For example, in image recognition, where input

initially comes in the form of pixels, the ﬁrst layer detects

low level features such as edges and curves. The output

of the ﬁrst layer becomes input to the second layer which

produces higher level features, for example semi-circles, and

squares [11]. The next layer assembles the output of the

previous layer to parts of familiar objects, and a subsequent

layer detects the objects. As we go through more layers, the

network yields an activation map that represents more and

more complex features. The deeper you go into the network,

the ﬁlters begin to be more responsive to a larger region of

the pixel space. Higher level layers amplify aspects of the

received inputs that are important for discrimination and

suppress irrelevant variations.

A. APPLICATIONS OF DEEP LEARNING NETWORKS

With the now widely used convolution neural networks

(CNNs) [12], [13] and deep neural networks (DNNs) [14],

[15], it is now possible to solve problems in domains

where knowledge is not easily expressed explicitly and

implicit information is stored in the raw data. Solutions to

multifarious problems in the domain of sciences, business,

etc., have been possible that were not conceivable for several

years, in spite of best attempts by the AI community.

This has been primarily possible due to the excellent abil-

ity of deep learning in discovering intricate structures in

high-dimensional data. Examples include character recog-

nition [16], gesture recognition [17], speech recognition

(e.g., in Google Now, Siri, or click-through prediction on

an advertisement) [18]–[20], document processing [21]–

[23], natural language processing [24], [25], video classi-

ﬁcation [26], image classiﬁcation [27]–[32], face detection

and recognition [33], [34], robot navigation [35]–[37], real-

time multiple object tracking [38], ﬁnancial forecasting [39],

and medical diagnosis systems [40]–[42], to name a few.

Other recent areas of applications include automated

driving (e.g., learning to detect stop signs, trafﬁc lights,

pedestrians, etc.), aerospace and defense (e.g., identify ob-

jects from satellites and identify safe or unsafe zones),

medical research (e.g., in identiﬁcation of cancer cells),

industrial automation (e.g., to improve worker safety by

detecting when people or objects are within an unsafe

distance of machines), and electronics (used in automated

hearing, speech translation, etc.) [9], [43]–[46].

B. EMERGENCE OF DEEP LEARNING NETWORKS

Convolutional neural networks are considered as one of

the most inﬂuential innovations in the ﬁeld of computer

vision [47]. The success of deep learning networks grew

to prominence in 2012 when Krizhevsky et al. [28] uti-

lized CNNs to win the annual olympics of computer

vision, ImageNet large-scale vision recognition challenge

(ILSVRC) [30]. Using AlexNet model, they achieved an

astounding improvement as the image classiﬁcation error

dropped from 26% (in 2011) to 15%. ImageNet is a stan-

dard benchmark dataset used to evaluate the performance

FIGURE 1. ImageNet Competition Results [50].

of object detection and image classiﬁcation algorithms. It

consists of millions of different images distributed over tens

of thousands of object classes.

CNNs have achieved even better accuracy in classiﬁca-

tion and various computer vision tasks. The classiﬁcation

accuracy in ILSVRC improved to 88.8% [48], 93.3% [31],

and 96.4% [49] in the 2013, 2014, and 2015 competitions,

respectively. Fig. 1 shows the accuracy loss for the winners

of ImageNet competitions before and after the emergence

of deep learning algorithms.

Thereafter, large host companies started using CNNs at

the core of their services. Google, Microsoft, Facebook,

Amazon, Pinterest, and Instagram are currently using neural

networks for their photo search, Bing’s image feeds, auto-

matic tagging algorithms, product recommendations, home

feed personalization, and for their search infrastructure,

respectively [11]. However, the classic use-case of CNNs

is for image and speech processing [51].

A typical CNN is a multi-layered feed-forward ANN with

a pipeline-like architecture. Speciﬁcally, each layer performs

a well-known computation on the outputs of the previous

layer to generate the inputs for the next layer. In general,

CNNs have two types of inputs; the data to be tested or

classiﬁed (also named as feature maps), and the weights.

Images, audio ﬁles, and recorded videos are examples of the

input data to be classiﬁed using CNNs. On the other hand,

the network weights are the data generated from training

the CNN on a dataset containing similar inputs to the one

being tested.

C. HARDWARE ACCELERATION OF DEEP LEARNING

NETWORKS

To provide more accurate results as well as real-time object

recognition, for example in applications such as robots and

auto-piloted cars, the size of the convolution neural network

needs to be increased by adding more neural network

layers [28]. However, evolving more and new type of NN

layers results in more complex CNN structures as well as

high depth CNN models. Thus, billions of operations and

millions of parameters, as well as substantial computing re-

sources are required to train and evaluate the resultant large-

scale CNN [31], [52], [53]. Such requirements represent

2VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

a computational challenge for general purpose processors

(GPP). Consequently, hardware accelerators such as appli-

cation speciﬁc integrated circuit (ASIC), ﬁeld programmable

gate array (FPGA), and graphic processing unit (GPU) have

been employed to improve the throughput of the CNN.

In practice, CNNs are trained off-line using the back-

propagation process [54]. Then, the off-line trained CNNs

are used to perform recognition tasks using the feed-forward

process [55]. Therefore, the speed of feed-forward process

is what matters.

GPUs are the most widely used hardware accelerators

for improving both training and classiﬁcation processes in

CNNs [56]. This is due to their high memory bandwidth

and throughput as they are highly efﬁcient in ﬂoating-point

matrix-based operations [57]–[59]. However, GPU acceler-

ators consume a large amount of power. Therefore, their use

in CNN-based applications implemented as a cloud service

on large servers or in battery operated devices becomes a

challenge. Furthermore, GPUs gain their performance from

their ability to process a large image batch in parallel. For

some applications like a video stream, input images should

be processed frame by frame as the latency of the result of

each frame is critical to the application’s performance. For

some tracking algorithms, the result of one frame affects the

process of the next frame [60]. Nurvitadhi et al. [61] recently

evaluated emerging DNN algorithms on latest generations

of GPUs (i.e., NVIDIA Titan X Pascal) and FPGAs (i.e.,

Intel Arria 10 GX 1150 and Intel Stratix 10 2800). The

experimental results show that current trends in deep neural

networks favor FPGA platforms as they offer higher power

efﬁciency (a.k.a., performance per Watt).

FPGA and ASIC hardware accelerators have relatively

limited memory, I/O bandwidths, and computing resources

compared with GPU-based accelerators. However, they can

achieve at least moderate performance with lower power

consumption [62]. The throughput of ASIC design can be

improved by customizing memory hierarchy and assigning

dedicated resources [63]. However, the development cycle,

cost, and ﬂexibility are not satisfactory in ASIC-based

acceleration of deep learning networks [64], [65]. As an

alternative, FPGA-based accelerators are currently in use

to provide high throughput at a reasonable price with low

power consumption and reconﬁgurability [66], [67]. The

availability of high-level synthesis (HLS) tools, using C or

C++, from FPGA vendors lowers the programming hurdle

and shortens the development time of FPGA-based hardware

accelerators [68]–[70].

Convolutional neural networks have a very useful prop-

erty, that is, each feature map neuron shares its weights

with all other neurons [71]. The authors in [72], [73] proved

that the highest energy expense results from accessing the

off-chip DRAM memory for data movement rather than

computation. In other words, the energy cost of the increased

memory accesses and data movement due to the large

number of CNN operations often exceeds the energy cost

of computation [64], [74]. Thus, CNN accelerators need to

carefully consider this to achieve efﬁcient architecture in

terms of time and power.

In this paper, we review the current status of using FPGAs

as accelerators for implementing deep learning networks.

We highlight the implementation challenges and design

directions used to tackle those challenges. We also provide

future recommendations to maximize the performance of

FPGAs as accelerators for deep learning networks and

simplify their use.

The remainder of the paper is organized as follows.

Section II provides background information about CNNs,

their key operations, and some well-known deep learning

networks. In addition, it introduces the basic structure of FP-

GAs and highlights their features enabling them to acceler-

ate computationally intensive applications. It also discusses

the implementation challenges of deep learning networks

on FPGAs and how these challenges can be overcome.

Section III reviews existing CNNs compression techniques

and presents the current status of accelerating deep learning

networks using ASIC-based and FPGA-based accelerators.

Section IV describes the use of metaheuristics in the de-

sign and optimization of CNNs implementation. Section

V summarizes existing design approaches for accelerating

deep learning networks and provides recommendations for

future directions that will simplify the use of FPGA-based

accelerators and enhance their performance. Finally, section

VI concludes the paper.

II. BACKGROUND AND TERMINOLOGY

This section gives an overview of the key operations and

terminology used in convolutional neural networks (CNNs)

and provides examples of well-known deep learning net-

works. In addition, it illustrates the basic structure of ﬁeld

programmable gate arrays (FPGAs) and how deep learning

methods can beneﬁt from the capabilities of FPGAs. The

last subsection highlights the challenges of implementing

deep learning networks on FPGAs.

A. CONVOLUTIONAL NEURAL NETWORKS (CNNS)

In this subsection, we describe the key operations and

terminology involved in the construction of CNNs including

convolution, activation functions, normalization, pooling,

and characteristics of fully connected layers.

1) Convolution (CONV)

A convolution operation can be thought of as the production

of a matrix smaller in size than the original image matrix,

representing pixels, by sliding a small window (called ﬁlter,

feature identiﬁer, or kernel) of size kˆkover the image

(called input feature map (FM)), to produce an output

feature neuron value [75]. The ﬁlter is an array of numbers

called weights or parameters. These weights are computed

during the training phase. As the ﬁlter slides over the

feature map, it multiplies the values in the ﬁlter with the

original pixel values, that is, it ﬁrst performs element-wise

multiplication, and then sums the products, to produce a

VOLUME 4, 2016 3

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

single number. The inputs and outputs of the CONV layer

are a series of FM arrays.

This operation, starting from the top left corner of the

FM, is repeated by moving the window Sstrides at a

time, ﬁrst in the right direction, until the end of the FM

is reached, and then proceeding downwards until the FM

is completely scanned and all the elements of the FM are

covered. The sliding of the ﬁlter window and performing

the operation is known by the verb convolving, hence the

noun convolution [11], [76]. Normally, the size of the kernel

is very small, less than or equals to 11 ˆ11. Each output-

input FM pair has a set of weights equal to the kernel size

and each output FM is computed based on the sum of the

convolution operations performed on all input FMs. Note

that different CONV layers in the same CNN model vary

considerably in their sizes.

In summary, the convolution operation comprises four

levels of loops; the output FMs loop (Loop-4), the loop

across the input FMs (Loop-3), the loop along the di-

mensions of a single input FM (scan operation, Loop-2),

and the kernel window size loop (multiply-and-accumulate

(MAC) operation, Loop-1). CONV layers are dominant in

CNN algorithms since they often constitute more than 90%

of the total CNN operations [28], [29], [49], [74], [77],

[78]. Therefore, many attempts have been made to speedup

CONV operations using loop unrolling technique [55], [79],

as will be discussed later. Loop unrolling maximizes the

parallelism of CONV MACs computation which requires a

special consideration of processing elements (PEs) and reg-

ister arrays architecture. Fig. 2 illustrates the loop unrolling

of CONV loops levels.

2) Activation Functions (AFs)

Activation function in neural networks is similar to action

potential in animal cells such as neurons. A neuron is said

to ﬁre if it emits an action potential. A popularly used

activation function is the sigmoid function which can be

expressed as

fpxq “ 1{p1`e´xq(1)

where xrepresents the weighted sum of the neuron inputs

and if it is a sufﬁciently large positive number, the sigmoid

function approximates to unity. For sufﬁciently large nega-

tive values of x, the sigmoid function is close to 0. Another

popular activation function is

fpxq “ tanhpxq(2)

The above standard sigmoid and tanh non-linear functions

require long training time [28]. A recently proposed and

commonly used AF in CNNs is rectiﬁed linear unit (ReLU)

which is deﬁned as

fpxq “ maxpx, 0q(3)

ReLU activation function is known to converge faster

in training, and has lesser computational complexity [80],

FIGURE 2. CONV Loops Unrolling [83]: (a) Unrolling Loop-1; (b) Unrolling

Loop-2; (c) Unrolling Loop-3; (d) Unrolling Loop-4, where, P kx,P ky,P ix,

P iy,P if , and P of are loop unrolling design variables for the kernel window

width, kernel window height, input FM width, input FM height, number of input

FMs, and the number of output FMs, respectively.

[81] than standard sigmoid and tanh functions. In addition,

it does not require input normalization to prevent it from

saturating [28], [80], [82].

3) Normalization

In real life, a phenomenon called ‘lateral inhibition’ appears,

which refers to the capacity of an excited neuron to subdue

its neighbors, thereby creating a contrast in that area. In

CNNs, to accomplish this, local response normalization

(LRN) or simply normalization is used, particularly when

dealing with ReLU neurons, because they have unbounded

activation that needs normalization. It detects high frequency

features with a large response. If we normalize around the

local neighborhood of the excited neuron, it becomes even

more sensitive as compared to its neighbors. At the same

time, it will dampen the responses that are uniformly large

in any given local neighborhood. If all the values are large,

then normalizing those values will diminish all of them. So,

basically it performs some kind of inhibition and boosts the

neurons with relatively larger activations.

Normalization can be done within the same feature or

across neighboring features by a factor that depends on the

neighboring neurons. Expressions to compute the response

normalized activity can be found in [28], [80].

4) Pooling

Pooling, also known as subsampling, is employed to pro-

gressively reduce the spatial size of the representation,

thereby reducing the amount of parameters and computation

in the network. Pooling layers are periodically inserted

in between successive convolutional layers. They operate

independently on every depth slice of the input and resize it

spatially using the MAX operation. The most common form

is a pooling layer with ﬁlters of size 2ˆ2applied where the

MAX operation would be taking a maximum over 4samples

thereby discarding 75 percent of the activations [84]. In

4VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 3. AlexNet CNN Architecture [28].

addition to the popular MAX pooling, the pooling units in

some CNNs are also used to perform other functions, such

as AVG and MIN operations [80].

5) Fully Connected Layer (FC)

A common form of a convolutional neural network archi-

tecture comprises stacks of a few convolutional and ReLU

layers, followed by layers for pooling, and this pattern is

repeated until the image has merged spatially to a small

size. This is followed by one or more fully connected layers,

also known as inner-product layers, whose neurons have full

connections to all activations in the previous layer, hence

the name. The last fully connected layer is the classiﬁcation

layer and it holds the output such as the class scores [80].

B. EXAMPLES OF DEEP LEARNING NETWORKS

We list in this subsection some of the well-known deep

learning networks.

‚AlexNet (2012) is a convolutional neural network

consisting of 5convolutional layers, interspersed by

2normalization layers, as well as 3fully connected

layers [28]. Each convolutional layer performs the

activation function using ReLU. In addition, 3pooling

layers are employed with the ﬁrst, second, and last

convolutional layers. The architecture of AlexNet CNN

is shown in Fig. 3. AlexNet won the 2012 ImageNet

challenge by classifying 224 ˆ224 input color images

to 1,000 different output classes.

‚VGG (2014) is a convolutional neural network model

similar to AlexNet in terms of the number of fully

connected layers. However, it consists of 5groups of

convolutional layers [29], [81]. The exact number of

CONV layers in each group depends on the version

of the VGG, visual geometry group, model. Table 1

shows the number of CONV and FC layers for the

most commonly used VGG models.

‚ResNets (2016) are deep residual networks with ex-

tremely irregular and complex structures compared to

AlexNet and VGG CNN models [49], [85], [86]. This is

due to having more types of layers, where non-adjacent

layers incorporate shortcuts to compute the residual

functions, as well as having highly deep structures, that

is, between 50 and 1000 CONV layers. Unlike AlexNet

and VGG models where the layers are connected in

sequence, the interconnections in ResNet layers are in

the form of a directed acyclic graph (DAG). ResNet-

50 and ResNet-152 are widely used, especially for

image classiﬁcation. ResNet-50/152 structure contains

53/155 CONV (most of them are followed by batch

normalization (BatchNorm), scale, and ReLU layers),

1/1 MAX pooling, 1/1 Average pooling, 1/1 FC, and,

16/50 element-wise (Eltwise) layers, respectively.

C. FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

FPGAs are off-the-shelf programmable devices that provide

a ﬂexible platform for implementing custom hardware func-

tionality at a low development cost. They consist mainly of

a set of programmable logic cells, called conﬁgurable logic

blocks (CLBs), a programmable interconnection network,

and a set of programmable input and output cells around the

device [87]. In addition, they have a rich set of embedded

components such as digital signal processing (DSP) blocks

which are used to perform arithmetic-intensive operations

such as multiply-and-accumulate, block RAMs (BRAMs),

look-up tables (LUTs), ﬂip-ﬂops (FFs), clock management

unit, high speed I/O links, and others. Fig. 4 shows a basic

structure of an FPGA.

FPGAs are widely considered as accelerators for

computationally-intensive applications as they enable mod-

els with highly ﬂexible ﬁne-grained parallelism and as-

TABLE 1. CNN Layers for VGG Models.

Layers VGG-11 VGG-16 VGG-19

CONV (Group 1) 1 2 2

CONV (Group 2) 1 2 2

CONV (Group 3) 2 3 4

CONV (Group 4) 2 3 4

CONV (Group 5) 2 3 4

CONV (Total) 8 13 16

FC 3 3 3

Total 11 16 19

VOLUME 4, 2016 5

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 4. FPGA Basic Structure [87].

sociative operations such as broadcast and collective re-

sponse [88]. In [89], [90], FPGA computing models used

for applications acceleration are presented, including data

streaming, associative computing, highly parallel memory

access, use of standard hardware structures such as ﬁrst

in ﬁrst out (FIFO) buffers, stacks and priority queues, and

functional parallelism.

FPGAs have the advantage of maximizing performance

per Watt of power consumption, reducing costs for large

scale operations [91]. This makes them an excellent choice

as accelerators for battery operated devices and in cloud

services on large servers. FPGAs have recently been widely

used for deep learning acceleration given the ﬂexibility in

implementing architectures with large degree of parallelism

resulting in high execution speeds [92].

The adoption of software-level programming models such

as the open computing language (OpenCL) standard [93],

[94] in FPGA tools made them more attractive to use for

deep learning [95], [96]. In addition, the feed-forward nature

of deep learning algorithms makes FPGAs offer a clear

advantage as they can create customized hardware circuits

that are deeply pipelined and inherently multithreaded [91].

FPGAs also have the capability of partial dynamic conﬁg-

uration, which allows part of the FPGA to be reconﬁgured

while the rest is being used. This could be of potential

beneﬁt to deep learning methods where the next layer could

be reconﬁgured while the current layer is being used.

D. CHALLENGES OF FPGA-BASED IMPLEMENTATION

OF DEEP LEARNING NETWORKS

Implementation of deep learning networks and, in particular,

CNNs on FPGAs has a number of challenges including

the requirement of a signiﬁcant amount of storage, external

memory bandwidth, and computational resources on the

order of billions of operations per second [97]. For example,

AlexNet CNN has over 60 million model parameters which

need 250MB of memory for storing the weights based

on 32-bit ﬂoating-point representation as well as requires

around 1.5 billion operations for each input image [80].

This large amount of storage required is not supported by

existing commercial FPGAs and hence the weights have

to be stored on external memory and transferred to the

FPGA during computation. Without careful implementation

of deep learning networks and maximizing resource sharing,

the implementation may not ﬁt on FPGAs due to limited

logic resources.

The problem exacerbates with more complex models such

as VGG CNN model which have 16 layers. For example,

the VGG-16 CNN model has 138 million weights and

needs over 30 GOPS [98]. Although the current trends in

implementing CNNs is going toward compressing the entire

CNN model with dramatically reducing data bit-width [99],

it is expected that future CNN models will get more complex

with larger number of layers as the amount of training data

continues to grow and the problems to be solved get more

complex.

In addition, different layers in CNNs have different

characteristics which result in different parallelism and

memory access requirements. Different layers in a CNN

network exhibit vastly different amounts of intra-output

and inter-output parallelism [100]. Intra-output parallelism

parallelizes the computation of a single output image since

it is the sum of ninput-kernel convolutions. However,

inter-output parallelism is based on computing multiple

output FMs in parallel. Furthermore, convolutional layers

are computational-centric while fully connected layers are

memory centric [98]. For example, the number of operations

in each group of convolutional layers in VGG-16 model are

on the order of 2 to 9 GOPS while the number of weights are

on the order of 0.04 to 7.08 million. However, the number

of operations in fully connected layers are in the order of

0.01 to 0.21 GOPS, while the number of weights are on the

order of 4.10 to 102.76 million. Thus, the developed CNN

accelerator must be designed carefully to meet the varying

requirements of different layers and needs to be ﬂexible to

maximize the performance for each CNN layer.

As technology advances, FPGAs continue to grow in size

and capabilities. It is crucial to have some mechanisms for

addressing the requirements for efﬁcient implementations

of deep learning networks. Addressing hardware resource

limitations requires reuse of computational resources, and

storing of partial results in internal memories. Data transfer

and computational resource usage are signiﬁcantly impacted

by the ordering of operations and selection of parallelism in

the implementation of CNNs on FPGAs. Careful scheduling

of operations can result in signiﬁcant reduction in external

memory access and internal buffer sizes. External memory

bandwidth requirements can be also decreased by using

reduced precision for representing the weights with minimal

impact on solution quality, which also results in a better

energy efﬁciency. In addition, the number of external mem-

ory accesses can be reduced by utilizing on-chip memory

and exploiting data reuse. Furthermore, the large number of

weights in the fully connected layer can be reduced, based

on utilizing singular value decomposition (SVD) [101] with

a small impact on accuracy. In the next section, we will

6VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

review various design approaches used to cope with those

challenges for implementing deep learning networks.

III. ACCELERATION OF DEEP LEARNING NETWORKS:

CURRENT STATUS

In this section, we will start by covering convolutional

neural networks (CNNs) compression techniques as they

have a signiﬁcant impact on the implementation complexity

of CNNs. CNNs compression techniques target the min-

imization of the number of operations and the memory

footprint with minimal impact on accuracy. Then, we discuss

hardware acceleration techniques for deep learning (DL)

algorithms and CNNs based on both application speciﬁc

integrated circuit (ASIC) and ﬁeld programmable gate array

(FPGA) implementations. In general, hardware accelerators

focus on designing speciﬁc modules and architectures that

ensure data reuse, enhance data locality, and accelerate

convolutional (CONV) layer operations based on performing

needed operations in parallel.

A. CNNS COMPRESSION

In this subsection, we review techniques that target the com-

pression of CNNs which results in signiﬁcantly reducing

their implementation complexity with minimal impact on

accuracy.

Denton et al. [102] proposed a technique to reduce the

memory footprint for the network weights in object recog-

nition systems. They used singular value decomposition

(SVD) [101] and ﬁlter clustering methods for this purpose.

The results for convolutional model of 15 layers in [48]

show that the proposed technique speeds up the operations

in convolutional layers by a factor of 2, compared to CPU

Eigen3-based library implementation [103]. In addition, it

successfully achieved 13ˆmemory footprint reduction for

the fully connected layers while preserving the recognition

accuracy within 99%.

In another work, Han et al. [104] employed network

pruning techniques [105]–[107] to reduce the over-ﬁtting

and complexity of neural network models. Their results

demonstrated that pruning redundant connections as well as

less inﬂuential connections achieved 9ˆand 13ˆcompres-

sion for AlexNet and VGG-16 models, respectively, while

achieving zero accuracy loss for both.

In a subsequent work, Han et al. [108] proposed a deep

compression technique for more reduction of the storage

requirements of CNNs through the enforcement of weights

sharing. Deep compression basically consists of pruning,

trained weights quantization, and Huffman coding pipeline

stages. The experimental results show that the proposed

compression technique successfully reduced the storage

requirement of AlexNet and VGG-16 CNN models by 35ˆ

and 49ˆ, respectively, without affecting their accuracy. This

also improved the power efﬁciency (a.k.a., performance per

Watt) by 3ˆto 7ˆ.

B. ASIC-BASED ACCELERATORS

In this subsection, we present some recent work in the area

of hardware-based accelerators (ASICs).

An ASIC-based hardware accelerator referred to as Dian-

Nao [109] was designed for large-scale convolutional neural

networks and deep neural networks. DianNao accelerates

neural networks by minimizing memory transfers, which

opened a new paradigm for hardware accelerators. Since

the weights are repeatedly used in the computations of con-

volution layers, frequent memory access can signiﬁcantly

degrade the overall performance. Therefore, the authors

exploited the locality properties of neural network layers

to design custom storage structures that take advantages

of these properties. In addition, they employed dedicated

buffers and tiling techniques to reduce the overall external

memory trafﬁc through increasing data locality.

Chen et al. [109] also observed that using short ﬁxed-

point representation of feature maps (FMs) and weights can

also signiﬁcantly reduce computation resources and memory

footprint. They found that the area and power of a 32-

bit multiplier can be reduced by a factor of 0.164ˆand

0.136ˆ, respectively, using 16-bit multipliers. Consequently,

DianNao has been implemented using 65nm fabrication

technology with 16-bit ﬁxed-point arithmetic units, 6 bits of

which are used for the integer part and the remaining 10 for

the fractional part. The experimental results demonstrated

that DianNao has an average performance of 452 GOPS

with power consumption of 485 mW. The results depicted

that using 16-bit arithmetic units instead of 32-bit ones in-

troduced only 0.26% accuracy loss on MNIST dataset [110].

On the other hand, the scalability and efﬁciency of DianNao

accelerator are severely limited by the bandwidth constraints

of the memory system.

In a related research work, Chen et al. [111], [112] pro-

posed DaDianNao multi-chip supercomputer which offers

sufﬁcient memory capacity suitable for on-chip storage of

all weights in CNNs. This system is mainly important for

today’s large-scale deployments of sophisticated industry

and consumers services. DaDianNao uses 16-bit ﬁxed-point

numbers in the inference process like DianNao, but it is

implemented using 28nm technology. The results show that

DaDianNao outperforms the performance of a single GPU

architecture by up to 656.63ˆand reduces the average

energy consumption by 184.05ˆwith only 0.01% accuracy

error rate on MNIST dataset for a 64-chip system.

Another member of the DianNao family, called PuDian-

Nao [113], has been designed using TSMC 65nm process

to support multiple techniques and scenarios of machine

learning (ML). PuDianNao accelerates different ML tech-

niques through extracting their critical locality properties

and computational primitives with the use of on-chip storage

as well as 7 novel functional units. Experimental results

show that PuDianNao is 1.20ˆand 128.41ˆfaster and

energy-efﬁcient, respectively, than NVIDIA K20M GPU

architecture. However, both of DaDianNao [111] and Pu-

VOLUME 4, 2016 7

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 5. Processing Element (PE) Architecture in; (a) FlexFlow, (b) 2D-Mapping [118].

DianNao architectures have not been optimized to be used

for embedded applications.

To improve the scalability and energy efﬁciency of Di-

anNao design discussed in [109], ShiDianNao accelerator

was proposed [114]. ShiDianNao is designed especially

for real-time object recognition applications such as self-

driving cars, smartphones, and security using 65nm CMOS

technology. The proposed accelerator directly connects with

a CMOS/CCD sensor in the image processing chip. In

addition, all the weights of CNN layers are stored in SRAM

on-chip memory, as the target here is small CNN models.

ShiDianNao is embedded inside the processing chip to elim-

inate off-chip DRAM memory accesses and minimize data

movements between the SRAM holding the CNN model

and the individual processing elements from the sensor.

ShiDianNao has a power consumption of 320.10 mW with

a peak performance of 194 GOPS under 1 GHz working

frequency. Moreover, ShiDianNao has 1.87ˆspeedup and

is 60ˆmore energy-efﬁcient than DianNao [109].

However, DianNao [109], DaDianNao [111], [112], Pu-

DianNao [113], and ShiDianNao [114] are not implemented

using FPGA or any other reconﬁgurable hardware. There-

fore, they cannot be efﬁciently adapted to different appli-

cation demands (i.e., different neural network sizes). In

addition, ASIC designs have a long development cycle and

lack ﬂexibility for handling varying DL network designs.

Finally, CNN accelerators, which store all weights on-chip

such as ShiDianNao [114], will not be able to support

realistic large-scale CNN models.

Similar approaches to the DianNao family of techniques

are presented in [115] with similar limitations. ISAAC [116]

and PRIME [117] have explored in-memory processing to

design an acceleration architecture for neural networks. The

proposed ISAAC architecture has achieved better improve-

ments of 14.8ˆ, 5.5ˆ, and 7.5ˆin throughput, energy, and

computational density, respectively, than the state-of-the-art

DaDianNao architecture.

In CNN models, ﬁne-grained parallelism appears at fea-

ture map level, in the neuron level, and in the synapse level.

Lu et al. [118] reviewed current accelerators that exploit

the intrinsic parallelism and observed a mismatch between

the parallel types supported by the computing engine and

the dominant parallel types that appear in CNN workloads.

They identiﬁed that most of the previous techniques pro-

posed solutions that fall into one of the three representative

architectures: (i) Systolic, (ii) 2D-mapping, and (iii) Tiling.

Due to limitations of dataﬂow of each of the above three

architectures, most existing accelerators support only one

speciﬁc parallelism. Systolic architectures exploit synapse

parallelism (SP), 2D-mapping architectures exploit neuron

parallelism (NP), and tiling architectures exploit feature map

parallelism (FP). However, in a practical CNN, the dominant

parallel type depends on the number of input FMs, the

number of output FMs, the size of the output FMs, and

the size of the kernel.

With three components (feature map, neuron, synapse)

that can be either left serial, or parallelized, we get 23

possible combinations. An example of processing style

could be SFSNMS, meaning, single feature map, single

neuron, and multiple synapse.

To address the above problem, and support all possi-

ble processing styles, Lu et al. [118] proposed a ﬂexible

dataﬂow architecture, called FlexFlow, with minimal con-

trols. FlexFlow supports all types of data paths in each type

of parallelism in different layers efﬁciently.

As a ﬁrst step, a modiﬁcation to the processing element

(PE) micro-architecture, and the interconnections between

PEs, is proposed. PEs are arranged in rows where each

row can complete one convolution and serve one output

neuron. The adders in each PE row are connected to form

the adder tree. Fig. 5 illustrates the proposed PE in FlexFlow

and that in 2D-mapping architecture. By eliminating de-

pendency between adjacent PEs, the proposed convolutional

unit supports the comprehensive MFMNMS parallelisms. To

cater to different types of parallelisms, they also proposed

a hierarchical dataﬂow with high data “routability” and low

control overhead. The entire dataﬂow can be divided into

three sub-ﬂows: (i) distribution to local storage in each

8VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

PE, (ii) fetching of data from local storage for operators

(multiplier and adder), and, (iii) dataﬂow from neuron and

kernel buffers to the distribution layer. They also presented

a method to determine parallelization type and degree (i.e.,

the unrolling parameters) for each CONV layer.

FlexFlow architecture was evaluated for computing re-

source utilization, performance, power, energy, and area.

Comparison was made with three typical architectures (i.e.,

systolic, 2D-mapping, and tiling) using six practical work-

loads, including AlexNet and VGG. They also examined

the scalability of FlexFlow in terms of resource utilization,

power, and area with growing scales of computing engine.

From experimental results, it was found that computing

resource utilization of each baseline was over 80% across

all workloads in contrast to other baselines that utilized

less than 60% (most of them less than 40%). In terms

of performance, FlexFlow demonstrated over 420 GOPS

performance with 1 GHz working frequency. It also out-

performed others in terms of data reusability and power

efﬁciency.

C. FPGA-BASED ACCELERATORS

In this subsection, we will review recent techniques employ-

ing FPGAs for the acceleration of deep learning networks.

For each reviewed technique, we will highlight the key

features utilized to maximize performance and throughput

in the acceleration process.

FPGA implementations of CNNs appeared in the mid-

1990’s when Cloutier et al. [119] designed the virtual

image processor (VIP) on Altera EPF81500 FPGA. VIP

is a single-instruction stream multiple-data streams (SIMD)

multiprocessor architecture with a 2D torus connection

topology of processing elements (PEs). VIP improves the

performance through the use of low-accuracy arithmetic

to avoid implementing full-ﬂedged multipliers. Fortunately,

recent digital signal processing (DSP)-oriented FPGAs in-

clude large numbers of multiply-and-accumulate (MAC)

units which allow for extremely fast and low power CNN

implementations.

Thereafter, FPGA implementations of deep learning net-

works have mainly focused on accelerating the computa-

tional engine through optimizing CONV layer operations.

Several studies in the literature [120]–[126] have reported

FPGA-based implementations of convolution operation.

Farabet et al. [127] presented an FPGA implementation of

CNN that uses one dedicated hardware convolver and a soft-

processor for data processing and controlling, respectively.

The proposed implementation is referred to as convolutional

network processor (CNP). CNP exploits the parallelism of

CONV layers to accelerate the computational engine of

CNNs while fully utilizing the large number of DSPs, the

MAC hardware units on FPGA. The proposed architecture

consists of Virtex4 SX35 FPGA platform and external mem-

ory. The authors designed a dedicated hardware interface

with the external memory to allow 8simultaneous read/write

accesses transparently. In addition, they used ﬁrst in ﬁrst

FIGURE 6. 2D Convolution Module of 3ˆ3Kernel [127].

out (FIFO) buffers between the FPGA and the external

memory chip in both directions to guarantee the steadiness

of dataﬂow.

The vector arithmetic and logic unit in CNP implements

2D CONV, pooling, and non-linear activation function op-

erations of convolutional networks. The implementation of

2D CONV with kernel of size 3(i.e., K“3) is shown

in Fig. 6, where xis the data from input feature map

(FM), yis the partial result to be combined with the current

result, zis the result to the output FM, Wij is the weight

value in the convolution kernel, and Wis the width of the

input image. It can be seen that the proposed convolutional

module accomplishes K2MAC operations simultaneously

in each clock cycle. CNP represents FMs and weights using

16-bit (Q8.8) ﬁxed-point format. The proposed accelerator

has been implemented for a face detection system with

LeNet-5 architecture [128]. It utilized 90% and 28% of the

general logic and multipliers, respectively. In addition, CNP

consumed less than 15 Watts of power.

Sankaradas et al. [129] proposed a massively parallel

coprocessor to accelerate CNNs using Virtex5 LX330T

FPGA platform. The proposed accelerator mainly focused

on optimizing computation engine by employing the paral-

lelism within convolution kernel and FMs. The coprocessor

can be considered as parallel clusters of vector processing

elements (VPEs) where each cluster is designed using 2D

convolvers, adders, sub-samplers, and look-up tables. Each

VPE consists of multiplier-accumulator and programmable

hold the massive intermediate data of CNNs, the authors

employed a dedicated off-chip memory (4 DDR2 memory

banks) with a large bandwidth on the coprocessor card.

Moreover, the proposed accelerator uses a low precision

data representation feature with memory packing to further

improve the memory bandwidth as well as the throughput.

20-bit and 16-bit ﬁxed-point representations were utilized

for kernel weights and FMs, respectively.

VOLUME 4, 2016 9

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 7. MAPLE Processing Core Architecture [132].

The authors examined their architecture on CNN with

4 CONV layers and without any fully connected (FC)

layer for a face recognition application. The results show

that the proposed coprocessor is 6ˆfaster than a software

implementation on a 2.2 GHz AMD Opteron processor

with less than 11 Watts of power dissipation. However, the

proposed accelerator cannot be used to accelerate full CNNs

as it uses few CONV layers without any FC layer. A full

CNN model consists of both CONV layers and FC layers.

Thus, an efﬁcient CNN accelerator for real-life applications

is needed to consider both. Similar approaches to the work

of Sankardas et al. [129] are presented in [130], [131] to

accelerate support vector machines (SVM).

MAPLE [132] is a programmable FPGA prototype sys-

tem presented to accelerate both learning and classiﬁcation

tasks in applications with unstructured large amount of

data. The authors analyzed ﬁve workload domains to help

in designing MAPLE. These workloads are SVM [133],

supervised semantic indexing (SSI) [134], K-means [135],

generalized learning vector quantization (GLVQ) [136], and

CNNs [71]. They found that their computations can be

structured as parallel streams of vector or matrix operations.

Thus, they architected MAPLE as a 2D grid of vector pro-

cessing elements as shown in Fig. 7. To efﬁciently perform

matrix multiplication, they allocate a private local storage

to each PE which is used to store a column, or part of it,

from the multiplier matrix. In this way, matrix multiplication

is accomplished by streaming the multiplicand matrix rows

through the PEs where each PE performs a MAC operation.

The PEs are organized in clusters, where each group is

served by a separate memory bank of the banked off-chip

memories, which create independent streams for processor-

memory computation.

Moreover, MAPLE uses on-chip smart memory blocks

to process the large intermediate data on-the-ﬂy using in-

memory processing. Fig. 8 shows the architecture of the

smart memory block. To illustrate the idea of on-the-ﬂy

in-memory processing, lets consider ﬁnding the maximum

Kelements. The ﬁlter compares the input data with the

FIGURE 8. MAPLE Smart Memory Block [132].

threshold value (VAL). If the input value is greater than

VAL, it updates the list by replacing VAL at address ADDR

with the input value. Then, the scanner (SCAN) searches for

the new minimum value in the list and updates the threshold

VAL and ADDR accordingly. It should be mentioned here

that the employment of in-memory processing reduced the

off-chip memory trafﬁc by 1.64ˆ,25.7ˆ, and 76ˆfor SSI,

K-means, and CNN workloads, respectively. MAPLE pro-

totype has been implemented on Virtex5 SX240T platform

running at 125MHz. The experimental results for face and

digit recognition CNNs [137]–[139] show that MAPLE is

50% faster than that for 1.3 GHz NVIDIA Tesla C870 GPU

implementation.

Chakradhar et al. [100] proposed a dynamically conﬁg-

urable CNN architecture on FPGA. The proposed system

consists of three main components; a coprocessor, a dy-

namically conﬁgurable CNN (DC-CNN) processing core,

and 3-bank memory subsystem. The coprocessor is designed

such that it automatically conﬁgures the software and the

hardware elements to fully exploit the parallelism at the

workload level. DC-CNN is responsible for executing CNN

applications and its architecture is shown in Fig. 9. It

consists of mcomputational elements (each with n2D

FIGURE 9. The Architecture of DC-CNN [100].

10 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

convolvers as well as sub-sampling (S) and non-linearity

(NL) pipelined units), madders (each with ninputs), and

input/output switches. The internal structure of the switches

vector encloses mˆnselectors which are used to help

in exploring the entire design space and to provide the

conﬁgurability function across different layers of CNN

model. To determine the best (m,n) feasible combination for

each layer, the system analyzes the workload using integer

factorization techniques because it is considered fast for

small numbers [140], [141]. Dynamic programming is also

used to quickly prune infeasible combinations.

The authors compared the proposed DC-CNN architec-

ture, considering 20 2D convolvers as well as a mem-

ory subsystem of 128-bit port width, with a 1.35 GHz

NVIDIA’s GPU implementation. The results show that DC-

CNN achieved 4.0ˆ,4.4ˆ,5.4ˆ,6.0ˆ, and 6.5ˆspeedup

for face recognition [137], face detection [139], mobile

robot vision [127], video surveillance [100], and automotive

safety [100] workloads, respectively. It is worth mentioning

that DC-CNN is the ﬁrst architecture that achieves a perfor-

mance suitable for real-time processing for video streaming

as it processes up to 30 frames per second. In addition, DC-

CNN is more energy-efﬁcient than the GPU implementation

as it consumes 14 Watts, while more than 150 Watts are

consumed by the GPU. On the other hand, the authors

modeled their architecture on a CNN with 3 CONV layers

only without any FC layer which makes it unsuitable for

today’s other real-life applications.

A second-generation of CNP [127] architecture has been

proposed in [142] by designing a stream processor sys-

tem. The proposed design replaces the dedicated hardware

convolver in CNP with multiple parallel vector processing

units, named as ALUs, laid out in a 2D grid. Each ALU

is composed of four local routers, one global router, and a

streaming operator. The local routers are used to stream data

to/from the neighbors. Streaming data to and from global

data line is done through the global router. The streaming

operators in the ALU are fully pipelined to produce a

result per clock cycle as described in [127] with the use of

Q8.8 coding to represent FMs and weights. The proposed

system also uses a multi-port direct memory access (DMA)

streaming engine to allow individual streams of data to

operate seamlessly within processing blocks. The results

show that the proposed stream processor system can run

small CNNs at up to 30 fps while consuming about 15 Watts.

An improved version of CNP architectures given in [127],

[142] was presented in [143] and referred to as neuFlow.

Particularly, neuFlow has replaced the 2D grid of ALUs

with a 2D grid of processing tiles (PTs). The proposed

architecture contains a 2D grid of PTs, a control unit,

and a smart DMA module, as shown in Fig. 10. Each

PT consists of local operators and a routing multiplexer

(MUX). The top three PTs have been implemented to

perform MAC operation. Thus, they can be used to perform

2D convolution, simple dot-products, and spatial pooling.

General-purpose operations, such as dividing and squaring,

FIGURE 10. The Architecture of neuFlow [143].

have been implemented at the middle three PTs. Therefore,

the middle row of neuFlow can be used for normalization.

Finally, neuFlow’s bottom PTs row implements non-linear

operations. Moreover, each operator employed input and

output FIFOs to stall its pipeline when required. On the

other hand, PT’s MUX is used to connect its local operators

with the neighboring PT’s streaming operators and off-chip

memory instead of the used local routers and global router

discussed in [142].

NeuFlow uses a dataﬂow compiler, named luaFlow, to

translate a high-level ﬂow-graph representation of CNNs

in Torch5 [144] into HDL scripts with different levels of

parallelism. In addition, luaFlow produces a binary code

conﬁguration ﬁle and holds it in the embedded control

unit. Thereafter, the control unit conﬁgures the 2D grid

of PTs (connections and streaming operator) and the DMA

ports through run-time conﬁguration buses. A smart memory

module has been designed to support multiple asynchronous

accesses of off-chip memory through its reconﬁgurable

ports. By targeting the larger Xilinx Virtex6 VLX240T

FPGA, neuFlow achieved 147 GOPS at 10 Watts for street

scene parsing CNN in [145] with the use of 16 bits to

represent FMs and weights.

Peemen et al. [146] utilized the ﬂexible off-chip memory

hierarchy method to design a conﬁgurable memory-centric

accelerator template for a variety of models of CNNs. This

accelerator exploits data reuse in complex access patterns to

reduce off-chip memory communication, which minimizes

the bandwidth requirements. The memory-centric accelera-

tor maximizes the efﬁciency of on-chip memories for better

data locality using loop transformation (to optimize the

tiling parameters) and block RAM (BRAM)-based multi-

bank on-chip buffers [147]. At the same time, it minimizes

the size of FPGA on-chip memories to optimize energy

and area usage, which are key requirements for embedded

platforms.

The memory-centric accelerator uses a SIMD cluster

of MAC PEs with ﬂexible reuse buffers to accelerate the

CONV layer. The acceleration template has been imple-

mented on Virtex6 FPGAs. In addition, a MicroBlaze pro-

VOLUME 4, 2016 11

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

cessor has been utilized to conﬁgure and communicate with

the accelerator via FIFO-based fast simplex link (FSL). The

proposed accelerator has been analyzed for a CNN vision

task of size 2.74 GMAC and the results show that the

memory-centric accelerator is 11ˆfaster than the standard

implementation of similar FPGA resources.

Neural network next (nn-X) [148] is a real-time system-

on-chip (SoC) computing system for deep learning networks

on mobile devices. The architecture of nn-X consists of a

host processor, a co-processor, and external memory. The

co-processor accelerates the learning networks by paral-

lelizing their operations throughout arrays of conﬁgurable

processing elements referred to as collections. Each collec-

tion contains one convolution engine, one pooling module,

and one non-linear operator. The CONV engine accelerates

the CONV operation by fully pipelining the incoming data

with the use of cache memories. The collections are able

to communicate with one another using the collection route

component to achieve cascaded pipelining, which results in

reducing accesses to external memory. The data transfer

between the collections and the external memory is ac-

complished throughout the co-processor full-duplex memory

router, which provides independent data streams. The nn-

X has been prototyped on Xilinx ZC706 which contains

Zynq XC7Z045, two ARM Cortex-A9 processors, and 1 GB

DDR3. Eight collections have been employed to achieve

large parallelism. The results for face recognition model

in [149] show that nn-X is 115ˆfaster than the two

embedded ARM processors.

Zhang et al. [55] proposed a rooﬂine-based model to

accelerate convolutional neural networks on FPGAs. The

rooﬂine model is an intuitive visual performance model used

to relate the attainable performance to the peak performance

that can be provided by the hardware platform and the

off-chip memory trafﬁc [150]. The focus in their work

is primarily on accelerating the convolutional layers as

it consumes more than 90% of the computational time

during the prediction process [77]. In doing so, the authors

optimized both the computation operations and the memory

access operations in convolutional layers. They considered a

CNN application composed of ﬁve convolutional layers that

won the ImageNet competition in 2012 [28]. The proposed

accelerator uses polyhedral-based data dependence analy-

sis [151] to fully utilize all FPGA computational resources

through loop unrolling, loop pipelining, and loop tile size

enumeration. Note that loop unrolling maximizes the par-

allel computation of CONV MAC operations. On the other

hand, local memory promotion and loop transformation are

used to reduce redundant communication operations and to

maximize the data sharing/reuse, respectively.

Subsequently, the rooﬂine performance model is used

to identify the optimal design from all possible solutions

in the design space. Speciﬁcally, the authors model all

possible legal designs delivered from the polyhedral analysis

in rooﬂine to ﬁnd the optimal unrolling factor xTm, Tnyfor

every convolutional layer, where Tmand Tnare the tile size

FIGURE 11. Zhang et al. [55] Accelerator Architecture.

for the output FMs and input FMs, respectively. However,

designing a CNN accelerator with different unrolling factors

to each convolutional layer is challenging. Therefore, the

proposed architecture enumerates all possible valid designs

to ﬁnd uniform cross-layer unrolling factors. Thereafter, the

hardware accelerator is implemented based on the cross-

layer optimal unrolling factors.

The proposed accelerator composed of a computational

engine and memory sub-system is depicted in Fig. 11.

The computation engine is designed as Tmduplicated tree-

shaped poly structures with Tninputs from the input FMs,

Tninputs from the weights, and one input from the bias.

On the other hand, the memory sub-system is implemented

as four sets of on-chip buffers; two sets to store the input

FMs and weights, each with Tnbuffer banks, and two

buffer sets of Tmindependent banks for storing the output

FMs. To overlap data transfer with computation, on-chip

buffers are operated in a ping-pong manner. In addition,

two independent channels are implemented for load and off-

load operations to increase the bandwidth utilization. More-

over, MicroBlaze processor is used to send conﬁguration

parameters and commands for the accelerator over AXI4lite

bus. The CNN accelerator communicates with external data

transfer engines through FIFO interfaces, where the data

transfer engines are used to access DDR3 DRAM memory

through AXI4 bus.

The accelerator is designed using Vivado 2013.4 high

level synthesis tool and implemented on Xilinx VC707

FPGA board clocked at 100 MHz. The experimental results

depict that the proposed implementation achieves a peak

performance of 61.62 GFLOPS as well as a 17.42ˆspeedup

over the software implementation on Intel Xeon CPU E5-

2430 at 2.20 GHz with 15 MB cache and 16 threads. In

addition to this, the results show that the proposed FPGA

architecture is 24.6ˆmore energy-efﬁcient than the software

implementation as the total power consumption is only 18.6

Watts. The proposed implementation has some limitations

such as designing the accelerator with new cross-layer

unrolling factors for different architectures of CNNs. Fur-

12 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 12. Top-Level Archeticture of Microsoft CNN Accelerator [152].

thermore, using the CNN accelerator with uniform unrolling

factors might be sub-optimal for some CONV layers, which

affects the overall performance.

In 2014, Microsoft research team of Catapult project

integrated FPGA boards into data center applications to

successfully achieve 2ˆspeedup for Bing Ranking (the

large-scale search engine) [67]. A year later, Ovtcharov et

al. [152] at Microsoft Research utilized Catapult hardware

infrastructure, a dual-socket Xeon server equipped with

Stratix-V GSMD5 FPGA, to design a specialized hardware

for accelerating the forward propagation of deep CNNs in

a power-constrained data center.

The top-level architecture of the proposed CNN accelera-

tor is shown in Fig. 12. Multi-banked input buffer and kernel

weight buffer are used to provide an efﬁcient buffering

scheme of FMs and weights, respectively. To minimize

the off-chip memory trafﬁc, a specialized network on-chip

has been designed to re-distribute the output FMs on the

multi-banked input buffer instead of transferring them to

the external memory. The 3D convolution operations (such

as the dot-product) and other CNN operations are indepen-

dently performed using spatially distributed scalable vectors

of PEs. The controller engine is responsible for streaming

and data delivery of multi-banked input buffer and kernel

weight buffer data to each of the PE vectors. In addition, it

supports conﬁguring multiple CNN layers at run-time. The

results show that the proposed design is able to classify 134

images/sec, while consuming about 25 Watts, for AlexNet

model on ImageNet-1K dataset [28], which is 3ˆbetter

than the published throughput results for the Rooﬂine-based

FPGA Accelerator [55]. The authors mentioned that using

top-of-the-line FPGAs such as Arria 10 GX 1150 improves

the throughput to around 233 images/sec.

Qiu et al. [98] proposed an FPGA design to accelerate

CNNs for a large-scale image classiﬁcation challenge on

embedded systems. The focus was on accelerating both

CONV and FC layers, since they are considered as the

most computational-centric and the most memory-centric

operations in CNNs, respectively. The proposed accelerator

reduces the resource consumption using speciﬁc design

of convolver hardware module. In addition, the authors

applied singular value decomposition (SVD) to the weight

matrix of FC layer in order to reduce memory footprint at

this layer [101]. To further reduce memory footprint and

bandwidth requirement of CNN, they proposed a dynamic-

precision data quantization ﬂow component. This compo-

nent is responsible for ﬁnding the optimal fractional length

for weights in each layer as well as the optimal fractional

length for FMs in adjacent layers, while achieving minimal

accuracy loss. Then, it converts the ﬂoating-point numbers

representing weights and FMs into ﬁxed-point numbers.

In addition, the authors proposed a data arrangement

scheme that maximizes the burst length of each transaction

to the external memory to accelerate CONV and FC layers,

as well as to avoid unnecessary access latency. Note that

maximizing the DRAM burst length raises up the effective

DRAM bandwidth [55], [153].

The proposed architecture consists of a processing system

(CPU) and programmable logic (FPGA). CNN computations

are performed through special design of processing element

modules in FPGA. The main modules in the processing

element are convolver complex, max-pooling, non-linearity,

data shift, bias shift, and adder tree, as shown in Fig. 13.

The convolver complex is designed as a classical line

buffer [154], as shown in Fig. 14, to achieve convolution

operations as well as to compute FC layer multiplication of

matrix-vector. The pooling layer implemented in the max-

pooling module is used to output the maximum value in the

input data stream with a window of size 2. The activation

function of CNN is applied to the input data stream using the

non-linearity module. The adder tree accumulates the partial

sums generated from the convolvers. Finally, data shift

and bias shift modules are responsible for accomplishing

dynamic quantization.

The proposed embedded FPGA platform has been im-

plemented using VGG-16-SVD network with 16-bit ﬁxed-

FIGURE 13. Processing Element Module of Qiu et al. [98] Embedded

Accelerator Architecture.

VOLUME 4, 2016 13

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 14. Convolver Complex Design of Qiu et al. [98] Embedded

Accelerator Architecture.

point numbers on Zynq XC7Z045 platform. The results

demonstrate that applying SVD technique reduces memory

footprint of FC layer by 85.8% with a compression rate of

7.04% while introducing an accuracy loss of only 0.04%.

Finally, the overall performance of the proposed acceler-

ator reported is 136.97 GOPS under 150 MHz working

frequency with the top-5 accuracy of 86.66% and a total

power consumption of 9.63 Watts.

DeepBurning [155] is an FPGA-based neural network

(NN) design automation tool. It allows for building learning

accelerators for speciﬁc NN with optimized performance

and custom design parameters conﬁguration using a pre-

constructed register transfer level (RTL) module library. The

RTL library holds the hardware descriptive scripts for NN

reconﬁgurable components as well as their conﬁguration

scripts. In addition, it contains other RTL building blocks

for logical and arithmetic operations such as the connection

box (used to exchange data between NN layers as well as to

approximate the division operation) and approximate look-

up table (LUT) (used to simplify a function or operation to

allow it to be mapped into hardware).

In order to design an optimized hardware, DeepBurning

compresses the passed NN model to the greatest extent

using temporal and spatial folding which helps also in

satisfying the resource constraints and minimizing the re-

quired hardware modules. DeepBurning not only generates

the hardware description for neural network scripts, but

also analyzes the complex access pattern and data local-

ity using an integrated compiler to generate a run-time

control ﬂow which provides energy-efﬁcient, and, better

data reuse implementation. In addition, the DeepBurning

compiler investigates the accelerator on-chip memory size

and throughput to properly tile and partition the NN weights

and feature data layouts. Moreover, DeepBurning uses the

address ﬂow component to automatically fetch and store

off-chip memory and on-chip memory data. The authors

compared the performance of DeepBurning with that in [55],

considering AlexNet CNN model, as they both operate

at 100 MHz. They considered a high budget resources

constrained DeepBurning on Zynq-7045 device. The results

show that DeepBurning is 1.13ˆslower but 1.45ˆmore

energy-efﬁcient.

An OpenCL-based optimization framework to accelerate

large-scale convolutional neural network models was pro-

posed by Suda et al. [80]. They found that the number of

performed CONV MAC operations in parallel (NCON V ),

SIMD vectorization factor (SCON V ), normalization layer

loop unrolling factor (NNO RM ), the number of parallel

pooling outputs in one cycle (NP OOL ), and the number of

parallel FC MAC operations (NFC ) are the key variables

that determine the parallelism of the design. Subsequently,

they analytically and empirically modeled the execution

time for each layer as a function of the above mentioned

variables. Then, genetic algorithm was used to explore the

design space for ﬁnding the optimal combination of the key

design variables considering the resources constraints.

The authors implemented the scalable CONV block in

a similar fashion to that in [138] as a matrix multipli-

cation by ﬂattening and on-the-ﬂy rearrangement of the

feature data. The OpenCL software has been utilized in

their work due to its parallel programming model as well

as its ability to integrate the compiled RTL design with

external memory interfacing IPs [156], which uses memory

coalescing technique with complex load and store units. In

addition, it has optimized matrix multiplication and CPU-

FPGA communication libraries [157], [158].

The framework is used on both VGG-16 and AlexNet

CNN models which are implemented on P395-D8 [159]

and DE5-Net [160] FPGA boards with ﬁxed-point opera-

tions according to their precision study. They compared the

proposed implementation with 3.3 GHz core i5-4590 CPU

implementation that uses Caffe tool [58] with ATLAS [161]

optimized library for matrix/vector operations. The results

show that the OpenCL optimized framework on P395-

D8 achieved 5.5ˆ(117.8 GOPS) and 9.5ˆ(72.4 GOPS)

speedups for VGG-16 and AlexNet models, respectively.

On the other hand, DE5-Net FPGA achieved less throughput

speedup than the P395-D8 (2.2ˆ(47.5 GOPS) for VGG-16,

and 4.2ˆ(31.8 GOPS) for AlexNet) as it has 7.67ˆless

DSPs than what is available on P395-D8.

Zhang et al. [153], [162] analyzed the transformation

of CONV and FC layers to regular matrix multiplication

presented in prior work [98]. For VGG-16 model, they found

that such transformation necessitates up to 25ˆduplication

of input FMs. To address this problem and improve the

bandwidth utilization, they designed a uniformed matrix

multiplication kernel that uses either input-major mapping

(IMM) or weight-major mapping (WMM) techniques while

computing FC layer. In IMM, the designed kernel batches

a group of different input FMs together, and then performs

the matrix multiplication. IMM technique improves the data

reuse of FC weights. On the other hand, the designed kernel

with WMM technique makes use of the fact that the FC

layer is communication-bound in which the weight matrix

14 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

is much larger than the input FM matrix. In particular,

it loads input FM matrix to a weight buffer and loads

weight matrix to input FM buffer. Subsequently, a regular

matrix multiplication is performed on these matrices. As a

result, WMM may allow for a higher data reuse than IMM,

especially for input FMs that can be reused multiple times

considering the limited hardware resources.

For the above, the rooﬂine model was applied to identify

the optimal mapping technique under different batch sizes

and data precisions. The results demonstrate that WMM is

better than IMM in term of data reuse and bandwidth uti-

lization, especially in small batch sizes which is required for

real-time inference. Hence, the same matrix multiplication

kernel is utilized for the computation of both CONV and

FC layers, but with the use of IMM in CONV layer and

WMM in FC layer. Based on this, the authors proposed

a software/hardware co-design library, which they named

Caffeine, to accelerate CNNs on FPGAs.

With an easy-to-use developed tool, Caffeine aids in

automatically choosing the best hardware parameters, using

the model ﬁles from Caffe and FPGA device speciﬁcations

obtained from the user. Caffeine FPGA engine uses a high-

level synthesis (HLS)-based systolic-like architecture to

implement matrix multiplication kernel. It allows changing

parameters such as number of PEs, precision, and FM size.

Caffeine further maximizes the FPGA computing capability

by optimizing multi-level data parallelism discussed in [55]

and pipeline parallelism using polyhedral-based optimiza-

tion framework given in [163]. Caffeine framework also

handles the weights and biases reorganization in off-chip

DRAM to maximize the underlying memory bandwidth

utilization. In addition, the double-buffering technique is

employed to prefetch the next data tile for each PE. Caffeine

has been evaluated by implementing AlexNet and VGG-

16 CNNs on Ultrascale KU060 (20nm and 200 MHz)

and on Virtex7 690T (28nm and 150 MHz) considering

different precisions. The VGG-16 implementation with 16-

bit ﬁxed-point on Ultrascale KU060 and Virtex7 690T

provided 43.5ˆand 65ˆoverall throughput enhancement,

respectively, compared to implementation on a two-socket

server, each with a 6-core Intel CPU (E5-2609 at 1.9 GHz).

A special case of dataﬂow, referred to as synchronous

dataﬂow (SDF) [164], is a paradigm of computation that

allows for representing a computing system as a stream-

ing problem. In this way, SDF model can represent the

hardware implementation of CNNs using linear algebra and

directed SDF graph (SDFG). Each node of SDFG represents

a hardware building block that can immediately start its

computation as soon as the data are available through its

input arcs. Such representation of CNN model offers a

fast design space exploration. Venieris and Bouganis [165]

employed SDF model to optimize the mapping of CNNs

onto FPGAs based on HLS.

In particular, the proposed fpgaConvNet framework in

[165] takes as input a high-level script programmed by DL

expert describing the CNN model, along with speciﬁcations

FIGURE 15. SDF Graph Partitioning [165].

of the targeted FPGA platform. Thereafter, it parses the

input script through a developed domain-speciﬁc language

(DSL) processor to model the CNN in the form of a directed

acyclic graph (DAG) where each node corresponds to a

CNN layer. Then, the DAG-based CNN is transformed into

an SDFG representation and modeled as a topology matrix.

The topology matrix contains the number of incoming

parallel streams, the width of each data stream, and the

production or consumption rates at each node. In addition,

the DSL processor extracts information about the platform-

speciﬁc resource constraints.

Unlike other attempts, instead of exploring the design

space for the optimal parameters of loop unrolling and tiling,

fpgaConvNet explores the design space of the topology

matrix components while considering the resource con-

straints. In doing so, fpgaConvNet performs graph parti-

tioning, coarse-grained folding, and ﬁne-grained folding.

The graph partitioning splits the original SDFG into sub-

graphs and each subgraph is then mapped to a distinct

bitstream as shown in Fig. 15. Note that the proposed multi-

bitstream architecture might have multiple CONV layer

processors (CLPs), as in the provided example. This away,

on-chip RAM is used for intermediate results and data reuse

within the subgraph, while accesss of off-chip memory is

minimized and limited for input and output streams of

the subgraph. However, this scheme adds reconﬁguration

penalty due to the need for reconﬁguring the FPGA when

the data ﬂows between adjacent subgraphs. To amortize

this overhead, several input data streams are processed in

a pipelined manner.

Thereafter, each bitstream architecture is optimized using

coarse-grained folding and ﬁne-grained folding. In coarse-

grain folding, CONV, pooling, non-linear, and other major

operations of each layer are unrolled to provide the highest

possible throughput by having several parallel units of each

operation. The ﬁne-grain folding controls the unrolling and

pipelining of the dot-product operations inside CONV and

average pooling units. Instead of fully unrolling the imple-

mentation of dot-product which produces a 1 dot-product

per cycle, with the use of a high number of multipliers and

adders, fpgaConvNet uses a smaller number of MAC units

and schedules the execution of different operations using

time-multiplexing. A trade-off between the performance

and the required hardware resources can be achieved by

changing the unroll factor and the degree of multiplex-

VOLUME 4, 2016 15

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

ing. Therefore, fpgaConvNet employed simulated annealing

[166] to ﬁnd the optimal partitioning points and folding

factors. Finally, fpgaConvNet uses optimal components to

derive the conﬁguration of PEs and buffers, and generates

a synthesizable Vivado HLS hardware design.

fpgaConvNet framework has been evaluated by mapping

LeNet-5 and scene labelling [167] small CNN models with

Q8.8 ﬁxed-point representation onto a Zynq-7000 XC7Z020

FPGA platform working at 100 MHz. In mapping LeNet-5,

fpgaConvNet achieves up to 1.62ˆthe performance density

of CNP [127]. Compared to Tegra K1 GPU implementation

of scene labelling CNN, fpgaConvNet surpasses Tegra K1’s

power efﬁciency by 1.05ˆ.

Ma et al. [78] proposed a Python-based modularized RTL

compiler to accelerate CNNs by employing loop unrolling

optimization [55], [79] for CONV layer operations. A de-

tailed review article of this work has been recently published

and referred to as ALAMO [168]. The proposed compiler

integrates both the RTL ﬁner level optimization and the

ﬂexibility of HLS to generate efﬁcient Verilog parameterized

RTL scripts for ASIC or FPGA platform under the available

number of parallel computing resources (i.e., the number

of multipliers (Nm)). If Nmis greater than the number of

input FMs (Nif ), the proposed compiler fully unrolls Loop-

3(Nif , refer to subsection II-A1 for more details) while it

partially unrolls Loop-4 (Nof ) to exploit the data reuse of

shared features among Nm{Nif output FMs. Otherwise, it

partially unrolls Loop-3 which results in Nif {Nmrepeated

sliding of kernel window. On the other hand, Loop-2 (XˆY)

is serially computed after Loop-1 (K) to minimize the

number of partial sums.

The overall modules of the proposed CNN accelerator are

shown in Fig. 16. The controller is responsible for directing

and ensuring in-order computation of CNN modules for

each layer. The data routers oversee the selection of data

read and data write of two adjacent modules as well as the

assignment of buffer outputs to shared or pool multipliers

of the multiplier bank. The feature buffers hold the FMs

using on-chip RAMs. The weight buffers are used to ensure

the availability of CONV and FC layers’ weights before

their computation as well as to overlap the transfer of FC

layer weights with its computation. The CONV module

consists of control logic, groups of adder trees, and ReLU

components. The control logic component parametrizes the

loop unrolling factors based on the conﬁguration of each

layer (Nif ,Nof ,X,Y, and K). The CONV module

contains Nm{Nif adders to sum Nif parallel multiplier

results and accumulate them. Moreover, the adder trees can

be shared by layers with identical Nif to be as one single

module. The ReLU component checks the input pixel sign

bit to either output zero or the data pixel itself. The POOL

module contains accumulators or comparators to perform

average or maximum operation, respectively. The NORM

module maintains the required components to perform the

operations of local response normalization such as square,

non-linear (using look-up table), and multiplication oper-

FIGURE 16. ALAMO Overall Acceleration Modules [78].

ations. Finally, the FC module shares the multiplier bank

module with the CONV module to perform the matrix-

vector multiplication (MVM).

ALAMO architecture permits the output pixels to be only

stored in the feature buffers, which makes ALAMO suitable

for CNNs with only small intermediate data volumes. The

proposed RTL compiler has been tested by accelerating

two CNN models; AlexNet and NiN [169]. The generated

parameterized RTL scripts for AlexNet and NiN are synthe-

sized using Altera Quartus synthesis tool and implemented

on DE5-Net FPGA board. The experimental results for

AlexNet model are compared with the results for OpenCL-

based design [80] as both use the same FPGA board with

similar hardware resources for AlexNet. ALAMO achieved

1.9ˆand 1.3ˆimprovement for throughput and power

consumption, respectively. Moreover, the overall throughput

of NiN model is 1.03ˆbetter than that of AlexNet. This is

because NiN has more CONV layers and many of them

have the same Nif .

Liu et al. [170] proposed a parallel framework for FPGA-

based CNN accelerators that exploits four levels of par-

allelism; task level, layer level, loop level, and operator

level. Task-level parallelism involves executing multiple im-

age prediction tasks simultaneously. Layer-level parallelism

exploits pipelining across layers to enable parallel execution

of all layers with different images. Loop-level parallelism

utilizes loop unrolling in performing convolutions and this

can be achieved either through intra-output or inter-output

parallelism. Finally, operator-level parallelism is achieved

by parallelising the kˆkMACs operations needed for

convolution operation in convolutional layers or the nMACs

needed for inner-product computation in fully connected

layers. Fig. 17 shows the parallel framework exploiting these

four levels of parallelism.

The authors have used 16-bit ﬁxed-point format for rep-

resenting pixels in input feature maps and output feature

maps. However, they have used 32 bits for intermediate

results which get truncated to 16 bits. In addition, they

have used 8 bits for representing kernels and weights. They

have presented a systematic methodology for design space

16 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 17. Parallel Framework Exploiting Four Levels of Parallelism [170].

exploration to ﬁnd the optimal solution that maximizes

the throughput of an FPGA-based accelerator under given

FPGA constraints such as on-chip memory, computational

resources, external memory bandwidth, and clock frequency.

The proposed technique has been evaluated by imple-

menting three CNN accelerators on the VC709 board for

LeNet, AlexNet, and VGG-S. It has achieved a throughput

of 424.7 GOPS, 445.6 GOPS, and 473.4 GOPS for LeNet,

AlexNet, and VGG-S accelerators, respectively. In addition,

the performance has been compared with MatConvNet tool

running the CNN models on Intel Core i7-4790K CPU (4.0

GHz) and NVIDIA GTX-770 GPU (1,536 CUDA cores, 2

GB GDDR5, 224.3 GB/s memory bandwidth). Compared

to the CPU implementations, the accelerators for LeNet,

AlexNet, and VGG-S achieved 14.84ˆ,6.96ˆ, and 4.79ˆ

in performance, respectively, and 51.84ˆ,24.69ˆ, and

16.46ˆin power efﬁciency, respectively. Compared to the

GPU implementations, the accelerators achieved better per-

formance in the small-scale network LeNet (3.17ˆ), com-

parable performance in the medium-scale network AlexNet

(0.96ˆ), and worse performance in the large-scale network

VGG-S (0.56ˆ). However, the accelerators achieved higher

power efﬁciency than the GPU implementations in all three

networks with 28.3ˆfor LeNet, 8.7ˆfor AlexNet and

4.98ˆfor VGG-S.

FP-DNN [171] is an end-to-end framework that auto-

matically generates optimized FPGA-based implementations

of deep neural networks (DNNs) using an RTL-HLS hy-

brid library. FP-DNN compiler, programed using C++ and

OpenCL, takes TensorFlow symbolic descriptions [172] of

DNNs, and then performs model inference through the use

of model mapper, software generator, and hardware gener-

ator modules. The model mapper extracts the topological

structure and layers conﬁgurations of DNN model from the

TensorFlow descriptions and generates an execution graph

for the target model. The execution graph shows layer-by-

layer operations and read/write data transactions.

FP-DNN compiler allocates off-chip DRAM data buffers

to store intermediate data, weights, and model parame-

ters and conﬁgurations. The model mapper maximizes the

storage resource reuse through minimizing the number of

required physical buffers. Speciﬁcally, it formulates the

data reuse problem as a graph coloring problem [173],

and then the left-edge algorithm is applied to generate

kernel conﬁguration and kernel schedule. Subsequently, the

software generator uses the kernel schedule to generate a

host C++ program which initializes the model, manages

the data buffers, and schedules the kernel execution. On

the other hand, the hardware generator uses the kernel

conﬁguration and the execution graph to generate the FPGA

hardware codes by instantiating the corresponding optimized

templates from an expandable RTL-HLS hybrid library.

Each template is comprised of Verilog-based computational

engine and OpenCL-based control logics engine.

The architecture of the proposed FPGA-based accelerator

consists of matrix multiplication and data arranger modules.

Matrix multiplication module is a hand-written Verilog code

that is designed and optimized based on the hardware

constraints of Altera Stratix-V GSMD5 FPGA. It applies

tiling and ping-pong double buffers techniques to improve

the throughput. On the other hand, data arranger is an

OpenCL-based module that is responsible for mapping the

computational part of a layer to matrix multiplication as well

as performing data communication with off-chip memory

and matrix multiplication module. Mapping DNNs compu-

tational operations to matrix multiplication has been widely

applied in prior studies [80], [132], [174]. FP-DNN maps

FC layer to matrix multiplication by batching input vectors

together. Before model deployment, FMs and weights are

rearranged in DRAM using the channel-major scheme to

optimize the communication between the accelerator and

off-chip DRAM. On the other hand, both ﬂoating-point

and ﬁxed-point representations have been supported for

implementation, and they can be adjusted by the user.

The proposed RTL-HLS hybrid framework has been eval-

uated by accelerating VGG-19, LSTM-LM [175], ResNet-

152 DNNs on Stratix-V GSMD5 FPGA. Note that this is

the ﬁrst work that implements ResNet-152 on FPGA. The

VOLUME 4, 2016 17

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

experimental results demonstrated that the speedup of FP-

DNN for 16-bit ﬁxed-point implementations are about 1.9ˆ

- 3.06ˆcompared with the server that includes 2 processors

each with 8-core Intel Xeon E5-2650v2 at 2.6 GHz.

In line with the current trends towards compressed neural

networks, with dramatically reduced weights and activations

bit-width using 1-bit or 2-bit quantization [176]–[180],

Umuroglu et al. [181] conducted a set of experiments to

estimate the trade-off between the network size and preci-

sion using the rooﬂine model. They found that binarized

neural networks (BNNs) [180] require 2 to 11 times more

operations and parameters than an 8-bit ﬁxed-point CNN

to achieve a comparable accuracy on MNIST [71] dataset.

However, the performance of BNN is found to be 16ˆfaster

than the ﬁxed-point network.

Subsequently, the authors proposed a framework, referred

to as FINN [181], that maps a trained BNN onto FPGA.

FINN generates a synthesizable C++ network description

of a ﬂexible heterogeneous streaming architecture. The

architecture consists of pipelined compute engines that

communicate via on-chip data streams. Each BNN layer has

been implemented using dedicated compute engines with

1-bit values for weights and FMs; +1 and -1 are used to

represent a set bit and unset bit, respectively.

The authors have optimized accumulation, batch normal-

ization (batchnorm), activation, and pooling operations of

BNNs. In particular, the accumulation of a binary dot-

product has been implemented as a counter of set bits

(popcount operation). The popcount-accumulate reduces the

number of required look-up tables (LUTs) and ﬂip-ﬂops

(FFs) by a half, compared to the implementation of signed-

accumulation. BNN batchnorm and activation operations

have been simpliﬁed and implemented together as unsigned

comparison with a threshold τk, +1 is produced when

the input value is greater than or equals to τk, and -

1 otherwise. The value of τkis computed during run-

time. Such an implementation of batchnorm-activation op-

erations requires much smaller number of LUTs, without

the need for DSPs and FFs, compared to regular imple-

mentation of batchnorm-activation. Max-pooling, average-

polling, and min-pooling have been effectively implemented

with Boolean OR-operator, Boolean majority function, and

Boolean AND-operator, respectively.

The accelerator architecture is composed of building

blocks from the FINN hardware library. The matrix-vector-

threshold unit (MVTU) is the core computational building

block as matrix-vector operations followed by thresholding

form the majority of BNN operations. The design of MVTU

consists of an input buffer, an array of Pparallel PEs each

with SSIMD lanes, and an output buffer. BNN weight

matrix is distributed across the PEs and stored locally in on-

chip memory. Subsequently, the input images are streamed

through the MVTU and multiplied with the weight matrix.

Particularly, the PE computes the dot-product between an

input vector and a row of weight matrix, each of S-bits

wide, using an XNOR gate, as shown in Fig. 18. Then, it

FIGURE 18. The Architecture of MVTU PE [181].

compares the number of set bits to a threshold and produces

a 1-bit output value as previously discussed.

Umuroglu et al. [181] implemented the CONV layer using

a sliding window unit (SWU) and an MVTU, where convo-

lutional operation is transformed to matrix-multiplication of

image matrix and ﬁlter matrix. SWU generates the image

matrix to MVTU by moving the sliding window over the

input FMs, while the ﬁlter matrix is generated by packing

the weights from the convolution ﬁlters as shown in Fig. 19.

In order to meet the user throughput requirement, MVTU is

folded (time-multiplexed) by controlling the values of Pand

S. Folding of MVM decides partitioning of the matrix across

PEs. Every row of matrix tile is mapped to a distinct PE and

every column of PE buffer is mapped to a distinct SIMD

lane. In this away, the required number of cycles to compute

one MVM (total fold) is obtained as pXˆYq{pPˆSq, where

Xand Yare the dimensions of the matrix. The folding

factors of BNN layers have been determined such that every

BNN layer takes nearly the same number of cycles.

To evaluate FINN, the authors implemented CNV topol-

ogy on Xilinx Zynq-7000 board at 200 MHz to acceler-

ate BNNs inference on CIFAR-10 [182]. CNV contains

FIGURE 19. Transforming CONV to Matrix-Multiplication [181], where, ifm and

ofm are the input and output feature maps, respectively.

18 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 20. CONV Acceleration Architecture and Dataﬂow [83], where, P ix “P ox “3,P iy “P oy “3, and P of “3.

three repetitions of two 3ˆ3CONVs and 2ˆ2max-

pooling layers. Its topology is inspired by VGG-16 and

BinaryNet [180]. Although CNV accepts images with 24-

bits/pixel as an input and produces a 10-element vector of

16-bit values, 2-bits are used for representing intermediate

results while 1-bit is used for representing CONV and

FC weights. Experimental results demonstrated that the

proposed design provides high performance (2.5 TOPS)

while incurring low energy consumption (11.7 Watts). FINN

outperforms the design by Ovtcharov et al. [152] by over

13.8ˆfor throughput.

In [83], loop optimization techniques [55], [79] have been

employed in FPGA to design a customized CNN accelerator

through speeding up CONV layer operations. Firstly, an in-

depth analysis is provided to numerically characterize loop

unrolling, loop tiling, and loop interchange optimization

techniques. In doing so, 8 CONV dimensions parameters

(N˚), 8 loop unrolling design variables (P˚), and 8 loop

tiling design variables (T˚) have been used with a con-

straint, as for a speciﬁc loop level, 1ďP˚ďT˚ďN˚.

Note that unrolling Loop-1 and Loop-3 requires P kx ˆP ky

and P if multipliers, respectively, an adder tree with fan-in

of P kx ˆP k y and P if , respectively, and an accumulator.

On the other hand, unrolling Loop-2 requires P ixˆP iy par-

allel units of MAC to reuse the same weight for P ix ˆP iy

times, while the input feature pixel can be reused by P of

times when unrolling Loop-4 with the use of P of parallel

MAC units. Thus, P kx ˆP k y ˆP if ˆP ix ˆP iy ˆP of

multipliers are required. Please refer to Fig. 2 for more

details on CONV loops levels and their parameters. In

loop tile optimization, the authors have numerically set

the lower bound on the required size of the input pixel

buffer, the weight buffer, and output pixel buffer that ensures

reading each input feature pixel and weight from the off-

chip memory only once. On the other hand, loop interchange

technique has a great impact on the times of memory access

as well as the number of partial sums since it determines

the order of computing CONV loops.

Secondly, the authors have provided a quantitative anal-

ysis of the design variables to minimize each of computing

latency, partial sum storage, on-chip buffer access, and off-

chip DRAM access. Subsequently, MATLAB scripts are

used to randomly sample a subset of the solution space

to ﬁnd the optimal design conﬁgurations. This is due to

the large solution space, more than 7.2ˆ1013 possible

conﬁgurations for loop tiling variables of width (P ox) and

height (P oy) output FM alone. According to the randomly

sampling results for VGG-16 CNN model on Arria 10 GX

1150 FPGA, uniform unrolling factors for CONV layers are

used with P ix “P ox “P iy “P oy “14 and P of “16

for Loop-2 and Loop-4, respectively, to reuse input feature

pixels and weights. On the other hand, Loop-1 and Loop-3

are serially computed to prevent the movement of the partial

sums between the MAC units and consume them ASAP

since both Loop-1 and Loop-3 need to be ﬁnished in order

to obtain one ﬁnal output pixel. More importantly, the order

of loops computation has been found to be as follows. Loop-

1is computed ﬁrst, then comes Loop-3, and ﬁnally Loop-2

and Loop-4 are computed in any order.

Finally, a customized convolution accelerator module

with efﬁcient dataﬂow has been designed based on the

previous results and used for all VGG-16 CONV layers.

The CONV accelerator consists of 3,136 (P ixˆP iy ˆP of )

independent MAC units and 14 (P of) input pixel buffers.

Fig. 20 shows an example of the designed CONV accel-

erator when P ix,P iy, and P of are all equal to 3. The

input pixels are shifted after fetching them out of the input

pixel buffers. Subsequently, they can be reused among the

input register arrays. Then, the input pixels are fed into the

associated MAC units. The ﬁgure also shows that the input

pixels and weights are shared by P of and P ix ˆP iy MAC

units, respectively.

The overall CNN acceleration system mainly consists

of two SDRAM banks that hold the input feature pixels

and weights, two modular Scatter-Gather DMA (mSGDMA)

engines to facilitate the simultaneous read/write from/to

the SDRAMs, and a controller to govern the sequential

computation of layers as well as the iterations of the four

VOLUME 4, 2016 19

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

CONV loops. On the other hand, dual weight buffers have

been used to increase the throughput of FC layer through

overlapping the inner-product computation with off-chip

communication. The acceleration system has been written as

parametrized Verilog scripts. The experimental results show

that the proposed accelerator has a throughput of 645.25

GOPS, which is more than 3.2ˆenhancement compared to

prior VGG-16 FPGA-based implementations [80], [98].

Venieris and Bouganis [183] further extended fpga-

ConvNet framework [165] to allow for optimizing either

throughput or latency depending on the size of the workload.

For large workloads, weights reloading transformation has

been introduced to efﬁciently design latency-critical CNNs

on FPGA. In contrast with fpgaConvNet, where a distinct

architecture is designed for each subgraph, the weights

reloading transformation allows for generating a single

ﬂexible architecture, named as the reference architecture and

derived using pattern matching, to execute the workloads

of all subgraphs by transitioning to different modes. Upon

the execution of a new subgraph, the subgraph’s weights

are read into the on-chip memory and the multiplexers are

conﬁgured to form the appropriate datapath. Fig. 21 demon-

strates how weights reloading is applied. The authors have

mentioned that the required time for transferring subgraph’s

weights is much smaller than the average time for full

FPGA reconﬁguration, 272.7ˆless when loading 4.5 MB

of weights for a VGG-16 layer on Zynq XC7Z045.

In the situation discussed above, due to limited on-chip

memory capacity, it might not be possible to load all weights

required for a single CONV layer. To handle this, the

authors introduced an input FMs folding factor (fin) with

each CONV layer. A CONV layer (CON V i) is partitioned

into finisubgraphs in which each subgraph executes a

fraction of CONV ito produce a fraction of the output

FMs. The proposed latency-driven methodology has been

evaluated by implementing AlexNet and VGG-16 with 16-

bit ﬁxed-point precision for both on Zynq XC7Z045 at 125

FIGURE 21. Weights Reloading [183].

FIGURE 22. Overall DLA Architecture [188].

MHz. The experimental results showed 1.49ˆand 0.65ˆ

higher CONV throughput than DeepBurning [155] and the

embedded FPGA accelerator in [98] for AlexNet and VGG-

16 implementations, respectively.

Lavin and Gray [184] demonstrated that CNN algorithms

with small ﬁlters can be efﬁciently derived using Winograd

algorithm [185] and fast Fourier transform (FFT) algo-

rithm [186] due to their advantages in improving resource

efﬁciency and reducing arithmetic complexity. Winograd

computation involves a mix of element-wise (Eltwise) and

general-purpose matrix multiplication, where some of the

matrices need to be transformed. In particular, Winograd

algorithm exploits the structure similarity among nˆntiled

input FM pixels given a ﬁlter of size rˆrto generate mˆm

tiled pixels of the output FM, where mrepresents the stride

between Winograd tiles (m“n´r`1), while minimizing

the number of required CONV multiplications from m2r2

for conventional CONV algorithm to n2. In another work,

Zhang et al. [187] implemented FFT algorithm for CNN on

FPGA platform. However, their proposed implementation

shows little reduction of computation complexity with small

ﬁlters such as 3ˆ3.

Aydonat et al. [188] presented a deep learning architec-

ture (DLA) based on OpenCL. Their proposed architecture

reduces the external memory bandwidth requirements by an

order-of-magnitude for both the convolutional and fully con-

nected layers. This is achieved by caching all intermediate

feature maps on-chip in stream buffers. For fully connected

layers, image batching is used where a batch of images

are processed together through the fully connected layers.

The approach utilizes the Winograd transformation to reduce

the multiply-accumulate operations, which could reduce the

number of needed operations by about 50%. In addition,

it uses half-precision (FP16) ﬂoating-point operations with

shared exponents, which signiﬁcantly reduces the needed

computational resources.

The overall DLA architecture is shown in Fig. 22.

Each PE consists of dot-product units, accumulators, and

caches, for performing dot-products for convolution and

fully connected layers. Caches are used for storing ﬁlter

weights. To avoid idle computation cycles, double-buffering

is used such that ﬁlter weights for the next convolution

20 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

layer are prefetched onto the caches while ﬁlter weights are

loaded from the caches for a particular convolution layer.

Stream buffers store feature data and stream it to PEs. Each

stream buffer is double-buffered similar to ﬁlter caches.

Images are loaded from the DDR and are stored in stream

buffers before the ﬁrst convolution layer starts execution.

During a convolution layer execution, while feature data

for a convolution layer is being streamed into the PEs,

the outputs of convolutions are simultaneously stored in

the buffers. The StreamBuffer unit applies the Winograd

transformations to features, and streams the transformed

features to the ﬁrst PE which are forwarded through all

the PEs via the daisy-chained input connections between

them. The ReLU unit receives the outputs of the PEs via

daisy-chained output connections. Then, the normalization

unit receives the outputs of the ReLU unit and applies

the normalization formula across the feature maps. The

pooling unit receives the outputs of the normalization unit

and computes the maximum value in a window. The output

of the pooling unit is stored back in the stream buffer for

further processing, if more convolution layers are to follow.

Otherwise, the outputs of the pooling unit are stored in

external memory. For the fully connected layers, features

data are stored on PEs caches while ﬁlter weights are stored

in stream buffers. For the ﬁrst fully connected layer, features

data are read back from external memory and loaded onto

the PE caches. The ReLU output is sent directly to DDR,

without applying normalization or pooling. The sequencer

generates the control signals to control the operation of

the various blocks in DLA according to the topology of

the executed CNN. Executing a different CNN requires just

changing the sequencer conﬁguration.

The DLA has been evaluated by implementing AlexNet

CNN on Intel’s Arria 10 dev kit which contains a A10-1150

device (20nm) using a 96 batch size for the fully connected

layers. It achieved a performance of 1020 images/s. In

addition, it achieved 8.4x more GFLOPS than the latest

Ultrascale (KU 20nm) result reported in [162], which uses

a 32 batch size for the fully connected layers, and 19ˆ

more GFLOPS than the latest Stratix V result reported in

[80]. Furthermore, it has achieved energy efﬁciency at 23

images/s/W, which is similar to what is achieved with the

best publicly known implementation of AlexNet on NVIDIA

Titan X GPU.

Unlike DLA architecture [188] where a 1D Winograd

algorithm was employed to reduce arithmetic complexity,

Lu et al. [189] implemented a novel FPGA architecture

with a two-dimensional Winograd algorithm [185] to ac-

celerate convolutional computation of CNNs. The overall

architecture consists of line buffer structure and Winograd

PE engine, as shown in Fig. 23. Particularly, n`minput

lines and moutput lines of on-chip buffers are used to

effectively reuse FM data among different tiles. While

Winograd PE engine reads the ﬁrst ninput lines to perform

Winograd computation, the next minput lines load pixels

from off-chip memory using FIFOs to overlap the data

FIGURE 23. Winograd-based CNN Accelerator [189], where, mis the size of

the input FM tile, nis the size of the output FM tile, Mis the number of input

channels, Nis the number of output channels, Wis the maximal width of all

input FMs, Cis the width of the output FMs.

transfer and computation. Thereafter, the input lines are

rotated in a circular fashion to make the next ninput lines

ready. On the other hand, Winograd PE engine composed

of 4 pipelined stages performs transformation, element-

wise matrix multiplication, additional transformation, and

accumulation of output tiles, respectively.

A vector of PEs is employed to achieve parallelism

through unrolling Loop ´4(P of ) and Loop ´3(P if )

similar to that in [55]. To implement FC layer, the proposed

accelerator uses the input line buffers to hold FC weights

while input neurons are stored on the ﬁlter buffers. Then,

Winograd PE engine is reused to implement FC operation

but with bypassing the transformation stages. Moreover, a

batch (Nbatch) of input FMs are assembled and processed

together in order to improve the memory bandwidth. An

analytical model has been proposed for a fast design space

exploration of optimal design parameters (n,P of ,P if,

Nbatch) constrained by FPGA conﬁguration with a 16-bit

ﬁxed-point representation for both FM data and ﬁlter.

The proposed accelerator has been evaluated by imple-

menting AlexNet and VGG-16 on Xilinx ZCU102 FPGA.

AlexNet CONV layers have 3 different ﬁlters. Conventional

CONV algorithm has been applied to the ﬁrst CONV layer

as it has a ﬁlter of size 11ˆ11 while a uniform ﬁlter of size

3ˆ3for Winograd algorithm has been used to implement

the rest of the layers. The design parameters are found to

be equal to (6, 4, 8, 128) and (6, 4, 16, 128) for AlexNet

and VGG-16, respectively. The experimental results demon-

strated that the proposed Winograd-based CNN accelerator

has an average performance of 854.6 GOPS and 2940.7

GOPS for AlexNet and VGG-16, respectively, with power

consumption of 23.6 Watts for both. The proposed accelera-

tor has also been evaluated on Xilinx ZC706 platform where

the design parameters are found to be as (6, 2, 8, 32) and

(7, 4, 4, 32) for AlexNet and VGG-16, respectively. The ex-

perimental results demonstrated that Winograd-based CNN

accelerator has an average performance of 201.4 GOPS and

679.6 GOPS for AlexNet and VGG-16, respectively, with

power consumption of 23.6 Watts for both. Compared to

the implementation of VGG-16 on NVIDIA Titan X with

VOLUME 4, 2016 21

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 24. Compute Unit with a 2D BRAM-to-PE Interconnection [190].

the latest CuDNN 5.1, Titan X gives better performance

than Xilinx ZC706 but the implementation on Xilinx ZC706

achieves 1.73ˆhigher energy efﬁciency.

Zhang et al. [190] presented an OpenCL-based architec-

ture for accelerating CNNs on FPGA. They also proposed

an analytical performance model to identify the bottleneck

in OpenCL-based acceleration of VGG-19 CCN model on

modern FPGA platforms such as Altera Arria 10 GX 1150.

Based on rooﬂine mode analysis, it is shown that the

bandwidth requirement of VGG-19 workload is higher than

what is provided by the FPGA board. Thus, they identiﬁed

on-chip memory bandwidth as the key performance bottle-

neck. In addition, they observed that exploited data-level

parallelism in the existing Altera OpenCL library [191] leads

to wasteful replication of on-chip memory (BRAM). This is

due to connecting each PE with a dedicated BRAM port.

Therefore, a Verilog-based accelerator kernel has been

designed and warped to an OpenCL IP in order to opti-

mally balance on-chip memory bandwidth with workload

computational throughput and off-chip memory accesses.

In particular, the proposed kernel consists of a compute

subsystem, a local memory subsystem, and a 2D dispatcher.

The compute subsystem is organized hierarchically into

compute units (CUs) and PEs. At PE level, the authors have

FIGURE 25. 2D Dispatcher [190], where, X0is the column size of kernel

buffer as well as the row size of the input feature buffer, and X1is the row size

of kernel buffer.

FIGURE 26. Line Buffer Design [190].

designed a 2D multi-cast interconnection between BRAMs

(32-bit data width) and PEs to improve the efﬁciency of

on-chip BRAM usage by sharing the data of one BRAM

port with several PEs as shown in Fig. 24. The CU has

been designed as a 2D PE array of size 16 ˆ16 to match

the computational bandwidth with the maximum streaming

bandwidth (512-bit data bus) provided by off-chip memory.

The 2D dispatcher divides the work items into work groups

each of size (X0,X1) as shown in Fig. 25. Thereafter, it

adaptively schedules the work items within each work group

to the CUs starting with the lowest dimension to balance

the memory bandwidth with capacity. The 2D dispatcher is

also responsible for host/device memory data transfers. In

addition, the authors have limited the maximum fan-out for

registers to 100 in order to guarantee a higher frequency.

The CONV layer has been implemented as a matrix

multiplication by ﬂattening and rearranging the data using

line buffer [154], as shown in Fig. 26, in a similar fashion

to that in [80]. The line buffer converts continuous address

stream from external memory into a stream conducive for

CONV operation to substantially reduce the bandwidth

requirement of off-chip memory. To implement FC layer,

the proposed accelerator uses one column of PEs in the CU.

The proposed implementation has achieved 866 GOPS and

1790 GOPS with the use of 32-bit ﬂoating-point and 16-

bit ﬁxed-point, respectively, under 370 MHz and 385 MHz

working frequencies, respectively.

All previously discussed FPGA-based CNN accelerators,

except the ones discussed in [165], [170], have employed

a single CLP to maximize the aggregate throughput of

performed consecutive convolutional operations. However,

Shen et al. [192] noted that using a single globally-optimized

CLP design for the computation of CONV layers of radi-

cally different conﬁgurations and dimensions leads to sub-

optimal performance and insufﬁcient utilization of FPGA

resources. Fig. 27a demonstrates the use of a single CLP to

iteratively process L1,L2, and L3CONV layers where the

dimensions of the hardware (CLP ,C LP 1, and CLP 2) and

the layers are represented by the size and shape of the boxes.

It is clear that computing L1and portions of L3leaves

FPGA resources unutilized as their dimensions are smaller

22 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 27. Operation of CONV Layer Processors (CLPs) on CNN with three CONV Layers [192].

than the dimension of the used CLP. Note that processing

a CONV layer with a dimension bigger than the dimension

of CLP, such as L3, requires the repeated use of CLP to

process different portions of the layer.

The authors have also followed the methodology

in [55] to derive an optimal single CLP, through ﬁnding

the optimal unrolling factor xTm, Tny, for implementing

SqueezeNet [193] and AlexNet on Virtex7 690T FPGA

with a single precision ﬂoating-point and 16-bit ﬁxed-point

arithmetic units, respectively. They found that one quarter

of DSP slices of SqueezeNet’s CLP remain unused. Even

more worse utilization has been observed for AlexNet. The

optimal single CLP has not utilized, on average, more than

one quarter of the arithmetic unit resources.

On the other hand, they also noted that using one CLP

for each stage of CONV layer in a fashion similar to

that in [194] is not efﬁcient due to three reasons. First, it

reduces the on-chip BRAM buffer size of each CLP which

minimizes overall data locality. Second, such one-to-one

mapping of CONV layers and CLPs requires orchestrating

many off-chip memory accesses which incurs latency and

bandwidth overheads. Third, the overall control overhead

scales with the number of CLPs which leaves insufﬁcient

resources for the computation of CNN.

To address the above inefﬁciencies, Shen et al. [192]

proposed a multi-CLP accelerator system for CNNs where

the available PFGA hardware resources are partitioned

across multiple smaller CLPs. Each CLP is tailored with a

dimension that closely matches the dimensions of a subset of

CONV layers. Thereafter, these specialized CLPs are used

to concurrently operate on a batch of images to achieve a

higher overall throughput, as shown in Fig. 27b, where the

same hardware in Fig. 27a is partitioned into two parallel

CLPs; CLP 1and C LP 2.

Shen et al. [192] developed an optimization search algo-

rithm that uses dynamic programming to ﬁnd optimal de-

signs. For given conﬁgurations of CNN model (i.e., CONV

layers descriptions) and resource constraints of the targeted

FPGA platform (i.e., number of DSP slices, BRAM-18Kb

units, and off-chip memory bandwidth), it derives the opti-

mal number of CLPs (along with their xTm, Tnydimensions)

as well as the optimal mapping between CONV layers and

CLPs that maximize the performance. The assignment of

CNN layers to CLPs is static, where each CNN layer is

mapped and bounded to a particular CLP. Subsequently,

CNN layers are pipelined to their CLP, as shown in Fig. 27b,

where L1and L3are pipelined to CLP 1while L2is re-

peatedly processed on CLP 2with very little idle hardware

which improves the performance compared to single CLP

approach. Moreover, the optimization algorithm also ﬁnds

the optimal partition of on-chip BRAM resources of each

CLP that minimizes the overall off-chip memory accesses.

Note that the optimal dimension of each CLP is found based

on the work in [55].

Subsequently, C++ (HLS) templates are parameterized

to design CLPs and to form a complete implementation

of CNN. A standard AXI crossbar is used to intercon-

nect the independent CLPs. The ping-pong double-buffering

technique is also used for input FMs, output FMs, and

weights to allow for transferring data while computation

is in progress. The experimental results of implementing

AlexNet with a single precision ﬂoating-point using multi-

CLP accelerator on Virtex7 485T and 690T FPGAs at 100

MHz demonstrate 1.31ˆand 1.54ˆhigher throughput than

the state-of-the-art single CLP design in [55], respectively.

For the more recent SqueezeNet network, the proposed

multi-CLP accelerator results in speedup of 1.9ˆand 2.3ˆ

on Virtex7 485T and 690T FPGAs at 170 MHz with 16-bit

ﬁxed-point, respectively.

Wei et al. [195] presented a systolic architecture for

automatically implementing a given CNN on FPGA based

on OpenCL description, maximizing clock frequency and

resource utilization. The proposed systolic architecture is

shown in Fig. 28. Each PE shifts the data of the weights

(W) and inputs (IN) horizontally and vertically to the

neighboring PEs in each cycle. The 2D structure of PEs is

designed to match the FPGA 2D layout structure to reduce

routing complexity and achieve timing constraints.

VOLUME 4, 2016 23

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 28. Systolic Array Architecture for CNN [195].

The technique ﬁrst ﬁnds a feasible mapping for the given

CNN to the systolic array to guarantee that proper data

is available at speciﬁc locations in the PE array at every

cycle. Then, the size of PE array (dimensions) is determined

which has an impact on the required number of DSPs,

the clock frequency, and the DSPs efﬁciency. Finally, the

data reuse strategy is determined by choosing proper tiling

sizes. The proposed technique has been evaluated using

AlexNet and VGG16 on Intel’s Arria 10 GT 1150 board.

The technique has explored the use of both 32-bit ﬂoating-

point and ﬁxed-point using 8-bits for weights and 16-bits for

data. Evaluation results show that, for the VGG16 CNN, the

technique achieves up to 1,171 GOPS on Intel’s Arria 10

device with a clock frequency of 231.85 MHZ and (8-16)-bit

ﬁxed-point representation.

In another recent research work, Ma et al. [196] general-

ized the previously proposed accelerator in [83] to efﬁciently

accelerate ResNet-50 and ResNet-152 on Arria 10 GX 1150

FPGA. In doing so, they designed ﬂexible and scalable

CONV, ReLU, BatchNorm, scale, pooling, FC, and Eltwise

primitives. In addition, local control logic and registers have

been used with each primitive to control their computation

order and to hold their conﬁgurations, respectively. By doing

so, ResNets primitives can be efﬁciently reused for different

parameters of each layer.

For ResNets scalable CONV primitive, there are four

(kernel, stride) size conﬁgurations; (3ˆ3, 1), (1ˆ1, 1),

(1ˆ1, 2), and (7ˆ7, 2). Therefore, a similar architecture

and dataﬂow to that shown in Fig. 20 has been used for

CONV but with the use of two sets of register arrays; with

shifting between the registers (which is shown in Fig. 20,

Set-1), and without shifting between the registers (Set-2).

The CONV primitive with 3ˆ3kernel and stride of 1 uses

Set-1 register array, while Set-2 is used with (1ˆ1, 1),

(1ˆ1, 2), and (7ˆ7, 2) conﬁgurations. In CONV primitive

with Set-2, the input pixels are fed from the input pixel

buffers into the corresponding registers without shifting, and

then to MAC units. The skipped input pixels in (1ˆ1, 2)

conﬁguration are not stored to the input pixel buffers. On

the other hand, the (7ˆ7, 2) conﬁguration of the kernel

and stride sizes is retained as the (1ˆ1, 1) case while

transferring repeated input pixels into the input pixel buffers

and rearranging their storage patterns. The CONV primitive

also takes care of zero-paddings for different (kernel, stride)

size conﬁgurations.

The loop unrolling and tiling techniques in [83] have

also been employed to accelerate CONV primitive with

a uniform mapping of PEs to all ResNets CONV layers.

However, designing of efﬁcient CNN modules is not enough,

as the memory accesses and data movements between these

modules must also be minimized. Therefore, the authors

have designed a layer-by-layer computation ﬂow. The global

control logic is responsible for governing the sequential op-

erations of primitives and their dataﬂow through predeﬁned

and preloaded layered-based execution ﬂowchart, as shown

in Fig. 29. In addition, it has been modeled to reconﬁgure

ResNet primitives according to the parameters of each layer

during runtime. For instance, it maps a particular number of

PEs to CONV layer based on loop unrolling parameters as

well as it controls the selection of register array type (Set-1

or Set-2) based on CONV (kernel, stride) parameters.

On the other hand, a custom DMA manager has been

designed to control the operations of DMA. Note that the

DMA is responsible for transferring the input FM pixels,

weights, and output FM pixels between off-chip memory

and on-chip buffers. Unlike ALAMO architecture [168]

where the output pixels are only stored in on-chip buffers,

this work as well as the work discussed in [83] store the

output pixels in off-chip memory with the use of loop

tiling technique in order to have a ﬂexible architecture

that can process large-scale CNNs. The dual weight buffers

technique has not been used in this work due to the current

trend in CNNs where either the size of FC weights has

been signiﬁcantly reduced (2 M in ResNet compared with

123.6 M in VGG) or the FC layers are completely removed

such as in NiN. The experimental results demonstrated that

the achieved throughput for ResNet-50 and ResNet-152 are

285.1 GOPS and 315.5 GOPS, respectively. Finally, the

authors mentioned that higher throughput can be achieved

FIGURE 29. Execution Flowchart of ResNets Layers [196].

24 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 30. DLAU Accelerator Architecture [197].

using batch computing [194].

Wang et al. [197] proposed a scalable design on FPGA for

accelerating deep learning algorithms. In order to provide

a scalable architecture and support various deep learning

applications, the proposed architecture utilizes the tiling

technique in which the large-scale input data is partitioned

into small subsets. The size of the tile is conﬁgured to

leverage the trade-off between the hardware cost and the

speedup. Moreover, the authors explored hot spots proﬁl-

ing to determine the computational parts that need to be

accelerated to improve the performance. The experimental

results illustrated that matrix multiplication and activation

functions are the key operations in deep learning algorithms

as they consume about 98.6% and 1.1% of the overall

execution time, respectively. Thus, the proposed accelerator

is responsible for speeding up both matrix multiplication

and activation function computations.

The main components of the proposed architecture are

the embedded processor, the DDR3 memory controller,

the DMA module, and the deep learning acceleration unit

(DLAU), as shown in Fig. 30. The embedded processor

utilizes the JTAG-UART to communicate with the accel-

eration unit [198]. The DLAU unit accesses the DDR3

memory to read the tiled input data and to write the

results back through the DMA module during the execu-

tion. The DLAU utilizes three fully pipelined processing

units to improve the throughput, while minimizing the

memory transfer operations. These units are tiled matrix

multiplication unit (TMMU), partial sum accumulation unit

(PSAU), and activation function acceleration unit (AFAU).

TMMU is responsible for multiplication and generating

the partial sums. To optimize the performance, TMMU

is structured as a pipelined binary adder tree. Moreover,

it uses two sets of registers alternately to overlap the

computation with the communication, one group is used

for the computation, while in parallel, the other group is

loaded with the next node data every clock cycle. On the

other hand, PSAU is responsible for accumulating the partial

sums generated from TMMU. Finally, AFAU implements

the sigmoid function using piecewise linear interpolation

to speedup the computation with negligible accuracy loss.

Since the processing units in DLAU might have inconsistent

throughput rates, each unit has input FIFO buffer and output

FIFO buffer to prevent data loss.

The authors implemented the proposed architecture on

Xilinx Zynq Zedboard with ARM Cortex-A9 processors

clocked at 667 MHz. In addition, they used the MNIST

dataset as a benchmark considering the network size as

64ˆ64, 128ˆ128, and 256ˆ256. The experimental results

demonstrated that the speedup of the DLAU accelerator is

up to 36.1ˆcompared with the Intel Core2 processors at

256ˆ256 network size. In addition, the results depict that

the proposed architecture is quite energy-efﬁcient as the total

power consumption was only 234 mW.

In [199], a generalized end-to-end acceleration system of

the previously proposed accelerators in [78], [83], [168],

[196] has been developed to support diverse CNN models.

In doing so, a user-friendly interface and an RTL-level

compiler have been proposed to automatically generate

customized FPGA designs. The authors have developed

an expandable optimized RTL-based library containing the

most commonly used CNN operations. These operations

have been coded in Verilog and designed based on the

quantitative analysis and optimization strategies discussed

in [83]. The compiler generates a DAG-based structure for

the used CNN model and then compiles it with RTL mod-

ules in the library. The proposed compiler allows the user

to input the high-level information of the used CNN model

(previously designed on Caffe framework [58]) as well as

the design variables (i.e., loop unrolling and loop tiling

variables) with the resource constrains of the targeted FPGA

platform. Such utility facilitates the exploration of the best

trade-off between the resource usage and the performance.

Unlike the architecture in [168] where individual CONV

module is assigned to each CONV layer, the scalable RTL

computing module proposed in this work is reused by all

CNN layers of the same type for different CNNs as shown

in Fig. 31. Note that it is not necessary to have all these

modules in the architecture. For instance, the RTL compiler

FIGURE 31. Overall Architecture and Dataﬂow [199].

VOLUME 4, 2016 25

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

FIGURE 32. CONV Reconﬁgurable Computing Module [199].

will not compile or synthesize Eltwise and combined batch

normalization with scale (Bnorm) modules for VGG-16

model which greatly saves the hardware resources.

On the other hand, the authors categorized CNN layers

into key layers (i.e., CONV, pool, and FC layers) and

afﬁliated layers (e.g., ReLU, Bnorm, Eltwise, and all other

layers). They have also deﬁned layer combos, where each

combo is composed of a key layer and several afﬁliated

layers. Layer combos are sequentially executed according

to their order in the DAG. Moreover, the layer combo is

also divided into several sequential tiles. The computation

of each combo tile starts by reading its input pixels from

off-chip DRAM and ends by writing back its output pixels

to off-chip DRAM. The global control logic, inter-tile

control logic, and intra-tile control logic are responsible

for governing the sequential operations of layer combos

and reconﬁguring the modules, combo tiles, and tile layers

(key and afﬁliated layers), respectively, through predeﬁned

ﬂexible execution schedule similar to that in [196].

The authors have also employed special storage pattern of

both input pixels and weights on off-chip memory before the

acceleration process to maximize data reuse and minimize

of data communication. The architecture of CONV module

is designed based on the acceleration strategies in [83],

[196] but with a different organization of MAC units as

shown in Fig. 32. The MAC units of CONV module

have been organized into P iy ˆP of independent MAC

blocks, with each MAC block containing P ix MAC units to

further minimize the buffer read operations and the partial

sums movements. Moreover, such organization enables to

handle varying (kernel, stride) sizes conﬁgurations through

generating different variants of CONV register arrays during

the compilation.

Experimental results demonstrated that the achieved

throughput on Intel Stratix V GXA7 for NiN, VGG-16,

ResNet-50, and ResNet-152 are 282.67 GOPS, 352.24

GOPS, 250.75 GOPS, and 278.67 GOPS, respectively. On

the other hand, the achieved throughput on Intel Arria

10 GX 1150 was 587.63 GOPS, 720.15 GOPS, 619.13

GOPS, and 710.30 GOPS for NiN, VGG-16, ResNet-50,

and ResNet-152, respectively. More than 2ˆthroughput

improvements have been achieved on Intel Arria 10 GX

1150 since it has 1.8ˆand 5.9ˆmore logic elements and

DSPs than the Intel Stratix V GXA7, respectively, which

allows for larger loop unrolling variables.

Recently, the programmable solutions group at Intel has

developed an FPGA software-programmable and run-time

reconﬁgurable overlay for deep learning inference [200].

The developed overlay is referred to as the deep learning

accelerator (DLA). For the hardware side of Intel’s DLA, the

team has partitioned the conﬁgurable parameters into run-

time and compile-time parameters. The run-time parameters

allow for easy and quick use of different neural network

frameworks, while the compile-time parameters provide a

tunable architecture for performance. Intel’s DLA uses a

lightweight very long instruction word (VLIW) network,

an 8-bit unidirectional ring network, to support the control

and reprogramming logic. Compared with typical overlays,

Intel’s DLA comes with only „1% overhead while other

typical overlays tend to always come with larger over-

heads [201]. The reprogramming of Intel’s DLA overlay

allows for consecutive runs of multiple NNs in a single

application run [202] without the need for reconﬁguring and

recompiling the FPGA.

Fig. 33 shows that a 1D array of PEs is used to perform

convolution, multiplication, or any other matrix operations.

Each PE contains a double-buffered ﬁlter cache allowing

for pre-loading of next ﬁlters while computing. The stream

buffer employed the double-buffering mechanism as well

to store the inputs and the intermediate data on-chip. To

have ﬂexible NN architecture, Intel’s DLA employs an Xbar

interconnect that connects all the core functions required.

Thus, deep learning functions can be easily added to the

overlay through the Xbar by picking them from a suite

FIGURE 33. Intel’s DLA: Neural Network Inference Accelerator [200].

26 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

of pre-optimized functions of the select frameworks that

Intel’s DLA uses. The width adaptation module has been

used to control the throughput of the function. In addition,

Intel’s DLA supports vectorization across the input width

(Q_VEC), input height (P_VEC), input depth (C_VEC),

output depth (K_VEC), ﬁlter width (S_VEC), and other

dimensions as depicted in Fig. 33. The authors mention

that vectorization depends on the layers’ dimensions of

the considered framework. However, they did not provide

a systematic way for ﬁnding the optimal balance for the

number of used PEs and the size of the caches. For efﬁcient

use of resources, Intel’s DLA maps AVG pooling and FC

primitives to convolutions in order to avoid having under-

utilized (over time) dedicated auxiliary functions.

For the software side of Intel’s DLA, the proposed accel-

erator uses a graph compiler to map a NN architecture to

the overlay for maximizing the hardware efﬁciency through

slicing, allocation, and scheduling. In the slicing pass, the

graph compiler breaks down the architecture into subgraph

(a chain of functions) in such a way that they ﬁt within the

computing and storage resources of the overlay. A single

CONV layer followed by a pooling layer is an example of

CNN subgraph. The graph compiler optimizes the external

memory spill-points by group slicing technique. The group

slicing allows several sequential convolutions, for instance,

of a single slice to be computed before moving onto the

next slice while using the whole stream buffer. During

the allocation pass, the graph compiler optimizes the use

of a custom developed ﬁlter caches and stream buffer by

managing the read and write from the stream buffer for each

slice. Moreover, it assigns an external memory address when

the stream buffer is not big enough to hold the slice data.

Finally, Intel’s DLA compiler schedules the execution of

subgraphs using cost-based (the ratio of the output size

FIGURE 34. Design Flow from CNN Model to Hardware Acceleration [60].

FIGURE 35. Overall Architecture of Angel-Eye [60].

to the effective input size) priority queue. The authors

utilized the software-programmable and run-time recon-

ﬁgurable overlay to optimize the software and hardware

implementation of GoogleNet [31] and ResNet [49] CNNs.

The benchmark results on an Arria 10 GX 1150 FPGA

demonstrated that Intel’s DLA has a throughput of 900

fps on GoogLeNet. The team pointed that multi-FPGA

deployment [203] might be used to further improve the

throughput of Intel’s DLA.

Kaiyuan et al. [60] proposed a complete design ﬂow,

referred to as Angel-Eye, for mapping CNNs onto FPGA.

It includes a data quantization strategy, a parameterizable

and run-time conﬁgurable hardware architecture to support

various CNNs, FPGA platforms, and a compiler to map

a given CNN onto the hardware architecture. It adopts

the approach of using a ﬂexible hardware architecture and

maps different CNNs onto it by changing the software.

The proposed design ﬂow from CNN model to hardware

acceleration is shown in Fig. 34.

Due to the large dynamic range of data across different

layers, the best radix point is found for each layer for a

given bit width. They demonstrated that their strategy can

simplify state-of-the-art CNNs to 8-bit ﬁxed-point format

with negligible accuracy loss. Although 8-bits are used for

representing data, 24 bits are used for representing inter-

mediate data in layers, which is then aligned and quantized

to 8 bits. Fig. 35 and Fig. 36 show the overall architecture

of Angel-Eye and the structure of a single PE, respectively.

The architecture is designed for supporting an instruction

interface that supports three types of instructions; LOAD,

SAVE, and CALC.

The overall architecture is divided into four main compo-

FIGURE 36. Structure of a Single PE [60].

VOLUME 4, 2016 27

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

nents; PE array, on-chip buffer, external memory, and con-

troller. The PE array implements the convolution operations

in CNN and supports kernel level parallelism, and input

and output channel parallelisms. It uses a 3ˆ3convolution

kernel, as this size is most popular in state-of-the-art CNN

models. However, larger kernel sizes are supported based on

the use of multiple 3ˆ3kernels. The use of on-chip buffers

allows data I/O and convolution calculation to be done

in parallel. The controller is responsible for receiving the

instructions and issuing them to the other components. The

compiler maps the CNN descriptor to the set of instructions

that will be executed by the hardware. It follows basic

scheduling rules to fully utilize data localization in CNN

and reduce data I/O.

The block partition step partitions the calculation of one

layer to ﬁt each block into the hardware. The memory

mapping step allocates memory for communication between

the host CPU and the CNN accelerator. Based on the block

partition result, on-chip memory is allocated for the input

and output feature map blocks and for the convolution

kernels and bias values. The dependency check step checks

data dependency among instructions and sets appropriate

instruction ﬂags to maximize parallelism between convolu-

tion calculation and data I/O. Based on experimental results,

it is shown that the 8-bit implementation of Angel-Eye

on XC7Z020 achieves up to 16ˆhigher energy efﬁciency

than NVIDIA TK1 and 10ˆhigher than NVIDIA TX1.

In addition, the 16-bit implementation of Angel-Eye on

XC7Z045 is 6ˆfaster and 5ˆhigher in power efﬁciency

than peer FPGA implementation on the same platform [148].

In [83] and [199], a special register array architecture has

been designed to rearrange buffers data and direct them into

PEs for the purpose of implementing CONV module that

supports speciﬁc stride and zero-padding settings. Although

the designed CONV module is not generalized for any (ker-

nel, stride) size conﬁgurations, it is composed of complex

wire routing and control logic as shown in Fig. 20. To have

ﬂexibility in directing the dataﬂow of CONV pixels, Ma

FIGURE 37. CONV Acceleration Architecture and Dataﬂow using Data

Router [204], where, P ix “P ox, and P iy “P oy.

et al. [204] replaced the register array architecture in [199]

with a data router as shown in Fig. 37.

The data router is a scalable set of data bus from buffer

to PE (BUF2PE). The BUF2PE data bus consists of simple

similar to that in [154]. The register array uses the FIFO to

pass its input pixels to the adjacent registers. Each BUF2PE

data bus has different data movements within its register

arrays to implement speciﬁc stride and kernel size settings.

Unlike the register array architecture in [83] where the west

zero-paddings are handled by changing the storage pattern

within the input pixel buffer, the BUF2PE handles such

kind of paddings by shifting the connection between the

data transferring from off-chip memory to on-chip buffers.

However, there is still a need for adjusting the storage

pattern within the input buffers in order to handle other

zero-paddings.

The global control logic is responsible for selecting the

suitable BUF2PE data bus from the data router as well as

the suitable storage pattern within the input buffers based on

the (kernel, stride) size conﬁguration of CONV layer. The

CONV module has also been optimized by reducing the

required number of parallel adders that add the partial sums

with biases as well as the number of parallel multipliers and

adders needed to perform Bnorm operation by serializing

the parallel outputs using multipliers. In addition, 16-bit

ﬁxed-point has been used to represent both weights and

pixels, while dynamically adjusting the decimal point in

different layers to fully utilize the existing data width [60].

The proposed compiler in [199] has been used to conﬁg-

ure the parameterized Verilog scripts of the overall CNN

acceleration system. Experimental results show throughput

degradation on both Intel Arria 10 GX 1150 and Intel Stratix

V GXA7 in comparison to the results in [199].

In Table 2 and Table 3, we summarize the reviewed

FPGA-based deep learning networks acceleration tech-

niques. For each technique, we list the year the technique

was introduced, the key features employed for acceleration,

the used deep learning model, the number of needed op-

erations per image, the FPGA platform used to implement

the technique, the precision used for the FMs and weights,

the clock frequency used, the design entry for describing

the modeled deep learning network, the type of LUT for

the used platform, the number of resources available by the

used platform in terms of BRAMs, LUTs, FFs, and DSPs,

the percentage of each resource utilization, the performance

in GOPS, the speedup in comparison to a given baseline

model, and ﬁnally the power efﬁciency (GOPS/W).

IV. METAHEURISTICS IN THE DESIGN OF

CONVOLUTIONAL NEURAL NETWORKS

Currently, convolutional neural network (CNN) structures

are designed based on human expertise. For a given applica-

tion, this consists of determining the number of convolution

layers, number of fully connected layers, sizes of feature

28 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

TABLE 2. Implementation and Performance Summary of FPGA-based Accelerators.

Technique Year Key Features DL Model Image Operations

(GOP) Platform Precision Frequency

(MHz) LUT Type Design Entry

Resources Resources Utilization Performance

(GOPS) Speedup Baseline

Power

Efﬁciency

(GOPS/W)

BRAMs/

M20K

LUTs/

ALMs FFs DSPs BRAMs/

M20K

LUTs/

ALMs FFs DSPs

CNP [127] 2009

2D CONV modules,

Memory interface with

8 simultaneous accesses

LeNet-5 0.52 Virtex4

SX35

16-bit

ﬁxed-point 200 4-input LUTs Lush 192 30,720 30,720 192 N/A 90% 90% 28% 5.25 N/A 0.35

CONV Coprocessor

Accelerator [129] 2009

Parallel clusters of

2D convolver units,

data quantization,

off-chip memory banks

4CONV layers 1.06 Virtex5

LX330T

16-bit

ﬁxed-point 115 6-input LUTs C324 207,360 207,360 192 0.93% 17% 19.05% 55.73% 6.74 6ˆ2.2 GHz

AMD Opteron 0.61

MAPLE [132] 2010

In-memory processing,

banked off-chip memories,

2D array of VPEs

4CONV layers 1.06 Virtex5

SX240T ﬁxed-point 125 6-input LUTs C++ 516 149,760 149,760 1,056 N/A 70.5ˆ1.3 GHz

C870 GPU N/A

DC-CNN [100] 2010

Integer factorization

to determine the best

conﬁg for each layer,

input and output switches

3CONV layers 0.52 Virtex5

SX240T

48-bit

ﬁxed-point 120 6-input LUTs RTL 516 149,760 149,760 1,056 N/A 16 4.0ˆ- 6.5ˆ1.35 GHz

C870 GPU 1.14

NeuFlow [143] 2011

Multiple full-custom

processing tiles (PTs),

Pipelining, Fast streaming

memory interface

4CONV layers

in [145] N/A Virtex6

VLX240T

16-bit

ﬁxed-point 200 6-input LUTs HDL 416 150,720 301,440 768 N/A 147 133.6ˆ2.66 GHz

Core 2 Duo CPU 14.7

Memory-Centric

Accelerator [146] 2013

Flexible off-chip memory

hierarchy, Data reuse,

Loop transformation

4CONV layers 5.48 Virtex6

VLX240T ﬁxed-point 150 6-input LUTs C416 150,720 301,440 768 45.5% 1.1% N/A 6% 17 11ˆStandard Virtex6

Implementation N/A

nn-X [148] 2014

Cascaded pipelining,

Multiple stream

processing

2CONV layers 0.55 Zynq

XC7Z045

16-bit

ﬁxed-point 142 4-input LUTs Lua 545 218,600 437,200 900 N/A 23.18 115ˆ

800 MHz

Embedded ARM

Cortex-A9 Processors

2.9

Rooﬂine-based FPGA

Accelerator [55] 2015

Rooﬂine-based model,

Loop transformation,

Loop tiling/unrolling

AlexNet 1.33 Virtex7

VX485T

32-bit

ﬂoat-point 100 4-input LUTs C2,060 303,600 607,200 2,800 50% 61.3% 33.87% 80% 61.62 17.42ˆ2.2 GHz

Intel Xeon 3.31

Microsoft Specialized

CNN Accelerator [152] 2015

Multi-banked buffers,

Network on-chip for re-

distribution of output data,

Software conﬁgurable

AlexNet 1.33 Stratix-V

GSMD5

32-bit

ﬂoat-point 100 4-input LUTs C2,014 172,600 690,000 1,590 N/A 178.63 3ˆRooﬂine-based FPGA

Accelerator [55] 7.15

Embedded FPGA

Accelerator [98] 2015

Data quantization

and arrangment,

SVD

VGG-16 30.76 Zynq

XC7Z045

16-bit

ﬁxed-point 150 4-input LUTs RTL 545 218,600 437,200 900 86.7% 83.5% 29.2% 89.2% 136.97 1.4ˆ2.9 GHz

Intel Xeon 14.22

DeepBurning [155] 2016

NN model compression,

Compiler-based library,

Automatic partitioning/

tiling, Function Approx.

AlexNet 1.46 Zynq

XC7Z045

16-bit

ﬁxed-point 100 4-input LUTs RTL 545 218,600 437,200 900 N/A 17.3% 7.65% 16% 73 15.88ˆ2.4 GHz

Intel Xeon 6.43

OpenCL-based FPGA

Accelerator [80] 2016

Design space exploration

for all CNN layers,

Genetic algorithm,

Altera OpenCL SDK

AlexNetpaq1.46 Stratix-Vp1q

GXA7 (8-16)-bit

ﬁxed-point 120 7-input LUTs OpenCL

2,560 234,720 938,880 256 37% 47% N/A 100% 31.8pa1q4.2ˆ

3.3 GHz

Intel i5-4590

1.23

N/A 47.5pb1q2.2ˆN/A

VGG-16pbq30.9 Stratix-Vp2q

GSD8 2,567 262,400 1,050,000 1,963 52% 46% N/A 37% 72.4pa2q9.5ˆ3.79

N/A 117.8pb2q5.5ˆN/A

Caffeine [153], [162] 2016

Systolic array architecture,

Loop unrolling, Double

buffering, Pipelining,

Rooﬂine model

AlexNetpaq1.46 Virtex7p1q

VX690T

16-bitpaq

ﬁxed-point 150

6-input LUTs C++

2,940 433,200 866,400 3,600 42% 81% 36% 78% 354pb1aq9.7ˆTwo-socket server

each with a 6-core

Intel CPU

E5-2609 at 1.9GHz

13.62

36% 31% 11% 38% 266pb2aq7.3ˆ10.64

VGG-16pbq31.1 Xilinxp2q

KU060

8-bitpbq

ﬁxed-point 200 2,160 331,680 663,360 2,760 36% 60% 20% 4% 1,171.7pb2bq29ˆ45.07

N/A 165pa2aq4.2ˆN/A

fpgaConvNet [165] 2016

Datapath optimization,

Synchronous dataﬂow,

Partitioning and folding,

Design space exploration

LeNet-5 0.0038 Zynq

XC7Z020

16-bit

ﬁxed-point 100 4-input LUTs HLS 140 53,200 106,400 200

4.4% 18.2% 13% 1.2% 0.48 0.09ˆCNP [127] N/A

Scene

Labelling [167] 7.6528 6.5% 61% 36.6% 71.8% 12.73 0.17ˆTegra K1 GPU 7.27

VOLUME 4, 2016 29

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

TABLE 3. Implementation and Performance Summary of FPGA-based Accelerators.

Technique Year Key Features DL Model Image Operations

(GOP) Platform Precision Frequency

(MHz) LUT Type Design Entry

Resources Resources Utilization Performance

(GOPS) Speedup Baseline

Power

Efﬁciency

(GOPS/W)

BRAMs/

M20K

LUTs/

ALMs FFs DSPs BRAMs/

M20K

LUTs/

ALMs FFs DSPs

Throughput-Optimized

FPGA Accelerator [170] 2017

Four-levels of parallelism,

Memory-cache mechanism,

Design space exploration

LeNet 0.04 Virtex7

VX690T

(8-16)-bit

ﬁxed-point 100 6-input LUTs N/A 1,470 433,200 866,400 3,600

32.4% 53.8% 35.5% 80.8% 424.7 14.84ˆ4.0 GHz

Intel Core

i7-4790K CPU

16.85

AlexNet 1.46 69.5% 47.7% 37.3% 79.8% 445.6 6.96ˆ17.97

VGG-S 5.29 88.8% 51.7% 34.5% 81.9% 473.4 4.79ˆ18.49

FP-DNN [171] 2017

RTL-HLS hybrid compiler,

Hand-written matrix

multiply, Quantization,

Tiling and double buffers

VGG-19 39.26 Stratix-V

GSMD5

16-bit

ﬁxed-point 150 4-input LUTs

C++

and

OpenCL

2,014 172,600 690,000 1,590

48% 27% N/A 66% 364.36 3.06ˆ2 Processors each

2.6 GHz

Intel Xeon

8-core E5-2650v2

14.57

ResNet-152 22.62 48% 27% N/A 66% 226.47 1.9ˆ9.06

FINN [181] 2017

Binarized CNN,

Pipelining, Automatic

partitioning and tiling,

Approximate arithmetic

CNV 0.113 Zynq

XC7Z045

(1-2)-bit

precision 200 4-input LUTs C++ 545 218,600 437,200 900 34.1% 21.2% N/A N/A 2,465.5 13.8ˆMicrosoft Specialized

CNN Accelerator [152] 210.7

Customized CONV

Loops Accelerator [83] 2017

Numerical analysis of

CONV loops opt.,

Solution space exploration,

Dataﬂow opt.

VGG-16 30.95 Arria 10

GX 1150

(8-16)-bit

ﬂoat-point 150 8-input LUTs Verilog 2,713 427,200 1,708,800 1,518 70% 38% N/A 100% 645.25

3.2ˆEnergy-efﬁcient

CNN [205] 30.44

5.5ˆOpenCL-based FPGA

Accelerator [80]pb2q

Latency-Driven Design

for FPGA-based

CNNs [183]

2017

Synchronous dataﬂow,

Weights reloading

and SDF transformations,

Design space exploration

AlexNet 1.3315 Zynq

XC7Z045

16-bit

ﬁxed-point 125 4-input LUTs HLS 545 218,600 437,200 900 N/A

161.98 1.49ˆDeepBurning [155]

N/A

VGG-16 30.72 123.12 0.65ˆEmbedded FPGA

Accelerator [98]

DLA [188] 2017

Winograd transformations,

Double stream buffers,

PEs double cache buffers,

Daisy-chained PEs

AlexNet 1.46 Arria 10

GX 1150

Half-precision

FP16 with

shared

exponent

303 8-input LUTs OpenCL 2,713 427,200 1,708,800 1,518 92% 58% 40% 97% 1,382

8.4ˆCaffeine [162]pa2aq

30.71

19.1ˆOpenCL-based FPGA

Accelerator [80]pa2q

Winograd-based

CNN Accelerator [189] 2017

Winograd algorithm, Loop

unrolling, Double buffers,

Batching, Line buffers,

Design space exploration

AlexNetpaq1.33 Zynqp1q

ZCU102 16-bit

ﬁxed-point

200 6-input LUTs

912 274,080 548,160 2,520

N/A

854.6pa1q11.8ˆ[80]pa2q36.2

2,940.7pb1q8.31ˆCaffeine [162]pb1aq124.6

VGG-16pbq30.76 Xilinxp2q

ZC706 167 4-input LUTs 545 218,600 437,200 900 201.4pa2q2.78ˆ[80]pa2q21.4

679.6pb2q0.13ˆTitan X (CuDNN 5.1) 72.3

OpenCL-based

Architecture for

Accelerating

CNNs [190]

2017

2D BRAM-to-PE

interconnection, 2D

dispatcher, Rooﬂine

model, OpenCL

VGG-16 30.76 Arria 10

GX 1150

32-bit

ﬂoat-point 370

8-input LUTs OpenCL

2,713 427,200 1,708,800 1,518 46.1% N/A N/A 87% 866 4.41ˆAltera OpenCL [191]

on Arria 10 Platform 20.75

16-bit

ﬁxed-point 385 2,713 427,200 1,708,800 3,036 53.4% N/A N/A 90.8% 1,790 N/A 47.78

Multi-CLP

Accelerator for

CNNs [192]

2017

Multiple CONV

processors, Pipelining,

Dynamic programming,

Double-buffering

AlexNetpaq1.33 Virtex7p1q

VX485T

32-bit

ﬂoat-point 100 4-input LUTs

C++

2,060 303,600 607,200 2,800 39.4% 58.3% 44.6% 87.3% 85.2pa1q1.31ˆSingle CLP

Design Based

in [55]

11.2

48.8% 84.7% 40.2% 88.3% 113.92pa2q1.54ˆ11.17

SqueezeNetpbq0.77 Virtex7p2q

VX690T

16-bit

ﬁxed-point 170 6-input LUTs 2,940 433,200 866,400 3,600 N/A 708.3pb1q1.9ˆN/A

37.7% 30.9% 18.6% 97.1% 909.7pb2q2.3ˆ126.3

Automated Systolic

Array Architecture

for CNN [195]

2017

2D systolic array

architecture, Rooﬂine

model, Automation ﬂow

Design space exploration

AlexNetpaq1.4 Arria 10

GX 1150

32-bitp1q

ﬂoat-point 239.62pa1q

8-input LUTs OpenCL 2,713 427,200 1,708,800 1,518

87% 82% N/A 85% 360.4pa1q

N/A

20.75

N/A

VGG-16pbq31.1 (8-16)-bitp2q

ﬁxed-point

221.65pb1q90.5% 83% N/A 88.3% 460.5pb1q

231.85pb2q61.5% 73% N/A 98.8% 1,171.3pb2q

End-to-End Scalable

FPGA Accelerator [196] 2017

Flexible and scalable

ResNet modules, CONV

loops opt., Dataﬂow opt.,

Controlled execution

ﬂow of ResNets layers

ResNet-50 7.74

Arria 10

GX 1150

16-bit

ﬁxed-point 150 8-input LUTs Verilog 2,713 427,200 1,708,800 1,518

80% 30% N/A 69% 285.07

N/A N/A

ResNet-152 22.62 93% 33% N/A 69% 315.48

DLAU [197] 2017

Pipelined processing

units, Tiling,

FIFO buffers

DNN N/A Zynq

XC7Z020

48-bit

ﬂoat-point 200 4-input LUTs N/A 280 53,200 106,400 220 12.5% 68.4% 26.6% 75.9% N/A 36.1ˆ2.3 GHz

Intel Core2 N/A

An Automatic RTL

Compiler for High-

Throughput Deep

CNNs [199]

2017

Library-based RTL

compiler, Flexible and

scalable CNN modules,

Layer combo computation,

CONV loops and dataﬂow

opt., Controlled execution

ﬂow of CNN layers

NiNpaq2.2 Stratix-Vp1q

GXA7

16-bit

ﬁxed-point

150 7-input LUTs

Verilog

2,560 234,720 938,880 256

59% 96% N/A 100% 282.67pa1qą6.4ˆDeepBurning [155]

N/A

86% 90% N/A 100% 352.24pb1qN/A

VGG-16pbq30.95 76% 74% N/A 100% 250.75pc1qN/A

93% 77% N/A 100% 278.67pd1q1.23ˆFP-DNN [171]

ResNet-50pcq7.74 Arria 10p2q

GX 1150 200 8-input LUTs 2,713 427,200 1,708,800 1,518

56% 37% N/A 100% 587.63pa2qN/A

82% 30% N/A 100% 720.15pb2q2.0ˆCaffeine [162]pb1aq

ResNet-152pdq22.62 71% 51% N/A 100% 619.13pc2qN/A

87% 54% N/A 100% 710.30pd2qN/A

ALAMO [78], [168] 2018

Modularized RTL

compiler, Loop

unrolling, Loop

tiling

AlexNet 1.46 Stratix-V

GXA7

(8-16)-bit

ﬁxed-point 100 7-input LUTs Verilog 2,560 234,720 938,880 256

61% 52% N/A 100% 114.5 1.9ˆRooﬂine-based FPGA

Accelerator [55] 5.87

NiN 2.2 91% 48% N/A 100% 117.3 N/A 6.14

Angel-Eye [189] 2018

Quantization,

Parallel PEs,

Compiler,

On-chip buffers

VGG-16 30.76

Zynq

XC7Z020

8-bit

ﬁxed-point 214

4-input LUTs N/A

140 53,200 106,400 220 61% 56% 33% 86.4% 84.3 3.8ˆ

nn-X [148]

24.1

Zynq

XC7Z045

16-bit

ﬁxed-point 150 545 218,600 437,200 900 89% 84% 29% 87% 137 6ˆ14.2

Optimizing the CONV

Operation to Accelerate

DNNs on FPGA [204]

2018

Scalable set of data buses

(BUF2PE), Optimized

CONV module for

different (kernel, stride)

size conﬁgurations,

Flexible and scalable CNN

modules, CONV loops

and dataﬂow opt.

NiNpaq2.2 Stratix-Vp1q

GXA7

16-bit

ﬁxed-point

150 7-input LUTs

Verilog

2,560 234,720 938,880 256

59% 97% N/A 100% 278.2pa1qN/A

N/A

86% 93% N/A 100% 348.8pb1qN/A

VGG-16pbq30.95 76% 75% N/A 100% 243.3pc1qN/A

93% 78% N/A 100% 276.6pd1q1.2ˆFP-DNN [171]

ResNet-50pcq7.74 Arria 10p2q

GX 1150 200 8-input LUTs 2,713 427,200 1,708,800 1,518

56% 38% N/A 100% 584.8pa2qN/A

82% 32% N/A 100% 715.9pb2qN/A

ResNet-152pdq22.62 71% 52% N/A 100% 611.4pc2qN/A

87% 55% N/A 100% 707.2pd2qN/A

30 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

maps in each layer, along with other operators. Recent

research has demonstrated that a large number of weights

in fully connected layers could be eliminated with minimal

impact on accuracy. In addition, although the suggested

CNN structures by experts perform well for various appli-

cations, the question arises whether the suggested structures

could be optimized for performance with minimal impact on

accuracy. Since the designed CNN has a signiﬁcant impact

on the complexity of its implementation, we review in this

section some approaches attempting to optimize the design

of CNNs using metaheuristics.

NP-hard combinatorial optimization problems [206] ap-

pear in the design of CNNs. Some examples of areas include

design of CNN structures, selection of weights and bias

values to improve accuracy, and determination of optimal

values of variables to reduce run-time. Below, we brieﬂy

touch upon some existing literature in these areas.

A. CNN STRUCTURE OPTIMIZATION

In the design of CNNs, the number of possible network

structures increases exponentially with the number of layers.

Xie and Yuille used genetic algorithm in learning deep

network structures [207]. The objective was to ﬁnd the

best CNN structure that would minimize the error rate.

The cost function was the CNN accuracy. They proposed

an elegant encoding of chromosome using a ﬁxed length

binary string to represent each network structure. A CNN

string represents only the convolution layers.

In each generation, using standard genetic operations new

individuals are generated and weak ones eliminated. The

quality of an individual was assessed by its recognition ac-

curacy which is obtained via the time consuming operation

of training the network, and evaluating it on a validation set.

Two small data sets were used (MNIST and CIFAR-10) to

run the genetic implementation via which they demonstrated

the discovery of new structures.

B. CNN WEIGHTS AND BIAS VALUES OPTIMIZATION

An attempt to train CNNs using metaheuristics (that is,

determine weights and bias values) is presented in [208].

The objective again was to improve accuracy and minimize

the estimated error. The authors experiment with three

metaheuristic algorithms, namely; simulated annealing, dif-

ferential evolution, and harmony search. The algorihtms

compute the values of weights and bias in the last layer.

These values are used as the solution vector denoted by x

which is to be optimized. The move comprised adding a

small value of ∆xto perturb the state. The cost function y

is modeled as

y“1

2ˆřN

i“npo´uq2

N˙0.5

(4)

where, ois the expected output, uis the real output, and Nis

the number of used samples. The stopping criterion is when

the iteration count is reached or when the cost function goes

below a pre-speciﬁed value.

C. CNN DESIGN VARIABLES OPTIMIZATION

Suda et al. [80] presented a systematic methodology for

design space exploration with the objective of maximizing

the throughput of an OpenCL-based FPGA accelerator for

a given CNN model (please see subsection III-C). FPGA

resource constraints such as on-chip memory, registers, com-

putational resources and external memory bandwidth are

considered. The optimization problem comprises ﬁnding the

best combination of NCON V ,SCONV ,NN ORM ,NP OO L,

and NF C variables, where

‚NCON V is size of the ﬁlter (or neuron or kernel);

‚SCON V is the factor by which computational resources

are vectorized to execute in a single-instruction stream

multiple-data streams (SIMD) fashion;

‚NNO RM represents the number of normalization oper-

ations performed in a single cycle;

‚NP OOL is the number of parallel outputs of the pooling

layer in a single cycle to achieve acceleration; and,

‚NF C is the number of parallel multiply and accumulate

(MAC) operations preformed in a single work-item

within the fully connected layer.

The objective function to be minimized is the run-time

(RT ), and is given by

T L

i“0

RTirNCONV , SC ON V , NNOR M , NPO OL, NF C s(5)

subject to digital signal processing (DSP) slices, logic, and

memory constraints, where T L represents the total number

of CNN layers including the repeated layers. The convolu-

tion layer run-time (RT CO NV ) is analytically modeled as

a function of design variables as

RT CONV i“#of C onvolution Opsi

NCON V ˆSC ONV ˆF requency (6)

As for the other layers, that is, normalization, pooling, and

fully connected, the following general model is proposed

RT Layeri“#of Layer Opsi

Unroll f actor ˆF r equency (7)

The above analytical models are later validated by per-

forming full synthesis at selective points and running them

on the FPGA accelerator.

Clearly, in order to determine the best values of the

discussed design variables, exhaustive search, especially if

the number of variables and or FPGA resources is large, is

infeasible. We have to resort to iterative non-deterministic

heuristics [206] such as simulated annealing, simulated

evolution, tabu search, genetic algorithm, particle swarm

optimization, cuckoo search, etc., or any of the modern

metaheuristics, to efﬁciently traverse the search space to ﬁnd

acceptable solutions.

VOLUME 4, 2016 31

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

The proposed methodology employing genetic algorithm

was demonstrated by optimizing the implementation of two

representative CNNs, AlexNet and VGG, on two Altera

Stratix-V FPGA platforms, DE5-Net and P395-D8 boards,

both of which have different hardware resources. Peak

performance is achieved for both, for the convolution oper-

ations, and for the entire CNN network.

One major issue related to use of non-deterministic iter-

ative heuristics in the design of neural networks and CNNs

is the large amount of memory required to store the state of

solution and the amount of time taken to determine the cost

of the solution, be it accuracy/error estimation, run-time, or

any other objective. Reasonable estimation techniques and

analytical formulations are required to efﬁciently traverse

the design space in search of efﬁcient solutions.

V. SUMMARY AND RECOMMENDATIONS

In this section, we highlight the key features discussed in

the acceleration of convolutional neural networks (CNNs)

implemented on FPGAs, and provide recommendations to

enhance the effectiveness of employing FPGAs in the ac-

celeration of CNNs.

All reviewed techniques are centered around accelerating

the convolution (CONV) operation as it consumes around

90% of the computational time. This is achieved by utiliz-

ing parallel multiply-accumulate operations bounded by re-

source limitations. In addition, careful design of data access

patterns are targeted to minimize the memory bandwidth

requirements utilizing internal memory structures and max-

imizing data reuse. This is crucial in the acceleration process

due to the large memory data that needs to be accessed

including feature maps (FMs) and weights. To minimize

the memory footprint and to achieve effective utilization

of resources, some techniques optimize the number of bits

used to represent the feature maps and weights with minimal

impact on accuracy. This is combined with the optimized

selection of the number of fraction bits used for each layer.

Other techniques optimize the number of used weights in the

fully connected (FC) layers as they are memory-intensive.

Coprocessors are also employed to automatically conﬁgure

both the software and the hardware elements to fully exploit

parallelism [100].

To optimize parallelization of convolution operations,

several approaches have been attempted. Work load analysis

has been tried to determine computations that can be struc-

tured as parallel streams [132]. The rooﬂine model based

accelerator uses polyhedral-based data dependence analysis

to ﬁnd the optimal unrolling factor for every convolutional

layer [150], and to fully utilize all FPGA computational

resources through loop pipelining. To optimize performance,

tiled matrix multiplication is structured as a pipelined binary

adder tree for performing multiplication and generating

partial sums [198]. An optimization framework has been

proposed by Suda et al. who identiﬁed the key variables of

the design [80] and optimize them to maximize parallelism.

To reduce computational complexity of CONV layers and

improve resource efﬁciency, a number of approaches such

as [184], [188], [189] utilized Winograd transformation in

performing CONV operations as this reduces the computa-

tional complexity by around 50%.

To maximize throughput, several techniques such as

[165], [170], [192] have used multiple CONV layer pro-

cessors (CLPs) instead of using a single CLP that is opti-

mized for all CONV layers. This pipelines the operation of

the multiple CLPs achieving layer-level parallelism which

maximizes resource utilization and enhances performance in

comparison to using a single CLP.

Since the computational requirement of FC layers is

signiﬁcantly less than that of CONV layers, to improve

performance, and maximize resource utilization, a number

of techniques such as [153], [162], [188], [189] create

batches by grouping different input FMs and processing

them together in FC layers.

Complex access patterns and data locality are used in

DeepBurning tool [155] for better data reuse. In [197],

the authors explored hot spots proﬁling to determine the

computational parts that need to be accelerated to improve

the performance. Acceleration is accomplished by reducing

the memory bandwidth requirements. Techniques proposed

exploit data reuse to reduce off-chip memory communica-

tions. Loop transformations have also been used by reducing

tiling parameters to improve data locality, and to reduce

redundant communication operations to maximize the data

sharing/reuse.

Efﬁcient buffering, where the weight buffers are used to

ensure the availability of CONV and FC layers’ weights

before their computation, as well as to overlap the transfer

of FC layer weights with its computation, helps in improving

performance [78], [168]. In the Catapult project, FPGA

boards were integrated into data center applications and

achieved speedup. Microsoft Research’s Catapult utilized

multi-banked input buffer and kernel weight buffer to pro-

vide an efﬁcient buffering scheme of feature maps and

weights, respectively. To minimize the off-chip memory

trafﬁc, a specialized network on-chip was designed to re-

distribute the output feature maps on the multi-banked

input buffer instead of transferring them to the external

memory [152].

To further reduce memory footprint and bandwidth re-

quirement, optimal fractional length for weights and feature

maps in each layer are used. Singular value decomposition

(SVD) has also been applied to the weight matrix of FC

layer in order to reduce memory footprint at this layer [98].

Tiling techniques have been proposed where large-scale

input data is partitioned into small subsets or tiles whose size

is conﬁgured to leverage the trade-off between the hardware

cost and the speedup [197].

Automation tools have been developed that automatically

build neural networks with optimized performance [155].

They employ pre-constructed register transfer level (RTL)

module library that holds hardware (including logical and

arithmetic operations) and conﬁguration scripts. DeepBurn-

32 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

ing, for example, generates the hardware description for

neural network scripts. Another modularized RTL compiler,

ALAMO, integrates both the RTL ﬁner level optimization

and the ﬂexibility of high-level synthesis (HLS) to generate

efﬁcient Verilog parameterized RTL scripts for ASIC or

FPGA platform under the available number of parallel

computing resources (i.e., the number of multipliers) [78],

[168]. Acceleration is achieved by employing loop unrolling

technique for CONV layer operations. Some of the reviewed

techniques also help minimize the size of FPGA on-chip

memories to optimize energy and area usage [146], [147].

In Table 4 and Table 5, we list the optimization mech-

anisms utilized by each of the reviewed techniques to

maximize performance and throughput of FPGA-based deep

learning networks.

To enhance utilization of FPGAs in CNNs acceleration

and to maximize their effectiveness, we recommend the

development of a framework that includes a user-friendly

interface that allows the user to easily specify the CNN

model to be accelerated. This includes specifying the CNN

model parameters in terms of number of convolution layers

and their sizes, and number of fully connected layers along

with other intermediate operations. The speciﬁed CNN

model weights will be read from a ﬁle. In addition, the user

should have the option of specifying the FPGA platform

that will be used for implementing the CNN accelerator

and the maximum tolerable error, along with the selection

of a library from a set of applications to be used for model

optimization and evaluation. The framework then should

perform optimizations to ﬁnd the minimum number of bits

that need to be used for representing the weights and feature

maps and the number of fraction bits to be used for each

layer. In addition, optimization of fully connected layers is

performed to minimize the memory requirements. All such

optimizations are carried out bounded by the maximum error

speciﬁed by the user for the speciﬁed application library.

The framework should be designed based on the devel-

opment of a scalable hardware architecture that works for

any given FPGA platform and achieves higher speedup with

the availability of higher resources. Based on the available

resources, speciﬁed by the FPGA platform, the tool will

perform optimizations to maximize parallelism and data

reuse, given the resource limitations. The tool will then

automatically generate the CNN model that will ﬁt on the

given FPGA platform and will allow the user to evaluate

the performance based on the chosen application library.

This will allow the user to evaluate the performance gains

while evaluating different FPGA platforms with different

resources. The tool should have the option to generate per-

formance measures based on different performance metrics

as selected by the user such as number of frames processed

per second or number of operations performed per second.

In addition, the tool will report other design metrics such

as resource utilization, memory sizes and bandwidth, and

power dissipation.

Furthermore, it is desired to have the option for the user

to specify the desired performance for a given CNN model

and have the tool perform necessary analysis and evaluation

and recommend to the user candidate FPGA platforms for

achieving the desired performance levels. This will require

the development of reasonably accurate analytical models

that will estimate the needed resources for achieving the

desired performance. The user can then choose the recom-

mended FPGA platform and perform complete evaluation

to verify that the desired performance levels are met.

VI. CONCLUSION

In this paper, we reviewed recent developments in the area

of acceleration of deep learning networks and, in particular,

convolution neural networks (CNNs) on ﬁeld programmable

gate arrays (FPGAs). The paper begins with a brief overview

of deep learning techniques highlighting their importance,

key operations, and applications. Special emphasis is given

on CNNs as they have wide applications in the area of

image detection and recognition and require both CPU

and memory intensive operations that can be effectively

accelerated utilizing FPGA inherent ability to maximize

parallelism of operations.

While the paper brieﬂy touches upon the acceleration

techniques for deep learning algorithms and CNNs from

both software and hardware perspectives, the core of this

article has been the review of recent techniques employed

in the acceleration of CNNs on FPGAs. A thorough up-to-

date review is provided that illustrates the employment of

various possibilities and techniques such as exploitation of

parallelism utilizing loop tiling and loop unrolling, effective

use of internal memory to maximize data reuse, operation

pipelining, and effective use of data sizes to minimize mem-

ory footprint, and, to optimize FPGA resource utilization.

The paper also presented the use of tools for generating

automating the design process, but also help in exploring the

design space and suggesting efﬁcient hardware. The paper

discusses the use of analytics such as: (i) work load analysis

in determining the computations that can be parallelized,

(ii) optimal loop unrolling factors, (iii) determining access

patterns to improve data locality, etc. In addition, a brief

review of the use of non-deterministic heuristics in solving

NP-hard combinatorial optimization problems in the design

and implementation of CNNs has been presented. Finally,

the paper summarizes the key features employed by the

various FPGA-based CNN acceleration techniques and pro-

vided recommendations for enhancing the effectiveness of

utilizing FPGAs in CNNs acceleration.

ACKNOWLEDGMENT

Authors acknowledge King Fahd University of Petroleum

& Minerals, Dhahran, Saudi Arabia for all support. We also

like to acknowledge Dr. Blair P. Bremberg and Ms. Sumaiya

Hussain Sadiq for their help in professional English editing

of this manuscript.

VOLUME 4, 2016 33

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

TABLE 4. Optimization Mechanisms Employed for FPGA-based Acceleration of Deep Learning Networks.

Technique VIP

[119]

CNP

[127]

CONV

Coprocessor

Accelerator [129]

MAPLE

[132]

DC-CNN

[100]

NeuFlow

[143]

Memory-Centric

Accelerator

[146]

nn-X

[148]

Rooﬂine-based

FPGA

Accelerator [55]

Embedded

FPGA

Accelerator [98]

DeepBurning

[155]

OpenCL-based

FPGA

Accelerator [80]

Caffeine

[153], [162]

fpgaConvNet

[165]

Loop

Unrolling Ś Ś Ś Ś

Loop

Tiling Ś Ś Ś Ś Ś Ś Ś

Loop

Interchange Ś

Pipelining ŚŚ Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś

Batching Ś

Multi-CLPs Ś

Fixed-Point

Precision Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś

Per-Layer

Quantization Ś

Singular Value

Decomposition Ś

Prefetching Ś Ś Ś

Rearranging

Memory Data Ś Ś Ś Ś

In-Memory

Processing Ś

Line

Buffer Ś

Double

Buffering ŚŚ

Approximating

Non-Linear AF Ś Ś Ś Ś Ś Ś Ś

Eliminating

FC Layer Ś Ś Ś Ś

Rooﬂine

Model Ś Ś

Polyhedral

Optimization Ś Ś

Dynamic

Programming Ś

Graph

Partitioning Ś

34 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

TABLE 5. Optimization Mechanisms Employed for FPGA-based Acceleration of Deep Learning Networks.

Technique ALAMO

[78], [168]

Throughput-

Optimized

FPGA

Accelerator [170]

FP-DNN

[171]

FINN

[181]

Customized

CONV Loop

Accelerator

[83]

Latency-Driven

Design for

FPGA-based

CNNs [183]

DLA

[188]

Winograd-

based CNN

Accelerator

[189]

OpenCL-based

Architecture for

Accelerating

CNNs [190]

Multi-CLP

Accelerator

for CNNs

[192]

Automated

Systolic Array

Architecture

for CNN [195]

End-to-End

Scalable FPGA

Accelerator

[196]

DLAU

[197]

An Automatic

RTL Compiler for

High-Throughput

Deep CNNs [199]

Intel’s

DLA

[200]

Angel-Eye

[60]

Optimizing the

CONV Operation

to Accelerate DNNs

on FPGA [204]

Loop

Unrolling Ś Ś Ś Ś Ś Ś Ś Ś Ś

Loop

Tiling Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś

Loop

Interchange Ś Ś

Pipelining ŚŚ Ś Ś Ś Ś Ś Ś Ś

Input

Batching Ś

FC Layer

Batching Ś Ś Ś

Multi-CLPs Ś Ś Ś

Binarized

CNN Ś

Fixed-Point

Precision Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś Ś

Per-Layer

Quantization Ś Ś

Prefetching Ś Ś Ś Ś

Rearranging

Memory Data Ś Ś

Line

Buffer ŚŚ Ś

Double

Buffering ŚŚ Ś Ś Ś Ś Ś Ś

Padding

Optimizations Ś Ś

Winograd

Algorithm Ś Ś

Approximating

Non-Linear AF Ś Ś Ś

Rooﬂine

Model ŚŚ

Polyhedral

Optimization Ś

Dynamic

Programming Ś

Graph

Coloring Ś

Graph

Partitioning Ś Ś

Pattern

Matching Ś

VOLUME 4, 2016 35

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

REFERENCES

[1] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and

trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. 1

[2] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neu-

ral networks, vol. 61, pp. 85–117, 2015. 1

[3] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.

MIT Press Cambridge, 2016, vol. 1. 1

[4] L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: A

survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge

Discovery, p. e1253, 2018. 1

[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-

tations by back-propagating errors,” Nature, vol. 323, no. 6088, p. 533,

1986. 1

[6] Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J,

“Neurocomputing: Foundations of research,” ch. Learning Representa-

tions by Back-propagating Errors, pp. 696–699, 1988. 1

[7] M. A. Nielsen, Neural networks and deep learning. Determination Press,

USA, 2015, vol. 25. 1

[8] T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo geolocation with

convolutional neural networks,” in European Conference on Computer

Vision. Springer, 2016, pp. 37–55. 1

[9] Mathworks. What Is Deep Learning? [Online]. Available: https://www.

mathworks.com/discovery/deep-learning.html/, 2018. 1, 2

[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,

no. 7553, p. 436, 2015. 2

[11] A. Deshpande. A Beginner’s Guide To Understanding Convolutional

Neural Networks [Online]. Available: https://adeshpande3.github.io/A-

Beginner%27s-Guide-To-Understanding-Convolutional-Neural-

Networks/, 2018. 2, 4

[12] J. E. Dayhoff, Neural network architectures: an introduction. Van

Nostrand Reinhold New York, 1990. 2

[13] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,

and time series,” The handbook of brain theory and neural networks, vol.

3361, no. 10, p. 1995, 1995. 2

[14] J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge,

R. G. Dreslinski, J. Mars, and L. Tang, “Djinn and tonic: DNN as a

service and its implications for future warehouse scale computers,” in

ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM,

2015, pp. 27–40. 2

[15] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,

R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for

video classiﬁcation,” in Proceedings of the IEEE conference on computer

vision and pattern recognition, 2015, pp. 4694–4702. 2

[16] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.

Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-

propagation network,” in Advances in neural information processing

systems, 1990, pp. 396–404. 2

[17] P. Barros, S. Magg, C. Weber, and S. Wermter, “A multichannel con-

volutional neural network for hand posture recognition,” in International

Conference on Artiﬁcial Neural Networks. Springer, 2014, pp. 403–410.

[18] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep

recurrent neural networks,” in Acoustics, speech and signal processing

(icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–

6649. 2

[19] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning

deep structured semantic models for web search using clickthrough data,”

in Proceedings of the 22nd ACM international conference on Conference

on information & knowledge management. ACM, 2013, pp. 2333–2338.

[20] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu,

“Convolutional neural networks for speech recognition,” IEEE/ACM

Transactions on audio, speech, and language processing, vol. 22, no. 10,

pp. 1533–1545, 2014. 2

[21] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convo-

lutional neural networks applied to visual document analysis,” in null.

IEEE, 2003, p. 958. 2

[22] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural

networks for text classiﬁcation.” in AAAI, vol. 333, 2015, pp. 2267–

2273. 2

[23] Y. Kim, “Convolutional neural networks for sentence classiﬁcation,”

arXiv preprint arXiv:1408.5882, 2014. 2

[24] R. Collobert and J. Weston, “A uniﬁed architecture for natural language

processing: Deep neural networks with multitask learning,” in Proceed-

ings of the 25th international conference on Machine learning. ACM,

2008, pp. 160–167. 2

[25] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief

networks for natural language understanding,” IEEE/ACM Transactions

on Audio, Speech and Language Processing (TASLP), vol. 22, no. 4, pp.

778–784, 2014. 2

[26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and

L. Fei-Fei, “Large-scale video classiﬁcation with convolutional neural

networks,” in Proceedings of the IEEE conference on Computer Vision

and Pattern Recognition, 2014, pp. 1725–1732. 2

[27] J. Mutch and D. G. Lowe, “Multiclass object recognition with sparse,

localized features,” in Computer Vision and Pattern Recognition, 2006

IEEE Computer Society Conference on, vol. 1. IEEE, 2006, pp. 11–18.

[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation

with deep convolutional neural networks,” in Advances in neural infor-

mation processing systems, 2012, pp. 1097–1105. 2, 4, 5, 12, 13

[29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. 2,

4, 5

[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,

A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual

recognition challenge,” International Journal of Computer Vision, vol.

115, no. 3, pp. 211–252, 2015. 2

[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”

arXiv preprint arXiv:1409.4842, vol. 7, 2015. 2, 27

[32] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time

object detection with region proposal networks,” in Advances in neural

information processing systems, 2015, pp. 91–99. 2

[33] K. Korekado, T. Morie, O. Nomura, T. Nakano, M. Matsugu, and

A. Iwata, “An image ﬁltering processor for face/object recognition us-

ing merged/mixed analog-digital architecture,” in VLSI Circuits, 2005.

Digest of Technical Papers. 2005 Symposium on. IEEE, 2005, pp. 220–

223. 2

[34] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural

network cascade for face detection,” in Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition, 2015, pp. 5325–5334.

[35] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-road

obstacle avoidance through end-to-end learning,” in Advances in neural

information processing systems, 2006, pp. 739–746. 2

[36] R. Hadsell, A. Erkan, P. Sermanet, J. Ben, K. Kavukcuoglu, U. Muller,

and Y. LeCun, “A multi-range vision strategy for autonomous offroad

navigation,” Proc. Robotics and Applications (RA’07), vol. 1, no. 7, 2007.

[37] P. Sermanet, R. Hadsell, M. Scofﬁer, M. Grimes, J. Ben, A. Erkan,

C. Crudele, U. Miller, and Y. LeCun, “A multirange architecture for

collision-free off-road robot navigation,” Journal of Field Robotics,

vol. 26, no. 1, pp. 52–87, 2009. 2

[38] B. Blanco-Filgueira, D. García-Lesta, M. Fernández-Sanjurjo, V. M.

Brea, and P. López, “Deep learning-based multiple object visual tracking

on embedded system for iot and mobile edge computing applications,”

arXiv preprint arXiv:1808.01356, 2018. 2

[39] P. D. McNelis, Neural networks in ﬁnance: gaining predictive edge in the

market. Academic Press, 2005. 2

[40] P. J. Lisboa and E. C. Ifeachor, Artiﬁcial neural networks in biomedicine.

Springer Science & Business Media, 2000. 2

[41] P. W. Mirowski, Y. LeCun, D. Madhavan, and R. Kuzniecky, “Comparing

SVM and convolutional networks for epileptic seizure prediction from

intracranial EEG,” in Machine Learning for Signal Processing, 2008.

MLSP 2008. IEEE Workshop on. IEEE, 2008, pp. 244–249. 2

[42] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural

networks for LVCSR using rectiﬁed linear units and dropout,” in Acous-

tics, Speech and Signal Processing (ICASSP), 2013 IEEE International

Conference on. IEEE, 2013, pp. 8609–8613. 2

[43] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scofﬁer, K. Kavukcuoglu,

U. Muller, and Y. LeCun, “Learning long-range vision for autonomous

off-road driving,” Journal of Field Robotics, vol. 26, no. 2, pp. 120–144,

2009. 2

36 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

[44] L. Deng, D. Yu et al., “Deep learning: methods and applications,”

Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–

387, 2014. 2

[45] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hi-

erarchies for accurate object detection and semantic segmentation,” in

Proceedings of the IEEE conference on computer vision and pattern

recognition, 2014, pp. 580–587. 2

[46] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Uniﬁed,

small, low power fully convolutional neural networks for real-time object

detection for autonomous driving.” in CVPR Workshops, 2017, pp. 446–

454. 2

[47] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:

Surpassing human-level performance on imagenet classiﬁcation,” in Pro-

ceedings of the IEEE international conference on computer vision, 2015,

pp. 1026–1034. 2

[48] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-

tional networks,” in European conference on computer vision. Springer,

2014, pp. 818–833. 2, 7

[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in Proceedings of the IEEE conference on computer vision

and pattern recognition, 2016, pp. 770–778. 2, 4, 5, 27

[50] Image-Net. The ImageNet Large Scale Visual Recognition Challenge

(ILSVRC) [Online]. Available: http://image-net.org/challenges/LSVRC/,

2018. 2

[51] A.-r. Mohamed, G. E. Dahl, G. Hinton et al., “Acoustic modeling using

deep belief networks,” IEEE Trans. Audio, Speech & Language Process-

ing, vol. 20, no. 1, pp. 14–22, 2012. 2

[52] O. Nomura and T. Morie, “Projection-ﬁeld-type VLSI convolutional

neural networks using merged/mixed analog-digital approach,” in Inter-

national Conference on Neural Information Processing. Springer, 2007,

pp. 1081–1090. 2

[53] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project

Adam: Building an efﬁcient and scalable deep learning training system.”

in OSDI, vol. 14, 2014, pp. 571–582. 2

[54] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-

bard, and L. D. Jackel, “Backpropagation applied to handwritten zip code

recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989. 3

[55] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing

FPGA-based accelerator design for deep convolutional neural networks,”

in Proceedings of the 2015 ACM/SIGDA International Symposium on

Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170. 3, 4, 12,

13, 14, 15, 16, 19, 21, 23, 29, 30, 34

[56] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotﬁ-Kamran, and H. Es-

maeilzadeh, “Neural acceleration for gpu throughput processors,” in

Proceedings of the 48th International Symposium on Microarchitecture.

ACM, 2015, pp. 482–493. 3

[57] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,

A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural

networks for acoustic modeling in speech recognition: The shared views

of four research groups,” IEEE Signal processing magazine, vol. 29,

no. 6, pp. 82–97, 2012. 3

[58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for

fast feature embedding,” in Proceedings of the 22nd ACM international

conference on Multimedia. ACM, 2014, pp. 675–678. 3, 14, 25

[59] A. Vasudevan, A. Anderson, and D. Gregg, “Parallel multi channel

convolution using general matrix multiplication,” in Application-speciﬁc

Systems, Architectures and Processors (ASAP), 2017 IEEE 28th Interna-

tional Conference on. IEEE, 2017, pp. 19–24. 3

[60] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and

H. Yang, “Angel-eye: A complete design ﬂow for mapping cnn onto

embedded FPGA,” IEEE Transactions on Computer-Aided Design of

Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2018. 3, 27,

28, 35

[61] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong

Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra et al.,

“Can FPGAs beat GPUs in accelerating next-generation deep neural

networks?” in Proceedings of the 2017 ACM/SIGDA International Sym-

posium on Field-Programmable Gate Arrays. ACM, 2017, pp. 5–14.

[62] J. Misra and I. Saha, “Artiﬁcial neural networks in hardware: A survey of

two decades of progress,” Neurocomputing, vol. 74, no. 1-3, pp. 239–255,

2010. 3

[63] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceler-

ation for general-purpose approximate programs,” in Proceedings of the

2012 45th Annual IEEE/ACM International Symposium on Microarchi-

tecture. IEEE Computer Society, 2012, pp. 449–460. 3

[64] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.

Dally, “Eie: efﬁcient inference engine on compressed deep neural net-

work,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual

International Symposium on. IEEE, 2016, pp. 243–254. 3

[65] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang, “A

reconﬁgurable streaming deep convolutional neural network accelerator

for internet of things,” IEEE Transactions on Circuits and Systems I:

Regular Papers, vol. 65, no. 1, pp. 198–208, 2018. 3

[66] W. Vanderbauwhede and K. Benkrid, High-performance computing using

FPGAs. Springer, 2013. 3

[67] A. Putnam, A. M. Caulﬁeld, E. S. Chung, D. Chiou, K. Constantinides,

J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “A

reconﬁgurable fabric for accelerating large-scale datacenter services,”

ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 13–

24, 2014. 3, 13

[68] Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen, “High-level

synthesis: productivity, performance, and software constraints,” Journal

of Electrical and Computer Engineering, vol. 2012, p. 1, 2012. 3

[69] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,

“High-level synthesis for fpgas: From prototyping to deployment,” IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Sys-

tems, vol. 30, no. 4, pp. 473–491, 2011. 3

[70] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. An-

derson, S. Brown, and T. Czajkowski, “Legup: high-level synthesis for

fpga-based processor/accelerator systems,” in Proceedings of the 19th

ACM/SIGDA international symposium on Field programmable gate ar-

rays. ACM, 2011, pp. 33–36. 3

[71] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning

applied to document recognition,” Proceedings of the IEEE, vol. 86,

no. 11, pp. 2278–2324, 1998. 3, 10, 18

[72] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee,

S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding sources

of inefﬁciency in general-purpose chips,” in ACM SIGARCH Computer

Architecture News, vol. 38, no. 3. ACM, 2010, pp. 37–47. 3

[73] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco,

“GPUs and the future of parallel computing,” IEEE Micro, vol. 31, no. 5,

pp. 7–17, 2011. 3

[74] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for

energy-efﬁcient dataﬂow for convolutional neural networks,” in ACM

SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press,

2016, pp. 367–379. 3, 4

[75] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust

object recognition with cortex-like mechanisms,” IEEE Transactions on

Pattern Analysis & Machine Intelligence, no. 3, pp. 411–426, 2007. 3

[76] P. Joshi. What Is Local Response Normalization In

Convolutional Neural Networks [Online]. Available:

https://prateekvjoshi.com/2016/04/05/what-is-local-response-

normalization-in-convolutional-neural-networks/, 2018. 4

[77] J. Cong and B. Xiao, “Minimizing computation in convolutional neural

networks,” in International conference on artiﬁcial neural networks.

Springer, 2014, pp. 281–290. 4, 12

[78] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, “Scalable and mod-

ularized rtl compilation of convolutional neural networks onto FPGA,”

in Field Programmable Logic and Applications (FPL), 2016 26th Inter-

national Conference on. IEEE, 2016, pp. 1–8. 4, 16, 25, 30, 32, 33,

[79] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations

for high-performance computing,” ACM Computing Surveys (CSUR),

vol. 26, no. 4, pp. 345–420, 1994. 4, 16, 19

[80] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.

Seo, and Y. Cao, “Throughput-optimized opencl-based FPGA accelerator

for large-scale convolutional neural networks,” in Proceedings of the

2016 ACM/SIGDA International Symposium on Field-Programmable

Gate Arrays. ACM, 2016, pp. 16–25. 4, 5, 6, 14, 16, 17, 20, 21, 22, 29,

30, 31, 32, 34

[81] M. Denil, B. Shakibi, L. Dinh, N. De Freitas et al., “Predicting parameters

in deep learning,” in Advances in neural information processing systems,

2013, pp. 2148–2156. 4, 5

VOLUME 4, 2016 37

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

[82] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-

mann machines,” in Proceedings of the 27th international conference on

machine learning (ICML-10), 2010, pp. 807–814. 4

[83] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop operation and

dataﬂow in FPGA acceleration of deep convolutional neural networks,”

in Proceedings of the 2017 ACM/SIGDA International Symposium on

Field-Programmable Gate Arrays. ACM, 2017, pp. 45–54. 4, 19, 24,

25, 26, 28, 30, 35

[84] A. Karpathy. Convolutional Neural Networks for Visual Recognition

[Online]. Available: http://cs231n.github.io/convolutional-networks/,

2018. 4

[85] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual

networks,” in European conference on computer vision. Springer, 2016,

pp. 630–645. 5

[86] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,

inception-resnet and the impact of residual connections on learning.” in

AAAI, vol. 4, 2017, p. 12. 5

[87] J. Villasenor and W. H. Mangione-Smith, “Conﬁgurable computing,”

Scientiﬁc American, vol. 276, no. 6, pp. 66–71, 1997. 5, 6

[88] S. D. Brown, R. J. Francis, J. Rose, and Z. G. Vranesic, Field-

programmable gate arrays. Springer Science & Business Media, 2012,

vol. 180. 6

[89] M. C. Herbordt, Y. Gu, T. VanCourt, J. Model, B. Sukhwani, and

M. Chiu, “Computing models for FPGA-based accelerators,” Computing

in science & engineering, vol. 10, no. 6, pp. 35–45, 2008. 6

[90] B. S. C. Varma, K. Paul, and M. Balakrishnan, Architecture exploration

of FPGA based accelerators for BioInformatics applications. Springer,

2016. 6

[91] G. Lacey, G. W. Taylor, and S. Areibi, “Deep learning on FPGAs: Past,

present, and future,” arXiv preprint arXiv:1602.04283, 2016. 6

[92] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini, P. Ak-

selrod, and S. Talay, “Large-scale FPGA-based convolutional networks,”

Scaling up Machine Learning: Parallel and Distributed Approaches, pp.

399–419, 2011. 6

[93] A. Munshi, “The opencl speciﬁcation,” in Hot Chips 21 Symposium

(HCS), 2009 IEEE. IEEE, 2009, pp. 1–314. 6

[94] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming

standard for heterogeneous computing systems,” Computing in science

& engineering, vol. 12, no. 3, pp. 66–73, 2010. 6

[95] A. R. Omondi and J. C. Rajapakse, FPGA implementations of neural

networks. Springer, 2006, vol. 365. 6

[96] H. M. Waidyasooriya, M. Hariyama, and K. Uchiyama, Design of FPGA-

Based Computing Systems with OpenCL. Springer, 2018. 6

[97] V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, and Z. Zhang, “Hardware for

machine learning: Challenges and opportunities,” in Custom Integrated

Circuits Conference (CICC), 2018 IEEE. IEEE, 2018, pp. 1–8. 6

[98] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, and

S. Song, “Going deeper with embedded FPGA platform for convolutional

neural network,” in Proceedings of the 2016 ACM/SIGDA International

Symposium on Field-Programmable Gate Arrays. ACM, 2016, pp. 26–

35. 6, 13, 14, 20, 29, 30, 32, 34

[99] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,

Y. Wang et al., “Ese: Efﬁcient speech recognition engine with sparse

lstm on FPGA,” in Proceedings of the 2017 ACM/SIGDA International

Symposium on Field-Programmable Gate Arrays. ACM, 2017, pp. 75–

84. 6

[100] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynam-

ically conﬁgurable coprocessor for convolutional neural networks,” in

ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM,

2010, pp. 247–257. 6, 10, 11, 29, 32, 34

[101] C. F. Van Loan, “Matrix computations (johns hopkins studies in mathe-

matical sciences),” 1996. 6, 7, 13

[102] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploit-

ing linear structure within convolutional networks for efﬁcient evalua-

tion,” in Advances in neural information processing systems, 2014, pp.

1269–1277. 7

[103] G. Guennebaud, B. Jacob, M. Lenz et al., “Eigen v3, 2010,” URL

http://eigen. tuxfamily. org, 2015. 7

[104] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-

nections for efﬁcient neural network,” in Advances in neural information

processing systems, 2015, pp. 1135–1143. 7

[105] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in

Advances in neural information processing systems, 1990, pp. 598–605.

[106] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network

construction with back-propagation,” in Advances in neural information

processing systems, 1989, pp. 177–185. 7

[107] B. Hassibi and D. G. Stork, “Second order derivatives for network

pruning: Optimal brain surgeon,” in Advances in neural information

processing systems, 1993, pp. 164–171. 7

[108] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep

neural networks with pruning, trained quantization and huffman coding,”

arXiv preprint arXiv:1510.00149, 2015. 7

[109] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,

“Diannao: A small-footprint high-throughput accelerator for ubiquitous

machine-learning,” ACM Sigplan Notices, vol. 49, no. 4, pp. 269–284,

2014. 7, 8

[110] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun.

com/exdb/mnist/, 1998. 7

[111] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,

Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,”

in Proceedings of the 47th Annual IEEE/ACM International Symposium

on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622. 7, 8

[112] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam,

and Y. Chen, “Dadiannao: A neural network supercomputer,” IEEE

Transactions on Computers, vol. 66, no. 1, pp. 73–88, 2017. 7, 8

[113] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou,

and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,” in

ACM SIGARCH Computer Architecture News, vol. 43, no. 1. ACM,

2015, pp. 369–381. 7, 8

[114] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,

and O. Temam, “Shidiannao: Shifting vision processing closer to the

sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3.

ACM, 2015, pp. 92–104. 8

[115] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efﬁcient processing of

deep neural networks: A tutorial and survey,” Proceedings of the IEEE,

vol. 105, no. 12, pp. 2295–2329, 2017. 8

[116] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-

chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional

neural network accelerator with in-situ analog arithmetic in crossbars,”

ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–

26, 2016. 8

[117] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,

“Prime: A novel processing-in-memory architecture for neural network

computation in reram-based main memory,” in ACM SIGARCH Com-

puter Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 27–39.

[118] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexﬂow: A ﬂexible

dataﬂow accelerator architecture for convolutional neural networks,” in

High Performance Computer Architecture (HPCA), 2017 IEEE Interna-

tional Symposium on. IEEE, 2017, pp. 553–564. 8

[119] J. Cloutier, E. Cosatto, S. Pigeon, F. R. Boyer, and P. Y. Simard, “Vip:

An FPGA-based processor for image processing and neural networks,”

in Microelectronics for Neural Networks, 1996., Proceedings of Fifth

International Conference on. IEEE, 1996, pp. 330–336. 9, 34

[120] D. F. Wolf, R. A. Romero, and E. Marques, “Using embedded proces-

sors in hardware models of artiﬁcial neural networks,” in V Simposio

Brasileiro de automação inteligente, Brasil, 2001. 9

[121] K. R. Nichols, M. A. Moussa, and S. M. Areibi, “Feasibility of ﬂoating-

point arithmetic in FPGA based artiﬁcial neural networks,” in In CAINE.

Citeseer, 2002. 9

[122] K. Benkrid and S. Belkacemi, “Design and implementation of a 2d

convolution core for video applications on FPGAs,” in Digital and

Computational Video, 2002. DCV 2002. Proceedings. Third International

Workshop on. IEEE, 2002, pp. 85–92. 9

[123] F. Cardells-Tormo and P.-L. Molinet, “Area-efﬁcient 2-d shift-variant

convolvers for FPGA-based digital image processing,” in Signal Pro-

cessing Systems Design and Implementation, 2005. IEEE Workshop on.

IEEE, 2005, pp. 209–213. 9

[124] R. G. Gironés, R. C. Palero, J. C. Boluda, and A. S. Cortés, “FPGA

implementation of a pipelined on-line backpropagation,” Journal of VLSI

signal processing systems for signal, image and video technology, vol. 40,

no. 2, pp. 189–213, 2005. 9

[125] H. Zhang, M. Xia, and G. Hu, “A multiwindow partial buffering scheme

for FPGA-based 2-d convolvers,” IEEE Transactions on Circuits and

Systems II: Express Briefs, vol. 54, no. 2, pp. 200–204, 2007. 9

38 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

[126] A. W. Savich, M. Moussa, and S. Areibi, “The impact of arithmetic

representation on implementing mlp-bp on FPGAs: A study,” IEEE

transactions on neural networks, vol. 18, no. 1, pp. 240–252, 2007. 9

[127] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An FPGA-based

processor for convolutional networks,” in Field Programmable Logic and

Applications, 2009. FPL 2009. International Conference on. IEEE,

2009, pp. 32–37. 9, 11, 16, 29, 34

[128] Y. LeCun et al., “Lenet-5, convolutional neural networks,” URL:

http://yann. lecun. com/exdb/lenet, p. 20, 2015. 9

[129] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic,

E. Cosatto, and H. P. Graf, “A massively parallel coprocessor for convo-

lutional neural networks,” in Application-speciﬁc Systems, Architectures

and Processors, 2009. ASAP 2009. 20th IEEE International Conference

on. IEEE, 2009, pp. 53–60. 9, 10, 29, 34

[130] H. P. Graf, S. Cadambi, V. Jakkula, M. Sankaradass, E. Cosatto,

S. Chakradhar, and I. Dourdanovic, “A massively parallel digital learning

processor,” in Advances in Neural Information Processing Systems,

2009, pp. 529–536. 10

[131] S. Cadambi, I. Durdanovic, V. Jakkula, M. Sankaradass, E. Cosatto,

S. Chakradhar, and H. P. Graf, “A massively parallel FPGA-based co-

processor for support vector machines,” in 2009 17th IEEE Symposium

on Field Programmable Custom Computing Machines. IEEE, 2009, pp.

115–122. 10

[132] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf,

“A programmable parallel accelerator for learning and classiﬁcation,” in

Proceedings of the 19th international conference on Parallel architectures

and compilation techniques. ACM, 2010, pp. 273–284. 10, 17, 29, 32,

[133] J. C. Platt, “12 fast training of support vector machines using sequential

minimal optimization,” Advances in kernel methods, pp. 185–208, 1999.

[134] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi,

O. Chapelle, and K. Weinberger, “Learning to rank with (a lot of) word

features,” Information retrieval, vol. 13, no. 3, pp. 291–314, 2010. 10

[135] J. MacQueen et al., “Some methods for classiﬁcation and analysis of mul-

tivariate observations,” in Proceedings of the ﬁfth Berkeley symposium

on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA,

USA, 1967, pp. 281–297. 10

[136] A. Sato and K. Yamada, “Generalized learning vector quantization,” in

Advances in neural information processing systems, 1996, pp. 423–429.

[137] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition:

A convolutional neural-network approach,” IEEE transactions on neural

networks, vol. 8, no. 1, pp. 98–113, 1997. 10, 11

[138] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional

neural networks for document processing,” in Tenth International Work-

shop on Frontiers in Handwriting Recognition. Suvisoft, 2006. 10, 14

[139] F. Nasse, C. Thurau, and G. A. Fink, “Face detection using gpu-based

convolutional neural networks,” in International Conference on Computer

Analysis of Images and Patterns. Springer, 2009, pp. 83–90. 10, 11

[140] J. D. Dixon, “Asymptotically fast factorization of integers,” Mathematics

of computation, vol. 36, no. 153, pp. 255–260, 1981. 11

[141] P. L. Montgomery, “A survey of modern integer factorization algorithms,”

CWI quarterly, vol. 7, no. 4, pp. 337–365, 1994. 11

[142] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culur-

ciello, “Hardware accelerated convolutional neural networks for synthetic

vision systems,” in Circuits and Systems (ISCAS), Proceedings of 2010

IEEE International Symposium on. IEEE, 2010, pp. 257–260. 11

[143] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. Le-

Cun, “Neuﬂow: A runtime reconﬁgurable dataﬂow processor for vision,”

in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011

IEEE Computer Society Conference on. IEEE, 2011, pp. 109–116. 11,

29, 34

[144] R. Collobert, C. Farabet, K. Kavukcuoglu et al., “Torch,” in Workshop on

Machine Learning Open Source Software, NIPS, vol. 76, 2008. 11

[145] D. Grangier, L. Bottou, and R. Collobert, “Deep convolutional networks

for scene parsing,” in ICML 2009 Deep Learning Workshop, vol. 3, no. 6.

Citeseer, 2009, p. 109. 11, 29

[146] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal, “Memory-centric

accelerator design for convolutional neural networks,” in Computer De-

sign (ICCD), 2013 IEEE 31st International Conference on. IEEE, 2013,

pp. 13–19. 11, 29, 33, 34

[147] A. Beric, J. van Meerbergen, G. de Haan, and R. Sethuraman, “Memory-

centric video processing,” IEEE Transactions on Circuits and Systems for

Video Technology, vol. 18, no. 4, pp. 439–452, 2008. 11, 33

[148] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240

g-ops/s mobile coprocessor for deep neural networks,” in Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition

Workshops, 2014, pp. 682–687. 12, 28, 29, 30, 34

[149] C. Farabet, C. Poulet, and Y. LeCun, “An FPGA-based stream processor

for embedded real-time vision with convolutional networks,” in Com-

puter Vision Workshops (ICCV Workshops), 2009 IEEE 12th Interna-

tional Conference on. IEEE, 2009, pp. 878–885. 12

[150] S. Williams, A. Waterman, and D. Patterson, “Rooﬂine: an insightful

visual performance model for multicore architectures,” Communications

of the ACM, vol. 52, no. 4, pp. 65–76, 2009. 12, 32

[151] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedral-based

data reuse optimization for conﬁgurable computing,” in Proceedings of

the ACM/SIGDA international symposium on Field programmable gate

arrays. ACM, 2013, pp. 29–38. 12

[152] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S.

Chung, “Accelerating deep convolutional neural networks using special-

ized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11, 2015. 13,

19, 29, 30, 32

[153] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine:

Towards uniformed representation and acceleration for deep convolu-

tional neural networks,” IEEE Transactions on Computer-Aided Design

of Integrated Circuits and Systems, 2018. 13, 14, 29, 32, 34

[154] B. Bosi, G. Bois, and Y. Savaria, “Reconﬁgurable pipelined 2-d con-

volvers for fast digital signal processing,” IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, vol. 7, no. 3, pp. 299–308, 1999.

13, 22, 28

[155] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “Deepburning: automatic

generation of FPGA-based learning accelerators for the neural network

family,” in Proceedings of the 53rd Annual Design Automation Confer-

ence. ACM, 2016, p. 110. 14, 20, 29, 30, 32, 34

[156] K. O. W. Group et al., “The opencl speciﬁcation version 1.1,” http://www.

khronos. org/registry/cl/specs/opencl-1.1. pdf, 2011. 14

[157] M. S. Abdelfattah, A. Hagiescu, and D. Singh, “Gzip on a chip: High

performance lossless data compression on FPGAs using opencl,” in

Proceedings of the International Workshop on OpenCL 2013 & 2014.

ACM, 2014, p. 4. 14

[158] Altera. OpenCL Design Examples [Online]. Available:

https://www.altera.com/support/support-resources/designexamples/

design-software/opencl.html/, 2018. 14

[159] Nallatech. P395-D8 OpenCL FPGA Accelerator Cards [Online]. Avail-

able: http://www.nallatech.com/wp-content/uploads/openclcardspb_v1_

51.pdf/, 2018. 14

[160] Altera. DE5-Net FPGA Kit User Manual [Online]. Available: ftp://ftp.

altera.com/up/pub/Altera_Material/Boards/DE5/DE5_User_/, 2018. 14

[161] R. C. Whaley and J. J. Dongarra, “Automatically tuned linear algebra

software,” in Supercomputing, 1998. SC98. IEEE/ACM Conference on.

IEEE, 1998, pp. 38–38. 14

[162] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards

uniformed representation and acceleration for deep convolutional neural

networks,” in Computer-Aided Design (ICCAD), 2016 IEEE/ACM In-

ternational Conference on. IEEE, 2016, pp. 1–8. 14, 21, 29, 30, 32,

[163] W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong, “Improving

high level synthesis optimization opportunity through polyhedral trans-

formations,” in Proceedings of the ACM/SIGDA international sympo-

sium on Field programmable gate arrays. ACM, 2013, pp. 9–18. 15

[164] E. A. Lee and D. G. Messerschmitt, “Synchronous data ﬂow,” Proceed-

ings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987. 15

[165] S. I. Venieris and C.-S. Bouganis, “FPGAconvnet: A framework for map-

ping convolutional neural networks on fpgas,” in Field-Programmable

Custom Computing Machines (FCCM), 2016 IEEE 24th Annual Inter-

national Symposium on. IEEE, 2016, pp. 40–47. 15, 20, 22, 29, 32,

[166] C. R. Reeves, Modern heuristic techniques for combinatorial problems.

Advanced topics in computer science. Mc Graw-Hill, 1995. 16

[167] L. Cavigelli, M. Magno, and L. Benini, “Accelerating real-time embed-

ded scene labeling with convolutional networks,” in Proceedings of the

52nd Annual Design Automation Conference. ACM, 2015, p. 108. 16,

VOLUME 4, 2016 39

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

[168] Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J.-s. Seo, “Alamo: FPGA

acceleration of deep learning algorithms with a modularized rtl compiler,”

Integration, 2018. 16, 24, 25, 30, 32, 33, 35

[169] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint

arXiv:1312.4400, 2013. 16

[170] Z. Liu, Y. Dou, J. Jiang, J. Xu, S. Li, Y. Zhou, and Y. Xu,

“Throughput-optimized fpga accelerator for deep convolutional neural

networks,” ACM Transactions on Reconﬁgurable Technology and Sys-

tems (TRETS), vol. 10, no. 3, p. 17, 2017. 16, 17, 22, 30, 32, 35

[171] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,

and J. Cong, “FP-DNN: An automated framework for mapping deep

neural networks onto FPGAs with RTL-HLS hybrid templates,” in Field-

Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th

Annual International Symposium on. IEEE, 2017, pp. 152–159. 17, 30,

[172] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,

S. Ghemawat, G. Irving, M. Isard et al., “Tensorﬂow: a system for large-

scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283. 17

[173] M. H. Alsuwaiyel, Algorithms: Design Techniques And Analysis (Re-

vised Edition). World Scientiﬁc, 2016, vol. 14. 17

[174] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,

and E. Shelhamer, “cudnn: Efﬁcient primitives for deep learning,” arXiv

preprint arXiv:1410.0759, 2014. 17

[175] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network

regularization,” arXiv preprint arXiv:1409.2329, 2014. 17

[176] W. Sung, S. Shin, and K. Hwang, “Resiliency of deep neural networks

under quantization,” arXiv preprint arXiv:1511.06488, 2015. 18

[177] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Im-

agenet classiﬁcation using binary convolutional neural networks,” in

European Conference on Computer Vision. Springer, 2016, pp. 525–

542. 18

[178] M. Kim and P. Smaragdis, “Bitwise neural networks,” arXiv preprint

arXiv:1601.06071, 2016. 18

[179] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net:

Training low bitwidth convolutional neural networks with low bitwidth

gradients,” arXiv preprint arXiv:1606.06160, 2016. 18

[180] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net-

works with weights and activations constrained to+ 1 or- 1,” 2016. 18,

[181] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,

and K. Vissers, “Finn: A framework for fast, scalable binarized neural

network inference,” in Proceedings of the 2017 ACM/SIGDA Interna-

tional Symposium on Field-Programmable Gate Arrays. ACM, 2017,

pp. 65–74. 18, 30, 35

[182] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from

tiny images,” Citeseer, Tech. Rep., 2009. 18

[183] S. I. Venieris and C.-S. Bouganis, “Latency-driven design for fpga-

based convolutional neural networks,” in Field Programmable Logic and

Applications (FPL), 2017 27th International Conference on. IEEE,

2017, pp. 1–8. 20, 30, 35

[184] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-

works,” in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2016, pp. 4013–4021. 20, 32

[185] S. Winograd, Arithmetic complexity of computations. Siam, 1980,

vol. 33. 20, 21

[186] C. Van Loan, Computational frameworks for the fast Fourier transform.

Siam, 1992, vol. 10. 20

[187] C. Zhang and V. Prasanna, “Frequency domain acceleration of convo-

lutional neural networks on cpu-fpga shared memory system,” in Pro-

ceedings of the 2017 ACM/SIGDA International Symposium on Field-

Programmable Gate Arrays. ACM, 2017, pp. 35–44. 20

[188] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu,

“An opencl™ deep learning accelerator on arria 10,” in Proceedings of

the 2017 ACM/SIGDA International Symposium on Field-Programmable

Gate Arrays. ACM, 2017, pp. 55–64. 20, 21, 30, 32, 35

[189] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for

convolutional neural networks on fpgas,” in Field-Programmable Custom

Computing Machines (FCCM), 2017 IEEE 25th Annual International

Symposium on. IEEE, 2017, pp. 101–108. 21, 30, 32, 35

[190] J. Zhang and J. Li, “Improving the performance of opencl-based fpga

accelerator for convolutional neural network,” in Proceedings of the 2017

ACM/SIGDA International Symposium on Field-Programmable Gate

Arrays. ACM, 2017, pp. 25–34. 22, 30, 35

[191] T. S. Czajkowski, D. Neto, M. Kinsner, U. Aydonat, J. Wong,

D. Denisenko, P. Yiannacouras, J. Freeman, D. P. Singh, and S. D.

Brown, “Opencl for fpgas: Prototyping a compiler,” in Proceedings of the

International Conference on Engineering of Reconﬁgurable Systems and

Algorithms (ERSA). The Steering Committee of The World Congress

in Computer Science, Computer Engineering and Applied Computing

(WorldComp), 2012, p. 1. 22, 30

[192] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator efﬁ-

ciency through resource partitioning,” in Computer Architecture (ISCA),

2017 ACM/IEEE 44th Annual International Symposium on. IEEE,

2017, pp. 535–547. 22, 23, 30, 32, 35

[193] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,

and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer

parameters and< 0.5 mb model size,” arXiv preprint arXiv:1602.07360,

2016. 23

[194] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-

formance FPGA-based accelerator for large-scale convolutional neural

networks,” in Field Programmable Logic and Applications (FPL), 2016

26th International Conference on. IEEE, 2016, pp. 1–9. 23, 25

[195] W. Xuechao, Y. Cody Hao, Z. Peng, C. Youxiang, W. Yuxin, H. Han,

L. Yun, and C. Jason, “Automated systolic array architecture synthesis

for high throughput cnn inference on FPGAs,” in Proceedings of the 2017

Design Automation Conference. ACM, 2017, pp. 1–6. 23, 24, 30, 35

[196] Y. Ma, M. Kim, Y. Cao, S. Vrudhula, and J.-s. Seo, “End-to-end scalable

FPGA accelerator for deep residual networks,” in Circuits and Systems

(ISCAS), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–4.

24, 25, 26, 30, 35

[197] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “Dlau: A

scalable deep learning accelerator unit on FPGA,” IEEE Transactions

on Computer-Aided Design of Integrated Circuits and Systems, vol. 36,

no. 3, pp. 513–517, 2017. 25, 30, 32, 35

[198] Altera. JTAG UART Core [Online]. Available: https://www.altera.com/

en_US/pdfs/literature/hb/nios2/n2cpu_nii51009.pdf, 2018. 25, 32

[199] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “An automatic rtl compiler

for high-throughput FPGA implementation of diverse deep convolutional

neural networks,” in Field Programmable Logic and Applications (FPL),

2017 27th International Conference on. IEEE, 2017, pp. 1–8. 25, 26,

28, 30, 35

[200] M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. OConnell,

N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling et al., “Dla: Compiler

and FPGA overlay for neural network inference acceleration,” arXiv

preprint arXiv:1807.06434, 2018. 26, 35

[201] A. K. Jain, S. A. Fahmy, and D. L. Maskell, “Efﬁcient overlay architec-

ture based on dsp blocks,” in Field-Programmable Custom Computing

Machines (FCCM), 2015 IEEE 23rd Annual International Symposium

on. IEEE, 2015, pp. 25–28. 26

[202] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.

Berg, “Ssd: Single shot multibox detector,” in European conference on

computer vision. Springer, 2016, pp. 21–37. 26

[203] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulﬁeld,

T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Accelerat-

ing persistent neural networks at datacenter scale,” in Hot Chips, vol. 27,

2017. 27

[204] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing the convolution

operation to accelerate deep neural networks on FPGA,” IEEE Transac-

tions on Very Large Scale Integration (VLSI) Systems, no. 99, pp. 1–14,

2018. 28, 30, 35

[205] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efﬁcient

cnn implementation on a deeply pipelined fpga cluster,” in Proceedings

of the 2016 International Symposium on Low Power Electronics and

Design. ACM, 2016, pp. 326–331. 30

[206] S. M. Sait and H. Youssef, Iterative computer algorithms with appli-

cations in engineering: solving combinatorial optimization problems.

IEEE Computer Society Press, 1999. 31

[207] L. Xie and A. L. Yuille, “Genetic CNN.” in ICCV, 2017, pp. 1388–1397.

[208] L. Rere, M. I. Fanany, and A. M. Arymurthy, “Metaheuristic algorithms

for convolution neural network,” Computational intelligence and neuro-

science, vol. 2016, 2016. 31

40 VOLUME 4, 2016

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2018.2890150, IEEE Access

Shawahna et al.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classiﬁcation: A Review

AHMAD SHAWAHNA obtained M.S. in com-

puter engineering from King Fahd University of

Petroleum and Minerals (KFUPM), Saudi Arabia,

in 2016. He also received the B.Sc degree in

computer engineering from An-Najah National

University, Palestine, in 2012. Ahmad Shawahna

is a Ph.D. student in the department of com-

puter engineering of KFUPM. In addition, he

is currently working at the Center for Commu-

nications and IT Research (CCITR), KFUPM.

His research interests include hardware accelerator, deep learning, CNNs,

FPGA, wireless security, network security, Internet of Things (IoT), and

cloud computing.

SADIQ M. SAIT (Senior Member, IEEE) ob-

tained his Bachelor’s degree in Electronics En-

gineering from Bangalore University in 1981,

and Master’s and Ph.D. degrees in Electrical

Engineering from KFUPM in 1983 and 1987,

respectively. In 1981 Sait received the best Elec-

tronic Engineer award from the Indian Institute

of Electrical Engineers, Bangalore (where he was

born). Sait has authored over 200 research papers,

contributed chapters to technical books, and lec-

tured in over 25 countries. Sadiq M. Sait is also the principle author of two

books. He is currently Professor of Computer Engineering and the Director

of the Center for Communications and IT Research of KFUPM.

AIMAN EL-MALEH is a Professor in the Com-

puter Engineering Department at King Fahd Uni-

versity of Petroleum & Minerals. He holds a B.Sc.

in Computer Engineering, with ﬁrst honors, from

King Fahd University of Petroleum & Minerals

in 1989, a M.A.SC. in Electrical Engineering

from University of Victoria, Canada, in 1991,

and a Ph.D in Electrical Engineering, with dean’s

honor list, from McGill University, Canada, in

1995. He was a member of scientiﬁc staff with

Mentor Graphics Corp., a leader in design automation, from 1995-1998.

Dr. El-Maleh received the Excellence in Teaching award from KFUPM

in 2001/2002, 2006/2007 and 2011/2012, the Excellence in Advising

award from KFUPM in 2013/2014 and 2017/2018, the Excellence in

Research award from KFUPM in 2010/2011 and 2015/2016, and the

First Instructional Technology award from KFUPM in 2009/2010. Dr.

El-Maleh’s research interests are in the areas of synthesis, testing, and

veriﬁcation of digital systems. In addition, he has research interests in

defect and soft-error tolerance design, VLSI design, design automation

and efﬁcient FPGA implementations of deep learning algorithms and data

compression techniques. Dr. El-Maleh is the winner of the best paper

award for the most outstanding contribution in the ﬁeld of test at the 1995

European Design & Test Conference. His paper presented at the 1995

Design Automation Conference was also nominated for best paper award.

He holds ﬁve US patents.

VOLUME 4, 2016 41

Efficient Real time Zynq 7000 FPGA deployment of optimized YOLOv2 deep leaning model for target detection, based on HDL Coder Methodology

Article

Jun 2024

Jihane Slimane

Field Programmable Gate Arrays (FPGAs) have garnered significant attention in the development and enhancement of target identification algorithms that employ YOLOv2 models and FPGAs, owing to their adaptability and user-friendliness. The Simulink HDL compiler was utilized to design, simulate, and implement our proposed design. In an effort to rectify this, this paper presents a comprehensive programming and design proposal. The implementation of the YOLOv2 algorithm for real-time vehicle detection on the Xilinx® Zynq-7000 System-on-a-chip is proposed in this work. Real-time testing of the synthesised hardware revealed that it can process Full HD video at a rate of 16 frames per second. On the Xilinx Zynq-7000 SOC, the estimated dynamic power consumption is less than 90 mW. When comparing the results of the proposed work to those of other simulations, it is observed that resource utilization is enhanced by around 204 k (75%) LUT, 305 (12%) DSP, and 224 k (41%) flip-flops at 200 MHz.

ВИКОРИСТАННЯ TPU АКСЕЛЕРАТОРА НА БАЗІ МІКРОКОМП’ЮТЕРА ДЛЯ ПРИСКОРЕННЯ АЛГОРИТМІВ РОЗПІЗНАВАННЯ ОБ’ЄКТІВ НА ЗОБРАЖЕННІ

Conference Paper

Full-text available

May 2024

Андрій Ройко

Штучний інтелект здійснив революцію в обробці зображень і комп’ютерному зорі, домінуючи в сотнях галузей включаючи розпізнавання об’єктів [1], класифікацію зображень [2], семантичну сегментацію [3] та інше. Водночас стрімко поширюється використання мікрокомп’ютерів для систем штучного інтелекту в реальному часі [4] [5]. Вони приваблюють невеликим використанням електричної потужності, мають низьку затримку та дешеву вартість.

An Energy-Efficient Field-Programmable Gate Array Rapid Implementation of a Structural Health Monitoring System

Article

Full-text available

May 2024

System health monitoring (SHM) of a ball screw laboratory system using an embedded real-time platform based on Field-Programmable Gate Array (FPGA) technology was developed. The ball screw condition assessment algorithms based on machine learning approaches implemented on multiple platforms were compared and evaluated. Studies on electric power consumption during the processing of the proposed structure of a neural network, implementing SHM, were carried out for three hardware platforms: computer, Raspberry Pi 4B, and Kria KV260. It was found that the average electrical power consumed during calculations is the lowest for the Kria platform using the FPGA system. However, the best ratio of the average power consumption to the accuracy of the neural network was obtained for the Raspberry Pi 4B. The concept of an efficient and energy-saving hardware platform that enables monitoring and analysis of the operation of the selected dynamic system was proposed. It allows for easy integration of many software environments (e.g., MATLAB and Python) with the System-on-a-Chip (SoC) platform containing an FPGA and a CPU.

Deep Learning Classification of Photoplethysmogram Signal for Hypertension Levels

Preprint

Full-text available

May 2024

Continuous photoplethysmography (PPG)-based blood pressure monitoring is necessary for healthcare and fitness applications. In Artificial Intelligence (AI), signal classification levels with the machine and deep learning arrangements need to be explored further. Techniques based on time-frequency spectra, such as Short-time Fourier Transform (STFT), have been used to address the challenges of motion artifact correction. Therefore, the proposed study works with PPG signals of more than 200 patients (650+ signal samples) with hypertension, using STFT with various Neural Networks (Convolution Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (Bi-LSTM), followed by machine learning classifiers, such as, Support Vector Machine (SVM) and Random Forest (RF). The classification has been done for two categories: Prehypertension (normal levels) and Hypertension (includes Stage I and Stage II). Various performance metrics have been obtained with two batch sizes of 3 and 16 for the fusion of the neural networks. With precision and specificity of 100% and recall of 82.1%, the LSTM model provides the best results among all combinations of Neural Networks. However, the maximum accuracy of 71.9% is achieved by the LSTM-CNN model. Further stacked Ensemble method has been used to achieve 100% accuracy for Meta-LSTM-RF, Meta- LSTM-CNN-RF and Meta- STFT-CNN-SVM.

IMECBinNets : A framework for deploying Neural Networks on FPGAs

Thesis

Full-text available

Apr 2020

Eashan Wadhwa

Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniques

Article

Full-text available

Mar 2024

With the recent advancements in high-performance computing, convolutional neural networks (CNNs) have achieved remarkable success in various vision tasks. However, along with improvements in model accuracy, the size and computational complexity of the models have significantly increased with the increasing number of parameters. Although graphics processing unit (GPU) platforms equipped with high-performance memory and specialized in parallel processing are commonly used for CNN processing, the significant power consumption presents challenges in their utilization on edge devices. To address these issues, research is underway to design CNN models using field-programmable gate arrays (FPGAs) as accelerators. FPGAs provide a high level of flexibility, allowing efficient optimization of convolution operations, which account for a significant portion of the CNN computations. Additionally, FPGAs are known for their low power consumption compared to GPUs, making them a promising energy-efficient platform. In this paper, we review and summarize various approaches and techniques related to the design of FPGA-based CNN accelerators. Specifically, to comprehensively study CNN accelerators, we investigate the advantages and disadvantages of various methods for optimizing CNN accelerators and previously designed efficient accelerator architectures. We expect this paper to serve as an important guideline for future hardware research in artificial intelligence.

Modern Trends in Improving the Technical Characteristics of Devices and Systems for Digital Image Processing

Article

Full-text available

Mar 2024

The technology development greatly increases the amount of digital visual information. Existing devices cannot efficiently process such huge amounts of data. The technical characteristics of digital image processing (DIP) devices and systems are being actively improved to resolve this contradiction in science and technology. The state-of-the-art methodology includes a huge number of very diverse approaches at the mathematical, software, and hardware implementation levels. We have analyzed all modern trends to improve the technical characteristics of DIP devices and systems. The main distinguishing feature of this review is that we are not limited to considering various aspects of neural network image processing, to which the vast majority of both review and research papers on the designated topic are devoted. Review papers on the subject under consideration are analyzed. Various mathematical and arithmetic-logical methods for improving the characteristics of image processing devices are described in detail. Original and significant architectural and structural solutions are analyzed. Promising neural network models of visual data processing are characterized. Hardware platforms for the design and operation of DIP systems that are efficient in terms of resource costs are considered. The most significant improvements achieved through the hardware implementation of models and methods on field-programmable gate arrays and application-specific integrated circuits are noted.

Embedded Plant Disease Recognition Using Deep PlantNet on FPGA-SoC

Article

Full-text available

Jan 2023

Technological breakthroughs have ushered in a revolution in a variety of industries, including agriculture, during the last several decades. This has given rise to what is now known as Agriculture 4.0, which emphasizes strategy and systems rather than the traditional obligations of the past. As a result, many human procedures have been replaced by a new generation of intelligent devices. Crop production management in Agriculture 4.0, on the other hand, poses a considerable challenge, particularly when it comes to prompt and accurate crop disease identification. Plant diseases are of special significance since they significantly reduce agricultural yield in terms of both quality and quantity. Deep learning neural network models are being utilized for early diagnosis of plant diseases in order to overcome this difficulty. These models can automatically extract features, generate high-dimensional features from low-dimensional ones, and achieve better learning results. In this research, we offer a joint solution involving image processing, phytopathology, and embedded platforms that intends to minimize the time necessary for human labor by leveraging AI to facilitate plant disease detection. We propose a learning-based PlantNet architecture for detecting plant diseases from leaf images, in which achieved about 97 % accuracy and about 0.27 loss on the PlantVillage dataset. However, because putting AI techniques on embedded systems can substantially cut energy consumption and processing times while also minimizing the costs and dangers involved with data transfer, it is worth considering. The second goal of this paper is to use high-level synthesis to accelerate the proposed PlantNet architecture. Moreover, we propose a hardware-software (HW/SW) design for integrating the suggested vision system on an embedded FPGA-SoC platform. Finally, we present a comparative study with state-of-the-art works, which demonstrates that the proposed design outperforms the others in terms of normalized GFLOPS (1.93), reduced power consumption (2.48 W), and minimum required processing time (0.04 seconds).

Heterogeneous Acceleration System for Scene Text Recognition

Conference Paper

May 2024

A Memory Efficient Run-time Re-configurable Convolution IP Core for Deep Neural Networks Inference on FPGA Devices

Conference Paper

Dec 2023

Modern computer vision applications are vastly dependent on convolutional neural networks (CNNs) in order to perform feature extractions, dimensionality reduction, etc. There are a series of arithmetic calculations markedly dominated by multiply-accumulate (MAC) operations in a single convolutional layer which is directly dependent on the kernel size. In this work, we propose a design method for a run-time re-configurable convolution IP core which could be implemented for any layer of a CNN having variable kernel sizes and verified for different dimensions, 5 × 5 for LeNet-5 and 11 × 11 for AlexNet respectively. The proposed accelerator is 6.25 x and 12.1 x faster than basic MAC operation for LeNet-5 and AlexNet respectively. The design has been compared with existing architecture and has improved performance by 2.46x G(FL)OPS/s per MAC module. Our proposed design has run-time re-configuration support with a scheduler logic which empowers to realize the design for any specific layer, kernel dimension and channels for respectively with efficient resource utilization.

Recurrent Convolutional Neural Networks for Text Classification

Article

Feb 2015

Text classification is a foundational task in many NLP applications. Traditional text classifiers often rely on many human-designed features, such as dictionaries, knowledge bases and special tree kernels. In contrast to traditional methods, we introduce a recurrent convolutional neural network for text classification without human-designed features. In our model, we apply a recurrent structure to capture contextual information as far as possible when learning word representations, which may introduce considerably less noise compared to traditional window-based neural networks. We also employ a max-pooling layer that automatically judges which words play key roles in text classification to capture the key components in texts. We conduct experiments on four commonly used datasets. The experimental results show that the proposed method outperforms the state-of-the-art methods on several datasets, particularly on document-level datasets.

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Article

Apr 2014

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Conference Paper

Jun 2017

Emotion Recognition Using Convolutional Neural Networks

Article

Jan 2019

Deep Learning-Based Multiple Object Visual Tracking on Embedded System for IoT and Mobile Edge Computing Applications

Article

Feb 2019

Compute and memory demands of state-of-the-art deep learning methods are still a shortcoming that must be addressed to make them useful at IoT end-nodes. In particular, recent results depict a hopeful prospect for image processing using Convolutional Neural Netwoks, CNNs, but the gap between software and hardware implementations is already considerable for IoT and mobile edge computing applications due to their high power consumption. This proposal performs low-power and real time deep learning-based multiple object visual tracking implemented on an NVIDIA Jetson TX2 development kit. It includes a camera and wireless connection capability and it is battery powered for mobile and outdoor applications. A collection of representative sequences captured with the on-board camera, dETRUSC video dataset, is used to exemplify the performance of the proposed algorithm and to facilitate benchmarking. The results in terms of power consumption and frame rate demonstrate the feasibility of deep learning algorithms on embedded platforms although more effort in the joint algorithm and hardware design of CNNs is needed.

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Conference Paper

Aug 2018

Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Article

Oct 2018

With the recent advancement of multilayer convolutional neural networks (CNN) and fully connected networks (FCN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGAbased acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/ software co-designed library to efficiently accelerate the entire CNN and FCN on FPGAs. First, we propose a uniformed convolutional matrixmultiplication representation for both computation-bound convolutional layers and communication-bound fully connected (FCN) layers. Based on this representation, we optimize the accelerator micro-architecture and maximize the underlying FPGA computing and bandwidth resource utilization based on a revised roofline model. Moreover, we design an automation flow to directly compile high-level network definitions to the final FPGA accelerator. As a case study, we integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet networks on multiple FPGA platforms. Caffeine achieves a peak performance of 1,460 GOPS on a medium-sized Xilinx KU060 FPGA board; to our knowledge, this is the best published result. It achieves more than 100x speed-up on FCN layers over prior FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 29x and 150x performance and energy gains over Caffe on a 12-core Xeon server, and 5.7x better energy efficiency over the GPU implementation. Performance projections for a system with a high-end FPGA (Virtex7 690t) show even higher gains.

Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

Article

Jul 2018

As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

Deep Learning for Sentiment Analysis : A Survey

Article

Jan 2018

Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state‐of‐the‐art prediction results. Along with the success of deep learning in many application domains, deep learning is also used in sentiment analysis in recent years. This paper gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis. This article is categorized under: • Fundamental Concepts of Data and Knowledge > Data Concepts • Algorithmic Development > Text Mining

ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler

Article

Jan 2018
INTEGRATION

Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9. X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

Abstract and Figures

Recommended publications

ICARUS: a dynamically reconfigurable computer architecture

FTDL: A Tailored FPGA-Overlay for Deep Learning with High Scalability

FTDL: An FPGA-tailored Architecture for Deep Learning Systems

A Survey of Convolutional Neural Networks on Edge with Reconfigurable Computing

FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio