ArticlePDF Available

Efficient Object Detection Using Embedded Binarized Neural Networks

June 2018
Journal of Signal Processing Systems 90(11):1-14

June 2018
90(11):1-14

DOI:10.1007/s11265-017-1255-5

Authors:

Jaeha Kung

Daegu Gyeongbuk Institute of Science and Technology

David C. Zhang

SRI International, Princeton, New Jersey, United States

G. S. van der Wal

SRI International

Show all 5 authorsHide

Memory performance is a key bottleneck for deep learning systems. Binarization of both activations and weights is one promising approach that can best scale to realize the highest energy efficient system using the lowest possible precision. In this paper, we utilize and analyze the binarized neural network in doing human detection on infrared images. Our results show comparable algorithmic performance of binarized versus 32bit floating-point networks, with the added benefit of greatly simplified computation and reduced memory overhead. In addition, we present a system architecture designed specifically for computation using binary representation that achieves at least 4× speedup and the energy is improved by three orders of magnitude over GPU.

Overview of binarized neural network (BNN): The training of BNN (top) and the inference mode in BNN (bottom).

…

Learned kernels and output features of BNN and 32bit network at the first convolutional layer.

…

Recognition accuracy on different datasets using both BNN and 32bit network: (a) MNIST, (b) IR dataset without noise and (c) IR dataset with noise.

…

A block diagram of a processing element with main modules: (i) XDP module and (ii) BN module.

…

A computation mapping on each PE for fully-connected layer (a) or convolutional layer (b).

…

Figures - uploaded by David C. Zhang

Content may be subject to copyright.

Content uploaded by David C. Zhang

Content may be subject to copyright.

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

Abstract—Memory performance is a key bottleneck for

deep learning systems. Binarization of both activations and

weights is one promising approach that can best scale to

realize the highest energy efficient system using the lowest

possible precision. In this paper, we utilize and analyze the

binarized neural network in doing human detection on

infrared images. Our results show comparable algorithmic

performance of binarized versus 32bit floating-point

networks, with the added benefit of greatly simplified

computation and reduced memory overhead. In this paper,

we present a system architecture designed specifically for

computation using binary representation that achieves at

least 4× speedup and the energy is improved by three

orders of magnitude over GPU.

Index Terms—Deep learning, embedded computer vision,

binary neural network, low-power object detection

I. INTRODUCTION

Large-scale deep neural networks (DNN) have been

successfully used in a number of tasks from image recognition

to natural language processing [1-4]. It is well understood that

to capture the wide spectrum of low, mid and high-level

representations of complex patterns, networks with many

layers, nodes, and with high local and global connectivity are

needed [5]. The success of recent deep learning algorithms

comes in part from the ability to train much larger DNN on

much larger datasets than was previously possible.

The networks to solve complex and real-world problems are

getting deeper and larger. This trend inevitably makes training

or inference to be both computationally challenging and

memory-intensive. As such, there has been a growing line of

research focused on minimizing the memory utilization and the

energy consumption from the hardware perspective [6-9].

These approaches leverage fixed-point instructions and the

level of bit-precision, utilize approximate hardware, or perform

data compression after training.

There has been another set of approaches that targets a

specific bit-precision during training, e.g. training at 8-bit DNN

weights instead of 32-bit. Results have shown that these

approaches can arrive at DNN weight parameters of lower

precision with negligible loss in performance [10-13]. In

addition, incorporating lower bit-precision during training help

reduce training time with lower memory footprint and also

lower bitwise computational requirement.

In this paper, we use binarized neural network (BNN) as our

algorithmic approach for our embedded DNN processor

because BNN offers the most savings in compute and memory

resources (i.e. binary is the lowest). For an in-depth

performance comparison, we analyze our BNN algorithm and

processor design for an image recognition application that

detects humans in the infrared (IR) spectrum. We analyze the

algorithmic performance against 32bit DNNs with various

model parameters such as number of kernels and synaptic

connections.

We also designed a system architecture suited for BNN (as

ASIC or on FPGA). We leverage the use of local memories to

store all BNN weights in order to speed up overall processing.

We then estimate the computing performance of our BNN

architecture and compare to that of GPU. Major computing

blocks in the BNN system are designed and synthesized to

estimate power consumptions of those functional blocks.

Our main contributions can be summarized as follows:

 We demonstrate learning of visual features agnostic to

thermal variations to identify human subjects in infrared

images by using BNN.

 We provide detailed characterization of the impact of

binarization on object detection.

 We show that BNN performance is close to the 32bit

network. Binary weights in hyperspace is not a linear scale

of those obtained from 32bit network, nevertheless, they

are another valid representation for the trained NN.

 We analyze and design the system architecture for BNN,

with performance and energy efficiency gains.

We anticipate that the results shown in this paper confirm the

suitability of BNNs for embedded platforms. We provide

demonstrable results on a complex application such as IR

human detection. We anticipate the use of BNNs for a variety

of embedded vision platforms, from smart cameras to

smartphone apps.

This paper is organized as follows: in Section 2, we present

previous research work on low-precision neural networks

including BNN. In Section 3, we explain our dataset captured

from different infrared cameras. In Section 4, we analyze

simulation results of BNN on our IR dataset with the

comparison to 32bit simulation results. In Section 5, we explore

the BNN system architecture and present performance/energy

estimates. Finally, we present our conclusions and discuss our

future work in Section 6.

II. PREVIOUS WORK: LOW-PRECISION NNS

Over the last decade, DNN parameter sizes continue to grow

dramatically. LeNet5, a Convolutional Neural Network (CNN)

presented in 1998 [14], uses ~60K parameters to classify

handwritten digits. AlexNet CNN [1] uses 61M parameters to

Efficient Object Detection Using Embedded Binarized Neural

Networks

Jaeha Kung†, David Zhang*, Gooitzen van der Wal*, Sek Chai*, Saibal Mukhopadhyay†

Georgia Institute of Technology, Atlanta, GA†

SRI International, Princeton, NJ*

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

win the ImageNet image classification competition in 2012.

More recently, new DNNs such as DeepFace uses 120M

parameters for human face identification [15], and there are

even networks with 10B parameters [16]. The current trend in

achieving higher level of algorithmic performance is to

increase the number of DNN layers to capture high-level

semantic reasoning, and as such, memory bottlenecks will

continue to be the central computing issue [17].

Needless to say, the computation power and run-time

memory footprint required for these modern DNN in inference

mode exceeds the power and memory size budgets for a typical

mobile device or an embedded platform. To tackle these

problems, different approaches including limiting bit-precision

[6,7], utilizing approximate hardware [8] or compressing data

[9] were proposed for inference mode. The efficient

manipulation of data precisions in CPUs to speed up the DNN

computation (also, reduce memory footprint by using 8bit) was

presented in [6]. In [7], a greedy algorithm was presented to

automatically set a reduced bit-precision for each synapse by

looking at its error sensitivity. The utilization of approximate

hardware with the fixed-point computation further reduces

energy consumption [7,8]. In [9], a compression method in

reducing the memory requirement was also explored.

To aggressively limit the bit-precision as an effort to reduce

memory, power usage and training time, targeting very low

precision even from training was presented [10-13]. Han et al.

uses a 4bit code book for weight parameters and updates

centroids during the training in [10]; data pruning and data

compression were performed to fit parameters into local

memory. More aggressive than this, [11,12] proposed binary

weights and activations in training mode (also, inference) and

obtained good results over the well-known benchmarks. Both

approaches keep binary weights and activations for all

computations except using 32bit weight parameters during

weight update. As an extension to this approach, 1bit weight,

2bit activation and 4bit gradient are presented to improve

performance on complex image recognition dataset [13].

All these methods target the same goals: i) reduce memory

requirement and ii) lower energy consumption (either by power

reduction or speed improvement). As limiting bit-precision

prior to training successfully reduces bit-precision to binary, we

select BNN [11] as our test prototype to realize a low-power

embedded vision system. BNN has simpler binarization than

other counterparts [12,13] and shows good accuracy on simple

image classification tasks. The overview of BNN algorithm is

shown in Figure 1. The important things to note are weight

update is done as a real-value (e.g. 32bit floating-point) and

input is 8bit only during the inference. In the following section,

we will describe our dataset to verify the applicability of BNN

on real embedded applications.

III. INFRARED DATASET

A. IR Data Collection

Traditional IR detection systems rely on manual coding of

visual features, resulting in missed detections and high false

alarm rates (FAR). Deep learning algorithms have been used

for object classification with great success, but current

approaches are focused on images in the visible spectrum, with

high resolution and contrast. Commercial deep learning

systems are cloud-based and geared towards performance

results at the cost of high power consumption and are less

concerned with real-time operation. IR cameras operate day

and night, but they pose additional challenges for deep learning

algorithms due to the fact that images have no color cues and

there are thermal variations for the same object. Small objects

in different poses are difficult to identify because they “look

like blobs” with little texture discrimination.

With BNNs, we take advantage of the lack of color data to

train a network specifically on spatial features. We have

previously shown that we can train DNN separately on spatial

and texture data, and then combine the results [18]. The

resulting DNN can be trained faster using higher learning rate,

Figure 1. Overview of binarized neural network (BNN): The training of BNN (top) and the inference mode in BNN (bottom).

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

and also achieve higher detection performance because we are

able to direct the learning on spatial feature and texture

separately. For this effort, we treat our BNN training in a

semi-supervised manner, specifically for spatial features (e.g.

edges, shapes, etc.), and we treat textural data as component

where precision can be lossy.

We collected a large sample of IR images of human activity,

both indoors and outdoors, on stationary and moving camera

platforms, at different times of the day to capture different

thermal background temperatures. The objects range from

medium to long range (1-100m) such that distant objects can be

as small as 20x20 pixels. We further incorporated thermal

images of pedestrians (in lab, school lounge, and marathon)

from [19] into our training set. Based on our experience, we

anticipate the need for a minimum of 100 frames for each

person in any particular pose. This is approximately five

seconds with a 20 frame-per-second camera.

B. Data Augmentation

Deep learning requires large training data sets, and we

address the challenges in data annotation with our data

ingestion software that labels image regions automatically. Our

data collection task is to ensure that we collect enough data

samples with sufficient variations in body poses and thermal

contrast. We then prepare the training set by selecting the

proper regions in the image with human subjects. We do this

labeling automatically using a combination of hot-spot and

motion tracking, similar to work published previously [20].

Negative samples are selected in the non-human area per frame,

based on object proposals suggested by Selective Search [21].

Furthermore, to further add resilience to our training set, we

augment each image in the training set with all variations that

are specific to thermal imagery (e.g. contrast change, resolution

blur, Gaussian noise and geometric distortions). In the context

of the paper, we call them augment noise or simply noise. This

is in addition to spatial augmentation (e.g. rotation, skew, shift)

that is commonly applied to training data in visible spectrum.

We have found, early on, that augmentation specific to IR

images is a key step in training in this spectral range so that our

DNN can be robust to realistic conditions (e.g. adverse weather,

ambient temperature, sensor noise and object distance/size). In

total, we have a sample set of 255k positive and 658k negative

samples.

IV. ALGORITHMIC EVALUATION OF BNN

A. Training

In addition to our IR dataset, we also used MNIST dataset as

a reference benchmark [14]. Although performance suggests

deeper and larger network for more complex vision tasks

[13,17], since IR dataset only labels two classes (i.e. human and

non-human), we started with a shallow network like LeNet5.

LeNet5 is regarded as a baseline network, which has 2

convolutional layers (with max pooling) and 3 fully-connected

layers. In order to see the impact of different network

configurations, such as the size of convolution kernels, the

number of kernels, the size of FC layer, and the number of

convolutional layers, we studied four different variants of

LeNet5 trained with MNIST and IR dataset respectively. The

performance of these configurations is concluded in the next

subsection.

Training of each network is done over 500 epochs with 50

mini-batches per epoch. The loss curves of both BNN and 32bit

network are compared on MNIST and IR datasets (Figure 3).

For MNIST, 40,000 training samples (input size: 28×28) are

used to train the network [14]. For IR dataset, 20,000 training

samples (input size: 128×128) with data augmentation as

explained in Section III.B are used. 30% of the samples are fed

into the network without any noise while the remaining 70%

samples are pre-processed with a randomized augment noise

before fed to the NN for training.

The convergence speed is important in terms of training time

(equivalently, energy consumption). With augment noise in IR

Figure 2. IR dataset captured from multiple infrared cameras followed by data augmentation to handle noise from cameras.

Figure 3. The loss curves during the training with BNN and

32bit on MNIST (left) and IR (right) datasets.

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

dataset, for both 32bit network and BNN, the networks see

different variations in input data distribution and thus take

longer time to converge. For both MNIST and IR datasets, the

final loss obtained by using 32bit network was lower than that

of BNN.

Previously, there was little analysis on what is mainly

different in learned kernels between BNN and the 32bit

network. The kernels in the first convolutional layer of LeNet5

(for both BNN and 32bit) are shown in Figure 4. The binary

kernels let BNN capture edge or shape information. The

learned kernels in 32bit network identify texture as well as

edge/shape information. As a result, BNN mostly rely on edges

of input images. The 32bit network, however, captures both the

human subject with some background information. As we will

discuss in Section IV.C, however, the edge-oriented BNN has a

benefit over the 32bit network for some input types.

B. Accuracy Evaluation

To evaluate the accuracy of BNN, a network with 32bit is

compared to perform classification on both MNIST and IR

dataset. Table I summarizes the specifications of networks that

we have configured to test with. LeNet5 is the baseline (v1).

Regarding the variants of LeNet5, network v2 has larger

convolutional kernels, v3 has larger fully-connected layers, v4

has more number of kernels and v5 has one more convolutional

layer. The number of parameters on IR dataset of each network

is also reported which is the same for both BNN and 32bit. The

reported memory size on IR dataset is only for BNN (for 32bit,

we can simply multiply the number by 32). The number of test

samples is 10,000 for MNIST and 4,000 for IR dataset.

Figure 5 shows the simulation results, in terms of recognition

accuracy, on both MNIST and IR dataset. As explained in

Section III.B, the noise due to various factors may deteriorate

the resolution of video frames to be processed from mobile or

embedded cameras. Thus, we tested two different cases for IR

dataset: one without noise on test set and the other with noise

(to verify noise resilience of our trained model). For all cases,

BNN shows high recognition accuracy, as close performed

compared to 32bit network (for MNIST, 32bit always performs

slightly better). Interesting observation on the performance of

noisy IR data set, BNN performs slightly better than the 32bit

network in most configurations. This indicates that 32bit

network is more prone to certain noise on input images. The

possible justification will be addressed in the following

subsection.

In terms of network specifications, having larger kernels in

BNN on MNIST dataset reduces the recognition accuracy by

3.1%. However larger FC layer or more kernels improves the

accuracy by 1.3%. Since input size is too small for MNIST

network v5 is not tested. On IR dataset, all variant networks

show almost same recognition accuracy (difference < 1%). For

BNN, network v4 performed the best (99.30%). For 32bit,

network v5 performed the best (99.13%). Even with the noise

in input images, the recognition accuracy is reasonably high (>

96%). As there is little variation in recognition accuracy among

5 networks for IR dataset, without loss of generality, we simply

select LeNet5 (v1) as our reference network for the following

analyses.

C. Analysis of Binarization on Object Detection

We compare the histograms of two networks, 32bit network

and BNN, to verify that they statistically behavior the same.

The upper figures in Figure 6 show histograms of output scores

on entire test set (4,000 samples) and on samples that each

network fails to recognize correctly. BNN and 32bit network

Figure 5. Recognition accuracy on different datasets using both BNN and 32bit network: (a) MNIST, (b) IR dataset without

noise and (c) IR dataset with noise.

Figure 4. Learned kernels and output features of BNN and

32bit network at the first convolutional layer.

Table I. Summary of simulated networks and required

number of parameters and storage size for IR dataset

Network

LeNet5

6: 5×5

6:11×11

6: 5×5

16: 5×5

6: 5×5

2×2

16: 5×5

16: 7×7

16: 5×5

32: 5×5

16: 5×5

2×2

12: 3×3

2×2

F5(F7)

120

3600

120

F6(F8)

1024

# Param

1.6M

1.3M

52M

3.3M

0.26M

Memory

243KB

201KB

6.56MB

478KB

72.8KB

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

have different mean values and deviation, but they show similar

distributions (bimodal distribution). Each peak represents

human subject or background (non-human). We can interpret

this as BNN is trained to behave similarly to the network with

32bit precision.

For samples that BNN or 32bit network fails to correctly

recognize, the output scores lie around zeros for both networks.

Among those false alarms, we have checked the sign of output

scores on samples which failed on both networks. The result

shows that both BNN and 32bit network had the same output

signs on 95% of common false alarms, implying that BNN

behaves very similarly to 32bit network. Some examples of

those common false alarms are shown in Figure 6 with possible

reasons of the recognition failure.

Then, the question arises that why some samples 32bit

network recognizes correctly while BNN fails, and vice versa.

Figure 7 shows some examples on which BNN fails to

recognize but 32bit succeeds. We looked into six features after

the first convolutional layer of an input sample on both BNN

and 32bit network. Again, BNN relies solely on edge (shape)

information. Thus, if the input image has less contrast to

recognize a subject, as an example shown in Figure 7, features

may fail to capture any useful information. To overcome such

problem in BNN, contrast enhancement such as histogram

equalization or LACE can be applied to IR image before they

are fed into BNN. After such pre-processing, BNN successfully

identifies the human subject for the previously failed example.

The corresponding features of BNN are shown in Figure 7

(lower right), which learns the shape of a human subject.

We also analyze samples where the 32bit network fails to

recognize, but BNN succeeds. A subset of those test samples

are shown in Figure 8. By looking at feature maps of one of the

samples, we can tell that 32bit network captures more

background information than BNN. This may help collect more

features but when there is noisy background (or background

with complex textures), 32bit network is difficult to recognize

Figure 6. The histogram of output values (scores) and examples of the common false alarms on identifying human subjects in

IR images for both BNN and 32bit network.

Figure 7. The false alarms by testing with BNN (upper left

images) and possible solution to the one of false alarms.

Figure 8. The false alarms with 32bit network (left) and the

reasoning behind the failure with possible solution (right).

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

the subject of interest in IR images. To solve this problem, a

tighter search box during the selective search may help

improving the recognition for 32bit network.

In summary, we have shown that BNN is trained to perform

in a similar way as 32bit network. This is verified with the high

recognition accuracy on two datasets and by comparing the

actual distribution of output scores. In the following section, we

will present the hardware architecture suited for BNN and

estimate system performance in terms of throughput and energy

consumption.

V. BNN SYSTEM ARCHITECTURE

A. Evaluation on Memory Requirement

The most critical design consideration of neural networks in

hardware is the memory space. As the number of parameters is

ever increasing [1,15,16], the external storage (i.e. DRAM) will

be accessed frequently. The energy consumption of memory

access for 32bit DRAM, however, is 128× higher than that of

32bit SRAM [22]. Thus, storing parameters in local memory

significantly reduces the energy consumption of the system.

This gets more importance in embedded platforms.

In Table II, the (internal) memory size of commercial FPGAs

is summarized. This helps us understanding the maximum size

of a neural network that each FPGA can fit into its local

memory. For example, AlexNet [1] has 61M parameters which

require ~240MB in the network. It does not fit in the largest

FPGA from the table (Xilinx FPGA-VirtexU 13P). Only a

neural network which is ~4× less in size than AlexNet is

estimated to fit into this FPGA. On the other hand, VirtexU 13P

can fit in a binarized neural network with ~0.45B parameters

for inference mode (7.5 × larger networks than AlexNet).

For IR dataset, a small and shallow network (e.g. LeNet5) is

good enough to achieve high recognition accuracy. LeNet5

with 128×128 inputs has 1.6M parameters occupying memory

about 6.4MB, using full precision. A medium size FPGA such

as KintexU-15P can be used to fit all parameters and has the

package size of 35mm×35mm. A small Zynq-7020 chip has a

footprint of 13mm×13mm (with 0.62MB of local memory). To

fit the network into a smaller form factor for embedded

platforms, LeNet5 on IR images can be trained with binary

parameters, which require ~0.24MB and can fit into this small

FPGA.

In the following discussion, LeNet5 is selected as the

reference network on IR dataset. Zynq-7000 series is assumed

as a target embedded platform in describing BNN system

architecture in the following subsections. Please note that the

BNN system architecture is designed for NN inferencing only.

B. BNN System Architecture

The BNN or any other low-precision neural network requires

dedicated system architecture to fully gain in performance. In

this section, we present overall architecture and its processing

elements (PE) suited for BNN algorithm. The proposed BNN

architecture is general in terms of its application or mapped

network topology. The architecture remains the same but with

different size of the internal memory depending on the size of

the network. To help understanding the overall architecture, we

set the application to IR detection with LeNet5.

First, we have to determine the number of internal memory

blocks, called Block RAM (BRAM) in FPGA terminology.

When deciding the number of BRAMs, we consider the total

number of parameters in the network and the maximum PE

utilization. We made several assumptions on external memory

specification with DDR3; 64bit-wide bus (word), operating

frequency at 800MHz and 75% efficiency. These assumptions

lead to the transfer rate of 19,200MB/s (this affects DRAM to

BRAM data prefetching time). For LeNet5, total data size of

242.67KB is required for both activations (39.23KB) and

weights (203.44KB) on IR dataset. In Zynq-7000 series, the

size of each BRAM is 36Kb (512×64bit) thus 51 BRAMs are

needed to store all weight parameters locally. To organize

BRAMs in a 2-dimensional fabric, we assume 54 BRAMs (6×9)

for parameters in LeNet5. Since the storage requirement for

activations is far less, 10 BRAMs are enough to cover the

whole activation data. To fully utilize PEs, however, 27

BRAMs are distributed over the system so that two PEs share

one activation BRAM. By operating the PE fabric at the half

clock frequency to BRAM, the proposed architecture gives us

the maximum throughput (Figure 9).

Given the overall system architecture, we now present the

design of processing elements optimized for BNN algorithm.

Figure 10 shows a block diagram of a PE with main functional

blocks: (i) XNOR Dot Product (XDP) module and (ii) Batch

Normalization (BN) module. As the word size in DRAM is

assumed as 64bit, XDP module is designed to operate on 64bit

data to compute 64 multiply-and-accumulate in parallel at each

PE. This greatly improves the system performance for BNN

algorithm (refer to Section V.D). Also, a simple example of dot

product operation is explained in Figure 10. Note that Boolean

value 0 represents -1 in BNN data. The XNOR between ‘n’

activations and weights simplifies multiplication. Then, the

Figure 9. The oveall system architecture for BNN. The

shared BRAMs (shaded) are for weights in convolutional

layer and for activations in fully-connected layer.

Table II. Internal memory size in different FPGA platforms

Xilinx Board

Zynq-7100

3.3

Virtex-7 X980T

6.6

KintexU 15P

8.8

ZynqU 19EG

8.8

KintexU 115

9.5

VirtexU 190

16.6

VirtexU 13P

56.8

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

result is followed by count and subtract module to obtain dot

product output.

C. BNN Computation Mapping on PEs

FC layer: The computation in fully-connected (FC) and

convolutional (Conv) layer requires different data management.

First, we present the detailed PE manipulation for the FC layer.

Most intuitively, we can map all computations for one output

node to each PE. Each PE handles 64 input nodes at a time, thus

the PE iteratively computes the state of the given output node if

the dimension of input state vector is larger than 64. This

computation mode is named iterative mode (left diagram in

Figure 11(a)). This mode is inefficient if the number of output

nodes is small, equivalently the #input/#output ratio is high. For

instance, if the number of output node is only 10, the remaining

44 PEs are in idle mode significantly reducing the utilization.

To solve this issue, another mode for FC layer is designed.

The additional mode, named partial mode, merely requires an

adder tree for each column of the PE fabric. In this mode, each

column handles the computations for one output node. Thus,

nine output nodes are handled in parallel (which is 54 nodes in

the iterative mode). As we will discuss in the next subsection,

the partial mode improves throughput when #input/#output

ratio is low. The other thing to note is that BN+NF module is in

idle mode except for the last row of PE fabric. Since batch

normalization and nonlinear activation happen after the state of

an output node is fully computed, only one BN+NF module is

needed per PE column. Even for the same FC layer, depending

on the layer specification the instruction changes in our BNN

system architecture.

Conv Layer: The Conv layer behaves completely different

from the FC layer. The number of weight parameters is much

less than FC layer thus weights are stored in a local buffer at

each PE. Making this buffer larger increases the capacity to

handle larger kernels. By having a 11bit buffer, kernels up to

size 11×11 are covered (a row of the kernel is stored). The input

data is multiple feature maps where each map is 2-dimensional

data. Input feature maps are distributed to PE columns. Then,

pixel data in each (input) feature map is distributed to PE rows.

For convolution, two consecutive words (64pixels/word) in the

same row are loaded into each PE. One word is the active word

at which convolution is performed: darker gray in Figure 11(b).

The other word is supplement word shifted in one bit at a time

to compute partial sum (row) in a convolution. Note that all 64

weight bits in XDP module is the same. For 5×5 kernel, the

shift is done 4 times to compute row partial sum of 64 output

nodes. This is repeated for 4 times in row-wise (next row of an

input feature is loaded) to complete the convolution on 5×5

kernel.

The important fact is that XDP module outputs 64 8bit data

while the output is one 8bit data in FC layer mapping. Also, in

each XDP module, CnS operation is done iteratively on each

output (simply adding 1 or subtracting 1 depending on XNOR

output). After computations for all kernel values are computed,

64 outputs are independently added in column-wise (64 8bit

adders are required; the overhead is discussed in Section V.D).

These results are passed through pooling module reducing

dimensionality. For 2x2 pooling, it needs to wait for the next 64

output data to produce 32 8it data (from 128 8bit data). This

computation is followed by BN+NF module to get 1bit result

for 32 pixels of an output feature. This procedure is repeated

until all pixels in output features are computed.

Input Layer: Different from other layers, the input layer

accepts 8bit data as an input. This makes the computation

Figure 11. A computation mapping on each PE for fully-connected layer (a) or convolutional layer (b).

Figure 10. A block diagram of a processing element with

main modules: (i) XDP module and (ii) BN module.

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

mapping in the first layer slightly different from the other layers.

In the memory, eight chunks of 8bit inputs are stored in one

64bit word. This requires the PE to handle eight 8bit data using

XDP module instead of 64 1bit data. Still, weight parameters

are kept 1bit. The output of XDP module is then handled

differently from the mapping shown in Figure 11.

In FC layer, block CnS modules are used to perform addition

on XNOR result at the same bit-position. Then, Shift-and-Add

(SnA) is performed to compute the dot product result with 8bit

input data (Figure 12(a)). The other parts of modules are kept

the same. In Conv layer, SnA is performed on 8 output chunks

of iterative CnS module connected to each XNOR output

(Figure 12(b)). The XDP module outputs eight 8bit data, thus

only some parts of column-wise adder, pooling and BN module

are utilized for input layer. For both layer cases, only additional

hardware overhead is the addition of SnA modules and its

proper management. Also, since PE computes dot product on

eight inputs at a time, the throughput reduces by 8 during input

layer processing.

D. Throughput and Energy Analysis

With the proposed BNN architecture described in Section

V.C, we can estimate the throughput with the proper pipelining

in PE. Before estimating performance, we need to set operating

frequency of local memory and PEs. The BRAM in Zynq-7000

series can be operated at 400MHz and we assume the operating

frequency of PE at 200MHz. The operating frequency of PE

module is verified by synthesizing the design in 28nm CMOS

technology [23].

For FC layer, there are two different modes depending on the

network topology. In iterative mode, the throughput in terms of

clock cycles is given by:

 



 

(1)

where #input (or #output) is the number of nodes in current

layer (or next layer), word size is 64 and #PEs is 54. In partial

mode, the throughput changes to:

 





(2)

where  (or  is the number of rows (or

columns) of a PE array. To verify that the iterative (partial)

mode is beneficial when #input/#output ratio is low (high), the

throughputs of two different cases are compared in Table III.

For Conv layer, the throughput becomes:

 



 

 



 

(3)

where , is the dimension of input features and

 (or  is the number of input

features (or output features). For instance, the convolution with

5×5 kernel on six 120×120 input features to obtain 16 output

features requires 11,700 cycles. This is significantly longer

than required cycles for examples on FC layer (refer to Table

III).

Table III. Throughput comparison between iterative and partial

mode for FC layer processing

Mode

(#inputs, #outputs)

(1024, 512)

(1024, 10)

Iterative

160

Partial

228

To estimate the throughput of a neural network for IR

detection, the number of cycles for LeNet5 is estimated. The

layer-wise estimated performance of BNN hardware is shown

in Figure 13. As a result, it takes 0.19s to process 4,000 test

images. In processing the same number of test images, GPU

took 0.83s for the inference. Thus, we get 4.37× improvement

in computation speed compared to GPU node. The GPU model

is Tesla K20 which has 2496 processors running at 706MHz.

Assuming 54 PEs in our BNN architecture, 4.37× improvement

is appreciable. The speed-up can be higher if higher power

budget is allowed to put more PEs in the target embedded

system. The throughput analysis on LeNet5 implies that Conv

layer is more computation intensive than FC layer. However,

FC layer demands more memory space as it normally has

significantly higher parameter counts. The delay incurred for

input layer processing significantly increases the computation

time.

Figure 12. A computation mapping for input layer processing for fully-connected layer (a) or convolutional layer (b).

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

Since our target platform is energy-limited embedded

platforms, the energy efficiency is a key factor as well as the

throughput. In order to estimate power consumption of each

functional block, we designed and synthesized major blocks in

28nm CMOS technology. The power consumption of those

blocks is summarized in Table IV. The XDP module is

designed to function on any computation mapping by having

different modes. In the proposed BNN architecture, we need 50

XDP modules (19.04mW), 62 BN modules (9.52mW) and 64

8bit adders (4.47mW). Overall, it consumes 33.03mW and

takes up the area of 0.26mm2. Thus, the energy consumption of

BNN system becomes 6.28mJ while GPU consumes 38.18J

(reported power was 46W). In terms of the energy efficiency,

the proposed BNN architecture improves by three orders of

magnitude.

Table IV. Power consumption of functional blocks in BNN

hardware synthesized in 28nm technology node

28nm

XDP Module

BN Module

8bit Adder

Power (W)

380.84

116.09

69.87

Area (m2)

2910.40

983.76

507.65

Timing (ns)

3.09

2.26

1.61

VI. CONCLUSIONS

In this paper, we have shown that binarized neural network

(BNN) can effectively detect human subjects from infrared

images. This is achieved by proper data augmentation adding

noise to dataset to deal with variable distance of a subject and

thermal variations. As a result, the recognition accuracy was

over 98% without any noise and over 96% even with noise in

IR images. As we target embedded platforms, we presented

architecture suited for BNN and designed main functional

blocks to estimate throughput as well as power consumption.

The proposed BNN system outperformed GPU with 46× many

processors in terms of throuput by 4.37×. Since our BNN

system has much less processing elements, the energy

efficiency improves by three orders of magnitude. Our study

shows the binarized NN targeting for object detection can fit

into a small FPGA such as Zynq-7100. The reduced memory

usage and simplified computation suggests DNN on the mobile

device with real-time performance. The future work would be

to integrate the entire system on small FPGA boards to run

embedded applications.

VII. ACKNOWLEDGMENTS

This research is partially funded under NSF #1526399, the

Defense Advanced Research Projects Agency (DARPA) and

the Air Force Research Laboratory (AFRL). The views,

opinions and/or findings expressed are those of the authors and

should not be interpreted as representing the official views or

policies of the Department of Defense or the U.S. Government.

REFERENCES

[1] A. Krizhevsky et. al., “ImageNet classification with deep

convolutional neural networks”, in Proc. Neural Information

Processing Systems (NIPS), pp. 1097–1105, 2012.

[2] P. Sermanet, K. Kavukcuoglu, S. Chintala and Y. Lecun,

“Pedestrian detection with unsupervised multi-stage feature

learning,” in Proc. Comput. Vision Pattern Recog. (CVPR), pp.

3626-3633. IEEE, 2013.

[3] I. Sutskever, J. Martens and G. Hinton, “Generating text with

recurrent neural networks,” in Proc. Int. Conf. Machine Learning

(ICML), pp. 1017-1024, 2011.

[4] R. Collobert et al., “Natural language processing (almost) from

scratch”, J. Machine Learning Research, 12:2493–2537, 2011.

[5] Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” Nature,

521:436-444, 2015.

[6] V. Vanhoucke and A. Senior, “Improving the speed of neural

networks on CPUs,” in Proc. Deep Learn. Unsupervised Feature

Learn. Workshop, NIPS, 2011.

[7] J. Kung, D. Kim and S. Mukhopadhyay, “A power-aware digital

feedforward neural network platform with backpropagation

driven approximate synapses,” in Int. Symp. Low Power Electron.

Design (ISLPED), pp. 85-90, 2015.

[8] S. S. Sarwar, S. Venkataramani, A. Raghunathan and K. Roy,

“Multiplier-less artificial neurons exploiting error resiliency for

energy-efficient neural computing,” in Proc. Design, Automat.

Test in Europe (DATE), pp. 145-150, 2016.

[9] Y. Gong, L. Liu, M. Yang and L. Bourdev, “Compressing deep

convolutional networks using vector quantization,” arXiv

preprint arXiv:1412.6115, 2014.

[10] S. Han et al., “EIE: efficient inference engine on compresses deep

neural network” arXiv preprint arXiv:1602:01528, 2016.

[11] M. Courbariaux et al., “Binarized neural networks: training neural

networks with weights and activations constrained to +1 or -1,”

arXiv preprint arXiv:1602.02830, 2016.

[12] M. Rastegari, V. Ordonez, J. Redmon and A. Farhadi,

“XNOR-Net: ImageNet classification using binary convolutional

neural networks,” arXiv preprint arXiv:1603.05279, 2016.

[13] S. Zhou et al., “DoReFa-Net: training low bitwidth convolutional

neural networks with low bitwidth gradients,” arXiv preprint

arXiv:1606.06160, 2016.

Figure 13. The layer-wise throughput estimation of our proposed BNN hardware with 54 PEs when running LeNet5.

Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016

[14] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based

learning applied to document recognition”, Proc. IEEE,

86(11):2278–2324, 1998.

[15] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, “DeepFace:

closing the gap to human-level performance in face verification,”

in Proc. Comput. Vision Pattern Recog. (CVPR), pp. 1701-1708.

IEEE, 2014.

[16] A. Coates et al., “Deep learning with COTS HPC systems,” in

Proc. Int. Conf. Machine Learning (ICML), pp. 1337-1345, 2013.

[17] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for

image recognition", arXiv preprint arXiv:1512.03385.

[18] S. Chai et al., “Low precision neural network using subband

decomposition,” in Cognitive Architecture (CogArch), 2016.

[19] Z. Wu, N. Fuller, D. Theriault and M. Betke, "A thermal infrared

video benchmark for visual analysis", in Proc. IEEE Workshop

on Perception Beyond the Visible Spectrum (PBVS), 2014.

[20] D. Zhang et al., “Unsupervised underwater fish detection fusing

flow and objectiveness,” in Proc. Winter Appl. Comput. Vision

Workshops (WACVW), pp. 1-7, 2016.

[21] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers and A. W. M.

Smeulders, “Segmentation as selective search for object

recognition” in Proc. Int. Conf. Comput. Vision (ICCV), 2011.

[22] M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki.

Available: https://sites.google.com/site/seecproject/energy-table

[23] “Synopsys 32/28nm generic library,” https://www.synopsys.com/

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Journal of Signal Processing Systems

This content is subject to copyright. Terms and conditions apply.

Binary Neural Networks in FPGAs: Architectures, Tool Flows and Hardware Comparisons

Article

Full-text available

Nov 2023
SENSORS-BASEL

Binary neural networks (BNNs) are variations of artificial/deep neural network (ANN/DNN) architectures that constrain the real values of weights to the binary set of numbers {−1,1}. By using binary values, BNNs can convert matrix multiplications into bitwise operations, which accelerates both training and inference and reduces hardware complexity and model sizes for implementation. Compared to traditional deep learning architectures, BNNs are a good choice for implementation in resource-constrained devices like FPGAs and ASICs. However, BNNs have the disadvantage of reduced performance and accuracy because of the tradeoff due to binarization. Over the years, this has attracted the attention of the research community to overcome the performance gap of BNNs, and several architectures have been proposed. In this paper, we provide a comprehensive review of BNNs for implementation in FPGA hardware. The survey covers different aspects, such as BNN architectures and variants, design and tool flows for FPGAs, and various applications for BNNs. The final part of the paper gives some benchmark works and design tools for implementing BNNs in FPGAs based on established datasets used by the research community.

A survey on FPGA-based design methodologies for visual object tracking

Conference Paper

Full-text available

Nov 2023

Visual object tracking systems have evolved from conventional image processing algorithms to technologies based on deep learning. Furthermore, FPGA technologies have allowed the implementation of complex and flexible portable systems. To present a comprehensive review of portable visual object tracking systems, this study proposes five categories of design methodologies for FPGA-based portable visual object tracking systems, along with their respective qualitative analysis and discussion. The proposed categorization includes the following algorithms: Region matching, Feature matching, Machine learning, and Deep learning. These design methodologies are contrasted against its FPGA-based hardware implementations.

Efficient FPGA Binary Neural Network Architecture for Image Super-Resolution

Article

Full-text available

Jan 2024

Super-resolution systems refer to computer-based systems designed to enhance the quality of images or video by producing high-resolution renditions from low-resolution counterparts using computational algorithms and technologies. Various methods and techniques have been used in development of super-resolution systems. The development of Convolution Neural Networks (CNNs) and the Deep Learning (DL) methods have outperformed traditional methods. However, as models become increasingly deeper with wider receptive fields, the number of parameters significantly increases. While this often results in better performance, it renders these models impractical for real-life scenarios such as smartphones or other mobile systems. Currently, most proposed methods with higher perceptual quality demand a substantial amount of time to process a single image, even on powerful hardware like NVIDIA GPUs. Such computationally expensive models are not cost-effective for real-world application scenarios. Optimization is needed to reduce the computational costs and memory requirements to enhance their suitability for less powerful hardware configurations. In this work, we propose an efficient binary neural network architecture, ResBinESPCN, designed for image super-resolution. In our design, we improved the energy efficiency of the architecture through algorithmic and hardware-level optimizations. These optimizations not only enhance computational efficiency and reduce memory consumption but also achieve effective image super-resolution in resource-constrained environments. Our experimental validation highlights the effectiveness of this network structure and includes ablation studies on models with varying data bit widths. Hardware analysis substantiates the efficiency and real-time capabilities of this model. Additionally, deploying the model on FPGA using FINN demonstrates its low hardware resource usage and low power consumption.

A Binary Neural Network with Dual Attention for Plant Disease Classification

Article

Full-text available

Oct 2023

Plant disease control has long been a critical issue in agricultural production and relies heavily on the identification of plant diseases, but traditional disease identification requires extensive experience. Most of the existing deep learning-based plant disease classification methods run on high-performance devices to meet the requirements for classification accuracy. However, agricultural applications have strict cost control and cannot be widely promoted. This paper presents a novel method for plant disease classification using a binary neural network with dual attention (DABNN), which can save computational resources and accelerate by using binary neural networks, and introduces a dual-attention mechanism to improve the accuracy of classification. To evaluate the effectiveness of our proposed approach, we conduct experiments on the PlantVillage dataset, which includes a range of diseases. The F1score and Accuracy of our method reach 99.39% and 99.4%, respectively. Meanwhile, compared to AlexNet and VGG16, the Computationalcomplexity of our method is reduced by 72.3% and 98.7%, respectively. The Paramssize of our algorithm is 5.4% of AlexNet and 2.3% of VGG16. The experimental results show that DABNN can identify various diseases effectively and accurately.

A systematic literature review on object detection using near infrared and thermal images

Article

Sep 2023
NEUROCOMPUTING

Formal verification for quantized neural networks

Article

Jul 2023

A Systematic Literature Review on Binary Neural Networks

Article

Full-text available

Jan 2023

This paper presents an extensive literature review on Binary Neural Network (BNN). BNN utilizes binary weights and activation function parameters to substitute the full-precision values. In digital implementations, BNN replaces the complex calculations of Convolutional Neural Networks (CNNs) with simple bitwise operations. BNN optimizes large computation and memory storage requirements, which leads to less area and power consumption compared to full-precision models. Although there are many advantages of BNN, the binarization process has a significant impact on the performance and accuracy of the generated models. To reflect the state-of-the-art in BNN and explore how to develop and improve BNN-based models, we conduct a systematic literature review on BNN with data extracted from 239 research studies. Our review discusses various BNN architectures and the optimization approaches developed to improve their performance. There are three main research directions in BNN: accuracy optimization, compression optimization, and acceleration optimization. The accuracy optimization approaches include quantization error reduction, special regularization, gradient error minimization, and network structure. The compression optimization approaches combine fractional BNN and pruning. The acceleration optimization approaches comprise computing in-memory, FPGA-based implementations, and ASIC-based implementations. At the end of our review, we present a comprehensive analysis of BNN applications and their evaluation metrics. Also, we shed some light on the most common BNN challenges and the future research trends of BNN.

Utilizing Dual-Port FeFETs for Energy-Efficient Binary Neural Network Inference Accelerators

Article

Aug 2024

Neuromorphic and in-memory computing architectures using emerging nonvolatile memories (e-NVMs) have emerged as promising solutions for area-and energy-efficient deep neural network (DNN) accele-rators. However, the inherent nonideal behavior of e-NVMs such as limited tuning precision (for multibit synapses), nonlinearity, temporal (cycle-to-cycle), and spatial (device-to-device) variability significantly degrades the performance of DNN accelerators. Recently, binary neural networks (BNNs), with 1-bit weights and activations, have been shown to offer an alternative relaxed approach for training and inference with high accuracy. However, the limited endurance and stuck-at faults of e-NVMs such as resistive (R)RAMs, charge trap memory, and so on limit the efficient implementation of BNN accelerators. Considering the ultrahigh endurance, ultralow switching energy, and CMOS-compatibility of the ferroelectric (Fe)FETs, it becomes imperative to explore their potential for BNN accelerators. To this end, in this work, we present a novel approach for the implementation of BNN inference accelerators utilizing an array of dual-port ferroelectric FETs (FeFETs) as current sinks. The dual-port FeFETs not only decouple the read and write paths (leading to reduced read disturbances) but also exhibit high reliability and voltage compatibility with existing peripheral circuit design since the write voltages are low. Furthermore, we utilize a comparator column approach that requires only half area when compared to other differential weights-based BNN accelerators. Our comprehensive analysis utilizing an experimentally calibrated compact model for dual-port FeFETs indicates that the proposed vector-matrix-multiplication (VMM) implementation exhibits an energy efficiency of 789.89 TOPS/W with a throughput of 0.06 TeraOp/s while achieving an accuracy of 96.39% and 80.8% for image classification task on the MNIST and Fashion MNIST datasets after ex-situ training.

A Survey on FPGA-based Design Methodologies for Visual Object Tracking (complete references)

Preprint

Full-text available

Sep 2023

Complete references to the CONACIC 2023 conference paper called “A Survey on FPGA-Based Design Methodologies for Visual Object Tracking.”

BISDU: A Bit-Serial Dot-Product Unit for Microcontrollers

Article

Jul 2023

Low-precision quantized neural networks (QNNs) reduce the required memory space, bandwidth, and computational power, and hence are suitable for deployment in applications such as IoT edge devices. Mixed-precision QNNs, where weights commonly have lower precision than activations or different precision is used for different layers, can limit the accuracy loss caused by low-bit quantization, while still benefiting from reduced memory footprint and faster execution. Previous multiple-precision functional units supporting 8-bit, 4-bit, and 2-bit SIMD instructions have limitations, such as large area overhead, under-utilization of multipliers, and wasted memory space for low and mixed bit-width operations. This paper introduces BISDU, a bit-serial dot-product unit to support and accelerate execution of mixed-precision low-bit QNNs on resource-constrained microcontrollers. BISDU is a multiplier-less dot-product unit, with frugal hardware requirements (a population count unit and 2:1 multiplexers). The proposed bit-serial dot-product unit leverages the conventional logical operations of a microcontroller to perform multiplications, which enables efficient software implementations of binary ( Xnor ), ternary ( Xor ), and mixed-precision [W × A] ( And ) dot-product operations. The experimental results show that BISDU achieves competitive performance compared to two state-of-the-art units, XpulpNN and Dustin, when executing low-bit-width CNNs. We demonstrate the advantage that bit-serial execution provides by enabling trading accuracy against weight footprint and execution time. BISDU increases the area of the ALU by 68% and the ALU power consumption by 42% compared to a baseline 32-bit RISC-V (RV32IC) microcontroller core. In comparison, XpulpNN and Dustin increase the area by 6.9 × and 11.1 × and the power consumption by 3.8 × and 5.97 ×, respectively. The bit-serial state-of-the-art, based on a conventional popcount instruction, increases the area by 42% and power by 32%, with BISDU providing a 37% speedup over it.

Low Precision Neural Networks using Subband Decomposition

Article

Full-text available

Mar 2017

Large-scale deep neural networks (DNN) have been successfully used in a number of tasks from image recognition to natural language processing. They are trained using large training sets on large models, making them computationally and memory intensive. As such, there is much interest in research development for faster training and test time. In this paper, we present a unique approach using lower precision weights for more efficient and faster training phase. We separate imagery into different frequency bands (e.g. with different information content) such that the neural net can better learn using less bits. We present this approach as a complement existing methods such as pruning network connections and encoding learning weights. We show results where this approach supports more stable learning with 2-4X reduction in precision with 17X reduction in DNN parameters.

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Article

Full-text available

Jun 2016

We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFatNet opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 4-bit gradients to get 47\% top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.

Compressing Deep Convolutional Networks using Vector Quantization

Technical Report

Dec 2014

Deep convolutional neural networks (CNN) has become the most promising method for object recognition, repeatedly demonstrating record breaking results for image classification and object detection in recent years. However, a very deep CNN generally involves many layers with millions of parameters, making the storage of the network model to be extremely large. This prohibits the usage of deep CNNs on resource limited hardware, especially cell phones or other embedded devices. In this paper, we tackle this model storage issue by investigating information theoretical vector quantization methods for compressing the parameters of CNNs. In particular, we have found in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods. Simply applying k-means clustering to the weights or conducting product quantization can lead to a very good balance between model size and recognition accuracy. For the 1000-category classification task in the ImageNet challenge, we are able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN.

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Conference Paper

Feb 2016

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Article

Mar 2016

We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure). We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Improving the speed of neural networks on CPUs

Conference Paper

Jan 2011

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 ﬁxed-point instructions which provide a 3X improvement over an optimized ﬂoating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10X speedup over an unoptimized baseline and a 4X speedup over an aggressively optimized ﬂoating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

Gradientbased learning applied to document recognition

Article

Jan 1986

A power-aware digital feedforward neural network platform with backpropagation driven approximate synapses

Conference Paper

Jul 2015

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Conference Paper

Oct 2016

We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32\(\times \) memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58\(\times \) faster convolutional operations (in terms of number of the high precision operations) and 32\(\times \) memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy. Our code is available at: http:// allenai. org/ plato/ xnornet.

Efficient Object Detection Using Embedded Binarized Neural Networks

Abstract and Figures

Recommended publications

Multi-Scale Object Detection Using Feature Fusion Recalibration Network

Low Precision Neural Networks using Subband Decomposition

FxpNet: Training a deep convolutional neural network in fixed-point representation

Bit Efficient Quantization for Deep Neural Networks

Learning both Weights and Connections for Efficient Neural Networks