ArticlePDF Available

Efficient Object Detection Using Embedded Binarized Neural Networks

Authors:
  • SRI International, Princeton, New Jersey, United States

Abstract and Figures

Memory performance is a key bottleneck for deep learning systems. Binarization of both activations and weights is one promising approach that can best scale to realize the highest energy efficient system using the lowest possible precision. In this paper, we utilize and analyze the binarized neural network in doing human detection on infrared images. Our results show comparable algorithmic performance of binarized versus 32bit floating-point networks, with the added benefit of greatly simplified computation and reduced memory overhead. In addition, we present a system architecture designed specifically for computation using binary representation that achieves at least 4× speedup and the energy is improved by three orders of magnitude over GPU.
Content may be subject to copyright.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
1
AbstractMemory performance is a key bottleneck for
deep learning systems. Binarization of both activations and
weights is one promising approach that can best scale to
realize the highest energy efficient system using the lowest
possible precision. In this paper, we utilize and analyze the
binarized neural network in doing human detection on
infrared images. Our results show comparable algorithmic
performance of binarized versus 32bit floating-point
networks, with the added benefit of greatly simplified
computation and reduced memory overhead. In this paper,
we present a system architecture designed specifically for
computation using binary representation that achieves at
least 4× speedup and the energy is improved by three
orders of magnitude over GPU.
Index TermsDeep learning, embedded computer vision,
binary neural network, low-power object detection
I. INTRODUCTION
Large-scale deep neural networks (DNN) have been
successfully used in a number of tasks from image recognition
to natural language processing [1-4]. It is well understood that
to capture the wide spectrum of low, mid and high-level
representations of complex patterns, networks with many
layers, nodes, and with high local and global connectivity are
needed [5]. The success of recent deep learning algorithms
comes in part from the ability to train much larger DNN on
much larger datasets than was previously possible.
The networks to solve complex and real-world problems are
getting deeper and larger. This trend inevitably makes training
or inference to be both computationally challenging and
memory-intensive. As such, there has been a growing line of
research focused on minimizing the memory utilization and the
energy consumption from the hardware perspective [6-9].
These approaches leverage fixed-point instructions and the
level of bit-precision, utilize approximate hardware, or perform
data compression after training.
There has been another set of approaches that targets a
specific bit-precision during training, e.g. training at 8-bit DNN
weights instead of 32-bit. Results have shown that these
approaches can arrive at DNN weight parameters of lower
precision with negligible loss in performance [10-13]. In
addition, incorporating lower bit-precision during training help
reduce training time with lower memory footprint and also
lower bitwise computational requirement.
In this paper, we use binarized neural network (BNN) as our
algorithmic approach for our embedded DNN processor
because BNN offers the most savings in compute and memory
resources (i.e. binary is the lowest). For an in-depth
performance comparison, we analyze our BNN algorithm and
processor design for an image recognition application that
detects humans in the infrared (IR) spectrum. We analyze the
algorithmic performance against 32bit DNNs with various
model parameters such as number of kernels and synaptic
connections.
We also designed a system architecture suited for BNN (as
ASIC or on FPGA). We leverage the use of local memories to
store all BNN weights in order to speed up overall processing.
We then estimate the computing performance of our BNN
architecture and compare to that of GPU. Major computing
blocks in the BNN system are designed and synthesized to
estimate power consumptions of those functional blocks.
Our main contributions can be summarized as follows:
We demonstrate learning of visual features agnostic to
thermal variations to identify human subjects in infrared
images by using BNN.
We provide detailed characterization of the impact of
binarization on object detection.
We show that BNN performance is close to the 32bit
network. Binary weights in hyperspace is not a linear scale
of those obtained from 32bit network, nevertheless, they
are another valid representation for the trained NN.
We analyze and design the system architecture for BNN,
with performance and energy efficiency gains.
We anticipate that the results shown in this paper confirm the
suitability of BNNs for embedded platforms. We provide
demonstrable results on a complex application such as IR
human detection. We anticipate the use of BNNs for a variety
of embedded vision platforms, from smart cameras to
smartphone apps.
This paper is organized as follows: in Section 2, we present
previous research work on low-precision neural networks
including BNN. In Section 3, we explain our dataset captured
from different infrared cameras. In Section 4, we analyze
simulation results of BNN on our IR dataset with the
comparison to 32bit simulation results. In Section 5, we explore
the BNN system architecture and present performance/energy
estimates. Finally, we present our conclusions and discuss our
future work in Section 6.
II. PREVIOUS WORK: LOW-PRECISION NNS
Over the last decade, DNN parameter sizes continue to grow
dramatically. LeNet5, a Convolutional Neural Network (CNN)
presented in 1998 [14], uses ~60K parameters to classify
handwritten digits. AlexNet CNN [1] uses 61M parameters to
Efficient Object Detection Using Embedded Binarized Neural
Networks
Jaeha Kung, David Zhang*, Gooitzen van der Wal*, Sek Chai*, Saibal Mukhopadhyay
Georgia Institute of Technology, Atlanta, GA
SRI International, Princeton, NJ*
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
2
win the ImageNet image classification competition in 2012.
More recently, new DNNs such as DeepFace uses 120M
parameters for human face identification [15], and there are
even networks with 10B parameters [16]. The current trend in
achieving higher level of algorithmic performance is to
increase the number of DNN layers to capture high-level
semantic reasoning, and as such, memory bottlenecks will
continue to be the central computing issue [17].
Needless to say, the computation power and run-time
memory footprint required for these modern DNN in inference
mode exceeds the power and memory size budgets for a typical
mobile device or an embedded platform. To tackle these
problems, different approaches including limiting bit-precision
[6,7], utilizing approximate hardware [8] or compressing data
[9] were proposed for inference mode. The efficient
manipulation of data precisions in CPUs to speed up the DNN
computation (also, reduce memory footprint by using 8bit) was
presented in [6]. In [7], a greedy algorithm was presented to
automatically set a reduced bit-precision for each synapse by
looking at its error sensitivity. The utilization of approximate
hardware with the fixed-point computation further reduces
energy consumption [7,8]. In [9], a compression method in
reducing the memory requirement was also explored.
To aggressively limit the bit-precision as an effort to reduce
memory, power usage and training time, targeting very low
precision even from training was presented [10-13]. Han et al.
uses a 4bit code book for weight parameters and updates
centroids during the training in [10]; data pruning and data
compression were performed to fit parameters into local
memory. More aggressive than this, [11,12] proposed binary
weights and activations in training mode (also, inference) and
obtained good results over the well-known benchmarks. Both
approaches keep binary weights and activations for all
computations except using 32bit weight parameters during
weight update. As an extension to this approach, 1bit weight,
2bit activation and 4bit gradient are presented to improve
performance on complex image recognition dataset [13].
All these methods target the same goals: i) reduce memory
requirement and ii) lower energy consumption (either by power
reduction or speed improvement). As limiting bit-precision
prior to training successfully reduces bit-precision to binary, we
select BNN [11] as our test prototype to realize a low-power
embedded vision system. BNN has simpler binarization than
other counterparts [12,13] and shows good accuracy on simple
image classification tasks. The overview of BNN algorithm is
shown in Figure 1. The important things to note are weight
update is done as a real-value (e.g. 32bit floating-point) and
input is 8bit only during the inference. In the following section,
we will describe our dataset to verify the applicability of BNN
on real embedded applications.
III. INFRARED DATASET
A. IR Data Collection
Traditional IR detection systems rely on manual coding of
visual features, resulting in missed detections and high false
alarm rates (FAR). Deep learning algorithms have been used
for object classification with great success, but current
approaches are focused on images in the visible spectrum, with
high resolution and contrast. Commercial deep learning
systems are cloud-based and geared towards performance
results at the cost of high power consumption and are less
concerned with real-time operation. IR cameras operate day
and night, but they pose additional challenges for deep learning
algorithms due to the fact that images have no color cues and
there are thermal variations for the same object. Small objects
in different poses are difficult to identify because they “look
like blobs” with little texture discrimination.
With BNNs, we take advantage of the lack of color data to
train a network specifically on spatial features. We have
previously shown that we can train DNN separately on spatial
and texture data, and then combine the results [18]. The
resulting DNN can be trained faster using higher learning rate,
Figure 1. Overview of binarized neural network (BNN): The training of BNN (top) and the inference mode in BNN (bottom).
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
3
and also achieve higher detection performance because we are
able to direct the learning on spatial feature and texture
separately. For this effort, we treat our BNN training in a
semi-supervised manner, specifically for spatial features (e.g.
edges, shapes, etc.), and we treat textural data as component
where precision can be lossy.
We collected a large sample of IR images of human activity,
both indoors and outdoors, on stationary and moving camera
platforms, at different times of the day to capture different
thermal background temperatures. The objects range from
medium to long range (1-100m) such that distant objects can be
as small as 20x20 pixels. We further incorporated thermal
images of pedestrians (in lab, school lounge, and marathon)
from [19] into our training set. Based on our experience, we
anticipate the need for a minimum of 100 frames for each
person in any particular pose. This is approximately five
seconds with a 20 frame-per-second camera.
B. Data Augmentation
Deep learning requires large training data sets, and we
address the challenges in data annotation with our data
ingestion software that labels image regions automatically. Our
data collection task is to ensure that we collect enough data
samples with sufficient variations in body poses and thermal
contrast. We then prepare the training set by selecting the
proper regions in the image with human subjects. We do this
labeling automatically using a combination of hot-spot and
motion tracking, similar to work published previously [20].
Negative samples are selected in the non-human area per frame,
based on object proposals suggested by Selective Search [21].
Furthermore, to further add resilience to our training set, we
augment each image in the training set with all variations that
are specific to thermal imagery (e.g. contrast change, resolution
blur, Gaussian noise and geometric distortions). In the context
of the paper, we call them augment noise or simply noise. This
is in addition to spatial augmentation (e.g. rotation, skew, shift)
that is commonly applied to training data in visible spectrum.
We have found, early on, that augmentation specific to IR
images is a key step in training in this spectral range so that our
DNN can be robust to realistic conditions (e.g. adverse weather,
ambient temperature, sensor noise and object distance/size). In
total, we have a sample set of 255k positive and 658k negative
samples.
IV. ALGORITHMIC EVALUATION OF BNN
A. Training
In addition to our IR dataset, we also used MNIST dataset as
a reference benchmark [14]. Although performance suggests
deeper and larger network for more complex vision tasks
[13,17], since IR dataset only labels two classes (i.e. human and
non-human), we started with a shallow network like LeNet5.
LeNet5 is regarded as a baseline network, which has 2
convolutional layers (with max pooling) and 3 fully-connected
layers. In order to see the impact of different network
configurations, such as the size of convolution kernels, the
number of kernels, the size of FC layer, and the number of
convolutional layers, we studied four different variants of
LeNet5 trained with MNIST and IR dataset respectively. The
performance of these configurations is concluded in the next
subsection.
Training of each network is done over 500 epochs with 50
mini-batches per epoch. The loss curves of both BNN and 32bit
network are compared on MNIST and IR datasets (Figure 3).
For MNIST, 40,000 training samples (input size: 28×28) are
used to train the network [14]. For IR dataset, 20,000 training
samples (input size: 128×128) with data augmentation as
explained in Section III.B are used. 30% of the samples are fed
into the network without any noise while the remaining 70%
samples are pre-processed with a randomized augment noise
before fed to the NN for training.
The convergence speed is important in terms of training time
(equivalently, energy consumption). With augment noise in IR
Figure 2. IR dataset captured from multiple infrared cameras followed by data augmentation to handle noise from cameras.
Figure 3. The loss curves during the training with BNN and
32bit on MNIST (left) and IR (right) datasets.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
4
dataset, for both 32bit network and BNN, the networks see
different variations in input data distribution and thus take
longer time to converge. For both MNIST and IR datasets, the
final loss obtained by using 32bit network was lower than that
of BNN.
Previously, there was little analysis on what is mainly
different in learned kernels between BNN and the 32bit
network. The kernels in the first convolutional layer of LeNet5
(for both BNN and 32bit) are shown in Figure 4. The binary
kernels let BNN capture edge or shape information. The
learned kernels in 32bit network identify texture as well as
edge/shape information. As a result, BNN mostly rely on edges
of input images. The 32bit network, however, captures both the
human subject with some background information. As we will
discuss in Section IV.C, however, the edge-oriented BNN has a
benefit over the 32bit network for some input types.
B. Accuracy Evaluation
To evaluate the accuracy of BNN, a network with 32bit is
compared to perform classification on both MNIST and IR
dataset. Table I summarizes the specifications of networks that
we have configured to test with. LeNet5 is the baseline (v1).
Regarding the variants of LeNet5, network v2 has larger
convolutional kernels, v3 has larger fully-connected layers, v4
has more number of kernels and v5 has one more convolutional
layer. The number of parameters on IR dataset of each network
is also reported which is the same for both BNN and 32bit. The
reported memory size on IR dataset is only for BNN (for 32bit,
we can simply multiply the number by 32). The number of test
samples is 10,000 for MNIST and 4,000 for IR dataset.
Figure 5 shows the simulation results, in terms of recognition
accuracy, on both MNIST and IR dataset. As explained in
Section III.B, the noise due to various factors may deteriorate
the resolution of video frames to be processed from mobile or
embedded cameras. Thus, we tested two different cases for IR
dataset: one without noise on test set and the other with noise
(to verify noise resilience of our trained model). For all cases,
BNN shows high recognition accuracy, as close performed
compared to 32bit network (for MNIST, 32bit always performs
slightly better). Interesting observation on the performance of
noisy IR data set, BNN performs slightly better than the 32bit
network in most configurations. This indicates that 32bit
network is more prone to certain noise on input images. The
possible justification will be addressed in the following
subsection.
In terms of network specifications, having larger kernels in
BNN on MNIST dataset reduces the recognition accuracy by
3.1%. However larger FC layer or more kernels improves the
accuracy by 1.3%. Since input size is too small for MNIST
network v5 is not tested. On IR dataset, all variant networks
show almost same recognition accuracy (difference < 1%). For
BNN, network v4 performed the best (99.30%). For 32bit,
network v5 performed the best (99.13%). Even with the noise
in input images, the recognition accuracy is reasonably high (>
96%). As there is little variation in recognition accuracy among
5 networks for IR dataset, without loss of generality, we simply
select LeNet5 (v1) as our reference network for the following
analyses.
C. Analysis of Binarization on Object Detection
We compare the histograms of two networks, 32bit network
and BNN, to verify that they statistically behavior the same.
The upper figures in Figure 6 show histograms of output scores
on entire test set (4,000 samples) and on samples that each
network fails to recognize correctly. BNN and 32bit network
Figure 5. Recognition accuracy on different datasets using both BNN and 32bit network: (a) MNIST, (b) IR dataset without
noise and (c) IR dataset with noise.
Table I. Summary of simulated networks and required
number of parameters and storage size for IR dataset
Network
LeNet5
v2
v3
v4
v5
C1
6: 5×5
6:11×11
6: 5×5
16: 5×5
6: 5×5
S2
2×2
2×2
2×2
2×2
2×2
C3
16: 5×5
16: 7×7
16: 5×5
32: 5×5
16: 5×5
S4
2×2
2×2
2×2
2×2
2×2
C5
12: 3×3
S6
2×2
F5(F7)
120
120
3600
120
120
F6(F8)
84
84
1024
84
84
# Param
1.6M
1.3M
52M
3.3M
0.26M
Memory
243KB
201KB
6.56MB
478KB
72.8KB
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
5
have different mean values and deviation, but they show similar
distributions (bimodal distribution). Each peak represents
human subject or background (non-human). We can interpret
this as BNN is trained to behave similarly to the network with
32bit precision.
For samples that BNN or 32bit network fails to correctly
recognize, the output scores lie around zeros for both networks.
Among those false alarms, we have checked the sign of output
scores on samples which failed on both networks. The result
shows that both BNN and 32bit network had the same output
signs on 95% of common false alarms, implying that BNN
behaves very similarly to 32bit network. Some examples of
those common false alarms are shown in Figure 6 with possible
reasons of the recognition failure.
Then, the question arises that why some samples 32bit
network recognizes correctly while BNN fails, and vice versa.
Figure 7 shows some examples on which BNN fails to
recognize but 32bit succeeds. We looked into six features after
the first convolutional layer of an input sample on both BNN
and 32bit network. Again, BNN relies solely on edge (shape)
information. Thus, if the input image has less contrast to
recognize a subject, as an example shown in Figure 7, features
may fail to capture any useful information. To overcome such
problem in BNN, contrast enhancement such as histogram
equalization or LACE can be applied to IR image before they
are fed into BNN. After such pre-processing, BNN successfully
identifies the human subject for the previously failed example.
The corresponding features of BNN are shown in Figure 7
(lower right), which learns the shape of a human subject.
We also analyze samples where the 32bit network fails to
recognize, but BNN succeeds. A subset of those test samples
are shown in Figure 8. By looking at feature maps of one of the
samples, we can tell that 32bit network captures more
background information than BNN. This may help collect more
features but when there is noisy background (or background
with complex textures), 32bit network is difficult to recognize
Figure 6. The histogram of output values (scores) and examples of the common false alarms on identifying human subjects in
IR images for both BNN and 32bit network.
Figure 7. The false alarms by testing with BNN (upper left
images) and possible solution to the one of false alarms.
Figure 8. The false alarms with 32bit network (left) and the
reasoning behind the failure with possible solution (right).
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
6
the subject of interest in IR images. To solve this problem, a
tighter search box during the selective search may help
improving the recognition for 32bit network.
In summary, we have shown that BNN is trained to perform
in a similar way as 32bit network. This is verified with the high
recognition accuracy on two datasets and by comparing the
actual distribution of output scores. In the following section, we
will present the hardware architecture suited for BNN and
estimate system performance in terms of throughput and energy
consumption.
V. BNN SYSTEM ARCHITECTURE
A. Evaluation on Memory Requirement
The most critical design consideration of neural networks in
hardware is the memory space. As the number of parameters is
ever increasing [1,15,16], the external storage (i.e. DRAM) will
be accessed frequently. The energy consumption of memory
access for 32bit DRAM, however, is 128× higher than that of
32bit SRAM [22]. Thus, storing parameters in local memory
significantly reduces the energy consumption of the system.
This gets more importance in embedded platforms.
In Table II, the (internal) memory size of commercial FPGAs
is summarized. This helps us understanding the maximum size
of a neural network that each FPGA can fit into its local
memory. For example, AlexNet [1] has 61M parameters which
require ~240MB in the network. It does not fit in the largest
FPGA from the table (Xilinx FPGA-VirtexU 13P). Only a
neural network which is ~4× less in size than AlexNet is
estimated to fit into this FPGA. On the other hand, VirtexU 13P
can fit in a binarized neural network with ~0.45B parameters
for inference mode (7.5 × larger networks than AlexNet).
For IR dataset, a small and shallow network (e.g. LeNet5) is
good enough to achieve high recognition accuracy. LeNet5
with 128×128 inputs has 1.6M parameters occupying memory
about 6.4MB, using full precision. A medium size FPGA such
as KintexU-15P can be used to fit all parameters and has the
package size of 35mm×35mm. A small Zynq-7020 chip has a
footprint of 13mm×13mm (with 0.62MB of local memory). To
fit the network into a smaller form factor for embedded
platforms, LeNet5 on IR images can be trained with binary
parameters, which require ~0.24MB and can fit into this small
FPGA.
In the following discussion, LeNet5 is selected as the
reference network on IR dataset. Zynq-7000 series is assumed
as a target embedded platform in describing BNN system
architecture in the following subsections. Please note that the
BNN system architecture is designed for NN inferencing only.
B. BNN System Architecture
The BNN or any other low-precision neural network requires
dedicated system architecture to fully gain in performance. In
this section, we present overall architecture and its processing
elements (PE) suited for BNN algorithm. The proposed BNN
architecture is general in terms of its application or mapped
network topology. The architecture remains the same but with
different size of the internal memory depending on the size of
the network. To help understanding the overall architecture, we
set the application to IR detection with LeNet5.
First, we have to determine the number of internal memory
blocks, called Block RAM (BRAM) in FPGA terminology.
When deciding the number of BRAMs, we consider the total
number of parameters in the network and the maximum PE
utilization. We made several assumptions on external memory
specification with DDR3; 64bit-wide bus (word), operating
frequency at 800MHz and 75% efficiency. These assumptions
lead to the transfer rate of 19,200MB/s (this affects DRAM to
BRAM data prefetching time). For LeNet5, total data size of
242.67KB is required for both activations (39.23KB) and
weights (203.44KB) on IR dataset. In Zynq-7000 series, the
size of each BRAM is 36Kb (512×64bit) thus 51 BRAMs are
needed to store all weight parameters locally. To organize
BRAMs in a 2-dimensional fabric, we assume 54 BRAMs (6×9)
for parameters in LeNet5. Since the storage requirement for
activations is far less, 10 BRAMs are enough to cover the
whole activation data. To fully utilize PEs, however, 27
BRAMs are distributed over the system so that two PEs share
one activation BRAM. By operating the PE fabric at the half
clock frequency to BRAM, the proposed architecture gives us
the maximum throughput (Figure 9).
Given the overall system architecture, we now present the
design of processing elements optimized for BNN algorithm.
Figure 10 shows a block diagram of a PE with main functional
blocks: (i) XNOR Dot Product (XDP) module and (ii) Batch
Normalization (BN) module. As the word size in DRAM is
assumed as 64bit, XDP module is designed to operate on 64bit
data to compute 64 multiply-and-accumulate in parallel at each
PE. This greatly improves the system performance for BNN
algorithm (refer to Section V.D). Also, a simple example of dot
product operation is explained in Figure 10. Note that Boolean
value 0 represents -1 in BNN data. The XNOR between n
activations and weights simplifies multiplication. Then, the
Figure 9. The oveall system architecture for BNN. The
shared BRAMs (shaded) are for weights in convolutional
layer and for activations in fully-connected layer.
Table II. Internal memory size in different FPGA platforms
Xilinx Board
MB
Zynq-7100
3.3
Virtex-7 X980T
6.6
KintexU 15P
8.8
ZynqU 19EG
8.8
KintexU 115
9.5
VirtexU 190
16.6
VirtexU 13P
56.8
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
7
result is followed by count and subtract module to obtain dot
product output.
C. BNN Computation Mapping on PEs
FC layer: The computation in fully-connected (FC) and
convolutional (Conv) layer requires different data management.
First, we present the detailed PE manipulation for the FC layer.
Most intuitively, we can map all computations for one output
node to each PE. Each PE handles 64 input nodes at a time, thus
the PE iteratively computes the state of the given output node if
the dimension of input state vector is larger than 64. This
computation mode is named iterative mode (left diagram in
Figure 11(a)). This mode is inefficient if the number of output
nodes is small, equivalently the #input/#output ratio is high. For
instance, if the number of output node is only 10, the remaining
44 PEs are in idle mode significantly reducing the utilization.
To solve this issue, another mode for FC layer is designed.
The additional mode, named partial mode, merely requires an
adder tree for each column of the PE fabric. In this mode, each
column handles the computations for one output node. Thus,
nine output nodes are handled in parallel (which is 54 nodes in
the iterative mode). As we will discuss in the next subsection,
the partial mode improves throughput when #input/#output
ratio is low. The other thing to note is that BN+NF module is in
idle mode except for the last row of PE fabric. Since batch
normalization and nonlinear activation happen after the state of
an output node is fully computed, only one BN+NF module is
needed per PE column. Even for the same FC layer, depending
on the layer specification the instruction changes in our BNN
system architecture.
Conv Layer: The Conv layer behaves completely different
from the FC layer. The number of weight parameters is much
less than FC layer thus weights are stored in a local buffer at
each PE. Making this buffer larger increases the capacity to
handle larger kernels. By having a 11bit buffer, kernels up to
size 11×11 are covered (a row of the kernel is stored). The input
data is multiple feature maps where each map is 2-dimensional
data. Input feature maps are distributed to PE columns. Then,
pixel data in each (input) feature map is distributed to PE rows.
For convolution, two consecutive words (64pixels/word) in the
same row are loaded into each PE. One word is the active word
at which convolution is performed: darker gray in Figure 11(b).
The other word is supplement word shifted in one bit at a time
to compute partial sum (row) in a convolution. Note that all 64
weight bits in XDP module is the same. For 5 kernel, the
shift is done 4 times to compute row partial sum of 64 output
nodes. This is repeated for 4 times in row-wise (next row of an
input feature is loaded) to complete the convolution on 5
kernel.
The important fact is that XDP module outputs 64 8bit data
while the output is one 8bit data in FC layer mapping. Also, in
each XDP module, CnS operation is done iteratively on each
output (simply adding 1 or subtracting 1 depending on XNOR
output). After computations for all kernel values are computed,
64 outputs are independently added in column-wise (64 8bit
adders are required; the overhead is discussed in Section V.D).
These results are passed through pooling module reducing
dimensionality. For 2x2 pooling, it needs to wait for the next 64
output data to produce 32 8it data (from 128 8bit data). This
computation is followed by BN+NF module to get 1bit result
for 32 pixels of an output feature. This procedure is repeated
until all pixels in output features are computed.
Input Layer: Different from other layers, the input layer
accepts 8bit data as an input. This makes the computation
Figure 11. A computation mapping on each PE for fully-connected layer (a) or convolutional layer (b).
Figure 10. A block diagram of a processing element with
main modules: (i) XDP module and (ii) BN module.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
8
mapping in the first layer slightly different from the other layers.
In the memory, eight chunks of 8bit inputs are stored in one
64bit word. This requires the PE to handle eight 8bit data using
XDP module instead of 64 1bit data. Still, weight parameters
are kept 1bit. The output of XDP module is then handled
differently from the mapping shown in Figure 11.
In FC layer, block CnS modules are used to perform addition
on XNOR result at the same bit-position. Then, Shift-and-Add
(SnA) is performed to compute the dot product result with 8bit
input data (Figure 12(a)). The other parts of modules are kept
the same. In Conv layer, SnA is performed on 8 output chunks
of iterative CnS module connected to each XNOR output
(Figure 12(b)). The XDP module outputs eight 8bit data, thus
only some parts of column-wise adder, pooling and BN module
are utilized for input layer. For both layer cases, only additional
hardware overhead is the addition of SnA modules and its
proper management. Also, since PE computes dot product on
eight inputs at a time, the throughput reduces by 8 during input
layer processing.
D. Throughput and Energy Analysis
With the proposed BNN architecture described in Section
V.C, we can estimate the throughput with the proper pipelining
in PE. Before estimating performance, we need to set operating
frequency of local memory and PEs. The BRAM in Zynq-7000
series can be operated at 400MHz and we assume the operating
frequency of PE at 200MHz. The operating frequency of PE
module is verified by synthesizing the design in 28nm CMOS
technology [23].
For FC layer, there are two different modes depending on the
network topology. In iterative mode, the throughput in terms of
clock cycles is given by:
 


(1)
where #input (or #output) is the number of nodes in current
layer (or next layer), word size is 64 and #PEs is 54. In partial
mode, the throughput changes to:
 


(2)
where  (or  is the number of rows (or
columns) of a PE array. To verify that the iterative (partial)
mode is beneficial when #input/#output ratio is low (high), the
throughputs of two different cases are compared in Table III.
For Conv layer, the throughput becomes:
 

 



(3)
where , is the dimension of input features and
 (or  is the number of input
features (or output features). For instance, the convolution with
5×5 kernel on six 120×120 input features to obtain 16 output
features requires 11,700 cycles. This is significantly longer
than required cycles for examples on FC layer (refer to Table
III).
Table III. Throughput comparison between iterative and partial
mode for FC layer processing
Mode
(#inputs, #outputs)
(1024, 512)
(1024, 10)
Iterative
160
16
Partial
228
6
To estimate the throughput of a neural network for IR
detection, the number of cycles for LeNet5 is estimated. The
layer-wise estimated performance of BNN hardware is shown
in Figure 13. As a result, it takes 0.19s to process 4,000 test
images. In processing the same number of test images, GPU
took 0.83s for the inference. Thus, we get 4.37× improvement
in computation speed compared to GPU node. The GPU model
is Tesla K20 which has 2496 processors running at 706MHz.
Assuming 54 PEs in our BNN architecture, 4.37× improvement
is appreciable. The speed-up can be higher if higher power
budget is allowed to put more PEs in the target embedded
system. The throughput analysis on LeNet5 implies that Conv
layer is more computation intensive than FC layer. However,
FC layer demands more memory space as it normally has
significantly higher parameter counts. The delay incurred for
input layer processing significantly increases the computation
time.
Figure 12. A computation mapping for input layer processing for fully-connected layer (a) or convolutional layer (b).
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
9
Since our target platform is energy-limited embedded
platforms, the energy efficiency is a key factor as well as the
throughput. In order to estimate power consumption of each
functional block, we designed and synthesized major blocks in
28nm CMOS technology. The power consumption of those
blocks is summarized in Table IV. The XDP module is
designed to function on any computation mapping by having
different modes. In the proposed BNN architecture, we need 50
XDP modules (19.04mW), 62 BN modules (9.52mW) and 64
8bit adders (4.47mW). Overall, it consumes 33.03mW and
takes up the area of 0.26mm2. Thus, the energy consumption of
BNN system becomes 6.28mJ while GPU consumes 38.18J
(reported power was 46W). In terms of the energy efficiency,
the proposed BNN architecture improves by three orders of
magnitude.
Table IV. Power consumption of functional blocks in BNN
hardware synthesized in 28nm technology node
28nm
XDP Module
BN Module
8bit Adder
Power (W)
380.84
116.09
69.87
Area (m2)
2910.40
983.76
507.65
Timing (ns)
3.09
2.26
1.61
VI. CONCLUSIONS
In this paper, we have shown that binarized neural network
(BNN) can effectively detect human subjects from infrared
images. This is achieved by proper data augmentation adding
noise to dataset to deal with variable distance of a subject and
thermal variations. As a result, the recognition accuracy was
over 98% without any noise and over 96% even with noise in
IR images. As we target embedded platforms, we presented
architecture suited for BNN and designed main functional
blocks to estimate throughput as well as power consumption.
The proposed BNN system outperformed GPU with 46× many
processors in terms of throuput by 4.37×. Since our BNN
system has much less processing elements, the energy
efficiency improves by three orders of magnitude. Our study
shows the binarized NN targeting for object detection can fit
into a small FPGA such as Zynq-7100. The reduced memory
usage and simplified computation suggests DNN on the mobile
device with real-time performance. The future work would be
to integrate the entire system on small FPGA boards to run
embedded applications.
VII. ACKNOWLEDGMENTS
This research is partially funded under NSF #1526399, the
Defense Advanced Research Projects Agency (DARPA) and
the Air Force Research Laboratory (AFRL). The views,
opinions and/or findings expressed are those of the authors and
should not be interpreted as representing the official views or
policies of the Department of Defense or the U.S. Government.
REFERENCES
[1] A. Krizhevsky et. al., “ImageNet classification with deep
convolutional neural networks”, in Proc. Neural Information
Processing Systems (NIPS), pp. 10971105, 2012.
[2] P. Sermanet, K. Kavukcuoglu, S. Chintala and Y. Lecun,
“Pedestrian detection with unsupervised multi-stage feature
learning,” in Proc. Comput. Vision Pattern Recog. (CVPR), pp.
3626-3633. IEEE, 2013.
[3] I. Sutskever, J. Martens and G. Hinton, “Generating text with
recurrent neural networks,” in Proc. Int. Conf. Machine Learning
(ICML), pp. 1017-1024, 2011.
[4] R. Collobert et al., “Natural language processing (almost) from
scratch”, J. Machine Learning Research, 12:24932537, 2011.
[5] Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” Nature,
521:436-444, 2015.
[6] V. Vanhoucke and A. Senior, “Improving the speed of neural
networks on CPUs,” in Proc. Deep Learn. Unsupervised Feature
Learn. Workshop, NIPS, 2011.
[7] J. Kung, D. Kim and S. Mukhopadhyay, “A power-aware digital
feedforward neural network platform with backpropagation
driven approximate synapses,” in Int. Symp. Low Power Electron.
Design (ISLPED), pp. 85-90, 2015.
[8] S. S. Sarwar, S. Venkataramani, A. Raghunathan and K. Roy,
“Multiplier-less artificial neurons exploiting error resiliency for
energy-efficient neural computing,” in Proc. Design, Automat.
Test in Europe (DATE), pp. 145-150, 2016.
[9] Y. Gong, L. Liu, M. Yang and L. Bourdev, Compressing deep
convolutional networks using vector quantization,” arXiv
preprint arXiv:1412.6115, 2014.
[10] S. Han et al., “EIE: efficient inference engine on compresses deep
neural network” arXiv preprint arXiv:1602:01528, 2016.
[11] M. Courbariaux et al., “Binarized neural networks: training neural
networks with weights and activations constrained to +1 or -1,”
arXiv preprint arXiv:1602.02830, 2016.
[12] M. Rastegari, V. Ordonez, J. Redmon and A. Farhadi,
“XNOR-Net: ImageNet classification using binary convolutional
neural networks,” arXiv preprint arXiv:1603.05279, 2016.
[13] S. Zhou et al., “DoReFa-Net: training low bitwidth convolutional
neural networks with low bitwidth gradients,” arXiv preprint
arXiv:1606.06160, 2016.
Figure 13. The layer-wise throughput estimation of our proposed BNN hardware with 54 PEs when running LeNet5.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
10
[14] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based
learning applied to document recognition”, Proc. IEEE,
86(11):22782324, 1998.
[15] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, “DeepFace:
closing the gap to human-level performance in face verification,”
in Proc. Comput. Vision Pattern Recog. (CVPR), pp. 1701-1708.
IEEE, 2014.
[16] A. Coates et al., “Deep learning with COTS HPC systems,” in
Proc. Int. Conf. Machine Learning (ICML), pp. 1337-1345, 2013.
[17] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for
image recognition", arXiv preprint arXiv:1512.03385.
[18] S. Chai et al., “Low precision neural network using subband
decomposition,” in Cognitive Architecture (CogArch), 2016.
[19] Z. Wu, N. Fuller, D. Theriault and M. Betke, "A thermal infrared
video benchmark for visual analysis", in Proc. IEEE Workshop
on Perception Beyond the Visible Spectrum (PBVS), 2014.
[20] D. Zhang et al., “Unsupervised underwater fish detection fusing
flow and objectiveness,” in Proc. Winter Appl. Comput. Vision
Workshops (WACVW), pp. 1-7, 2016.
[21] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers and A. W. M.
Smeulders, Segmentation as selective search for object
recognition” in Proc. Int. Conf. Comput. Vision (ICCV), 2011.
[22] M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki.
Available: https://sites.google.com/site/seecproject/energy-table
[23] “Synopsys 32/28nm generic library,” https://www.synopsys.com/
... Based on our study of the literature, there have been a number of BNN-related reviews, ranging from conceptual summaries [8,[23][24][25] to technical summaries related to various domains, such as agriculture [26], medicine [27][28][29], large-scale image retrieval [30], human action recognition [31,32], etc. This means significant research has been devoted to BNNs, applying them in many areas, including object recognition [31,[33][34][35], semantic segmentation [36,37] and point-cloud tasks [38]. ...
... Based on our study of the literature, there have been a number of BNN-related reviews, ranging from conceptual summaries [8,[23][24][25] to technical summaries related to various domains, such as agriculture [26], medicine [27][28][29], large-scale image retrieval [30], human action recognition [31,32], etc. This means significant research has been devoted to BNNs, applying them in many areas, including object recognition [31,[33][34][35], semantic segmentation [36,37] and point-cloud tasks [38]. Compared to other BNN-related reviews and earlier works, this review contributes the following: ...
... In [31], the authors apply binary neural networks (BNNs) to the field of object detection, especially human detection in infrared images. The results show that the performance of their binary neural network is comparable to that of a 32-bit floating-point network while greatly saving computational resources, reducing memory consumption and boosting computational speed by four times. ...
Article
Full-text available
Binary neural networks (BNNs) are variations of artificial/deep neural network (ANN/DNN) architectures that constrain the real values of weights to the binary set of numbers {−1,1}. By using binary values, BNNs can convert matrix multiplications into bitwise operations, which accelerates both training and inference and reduces hardware complexity and model sizes for implementation. Compared to traditional deep learning architectures, BNNs are a good choice for implementation in resource-constrained devices like FPGAs and ASICs. However, BNNs have the disadvantage of reduced performance and accuracy because of the tradeoff due to binarization. Over the years, this has attracted the attention of the research community to overcome the performance gap of BNNs, and several architectures have been proposed. In this paper, we provide a comprehensive review of BNNs for implementation in FPGA hardware. The survey covers different aspects, such as BNN architectures and variants, design and tool flows for FPGAs, and various applications for BNNs. The final part of the paper gives some benchmark works and design tools for implementing BNNs in FPGAs based on established datasets used by the research community.
... Singh et al. [20] HM PVOT Virtex 6 ML605 1 Weißbrich et al. [41] FM FE Virtex 7 VC707 1 Kung et al. [54] DLM OD Xilinx ML-507 1 Pandey et al. [31] MLM PVOT Xilinx VC709 1 El-Shafie et al. [26] DLM PVOT Xilinx Zedboard 3 ...
Conference Paper
Full-text available
Visual object tracking systems have evolved from conventional image processing algorithms to technologies based on deep learning. Furthermore, FPGA technologies have allowed the implementation of complex and flexible portable systems. To present a comprehensive review of portable visual object tracking systems, this study proposes five categories of design methodologies for FPGA-based portable visual object tracking systems, along with their respective qualitative analysis and discussion. The proposed categorization includes the following algorithms: Region matching, Feature matching, Machine learning, and Deep learning. These design methodologies are contrasted against its FPGA-based hardware implementations.
... Moreover, because of the characteristics of binary values, matrix multiplication can be replaced by XNOR-popcount operations, thus conserving a substantial amount of computational resources. Over the past few years, many researchers have demonstrated the advantages of BNN for implementation on resources-limited hardware and applied it across various domains with exceptionally high real-time requirements [28][29][30][31]. While BNNs significantly reduce hardware strain, they inevitably introduce some issues for performance of neural networks, notably information loss resulting from binarization. ...
Article
Full-text available
Super-resolution systems refer to computer-based systems designed to enhance the quality of images or video by producing high-resolution renditions from low-resolution counterparts using computational algorithms and technologies. Various methods and techniques have been used in development of super-resolution systems. The development of Convolution Neural Networks (CNNs) and the Deep Learning (DL) methods have outperformed traditional methods. However, as models become increasingly deeper with wider receptive fields, the number of parameters significantly increases. While this often results in better performance, it renders these models impractical for real-life scenarios such as smartphones or other mobile systems. Currently, most proposed methods with higher perceptual quality demand a substantial amount of time to process a single image, even on powerful hardware like NVIDIA GPUs. Such computationally expensive models are not cost-effective for real-world application scenarios. Optimization is needed to reduce the computational costs and memory requirements to enhance their suitability for less powerful hardware configurations. In this work, we propose an efficient binary neural network architecture, ResBinESPCN, designed for image super-resolution. In our design, we improved the energy efficiency of the architecture through algorithmic and hardware-level optimizations. These optimizations not only enhance computational efficiency and reduce memory consumption but also achieve effective image super-resolution in resource-constrained environments. Our experimental validation highlights the effectiveness of this network structure and includes ablation studies on models with varying data bit widths. Hardware analysis substantiates the efficiency and real-time capabilities of this model. Additionally, deploying the model on FPGA using FINN demonstrates its low hardware resource usage and low power consumption.
... In the existing work, classification tasks based on binary neural networks are performed for infrared images on GPUs and are at least four times faster and have three orders of magnitude less energy consumption [33]. The binary weight and binary input data network (BWBDN) [34] achieved 62 times the acceleration and saved 32 times the storage space. ...
Article
Full-text available
Plant disease control has long been a critical issue in agricultural production and relies heavily on the identification of plant diseases, but traditional disease identification requires extensive experience. Most of the existing deep learning-based plant disease classification methods run on high-performance devices to meet the requirements for classification accuracy. However, agricultural applications have strict cost control and cannot be widely promoted. This paper presents a novel method for plant disease classification using a binary neural network with dual attention (DABNN), which can save computational resources and accelerate by using binary neural networks, and introduces a dual-attention mechanism to improve the accuracy of classification. To evaluate the effectiveness of our proposed approach, we conduct experiments on the PlantVillage dataset, which includes a range of diseases. The F1score and Accuracy of our method reach 99.39% and 99.4%, respectively. Meanwhile, compared to AlexNet and VGG16, the Computationalcomplexity of our method is reduced by 72.3% and 98.7%, respectively. The Paramssize of our algorithm is 5.4% of AlexNet and 2.3% of VGG16. The experimental results show that DABNN can identify various diseases effectively and accurately.
... They showed that the distance between the camera and the object as well as its pose can diminish model performance. The authors in [23] aimed to reduce the computational time needed to detect people by using a Binarized Neural Network (BNN) on a selfcollected thermal dataset. Strategies presented in recent years include image processing and human visual system (HVS) based techniques; see [24,25], respectively. ...
... One important family of deep neural networks is the class of Binarized Neural Networks (BNNs) [10]. Since these networks are memory efficient and computationally efficient, as their parameters and activations are predominantly binary, BNNs are useful in resource-constrained environments, like embedded devices or mobile phones [15,17]. Moreover, BNNs allow a compact representation in Boolean logic, enabling verification approaches based on SAT or SMT solving, or 0-1 Integer Linear Programming [1,14,19]. ...
... Object detection gathers two tasks; the first task is localizing one or more objects in an image and drawing a box around them, then, the turn of the second task comes to classify the objects in the image [252]. There is some BNN research works on object detection, like in [195], the authors presented human detection on infrared images. S. Sun et al. [241] suggested a fast object detection algorithm by combining the proposals prediction and object classification processes. ...
Article
Full-text available
This paper presents an extensive literature review on Binary Neural Network (BNN). BNN utilizes binary weights and activation function parameters to substitute the full-precision values. In digital implementations, BNN replaces the complex calculations of Convolutional Neural Networks (CNNs) with simple bitwise operations. BNN optimizes large computation and memory storage requirements, which leads to less area and power consumption compared to full-precision models. Although there are many advantages of BNN, the binarization process has a significant impact on the performance and accuracy of the generated models. To reflect the state-of-the-art in BNN and explore how to develop and improve BNN-based models, we conduct a systematic literature review on BNN with data extracted from 239 research studies. Our review discusses various BNN architectures and the optimization approaches developed to improve their performance. There are three main research directions in BNN: accuracy optimization, compression optimization, and acceleration optimization. The accuracy optimization approaches include quantization error reduction, special regularization, gradient error minimization, and network structure. The compression optimization approaches combine fractional BNN and pruning. The acceleration optimization approaches comprise computing in-memory, FPGA-based implementations, and ASIC-based implementations. At the end of our review, we present a comprehensive analysis of BNN applications and their evaluation metrics. Also, we shed some light on the most common BNN challenges and the future research trends of BNN.
Article
Neuromorphic and in-memory computing architectures using emerging nonvolatile memories (e-NVMs) have emerged as promising solutions for area-and energy-efficient deep neural network (DNN) accele-rators. However, the inherent nonideal behavior of e-NVMs such as limited tuning precision (for multibit synapses), nonlinearity, temporal (cycle-to-cycle), and spatial (device-to-device) variability significantly degrades the performance of DNN accelerators. Recently, binary neural networks (BNNs), with 1-bit weights and activations, have been shown to offer an alternative relaxed approach for training and inference with high accuracy. However, the limited endurance and stuck-at faults of e-NVMs such as resistive (R)RAMs, charge trap memory, and so on limit the efficient implementation of BNN accelerators. Considering the ultrahigh endurance, ultralow switching energy, and CMOS-compatibility of the ferroelectric (Fe)FETs, it becomes imperative to explore their potential for BNN accelerators. To this end, in this work, we present a novel approach for the implementation of BNN inference accelerators utilizing an array of dual-port ferroelectric FETs (FeFETs) as current sinks. The dual-port FeFETs not only decouple the read and write paths (leading to reduced read disturbances) but also exhibit high reliability and voltage compatibility with existing peripheral circuit design since the write voltages are low. Furthermore, we utilize a comparator column approach that requires only half area when compared to other differential weights-based BNN accelerators. Our comprehensive analysis utilizing an experimentally calibrated compact model for dual-port FeFETs indicates that the proposed vector-matrix-multiplication (VMM) implementation exhibits an energy efficiency of 789.89 TOPS/W with a throughput of 0.06 TeraOp/s while achieving an accuracy of 96.39% and 80.8% for image classification task on the MNIST and Fashion MNIST datasets after ex-situ training.
Preprint
Full-text available
Complete references to the CONACIC 2023 conference paper called “A Survey on FPGA-Based Design Methodologies for Visual Object Tracking.”
Article
Low-precision quantized neural networks (QNNs) reduce the required memory space, bandwidth, and computational power, and hence are suitable for deployment in applications such as IoT edge devices. Mixed-precision QNNs, where weights commonly have lower precision than activations or different precision is used for different layers, can limit the accuracy loss caused by low-bit quantization, while still benefiting from reduced memory footprint and faster execution. Previous multiple-precision functional units supporting 8-bit, 4-bit, and 2-bit SIMD instructions have limitations, such as large area overhead, under-utilization of multipliers, and wasted memory space for low and mixed bit-width operations. This paper introduces BISDU, a bit-serial dot-product unit to support and accelerate execution of mixed-precision low-bit QNNs on resource-constrained microcontrollers. BISDU is a multiplier-less dot-product unit, with frugal hardware requirements (a population count unit and 2:1 multiplexers). The proposed bit-serial dot-product unit leverages the conventional logical operations of a microcontroller to perform multiplications, which enables efficient software implementations of binary ( Xnor ), ternary ( Xor ), and mixed-precision [W × A] ( And ) dot-product operations. The experimental results show that BISDU achieves competitive performance compared to two state-of-the-art units, XpulpNN and Dustin, when executing low-bit-width CNNs. We demonstrate the advantage that bit-serial execution provides by enabling trading accuracy against weight footprint and execution time. BISDU increases the area of the ALU by 68% and the ALU power consumption by 42% compared to a baseline 32-bit RISC-V (RV32IC) microcontroller core. In comparison, XpulpNN and Dustin increase the area by 6.9 × and 11.1 × and the power consumption by 3.8 × and 5.97 ×, respectively. The bit-serial state-of-the-art, based on a conventional popcount instruction, increases the area by 42% and power by 32%, with BISDU providing a 37% speedup over it.
Article
Full-text available
Large-scale deep neural networks (DNN) have been successfully used in a number of tasks from image recognition to natural language processing. They are trained using large training sets on large models, making them computationally and memory intensive. As such, there is much interest in research development for faster training and test time. In this paper, we present a unique approach using lower precision weights for more efficient and faster training phase. We separate imagery into different frequency bands (e.g. with different information content) such that the neural net can better learn using less bits. We present this approach as a complement existing methods such as pruning network connections and encoding learning weights. We show results where this approach supports more stable learning with 2-4X reduction in precision with 17X reduction in DNN parameters.
Article
Full-text available
We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFatNet opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 4-bit gradients to get 47\% top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.
Technical Report
Deep convolutional neural networks (CNN) has become the most promising method for object recognition, repeatedly demonstrating record breaking results for image classification and object detection in recent years. However, a very deep CNN generally involves many layers with millions of parameters, making the storage of the network model to be extremely large. This prohibits the usage of deep CNNs on resource limited hardware, especially cell phones or other embedded devices. In this paper, we tackle this model storage issue by investigating information theoretical vector quantization methods for compressing the parameters of CNNs. In particular, we have found in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods. Simply applying k-means clustering to the weights or conducting product quantization can lead to a very good balance between model size and recognition accuracy. For the 1000-category classification task in the ImageNet challenge, we are able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN.
Conference Paper
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.
Article
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure). We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3X improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10X speedup over an unoptimized baseline and a 4X speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.
Conference Paper
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32\(\times \) memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58\(\times \) faster convolutional operations (in terms of number of the high precision operations) and 32\(\times \) memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy. Our code is available at: http:// allenai. org/ plato/ xnornet.