Content uploaded by David C. Zhang
Author content
All content in this area was uploaded by David C. Zhang on Jul 31, 2018
Content may be subject to copyright.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
1
Abstract—Memory performance is a key bottleneck for
deep learning systems. Binarization of both activations and
weights is one promising approach that can best scale to
realize the highest energy efficient system using the lowest
possible precision. In this paper, we utilize and analyze the
binarized neural network in doing human detection on
infrared images. Our results show comparable algorithmic
performance of binarized versus 32bit floating-point
networks, with the added benefit of greatly simplified
computation and reduced memory overhead. In this paper,
we present a system architecture designed specifically for
computation using binary representation that achieves at
least 4× speedup and the energy is improved by three
orders of magnitude over GPU.
Index Terms—Deep learning, embedded computer vision,
binary neural network, low-power object detection
I. INTRODUCTION
Large-scale deep neural networks (DNN) have been
successfully used in a number of tasks from image recognition
to natural language processing [1-4]. It is well understood that
to capture the wide spectrum of low, mid and high-level
representations of complex patterns, networks with many
layers, nodes, and with high local and global connectivity are
needed [5]. The success of recent deep learning algorithms
comes in part from the ability to train much larger DNN on
much larger datasets than was previously possible.
The networks to solve complex and real-world problems are
getting deeper and larger. This trend inevitably makes training
or inference to be both computationally challenging and
memory-intensive. As such, there has been a growing line of
research focused on minimizing the memory utilization and the
energy consumption from the hardware perspective [6-9].
These approaches leverage fixed-point instructions and the
level of bit-precision, utilize approximate hardware, or perform
data compression after training.
There has been another set of approaches that targets a
specific bit-precision during training, e.g. training at 8-bit DNN
weights instead of 32-bit. Results have shown that these
approaches can arrive at DNN weight parameters of lower
precision with negligible loss in performance [10-13]. In
addition, incorporating lower bit-precision during training help
reduce training time with lower memory footprint and also
lower bitwise computational requirement.
In this paper, we use binarized neural network (BNN) as our
algorithmic approach for our embedded DNN processor
because BNN offers the most savings in compute and memory
resources (i.e. binary is the lowest). For an in-depth
performance comparison, we analyze our BNN algorithm and
processor design for an image recognition application that
detects humans in the infrared (IR) spectrum. We analyze the
algorithmic performance against 32bit DNNs with various
model parameters such as number of kernels and synaptic
connections.
We also designed a system architecture suited for BNN (as
ASIC or on FPGA). We leverage the use of local memories to
store all BNN weights in order to speed up overall processing.
We then estimate the computing performance of our BNN
architecture and compare to that of GPU. Major computing
blocks in the BNN system are designed and synthesized to
estimate power consumptions of those functional blocks.
Our main contributions can be summarized as follows:
We demonstrate learning of visual features agnostic to
thermal variations to identify human subjects in infrared
images by using BNN.
We provide detailed characterization of the impact of
binarization on object detection.
We show that BNN performance is close to the 32bit
network. Binary weights in hyperspace is not a linear scale
of those obtained from 32bit network, nevertheless, they
are another valid representation for the trained NN.
We analyze and design the system architecture for BNN,
with performance and energy efficiency gains.
We anticipate that the results shown in this paper confirm the
suitability of BNNs for embedded platforms. We provide
demonstrable results on a complex application such as IR
human detection. We anticipate the use of BNNs for a variety
of embedded vision platforms, from smart cameras to
smartphone apps.
This paper is organized as follows: in Section 2, we present
previous research work on low-precision neural networks
including BNN. In Section 3, we explain our dataset captured
from different infrared cameras. In Section 4, we analyze
simulation results of BNN on our IR dataset with the
comparison to 32bit simulation results. In Section 5, we explore
the BNN system architecture and present performance/energy
estimates. Finally, we present our conclusions and discuss our
future work in Section 6.
II. PREVIOUS WORK: LOW-PRECISION NNS
Over the last decade, DNN parameter sizes continue to grow
dramatically. LeNet5, a Convolutional Neural Network (CNN)
presented in 1998 [14], uses ~60K parameters to classify
handwritten digits. AlexNet CNN [1] uses 61M parameters to
Efficient Object Detection Using Embedded Binarized Neural
Networks
Jaeha Kung†, David Zhang*, Gooitzen van der Wal*, Sek Chai*, Saibal Mukhopadhyay†
Georgia Institute of Technology, Atlanta, GA†
SRI International, Princeton, NJ*
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
2
win the ImageNet image classification competition in 2012.
More recently, new DNNs such as DeepFace uses 120M
parameters for human face identification [15], and there are
even networks with 10B parameters [16]. The current trend in
achieving higher level of algorithmic performance is to
increase the number of DNN layers to capture high-level
semantic reasoning, and as such, memory bottlenecks will
continue to be the central computing issue [17].
Needless to say, the computation power and run-time
memory footprint required for these modern DNN in inference
mode exceeds the power and memory size budgets for a typical
mobile device or an embedded platform. To tackle these
problems, different approaches including limiting bit-precision
[6,7], utilizing approximate hardware [8] or compressing data
[9] were proposed for inference mode. The efficient
manipulation of data precisions in CPUs to speed up the DNN
computation (also, reduce memory footprint by using 8bit) was
presented in [6]. In [7], a greedy algorithm was presented to
automatically set a reduced bit-precision for each synapse by
looking at its error sensitivity. The utilization of approximate
hardware with the fixed-point computation further reduces
energy consumption [7,8]. In [9], a compression method in
reducing the memory requirement was also explored.
To aggressively limit the bit-precision as an effort to reduce
memory, power usage and training time, targeting very low
precision even from training was presented [10-13]. Han et al.
uses a 4bit code book for weight parameters and updates
centroids during the training in [10]; data pruning and data
compression were performed to fit parameters into local
memory. More aggressive than this, [11,12] proposed binary
weights and activations in training mode (also, inference) and
obtained good results over the well-known benchmarks. Both
approaches keep binary weights and activations for all
computations except using 32bit weight parameters during
weight update. As an extension to this approach, 1bit weight,
2bit activation and 4bit gradient are presented to improve
performance on complex image recognition dataset [13].
All these methods target the same goals: i) reduce memory
requirement and ii) lower energy consumption (either by power
reduction or speed improvement). As limiting bit-precision
prior to training successfully reduces bit-precision to binary, we
select BNN [11] as our test prototype to realize a low-power
embedded vision system. BNN has simpler binarization than
other counterparts [12,13] and shows good accuracy on simple
image classification tasks. The overview of BNN algorithm is
shown in Figure 1. The important things to note are weight
update is done as a real-value (e.g. 32bit floating-point) and
input is 8bit only during the inference. In the following section,
we will describe our dataset to verify the applicability of BNN
on real embedded applications.
III. INFRARED DATASET
A. IR Data Collection
Traditional IR detection systems rely on manual coding of
visual features, resulting in missed detections and high false
alarm rates (FAR). Deep learning algorithms have been used
for object classification with great success, but current
approaches are focused on images in the visible spectrum, with
high resolution and contrast. Commercial deep learning
systems are cloud-based and geared towards performance
results at the cost of high power consumption and are less
concerned with real-time operation. IR cameras operate day
and night, but they pose additional challenges for deep learning
algorithms due to the fact that images have no color cues and
there are thermal variations for the same object. Small objects
in different poses are difficult to identify because they “look
like blobs” with little texture discrimination.
With BNNs, we take advantage of the lack of color data to
train a network specifically on spatial features. We have
previously shown that we can train DNN separately on spatial
and texture data, and then combine the results [18]. The
resulting DNN can be trained faster using higher learning rate,
Figure 1. Overview of binarized neural network (BNN): The training of BNN (top) and the inference mode in BNN (bottom).
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
3
and also achieve higher detection performance because we are
able to direct the learning on spatial feature and texture
separately. For this effort, we treat our BNN training in a
semi-supervised manner, specifically for spatial features (e.g.
edges, shapes, etc.), and we treat textural data as component
where precision can be lossy.
We collected a large sample of IR images of human activity,
both indoors and outdoors, on stationary and moving camera
platforms, at different times of the day to capture different
thermal background temperatures. The objects range from
medium to long range (1-100m) such that distant objects can be
as small as 20x20 pixels. We further incorporated thermal
images of pedestrians (in lab, school lounge, and marathon)
from [19] into our training set. Based on our experience, we
anticipate the need for a minimum of 100 frames for each
person in any particular pose. This is approximately five
seconds with a 20 frame-per-second camera.
B. Data Augmentation
Deep learning requires large training data sets, and we
address the challenges in data annotation with our data
ingestion software that labels image regions automatically. Our
data collection task is to ensure that we collect enough data
samples with sufficient variations in body poses and thermal
contrast. We then prepare the training set by selecting the
proper regions in the image with human subjects. We do this
labeling automatically using a combination of hot-spot and
motion tracking, similar to work published previously [20].
Negative samples are selected in the non-human area per frame,
based on object proposals suggested by Selective Search [21].
Furthermore, to further add resilience to our training set, we
augment each image in the training set with all variations that
are specific to thermal imagery (e.g. contrast change, resolution
blur, Gaussian noise and geometric distortions). In the context
of the paper, we call them augment noise or simply noise. This
is in addition to spatial augmentation (e.g. rotation, skew, shift)
that is commonly applied to training data in visible spectrum.
We have found, early on, that augmentation specific to IR
images is a key step in training in this spectral range so that our
DNN can be robust to realistic conditions (e.g. adverse weather,
ambient temperature, sensor noise and object distance/size). In
total, we have a sample set of 255k positive and 658k negative
samples.
IV. ALGORITHMIC EVALUATION OF BNN
A. Training
In addition to our IR dataset, we also used MNIST dataset as
a reference benchmark [14]. Although performance suggests
deeper and larger network for more complex vision tasks
[13,17], since IR dataset only labels two classes (i.e. human and
non-human), we started with a shallow network like LeNet5.
LeNet5 is regarded as a baseline network, which has 2
convolutional layers (with max pooling) and 3 fully-connected
layers. In order to see the impact of different network
configurations, such as the size of convolution kernels, the
number of kernels, the size of FC layer, and the number of
convolutional layers, we studied four different variants of
LeNet5 trained with MNIST and IR dataset respectively. The
performance of these configurations is concluded in the next
subsection.
Training of each network is done over 500 epochs with 50
mini-batches per epoch. The loss curves of both BNN and 32bit
network are compared on MNIST and IR datasets (Figure 3).
For MNIST, 40,000 training samples (input size: 28×28) are
used to train the network [14]. For IR dataset, 20,000 training
samples (input size: 128×128) with data augmentation as
explained in Section III.B are used. 30% of the samples are fed
into the network without any noise while the remaining 70%
samples are pre-processed with a randomized augment noise
before fed to the NN for training.
The convergence speed is important in terms of training time
(equivalently, energy consumption). With augment noise in IR
Figure 2. IR dataset captured from multiple infrared cameras followed by data augmentation to handle noise from cameras.
Figure 3. The loss curves during the training with BNN and
32bit on MNIST (left) and IR (right) datasets.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
4
dataset, for both 32bit network and BNN, the networks see
different variations in input data distribution and thus take
longer time to converge. For both MNIST and IR datasets, the
final loss obtained by using 32bit network was lower than that
of BNN.
Previously, there was little analysis on what is mainly
different in learned kernels between BNN and the 32bit
network. The kernels in the first convolutional layer of LeNet5
(for both BNN and 32bit) are shown in Figure 4. The binary
kernels let BNN capture edge or shape information. The
learned kernels in 32bit network identify texture as well as
edge/shape information. As a result, BNN mostly rely on edges
of input images. The 32bit network, however, captures both the
human subject with some background information. As we will
discuss in Section IV.C, however, the edge-oriented BNN has a
benefit over the 32bit network for some input types.
B. Accuracy Evaluation
To evaluate the accuracy of BNN, a network with 32bit is
compared to perform classification on both MNIST and IR
dataset. Table I summarizes the specifications of networks that
we have configured to test with. LeNet5 is the baseline (v1).
Regarding the variants of LeNet5, network v2 has larger
convolutional kernels, v3 has larger fully-connected layers, v4
has more number of kernels and v5 has one more convolutional
layer. The number of parameters on IR dataset of each network
is also reported which is the same for both BNN and 32bit. The
reported memory size on IR dataset is only for BNN (for 32bit,
we can simply multiply the number by 32). The number of test
samples is 10,000 for MNIST and 4,000 for IR dataset.
Figure 5 shows the simulation results, in terms of recognition
accuracy, on both MNIST and IR dataset. As explained in
Section III.B, the noise due to various factors may deteriorate
the resolution of video frames to be processed from mobile or
embedded cameras. Thus, we tested two different cases for IR
dataset: one without noise on test set and the other with noise
(to verify noise resilience of our trained model). For all cases,
BNN shows high recognition accuracy, as close performed
compared to 32bit network (for MNIST, 32bit always performs
slightly better). Interesting observation on the performance of
noisy IR data set, BNN performs slightly better than the 32bit
network in most configurations. This indicates that 32bit
network is more prone to certain noise on input images. The
possible justification will be addressed in the following
subsection.
In terms of network specifications, having larger kernels in
BNN on MNIST dataset reduces the recognition accuracy by
3.1%. However larger FC layer or more kernels improves the
accuracy by 1.3%. Since input size is too small for MNIST
network v5 is not tested. On IR dataset, all variant networks
show almost same recognition accuracy (difference < 1%). For
BNN, network v4 performed the best (99.30%). For 32bit,
network v5 performed the best (99.13%). Even with the noise
in input images, the recognition accuracy is reasonably high (>
96%). As there is little variation in recognition accuracy among
5 networks for IR dataset, without loss of generality, we simply
select LeNet5 (v1) as our reference network for the following
analyses.
C. Analysis of Binarization on Object Detection
We compare the histograms of two networks, 32bit network
and BNN, to verify that they statistically behavior the same.
The upper figures in Figure 6 show histograms of output scores
on entire test set (4,000 samples) and on samples that each
network fails to recognize correctly. BNN and 32bit network
Figure 5. Recognition accuracy on different datasets using both BNN and 32bit network: (a) MNIST, (b) IR dataset without
noise and (c) IR dataset with noise.
Figure 4. Learned kernels and output features of BNN and
32bit network at the first convolutional layer.
Table I. Summary of simulated networks and required
number of parameters and storage size for IR dataset
Network
LeNet5
v2
v3
v4
v5
C1
6: 5×5
6:11×11
6: 5×5
16: 5×5
6: 5×5
S2
2×2
2×2
2×2
2×2
2×2
C3
16: 5×5
16: 7×7
16: 5×5
32: 5×5
16: 5×5
S4
2×2
2×2
2×2
2×2
2×2
C5
12: 3×3
S6
2×2
F5(F7)
120
120
3600
120
120
F6(F8)
84
84
1024
84
84
# Param
1.6M
1.3M
52M
3.3M
0.26M
Memory
243KB
201KB
6.56MB
478KB
72.8KB
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
5
have different mean values and deviation, but they show similar
distributions (bimodal distribution). Each peak represents
human subject or background (non-human). We can interpret
this as BNN is trained to behave similarly to the network with
32bit precision.
For samples that BNN or 32bit network fails to correctly
recognize, the output scores lie around zeros for both networks.
Among those false alarms, we have checked the sign of output
scores on samples which failed on both networks. The result
shows that both BNN and 32bit network had the same output
signs on 95% of common false alarms, implying that BNN
behaves very similarly to 32bit network. Some examples of
those common false alarms are shown in Figure 6 with possible
reasons of the recognition failure.
Then, the question arises that why some samples 32bit
network recognizes correctly while BNN fails, and vice versa.
Figure 7 shows some examples on which BNN fails to
recognize but 32bit succeeds. We looked into six features after
the first convolutional layer of an input sample on both BNN
and 32bit network. Again, BNN relies solely on edge (shape)
information. Thus, if the input image has less contrast to
recognize a subject, as an example shown in Figure 7, features
may fail to capture any useful information. To overcome such
problem in BNN, contrast enhancement such as histogram
equalization or LACE can be applied to IR image before they
are fed into BNN. After such pre-processing, BNN successfully
identifies the human subject for the previously failed example.
The corresponding features of BNN are shown in Figure 7
(lower right), which learns the shape of a human subject.
We also analyze samples where the 32bit network fails to
recognize, but BNN succeeds. A subset of those test samples
are shown in Figure 8. By looking at feature maps of one of the
samples, we can tell that 32bit network captures more
background information than BNN. This may help collect more
features but when there is noisy background (or background
with complex textures), 32bit network is difficult to recognize
Figure 6. The histogram of output values (scores) and examples of the common false alarms on identifying human subjects in
IR images for both BNN and 32bit network.
Figure 7. The false alarms by testing with BNN (upper left
images) and possible solution to the one of false alarms.
Figure 8. The false alarms with 32bit network (left) and the
reasoning behind the failure with possible solution (right).
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
6
the subject of interest in IR images. To solve this problem, a
tighter search box during the selective search may help
improving the recognition for 32bit network.
In summary, we have shown that BNN is trained to perform
in a similar way as 32bit network. This is verified with the high
recognition accuracy on two datasets and by comparing the
actual distribution of output scores. In the following section, we
will present the hardware architecture suited for BNN and
estimate system performance in terms of throughput and energy
consumption.
V. BNN SYSTEM ARCHITECTURE
A. Evaluation on Memory Requirement
The most critical design consideration of neural networks in
hardware is the memory space. As the number of parameters is
ever increasing [1,15,16], the external storage (i.e. DRAM) will
be accessed frequently. The energy consumption of memory
access for 32bit DRAM, however, is 128× higher than that of
32bit SRAM [22]. Thus, storing parameters in local memory
significantly reduces the energy consumption of the system.
This gets more importance in embedded platforms.
In Table II, the (internal) memory size of commercial FPGAs
is summarized. This helps us understanding the maximum size
of a neural network that each FPGA can fit into its local
memory. For example, AlexNet [1] has 61M parameters which
require ~240MB in the network. It does not fit in the largest
FPGA from the table (Xilinx FPGA-VirtexU 13P). Only a
neural network which is ~4× less in size than AlexNet is
estimated to fit into this FPGA. On the other hand, VirtexU 13P
can fit in a binarized neural network with ~0.45B parameters
for inference mode (7.5 × larger networks than AlexNet).
For IR dataset, a small and shallow network (e.g. LeNet5) is
good enough to achieve high recognition accuracy. LeNet5
with 128×128 inputs has 1.6M parameters occupying memory
about 6.4MB, using full precision. A medium size FPGA such
as KintexU-15P can be used to fit all parameters and has the
package size of 35mm×35mm. A small Zynq-7020 chip has a
footprint of 13mm×13mm (with 0.62MB of local memory). To
fit the network into a smaller form factor for embedded
platforms, LeNet5 on IR images can be trained with binary
parameters, which require ~0.24MB and can fit into this small
FPGA.
In the following discussion, LeNet5 is selected as the
reference network on IR dataset. Zynq-7000 series is assumed
as a target embedded platform in describing BNN system
architecture in the following subsections. Please note that the
BNN system architecture is designed for NN inferencing only.
B. BNN System Architecture
The BNN or any other low-precision neural network requires
dedicated system architecture to fully gain in performance. In
this section, we present overall architecture and its processing
elements (PE) suited for BNN algorithm. The proposed BNN
architecture is general in terms of its application or mapped
network topology. The architecture remains the same but with
different size of the internal memory depending on the size of
the network. To help understanding the overall architecture, we
set the application to IR detection with LeNet5.
First, we have to determine the number of internal memory
blocks, called Block RAM (BRAM) in FPGA terminology.
When deciding the number of BRAMs, we consider the total
number of parameters in the network and the maximum PE
utilization. We made several assumptions on external memory
specification with DDR3; 64bit-wide bus (word), operating
frequency at 800MHz and 75% efficiency. These assumptions
lead to the transfer rate of 19,200MB/s (this affects DRAM to
BRAM data prefetching time). For LeNet5, total data size of
242.67KB is required for both activations (39.23KB) and
weights (203.44KB) on IR dataset. In Zynq-7000 series, the
size of each BRAM is 36Kb (512×64bit) thus 51 BRAMs are
needed to store all weight parameters locally. To organize
BRAMs in a 2-dimensional fabric, we assume 54 BRAMs (6×9)
for parameters in LeNet5. Since the storage requirement for
activations is far less, 10 BRAMs are enough to cover the
whole activation data. To fully utilize PEs, however, 27
BRAMs are distributed over the system so that two PEs share
one activation BRAM. By operating the PE fabric at the half
clock frequency to BRAM, the proposed architecture gives us
the maximum throughput (Figure 9).
Given the overall system architecture, we now present the
design of processing elements optimized for BNN algorithm.
Figure 10 shows a block diagram of a PE with main functional
blocks: (i) XNOR Dot Product (XDP) module and (ii) Batch
Normalization (BN) module. As the word size in DRAM is
assumed as 64bit, XDP module is designed to operate on 64bit
data to compute 64 multiply-and-accumulate in parallel at each
PE. This greatly improves the system performance for BNN
algorithm (refer to Section V.D). Also, a simple example of dot
product operation is explained in Figure 10. Note that Boolean
value 0 represents -1 in BNN data. The XNOR between ‘n’
activations and weights simplifies multiplication. Then, the
Figure 9. The oveall system architecture for BNN. The
shared BRAMs (shaded) are for weights in convolutional
layer and for activations in fully-connected layer.
Table II. Internal memory size in different FPGA platforms
Xilinx Board
MB
Zynq-7100
3.3
Virtex-7 X980T
6.6
KintexU 15P
8.8
ZynqU 19EG
8.8
KintexU 115
9.5
VirtexU 190
16.6
VirtexU 13P
56.8
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
7
result is followed by count and subtract module to obtain dot
product output.
C. BNN Computation Mapping on PEs
FC layer: The computation in fully-connected (FC) and
convolutional (Conv) layer requires different data management.
First, we present the detailed PE manipulation for the FC layer.
Most intuitively, we can map all computations for one output
node to each PE. Each PE handles 64 input nodes at a time, thus
the PE iteratively computes the state of the given output node if
the dimension of input state vector is larger than 64. This
computation mode is named iterative mode (left diagram in
Figure 11(a)). This mode is inefficient if the number of output
nodes is small, equivalently the #input/#output ratio is high. For
instance, if the number of output node is only 10, the remaining
44 PEs are in idle mode significantly reducing the utilization.
To solve this issue, another mode for FC layer is designed.
The additional mode, named partial mode, merely requires an
adder tree for each column of the PE fabric. In this mode, each
column handles the computations for one output node. Thus,
nine output nodes are handled in parallel (which is 54 nodes in
the iterative mode). As we will discuss in the next subsection,
the partial mode improves throughput when #input/#output
ratio is low. The other thing to note is that BN+NF module is in
idle mode except for the last row of PE fabric. Since batch
normalization and nonlinear activation happen after the state of
an output node is fully computed, only one BN+NF module is
needed per PE column. Even for the same FC layer, depending
on the layer specification the instruction changes in our BNN
system architecture.
Conv Layer: The Conv layer behaves completely different
from the FC layer. The number of weight parameters is much
less than FC layer thus weights are stored in a local buffer at
each PE. Making this buffer larger increases the capacity to
handle larger kernels. By having a 11bit buffer, kernels up to
size 11×11 are covered (a row of the kernel is stored). The input
data is multiple feature maps where each map is 2-dimensional
data. Input feature maps are distributed to PE columns. Then,
pixel data in each (input) feature map is distributed to PE rows.
For convolution, two consecutive words (64pixels/word) in the
same row are loaded into each PE. One word is the active word
at which convolution is performed: darker gray in Figure 11(b).
The other word is supplement word shifted in one bit at a time
to compute partial sum (row) in a convolution. Note that all 64
weight bits in XDP module is the same. For 5×5 kernel, the
shift is done 4 times to compute row partial sum of 64 output
nodes. This is repeated for 4 times in row-wise (next row of an
input feature is loaded) to complete the convolution on 5×5
kernel.
The important fact is that XDP module outputs 64 8bit data
while the output is one 8bit data in FC layer mapping. Also, in
each XDP module, CnS operation is done iteratively on each
output (simply adding 1 or subtracting 1 depending on XNOR
output). After computations for all kernel values are computed,
64 outputs are independently added in column-wise (64 8bit
adders are required; the overhead is discussed in Section V.D).
These results are passed through pooling module reducing
dimensionality. For 2x2 pooling, it needs to wait for the next 64
output data to produce 32 8it data (from 128 8bit data). This
computation is followed by BN+NF module to get 1bit result
for 32 pixels of an output feature. This procedure is repeated
until all pixels in output features are computed.
Input Layer: Different from other layers, the input layer
accepts 8bit data as an input. This makes the computation
Figure 11. A computation mapping on each PE for fully-connected layer (a) or convolutional layer (b).
Figure 10. A block diagram of a processing element with
main modules: (i) XDP module and (ii) BN module.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
8
mapping in the first layer slightly different from the other layers.
In the memory, eight chunks of 8bit inputs are stored in one
64bit word. This requires the PE to handle eight 8bit data using
XDP module instead of 64 1bit data. Still, weight parameters
are kept 1bit. The output of XDP module is then handled
differently from the mapping shown in Figure 11.
In FC layer, block CnS modules are used to perform addition
on XNOR result at the same bit-position. Then, Shift-and-Add
(SnA) is performed to compute the dot product result with 8bit
input data (Figure 12(a)). The other parts of modules are kept
the same. In Conv layer, SnA is performed on 8 output chunks
of iterative CnS module connected to each XNOR output
(Figure 12(b)). The XDP module outputs eight 8bit data, thus
only some parts of column-wise adder, pooling and BN module
are utilized for input layer. For both layer cases, only additional
hardware overhead is the addition of SnA modules and its
proper management. Also, since PE computes dot product on
eight inputs at a time, the throughput reduces by 8 during input
layer processing.
D. Throughput and Energy Analysis
With the proposed BNN architecture described in Section
V.C, we can estimate the throughput with the proper pipelining
in PE. Before estimating performance, we need to set operating
frequency of local memory and PEs. The BRAM in Zynq-7000
series can be operated at 400MHz and we assume the operating
frequency of PE at 200MHz. The operating frequency of PE
module is verified by synthesizing the design in 28nm CMOS
technology [23].
For FC layer, there are two different modes depending on the
network topology. In iterative mode, the throughput in terms of
clock cycles is given by:
(1)
where #input (or #output) is the number of nodes in current
layer (or next layer), word size is 64 and #PEs is 54. In partial
mode, the throughput changes to:
(2)
where (or is the number of rows (or
columns) of a PE array. To verify that the iterative (partial)
mode is beneficial when #input/#output ratio is low (high), the
throughputs of two different cases are compared in Table III.
For Conv layer, the throughput becomes:
(3)
where , is the dimension of input features and
(or is the number of input
features (or output features). For instance, the convolution with
5×5 kernel on six 120×120 input features to obtain 16 output
features requires 11,700 cycles. This is significantly longer
than required cycles for examples on FC layer (refer to Table
III).
Table III. Throughput comparison between iterative and partial
mode for FC layer processing
Mode
(#inputs, #outputs)
(1024, 512)
(1024, 10)
Iterative
160
16
Partial
228
6
To estimate the throughput of a neural network for IR
detection, the number of cycles for LeNet5 is estimated. The
layer-wise estimated performance of BNN hardware is shown
in Figure 13. As a result, it takes 0.19s to process 4,000 test
images. In processing the same number of test images, GPU
took 0.83s for the inference. Thus, we get 4.37× improvement
in computation speed compared to GPU node. The GPU model
is Tesla K20 which has 2496 processors running at 706MHz.
Assuming 54 PEs in our BNN architecture, 4.37× improvement
is appreciable. The speed-up can be higher if higher power
budget is allowed to put more PEs in the target embedded
system. The throughput analysis on LeNet5 implies that Conv
layer is more computation intensive than FC layer. However,
FC layer demands more memory space as it normally has
significantly higher parameter counts. The delay incurred for
input layer processing significantly increases the computation
time.
Figure 12. A computation mapping for input layer processing for fully-connected layer (a) or convolutional layer (b).
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
9
Since our target platform is energy-limited embedded
platforms, the energy efficiency is a key factor as well as the
throughput. In order to estimate power consumption of each
functional block, we designed and synthesized major blocks in
28nm CMOS technology. The power consumption of those
blocks is summarized in Table IV. The XDP module is
designed to function on any computation mapping by having
different modes. In the proposed BNN architecture, we need 50
XDP modules (19.04mW), 62 BN modules (9.52mW) and 64
8bit adders (4.47mW). Overall, it consumes 33.03mW and
takes up the area of 0.26mm2. Thus, the energy consumption of
BNN system becomes 6.28mJ while GPU consumes 38.18J
(reported power was 46W). In terms of the energy efficiency,
the proposed BNN architecture improves by three orders of
magnitude.
Table IV. Power consumption of functional blocks in BNN
hardware synthesized in 28nm technology node
28nm
XDP Module
BN Module
8bit Adder
Power (W)
380.84
116.09
69.87
Area (m2)
2910.40
983.76
507.65
Timing (ns)
3.09
2.26
1.61
VI. CONCLUSIONS
In this paper, we have shown that binarized neural network
(BNN) can effectively detect human subjects from infrared
images. This is achieved by proper data augmentation adding
noise to dataset to deal with variable distance of a subject and
thermal variations. As a result, the recognition accuracy was
over 98% without any noise and over 96% even with noise in
IR images. As we target embedded platforms, we presented
architecture suited for BNN and designed main functional
blocks to estimate throughput as well as power consumption.
The proposed BNN system outperformed GPU with 46× many
processors in terms of throuput by 4.37×. Since our BNN
system has much less processing elements, the energy
efficiency improves by three orders of magnitude. Our study
shows the binarized NN targeting for object detection can fit
into a small FPGA such as Zynq-7100. The reduced memory
usage and simplified computation suggests DNN on the mobile
device with real-time performance. The future work would be
to integrate the entire system on small FPGA boards to run
embedded applications.
VII. ACKNOWLEDGMENTS
This research is partially funded under NSF #1526399, the
Defense Advanced Research Projects Agency (DARPA) and
the Air Force Research Laboratory (AFRL). The views,
opinions and/or findings expressed are those of the authors and
should not be interpreted as representing the official views or
policies of the Department of Defense or the U.S. Government.
REFERENCES
[1] A. Krizhevsky et. al., “ImageNet classification with deep
convolutional neural networks”, in Proc. Neural Information
Processing Systems (NIPS), pp. 1097–1105, 2012.
[2] P. Sermanet, K. Kavukcuoglu, S. Chintala and Y. Lecun,
“Pedestrian detection with unsupervised multi-stage feature
learning,” in Proc. Comput. Vision Pattern Recog. (CVPR), pp.
3626-3633. IEEE, 2013.
[3] I. Sutskever, J. Martens and G. Hinton, “Generating text with
recurrent neural networks,” in Proc. Int. Conf. Machine Learning
(ICML), pp. 1017-1024, 2011.
[4] R. Collobert et al., “Natural language processing (almost) from
scratch”, J. Machine Learning Research, 12:2493–2537, 2011.
[5] Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” Nature,
521:436-444, 2015.
[6] V. Vanhoucke and A. Senior, “Improving the speed of neural
networks on CPUs,” in Proc. Deep Learn. Unsupervised Feature
Learn. Workshop, NIPS, 2011.
[7] J. Kung, D. Kim and S. Mukhopadhyay, “A power-aware digital
feedforward neural network platform with backpropagation
driven approximate synapses,” in Int. Symp. Low Power Electron.
Design (ISLPED), pp. 85-90, 2015.
[8] S. S. Sarwar, S. Venkataramani, A. Raghunathan and K. Roy,
“Multiplier-less artificial neurons exploiting error resiliency for
energy-efficient neural computing,” in Proc. Design, Automat.
Test in Europe (DATE), pp. 145-150, 2016.
[9] Y. Gong, L. Liu, M. Yang and L. Bourdev, “Compressing deep
convolutional networks using vector quantization,” arXiv
preprint arXiv:1412.6115, 2014.
[10] S. Han et al., “EIE: efficient inference engine on compresses deep
neural network” arXiv preprint arXiv:1602:01528, 2016.
[11] M. Courbariaux et al., “Binarized neural networks: training neural
networks with weights and activations constrained to +1 or -1,”
arXiv preprint arXiv:1602.02830, 2016.
[12] M. Rastegari, V. Ordonez, J. Redmon and A. Farhadi,
“XNOR-Net: ImageNet classification using binary convolutional
neural networks,” arXiv preprint arXiv:1603.05279, 2016.
[13] S. Zhou et al., “DoReFa-Net: training low bitwidth convolutional
neural networks with low bitwidth gradients,” arXiv preprint
arXiv:1606.06160, 2016.
Figure 13. The layer-wise throughput estimation of our proposed BNN hardware with 54 PEs when running LeNet5.
Submitted to Journal of Signal Processing Systems, Special Issue on Embedded Computer Vision, August 12, 2016
10
[14] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based
learning applied to document recognition”, Proc. IEEE,
86(11):2278–2324, 1998.
[15] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, “DeepFace:
closing the gap to human-level performance in face verification,”
in Proc. Comput. Vision Pattern Recog. (CVPR), pp. 1701-1708.
IEEE, 2014.
[16] A. Coates et al., “Deep learning with COTS HPC systems,” in
Proc. Int. Conf. Machine Learning (ICML), pp. 1337-1345, 2013.
[17] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for
image recognition", arXiv preprint arXiv:1512.03385.
[18] S. Chai et al., “Low precision neural network using subband
decomposition,” in Cognitive Architecture (CogArch), 2016.
[19] Z. Wu, N. Fuller, D. Theriault and M. Betke, "A thermal infrared
video benchmark for visual analysis", in Proc. IEEE Workshop
on Perception Beyond the Visible Spectrum (PBVS), 2014.
[20] D. Zhang et al., “Unsupervised underwater fish detection fusing
flow and objectiveness,” in Proc. Winter Appl. Comput. Vision
Workshops (WACVW), pp. 1-7, 2016.
[21] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers and A. W. M.
Smeulders, “Segmentation as selective search for object
recognition” in Proc. Int. Conf. Comput. Vision (ICCV), 2011.
[22] M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki.
Available: https://sites.google.com/site/seecproject/energy-table
[23] “Synopsys 32/28nm generic library,” https://www.synopsys.com/
A preview of this full-text is provided by Springer Nature.
Content available from Journal of Signal Processing Systems
This content is subject to copyright. Terms and conditions apply.