ArticlePDF Available

Improving the Accuracy of Spiking Neural Networks for Radar Gesture Recognition Through Preprocessing

Authors:

Abstract and Figures

Event-based neural networks are currently being explored as efficient solutions for performing AI tasks at the extreme edge. To fully exploit their potential, event-based neural networks coupled to adequate preprocessing must be investigated. Within this context, we demonstrate a 4-b-weight spiking neural network (SNN) for radar gesture recognition, achieving a state-of-the-art 93% accuracy within only four processing time steps while using only one convolutional layer and two fully connected layers. This solution consumes very little energy and area if implemented in event-based hardware, which makes it suited for embedded extreme-edge applications. In addition, we demonstrate the importance of signal preprocessing for achieving this high recognition accuracy in SNNs compared to deep neural networks (DNNs) with the same network topology and training strategy. We show that efficient preprocessing prior to the neural network is drastically more important for SNNs compared to DNNs. We also demonstrate, for the first time, that the preprocessing parameters can affect SNNs and DNNs in antagonistic ways, prohibiting the generalization of conclusions drawn from DNN design to SNNs. We demonstrate our findings by comparing the gesture recognition accuracy achieved with our SNN to a DNN with the same architecture and similar training. Unlike previously proposed neural networks for radar processing, this work enables ultralow-power radar-based gesture recognition for extreme-edge devices.
Content may be subject to copyright.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Improving the Accuracy of Spiking Neural
Networks for Radar Gesture Recognition
Through Preprocessing
Ali Safa ,Graduate Student Member, IEEE, Federico Corradi, Member, IEEE, Lars Keuninckx,
Ilja Ocket, Member, IEEE, André Bourdoux ,Senior Member, IEEE,
Francky Catthoor, Fellow, IEEE , and Georges G. E. Gielen, Fellow, IEEE
Abstract— Event-based neural networks are currently being
explored as efficient solutions for performing AI tasks at the
extreme edge. To fully exploit their potential, event-based neural
networks coupled to adequate preprocessing must be investi-
gated. Within this context, we demonstrate a 4-b-weight spiking
neural network (SNN) for radar gesture recognition, achieving
a state-of-the-art 93% accuracy within only four processing
time steps while using only one convolutional layer and two
fully connected layers. This solution consumes very little energy
and area if implemented in event-based hardware, which makes
it suited for embedded extreme-edge applications. In addition,
we demonstrate the importance of signal preprocessing for
achieving this high recognition accuracy in SNNs compared to
deep neural networks (DNNs) with the same network topology
and training strategy. We show that efficient preprocessing prior
to the neural network is drastically more important for SNNs
compared to DNNs. We also demonstrate, for the first time,
that the preprocessing parameters can affect SNNs and DNNs in
antagonistic ways, prohibiting the generalization of conclusions
drawn from DNN design to SNNs. We demonstrate our findings
by comparing the gesture recognition accuracy achieved with our
SNN to a DNN with the same architecture and similar training.
Unlike previously proposed neural networks for radar processing,
this work enables ultralow-power radar-based gesture recognition
forextreme-edgedevices.
Index Terms—Energy- and area-efficient networks, gesture
recognition, preprocessing impact on accuracy, radar processing,
spiking neural networks (SNNs).
I. INTRODUCTION
IN RECENT years, event-based neural networks (as
opposed to frame-based) have gained considerable interest
Manuscript received December 18, 2020; revised June 11, 2021 and
August 26, 2021; accepted August 31, 2021. This work was supported in part
by the Flanders Artificial Intelligence (AI) Research Program. (Corresponding
author: Ali Safa.)
Ali Safa, Ilja Ocket, Francky Catthoor, and Georges G. E. Gielen are with
imec, 3001 Leuven, Belgium, and also with the Department of Electrical
Engineering, Katholieke Universiteit (KU) Leuven, 3001 Leuven, Belgium
(e-mail: ali.safa@imec.be; ilja.ocket@imec.be; francky.catthoor@imec.be;
georges.gielen@kuleuven.be).
Federico Corradi is with Stichting imec, 5656 AE Eindhoven,
The Netherlands (e-mail: federico.corradi@imec.nl).
Lars Keuninckx and André Bourdoux are with imec, 3001 Leuven, Belgium
(e-mail: lars.keuninckx@imec.be; andre.bourdoux@imec.be).
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3109958.
Digital Object Identifier 10.1109/TNNLS.2021.3109958
due to their ability for low energy consumption, low inference
latency, and their capability of being implemented in massively
parallel, non-von Neumann architectures where computation is
done close to the memory [1], solving the memory bottleneck
issues. Event-based neural networks [also called spiking neural
networks (SNNs)] are, thus, well suited for edge-AI applica-
tions where low energy consumption, compact die area, and
low latency are key requirements.
Following the popularity of the backpropagation of error
algorithm and supervised learning, multiple methods have
been proposed to solve the issue of gradient computation due
to the discontinuous Dirac-pulse activation of SNNs [2], [3].
Currently, the use of a surrogate gradient coupled with back-
propagation through time (BPTT) has gained popularity due
to its remarkable efficiency [4].
In the past decade, radar sensing via neural networks has
gained huge interest because of the advantages that radar sen-
sors present over vision-based technologies. Unlike cameras,
radars preserve privacy, are independent of lighting condi-
tions, and are able to sense even when occluded [34]. Deep
neural networks (DNNs) have successfully been used for radar
processing in a wide range of applications, from indoor person
identification [31] and heart-rate estimation [32] to target
classification [33] and gesture recognition [9]. However, those
DNN-based solutions are ill-suited for embedded AI applica-
tions at the extreme edge (such as in IoT devices), where the
current trend is to embed neural network accelerators within
the sensor chip itself and where the very tight energy budgets
generally cannot cope with the latency requirements and
computing power needed to run such deep learning solutions,
notwithstanding the remarkable recent efforts of many teams in
DNN acceleration [38], [39]. In an effort to bring radar-based
sensing to the ultralow-power edge devices, we investigate in
this article the use of SNNs for radar sensing and gesture
recognition and explore the design challenges that must be
faced to make them effective compared to conventional DNNs.
Indeed, a competitive SNN accuracy compared to the
DNN counterpart has mainly been demonstrated on standard
benchmarks, such as MNIST and CIFAR10 [5], and it is
unclear how SNNs compare to DNNs when used for other
applications, such as our radar gesture recognition task. More
specifically, we show in this article that SNN processing of the
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
sensory data requires some level of preprocessing, but there
has been little discussion so far about the impact of such data
preprocessing (such as denoising, sparse coding, or dimen-
sionality reduction) on the SNN performance compared to
any preprocessing impact on the DNN performance. DNNs
have shown remarkable success when applied to unprocessed
datasets corrupted by nonstationary noise [6], with exper-
imental evidence suggesting that learning with noisy data
might even be beneficial [7]. It is unclear if and how those
observations generalize to SNNs.
Within this context, we demonstrate the importance of
adding preprocessing to a resource-constrained SNN for
the task of gesture recognition in an extreme-edge device
using a custom 8-GHz ultralow-power frequency-modulated
continuous-wave (FMCW) SISO radar [8], together with pre-
senting a custom preprocessing queue performing sparse cod-
ing and dimensionality reduction. The resulting architecture
achieves state-of-the-art performance compared to conven-
tional solutions.
The key contributions of this work are given as follows.
1) We demonstrate the importance of radar data preprocess-
ing for SNNs compared to its importance for DNNs in
a supervised learning setting.
2) We demonstrate that conclusions drawn from DNN
design cannot be extended to the SNN counterpart, even
if a similar training methodology is used.
3) We introduce a well-suited method for preprocessing
and encoding radar intermediate frequency (IF) signals
into the event domain, extensively comparing different
approaches.
4) We propose a low-resource SNN architecture with only
one convolutional layer and only two fully connected
hidden layers, using only 4-b weights, to achieve 93%
of recognition accuracy within four time steps, which
establishes a high energy efficiency, a low die area usage,
and a low latency when implemented in hardware for
edge devices.
This article is organized as follows. Related works are dis-
cussed in Section II. Our radar data acquisition is introduced
in Section III. Our preprocessing approaches are detailed in
Section IV. Our neural network is introduced in Section V.
Results and their discussion are presented in Section VI.
Conclusions are provided in Section VII.
II. RELATED WORKS
Radar gesture recognition is actively being explored using
conventional DNN solutions [9], [11], [12], [18]. Using
Google’s Soli sensor [10], 11-class gesture recognition with
87% accuracy has been demonstrated [9]. The authors use a
network of rectified linear units (ReLU) neurons composed of
multiple convolutional and max-pooling layers, followed by
multiple fully connected layers and ending with a recurrent
layer using long short-term memory blocks. As input to the
network, they perform range-Doppler (RD) processing on the
radar signals and feed their magnitude as images. Compared
to our work, they are able to recognize more gesture classes at
the expense of much higher network complexity (three 3 ×3
convolutional layers with 32, 64, and 128 filters, respectively;
two fully connected layers both with 512 neurons; an LSTM
layer of size 512; and a fully connected SoftMax layer
of 11 outputs). In addition, the authors point out that their
remarkable ability to recognize 11 classes is also due to their
use of a 60-GHz radar [9], [10], providing close to eight times
finer grain Doppler resolution compared to our 8-GHz radar
chip (as Doppler resolution is inversely proportional to the
radar frequency [20]). On the other hand, their 60-GHz radar
consumes around 300 mW of power [10] compared to 680 μW
for our 8-GHz radar [8], which makes our sensor much
better suited for ultralow-power applications, such as IoT edge
devices. Also, they do not quantize the weights nor the input
to make their network compatible with applications requiring
ultralow power consumption. Therefore, even though better
suited for solutions that can make use of GPU compute power,
their solution is unsuited for ultralow-power edge devices with
limited energy available.
As mentioned above, a key observation made by
Wang et al. [9] and Lien et al. [10] is the need for fine-grain
Doppler resolution in order to achieve high-accuracy gesture
recognition with many classes. This is a key limitation of
our 8-GHz sensor as we trade off frequency for ultralow
power consumption and die area utilization. Integrating an
SNN with such a small area (1.8mm
2[8] versus 144 mm2
for Google’s Soli [10]), low-power 8-GHz radar chip is,
therefore, a well-suited choice, as SNN architectures can
lead to substantially higher energy- and area-efficient imple-
mentations compared to current DNN accelerators [1], [26],
[27], even when the network architecture is shallow (see [28]
for a discussion about the accuracy–area–energy tradeoff).
However, using SNNs comes at the cost of a potentially
severe accuracy loss compared to using DNNs (see Table III).
This motivates the exploration of preprocessing strategies for
the SNNs, as provided in this article. Indeed, it is due to a
well-suited preprocessing strategy that we are able to achieve
a high-accuracy, five-class gesture recognition with our simple
8-GHz radar and small 4-b-weight SNN in this work.
In order to bring radar gesture recognition closer to the
edge, Sun et al. [11] proposed a 12-class gesture recognition
system using a 60-GHz FMCW radar with an Nvidia Jetson
Nano board running a network of ReLU neurons with four
convolutional layers and three fully connected layers achieving
94% accuracy. Compared to our work with the SISO setting,
their radar system [11] uses multiple receive antennas to
provide azimuth and elevation information, at the expense of
increased power consumption in the radar front-end due to the
use of multiple receive antennas. As input to their network,
the authors first compute the magnitude of RD maps for each
antenna and sum the resulting maps for each antenna. Then,
they extract the 25 largest bins and compute the azimuth and
elevation angles for those bins. Finally, they aggregate the
range, Doppler, and angular information as a feature cube
that is fed to the neural network. Compared to [11], our
preprocessing chain is much simpler as we have an SISO
setting (angle processing is impossible since only one receive
antenna is available). While the system of [11] is well suited
for GPU-enabled edge computing platforms, it is much less
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 3
suited for extreme-edge applications requiring an ultralow
power consumption. Sun et al. [11] also does not perform
aggressive quantization on the inputs and network weights
compared to our SNN queue.
Other DNN solutions have reported similar results
by extracting micro-Doppler features of the gestures
(joint time–frequency representations) instead of RD map
processing [12]. In this work, both micro-Doppler and RD
approaches will be compared in order to illustrate the
importance of choosing the right preprocessing approach for
resource-constrained SNNs.
Compared to our work, the DNN solutions described above
are all frame-based and, therefore, cannot be deployed on the
growing number of event-based SNN processors well suited
for ultralow-power AI applications. Moradi et al. [1] proposed
a mixed-signal SNN processor implementing an asynchronous
and event-based packet routing methodology in order to min-
imize latency and memory requirements. Davies et al. [26]
proposed a fully digital asynchronous and event-based SNN
accelerator, whereas Frenkel et al. [27] proposed a fully
synchronous, yet event-based digital architecture.
Although not fully comparable to our work, most
SNN-based gesture recognition systems use dynamic vision
sensor (DVS) cameras with pixels that directly output spikes
when their change in luminosity crosses a certain thresh-
old [2], [13]. Compared to standard cameras that output
image frames at a certain frame rate, DVS cameras output
events asynchronously, which makes them a natural choice
for SNN-based processing. Compared to our small 8-GHz
radar chip, DVS cameras provide signals of significantly better
fidelity (images versus sparse number of radar returns) at
the expense of being bulkier, power-hungry (23 mW for an
iniVation DVS128 camera versus 680 μW for our 8-GHz
radar) and more expensive (in the 2000$ range compared to
10$ for ubiquitous radar chips). Even then, the performance
of our work compares favorably with results in the DVS ges-
ture recognition literature. Using the DVSGesture dataset [14],
the reported accuracies using various machine learning models
and SNNs fall in the 91%–96% range for ten-class recognition
[2], [14]–[17]. Recently, Massa et al. [13] proposed a 11-class
gesture recognition system achieving 89.64% of accuracy by
first training a DNN composed of four convolutional layers
and a fully connected SoftMax layer and then converting this
trained DNN to an SNN through the conversion of ReLU
activations into spike rates (as opposed to our work where
we train the SNN directly in the spiking domain). More
closely related to our work, Maro et al. [15] have reported
an accuracy of 93% (similar to ours) using the NavGesture
dataset containing five similar gestures compared to our work.
It is worth noting that, currently, the application field of SNNs
is still limited compared to DNNs, and therefore, it is worth
investigating the application of SNNs to sensory data other
than the traditional DVS cameras used in most SNN works.
Closely related to our work, eight-class human action recog-
nition with CW radar and SNN processing has been pro-
posed in [19], achieving an accuracy of 85%. Although their
preprocessing shares commonalities with ours, they do not
perform sparse coding nor dimensionality reduction after radar
processing, which helps to increase the SNN performance
in terms of energy–area efficiency and accuracy, as reducing
the dimensionality of radar signals leads to overall smaller
network architectures as fewer neurons would be required.
In addition, dimensionality reduction can be used to reject out-
of-band noise, and sparse coding can increase the separability
of the input signals, which can lead to a higher accuracy [24].
Banerjee et al. [19] use an SNN topology composed of multi-
ple convolutional spiking layers and winner-take-all classifica-
tion, coupled to the spike-timing-dependent plasticity (STDP)
learning rule. Compared to the topology in [19], our pro-
posed network architecture is simpler. Banerjee et al. [19]
makes use of leaky integrate and fire (I&F) neurons, while we
use I&F neurons that are simpler to implement in hardware.
Furthermore, Banerjee et al. [19] do not quantize the weights,
while we apply 4-b quantization to the weights in our net-
work. As we are aiming for keeping the neural network con-
strained in size, the architecture of [19] is not appropriate for
our goal.
With this article, our aim is to illustrate the importance
and the lessons learned on the impact of using preprocessing
for SNNs compared to its impact on DNNs. To the best of
our knowledge, an extensive discussion about the impact of
preprocessing on the accuracy of resource-constrained SNNs
compared to nonconstrained DNNs with the same topology,
both trained via backpropagation, has not yet been reported.
Such a discussion is important within the radar context of
this article as state-of-the-art radar gesture recognition DNNs
targeting the edge computing domain [11] are principally
being deployed in embedded platforms, such as the Nvidia
Jetson board, providing support for high bit-width integer and
floating-point computation [11]. This will be discussed and
illustrated in the example case of radar gesture recognition in
Section VI.
III. RADAR DATA ACQUISITION SETUP
This section gives a brief overview of FMCW radar theory
and the radar dataset that we used for the gesture recognition
demonstrations.
A. FMCW Radar Theory
Using SISO FMCW radars, a chirp signal pt(t)with slope
αand carrier frequency fcis emitted at the transmit antenna.
This can be modeled as follows [21]:
pt(t)=sin (2πfct+παt2). (1)
Bouncing off an object located at a distance dfrom the
radar, the signal pr(t)at the receive antenna can be modeled
as a delayed and attenuated version of the transmitted signal
(assuming a single point target)
pr(t)=ξpt(tTd)(2)
where ξdenotes the attenuation coefficient. The time-of-flight
delay Tddepends on the round-trip distance 2dfrom the radar
to the target
Td=2d
c(3)
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
where cis the speed of light. By mixing the received signal
pr(t)with a replica of pt(t)at the in-phase receiver of
the radar chip [8], and by low-pass filtering the resulting
signal such that the frequency component fcis removed,
the following IF signal is obtained:
r(t)=−ξ
2cos 2παTdt+2πfcTdπαT2
d.(4)
As the frequency of the baseband signal is proportional
to Tdand, therefore, d, its spectrum shows peaks at certain
frequencies corresponding to the distance between the radar
and the surrounding objects (the receive signal being, in fact,
a sum of many signals of the form given by (4) when multiple
targets are present). After the analog-to-digital conversion by
the in-phase ADC, the discrete-time signal r[i]is found as
r[i]=r(t=iTf). (5)
The choice of the ADC sampling period Tfimpacts the
maximum range coverage dmax of the radar [21]
dmax =c
2αTf
.(6)
The slope αis defined as follows:
α=
Tc
(7)
where Tcis the chirp duration and is the bandwidth of the
chirp, which impacts the range resolution [21]
dres =c
2.(8)
For chirp n, the range profile Rn[k]is defined as the
DFT of r[i]. Peaks in the magnitude of Rn[k]indicate the
presence of a target, while the phase evolution of Rn[k]for a
target at range bin kbetween successive chirps represents the
micromotions of the target [20] (a small radial displacement
dinduces a phase shift φ =(4π/λ)d). The spectrum of
the micromotions is called the Doppler profile and is defined
as the DFT of Rn[k]along n[20].
B. Gesture Dataset
We have used a custom ultralow-power 8-GHz SISO
radar [8] to acquire a five-class gesture dataset. Table I shows
the dataset content used for the demonstration experiments
reported later on. The gestures were recorded at a distance
of 2 m from the antennas (RX and TX) and were obtained
by swinging the right or left arm in the vertical direction
(gesture one-vertical), by swinging the right or left arm in
the horizontal direction (gesture two-horizontal), by waving
with the right or left hand while keeping the palm facing out
(gesture three-hello), by moving the hand with the palm facing
out toward and away from the radar (gesture four-toward),
and, finally, by recording background activity in which none
of the above gestures appears in a static background (gesture
five-background). It should be noted that such a dataset
is particularly well-suited for comparing the importance of
preprocessing in SNNs and DNNs as radar signals represent
the environment with a lower fidelity compared to images [36],
TAB L E I
DATA S ET CONTENT USED IN THE DEMONSTRATION EXPERIMENTS
Fig. 1. Gesture acquisition setup.
making them more sensitive to proper preprocessing and
feature extraction.
The radar parameters were set as follows: the number of
ADC samples per chirp is 512, the number of chirps per frame
is 192, and the time between chirps is Ts=1.2 ms, while the
chirp duration Tcis 41 μs. Therefore, the total duration for
a frame capture is 238 ms and Tf=80 ns. Fig. 1 shows
the gesture acquisition setup with the antennas and the radar
read-out boards.
IV. RADAR PREPROCESSING APPROACHES
This section presents the two widely used radar preprocess-
ing approaches that will then be compared in Section VI,
followed by a description of our proposed dimensionality
reduction and sparse coding techniques.
A. μDoppler Signature
We acquire the μDoppler signature plot [20] of each gesture
acquisition in the dataset by computing the 1-D vector Rn[k]
for n=1,...,Ntot (Ntot is the total number of chirps), where
kis the range bin in which the gestures are performed (which
is known apriorisince the distance between the human target
and the radar is fixed) and Rn[k]isacquiredbyDFTas
follows:
Rn[k]= 1
L
L
i=0
w[i]rn[i]ej2πki
L(9)
where L=512 is the number of ADC samples of rn(the
received IF signal for chirp n)andwdenotes the Blackman
window that we used [35].
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 5
Then, we apply the short-time Fourier transform (STFT) on
Rn[k]as follows [19]:
[m,f]=
n=−∞
Rn[k]gs[nmR]ej2πfn (10)
where gsis the Hanning window of length sand Ris the
hop size (s=192 and R=8 in this work). Finally,
the μDoppler signature plot is found by taking the magnitude
of [m,f][12]. The size of the [m,f]matrix for each
acquisition of Table I is (NT×s)with s=192 frequency
bins and the number of time bins NTgiven by [19]
NT=Nframes Nchirps Noverlap
sNoverlap (11)
where Noverlap is the number of overlapping bins between
subsequent analysis windows and is equal to sR.
We also investigate a variation of this method, which we
call μDoppler, by applying the STFT on the sequence
˜
Rn[k]=Rn[k]−Rn1[k]to remove the strong dc com-
ponent during each analysis window [31].
In order to obtain example patches to feed to our neural
network, we cut |[m,f]| along dimension minto patches of
48 time samples, resulting in (NT/48)examples for each
acquisition. Finally, we construct a balanced dataset with a
total of 1695 examples by randomly selecting 339 examples
per acquisition class. The choice of 339 examples comes from
the fact that the background acquisition (which is the one
containing the smallest amount of frames) gives 351 example
patches according to (11), but we discard the first and last six
example patches to remove startup and ending artifacts.
B. RD Maps
We acquire the RD maps [9] for each gesture acqui-
sition in the dataset by first acquiring [using (9)] Rn[k],
n=1,...,Nwin as a matrix of size (L×Nwin ),whereNwin
is the number of chirps in each RD map (set to 192 in this
work). Then, the RD map is found as follows [21]:
[m,k]=
1
Nwin
Nwin
n=0
w[n]Rn[k]ej2πmn
Nwin
.(12)
This operation is repeated for each successive Nwin chirps,
and (Ntot /Nwin)RD maps are acquired during gesture acqui-
sition with Ntot being the total number of chirps for a particular
gesture recording. Only the first 50 range bins of each RD map
are kept in order to reject far-away clutter.
In order to remove dc reflectors and exploit the correlation
between successive RD maps, we also investigate a custom
variation of this method, which we call RD, by aggregating
the difference between successive RD maps as follows (where
bdenotes the RD map index):
˜
b[m,k]=max(b[m,k]−b1[m,k],0). (13)
Similar to the first preprocessing method, we construct a
balanced dataset with a total of 1695 examples, randomly
selecting 339 examples per class to enable a fair comparison
between the two preprocessing approaches.
Fig. 2. Example patch for the “vertical” gesture, acquired using μDoppler
preprocessing with Doppler band-limiting, max48, and complete map normal-
ization (pixel values between 0 and 1).
Fig. 3. Example patch for the “vertical” gesture, acquired using RD
preprocessing with Doppler band-limiting, max48, and complete map nor-
malization (pixel values between 0 and 1).
C. Enhancements
For both preprocessing methods described above, and for
their respective -variations, we explore the effect of a rep-
resentative subset of possible enhancements [31].
1) Dimensionality reduction through Doppler spectrum
band-limiting.
2) Sparse coding through soft thresholding.
3) Image normalization.
a) In order to reject out-of-band noise, we explore
the effect of band-limiting the Doppler spectra by
keeping only a reduced portion of the Doppler axis
between the normalized frequencies [−0.26,0.26]
(frequency band found by visually identifying the
maximal significant extent of the Doppler spectra
in the dataset).
b) In order to remove in-band noise, we use a fast
approximation to Lasso coding [22] by soft thresh-
olding each Doppler spectrum, for each time bin in
the case of the preprocessing method Aor for each
range bin in the case of the preprocessing method
B. For soft thresholding, we use the maxkoperator,
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Fig. 4. SNN architecture. Each pixel of the input radar maps is first encoded into a spike train of length Tinf , either through RB or TTFS coding. Each
spiking map slice corresponding to each time step is then fed one by one to the network and the IF neurons change state according to their self-recurrence
(as denoted by the black recurrence arrows). Output spikes emitted by the σ3layer are accumulated during the Tinf time steps, and the final accumulation
vector Ais transformed through SoftMax to class probabilities.
which keeps the klargest values and replaces the
others with 0 (we do not subtract the threshold
from the values to be kept). To choose k, we con-
sider that more than half of the Doppler samples
within the normalized frequencies [−0.26,0.26]
are significant, which leads to the choice of
k=(192 ×(0.26 (0.26))/2)−1=48
throughout this work.
c) Before encoding the preprocessed radar signals
into event streams to be fed to the SNN, we nor-
malize the pixel values between 0 and 1, either by
normalizing each Doppler spectrum individually or
by normalizing the complete maps (which gives
a smaller amount of large pixel values compared
to the normalization of each individual Doppler
spectrum).
For the sake of illustration, Figs. 2 and 3 show exam-
ple patches of the “vertical” gesture, for each preprocessing
method, to be fed to the neural network. As seen in Fig. 2,
μDoppler processing leads to a periodic-like pattern in
time, which captures the Doppler information of the range
bin in which the gestures are performed. On the other hand,
RD (see Fig. 3) provides no time information but rather
captures the Doppler information carried by any wave that
was reflected from the gestures in the environment (hence the
range dimension).
In Section VI, we will explore the effect of the above
preprocessing methods Aand B, their respective -variations,
and the enhancements a), b), and c) on the accuracy of our
low-complexity SNN compared to the accuracy of a DNN with
the same topology.
V. N EURAL NETWORK DESIGN AND TRAINING APPROACH
We use two neural networks within this work: an SNN with
I&F neurons (see (14), where Vkis the neural membrane
potential at time step k,Jin is the neuron input, and Sis
the spiking output), and the corresponding DNN with ReLU
neurons. Fig. 4 shows the topology used for both networks.
The input size to the neural network varies depending on the
preprocessing combination used, as shown in Table II
Vk+1=Vk+Jin and S=0,if Vk<1
Vk+1=0andS=1,if Vk1.(14)
TAB L E I I
NETWORK INPUT SIZES.THE SIZE ISDIFFERENT FOR EACH
PREPROCESSING VARIANT AS EACH PREPROCESSING PRODUCES
RESULTS IN DIFFERENT DOMAINS:DOPPLER FREQUENCY
AGAINST THE NUMBER OF SLI DING WINDOW HOPS THROUGH
TIME FOR μDOPPLER AND DOPP LER FREQUENCY AGAINST
RANGE FOR RD. BAND-LIMITINGALS O AFFECTS
THE SIZE ASITCROPS TH E ORIGINAL DIMENSION
OF THE DOPPLER AXIS FROM 192 DOPP LER
BINS TO 100 IN OUR CASE (REJECTING
HIGHER-FREQUENCY BINS)
A. Architectural Choices
The architectural choices made during the design of the
neural network (see Figs. 4 and 5) are motivated as follows.
First, we decided to use a convolutional layer at the network
input to better capture spatial patterns in the input maps and
avoid overfitting through weight sharing. Then, we opted for
a max-pooling layer instead of average pooling, as we wanted
to keep the spiking nature of the data to be given as input to
the pooling layer. Indeed, pooling is performed on the output
of the spiking neurons in layer σ1. With average pooling,
the resulting map would not contain any spikes anymore,
as the output values are between 0 and 1. On the other
hand, max pooling preserves the spiking nature of the data
(output values are either equal to 0 or 1), which is beneficial
for hardware implementation, as it is equivalent to a simple
OR gate passing the spikes to the next layer. It is important
to note that max pooling in SNNs has been shown to be
rather complex to deploy for those SNNs that result from
continuous-valued DNN to SNN conversion (see [40], where
an existing DNN is converted into an SNN through rate-based
(RB) coding). This is not a problem in our case as we train
our SNNs (containing a standard max-pooling layer) directly
in the spiking domain. Finally, limiting the size of the network
to only one convolutional layer and only two fully connected
layers is motivated by hardware considerations (a smaller
network consumes less energy and is easier to implement
in hardware). We have decided to use IF neurons over more
complex models as the IF neuron is the simplest to implement
in hardware, again reducing the chip area overhead.
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 7
Fig. 5. Corresponding DNN architecture. The network retains a similar topology to Fig. 4 but with IF neurons replaced by ReLU and the output of the σ2
layer directly fed to the SoftMax nonlinearity, as conversion from the event domain to the frame domain by accumulation is not needed.
For the SNN, the conversion from event to frame domain
is done by counting spikes during the inference time Tinf
for each of the five output neurons in the σ3layer. Then,
the five-element vector containing the spike counts is trans-
formed via SoftMax to class probabilities. For the DNN,
the SoftMax layer is directly connected to weights W2
ij.
In the SNN case, each pixel of the radar input maps must
first be encoded into event trains of length Tinf (number of
time steps for inference) to make the input compatible with
the spiking, event-based nature of the SNN. In Section VI,
we will explore the effect of two spike encoding approaches:
RB and time-to-first-spike (TTFS) coding [43], as well as the
effect of the inference time Tinf .
RB and TTFS encoding are done as follows. After normaliz-
ing the pixel values between 0 and 1, a value vis coded into a
periodic event train with event rate (vTinf /Tinf )for RB, while
vis coded into an event train containing one spike located at
index Tinf −vTinf for TTFS. In both cases, if vTinf =0,
no spike is generated.
B. Training Approach
To train the SNN, we use the pyTorch framework [42]
by defining a custom IF neuron model behaving as (14).
As the derivative of a spike as a function of the neuron
membrane potential σ(V)is ill-defined, we approximate it
using a Gaussian function (15) as surrogate derivative [4] to
enable backpropagation
σ(V)1
2πe2V2.(15)
Even if the network itself is not recurrent, backpropagation
should be carried through time (BPTT) as each spiking neuron
can be seen as a recurrent unit by itself [4]. Indeed, an input
event to a spiking neuron at time nstill affects its membrane
potential at future times as evidenced by (14). In the general
case of BPTT, the total loss function Ltot can be written as
the sum of the losses for each time step as follows [30]:
Ltot =
Tinf1
n=0
L[n].(16)
Then, the derivative of Ltot as a function of weight Wl
ij in
layer lis given as follows:
Ltot
Wl
ij =
Tinf1
m=0
Vl+1
i[m]
Wl
ij
Tinf1
n=m
L
Vl+1
i[m](17)
where Vl+1
i[m]is the membrane potential of neuron iin layer
l+1 at time step m. Expanding (17) shows the effect that a
spike at time step nhas on all future evaluations of the loss
Ltot
Wl
ij =Vl+1
i[0]
Wl
ij
Tinf1
n=0
L[n]
Vl+1
i[0]
+Vl+1
i[1]
Wl
ij
Tinf1
n=1
L[n]
Vl+1
i[1]
+···+ Vl+1
i[Tinf 1]
Wl
ij
Tinf1
n=Tinf1
L[n]
Vl+1
i[Tinf 1].
(18)
Equation (18) shows how BPTT applies to SNNs: the
membrane potential at time 0 resulting from an input spike
at time 0 affects the loss function evaluation from time step 0
onward; the membrane potential at time 1 resulting from an
input spike at time 0 and 1 affects the loss function evaluation
from time step 1 onward; and so on until the final inference
time step Tinf 1. For our gesture classification task, the loss
function only takes a nonzero value at the last time step as we
accumulate all the output spikes after layer σ3before applying
SoftMax (so-called many-to-one correspondence). In that case,
(18) simplifies to the following:
Ltot
Wl
ij =
Tinf1
m=0
Vl+1
i[m]
Wl
ij
L[Tinf 1]
Vl+1
i[m].(19)
Our feedforward SNN can, thus, be seen as a recurrent
network where the recurrence is due to the state of the
membrane potentials being passed to the next time step.
This enables the use of BPTT just like in regular RNNs.
Algorithms 1 and 2 summarize the forward and backward
pass during SNN training and inference. The 3-D inputs I[n]
(resulting from the spike train encoding of each pixel in the
2-D radar maps) are sliced out as 2-D binary maps at each
spike time step and fed one by one to the input of the network.
For training both the SNN and the DNN, we initialize the
weights of our network using the uniform Glorot method
(see (20) proposed in [29], where Udenotes the uniform
distribution and njis the number of neurons in layer j)andthe
biases set to zero, helping the variance at the input of the layers
to be equal to the variance at the output in the forward pass
and vice versa for the backward pass. Unlike the usual random
initialization where statistics are not layer-dependent, this
method helps avoiding the vanishing and exploding gradient
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
TABLE III
IMPACT OF RADAR PREPROCESSING ON THE SNN ACCURACY COMPARED TO DNN ACCURACY.FOR THE SNN, THE Nq=5LAST EPOCHS ARE
TRAINED WITH QUANTIZED WEIGHTS.RBCODING ISUSED THROUGHOUT THE TABLE.THE RESULTS CLEARLY MOTIVATE THE
NEED FOR USING A WELL-SUITED PREPROCESSINGIN THE SNN CASE
Algorithm 1 SNN Forward Pass
Input: I[n],n=0,...,Tinf : radar map, y: one-hot label
Output: ˆy,L
Vl
i: all membrane potentials, A:(5×1)spike accumulator
Initialization:Vl
i0i,l,A0
1: for n=0toTinf 1do
2: σ3Network(I[n],Vl
i)
3: AA+σ3
4: end for
5: ˆySoftMax(A)
6: LCrossEntropy(ˆy,y)
7: return ˆy,L
Algorithm 2 SNN Backward Pass (BPTT)
Input: ˆy: predicted class, y: one-hot label
1: LCrossEntropy(ˆy,y)
2: for n=Tinf 1to0do
3: Wl
ij Vl+1
i[n]
Wl
ij
L[Tinf1]
Vl+1
i[n]
4: end for
problems during training [29]
WjU6
nj+nj+1
,6
nj+nj+1.(20)
We use the Adam optimizer [23] with the learning rate
η=103, the gradient moving average coefficient
β1=0.9, and the gradient-square moving average coefficient
β2=0.999. Training is done on an NVIDIA GEFORCE RTX
2080, and the batch size is set to 128.
In the SNN case only, the training is done in two phases.
First, the network is trained with full-bit weights for Nfull
epochs. Then, training continues with 4-b quantized weights in
the forward pass and full-bit weights in the backward pass for
Nqepochs (the total number of epochs is, therefore, Nfull +Nq).
The choice of the number of epochs depends on the radar
preprocessing configuration and will be detailed in Section VI.
TAB L E I V
IMPACT OF EVENT-DOMAIN ENCODING ON THE SNN ACCURACY.
PREPROCESSING ISDONE WITH μDOPPLER WITH BAND-LIMITING,
max48 CODING (EXCEPTION MADE FOR ENCODING 8),
AND NORMALIZATION ON THE WHOLE MAP
In Section VI, we will use the SNN and DNN networks
presented above, alongside the radar preprocessing variations
introduced in Section IV, to extensively compare the impact
of preprocessing and event encoding on the accuracy of the
4-b-weight SNN and its full-bit-weight DNN counterpart.
VI. EXPERIMENTAL RESULTS
First, we will explore the effect of the different
preprocessing variations described in Section IV on the
accuracy of the SNN compared to the DNN. The results are
reported in Table III, where preprocessing techniques 1)–6) use
the μDoppler approach, while preprocessing techniques 7)–13)
use the RD approach. In Table III, all SNN results have been
generated using RB coding. The number of epochs needed to
have effective training (without falling into overfitting) varies
depending on the preprocessing used. For the SNN, the last
Nq=5 epochs of each training are done with 4-b quantized
weights in the forward pass. We use sixfold cross-validation
to assess the accuracy performance of each setup (282
examples in the test set and 1413 examples in the training
set for each validation pass). Then, we will take the best
preprocessing setup (in this case, preprocessing technique 6)
from Table III) and vary the event-domain encoding (as
described in Section V). Here, the goal is to optimize the
accuracy of the network without making the inference time
Tinf significantly longer. Table IV shows the accuracy results
for each event-encoding variant.
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 9
A. Discussion on the Impact of Preprocessing
Our observations on the impact of preprocessing are given
in the following. Observations 1) and 2) highlight differ-
ences between the SNN and DNN behavior, while observa-
tions 3) and 4) compare our different radar preprocessing
approaches.
1) Table III highlights one of the major observations of this
work: The preprocessing parameters drastically affect
the SNN accuracy, while their effect is significantly more
limited for the DNN accuracy even though the same net-
work topology and the same learning approach are used.
A possible explanation is the following: in our setting,
the differences between the SNN and the DNN are the
neuron model used and the nature of the input (spiking
or not). The output of IF neurons is of binary nature
(a spike), whereas the output of an ReLU neuron is
continuous. The ReLU neuron has more expressiveness
compared to the IF neuron [25]. In addition, the ReLU
neuron does not suffer from the vanishing gradient
problem [24], while the IF neuron does [see (15)],
which may lead to more effective training for the DNN.
Furthermore, encoding input maps into spike trains can
be seen as a quantization in time [37], which could
potentially discard useful information, requiring better
preprocessing to compensate for this information loss.
During our experiments, we have noticed that those
observations hold even when no weight quantization is
performed.
The effect of each preprocessing parameter can be
explained as follows. The -variant acts as a first-order
high-pass filter for both the μ-Doppler and RD pre-
processing, only keeping the more useful ac information,
but it also amplifies the noise in higher frequency
bands. This filtering provides an accuracy gain when
RD preprocessing is used (entry 7 in Table III), but it
does not significantly affect the accuracy for the case of
μ-Doppler preprocessing (entry 2 in Table III) because
of the larger noise amplification in the μ-Doppler case
compared to the RD case. This larger amplification can
be understood as follows. The RD maps are obtained
using (12) as the absolute value of RD processed
radar ADC data. Noise in radar data is considered
to be additive white Gaussian noise (AWGN) and
remains AWGN after RD processing, as Fourier trans-
forms are linear operations, with zero mean and vari-
ance σw. Consequently, this noise distribution becomes
Rayleigh-distributed with variance (4π/2)1/2σwafter
taking the absolute value [21]. Then, it can be shown
that the resulting variance of the difference between two
Rayleigh-distributed noise samples is given by σRD =
σw(4π)1/2[21]. For μDoppler, on the other hand,
differentiation is done on the range-processed data (9)
directly, where the noise was assumed AWGN. In that
case, the resulting variance after differentiation is given
by σμD=σw2, which is clearly larger than σRD.
Then, band-limiting discards the higher frequency bins
predominantly containing noise. This provides a large
enhancement of the accuracy for the case of μDoppler
preprocessing (entry 4 in Table III). On the other hand,
band-limiting does not seem to affect the accuracy for
the RD preprocessing case (entry 10 in Table III).
Finally, the application of max48 provides a smaller
yet significant accuracy boost for both preprocessing
methods, as it only keeps the 48 largest elements of
each Doppler spectrum untouched and pads the rest to
zero. Thus, this sparse coding step retains the support of
each Doppler spectrum while discarding potential noise
bins. Interestingly, using μDoppler preprocessing results
in a wider accuracy range compared to RD. Indeed,
the discussion above motivates the fact that μDoppler
and its -variant introduce a larger noise contribution,
which explains its lower accuracy compared to RD when
band-limiting and max48 are not applied. But compared
to RD, applying band-limiting and max48 seems to
capture better features in the case of μDoppler, which
explains its larger final accuracy (entry 6 in Table III).
2) The second major observation found in Table III is the
following: Changing the preprocessing parameters can
have antagonistic effects on the SNN and DNN accuracy
even though the same network topology and the same
learning approach are used. This can be seen by com-
paring preprocessing techniques 7) and 8) in Table III,
where using the variant enhances the SNN accuracy
while degrading the DNN accuracy significantly. This
effect can be explained as follows. One of the advantages
of using DNNs over SNNs is the limited need for
preprocessing due to the analog nature of DNN neurons
over the spiking nature of SNN neurons. The various
preprocessing strategies of Table III sparsify the original
radar maps, which helps the SNN accuracy by increasing
the separability between classes, but inevitably results in
a loss of information, which can degrade the DNN accu-
racy. This demonstrates that conclusions drawn from
DNN design cannot generally be applied to SNN design
even though the same network topology and the same
learning approach are used.
3) Comparing preprocessing techniques 1)–4) in Table III,
we see that, for μDoppler, it is not the sole effect of
the variant or the band-limiting that affects the SNN
accuracy, rather a combination of both. This can be
explained as follows: using the -variant alone helps
removing the strong dc and near-dc components in each
STFT window such that mostly useful ac information
is processed. However, the -variant heavily amplifies
the higher-frequency noise components, which degrades
learning. Band-limiting helps by rejecting out-of-band
noise and keeping only the signal components, leading
to an efficient feature extraction, which helps SNN
learning.
4) Globally, we see in Table III that the μDoppler approach
clearly achieves much higher accuracy than the RD
approach with our SNN settings, with the best accuracy
achieved for the μDoppler preprocessing with band-
limiting, max48 coding (keeping the 48 largest entries
in each Doppler spectrum), and normalization on the
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
complete map (preprocessing technique 6). This is spe-
cific to our dataset and can be explained as follows.
As the gestures are executed at a nearly constant distance
from the radar, the range information in RD maps does
not significantly relate to the nature of gestures and is
not that useful. Training the SNN to find the relevant
features then becomes more challenging as less useful
information is available compared to the μDoppler pre-
processing. Yet, the SNN accuracy is 89% compared to
93% for the DNN. It should be noted that, during our
experiments, we also have investigated the use of one-
hidden-layer fully connected SNN with 50 IF neurons
in the hidden layer and the same output layer as the
SNN shown in Fig. 4. We generally observed the same
recognition accuracy trends presented above compared
to an equivalent one-hidden-layer fully connected ReLU
DNN, although achieving a drastically lower accuracy
because of the oversimplistic network architecture. Next,
we will take our best preprocessing solution (entry 6 in
Table III) and vary the event-encoding parameters to
optimize the SNN network accuracy.
B. Varying Event Encoding to Optimize the SNN Accuracy
By varying the event-encoder type (RB or TTFS) and Tinf ,
the best result is obtained with preprocessing technique 6) of
Table III and encoding method 6) of Table IV, achieving
93% of recognition accuracy. Table IV shows that TTFS gives
on-par or better results compared to RB for the same Tinf
(even though the advantage is limited). This limited boost
in accuracy can be explained as follows. TTFS-encoding the
input generates fewer spikes than RB-encoding, which makes
the network activity sparser. Network sparsity may lead to
better information disentangling and linear separability of
the input representations in the neural network, as discussed
in [24]. For completeness, we come back to the effect of
preprocessing by removing the max48 coding out of our best
preprocessing and encoding queue (entry 8 of Table IV).
Removing max48 has a significantly larger negative effect
compared to the event-encoding parameters (it degrades the
accuracy by 6% compared to entry 6), which again emphasizes
the importance of preprocessing on the SNN performance.
Using the preprocessing technique 6) of Table III and the
encoding 6) of Table IV, our SNN achieves an accuracy
on-par with the DNN while being significantly cheaper
to implement in hardware. First, the SNN requires only
add operations as the output of the IF neurons are binary,
whereas the DNN requires multiply and add operations,
asking for more die area [40]. Second, the max-pooling
layer in the SNN can be implemented as a pool of simple
OR gates, whereas the max-pooling layer in the DNN must
be implemented as a search operation, as the values at
the input of the pooling layer are not binary in this latter
case. Finally, an important limitation of current computing
architectures is the energy cost of memory transfer due
to the memory bottleneck problem [41]. As our SNN
architecture is event-based, it can be deployed on the growing
number of massively parallel and energy-efficient SNN
processor architectures solving the memory bottleneck
problem and designed specifically for extreme-edge
applications [1], [26], [27].
C. Energy Consumption and Latency Estimation
As discussed in Section II, most SNN-based gesture recog-
nition systems use DVS cameras with a typical power con-
sumption of around 100 mW (such as, for instance, the one
used in [13]). In our system, we use an 8-GHz SISO radar
consuming only 680 μW at the expense of the number
of gesture classes that can be distinguished because of the
lower fidelity nature of our radar signals compared to DVS
cameras and higher frequency radars, such as Google’s Soli
(300 mW) [10].
The different preprocessing approaches presented above
introduce different latency and energy consumption overheads.
In both the μDoppler and RD cases, two Fourier-based trans-
forms are executed, which can efficiently be implemented with
ubiquitous FFT accelerators (integrated into most commercial
radar chips). Using a 256-point FFT only requires 2 ×1668
instruction cycles (56 μsat60MHz)and2×407.2nJof
energy in typical microcontrollers [44]. Similarly, the other
preprocessing steps are cheap to implement as they do not
iterate over the input data. Rather, the bottleneck to the pre-
processing latency depends on the number of radar chirps that
must be collected during each acquisition. For μDoppler-based
preprocessing, a radar acquisition time of 476 ms can be
derived using (11) and the radar parameters specified in
Section III-B. For RD preprocessing, an acquisition time of
238 ms is needed, while, for RD, an acquisition time of
476 ms, similar to the μDoppler case, is needed as two
successive RD maps are used. The SNN energy, latency, and
area overheads depend on the accelerator chip onto which
the architecture is mapped but also on the total number of
neurons, the number of time steps needed for inference, and
the number of events per time step. With the input size of our
best preprocessing technique (100 ×48), the convolutional
layer contains 94 ×42 ×6=23 688 neurons, and the
fully connected layers contain 120 +5=125 neurons for
a total of 23 813 neurons. The max-pooling layer must also
be mapped, and as discussed in Section V-A, our max-pooling
layer behaves as a pool of OR gates with four inputs, providing
downsampling of 2 on the convolution result. Such layer can
be mapped as a layer of IF neurons with synaptic weights and
threshold tuned such that, if any of the four input synapses of
the neuron receives a spike, then the neuron will spike (e.g.,
by setting the four synaptic weights and the threshold to unity).
This leads to 47 ×21 ×6=5922 neurons. Therefore, a total
number of 29 735 neurons is used.
Even though emerging mixed-signal SNN chips promise
to be even more energy-efficient than digital ones [1], using
the popular Intel Loihi chip [26] as our reference design for
energy consumption and inference latency (as done by most
SNN works in the literature), and using the Nengo-Loihi tool
in Python, we have estimated an energy cost per inference
of only 7.85 μJ (ignoring the cost of transferring data on
and off the device) and a maximum inference time of around
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 11
4 ms (which could be reduced when deploying the model
into Loihi through further optimization). Therefore, the total
latency is around 480 ms, which is similar to most radar-based
gesture recognition systems proposed in literature [9]–[12],
[18]. This value is feasible for a real-time system given the
typical time scale of a hand gesture (some state-of-the-art
systems, such as [11], exhibit a total latency of more than
1s from radar data acquisition to inference while still being
useful in practice). This leads to an SNN power consumption
of 7.85 μJ/480 ms =16.35 μW. Compared to the DVS-based
gesture recognition systems proposed in [14] (which reports
power estimates on the IBM True North neuromorphic chip),
our system consumes around three orders of magnitude less
power (16.35 μW for our system versus 88.5mWfortheirs)
due to the significantly larger network used in their work with
16 convolutional layers versus only one convolutional layer in
our case. As explained earlier, this is at the expense of the
number of classes that can be recognized (five for our system
versus 11 for theirs), with a comparable accuracy (93% for our
system versus 91.77% for theirs). Compared to state-of-the-art
radar-based gesture recognition systems [9], [11], our solution
trades off the number of gesture classes that can be recognized
for higher energy and die area efficiency at the sensor and
neural network level (e.g., 11 classes for [9] and 12 classes
for [11] at the expense of 60-GHz, multiantenna radar sensors
consuming more than 300 mW versus our 8-GHz, SISO radar
consuming merely 680 μW) while still achieving a high
recognition accuracy and fast inference time without making
use of a desktop GPU as in [9] or of an Nvidia Jetson platform
as in [11] (with an inference time of 25.84 ms against 4 ms in
our case), both being too bulky and power-hungry for small-
sized, ultralow-power IoT applications.
D. Applicability of the Proposed SNN on a Different Dataset
Finally, it is important to verify the applicability of our pro-
posed 4-b-weight SNN model on a different dataset than the
8-GHz one used Section VI-A. To the best of our knowledge,
most of the state-of-the-art radar gesture recognition datasets,
such as [9], do not provide raw ADC data but rather, radar
data that have already been preprocessed. Therefore, we train
our SNN model (see Fig. 4) on the dataset proposed in [9],
featuring 11 gesture classes. The dataset in [9] contains a total
of 5500 sequences of RD maps (see Section IV-B), acquired
using a 60-GHz FMCW radar.
We note RDi[t,k,m]the ith sequence of RD maps, where
tis the frame index, lis the range bin, and mis the Doppler
bin. The number of frames Mi28 in each sequence ican
vary but is always larger than Mmin =28 throughout the
dataset. Before feeding the data to our SNN, accumulation
and subsampling are performed along ton each RDi[t,k,m]
as follows:
RDi[l,k,m]=
l+Mi
Mmin
t=l
RDi[l,k,m](21)
where lis the frame index after subsampling. Finally, spiking
input sequences are obtained by thresholding RDi[l,k,m]with
regard to =102(found empirically). Values larger than
Fig. 6. Confusion matrix for the 60-GHz radar gesture recognition
dataset proposed in [9]. The recognition accuracy of the 4-b-weight SNN
is 94.6% ±2%.
are assigned to 1, while smaller values are assigned to 0.
Training is conducted using the same learning parameters as
in Section V-B, by iterating through 16 cycles of 20 epochs,
with the first Nfull =19 epochs of each cycle trained with
full-bit weights, and the last epoch of each cycle trained
with 4-b weights in the forward pass (Nq=1). Similar
to Section VI-A, the model performance is assessed using
sixfold cross-validation. Fig. 6 shows the obtained confusion
matrix, with a gesture recognition accuracy of 94.6% ±2%.
This demonstrates the applicability of our proposed 4-b-weight
SNN model to a second radar gesture dataset.
VII. CONCLUSION
This article has shown the impact of various preprocessing
and event-encoding techniques on the accuracy of a 4-b-weight
SNN within the context of radar gesture recognition for
extreme-edge applications with limited power and die area.
In particular, it has been shown that, while using the same
network topology and learning approach, preprocessing dras-
tically affects the SNN accuracy, while its effect is significantly
more limited on the DNN accuracy. In addition, it has been
demonstrated that conclusions drawn from DNN design cannot
directly be generalized to SNN design as the preprocessing
parameters can have antagonistic effects on the SNN and DNN
accuracy, even if the same topology and learning approaches
are used. Also, the impact of event-encoding to optimize the
accuracy of the network has been explored, be it that it has
less impact compared to preprocessing. Then, after extensive
comparison between different approaches, a well-suited radar
preprocessing and the event-encoding queue has been devel-
oped to be used with a small, 4-b-weight SNN to achieve 93%
of recognition accuracy within only four time steps. It must
be noted that the preprocessing and event-encoding strategy
developed in this article could also be extended to audio sig-
nals as future work. Finally, the applicability of the proposed
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
4-b-weight SNN approach has been demonstrated on a second
radar dataset, reaching 94.6% of accuracy. Unlike previous
works, this low-complexity architecture enables high-accuracy
and low-latency inference while consuming very little energy
and area when implemented in event-based hardware, making
it extremely useful for embedded edge-AI applications.
REFERENCES
[1] S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri, “A scalable multi-
core architecture with heterogeneous memory structures for dynamic
neuromorphic asynchronous processors (DYNAPs),IEEE Trans. Bio-
med. Circuits Syst., vol. 12, no. 1, pp. 106–122, Feb. 2018, doi:
10.1109/TBCAS.2017.2759700.
[2] S. B. Shrestha and G. Orchard, “SLAYER: Spike layer error reassign-
ment in time,” in Proc. Neural Inf. Process. Syst., Montreal, QC, Canada,
Dec. 2018, pp. 1–10.
[3] F. Zenke and S. Ganguli, “SuperSpike: Supervised learning in mul-
tilayer spiking neural networks,” Neural Comput., vol. 30, no. 6,
pp. 1514–1541, Jun. 2018.
[4] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning
in spiking neural networks: Bringing the power of gradient-based
optimization to spiking neural networks,” IEEE Signal Process. Mag.,
vol. 36, no. 6, pp. 51–63, Nov. 2019, doi: 10.1109/MSP.2019.2931595.
[5] E. Hunsberger and C. Eliasmith, “Spiking deep networks with LIF
neurons,” CoRR, vol. abs/1510.08829, Oct. 2015.
[6] Q. Bolsee and A. Munteanu, “CNN-based denoising of time-
of-flight depth images,” in Proc. 25th IEEE Int. Conf. Image
Process. (ICIP), Athens, Greece, Oct. 2018, pp. 510–514 , doi:
10.1109/ICIP.2018.8451610.
[7] M. Nazaré, “Deep convolutional neural networks and noisy images,” in
Progress in Pattern Recognition, Image Analysis, Computer Vision, and
Applications. Cham, Switzerland: Springer, 2018, pp. 416–424.
[8] Y.-H. Liu et al., “A 680 μW burst-chirp UWB radar transceiver for
vital signs and occupancy sensing up to 15 m distance,” in IEEE Int.
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco,
CA, USA, Feb. 2019, pp. 166–168, doi: 10.1109/ISSCC.2019.8662536.
[9] S. Wang, J. Song, J. Lien, I. Poupyrev, and O. Hilliges, “Interacting with
soli: Exploring fine-grained dynamic gesture recognition in the radio-
frequency spectrum,” in Proc. 29th Annu. Symp. User Interface Softw.
Technol., Oct. 2016, pp. 851–860.
[10] J. Lien et al., “Soli: Ubiquitous gesture sensing with millimeter wave
radar,ACM Trans. Graph., vol. 35, no. 4, pp. 1–4, 2016.
[11] Y. Sun, T. Fei, X. Li, A. Warnecke, E. Warsitz, and N. Pohl, “Real-time
radar-based gesture detection and recognition built in an edge-computing
platform,” IEEE Sensors J., vol. 20, no. 18, pp. 10706–10716, Sep. 2020,
doi: 10.1109/JSEN.2020.2994292.
[12] X. Zhang, Q. Wu, and D. Zhao, “Dynamic hand gesture recognition
using FMCW radar sensor for driving assistance,” in Proc. 10th Int.
Conf. Wireless Commun. Signal Process. (WCSP), Hangzhou, China,
Oct. 2018, pp. 1–6, doi: 10.1109/WCSP.2018.8555642.
[13] R. Massa, A. Marchisio, M. Martina, and M. Shafique, “An efficient
spiking neural network for recognizing gestures with a DVS camera on
the Loihi neuromorphic processor,” 2021, arXiv:2006.09985. [Online].
Available: https://arxiv.org/abs/2006.09985
[14] A. Amir et al., “A low power, fully event-based gesture recognition
system,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Honolulu, HI, USA, Jul. 2017, pp. 7243–7252.
[15] J.-M. Maro, S.-H. Ieng, and R. Benosman, “Event-based gesture recog-
nition with dynamic background suppression using smartphone compu-
tational capabilities,” Frontiers Neurosci., vol. 14, p. 275, Apr. 2020.
[16] J. Kaiser, H. Mostafa, and E. Neftci, “Synaptic plasticity dynamics
for deep continuous local learning (DECOLLE),” Frontiers Neurosci.,
vol. 14, p. 424, May 2020.
[17] R. Ghosh, A. Gupta, A. Nakagawa, A. Soares, and N. Thakor,
“Spatiotemporal filtering for event-based action recognition,” 2019,
arXiv:1903.07067.[Online]. Available: http://arxiv.org/abs/1903.07067
[18] J. S. Suh, S. Ryu, B. Han, J. Choi, J.-H. Kim, and S. Hong, “24 GHz
FMCW radar system for real-time hand gesture recognition using
LSTM,” in Proc. Asia–Pacific Microw. Conf. (APMC), Kyoto, Japan,
Nov. 2018, pp. 860–862, doi: 10.23919/APMC.2018.8617375.
[19] D. Banerjee et al., “Application of spiking neural networks for
action recognition from radar data,” in Proc. Int. Joint Conf.
Neural Netw. (IJCNN), Glasgow, U.K., Jul. 2020, pp. 1–10, doi:
10.1109/IJCNN48605.2020.9206853.
[20] C. V. Chen, “Radar micro-Doppler signatures: Processing and applica-
tions,” Inst. Eng. Technol., London, U.K., Tech. Rep., 2014.
[21] M. A. Richards, Fundamentals of Radar Signal Processing.NewYork,
NY, USA: McGraw-Hill, 2005.
[22] H. Xu, Z. Wang, H. Yang, D. Liu, and J. Liu, “Learning simple
thresholded features with sparse support recovery,IEEE Trans. Circuits
Syst. Video Technol., vol. 30, no. 4, pp. 970–982, Apr. 2020, doi:
10.1109/TCSVT.2019.2901713.
[23] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proc. Int. Conf. Learn. Represent., 2014, pp. 1–15.
[24] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” J.Mach.Learn.Res., 2010.
[25] X. Pan and V. Srikumar, “Expressiveness of rectifier networks,” CoRR,
vol. abs/1511.05678, May 2015.
[26] M. Davies et al., “Loihi: A neuromorphic manycore processor with on-
chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, Jan./Feb. 2018,
doi: 10.1109/MM.2018.112130359.
[27] C. Frenkel, M. Lefebvre, J. Legat, and D. Bol, “A 0.086-mm2
12.7-pJ/SOP 64k-synapse 256-neuron online-learning digital spik-
ing neuromorphic processor in 28-nm CMOS,IEEE Trans. Bio-
med. Circuits Syst., vol. 13, no. 1, pp. 145–158, Feb. 2019, doi:
10.1109/TBCAS.2018.2880425.
[28] C. Frenkel, J.-D. Legat, and D. Bol, “A 28-nm convolutional neu-
romorphic processor enabling online learning with spike-based reti-
nas,” 2020, arXiv:2005.06318. [Online]. Available: http://arxiv.org/abs/
2005.06318
[29] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” J. Mach. Learn. Res.-Track,vol.9,
pp. 249–256, 2010.
[30] P. J. Werbos, “Backpropagation through time: What it does and how
to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990, doi:
10.1109/5.58337.
[31] B. Vandersmissen et al., “Indoor person identification using a low-
power FMCW radar,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 7,
pp. 3941–3952, Jul. 2018, doi: 10.1109/TGRS.2018.2816812.
[32] S. Wu et al., “Person-specific heart rate estimation with ultra-wideband
radar using convolutional neural networks,” IEEE Access,vol.7,
pp. 168484–168494, 2019, doi: 10.1109/ACCESS.2019.2954294.
[33] W. Kim, H. Cho, J. Kim, B. Kim, and S. Lee, “YOLO-based
simultaneous target detection and classification in automotive FMCW
radar systems,” Sensors, vol. 20, no. 10, p. 2897, May 2020, doi:
10.3390/s20102897.
[34] V. M. Lubecke, O. Boric-Lubecke, A. Host-Madsen, and A. E. Fathy,
“Through-the-wall radar life detection and monitoring,” in IEEE MTT-S
Int. Microw. Symp. Dig., Honolulu, HI, USA, Jun. 2007, pp. 769–772,
doi: 10.1109/MWSYM.2007.380053.
[35] F. J. Harris, “On the use of windows for harmonic analysis with the
discrete Fourier transform,” Proc. IEEE, vol. 66, no. 1, pp. 51–83,
Jan. 1978, doi: 10.1109/PROC.1978.10837.
[36] D. Feng et al., “Deep multi-modal object detection and semantic seg-
mentation for autonomous driving: Datasets, methods, and challenges,”
CoRR, vol. abs/1902.07830, Feb. 2019.
[37] E. Doutsi, L. Fillatre, M. Antonini, and P. Tsakalides, “Dynamic image
quantization using leaky integrate-and-fire neurons,” IEEE Trans. Image
Process., vol. 30, pp. 4305–4315, 2021.
[38] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5
envision: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltage-
accuracy-frequency-scalable convolutional neural network processor in
28 nm FDSOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, San Francisco, CA, USA, Feb. 2017, pp. 246–247, doi:
10.1109/ISSCC.2017.7870353.
[39] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017, doi: 10.1109/JSSC.2016.2616357.
[40] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu, “Con-
version of continuous-valued deep networks to efficient event-driven
networks for image classification,” Frontiers Neurosci., vol. 11, p. 682,
Dec. 2017.
[41] M. Horowitz, “1.1 computing’s energy problem (and what we can
do about it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, San Francisco, CA, USA, Feb. 2014, pp. 10–14, doi:
10.1109/ISSCC.2014.6757323.
[42] A. Paszke et al.,“Automatic differentiation in PyTorch,” Tech. Rep.,
2017.
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 13
[43] R. Brette, “Philosophy of the spike: Rate-based vs. spike-based theories
of the brain,” Frontiers Syst. Neurosci., vol. 9, p. 151, Nov. 2015.
[44] M. McKeown, “FFT implementation on the TMS320VC5505,
TMS320C5505, and TMS320C5515 DSPs,” Texas Instrum., Dallas, TX,
USA, Appl. Rep., 2013.
Ali Safa (Graduate Student Member, IEEE) received
the M.Sc. degree (summa cum laude) in electrical
engineering from the Université Libre de Bruxelles,
Brussels, Belgium. He is currently pursuing the
Ph.D. degree in AI-driven processing and sensor
fusion for extreme edge applications with imec,
Leuven, Belgium, and Katholieke Universiteit Leu-
ven (KU Leuven), Leuven.
He joined imec and KU Leuven in 2020.
Federico Corradi (Member, IEEE) received the
B.Sc. degree in physics from Università degli studi
di Parma, Parma, Italy, the M.Sc. degree (cum laude)
in physics from La Sapienza University, Rome, Italy,
the Ph.D. degree in natural sciences from the Uni-
versity of Zurich, Zürich, Switzerland, and the Ph.D.
degree in neuroscience from the Ph.D. International
Program, Neuroscience Center Zurich, Zürich.
He is currently a Senior Research and the Develop-
ment Scientist of imec, Eindhoven, The Netherlands.
His research activities are at the interface between
neuroscience and neuromorphic engineering. His research focuses on neuro-
morphic computing technologies for the IoT applications. His contributions
in the field include the development of neuromorphic circuits and systems,
and their application in biomedical signal processing.
Lars Keuninckx received the M.Eng. degree in
telecommunications from Hogeschool Gent, Ghent,
Belgium, in 1996, and the bachelor’s degree
in physics and the Ph.D. degree in engineering
from Vrije Universiteit Brussel, Brussels, Belgium,
in 2009 and 2016, respectively.
After his M.Sc. degree, he worked in the industry
for several years, designing electronics for automo-
tive, industrial, and medical applications. He joined
imec, Leuven, Belgium, in 2019, where he is
involved in the design of neuromorphic systems. His
research interests include the applications of complex dynamics and reservoir
computing.
Ilja Ocket (Member, IEEE) received the M.Sc.
and Ph.D. degrees in electrical engineering from
Katholieke Universiteit (KU) Leuven, Leuven,
Belgium, in 1998 and 2009, respectively.
He currently serves as the Program Manager for
neuromorphic sensor fusion at the IoT Department,
imec, Leuven. His research interests include all
aspects of heterogeneous integration of highly minia-
turized millimeter-wave systems, spanning design,
technology, and metrology. He is also involved in
research on using broadband impedance sensing and
dielectrophoretic actuation for lab-on-chip applications.
André Bourdoux (Senior Member, IEEE) received
the M.Sc. degree in electrical engineering from
the Université Catholique de Louvain-la-Neuve,
Ottignies-Louvain-la-Neuve, Belgium, in 1982.
In 1998, he joined imec, Leuven, Belgium, where
he is currently a Principal Member of Technical Staff
with the Internet-of-Things Research Group. He is
a system-level and signal processing expert for the
mm-wave wireless communications and radar teams.
He has more than 15 years of research experience in
radar systems and 15 years of research experience
in broadband wireless communications. He holds several patents in these
fields. He has authored or coauthored more than 160 publications in books
and peer-reviewed journals and conferences. His research interests include
advanced signal processing, and machine learning for wireless physical layer
and high-resolution 3-D/4-D radars.
Francky Catthoor (Fellow, IEEE) received
the Ph.D. degree in electrical engineering from
Katholieke Universiteit (KU) Leuven, Leuven,
Belgium, in 1987.
From 1987 to 2000, he was the head of several
research domains in the area of synthesis techniques
and architectural methodologies. Since 2000, he has
been strongly involved in other activities at imec,
Leuven, including coexploration of application,
computer architecture and deep submicrometer
technology aspects, biomedical systems and the
IoT sensor nodes, and photo-voltaic modules combined with renewable
energy systems. He is also a Senior Fellow of imec. He is also a part-time
Full Professor with the Department of Electrical Engineering (ESAT),
KU Leuven.
Dr. Catthoor has been an associate editor of several IEEE and ACM
journals.
Georges G. E. Gielen (Fellow, IEEE) received the
M.Sc. and Ph.D. degrees in electrical engineering
from Katholieke Universiteit Leuven (KU Leuven),
Leuven, Belgium, in 1986 and 1990, respectively.
He is currently a Full Professor with the MICAS
Research Division, Department of Electrical Engi-
neering (ESAT), KU Leuven. Since 2020, he has
been the Chair of the Department of Electrical Engi-
neering. His research interests include the design
of analog and mixed-signal integrated circuits, and
especially in analog and mixed-signal CAD tools
and design automation. He is a frequently invited speaker/lecturer and coor-
dinator/partner of several (industrial) research projects in this area, including
several European projects. He has (co)authored ten books and more than
600 articles in edited books, international journals, and conference proceed-
ings. He is a 1997 Laureate of the Belgian Royal Academy of Sciences,
Literature, and Arts in the discipline of engineering.
Dr. Gielen is an Elected Member of the Academia Europaea. He received the
IEEE CAS Mac Van Valkenburg Award in 2015 and the IEEE CAS Charles
Desoer Award in 2020.
Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.
... The authors used the network for handling dynamic vision sensors. Taking advantage of the natural ability of the SNNs to handle sparse events, researchers in [13] proposed the use of this kind of network for radar gesture recognition. They were able to achieve a 93% accuracy rate using only one convolutional layer and two fully connected hidden layers. ...
Article
Full-text available
Predicting system faults is critical to improving productivity, reducing costs, and enforcing safety in industrial processes. Yet, traditional methodologies frequently falter due to the intricate nature of the task. This research presents a novel use of spiking neural networks (SNNs) in anticipating faults in syntactical time series, utilizing the generalized stochastic Petri net (GSPN) model. The inherent ability of SNNs to process both time and space aspects of data positions them as a prime instrument for this endeavor. A comparative evaluation with long short-term memory (LSTM) networks suggests that SNNs offer comparable robustness and performance.
... This is a problem common to all types of spiking neural networks, and as of yet there does not seem to be a consensus [41]. Worse, there is evidence that preprocessing methods that work well for SNNs do not work equally well for CNNs and vice versa [42]. This makes sense, given the central role that time plays in SNNs. ...
Preprint
Full-text available
p>Monostable multivibrators are simple timers which are easily implemented using counters in digital hardware and can be interpreted as non-biologically inspired spiking neurons. We show how fully binarized event-driven recurrent networks of monostable multivibrators can be trained to solve classification tasks. We mitigate an important bottleneck in neuromorphic hardware concepts by circumventing synaptic addition within the network. Here rather, input signals to a neuron are simply OR-ed together. Temporally overlapping input events are resolved at the neuron level. We demonstrate our approach on the MNIST handwritten digits, Google Soli radar gestures, IBM DVS128 gestures and Yin-Yang classification tasks, all with excellent results. The estimated energy consumption for the MNIST handwritten digits task, excluding the final readout layer, is 855pJ per inference for a test accuracy of 98.61% for a reconfigurable network of 500 units that was mapped to a 28nm process.</p
... This is a problem common to all types of spiking neural networks, and as of yet there does not seem to be a consensus [41]. Worse, there is evidence that preprocessing methods that work well for SNNs do not work equally well for CNNs and vice versa [42]. This makes sense, given the central role that time plays in SNNs. ...
Preprint
Full-text available
p>Monostable multivibrators are simple timers which are easily implemented using counters in digital hardware and can be interpreted as non-biologically inspired spiking neurons. We show how fully binarized event-driven recurrent networks of monostable multivibrators can be trained to solve classification tasks. We mitigate an important bottleneck in neuromorphic hardware concepts by circumventing synaptic addition within the network. Here rather, input signals to a neuron are simply OR-ed together. Temporally overlapping input events are resolved at the neuron level. We demonstrate our approach on the MNIST handwritten digits, Google Soli radar gestures, IBM DVS128 gestures and Yin-Yang classification tasks, all with excellent results. The estimated energy consumption for the MNIST handwritten digits task, excluding the final readout layer, is 855pJ per inference for a test accuracy of 98.61% for a reconfigurable network of 500 units that was mapped to a 28nm process.</p
Article
With the development of technology, using radar for gesture recognition is feasible and valuable. However, ensuring that gesture recognition can be applied to a wide range of scenarios with sufficient accuracy is still challenging. Due to the lack of accuracy and efficiency of traditional methods, we propose a gesture recognition scheme based on deep learning. We converted radar signals into pictures and designed a lightweight network called self-reparameterization network based on distance and velocity aware and binary coding(SR-DVBNet) to match them. We use the Self-reparameterization Encoder of the signal as the baseline of the network and add Distance and Velocity-aware Embedding (DVAE) between different blocks to do weighting for different dimensions. Since gesture recognition by radar signals often uses two-dimensional data, such as RDM or chirp-arranged matrices, we designed the DVAE module, which can weigh the different dimensions of the data separately to enhance the interpretability and gesture recognition accuracy of the model. At the same time, we use binary descriptors as the final representation of feature vectors for classification, which can well reflect the features of images and improve classification accuracy. Finally, we verify the algorithm’s effectiveness on two publicly available data sets and achieve an accuracy rate of more than 98%, surpassing other known gesture recognition algorithms.
Article
Radar-based gesture recognition combined machine learning methods can achieve excellent performance and has been utilized in a wide variety of applications. However, large-scale radar signal processing suffers from the risk of privacy leakage and the problem of limited computing resources. In this paper, we propose a federated learning framework based on the spiking neural network to solve above limitations in radar gesture recognition. First, we build a distributed federated learning system with good privacy protection to process radar data collaboratively, which only exchanges the model parameters between clients to train the model, without the need for transferring the data itself. Second, we train the spiking neural network using the spike timing dependent plasticity with biological interpretability, which is more compatible with the biological characteristics and can significantly reduce the energy consumption. Our creative combination of low-power spiking neural network and federated learning thus address the issue of resource constraint of edge nodes in federated learning. Furthermore, we introduce the weight pruning technique to the federated learning training process to reduce the model's communication cost. We conduct experiments on the 8GHz radar gesture dataset and Google Soli dataset to test the proposed model's validity. We evaluate the model's performance across various participating devices and examine the spiking neural network's capacity to recognize gestures under independent identical distribution and non-independent identical distribution strategies in federated learning. We also analyze the energy consumption and communication cost of the proposed model, demonstrating the spike timing dependent plasticity-based algorithm is more energy efficient than the traditional backpropagation as well as the backpropagation through time algorithms.
Article
Full-text available
This paper introduces a novel coding/decoding mechanism that mimics one of the most important properties of the human visual system: its ability to enhance the visual perception quality in time. In other words, the brain takes advantage of time to process and clarify the details of the visual scene. This characteristic is yet to be considered by the state-of-the-art quantization mechanisms that process the visual information regardless the duration of time it appears in the visual scene. We propose a compression architecture built of neuroscience models; it first uses the leaky integrate-and-fire (LIF) model to transform the visual stimulus into a spike train and then it combines two different kinds of spike interpretation mechanisms (SIM), the time-SIM and the rate-SIM for the encoding of the spike train. The time-SIM allows a high quality interpretation of the neural code and the rate-SIM allows a simple decoding mechanism by counting the spikes. For that reason, the proposed mechanisms is called Dual-SIM quantizer (Dual-SIMQ). We show that (i) the time-dependency of Dual-SIMQ automatically controls the reconstruction accuracy of the visual stimulus, (ii) the numerical comparison of Dual-SIMQ to the state-of-the-art shows that the performance of the proposed algorithm is similar to the uniform quantization schema while it approximates the optimal behavior of the non-uniform quantization schema and (iii) from the perceptual point of view the reconstruction quality using the Dual-SIMQ is higher than the state-of-the-art.
Article
Full-text available
This paper proposes a method to simultaneously detect and classify objects by using a deep learning model, specifically you only look once (YOLO), with pre-processed automotive radar signals. In conventional methods, the detection and classification in automotive radar systems are conducted in two successive stages; however, in the proposed method, the two stages are combined into one. To verify the effectiveness of the proposed method, we applied it to the actual radar data measured using our automotive radar sensor. According to the results, our proposed method can simultaneously detect targets and classify them with over 90% accuracy. In addition, it shows better performance in terms of detection and classification, compared with conventional methods such as density-based spatial clustering of applications with noise or the support vector machine. Moreover, the proposed method especially exhibits better performance when detecting and classifying a vehicle with a long body.
Article
Full-text available
In this paper, a real-time signal processing framework based on a 60 GHz frequency-modulated continuous wave (FMCW) radar system to recognize gestures is proposed. In order to improve the robustness of the radar-based gesture recognition system, the proposed framework extracts a comprehensive hand profile, including range, Doppler, azimuth and elevation, over multiple measurement-cycles and encodes them into a feature cube. Rather than feeding the range-Doppler spectrum sequence into a deep convolutional neural network (CNN) connected with recurrent neural networks, the proposed framework takes the aforementioned feature cube as input of a shallow CNN for gesture recognition to reduce the computational complexity. In addition, we develop a hand activity detection (HAD) algorithm to automatize the detection of gestures in real-time case. The proposed HAD can capture the time-stamp at which a gesture finishes and feeds the hand profile of all the relevant measurementcycles before this time-stamp into the CNN with low latency. Since the proposed framework is able to detect and classify gestures at limited computational cost, it could be deployed in an edge-computing platform for real-time applications, whose performance is notedly inferior to a state-of-the-art personal computer. The experimental results show that the proposed framework has the capability of classifying 12 gestures in realtime with a high F1-score.
Article
Full-text available
A growing body of work underlines striking similarities between biological neural networks and recurrent, binary neural networks. A relatively smaller body of work, however, addresses the similarities between learning dynamics employed in deep artificial neural networks and synaptic plasticity in spiking neural networks. The challenge preventing this is largely caused by the discrepancy between the dynamical properties of synaptic plasticity and the requirements for gradient backpropagation. Learning algorithms that approximate gradient backpropagation using local error functions can overcome this challenge. Here, we introduce Deep Continuous Local Learning (DECOLLE), a spiking neural network equipped with local error functions for online learning with no memory overhead for computing gradients. DECOLLE is capable of learning deep spatio temporal representations from spikes relying solely on local information, making it compatible with neurobiology and neuromorphic hardware. Synaptic plasticity rules are derived systematically from user-defined cost functions and neural dynamics by leveraging existing autodifferentiation methods of machine learning frameworks. We benchmark our approach on the event-based neuromorphic dataset N-MNIST and DvsGesture, on which DECOLLE performs comparably to the state-of-the-art. DECOLLE networks provide continuously learning machines that are relevant to biology and supportive of event-based, low-power computer vision architectures matching the accuracies of conventional computers on tasks where temporal precision and speed are essential.
Article
Full-text available
In this paper, we introduce a framework for dynamic gesture recognition with background suppression operating on the output of a moving event-based camera. The system is developed to operate in real-time using only the computational capabilities of a mobile phone. It introduces a new development around the concept of time-surfaces. It also presents a novel event-based methodology to dynamically remove backgrounds that uses the high temporal resolution properties of event-based cameras. To our knowledge, this is the first Android event-based framework for vision-based recognition of dynamic gestures running on a smartphone without off-board processing. We assess the performances by considering several scenarios in both indoors and outdoors, for static and dynamic conditions, in uncontrolled lighting conditions. We also introduce a new event-based dataset for gesture recognition with static and dynamic backgrounds (made publicly available). The set of gestures has been selected following a clinical trial to allow human-machine interaction for the visually impaired and older adults. We finally report comparisons with prior work that addressed event-based gesture recognition reporting comparable results, without the use of advanced classification techniques nor power greedy hardware.
Article
Full-text available
Vital-sign estimation using ultra-wideband (UWB) radar is preferable because it is contactless and less privacy-invasive. Recently, many approaches have been proposed for estimating heart rate from UWB radar data. However, their performance is still not reliable enough for practical applications. To improve the accuracy, this study employs convolutional neural networks to learn the special patterns of the heartbeats. In the proposed system, skin displacements of the target person are measured using UWB radar, and the radar signal is converted to a two-dimensional matrix, which is used as the input of the designed neural networks. Meanwhile, two triangular waves corresponding to the peaks and valleys in an electrocardiogram are adopted as the output of the networks. The proposed system then identifies each individual and estimates the heart rate automatically based on the already trained neural networks. The estimation error of the interbeat interval computed using our approach was reduced to 4.5 ms in the best case; and 48.5 ms in the worst case. Experiment results show that the proposed approach significantly outperforms a conventional method. The proposed machine learning approach achieves both personal identification and heart rate estimation simultaneously using UWB radar data for the first time. Moreover, this study found that using the respiration and heartbeat components together may enhance the accuracy of heart rate estimation, which is counter-intuitive, because the respiration is usually believed to interfere with the heartbeat.
Conference Paper
In an attempt to follow biological information representation and organization principles, the field of neuromorphic engineering is usually approached bottom-up, from the biophysical models to large-scale integration in silico. While ideal as experimentation platforms for cognitive computing and neuroscience, bottom-up neuromorphic processors have yet to demonstrate an efficiency advantage compared to specialized neural network accelerators for real-world problems. Top-down approaches aim at answering this difficulty by (i) starting from the applicative problem and (ii) investigating how to make the associated algorithms hardware-efficient and biologically-plausible. In order to leverage the data sparsity of spike-based neuromorphic retinas for adaptive edge computing and vision applications, we follow a top-down approach and propose SPOON, a 28-nm event-driven CNN (eCNN). It embeds online learning with only 16.8-% power and 11.8-% area overheads with the biologically-plausible direct random target projection (DRTP) algorithm. With an energy per classification of 313nJ at 0.6V and a 0.32-mm² area for accuracies of 95.3% (on-chip training) and 97.5% (off-chip training) on MNIST, we demonstrate that SPOON reaches the efficiency of conventional machine learning accelerators while embedding on-chip learning and being compatible with event-based sensors, a point that we further emphasize with N-MNIST benchmarking.
Article
Recent advancements in perception for autonomous driving are driven by deep learning. In order to achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with different sensors (e.g. cameras, LiDARs, Radars), and multiple sensing modalities can be fused to exploit their complementary properties. In this context, many methods have been proposed for deep multi-modal perception problems. However, there is no general guideline for network architecture design, and questions of "what to fuse", "when to fuse", and "how to fuse" remain open. This review paper attempts to systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving. To this end, we first provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection and semantic segmentation in autonomous driving research. We then summarize the fusion methodologies and discuss challenges and open questions. In the appendix, we provide tables that summarize topics and methods. We also provide an interactive online platform to navigate each reference: https://boschresearch.github.io/multimodalperception/.