ArticlePDF Available

Improving the Accuracy of Spiking Neural Networks for Radar Gesture Recognition Through Preprocessing

September 2021
IEEE Transactions on Neural Networks and Learning Systems PP(99):1-13

September 2021
PP(99):1-13

DOI:10.1109/TNNLS.2021.3109958

Authors:

Ali Safa

KU Leuven

Federico Corradi

Eindhoven University of Technology

Lars Keuninckx

imec

Show all 7 authorsHide

Event-based neural networks are currently being explored as efficient solutions for performing AI tasks at the extreme edge. To fully exploit their potential, event-based neural networks coupled to adequate preprocessing must be investigated. Within this context, we demonstrate a 4-b-weight spiking neural network (SNN) for radar gesture recognition, achieving a state-of-the-art 93% accuracy within only four processing time steps while using only one convolutional layer and two fully connected layers. This solution consumes very little energy and area if implemented in event-based hardware, which makes it suited for embedded extreme-edge applications. In addition, we demonstrate the importance of signal preprocessing for achieving this high recognition accuracy in SNNs compared to deep neural networks (DNNs) with the same network topology and training strategy. We show that efficient preprocessing prior to the neural network is drastically more important for SNNs compared to DNNs. We also demonstrate, for the first time, that the preprocessing parameters can affect SNNs and DNNs in antagonistic ways, prohibiting the generalization of conclusions drawn from DNN design to SNNs. We demonstrate our findings by comparing the gesture recognition accuracy achieved with our SNN to a DNN with the same architecture and similar training. Unlike previously proposed neural networks for radar processing, this work enables ultralow-power radar-based gesture recognition for extreme-edge devices.

Example patch for the "vertical" gesture, acquired using µDoppler preprocessing with Doppler band-limiting, max 48 , and complete map normalization (pixel values between 0 and 1).

…

Corresponding DNN architecture. The network retains a similar topology to Fig. 4 but with IF neurons replaced by ReLU and the output of the σ 2 layer directly fed to the SoftMax nonlinearity, as conversion from the event domain to the frame domain by accumulation is not needed.

…

Confusion matrix for the 60-GHz radar gesture recognition dataset proposed in [9]. The recognition accuracy of the 4-b-weight SNN is 94.6% ± 2%.

…

Figures - uploaded by Ali Safa

Content may be subject to copyright.

Content uploaded by Ali Safa

Content may be subject to copyright.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Improving the Accuracy of Spiking Neural

Networks for Radar Gesture Recognition

Through Preprocessing

Ali Safa ,Graduate Student Member, IEEE, Federico Corradi, Member, IEEE, Lars Keuninckx,

Ilja Ocket, Member, IEEE, André Bourdoux ,Senior Member, IEEE,

Francky Catthoor, Fellow, IEEE , and Georges G. E. Gielen, Fellow, IEEE

Abstract— Event-based neural networks are currently being

explored as efﬁcient solutions for performing AI tasks at the

extreme edge. To fully exploit their potential, event-based neural

networks coupled to adequate preprocessing must be investi-

gated. Within this context, we demonstrate a 4-b-weight spiking

neural network (SNN) for radar gesture recognition, achieving

a state-of-the-art 93% accuracy within only four processing

time steps while using only one convolutional layer and two

fully connected layers. This solution consumes very little energy

and area if implemented in event-based hardware, which makes

it suited for embedded extreme-edge applications. In addition,

we demonstrate the importance of signal preprocessing for

achieving this high recognition accuracy in SNNs compared to

deep neural networks (DNNs) with the same network topology

and training strategy. We show that efﬁcient preprocessing prior

to the neural network is drastically more important for SNNs

compared to DNNs. We also demonstrate, for the ﬁrst time,

that the preprocessing parameters can affect SNNs and DNNs in

antagonistic ways, prohibiting the generalization of conclusions

drawn from DNN design to SNNs. We demonstrate our ﬁndings

by comparing the gesture recognition accuracy achieved with our

SNN to a DNN with the same architecture and similar training.

Unlike previously proposed neural networks for radar processing,

this work enables ultralow-power radar-based gesture recognition

forextreme-edgedevices.

Index Terms—Energy- and area-efﬁcient networks, gesture

recognition, preprocessing impact on accuracy, radar processing,

spiking neural networks (SNNs).

I. INTRODUCTION

IN RECENT years, event-based neural networks (as

opposed to frame-based) have gained considerable interest

Manuscript received December 18, 2020; revised June 11, 2021 and

August 26, 2021; accepted August 31, 2021. This work was supported in part

by the Flanders Artiﬁcial Intelligence (AI) Research Program. (Corresponding

author: Ali Safa.)

Ali Safa, Ilja Ocket, Francky Catthoor, and Georges G. E. Gielen are with

imec, 3001 Leuven, Belgium, and also with the Department of Electrical

Engineering, Katholieke Universiteit (KU) Leuven, 3001 Leuven, Belgium

(e-mail: ali.safa@imec.be; ilja.ocket@imec.be; francky.catthoor@imec.be;

georges.gielen@kuleuven.be).

Federico Corradi is with Stichting imec, 5656 AE Eindhoven,

The Netherlands (e-mail: federico.corradi@imec.nl).

Lars Keuninckx and André Bourdoux are with imec, 3001 Leuven, Belgium

(e-mail: lars.keuninckx@imec.be; andre.bourdoux@imec.be).

Color versions of one or more ﬁgures in this article are available at

https://doi.org/10.1109/TNNLS.2021.3109958.

Digital Object Identiﬁer 10.1109/TNNLS.2021.3109958

due to their ability for low energy consumption, low inference

latency, and their capability of being implemented in massively

parallel, non-von Neumann architectures where computation is

done close to the memory [1], solving the memory bottleneck

issues. Event-based neural networks [also called spiking neural

networks (SNNs)] are, thus, well suited for edge-AI applica-

tions where low energy consumption, compact die area, and

low latency are key requirements.

Following the popularity of the backpropagation of error

algorithm and supervised learning, multiple methods have

been proposed to solve the issue of gradient computation due

to the discontinuous Dirac-pulse activation of SNNs [2], [3].

Currently, the use of a surrogate gradient coupled with back-

propagation through time (BPTT) has gained popularity due

to its remarkable efﬁciency [4].

In the past decade, radar sensing via neural networks has

gained huge interest because of the advantages that radar sen-

sors present over vision-based technologies. Unlike cameras,

radars preserve privacy, are independent of lighting condi-

tions, and are able to sense even when occluded [34]. Deep

neural networks (DNNs) have successfully been used for radar

processing in a wide range of applications, from indoor person

identiﬁcation [31] and heart-rate estimation [32] to target

classiﬁcation [33] and gesture recognition [9]. However, those

DNN-based solutions are ill-suited for embedded AI applica-

tions at the extreme edge (such as in IoT devices), where the

current trend is to embed neural network accelerators within

the sensor chip itself and where the very tight energy budgets

generally cannot cope with the latency requirements and

computing power needed to run such deep learning solutions,

notwithstanding the remarkable recent efforts of many teams in

DNN acceleration [38], [39]. In an effort to bring radar-based

sensing to the ultralow-power edge devices, we investigate in

this article the use of SNNs for radar sensing and gesture

recognition and explore the design challenges that must be

faced to make them effective compared to conventional DNNs.

Indeed, a competitive SNN accuracy compared to the

DNN counterpart has mainly been demonstrated on standard

benchmarks, such as MNIST and CIFAR10 [5], and it is

unclear how SNNs compare to DNNs when used for other

applications, such as our radar gesture recognition task. More

speciﬁcally, we show in this article that SNN processing of the

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

sensory data requires some level of preprocessing, but there

has been little discussion so far about the impact of such data

preprocessing (such as denoising, sparse coding, or dimen-

sionality reduction) on the SNN performance compared to

any preprocessing impact on the DNN performance. DNNs

have shown remarkable success when applied to unprocessed

datasets corrupted by nonstationary noise [6], with exper-

imental evidence suggesting that learning with noisy data

might even be beneﬁcial [7]. It is unclear if and how those

observations generalize to SNNs.

Within this context, we demonstrate the importance of

adding preprocessing to a resource-constrained SNN for

the task of gesture recognition in an extreme-edge device

using a custom 8-GHz ultralow-power frequency-modulated

continuous-wave (FMCW) SISO radar [8], together with pre-

senting a custom preprocessing queue performing sparse cod-

ing and dimensionality reduction. The resulting architecture

achieves state-of-the-art performance compared to conven-

tional solutions.

The key contributions of this work are given as follows.

1) We demonstrate the importance of radar data preprocess-

ing for SNNs compared to its importance for DNNs in

a supervised learning setting.

2) We demonstrate that conclusions drawn from DNN

design cannot be extended to the SNN counterpart, even

if a similar training methodology is used.

3) We introduce a well-suited method for preprocessing

and encoding radar intermediate frequency (IF) signals

into the event domain, extensively comparing different

approaches.

4) We propose a low-resource SNN architecture with only

one convolutional layer and only two fully connected

hidden layers, using only 4-b weights, to achieve 93%

of recognition accuracy within four time steps, which

establishes a high energy efﬁciency, a low die area usage,

and a low latency when implemented in hardware for

edge devices.

This article is organized as follows. Related works are dis-

cussed in Section II. Our radar data acquisition is introduced

in Section III. Our preprocessing approaches are detailed in

Section IV. Our neural network is introduced in Section V.

Results and their discussion are presented in Section VI.

Conclusions are provided in Section VII.

II. RELATED WORKS

Radar gesture recognition is actively being explored using

conventional DNN solutions [9], [11], [12], [18]. Using

Google’s Soli sensor [10], 11-class gesture recognition with

87% accuracy has been demonstrated [9]. The authors use a

network of rectiﬁed linear units (ReLU) neurons composed of

multiple convolutional and max-pooling layers, followed by

multiple fully connected layers and ending with a recurrent

layer using long short-term memory blocks. As input to the

network, they perform range-Doppler (RD) processing on the

radar signals and feed their magnitude as images. Compared

to our work, they are able to recognize more gesture classes at

the expense of much higher network complexity (three 3 ×3

convolutional layers with 32, 64, and 128 ﬁlters, respectively;

two fully connected layers both with 512 neurons; an LSTM

layer of size 512; and a fully connected SoftMax layer

of 11 outputs). In addition, the authors point out that their

remarkable ability to recognize 11 classes is also due to their

use of a 60-GHz radar [9], [10], providing close to eight times

ﬁner grain Doppler resolution compared to our 8-GHz radar

chip (as Doppler resolution is inversely proportional to the

radar frequency [20]). On the other hand, their 60-GHz radar

consumes around 300 mW of power [10] compared to 680 μW

for our 8-GHz radar [8], which makes our sensor much

better suited for ultralow-power applications, such as IoT edge

devices. Also, they do not quantize the weights nor the input

to make their network compatible with applications requiring

ultralow power consumption. Therefore, even though better

suited for solutions that can make use of GPU compute power,

their solution is unsuited for ultralow-power edge devices with

limited energy available.

As mentioned above, a key observation made by

Wang et al. [9] and Lien et al. [10] is the need for ﬁne-grain

Doppler resolution in order to achieve high-accuracy gesture

recognition with many classes. This is a key limitation of

our 8-GHz sensor as we trade off frequency for ultralow

power consumption and die area utilization. Integrating an

SNN with such a small area (1.8mm

2[8] versus 144 mm2

for Google’s Soli [10]), low-power 8-GHz radar chip is,

therefore, a well-suited choice, as SNN architectures can

lead to substantially higher energy- and area-efﬁcient imple-

mentations compared to current DNN accelerators [1], [26],

[27], even when the network architecture is shallow (see [28]

for a discussion about the accuracy–area–energy tradeoff).

However, using SNNs comes at the cost of a potentially

severe accuracy loss compared to using DNNs (see Table III).

This motivates the exploration of preprocessing strategies for

the SNNs, as provided in this article. Indeed, it is due to a

well-suited preprocessing strategy that we are able to achieve

a high-accuracy, ﬁve-class gesture recognition with our simple

8-GHz radar and small 4-b-weight SNN in this work.

In order to bring radar gesture recognition closer to the

edge, Sun et al. [11] proposed a 12-class gesture recognition

system using a 60-GHz FMCW radar with an Nvidia Jetson

Nano board running a network of ReLU neurons with four

convolutional layers and three fully connected layers achieving

94% accuracy. Compared to our work with the SISO setting,

their radar system [11] uses multiple receive antennas to

provide azimuth and elevation information, at the expense of

increased power consumption in the radar front-end due to the

use of multiple receive antennas. As input to their network,

the authors ﬁrst compute the magnitude of RD maps for each

antenna and sum the resulting maps for each antenna. Then,

they extract the 25 largest bins and compute the azimuth and

elevation angles for those bins. Finally, they aggregate the

range, Doppler, and angular information as a feature cube

that is fed to the neural network. Compared to [11], our

preprocessing chain is much simpler as we have an SISO

setting (angle processing is impossible since only one receive

antenna is available). While the system of [11] is well suited

for GPU-enabled edge computing platforms, it is much less

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 3

suited for extreme-edge applications requiring an ultralow

power consumption. Sun et al. [11] also does not perform

aggressive quantization on the inputs and network weights

compared to our SNN queue.

Other DNN solutions have reported similar results

by extracting micro-Doppler features of the gestures

(joint time–frequency representations) instead of RD map

processing [12]. In this work, both micro-Doppler and RD

approaches will be compared in order to illustrate the

importance of choosing the right preprocessing approach for

resource-constrained SNNs.

Compared to our work, the DNN solutions described above

are all frame-based and, therefore, cannot be deployed on the

growing number of event-based SNN processors well suited

for ultralow-power AI applications. Moradi et al. [1] proposed

a mixed-signal SNN processor implementing an asynchronous

and event-based packet routing methodology in order to min-

imize latency and memory requirements. Davies et al. [26]

proposed a fully digital asynchronous and event-based SNN

accelerator, whereas Frenkel et al. [27] proposed a fully

synchronous, yet event-based digital architecture.

Although not fully comparable to our work, most

SNN-based gesture recognition systems use dynamic vision

sensor (DVS) cameras with pixels that directly output spikes

when their change in luminosity crosses a certain thresh-

old [2], [13]. Compared to standard cameras that output

image frames at a certain frame rate, DVS cameras output

events asynchronously, which makes them a natural choice

for SNN-based processing. Compared to our small 8-GHz

radar chip, DVS cameras provide signals of signiﬁcantly better

ﬁdelity (images versus sparse number of radar returns) at

the expense of being bulkier, power-hungry (23 mW for an

iniVation DVS128 camera versus 680 μW for our 8-GHz

radar) and more expensive (in the ∼2000$ range compared to

∼10$ for ubiquitous radar chips). Even then, the performance

of our work compares favorably with results in the DVS ges-

ture recognition literature. Using the DVSGesture dataset [14],

the reported accuracies using various machine learning models

and SNNs fall in the 91%–96% range for ten-class recognition

[2], [14]–[17]. Recently, Massa et al. [13] proposed a 11-class

gesture recognition system achieving 89.64% of accuracy by

ﬁrst training a DNN composed of four convolutional layers

and a fully connected SoftMax layer and then converting this

trained DNN to an SNN through the conversion of ReLU

activations into spike rates (as opposed to our work where

we train the SNN directly in the spiking domain). More

closely related to our work, Maro et al. [15] have reported

an accuracy of 93% (similar to ours) using the NavGesture

dataset containing ﬁve similar gestures compared to our work.

It is worth noting that, currently, the application ﬁeld of SNNs

is still limited compared to DNNs, and therefore, it is worth

investigating the application of SNNs to sensory data other

than the traditional DVS cameras used in most SNN works.

Closely related to our work, eight-class human action recog-

nition with CW radar and SNN processing has been pro-

posed in [19], achieving an accuracy of 85%. Although their

preprocessing shares commonalities with ours, they do not

perform sparse coding nor dimensionality reduction after radar

processing, which helps to increase the SNN performance

in terms of energy–area efﬁciency and accuracy, as reducing

the dimensionality of radar signals leads to overall smaller

network architectures as fewer neurons would be required.

In addition, dimensionality reduction can be used to reject out-

of-band noise, and sparse coding can increase the separability

of the input signals, which can lead to a higher accuracy [24].

Banerjee et al. [19] use an SNN topology composed of multi-

ple convolutional spiking layers and winner-take-all classiﬁca-

tion, coupled to the spike-timing-dependent plasticity (STDP)

learning rule. Compared to the topology in [19], our pro-

posed network architecture is simpler. Banerjee et al. [19]

makes use of leaky integrate and ﬁre (I&F) neurons, while we

use I&F neurons that are simpler to implement in hardware.

Furthermore, Banerjee et al. [19] do not quantize the weights,

while we apply 4-b quantization to the weights in our net-

work. As we are aiming for keeping the neural network con-

strained in size, the architecture of [19] is not appropriate for

our goal.

With this article, our aim is to illustrate the importance

and the lessons learned on the impact of using preprocessing

for SNNs compared to its impact on DNNs. To the best of

our knowledge, an extensive discussion about the impact of

preprocessing on the accuracy of resource-constrained SNNs

compared to nonconstrained DNNs with the same topology,

both trained via backpropagation, has not yet been reported.

Such a discussion is important within the radar context of

this article as state-of-the-art radar gesture recognition DNNs

targeting the edge computing domain [11] are principally

being deployed in embedded platforms, such as the Nvidia

Jetson board, providing support for high bit-width integer and

ﬂoating-point computation [11]. This will be discussed and

illustrated in the example case of radar gesture recognition in

Section VI.

III. RADAR DATA ACQUISITION SETUP

This section gives a brief overview of FMCW radar theory

and the radar dataset that we used for the gesture recognition

demonstrations.

A. FMCW Radar Theory

Using SISO FMCW radars, a chirp signal pt(t)with slope

αand carrier frequency fcis emitted at the transmit antenna.

This can be modeled as follows [21]:

pt(t)=sin (2πfct+παt2). (1)

Bouncing off an object located at a distance dfrom the

radar, the signal pr(t)at the receive antenna can be modeled

as a delayed and attenuated version of the transmitted signal

(assuming a single point target)

pr(t)=ξpt(t−Td)(2)

where ξdenotes the attenuation coefﬁcient. The time-of-ﬂight

delay Tddepends on the round-trip distance 2dfrom the radar

to the target

Td=2d

c(3)

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

where cis the speed of light. By mixing the received signal

pr(t)with a replica of pt(t)at the in-phase receiver of

the radar chip [8], and by low-pass ﬁltering the resulting

signal such that the frequency component fcis removed,

the following IF signal is obtained:

r(t)=−ξ

2cos 2παTdt+2πfcTd−παT2

d.(4)

As the frequency of the baseband signal is proportional

to Tdand, therefore, d, its spectrum shows peaks at certain

frequencies corresponding to the distance between the radar

and the surrounding objects (the receive signal being, in fact,

a sum of many signals of the form given by (4) when multiple

targets are present). After the analog-to-digital conversion by

the in-phase ADC, the discrete-time signal r[i]is found as

r[i]=r(t=iTf). (5)

The choice of the ADC sampling period Tfimpacts the

maximum range coverage dmax of the radar [21]

dmax =c

2αTf

.(6)

The slope αis deﬁned as follows:

α=

(7)

where Tcis the chirp duration and is the bandwidth of the

chirp, which impacts the range resolution [21]

dres =c

2.(8)

For chirp n, the range proﬁle Rn[k]is deﬁned as the

DFT of r[i]. Peaks in the magnitude of Rn[k]indicate the

presence of a target, while the phase evolution of Rn[k∗]for a

target at range bin k∗between successive chirps represents the

micromotions of the target [20] (a small radial displacement

dinduces a phase shift φ =(4π/λ)d). The spectrum of

the micromotions is called the Doppler proﬁle and is deﬁned

as the DFT of Rn[k]along n[20].

B. Gesture Dataset

We have used a custom ultralow-power 8-GHz SISO

radar [8] to acquire a ﬁve-class gesture dataset. Table I shows

the dataset content used for the demonstration experiments

reported later on. The gestures were recorded at a distance

of 2 m from the antennas (RX and TX) and were obtained

by swinging the right or left arm in the vertical direction

(gesture one-vertical), by swinging the right or left arm in

the horizontal direction (gesture two-horizontal), by waving

with the right or left hand while keeping the palm facing out

(gesture three-hello), by moving the hand with the palm facing

out toward and away from the radar (gesture four-toward),

and, ﬁnally, by recording background activity in which none

of the above gestures appears in a static background (gesture

ﬁve-background). It should be noted that such a dataset

is particularly well-suited for comparing the importance of

preprocessing in SNNs and DNNs as radar signals represent

the environment with a lower ﬁdelity compared to images [36],

TAB L E I

DATA S ET CONTENT USED IN THE DEMONSTRATION EXPERIMENTS

Fig. 1. Gesture acquisition setup.

making them more sensitive to proper preprocessing and

feature extraction.

The radar parameters were set as follows: the number of

ADC samples per chirp is 512, the number of chirps per frame

is 192, and the time between chirps is Ts=1.2 ms, while the

chirp duration Tcis 41 μs. Therefore, the total duration for

a frame capture is 238 ms and Tf=80 ns. Fig. 1 shows

the gesture acquisition setup with the antennas and the radar

read-out boards.

IV. RADAR PREPROCESSING APPROACHES

This section presents the two widely used radar preprocess-

ing approaches that will then be compared in Section VI,

followed by a description of our proposed dimensionality

reduction and sparse coding techniques.

A. μDoppler Signature

We acquire the μDoppler signature plot [20] of each gesture

acquisition in the dataset by computing the 1-D vector Rn[k∗]

for n=1,...,Ntot (Ntot is the total number of chirps), where

k∗is the range bin in which the gestures are performed (which

is known apriorisince the distance between the human target

and the radar is ﬁxed) and Rn[k]isacquiredbyDFTas

follows:

Rn[k]= 1

√L



i=0

w[i]rn[i]e−j2πki

L(9)

where L=512 is the number of ADC samples of rn(the

received IF signal for chirp n)andwdenotes the Blackman

window that we used [35].

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 5

Then, we apply the short-time Fourier transform (STFT) on

Rn[k∗]as follows [19]:

[m,f]= ∞



n=−∞

Rn[k∗]gs[n−mR]e−j2πfn (10)

where gsis the Hanning window of length sand Ris the

hop size (s=192 and R=8 in this work). Finally,

the μDoppler signature plot is found by taking the magnitude

of [m,f][12]. The size of the [m,f]matrix for each

acquisition of Table I is (NT×s)with s=192 frequency

bins and the number of time bins NTgiven by [19]

NT=Nframes Nchirps −Noverlap

s−Noverlap (11)

where Noverlap is the number of overlapping bins between

subsequent analysis windows and is equal to s−R.

We also investigate a variation of this method, which we

call μDoppler, by applying the STFT on the sequence

Rn[k∗]=Rn[k∗]−Rn−1[k∗]to remove the strong dc com-

ponent during each analysis window [31].

In order to obtain example patches to feed to our neural

network, we cut |[m,f]| along dimension minto patches of

48 time samples, resulting in (NT/48)examples for each

acquisition. Finally, we construct a balanced dataset with a

total of 1695 examples by randomly selecting 339 examples

per acquisition class. The choice of 339 examples comes from

the fact that the background acquisition (which is the one

containing the smallest amount of frames) gives 351 example

patches according to (11), but we discard the ﬁrst and last six

example patches to remove startup and ending artifacts.

B. RD Maps

We acquire the RD maps [9] for each gesture acqui-

sition in the dataset by ﬁrst acquiring [using (9)] Rn[k],

n=1,...,Nwin as a matrix of size (L×Nwin ),whereNwin

is the number of chirps in each RD map (set to 192 in this

work). Then, the RD map is found as follows [21]:

[m,k]=

√Nwin

Nwin



n=0

w[n]Rn[k]e−j2πmn

Nwin 

.(12)

This operation is repeated for each successive Nwin chirps,

and (Ntot /Nwin)RD maps are acquired during gesture acqui-

sition with Ntot being the total number of chirps for a particular

gesture recording. Only the ﬁrst 50 range bins of each RD map

are kept in order to reject far-away clutter.

In order to remove dc reﬂectors and exploit the correlation

between successive RD maps, we also investigate a custom

variation of this method, which we call RD, by aggregating

the difference between successive RD maps as follows (where

bdenotes the RD map index):

b[m,k]=max(b[m,k]−b−1[m,k],0). (13)

Similar to the ﬁrst preprocessing method, we construct a

balanced dataset with a total of 1695 examples, randomly

selecting 339 examples per class to enable a fair comparison

between the two preprocessing approaches.

Fig. 2. Example patch for the “vertical” gesture, acquired using μDoppler

preprocessing with Doppler band-limiting, max48, and complete map normal-

ization (pixel values between 0 and 1).

Fig. 3. Example patch for the “vertical” gesture, acquired using RD

preprocessing with Doppler band-limiting, max48, and complete map nor-

malization (pixel values between 0 and 1).

C. Enhancements

For both preprocessing methods described above, and for

their respective -variations, we explore the effect of a rep-

resentative subset of possible enhancements [31].

1) Dimensionality reduction through Doppler spectrum

band-limiting.

2) Sparse coding through soft thresholding.

3) Image normalization.

a) In order to reject out-of-band noise, we explore

the effect of band-limiting the Doppler spectra by

keeping only a reduced portion of the Doppler axis

between the normalized frequencies [−0.26,0.26]

(frequency band found by visually identifying the

maximal signiﬁcant extent of the Doppler spectra

in the dataset).

b) In order to remove in-band noise, we use a fast

approximation to Lasso coding [22] by soft thresh-

olding each Doppler spectrum, for each time bin in

the case of the preprocessing method Aor for each

range bin in the case of the preprocessing method

B. For soft thresholding, we use the maxkoperator,

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 4. SNN architecture. Each pixel of the input radar maps is ﬁrst encoded into a spike train of length Tinf , either through RB or TTFS coding. Each

spiking map slice corresponding to each time step is then fed one by one to the network and the IF neurons change state according to their self-recurrence

(as denoted by the black recurrence arrows). Output spikes emitted by the σ3layer are accumulated during the Tinf time steps, and the ﬁnal accumulation

vector Ais transformed through SoftMax to class probabilities.

which keeps the klargest values and replaces the

others with 0 (we do not subtract the threshold

from the values to be kept). To choose k, we con-

sider that more than half of the Doppler samples

within the normalized frequencies [−0.26,0.26]

are signiﬁcant, which leads to the choice of

k=(192 ×(0.26 −(−0.26))/2)−1=48

throughout this work.

c) Before encoding the preprocessed radar signals

into event streams to be fed to the SNN, we nor-

malize the pixel values between 0 and 1, either by

normalizing each Doppler spectrum individually or

by normalizing the complete maps (which gives

a smaller amount of large pixel values compared

to the normalization of each individual Doppler

spectrum).

For the sake of illustration, Figs. 2 and 3 show exam-

ple patches of the “vertical” gesture, for each preprocessing

method, to be fed to the neural network. As seen in Fig. 2,

μDoppler processing leads to a periodic-like pattern in

time, which captures the Doppler information of the range

bin in which the gestures are performed. On the other hand,

RD (see Fig. 3) provides no time information but rather

captures the Doppler information carried by any wave that

was reﬂected from the gestures in the environment (hence the

range dimension).

In Section VI, we will explore the effect of the above

preprocessing methods Aand B, their respective -variations,

and the enhancements a), b), and c) on the accuracy of our

low-complexity SNN compared to the accuracy of a DNN with

the same topology.

V. N EURAL NETWORK DESIGN AND TRAINING APPROACH

We use two neural networks within this work: an SNN with

I&F neurons (see (14), where Vkis the neural membrane

potential at time step k,Jin is the neuron input, and Sis

the spiking output), and the corresponding DNN with ReLU

neurons. Fig. 4 shows the topology used for both networks.

The input size to the neural network varies depending on the

preprocessing combination used, as shown in Table II

Vk+1=Vk+Jin and S=0,if Vk<1

Vk+1=0andS=1,if Vk≥1.(14)

TAB L E I I

NETWORK INPUT SIZES.THE SIZE ISDIFFERENT FOR EACH

PREPROCESSING VARIANT AS EACH PREPROCESSING PRODUCES

RESULTS IN DIFFERENT DOMAINS:DOPPLER FREQUENCY

AGAINST THE NUMBER OF SLI DING WINDOW HOPS THROUGH

TIME FOR μDOPPLER AND DOPP LER FREQUENCY AGAINST

RANGE FOR RD. BAND-LIMITINGALS O AFFECTS

THE SIZE ASITCROPS TH E ORIGINAL DIMENSION

OF THE DOPPLER AXIS FROM 192 DOPP LER

BINS TO 100 IN OUR CASE (REJECTING

HIGHER-FREQUENCY BINS)

A. Architectural Choices

The architectural choices made during the design of the

neural network (see Figs. 4 and 5) are motivated as follows.

First, we decided to use a convolutional layer at the network

input to better capture spatial patterns in the input maps and

avoid overﬁtting through weight sharing. Then, we opted for

a max-pooling layer instead of average pooling, as we wanted

to keep the spiking nature of the data to be given as input to

the pooling layer. Indeed, pooling is performed on the output

of the spiking neurons in layer σ1. With average pooling,

the resulting map would not contain any spikes anymore,

as the output values are between 0 and 1. On the other

hand, max pooling preserves the spiking nature of the data

(output values are either equal to 0 or 1), which is beneﬁcial

for hardware implementation, as it is equivalent to a simple

OR gate passing the spikes to the next layer. It is important

to note that max pooling in SNNs has been shown to be

rather complex to deploy for those SNNs that result from

continuous-valued DNN to SNN conversion (see [40], where

an existing DNN is converted into an SNN through rate-based

(RB) coding). This is not a problem in our case as we train

our SNNs (containing a standard max-pooling layer) directly

in the spiking domain. Finally, limiting the size of the network

to only one convolutional layer and only two fully connected

layers is motivated by hardware considerations (a smaller

network consumes less energy and is easier to implement

in hardware). We have decided to use IF neurons over more

complex models as the IF neuron is the simplest to implement

in hardware, again reducing the chip area overhead.

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 7

Fig. 5. Corresponding DNN architecture. The network retains a similar topology to Fig. 4 but with IF neurons replaced by ReLU and the output of the σ2

layer directly fed to the SoftMax nonlinearity, as conversion from the event domain to the frame domain by accumulation is not needed.

For the SNN, the conversion from event to frame domain

is done by counting spikes during the inference time Tinf

for each of the ﬁve output neurons in the σ3layer. Then,

the ﬁve-element vector containing the spike counts is trans-

formed via SoftMax to class probabilities. For the DNN,

the SoftMax layer is directly connected to weights W2

ij.

In the SNN case, each pixel of the radar input maps must

ﬁrst be encoded into event trains of length Tinf (number of

time steps for inference) to make the input compatible with

the spiking, event-based nature of the SNN. In Section VI,

we will explore the effect of two spike encoding approaches:

RB and time-to-ﬁrst-spike (TTFS) coding [43], as well as the

effect of the inference time Tinf .

RB and TTFS encoding are done as follows. After normaliz-

ing the pixel values between 0 and 1, a value vis coded into a

periodic event train with event rate (vTinf /Tinf )for RB, while

vis coded into an event train containing one spike located at

index Tinf −vTinf for TTFS. In both cases, if vTinf =0,

no spike is generated.

B. Training Approach

To train the SNN, we use the pyTorch framework [42]

by deﬁning a custom IF neuron model behaving as (14).

As the derivative of a spike as a function of the neuron

membrane potential σ(V)is ill-deﬁned, we approximate it

using a Gaussian function (15) as surrogate derivative [4] to

enable backpropagation

σ(V)≈1

√2πe−2V2.(15)

Even if the network itself is not recurrent, backpropagation

should be carried through time (BPTT) as each spiking neuron

can be seen as a recurrent unit by itself [4]. Indeed, an input

event to a spiking neuron at time nstill affects its membrane

potential at future times as evidenced by (14). In the general

case of BPTT, the total loss function Ltot can be written as

the sum of the losses for each time step as follows [30]:

Ltot =

Tinf−1



n=0

L[n].(16)

Then, the derivative of Ltot as a function of weight Wl

ij in

layer lis given as follows:

∂Ltot

∂Wl

ij =

Tinf−1



m=0

∂Vl+1

i[m]

∂Wl

Tinf−1



n=m

∂L

∂Vl+1

i[m](17)

where Vl+1

i[m]is the membrane potential of neuron iin layer

l+1 at time step m. Expanding (17) shows the effect that a

spike at time step nhas on all future evaluations of the loss

∂Ltot

∂Wl

ij =∂Vl+1

i[0]

∂Wl

Tinf−1



n=0

∂L[n]

∂Vl+1

i[0]

+∂Vl+1

i[1]

∂Wl

Tinf−1



n=1

∂L[n]

∂Vl+1

i[1]

+···+ ∂Vl+1

i[Tinf −1]

∂Wl

Tinf−1



n=Tinf−1

∂L[n]

∂Vl+1

i[Tinf −1].

(18)

Equation (18) shows how BPTT applies to SNNs: the

membrane potential at time 0 resulting from an input spike

at time 0 affects the loss function evaluation from time step 0

onward; the membrane potential at time 1 resulting from an

input spike at time 0 and 1 affects the loss function evaluation

from time step 1 onward; and so on until the ﬁnal inference

time step Tinf −1. For our gesture classiﬁcation task, the loss

function only takes a nonzero value at the last time step as we

accumulate all the output spikes after layer σ3before applying

SoftMax (so-called many-to-one correspondence). In that case,

(18) simpliﬁes to the following:

∂Ltot

∂Wl

ij =

Tinf−1



m=0

∂Vl+1

i[m]

∂Wl

∂L[Tinf −1]

∂Vl+1

i[m].(19)

Our feedforward SNN can, thus, be seen as a recurrent

network where the recurrence is due to the state of the

membrane potentials being passed to the next time step.

This enables the use of BPTT just like in regular RNNs.

Algorithms 1 and 2 summarize the forward and backward

pass during SNN training and inference. The 3-D inputs I[n]

(resulting from the spike train encoding of each pixel in the

2-D radar maps) are sliced out as 2-D binary maps at each

spike time step and fed one by one to the input of the network.

For training both the SNN and the DNN, we initialize the

weights of our network using the uniform Glorot method

(see (20) proposed in [29], where Udenotes the uniform

distribution and njis the number of neurons in layer j)andthe

biases set to zero, helping the variance at the input of the layers

to be equal to the variance at the output in the forward pass

and vice versa for the backward pass. Unlike the usual random

initialization where statistics are not layer-dependent, this

method helps avoiding the vanishing and exploding gradient

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE III

IMPACT OF RADAR PREPROCESSING ON THE SNN ACCURACY COMPARED TO DNN ACCURACY.FOR THE SNN, THE Nq=5LAST EPOCHS ARE

TRAINED WITH QUANTIZED WEIGHTS.RBCODING ISUSED THROUGHOUT THE TABLE.THE RESULTS CLEARLY MOTIVATE THE

NEED FOR USING A WELL-SUITED PREPROCESSINGIN THE SNN CASE

Algorithm 1 SNN Forward Pass

Input: I[n],n=0,...,Tinf : radar map, y: one-hot label

Output: ˆy,L

i: all membrane potentials, A:(5×1)spike accumulator

Initialization:Vl

i←− 0∀i,l,A←− 0

1: for n=0toTinf −1do

2: σ3←− Network(I[n],Vl

3: A←− A+σ3

4: end for

5: ˆy←− SoftMax(A)

6: L←− CrossEntropy(ˆy,y)

7: return ˆy,L

Algorithm 2 SNN Backward Pass (BPTT)

Input: ˆy: predicted class, y: one-hot label

1: L←− CrossEntropy(ˆy,y)

2: for n=Tinf −1to0do

3: Wl

ij ←− ∂Vl+1

i[n]

∂Wl

∂L[Tinf−1]

∂Vl+1

i[n]

4: end for

problems during training [29]

Wj∼U−√6

√nj+nj+1

,√6

√nj+nj+1.(20)

We use the Adam optimizer [23] with the learning rate

η=10−3, the gradient moving average coefﬁcient

β1=0.9, and the gradient-square moving average coefﬁcient

β2=0.999. Training is done on an NVIDIA GEFORCE RTX

2080, and the batch size is set to 128.

In the SNN case only, the training is done in two phases.

First, the network is trained with full-bit weights for Nfull

epochs. Then, training continues with 4-b quantized weights in

the forward pass and full-bit weights in the backward pass for

Nqepochs (the total number of epochs is, therefore, Nfull +Nq).

The choice of the number of epochs depends on the radar

preprocessing conﬁguration and will be detailed in Section VI.

TAB L E I V

IMPACT OF EVENT-DOMAIN ENCODING ON THE SNN ACCURACY.

PREPROCESSING ISDONE WITH μDOPPLER WITH BAND-LIMITING,

max48 CODING (EXCEPTION MADE FOR ENCODING 8),

AND NORMALIZATION ON THE WHOLE MAP

In Section VI, we will use the SNN and DNN networks

presented above, alongside the radar preprocessing variations

introduced in Section IV, to extensively compare the impact

of preprocessing and event encoding on the accuracy of the

4-b-weight SNN and its full-bit-weight DNN counterpart.

VI. EXPERIMENTAL RESULTS

First, we will explore the effect of the different

preprocessing variations described in Section IV on the

accuracy of the SNN compared to the DNN. The results are

reported in Table III, where preprocessing techniques 1)–6) use

the μDoppler approach, while preprocessing techniques 7)–13)

use the RD approach. In Table III, all SNN results have been

generated using RB coding. The number of epochs needed to

have effective training (without falling into overﬁtting) varies

depending on the preprocessing used. For the SNN, the last

Nq=5 epochs of each training are done with 4-b quantized

weights in the forward pass. We use sixfold cross-validation

to assess the accuracy performance of each setup (282

examples in the test set and 1413 examples in the training

set for each validation pass). Then, we will take the best

preprocessing setup (in this case, preprocessing technique 6)

from Table III) and vary the event-domain encoding (as

described in Section V). Here, the goal is to optimize the

accuracy of the network without making the inference time

Tinf signiﬁcantly longer. Table IV shows the accuracy results

for each event-encoding variant.

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 9

A. Discussion on the Impact of Preprocessing

Our observations on the impact of preprocessing are given

in the following. Observations 1) and 2) highlight differ-

ences between the SNN and DNN behavior, while observa-

tions 3) and 4) compare our different radar preprocessing

approaches.

1) Table III highlights one of the major observations of this

work: The preprocessing parameters drastically affect

the SNN accuracy, while their effect is signiﬁcantly more

limited for the DNN accuracy even though the same net-

work topology and the same learning approach are used.

A possible explanation is the following: in our setting,

the differences between the SNN and the DNN are the

neuron model used and the nature of the input (spiking

or not). The output of IF neurons is of binary nature

(a spike), whereas the output of an ReLU neuron is

continuous. The ReLU neuron has more expressiveness

compared to the IF neuron [25]. In addition, the ReLU

neuron does not suffer from the vanishing gradient

problem [24], while the IF neuron does [see (15)],

which may lead to more effective training for the DNN.

Furthermore, encoding input maps into spike trains can

be seen as a quantization in time [37], which could

potentially discard useful information, requiring better

preprocessing to compensate for this information loss.

During our experiments, we have noticed that those

observations hold even when no weight quantization is

performed.

The effect of each preprocessing parameter can be

explained as follows. The -variant acts as a ﬁrst-order

high-pass ﬁlter for both the μ-Doppler and RD pre-

processing, only keeping the more useful ac information,

but it also ampliﬁes the noise in higher frequency

bands. This ﬁltering provides an accuracy gain when

RD preprocessing is used (entry 7 in Table III), but it

does not signiﬁcantly affect the accuracy for the case of

μ-Doppler preprocessing (entry 2 in Table III) because

of the larger noise ampliﬁcation in the μ-Doppler case

compared to the RD case. This larger ampliﬁcation can

be understood as follows. The RD maps are obtained

using (12) as the absolute value of RD processed

radar ADC data. Noise in radar data is considered

to be additive white Gaussian noise (AWGN) and

remains AWGN after RD processing, as Fourier trans-

forms are linear operations, with zero mean and vari-

ance σw. Consequently, this noise distribution becomes

Rayleigh-distributed with variance (4−π/2)1/2σwafter

taking the absolute value [21]. Then, it can be shown

that the resulting variance of the difference between two

Rayleigh-distributed noise samples is given by σRD =

σw(4−π)1/2[21]. For μDoppler, on the other hand,

differentiation is done on the range-processed data (9)

directly, where the noise was assumed AWGN. In that

case, the resulting variance after differentiation is given

by σμD=σw√2, which is clearly larger than σRD.

Then, band-limiting discards the higher frequency bins

predominantly containing noise. This provides a large

enhancement of the accuracy for the case of μDoppler

preprocessing (entry 4 in Table III). On the other hand,

band-limiting does not seem to affect the accuracy for

the RD preprocessing case (entry 10 in Table III).

Finally, the application of max48 provides a smaller

yet signiﬁcant accuracy boost for both preprocessing

methods, as it only keeps the 48 largest elements of

each Doppler spectrum untouched and pads the rest to

zero. Thus, this sparse coding step retains the support of

each Doppler spectrum while discarding potential noise

bins. Interestingly, using μDoppler preprocessing results

in a wider accuracy range compared to RD. Indeed,

the discussion above motivates the fact that μDoppler

and its -variant introduce a larger noise contribution,

which explains its lower accuracy compared to RD when

band-limiting and max48 are not applied. But compared

to RD, applying band-limiting and max48 seems to

capture better features in the case of μDoppler, which

explains its larger ﬁnal accuracy (entry 6 in Table III).

2) The second major observation found in Table III is the

following: Changing the preprocessing parameters can

have antagonistic effects on the SNN and DNN accuracy

even though the same network topology and the same

learning approach are used. This can be seen by com-

paring preprocessing techniques 7) and 8) in Table III,

where using the variant enhances the SNN accuracy

while degrading the DNN accuracy signiﬁcantly. This

effect can be explained as follows. One of the advantages

of using DNNs over SNNs is the limited need for

preprocessing due to the analog nature of DNN neurons

over the spiking nature of SNN neurons. The various

preprocessing strategies of Table III sparsify the original

radar maps, which helps the SNN accuracy by increasing

the separability between classes, but inevitably results in

a loss of information, which can degrade the DNN accu-

racy. This demonstrates that conclusions drawn from

DNN design cannot generally be applied to SNN design

even though the same network topology and the same

learning approach are used.

3) Comparing preprocessing techniques 1)–4) in Table III,

we see that, for μDoppler, it is not the sole effect of

the variant or the band-limiting that affects the SNN

accuracy, rather a combination of both. This can be

explained as follows: using the -variant alone helps

removing the strong dc and near-dc components in each

STFT window such that mostly useful ac information

is processed. However, the -variant heavily ampliﬁes

the higher-frequency noise components, which degrades

learning. Band-limiting helps by rejecting out-of-band

noise and keeping only the signal components, leading

to an efﬁcient feature extraction, which helps SNN

learning.

4) Globally, we see in Table III that the μDoppler approach

clearly achieves much higher accuracy than the RD

approach with our SNN settings, with the best accuracy

achieved for the μDoppler preprocessing with band-

limiting, max48 coding (keeping the 48 largest entries

in each Doppler spectrum), and normalization on the

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

complete map (preprocessing technique 6). This is spe-

ciﬁc to our dataset and can be explained as follows.

As the gestures are executed at a nearly constant distance

from the radar, the range information in RD maps does

not signiﬁcantly relate to the nature of gestures and is

not that useful. Training the SNN to ﬁnd the relevant

features then becomes more challenging as less useful

information is available compared to the μDoppler pre-

processing. Yet, the SNN accuracy is 89% compared to

93% for the DNN. It should be noted that, during our

experiments, we also have investigated the use of one-

hidden-layer fully connected SNN with 50 IF neurons

in the hidden layer and the same output layer as the

SNN shown in Fig. 4. We generally observed the same

recognition accuracy trends presented above compared

to an equivalent one-hidden-layer fully connected ReLU

DNN, although achieving a drastically lower accuracy

because of the oversimplistic network architecture. Next,

we will take our best preprocessing solution (entry 6 in

Table III) and vary the event-encoding parameters to

optimize the SNN network accuracy.

B. Varying Event Encoding to Optimize the SNN Accuracy

By varying the event-encoder type (RB or TTFS) and Tinf ,

the best result is obtained with preprocessing technique 6) of

Table III and encoding method 6) of Table IV, achieving

93% of recognition accuracy. Table IV shows that TTFS gives

on-par or better results compared to RB for the same Tinf

(even though the advantage is limited). This limited boost

in accuracy can be explained as follows. TTFS-encoding the

input generates fewer spikes than RB-encoding, which makes

the network activity sparser. Network sparsity may lead to

better information disentangling and linear separability of

the input representations in the neural network, as discussed

in [24]. For completeness, we come back to the effect of

preprocessing by removing the max48 coding out of our best

preprocessing and encoding queue (entry 8 of Table IV).

Removing max48 has a signiﬁcantly larger negative effect

compared to the event-encoding parameters (it degrades the

accuracy by 6% compared to entry 6), which again emphasizes

the importance of preprocessing on the SNN performance.

Using the preprocessing technique 6) of Table III and the

encoding 6) of Table IV, our SNN achieves an accuracy

on-par with the DNN while being signiﬁcantly cheaper

to implement in hardware. First, the SNN requires only

add operations as the output of the IF neurons are binary,

whereas the DNN requires multiply and add operations,

asking for more die area [40]. Second, the max-pooling

layer in the SNN can be implemented as a pool of simple

OR gates, whereas the max-pooling layer in the DNN must

be implemented as a search operation, as the values at

the input of the pooling layer are not binary in this latter

case. Finally, an important limitation of current computing

architectures is the energy cost of memory transfer due

to the memory bottleneck problem [41]. As our SNN

architecture is event-based, it can be deployed on the growing

number of massively parallel and energy-efﬁcient SNN

processor architectures solving the memory bottleneck

problem and designed speciﬁcally for extreme-edge

applications [1], [26], [27].

C. Energy Consumption and Latency Estimation

As discussed in Section II, most SNN-based gesture recog-

nition systems use DVS cameras with a typical power con-

sumption of around 100 mW (such as, for instance, the one

used in [13]). In our system, we use an 8-GHz SISO radar

consuming only 680 μW at the expense of the number

of gesture classes that can be distinguished because of the

lower ﬁdelity nature of our radar signals compared to DVS

cameras and higher frequency radars, such as Google’s Soli

(300 mW) [10].

The different preprocessing approaches presented above

introduce different latency and energy consumption overheads.

In both the μDoppler and RD cases, two Fourier-based trans-

forms are executed, which can efﬁciently be implemented with

ubiquitous FFT accelerators (integrated into most commercial

radar chips). Using a 256-point FFT only requires 2 ×1668

instruction cycles (56 μsat60MHz)and2×407.2nJof

energy in typical microcontrollers [44]. Similarly, the other

preprocessing steps are cheap to implement as they do not

iterate over the input data. Rather, the bottleneck to the pre-

processing latency depends on the number of radar chirps that

must be collected during each acquisition. For μDoppler-based

preprocessing, a radar acquisition time of 476 ms can be

derived using (11) and the radar parameters speciﬁed in

Section III-B. For RD preprocessing, an acquisition time of

238 ms is needed, while, for RD, an acquisition time of

476 ms, similar to the μDoppler case, is needed as two

successive RD maps are used. The SNN energy, latency, and

area overheads depend on the accelerator chip onto which

the architecture is mapped but also on the total number of

neurons, the number of time steps needed for inference, and

the number of events per time step. With the input size of our

best preprocessing technique (100 ×48), the convolutional

layer contains 94 ×42 ×6=23 688 neurons, and the

fully connected layers contain 120 +5=125 neurons for

a total of 23 813 neurons. The max-pooling layer must also

be mapped, and as discussed in Section V-A, our max-pooling

layer behaves as a pool of OR gates with four inputs, providing

downsampling of 2 on the convolution result. Such layer can

be mapped as a layer of IF neurons with synaptic weights and

threshold tuned such that, if any of the four input synapses of

the neuron receives a spike, then the neuron will spike (e.g.,

by setting the four synaptic weights and the threshold to unity).

This leads to 47 ×21 ×6=5922 neurons. Therefore, a total

number of 29 735 neurons is used.

Even though emerging mixed-signal SNN chips promise

to be even more energy-efﬁcient than digital ones [1], using

the popular Intel Loihi chip [26] as our reference design for

energy consumption and inference latency (as done by most

SNN works in the literature), and using the Nengo-Loihi tool

in Python, we have estimated an energy cost per inference

of only 7.85 μJ (ignoring the cost of transferring data on

and off the device) and a maximum inference time of around

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 11

4 ms (which could be reduced when deploying the model

into Loihi through further optimization). Therefore, the total

latency is around 480 ms, which is similar to most radar-based

gesture recognition systems proposed in literature [9]–[12],

[18]. This value is feasible for a real-time system given the

typical time scale of a hand gesture (some state-of-the-art

systems, such as [11], exhibit a total latency of more than

1s from radar data acquisition to inference while still being

useful in practice). This leads to an SNN power consumption

of 7.85 μJ/480 ms =16.35 μW. Compared to the DVS-based

gesture recognition systems proposed in [14] (which reports

power estimates on the IBM True North neuromorphic chip),

our system consumes around three orders of magnitude less

power (16.35 μW for our system versus 88.5mWfortheirs)

due to the signiﬁcantly larger network used in their work with

16 convolutional layers versus only one convolutional layer in

our case. As explained earlier, this is at the expense of the

number of classes that can be recognized (ﬁve for our system

versus 11 for theirs), with a comparable accuracy (93% for our

system versus 91.77% for theirs). Compared to state-of-the-art

radar-based gesture recognition systems [9], [11], our solution

trades off the number of gesture classes that can be recognized

for higher energy and die area efﬁciency at the sensor and

neural network level (e.g., 11 classes for [9] and 12 classes

for [11] at the expense of 60-GHz, multiantenna radar sensors

consuming more than 300 mW versus our 8-GHz, SISO radar

consuming merely 680 μW) while still achieving a high

recognition accuracy and fast inference time without making

use of a desktop GPU as in [9] or of an Nvidia Jetson platform

as in [11] (with an inference time of 25.84 ms against 4 ms in

our case), both being too bulky and power-hungry for small-

sized, ultralow-power IoT applications.

D. Applicability of the Proposed SNN on a Different Dataset

Finally, it is important to verify the applicability of our pro-

posed 4-b-weight SNN model on a different dataset than the

8-GHz one used Section VI-A. To the best of our knowledge,

most of the state-of-the-art radar gesture recognition datasets,

such as [9], do not provide raw ADC data but rather, radar

data that have already been preprocessed. Therefore, we train

our SNN model (see Fig. 4) on the dataset proposed in [9],

featuring 11 gesture classes. The dataset in [9] contains a total

of 5500 sequences of RD maps (see Section IV-B), acquired

using a 60-GHz FMCW radar.

We note RDi[t,k,m]the ith sequence of RD maps, where

tis the frame index, lis the range bin, and mis the Doppler

bin. The number of frames Mi≥28 in each sequence ican

vary but is always larger than Mmin =28 throughout the

dataset. Before feeding the data to our SNN, accumulation

and subsampling are performed along ton each RDi[t,k,m]

as follows:

RDi[l,k,m]=

l+Mi

Mmin



t=l

RDi[l,k,m](21)

where lis the frame index after subsampling. Finally, spiking

input sequences are obtained by thresholding RDi[l,k,m]with

regard to =10−2(found empirically). Values larger than

Fig. 6. Confusion matrix for the 60-GHz radar gesture recognition

dataset proposed in [9]. The recognition accuracy of the 4-b-weight SNN

is 94.6% ±2%.

are assigned to 1, while smaller values are assigned to 0.

Training is conducted using the same learning parameters as

in Section V-B, by iterating through 16 cycles of 20 epochs,

with the ﬁrst Nfull =19 epochs of each cycle trained with

full-bit weights, and the last epoch of each cycle trained

with 4-b weights in the forward pass (Nq=1). Similar

to Section VI-A, the model performance is assessed using

sixfold cross-validation. Fig. 6 shows the obtained confusion

matrix, with a gesture recognition accuracy of 94.6% ±2%.

This demonstrates the applicability of our proposed 4-b-weight

SNN model to a second radar gesture dataset.

VII. CONCLUSION

This article has shown the impact of various preprocessing

and event-encoding techniques on the accuracy of a 4-b-weight

SNN within the context of radar gesture recognition for

extreme-edge applications with limited power and die area.

In particular, it has been shown that, while using the same

network topology and learning approach, preprocessing dras-

tically affects the SNN accuracy, while its effect is signiﬁcantly

more limited on the DNN accuracy. In addition, it has been

demonstrated that conclusions drawn from DNN design cannot

directly be generalized to SNN design as the preprocessing

parameters can have antagonistic effects on the SNN and DNN

accuracy, even if the same topology and learning approaches

are used. Also, the impact of event-encoding to optimize the

accuracy of the network has been explored, be it that it has

less impact compared to preprocessing. Then, after extensive

comparison between different approaches, a well-suited radar

preprocessing and the event-encoding queue has been devel-

oped to be used with a small, 4-b-weight SNN to achieve 93%

of recognition accuracy within only four time steps. It must

be noted that the preprocessing and event-encoding strategy

developed in this article could also be extended to audio sig-

nals as future work. Finally, the applicability of the proposed

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

4-b-weight SNN approach has been demonstrated on a second

radar dataset, reaching 94.6% of accuracy. Unlike previous

works, this low-complexity architecture enables high-accuracy

and low-latency inference while consuming very little energy

and area when implemented in event-based hardware, making

it extremely useful for embedded edge-AI applications.

REFERENCES

[1] S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri, “A scalable multi-

core architecture with heterogeneous memory structures for dynamic

neuromorphic asynchronous processors (DYNAPs),” IEEE Trans. Bio-

med. Circuits Syst., vol. 12, no. 1, pp. 106–122, Feb. 2018, doi:

10.1109/TBCAS.2017.2759700.

[2] S. B. Shrestha and G. Orchard, “SLAYER: Spike layer error reassign-

ment in time,” in Proc. Neural Inf. Process. Syst., Montreal, QC, Canada,

Dec. 2018, pp. 1–10.

[3] F. Zenke and S. Ganguli, “SuperSpike: Supervised learning in mul-

tilayer spiking neural networks,” Neural Comput., vol. 30, no. 6,

pp. 1514–1541, Jun. 2018.

[4] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning

in spiking neural networks: Bringing the power of gradient-based

optimization to spiking neural networks,” IEEE Signal Process. Mag.,

vol. 36, no. 6, pp. 51–63, Nov. 2019, doi: 10.1109/MSP.2019.2931595.

[5] E. Hunsberger and C. Eliasmith, “Spiking deep networks with LIF

neurons,” CoRR, vol. abs/1510.08829, Oct. 2015.

[6] Q. Bolsee and A. Munteanu, “CNN-based denoising of time-

of-ﬂight depth images,” in Proc. 25th IEEE Int. Conf. Image

Process. (ICIP), Athens, Greece, Oct. 2018, pp. 510–514 , doi:

10.1109/ICIP.2018.8451610.

[7] M. Nazaré, “Deep convolutional neural networks and noisy images,” in

Progress in Pattern Recognition, Image Analysis, Computer Vision, and

Applications. Cham, Switzerland: Springer, 2018, pp. 416–424.

[8] Y.-H. Liu et al., “A 680 μW burst-chirp UWB radar transceiver for

vital signs and occupancy sensing up to 15 m distance,” in IEEE Int.

Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco,

CA, USA, Feb. 2019, pp. 166–168, doi: 10.1109/ISSCC.2019.8662536.

[9] S. Wang, J. Song, J. Lien, I. Poupyrev, and O. Hilliges, “Interacting with

soli: Exploring ﬁne-grained dynamic gesture recognition in the radio-

frequency spectrum,” in Proc. 29th Annu. Symp. User Interface Softw.

Technol., Oct. 2016, pp. 851–860.

[10] J. Lien et al., “Soli: Ubiquitous gesture sensing with millimeter wave

radar,” ACM Trans. Graph., vol. 35, no. 4, pp. 1–4, 2016.

[11] Y. Sun, T. Fei, X. Li, A. Warnecke, E. Warsitz, and N. Pohl, “Real-time

radar-based gesture detection and recognition built in an edge-computing

platform,” IEEE Sensors J., vol. 20, no. 18, pp. 10706–10716, Sep. 2020,

doi: 10.1109/JSEN.2020.2994292.

[12] X. Zhang, Q. Wu, and D. Zhao, “Dynamic hand gesture recognition

using FMCW radar sensor for driving assistance,” in Proc. 10th Int.

Conf. Wireless Commun. Signal Process. (WCSP), Hangzhou, China,

Oct. 2018, pp. 1–6, doi: 10.1109/WCSP.2018.8555642.

[13] R. Massa, A. Marchisio, M. Martina, and M. Shaﬁque, “An efﬁcient

spiking neural network for recognizing gestures with a DVS camera on

the Loihi neuromorphic processor,” 2021, arXiv:2006.09985. [Online].

Available: https://arxiv.org/abs/2006.09985

[14] A. Amir et al., “A low power, fully event-based gesture recognition

system,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),

Honolulu, HI, USA, Jul. 2017, pp. 7243–7252.

[15] J.-M. Maro, S.-H. Ieng, and R. Benosman, “Event-based gesture recog-

nition with dynamic background suppression using smartphone compu-

tational capabilities,” Frontiers Neurosci., vol. 14, p. 275, Apr. 2020.

[16] J. Kaiser, H. Mostafa, and E. Neftci, “Synaptic plasticity dynamics

for deep continuous local learning (DECOLLE),” Frontiers Neurosci.,

vol. 14, p. 424, May 2020.

[17] R. Ghosh, A. Gupta, A. Nakagawa, A. Soares, and N. Thakor,

“Spatiotemporal ﬁltering for event-based action recognition,” 2019,

arXiv:1903.07067.[Online]. Available: http://arxiv.org/abs/1903.07067

[18] J. S. Suh, S. Ryu, B. Han, J. Choi, J.-H. Kim, and S. Hong, “24 GHz

FMCW radar system for real-time hand gesture recognition using

LSTM,” in Proc. Asia–Paciﬁc Microw. Conf. (APMC), Kyoto, Japan,

Nov. 2018, pp. 860–862, doi: 10.23919/APMC.2018.8617375.

[19] D. Banerjee et al., “Application of spiking neural networks for

action recognition from radar data,” in Proc. Int. Joint Conf.

Neural Netw. (IJCNN), Glasgow, U.K., Jul. 2020, pp. 1–10, doi:

10.1109/IJCNN48605.2020.9206853.

[20] C. V. Chen, “Radar micro-Doppler signatures: Processing and applica-

tions,” Inst. Eng. Technol., London, U.K., Tech. Rep., 2014.

[21] M. A. Richards, Fundamentals of Radar Signal Processing.NewYork,

NY, USA: McGraw-Hill, 2005.

[22] H. Xu, Z. Wang, H. Yang, D. Liu, and J. Liu, “Learning simple

thresholded features with sparse support recovery,” IEEE Trans. Circuits

Syst. Video Technol., vol. 30, no. 4, pp. 970–982, Apr. 2020, doi:

10.1109/TCSVT.2019.2901713.

[23] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

in Proc. Int. Conf. Learn. Represent., 2014, pp. 1–15.

[24] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectiﬁer neural

networks,” J.Mach.Learn.Res., 2010.

[25] X. Pan and V. Srikumar, “Expressiveness of rectiﬁer networks,” CoRR,

vol. abs/1511.05678, May 2015.

[26] M. Davies et al., “Loihi: A neuromorphic manycore processor with on-

chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, Jan./Feb. 2018,

doi: 10.1109/MM.2018.112130359.

[27] C. Frenkel, M. Lefebvre, J. Legat, and D. Bol, “A 0.086-mm2

12.7-pJ/SOP 64k-synapse 256-neuron online-learning digital spik-

ing neuromorphic processor in 28-nm CMOS,” IEEE Trans. Bio-

med. Circuits Syst., vol. 13, no. 1, pp. 145–158, Feb. 2019, doi:

10.1109/TBCAS.2018.2880425.

[28] C. Frenkel, J.-D. Legat, and D. Bol, “A 28-nm convolutional neu-

romorphic processor enabling online learning with spike-based reti-

nas,” 2020, arXiv:2005.06318. [Online]. Available: http://arxiv.org/abs/

2005.06318

[29] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training

deep feedforward neural networks,” J. Mach. Learn. Res.-Track,vol.9,

pp. 249–256, 2010.

[30] P. J. Werbos, “Backpropagation through time: What it does and how

to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990, doi:

10.1109/5.58337.

[31] B. Vandersmissen et al., “Indoor person identiﬁcation using a low-

power FMCW radar,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 7,

pp. 3941–3952, Jul. 2018, doi: 10.1109/TGRS.2018.2816812.

[32] S. Wu et al., “Person-speciﬁc heart rate estimation with ultra-wideband

radar using convolutional neural networks,” IEEE Access,vol.7,

pp. 168484–168494, 2019, doi: 10.1109/ACCESS.2019.2954294.

[33] W. Kim, H. Cho, J. Kim, B. Kim, and S. Lee, “YOLO-based

simultaneous target detection and classiﬁcation in automotive FMCW

radar systems,” Sensors, vol. 20, no. 10, p. 2897, May 2020, doi:

10.3390/s20102897.

[34] V. M. Lubecke, O. Boric-Lubecke, A. Host-Madsen, and A. E. Fathy,

“Through-the-wall radar life detection and monitoring,” in IEEE MTT-S

Int. Microw. Symp. Dig., Honolulu, HI, USA, Jun. 2007, pp. 769–772,

doi: 10.1109/MWSYM.2007.380053.

[35] F. J. Harris, “On the use of windows for harmonic analysis with the

discrete Fourier transform,” Proc. IEEE, vol. 66, no. 1, pp. 51–83,

Jan. 1978, doi: 10.1109/PROC.1978.10837.

[36] D. Feng et al., “Deep multi-modal object detection and semantic seg-

mentation for autonomous driving: Datasets, methods, and challenges,”

CoRR, vol. abs/1902.07830, Feb. 2019.

[37] E. Doutsi, L. Fillatre, M. Antonini, and P. Tsakalides, “Dynamic image

quantization using leaky integrate-and-ﬁre neurons,” IEEE Trans. Image

Process., vol. 30, pp. 4305–4315, 2021.

[38] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5

envision: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltage-

accuracy-frequency-scalable convolutional neural network processor in

28 nm FDSOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.

Tech. Papers, San Francisco, CA, USA, Feb. 2017, pp. 246–247, doi:

10.1109/ISSCC.2017.7870353.

[39] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-

efﬁcient reconﬁgurable accelerator for deep convolutional neural net-

works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,

Jan. 2017, doi: 10.1109/JSSC.2016.2616357.

[40] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu, “Con-

version of continuous-valued deep networks to efﬁcient event-driven

networks for image classiﬁcation,” Frontiers Neurosci., vol. 11, p. 682,

Dec. 2017.

[41] M. Horowitz, “1.1 computing’s energy problem (and what we can

do about it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.

Tech. Papers, San Francisco, CA, USA, Feb. 2014, pp. 10–14, doi:

10.1109/ISSCC.2014.6757323.

[42] A. Paszke et al.,“Automatic differentiation in PyTorch,” Tech. Rep.,

2017.

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAFA et al.: IMPROVING ACCURACY OF SNNs FOR RADAR GESTURE RECOGNITION THROUGH PREPROCESSING 13

[43] R. Brette, “Philosophy of the spike: Rate-based vs. spike-based theories

of the brain,” Frontiers Syst. Neurosci., vol. 9, p. 151, Nov. 2015.

[44] M. McKeown, “FFT implementation on the TMS320VC5505,

TMS320C5505, and TMS320C5515 DSPs,” Texas Instrum., Dallas, TX,

USA, Appl. Rep., 2013.

Ali Safa (Graduate Student Member, IEEE) received

the M.Sc. degree (summa cum laude) in electrical

engineering from the Université Libre de Bruxelles,

Brussels, Belgium. He is currently pursuing the

Ph.D. degree in AI-driven processing and sensor

fusion for extreme edge applications with imec,

Leuven, Belgium, and Katholieke Universiteit Leu-

ven (KU Leuven), Leuven.

He joined imec and KU Leuven in 2020.

Federico Corradi (Member, IEEE) received the

B.Sc. degree in physics from Università degli studi

di Parma, Parma, Italy, the M.Sc. degree (cum laude)

in physics from La Sapienza University, Rome, Italy,

the Ph.D. degree in natural sciences from the Uni-

versity of Zurich, Zürich, Switzerland, and the Ph.D.

degree in neuroscience from the Ph.D. International

Program, Neuroscience Center Zurich, Zürich.

He is currently a Senior Research and the Develop-

ment Scientist of imec, Eindhoven, The Netherlands.

His research activities are at the interface between

neuroscience and neuromorphic engineering. His research focuses on neuro-

morphic computing technologies for the IoT applications. His contributions

in the ﬁeld include the development of neuromorphic circuits and systems,

and their application in biomedical signal processing.

Lars Keuninckx received the M.Eng. degree in

telecommunications from Hogeschool Gent, Ghent,

Belgium, in 1996, and the bachelor’s degree

in physics and the Ph.D. degree in engineering

from Vrije Universiteit Brussel, Brussels, Belgium,

in 2009 and 2016, respectively.

After his M.Sc. degree, he worked in the industry

for several years, designing electronics for automo-

tive, industrial, and medical applications. He joined

imec, Leuven, Belgium, in 2019, where he is

involved in the design of neuromorphic systems. His

research interests include the applications of complex dynamics and reservoir

computing.

Ilja Ocket (Member, IEEE) received the M.Sc.

and Ph.D. degrees in electrical engineering from

Katholieke Universiteit (KU) Leuven, Leuven,

Belgium, in 1998 and 2009, respectively.

He currently serves as the Program Manager for

neuromorphic sensor fusion at the IoT Department,

imec, Leuven. His research interests include all

aspects of heterogeneous integration of highly minia-

turized millimeter-wave systems, spanning design,

technology, and metrology. He is also involved in

research on using broadband impedance sensing and

dielectrophoretic actuation for lab-on-chip applications.

André Bourdoux (Senior Member, IEEE) received

the M.Sc. degree in electrical engineering from

the Université Catholique de Louvain-la-Neuve,

Ottignies-Louvain-la-Neuve, Belgium, in 1982.

In 1998, he joined imec, Leuven, Belgium, where

he is currently a Principal Member of Technical Staff

with the Internet-of-Things Research Group. He is

a system-level and signal processing expert for the

mm-wave wireless communications and radar teams.

He has more than 15 years of research experience in

radar systems and 15 years of research experience

in broadband wireless communications. He holds several patents in these

ﬁelds. He has authored or coauthored more than 160 publications in books

and peer-reviewed journals and conferences. His research interests include

advanced signal processing, and machine learning for wireless physical layer

and high-resolution 3-D/4-D radars.

Francky Catthoor (Fellow, IEEE) received

the Ph.D. degree in electrical engineering from

Katholieke Universiteit (KU) Leuven, Leuven,

Belgium, in 1987.

From 1987 to 2000, he was the head of several

research domains in the area of synthesis techniques

and architectural methodologies. Since 2000, he has

been strongly involved in other activities at imec,

Leuven, including coexploration of application,

computer architecture and deep submicrometer

technology aspects, biomedical systems and the

IoT sensor nodes, and photo-voltaic modules combined with renewable

energy systems. He is also a Senior Fellow of imec. He is also a part-time

Full Professor with the Department of Electrical Engineering (ESAT),

KU Leuven.

Dr. Catthoor has been an associate editor of several IEEE and ACM

journals.

Georges G. E. Gielen (Fellow, IEEE) received the

M.Sc. and Ph.D. degrees in electrical engineering

from Katholieke Universiteit Leuven (KU Leuven),

Leuven, Belgium, in 1986 and 1990, respectively.

He is currently a Full Professor with the MICAS

Research Division, Department of Electrical Engi-

neering (ESAT), KU Leuven. Since 2020, he has

been the Chair of the Department of Electrical Engi-

neering. His research interests include the design

of analog and mixed-signal integrated circuits, and

especially in analog and mixed-signal CAD tools

and design automation. He is a frequently invited speaker/lecturer and coor-

dinator/partner of several (industrial) research projects in this area, including

several European projects. He has (co)authored ten books and more than

600 articles in edited books, international journals, and conference proceed-

ings. He is a 1997 Laureate of the Belgian Royal Academy of Sciences,

Literature, and Arts in the discipline of engineering.

Dr. Gielen is an Elected Member of the Academia Europaea. He received the

IEEE CAS Mac Van Valkenburg Award in 2015 and the IEEE CAS Charles

Desoer Award in 2020.

Authorized licensed use limited to: IMEC. Downloaded on September 16,2021 at 17:52:11 UTC from IEEE Xplore. Restrictions apply.

Advancing Fault Prediction: A Comparative Study between LSTM and Spiking Neural Networks

Article

Full-text available

Sep 2023

Predicting system faults is critical to improving productivity, reducing costs, and enforcing safety in industrial processes. Yet, traditional methodologies frequently falter due to the intricate nature of the task. This research presents a novel use of spiking neural networks (SNNs) in anticipating faults in syntactical time series, utilizing the generalized stochastic Petri net (GSPN) model. The inherent ability of SNNs to process both time and space aspects of data positions them as a prime instrument for this endeavor. A comparative evaluation with long short-term memory (LSTM) networks suggests that SNNs offer comparable robustness and performance.

Not Biologically Inspired: On Training Networks of Monostable Multivibrator Timer Neurons

Preprint

Full-text available

Aug 2023

p>Monostable multivibrators are simple timers which are easily implemented using counters in digital hardware and can be interpreted as non-biologically inspired spiking neurons. We show how fully binarized event-driven recurrent networks of monostable multivibrators can be trained to solve classification tasks. We mitigate an important bottleneck in neuromorphic hardware concepts by circumventing synaptic addition within the network. Here rather, input signals to a neuron are simply OR-ed together. Temporally overlapping input events are resolved at the neuron level. We demonstrate our approach on the MNIST handwritten digits, Google Soli radar gestures, IBM DVS128 gestures and Yin-Yang classification tasks, all with excellent results. The estimated energy consumption for the MNIST handwritten digits task, excluding the final readout layer, is 855pJ per inference for a test accuracy of 98.61% for a reconfigurable network of 500 units that was mapped to a 28nm process.</p

Not Biologically Inspired: On Training Networks of Monostable Multivibrator Timer Neurons

Preprint

Full-text available

Aug 2023

Towards Chip-in-the-loop Spiking Neural Network Training via Metropolis-Hastings Sampling

Conference Paper

Apr 2024

Simple and Efficient Gesture Recognition Based on Frequency Modulated Continuous Wave Radar

Article

Jan 2024

With the development of technology, using radar for gesture recognition is feasible and valuable. However, ensuring that gesture recognition can be applied to a wide range of scenarios with sufficient accuracy is still challenging. Due to the lack of accuracy and efficiency of traditional methods, we propose a gesture recognition scheme based on deep learning. We converted radar signals into pictures and designed a lightweight network called self-reparameterization network based on distance and velocity aware and binary coding(SR-DVBNet) to match them. We use the Self-reparameterization Encoder of the signal as the baseline of the network and add Distance and Velocity-aware Embedding (DVAE) between different blocks to do weighting for different dimensions. Since gesture recognition by radar signals often uses two-dimensional data, such as RDM or chirp-arranged matrices, we designed the DVAE module, which can weigh the different dimensions of the data separately to enhance the interpretability and gesture recognition accuracy of the model. At the same time, we use binary descriptors as the final representation of feature vectors for classification, which can well reflect the features of images and improve classification accuracy. Finally, we verify the algorithm’s effectiveness on two publicly available data sets and achieve an accuracy rate of more than 98%, surpassing other known gesture recognition algorithms.

NeuroRadar: A Neuromorphic Radar Sensor for Low-Power IoT Systems

Conference Paper

Apr 2024

Research on Multi-Dimensional Feature Fusion Recognition Method of Micro-Motion Gesture based on Millimeter Wave Radar

Conference Paper

Jan 2024

Learned Spike Encoding of the Channel Response for Low-Power Environment Sensing

Conference Paper

Mar 2024

A Review of Recent Advances and Application for Spiking Neural Networks

Article

Jan 2023

Federated Learning for Radar Gesture Recognition Based on Spike Timing Dependent Plasticity

Article

Apr 2024

Radar-based gesture recognition combined machine learning methods can achieve excellent performance and has been utilized in a wide variety of applications. However, large-scale radar signal processing suffers from the risk of privacy leakage and the problem of limited computing resources. In this paper, we propose a federated learning framework based on the spiking neural network to solve above limitations in radar gesture recognition. First, we build a distributed federated learning system with good privacy protection to process radar data collaboratively, which only exchanges the model parameters between clients to train the model, without the need for transferring the data itself. Second, we train the spiking neural network using the spike timing dependent plasticity with biological interpretability, which is more compatible with the biological characteristics and can significantly reduce the energy consumption. Our creative combination of low-power spiking neural network and federated learning thus address the issue of resource constraint of edge nodes in federated learning. Furthermore, we introduce the weight pruning technique to the federated learning training process to reduce the model's communication cost. We conduct experiments on the 8GHz radar gesture dataset and Google Soli dataset to test the proposed model's validity. We evaluate the model's performance across various participating devices and examine the spiking neural network's capacity to recognize gestures under independent identical distribution and non-independent identical distribution strategies in federated learning. We also analyze the energy consumption and communication cost of the proposed model, demonstrating the spike timing dependent plasticity-based algorithm is more energy efficient than the traditional backpropagation as well as the backpropagation through time algorithms.

Dynamic Image Quantization Using Leaky Integrate-and-Fire Neurons

Article

Full-text available

Apr 2021

This paper introduces a novel coding/decoding mechanism that mimics one of the most important properties of the human visual system: its ability to enhance the visual perception quality in time. In other words, the brain takes advantage of time to process and clarify the details of the visual scene. This characteristic is yet to be considered by the state-of-the-art quantization mechanisms that process the visual information regardless the duration of time it appears in the visual scene. We propose a compression architecture built of neuroscience models; it first uses the leaky integrate-and-fire (LIF) model to transform the visual stimulus into a spike train and then it combines two different kinds of spike interpretation mechanisms (SIM), the time-SIM and the rate-SIM for the encoding of the spike train. The time-SIM allows a high quality interpretation of the neural code and the rate-SIM allows a simple decoding mechanism by counting the spikes. For that reason, the proposed mechanisms is called Dual-SIM quantizer (Dual-SIMQ). We show that (i) the time-dependency of Dual-SIMQ automatically controls the reconstruction accuracy of the visual stimulus, (ii) the numerical comparison of Dual-SIMQ to the state-of-the-art shows that the performance of the proposed algorithm is similar to the uniform quantization schema while it approximates the optimal behavior of the non-uniform quantization schema and (iii) from the perceptual point of view the reconstruction quality using the Dual-SIMQ is higher than the state-of-the-art.

An Efficient Spiking Neural Network for Recognizing Gestures with a DVS Camera on the Loihi Neuromorphic Processor

Conference Paper

Full-text available

Jul 2020

YOLO-Based Simultaneous Target Detection and Classification in Automotive FMCW Radar Systems

Article

Full-text available

May 2020
SENSORS-BASEL

This paper proposes a method to simultaneously detect and classify objects by using a deep learning model, specifically you only look once (YOLO), with pre-processed automotive radar signals. In conventional methods, the detection and classification in automotive radar systems are conducted in two successive stages; however, in the proposed method, the two stages are combined into one. To verify the effectiveness of the proposed method, we applied it to the actual radar data measured using our automotive radar sensor. According to the results, our proposed method can simultaneously detect targets and classify them with over 90% accuracy. In addition, it shows better performance in terms of detection and classification, compared with conventional methods such as density-based spatial clustering of applications with noise or the support vector machine. Moreover, the proposed method especially exhibits better performance when detecting and classifying a vehicle with a long body.

Real-Time Radar-Based Gesture Detection and Recognition Built in an Edge-Computing Platform

Article

Full-text available

May 2020

In this paper, a real-time signal processing framework based on a 60 GHz frequency-modulated continuous wave (FMCW) radar system to recognize gestures is proposed. In order to improve the robustness of the radar-based gesture recognition system, the proposed framework extracts a comprehensive hand profile, including range, Doppler, azimuth and elevation, over multiple measurement-cycles and encodes them into a feature cube. Rather than feeding the range-Doppler spectrum sequence into a deep convolutional neural network (CNN) connected with recurrent neural networks, the proposed framework takes the aforementioned feature cube as input of a shallow CNN for gesture recognition to reduce the computational complexity. In addition, we develop a hand activity detection (HAD) algorithm to automatize the detection of gestures in real-time case. The proposed HAD can capture the time-stamp at which a gesture finishes and feeds the hand profile of all the relevant measurementcycles before this time-stamp into the CNN with low latency. Since the proposed framework is able to detect and classify gestures at limited computational cost, it could be deployed in an edge-computing platform for real-time applications, whose performance is notedly inferior to a state-of-the-art personal computer. The experimental results show that the proposed framework has the capability of classifying 12 gestures in realtime with a high F1-score.

Synaptic Plasticity Dynamics for Deep Continuous Local Learning (DECOLLE)

Article

Full-text available

May 2020

A growing body of work underlines striking similarities between biological neural networks and recurrent, binary neural networks. A relatively smaller body of work, however, addresses the similarities between learning dynamics employed in deep artificial neural networks and synaptic plasticity in spiking neural networks. The challenge preventing this is largely caused by the discrepancy between the dynamical properties of synaptic plasticity and the requirements for gradient backpropagation. Learning algorithms that approximate gradient backpropagation using local error functions can overcome this challenge. Here, we introduce Deep Continuous Local Learning (DECOLLE), a spiking neural network equipped with local error functions for online learning with no memory overhead for computing gradients. DECOLLE is capable of learning deep spatio temporal representations from spikes relying solely on local information, making it compatible with neurobiology and neuromorphic hardware. Synaptic plasticity rules are derived systematically from user-defined cost functions and neural dynamics by leveraging existing autodifferentiation methods of machine learning frameworks. We benchmark our approach on the event-based neuromorphic dataset N-MNIST and DvsGesture, on which DECOLLE performs comparably to the state-of-the-art. DECOLLE networks provide continuously learning machines that are relevant to biology and supportive of event-based, low-power computer vision architectures matching the accuracies of conventional computers on tasks where temporal precision and speed are essential.

Event-Based Gesture Recognition With Dynamic Background Suppression Using Smartphone Computational Capabilities

Article

Full-text available

Apr 2020

In this paper, we introduce a framework for dynamic gesture recognition with background suppression operating on the output of a moving event-based camera. The system is developed to operate in real-time using only the computational capabilities of a mobile phone. It introduces a new development around the concept of time-surfaces. It also presents a novel event-based methodology to dynamically remove backgrounds that uses the high temporal resolution properties of event-based cameras. To our knowledge, this is the first Android event-based framework for vision-based recognition of dynamic gestures running on a smartphone without off-board processing. We assess the performances by considering several scenarios in both indoors and outdoors, for static and dynamic conditions, in uncontrolled lighting conditions. We also introduce a new event-based dataset for gesture recognition with static and dynamic backgrounds (made publicly available). The set of gestures has been selected following a clinical trial to allow human-machine interaction for the visually impaired and older adults. We finally report comparisons with prior work that addressed event-based gesture recognition reporting comparable results, without the use of advanced classification techniques nor power greedy hardware.

Person-Specific Heart Rate Estimation With Ultra-Wideband Radar Using Convolutional Neural Networks

Article

Full-text available

Nov 2019

Vital-sign estimation using ultra-wideband (UWB) radar is preferable because it is contactless and less privacy-invasive. Recently, many approaches have been proposed for estimating heart rate from UWB radar data. However, their performance is still not reliable enough for practical applications. To improve the accuracy, this study employs convolutional neural networks to learn the special patterns of the heartbeats. In the proposed system, skin displacements of the target person are measured using UWB radar, and the radar signal is converted to a two-dimensional matrix, which is used as the input of the designed neural networks. Meanwhile, two triangular waves corresponding to the peaks and valleys in an electrocardiogram are adopted as the output of the networks. The proposed system then identifies each individual and estimates the heart rate automatically based on the already trained neural networks. The estimation error of the interbeat interval computed using our approach was reduced to 4.5 ms in the best case; and 48.5 ms in the worst case. Experiment results show that the proposed approach significantly outperforms a conventional method. The proposed machine learning approach achieves both personal identification and heart rate estimation simultaneously using UWB radar data for the first time. Moreover, this study found that using the respiration and heartbeat components together may enhance the accuracy of heart rate estimation, which is counter-intuitive, because the respiration is usually believed to interfere with the heartbeat.

Application of Spiking Neural Networks for Action Recognition from Radar Data

Conference Paper

Jul 2020

A 28-nm Convolutional Neuromorphic Processor Enabling Online Learning with Spike-Based Retinas

Conference Paper

Oct 2020

In an attempt to follow biological information representation and organization principles, the field of neuromorphic engineering is usually approached bottom-up, from the biophysical models to large-scale integration in silico. While ideal as experimentation platforms for cognitive computing and neuroscience, bottom-up neuromorphic processors have yet to demonstrate an efficiency advantage compared to specialized neural network accelerators for real-world problems. Top-down approaches aim at answering this difficulty by (i) starting from the applicative problem and (ii) investigating how to make the associated algorithms hardware-efficient and biologically-plausible. In order to leverage the data sparsity of spike-based neuromorphic retinas for adaptive edge computing and vision applications, we follow a top-down approach and propose SPOON, a 28-nm event-driven CNN (eCNN). It embeds online learning with only 16.8-% power and 11.8-% area overheads with the biologically-plausible direct random target projection (DRTP) algorithm. With an energy per classification of 313nJ at 0.6V and a 0.32-mm² area for accuracies of 95.3% (on-chip training) and 97.5% (off-chip training) on MNIST, we demonstrate that SPOON reaches the efficiency of conventional machine learning accelerators while embedding on-chip learning and being compatible with event-based sensors, a point that we further emphasize with N-MNIST benchmarking.

Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

Article

Feb 2020

Recent advancements in perception for autonomous driving are driven by deep learning. In order to achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with different sensors (e.g. cameras, LiDARs, Radars), and multiple sensing modalities can be fused to exploit their complementary properties. In this context, many methods have been proposed for deep multi-modal perception problems. However, there is no general guideline for network architecture design, and questions of "what to fuse", "when to fuse", and "how to fuse" remain open. This review paper attempts to systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving. To this end, we first provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection and semantic segmentation in autonomous driving research. We then summarize the fusion methodologies and discuss challenges and open questions. In the appendix, we provide tables that summarize topics and methods. We also provide an interactive online platform to navigate each reference: https://boschresearch.github.io/multimodalperception/.

Improving the Accuracy of Spiking Neural Networks for Radar Gesture Recognition Through Preprocessing

Abstract and Figures

Recommended publications

CIS Publication Spotlight [Publication Spotlight]

On the Use of Spiking Neural Networks for Ultralow-Power Radar Gesture Recognition

A 2-$\mu$J, 12-class, 91% Accuracy Spiking Neural Network Approach For Radar Gesture Recognition

ConvSNN: A surrogate gradient spiking neural framework for radar gesture recognition

μBrain: An Event-Driven and Fully Synthesizable Architecture for Spiking Neural Networks