ArticlePDF Available

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Authors:

Abstract and Figures

With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors, new opportunities are emerging for applying deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of the medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies ranging from emerging memristive devices, to established Field Programmable Gate Arrays (FPGAs), and mature Complementary Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. After providing the required background, we unify the sparsely distributed research on neural network and neuromorphic hardware implementations as applied to the healthcare domain. In addition, we benchmark various hardware platforms by performing a biomedical electromyography (EMG) signal processing task and drawing comparisons among them in terms of inference delay and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that different accelerators and neuromorphic processors introduce to healthcare and biomedical domains. This paper can serve a large audience, ranging from nanoelectronics researchers to biomedical and healthcare practitioners in grasping the fundamental interplay between hardware, algorithms, and clinical adoption of these tools, as we shed light on the future of deep networks and spiking neuromorphic processing systems.
Content may be subject to copyright.
Zurich Open Repository and
Archive
University of Zurich
Main Library
Strickhofstrasse 39
CH-8057 Zurich
www.zora.uzh.ch
Year: 2020
Hardware Implementation of Deep Network Accelerators Towards
Healthcare and Biomedical Applications
Azghadi, Mostafa Rahimi ; Lammie, Corey ; Eshraghian, Jason K ; Payvand, Melika ; Donati, Elisa ;
Linares-Barranco, Bernabe ; Indiveri, Giacomo
Abstract: The advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors has
brought on new opportunities for applying both Deep and Spiking Neural Network (SNN) algorithms
to healthcare and biomedical applications at the edge. This can facilitate the advancement of medical
Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial
describing how various technologies including emerging memristive devices, Field Programmable Gate
Arrays (FPGAs), and Complementary Metal Oxide Semiconductor (CMOS) can be used to develop ef-
cient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing
problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement
their DL counterparts for processing biomedical signals. The tutorial is augmented with case studies of
the vast literature on neural network and neuromorphic hardware as applied to the healthcare domain.
We benchmark various hardware platforms by performing a sensor fusion signal processing task combin-
ing electromyography (EMG) signals with computer vision. Comparisons are made between dedicated
neuromorphic processors and embedded AI accelerators in terms of inference latency and energy. Finally,
we provide our analysis of the eld and share a perspective on the advantages, disadvantages, challenges,
and opportunities that various accelerators and neuromorphic processors introduce to healthcare and
biomedical domains.
DOI: https://doi.org/10.1109/tbcas.2020.3036081
Posted at the Zurich Open Repository and Archive, University of Zurich
ZORA URL: https://doi.org/10.5167/uzh-200402
Journal Article
Accepted Version
Originally published at:
Azghadi, Mostafa Rahimi; Lammie, Corey; Eshraghian, Jason K; Payvand, Melika; Donati, Elisa;
Linares-Barranco, Bernabe; Indiveri, Giacomo (2020). Hardware Implementation of Deep Network Ac-
celerators Towards Healthcare and Biomedical Applications. IEEE Transactions on Biomedical Circuits
and Systems, 14(6):1138-1159.
DOI: https://doi.org/10.1109/tbcas.2020.3036081
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 1
Hardware Implementation of Deep Network
Accelerators Towards Healthcare and Biomedical
Applications
Mostafa Rahimi Azghadi, Senior Member, IEEE, Corey Lammie, Student Member, IEEE,
Jason K. Eshraghian, Member, IEEE, Melika Payvand, Member, IEEE, Elisa Donati, Member, IEEE,
Bernab´
e Linares-Barranco, Fellow, IEEE and Giacomo Indiveri, Senior Member, IEEE
Abstract—With the advent of dedicated Deep Learning (DL)
accelerators and neuromorphic processors, new opportunities are
emerging for applying deep and Spiking Neural Network (SNN)
algorithms to healthcare and biomedical applications at the edge.
This can facilitate the advancement of the medical Internet of
Things (IoT) systems and Point of Care (PoC) devices. In this
paper, we provide a tutorial describing how various technologies
ranging from emerging memristive devices, to established Field
Programmable Gate Arrays (FPGAs), and mature Complemen-
tary Metal Oxide Semiconductor (CMOS) technology can be used
to develop efficient DL accelerators to solve a wide variety of
diagnostic, pattern recognition, and signal processing problems in
healthcare. Furthermore, we explore how spiking neuromorphic
processors can complement their DL counterparts for processing
biomedical signals. After providing the required background,
we unify the sparsely distributed research on neural network
and neuromorphic hardware implementations as applied to the
healthcare domain. In addition, we benchmark various hardware
platforms by performing a biomedical electromyography (EMG)
signal processing task and drawing comparisons among them
in terms of inference delay and energy. Finally, we provide our
analysis of the field and share a perspective on the advantages,
disadvantages, challenges, and opportunities that different ac-
celerators and neuromorphic processors introduce to healthcare
and biomedical domains. This paper can serve a large audience,
ranging from nanoelectronics researchers, to biomedical and
healthcare practitioners in grasping the fundamental interplay
between hardware, algorithms, and clinical adoption of these
tools, as we shed light on the future of deep networks and
spiking neuromorphic processing systems as proponents for
driving biomedical circuits and systems forward.
Index Terms—Spiking Neural Networks, Deep Neural Net-
works, Neuromorphic Hardware, CMOS, Memristor, FPGA,
RRAM, Healthcare, Medical IoT, Point-of-Care
I. INT RO DUC TI ON
HEALTH and well-being is, undoubtedly, one of the
most fundamental concerns of human beings. This is
evidenced by the sheer size and the fast growth of global
healthcare industries, which is projected to reach over 10
M. Rahimi Azghadi and Corey Lammie are with the College of Science
and Engineering, James Cook University, QLD 4811, Australia. e-mail:
mostafa.rahimiazghadi@jcu.edu.au
J. K. Eshraghian is with the Department of Electrical Engineering and
Computer Science, The University of Michigan, Ann Arbor, MI 48109-2122,
USA.
M. Payvand, E. Donati, and G. Indiveri are with the Institute of Neuroin-
formatics, University and ETH Zurich, Switzerland.
B. Linares-Barranco is with the Instituto de Microelectr´
onica de Sevilla
IMSE-CNM (CSIC and Universidad de Sevilla), Sevilla, Spain.
trillion dollars by 2022 [1]. One of the most promising
technologies to advance this fast-growing industry is Artificial
Intelligence (AI) [2] and its implementation with Deep Learn-
ing (DL). DL has shown success in various domains and
as its reliability improves, it has pervaded various facets
of healthcare from monitoring [3], [4], to prediction [5],
diagnosis [6], treatment [7], and prognosis [8], as visualized
in Fig. 1(a). The figure shows that the data collected from the
patient, which in this case is illustrated as a biomedical signal,
but can be any or a combination of other data types such as
bio-samples, medical images, temperature, movement, etc. can
be processed using a smart DL system that monitors the patient
for anomalies and/or to predict diseases. The prediction can
inform diagnosis, which itself can benefit from DL algorithms.
In addition, DL systems can be used to recommend treatment
options and prognosis, which further affect monitoring and
prediction in a closed-loop scenario. In every step of this loop,
there is a need for a DL training and inference procedure,
which requires significant computational resources.
The capacity of AI to meet or exceed the performance of
human experts in medical-data analysis [9], [10], [11] can,
in part, be attributed to the continued improvement of high-
performance computing platforms such as Graphics Processing
Units (GPUs) [12] and customized Machine Learning (ML)
hardware [13]. These can now process and learn from a large
amount of multi-modal heterogeneous general and medical
data [14]. This was not readily achievable a decade ago.
Although the DL field has been growing at an astonishing
rate in terms of software, algorithms, and architecture develop-
ments, its hardware accelerator development currently largely
relies on advances by a handful of giant technology companies,
most notably Nvidia and its GPUs [15], [16] and Google and
its Tensor Processing Units (TPUs) [13], in addition to new
startups and research groups developing Application Specific
Integrated Circuits (ASICs) for DL training and acceleration.
Similarly, while there are significant advances in tailoring deep
network models and algorithms for various healthcare and
biomedical applications [17], most medical deep networks are
currently trained and run on GPUs or in data centers [12], [18].
This mostly requires the use of cloud-based DL processors
which rely on costly and power-demanding data centers, as
opposed to the effective deployment of DL at the edge on an
increasing number of healthcare and medical IoT systems [19]
and PoC devices [20], as illustrated in Fig. 1(b). These devices
arXiv:2007.05657v1 [cs.AR] 11 Jul 2020
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2
Fig. 1. A depiction of (a) the usage of DL in a smart healthcare setting, which typically involves monitoring, prediction, diagnosis, treatment, and prognosis.
The various parts of the DL-based healthcare system can run on (b) the three levels of the IoT, i.e. edge devices, edge nodes, and the cloud. However, for
healthcare IoT and PoC processing, edge learning and inference is preferred.
and systems are desired to be as low-cost, compact, low-
power, and rapid (high throughput) as possible, to facilitate
applications at the edge and make smart health monitoring
technology more viable and affordable for integration into
human life [21]. Furthermore, edge learning and/or inference
can enable systems which are mostly independent of the cloud.
This feature is critical for highly sensitive medical data and
offline operation, which are much desired in healthcare and
biomedical settings.
To facilitate at-edge processing, specialized embedded DL
accelerators such as Nvidia Jetson and Xavier series [22], as
well as Movidius Neural Compute Stick [23], [24] have been
produced. These devices and systems have been shown to be
quite suitable for healthcare edge or near-edge inference. More
recently, examples of specialized embedded hardware systems
for medical tasks, such as The Nvidia Clara Embedded, have
been proposed. This is a computing platform for edge-enabled
AI on the Internet of Medical Things (IoMT). However, as
these embedded devices are still relatively power hungry and
costly, they are still not ideal learning/inference engines for
ambient-assisted healthcare IoT applications and PoC systems.
So, there is a need for innovative systems which can satisfy
the stringent requirements of healthcare edge devices, to make
them available and beneficial to the community at large scales
and with affordable costs.
To that end, in this paper, we focus on the use of three vari-
ous hardware technologies to develop dedicated deep network
accelerators which will be discussed from a biomedical and
healthcare application point-of-view, even though they could
be used for general-purpose smart edge IoT devices. The three
technologies that we cover here are CMOS, memristors, and
Field Programmable Gate Arrays (FPGAs). It is worth noting
that, while our focus is mainly around efficient inference
engines at the biomedical application edge, the techniques and
hardware advantages discussed here may be also useful for
more efficient offline deep network learning, or online on-
chip learning. Herein, the term DL ‘accelerator’ is used for
referring to a device that is able to perform DL inference and
potentially training.
To provide a self-contained tutorial on the implementation
of DL accelerators, we first deliver a brief introduction to
the fundamentals of artificial and spiking neural networks and
their various architectures. Next, we shed light on why deep
networks are power- and resource-hungry and need specific
hardware platforms to enable them for edge processing. After
that, we discuss recent hardware advances which have led to
improvements in training and inference efficiency. These im-
provements ultimately guide us to more viable edge inference
engine options.
When discussing the three target hardware technologies, we
show that the field of hardware implementation for customized
healthcare and biomedical DL accelerators is very sparse.
After reviewing the literature on these DL accelerators, we
provide a guided analysis to quantify the performance of
various algorithms on different types of DL processors. The
results allow us to draw a perspective on the potential future
of spike-based neuromorphic processors in the biomedical
signal processing domain. Based on our analysis and perspec-
tive, we conjecture that for at-edge processing, neuromorphic
computing technologies and their underlying Spiking Neural
Networks (SNNs) could complement DL inference engines,
either through signaling anomalies in the data or acting as
‘intelligent always-on watchdogs’ which continuously monitor
the data being recorded, but only activate further processing
stages if and when necessary.
Although there are previous reviews on general AI-based al-
gorithm design and hardware for biomedical applications [25],
to the best of our knowledge, this is the first work where a
comprehensive tutorial and review is proposed that focuses on
customized DL accelerators for biomedical applications. Our
contributions that differentiate our work from the available
literature can be summarized as follows:
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 3
Input Cell Output CellRecurrent Cell Memory CellHidden Cell Convolution Pooling
Convolutional Neural
Network (CNN)
Recurrent Neural
Network (RNN)
Long Short-Term
Memory (LSTM)
Multilayer Perceptron (MLP)
/ Dense / Fully Connected
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
..
...
...
...
...
...
..
..
Fig. 2. Popular ANN structures. MLP/Dense/Fully Connected are typically well-suited for cross-sectional quantitative data, whereas RNNs and LSTMs
networks are typically well-suited for sequential data. CNNs are equipped for both types.
There is no previous comprehensive paper that focuses
only on DL accelerators and shows how they can be used
for medical and healthcare applications.
Our paper is the first to discuss the use of three differ-
ent emerging and established hardware technologies for
facilitating DL acceleration, with a focus on biomedical
applications.
We provide tutorial sections on how one may implement
a typical biomedical task on FPGAs or simulate it for
deployment on memristive crossbars.
Our paper is the first to discuss how event-based neuro-
morphic processors can complement DL accelerators for
biomedical signal processing.
We provide open-source codes and data to enable the
reproduction of our shown results.
We believe these features make our paper a useful contribution
to the wider biomedical circuits and systems society with an
interest in utilizing mature and emerging technologies and
techniques for enabling DL training and inference on the edge
of healthcare systems.
The remainder of the paper is organized as follows. In
Section II, we define the technical terminology that is used
throughout this paper and cover the working principles of
artificial and spiking neural networks. We also introduce a
biomedical signal processing task for hand-gesture classifica-
tion, which is used for benchmarking the different technolo-
gies and algorithms discussed in this paper. In Section III,
we step through the design, simulation, and implementation
of Deep Neural Networks (DNNs) using different hardware
technologies. We show sample cases of how they have been
deployed in healthcare settings. Furthermore, we demonstrate
the steps and techniques required to simulate and implement
hardware for the benchmark hand-gesture classification task
using memristive crossbars and FPGAs.
In Section IV, we provide our perspective on the challenges
and opportunities of both DNNs and SNNs for biomedical ap-
plications and shed light on the future of spiking neuromorphic
hardware technologies in the biomedical domain. Section V
presents concluding remarks and discussions.
II. DE EP ARTI FICI AL A N D SPI KIN G NEU R AL NE TWO RK S
A. Nomenclature of Neural Network Architectures
Although most DNNs reported in literature are ANNs,
DNNs usually refer to more than one hidden layer, indepen-
dently of whether the architecture is fully connected, convolu-
tional, recurrent, ANN or SNN, or of any other structure. For
example, the most widely used DNN type, i.e. a CNN, can
be physically implemented as an ANN or SNN, and in both
cases it would be ‘deep’. However, in this paper, whenever
we use the term ‘deep’, DL, or deep network, we refer to
Deep Artificial Neural Networks. For Deep Spiking Neural
Networks, we simply use the term SNN.
B. Deep Artificial Neural Networks
Traditional ANNs and their learning strategies that were
first developed several decades ago [26] have, in the past
several years, demonstrated unprecedented performance in a
plethora of challenging tasks which are typically associated
with human cognition. These have been applied to medical
image diagnosis [27] and medical text processing [28], using
DNNs.
Fig. 2 illustrates a simplified overview of the structure of
some of the most widely-used DNNs. The most conventional
form of these architectures is the Multi-Layer Perceptron
(MLP). Increasing the number of hidden layers of perceptron
cells is widely regarded to improve hierarchical feature ex-
traction which is exploited in various biomedical tasks, such
as seizure detection from electroencephalography (EEG) [29].
CNNs introduce convolutional layers, which use spatial filters
to learn various parts of the feature space. CNNs also have
pooling layers that are placed after convolutional layers to
down-sample their outputs to reduce the search space size for
subsequent convolutional layers. CNNs have been widely used
in medical and healthcare applications, as they are very well-
suited for spatially structured data. Their use in medical image
analysis [30] will form a major part of our discussions in
this paper. RNNs represent another powerful DL architecture
type that has been recently used both individually [31], and
in combination with CNNs [32] in biomedical applications.
RNNs introduce recurrent cells with a feedback loop, and
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 4
are especially useful for processing sequential data such as
temporal signals and time-series data, e.g. electrocardiography
(ECG) [32], and medical text [33]. The feedback loop in
recurrent cells gives them a memory of previous steps and
builds a dynamic awareness of changes in the input. The most
well-known type of RNNs are LSTMs which are designed
to mine patterns in data sequences using their short memory
of distant events stored in their memory cells. LSTMs have
been widely used for processing biomedical signals such as
ECG [31], [34]. Although there are many other varieties of
DNN architectures, we will focus on these most commonly
used types.
1) Automatic hierarchical feature extraction: The above
mentioned DNNs learn intricate data features and representa-
tions through multiple neural computational layers across var-
ious levels of abstraction [35]. The fundamental advantage of
DNNs is that they mine the input data features automatically,
without the need for human knowledge in their supervised
learning loop. This essential feature helps deep networks learn
complex features by combining a hierarchy of simpler features
learned in their hidden layers [35].
2) Learning algorithms: Learning features from data in
a DNN, e.g. the networks shown in Fig. 2, is typically
achieved by minimizing a loss function. In most cases, the
loss is defined as maximum likelihood using the cross-entropy
between training data and the learned model distribution. The
loss function minimization happens through optimizing the
network parameters (weights and biases). This optimization
process minimizes the loss function from the final network
layer backward through all the network layers and is therefore,
called backpropagation. A typical optimization algorithm that
is widely used in DNNs is Stochastic Gradient Descent (SGD)
or its several variants [35].
3) Backpropagation in DNNs is computationally expensive:
Despite the continual improvement of hardware platforms for
running DNNs, training and running these networks remains
a highly power consuming and computationally formidable
task. The catalyst for the intensive computational requirement,
which results in high power consumption, is the feed-forward
error backpropagation algorithm, which depends on thou-
sands of epochs of computationally intensive Vector Matrix
Multiplication (VMM) operations [26]. These operations, if
performed on a conventional von Neumann architecture which
has separate memory and processing units, will have a time
and power complexity of order O(N2)for multiplying a vector
of length Nin a matrix of dimensions N×N.
In addition, an artificial neuron in DNNs calculates a sum-
of-products of its input-weight matrix pairs. For instance, a
CNN spatially structures the sum-of-products calculation into
a VMM operation. In digital logic, an adder tree can be
used to accumulate a large number of values. This, however,
becomes problematic in DNNs when one considers the sheer
number of elements that must be summed together, as each
addition requires one cycle. Table I depicts some popular
CNN architectures, accompanied with the total number of
weights, and multiply-and-accumulate (MAC) operations that
must be computed for a single image (656×468 for OpenPose,
224×224 for the rest).
TABLE I
NUM BE R OF W EI GH TS A ND MAC OP ER ATIO NS I N VARI OU S CNN
AR CH IT EC TU RE S FO R A SI NG LE I MAG E AN D FO R VI DE O PRO CE SS IN G AT
25 F RA ME S PE R SE CO ND .
Network architecture Weights MACs @25 FPS
AlexNet 61 M 725 M 18 B
ResNet-18 11 M 1.8 B 45 B
ResNet-50 23 M 3.5 B 88 B
VGG-19 144 M 22 B 550 B
OpenPose 46 M 180 B 4500 B
MobileNet 4.2 M 529 M 13 B
This table highlights two key facts. Firstly, MACs are the
dominant operation of DNNs. Therefore, hardware implemen-
tations of DNNs should strive to parallelize a large number
of MACs to perform effectively. Secondly, there are many
predetermined weights that must be called from memory. Re-
ducing the energy and time consumed by reading weights from
memory provides another opportunity to improve efficiency.
Consequently, significant research has been being conducted
to achieve massive parallelism and to reduce memory access
in DNN accelerators, using different hardware technologies
and platforms as depicted in Fig. 3. Although these goals
are towards general DL applications, they can significantly
facilitate fast and low-power smart PoC devices [20] and
healthcare IoT systems.
In addition to conventional DL accelerators, there have been
significant research efforts to utilize biologically plausible
SNNs for learning and cognition [36]. Spiking neuromorphic
processors have also been used for biomedical signal process-
ing [37], [38], [39]. Below, we provide a brief introduction to
SNNs, which will be discussed as a method complementary
to DL accelerators for efficient biomedical signal processing
later in this paper. We will also perform comparisons among
SNNs and DNNs in performing an electromyography (EMG)
processing task.
C. Spiking Neural Networks
SNNs are neural networks that typically use Integrate-
and-Fire neurons to dynamically process temporally varying
signals (see Fig. 4(j)). By integrating multiple spikes over time,
it is possible to reconstruct an analog value that represents
the mean firing rate of the neuron. The mean firing rate is
equivalent to the value of the activation function of ANNs. So
in the mean firing rate limit, there is an equivalence between
ANNs and SNNs. By using spikes as all-or-none digital
events (Fig. 4(i)), SNNs enable the reliable transmission of
signals across long distances in electronic systems. In addition,
by introducing the temporal dimension, these networks can
efficiently encode and process sequential data and temporally
changing inputs. SNNs can be efficiently interfaced with event-
based sensors since they only process events as they are
generated. An example of such sensors is the Dynamic Vision
Sensor (DVS), which is an event-based camera shown in
Fig. 4(h). The DVS consists of a logarithmic photo-detector
stage followed by an operational transconductance amplifier
with a capacitive-divider gain stage, and two comparators.
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 5
Fig. 3. Typical hardware technologies for DNN acceleration. In this paper we
cover the top two layers of the pyramid, which include specialized hardware
technologies for high-performance training and inference of DNNs. While the
apex is labelled RRAM, this is intended to broadly cover all programmable
non-volatile resistive switching memories e.g. CBRAM, MRAM, PCM, etc.
The ON/OFF spikes are generated every time the difference
between the current and previous value of the input exceeds a
pre-defined threshold. The sign of the difference corresponds
to the ON or OFF channel where the spike is produced. This is
different to conventional cameras (Fig. 4(f)), which produce
image frames (Fig. 4(g)). Intuitively, it makes sense to use
asynchronous event-based sensor data in asynchronous SNNs,
and synchronously generated frames (i.e., all pixels are given
at a regular clock interval) through synchronous ANNs. But
it is worth noting that conventional frames can be encoded as
asynchronous spikes with frequencies that vary based on pixel
intensity, and event streams can be integrated over time into
synchronously generated time-surfaces [40], [41]. Event-based
sensors have been used to process biomedical signals [37], [42]
(Fig. 4(a)), which can be encoded to spike trains (Fig. 4(b))
to be processed by SNNs or be digitally sampled (Fig. 4(c))
for use in DNNs for learning and inference (Fig. 4(d)).
D. Benchmarking on a Biomedical Signal Processing Task
In Section III we will present a use-case of bio-signal
processing where FPGA and memristive DNN accelerators
are implemented and simulated. These are later compared to
equivalent existing implementations1using DNN accelerators
and neuromorphic processors from [39]. To perform com-
parisons, we use the same hand-gesture recognition task as
in [39].
Hand-gesture recognition is an important task in medical
settings such as prosthetic control, which can be performed us-
ing EMG biomedical signals, hand-gesture images, or a com-
bination of both. Here, the adopted hand-gesture dataset [39]
is a collection of 5 hand gestures recorded with two sensor
modalities: muscle activity from a Myo armband that senses
1https://github.com/Enny1991/dvs emg fusion/blob/master/full baseline.
py
EMG electrical activity in forearm muscles, and a visual input
in the form of DVS events. Moreover, the dataset provides
accompanying video captured from a traditional frame-based
camera, i.e., images from an Active Pixel Sensor (APS) to feed
DNNs. Recordings were collected from 21 subjects including
12 males and 9 females between the ages 25 and 35, and were
taken over three separate sessions.
For each implementation, we compare the mean and stan-
dard deviation of the accuracy obtained over a 3-fold cross
validation, where each fold encapsulates all recordings from
a given session. Additionally, for all implementations, we
compare the energy and time required to perform inference
on a single input, as well as the Energy-Delay Product (EDP),
which is the average energy consumption multiplied by the
average inference time.
III. DNN ACC EL ERATO RS T OWAR DS H E ALT HC ARE A ND
BI OM EDI CAL AP PL ICATI ONS
In this Section, we cover the use of CMOS and memris-
tors in DL acceleration. We discuss how they use different
strategies to achieve two of the key DNN acceleration goals,
namely MAC parallelism and reduced memory access. We also
discuss and review FPGAs as an alternative reconfigurable
DNN accelerator platform, which has shown great promise
in the healthcare and biomedical domains.
A. CMOS DNN accelerators
General edge-AI CMOS accelerator chips can be used for
DNN-enabled healthcare IoT and PoC systems. Therefore,
within this subsection, we first review a number of these chips
and provide examples of potential healthcare applications they
can accelerate. We will also explore some common approaches
to CMOS-driven acceleration of AI algorithms using massive
MAC parallelism and reduced memory access, which are
useful for both edge-AI devices and offline data center scale
acceleration. We then delve deeper into one of the more
renowned approaches, namely, the use of systolic arrays, and
show how a large accelerator developed using systolic arrays
has been used to perform a breast cancer detection training
and acceleration [9].
1) Edge-AI DNN accelerators suitable for biomedical ap-
plications: The research and market for ASICs, which focus
on a new generation of microprocessor chips dedicated entirely
to machine learning and DNNs, have rapidly expanded in
recent years. Table II shows a number of these CMOS-
driven chips, which are intended for portable applications.
There are many other examples of AI accelerator chips (for
a comprehensive survey see [44]), but here we picked several
prolific examples, which are designed specifically for DL using
DNNs, RNNs, or both. We have also included a few general
purpose AI accelerators from Google [45], Intel [46], and
Huawei [47].
Although developed for general DNNs, the accelerators
shown in Table II can efficiently realize portable smart DL-
based healthcare IoT and PoC systems for processing image-
based (medical imaging) or dynamic sequential medical data
types (such as EEG and ECG). For instance, the table shows
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 6
Image Signal
Processor
Interface I/O
Image Sensor
Lens
ISP
Photoeletric Conversion
Charge Accumulation
Transfer Signal
Signal Detection
Analog to Digital Conversion
Spike Train Encoding Scheme
Analog to Digital Converter
Conventional Camera
Image Data
ON SpikesOFF Spikes
Photoreceptor Bipolar Cell Ganglion Cells
ON SpikesOFF Spikes
Photoreceptor Bipolar Cell Ganglion Cells
ON SpikesOFF Spikes
Photoreceptor Bipolar Cell Ganglion Cells
ON SpikesOFF Spikes
Photoreceptor Bipolar Cell Ganglion Cells
ON SpikesOFF Spikes
RESET
Dynamic Vision Sensors are asynchronous, and proccess each pixel independently
Conventional Cameras are frame-based, and process pixels synchronously
Inputs to neuromorphic processors are temporally encoded
Inputs to ANNs are processed in batches, which are propagated serially Time-series signals can be temporally encoded or digitally sampled
D[C,M,N]
...
...
...
...
...
...
...
..
D[0,0,0]
...
D[0,0,N]
D[0,1,0]
...
D[0,1,N]
D[0,2,0]
...
D[0,2,N]
D[C, M, 0]
...
D[C, M, N]
...
...
...
...
...
...
...
t
ton
T
ANNs require clocks for process synchronization
(a) (b) (c) (d)
(e) (f) (g)
(h) (i)
(j)
Fig. 4. DNNs and SNN neuromorphic processors adopt different operation models. In DNNs, inputs are processed in batches which propagate serially.
Consequently, they require clocks for process synchronization. SNNs are asynchronous and process temporally encoded inputs independently. Time series
signals, such as the EMG signal presented in (a) can be either (b) temporally encoded using spike train encoding schemes such as [37], before being fed into
(j) neuromorphic processors, or (c) digitally sampled, before being concatenated into batches, to be fed into (d) DNNs. Similarly, photographs captured from
(e) lenses can be (i) temporally encoded into spike trains using (h) DVSs [43] or (f) digitally encoded using conventional cameras to build (g) image frames.
a few exemplar healthcare and biomedical applications that
are picked based on the demonstrated capacity of these ac-
celerators to run (or train [48]) various well-known CNN
architectures such as VGG, ResNet, MobileNet, AlexNet,
Inception, or RNNs such as LSTMs, or combined CNN-RNNs.
It is worth noting that most of the available accelerators are
intended for CNN inference, while only some [49], [50], [51]
also include recurrent connections for RNNs acceleration.
The Table shows that the total power per chip in most of
these devices is typically in the range of hundreds of mW, with
a few exceptions consuming excessive power of around 10
Watts [46], [47]. This is required to avoid large heat sinks and
to satisfy portable battery constraints. The Table also shows the
computing capability per unit time (column ‘Computational
Power (GOP/s)’). Regardless of power consumption, this col-
umn reveals the computational performance and consequently
the size of a network one can compute per unit time. It is
demonstrated that several of these chips can run large and
deep CNNs such as VGG and ResNet, which enable them to
perform complex processing tasks within a constrained edge
power budget.
For instance, it has been previously shown in [53] that VGG
CNN (shown to be compatible with Cambricon-x [52]), can
successfully analyze ECog signals. Therefore, considering the
power efficiency of Cambricon-x, it can be used to implement
a portable automatic ECog analyzer for PoC diagnosis of
various cardiovascular diseases [66]. Similarly, Eyeriss [54]
can run VGG-16, which is shown to be effective in diagnosing
thyroid cancer [55]. In addition, Eyeriss can run AlexNet for
several different medical imaging applications [30]. Therefore,
Eyeriss can be used as a mobile diagnostic tool that can be
integrated into or complement medical imaging systems at the
PoC. Origami [56] is another CNN accelerator chip, which can
be used for other healthcare applications based on a CNN. For
instance, [57] proposes a CNN-based ECG analysis for heart
monitoring, where Origami can be used to develop a smart
healthcare IoT edge device. Similarly, the CNN processor pro-
posed in [58] is shown to be able to run AlexNet, which can be
deployed in a PoC ultrasound image processing system [59].
Envision [60] is another accelerator that has the capability to
run large-scale CNNs. It can also be used as an edge inference
engine for a multi-layer CNN for EEG/ECog feature extraction
for epilepsy diagnosis [61]. Neural processor [62] is another
CNN accelerator that is shown to be able to run Inception
V3 CNN, which can be used for skin cancer detection [11] at
the edge. LNPU [48] is the only CNN accelerator shown in
Table II, which unlike the others can perform both learning
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 7
TABLE II
ANU MB ER O F RE CE NT E DG E-AI CMOS CH IP S SU ITAB LE F OR P ORTA BL E HE ALTH CA RE A ND B IO ME DI CA L AP PL IC ATIO NS .
CMOS Chip Core size
(mm2)
Technology
(nm)
Computational
Power (GOP/s)
Power
(mW)
Power Efficiency
(TOPS/W)
Potential Mobile and Edge-based Health-
care and Medical Applications
Cambricon-x [52] 6.38 65 544 954 0.5 ECog analysis using a sparse VGG [53] for
PoC diagnosis of cardiovascular diseases
Eyeriss [54] 12.25 65 17-42 278 0.06–0.15 - Mobile Image-based cancer diagnosis us-
ing VGG-16 [55],
- Mobile diagnosis tool based on AlexNet
for radiology, cardiology gastroenterology
imaging [30]
Origami [56] 3.09 65 196 654 0.8 Smart healthcare IoT edge device for heart
health monitoring using a CNN-based ECG
analysis [57]
ConvNet processor [58] 2.4 40 102 25-287 0.3–2.7 PoC Ultrasound processing using
AlexNet [59]
Envision [60] 1.87 28 76-408 7.5-300 0.8–10 Multi-layer CNN for EEG/ECog feature
extraction for epileptogenicity for epilepsy
diagnosis on edge [61]
Neural processor [62] 5.5 8 1900-7000 39–1500 4.5-11.5 On edge classification of skin cancer using
Inception V3 CNN [11]
LNPU [48] 16 65 600 43-367 25 - On edge learning/inference using VGG-16
for cancer diagnosis [55],
- On edge AlexNet learning/inference
for radiology, cardiology, gastroenterology
imaging diagnosis [30]
DNPU [49] 16 65 300-1200 35-279 2.1–8.1 Parallel and Cascade RNN and CNN for
acECG analysis for BCI [32]
Thinker [50] 14.44 65 371 293 1–5 PoC conversion of respiratory organ motion
ultrasound into MRI using a long-term re-
current CNN [63]
UNPU [51] 16 65 346-7372 3.2-297 3.08–50.6 Intelligent pre-diagnosis medical
support/consultation using a CNN-
RNN [33]
Google Edge TPU [45] 25 - 4000 2000 2 - Low-cost and easy-to-access skin cancer
detection using MobileNet V1 CNN [24]
- On edge health monitoring for fall detec-
tion using LSTMs [64]
Intel Nervana NNP-I
1000 (Spring Hill) [46]
- 10 48000 10000 4.8 Diagnosis using chest x-ray classification on
ResNet CNN family [65]
Huawei Ascend 310
[47]
- 12 16000 8000 2 - Cardiovascular monitoring for arrhythmia
diagnosis from ECG using an LSTM [31],
- Health monitoring by heart rate variability
analysis using ECG analysis by a bidirec-
tional LSTM [34]
and inference of a deep network such as AlexNet and VGG-
16, for applications including on edge medical imaging [30]
and cancer diagnosis [55].
Unlike the above discussed chips that are capable of running
only CNNs, DNPU [49], Thinker [50], and UNPU [51] are
capable of accelerating both CNNs and RNNs. This fea-
ture makes them suitable for a wider variety of edge-based
biomedical applications such as ECG analysis for BCI using
a cascaded RNN-CNN [32], or PoC MRI construction from
motion ultrasounds using a long-term recurrent CNN [63], or
intelligent medical consultation using a CNN-RNN [33].
Table II lists three general purpose AI accelerator chips,
which have been deployed for low-cost and easy-to-access
skin cancer detection using MobileNet V1 CNN [24], on
edge health monitoring for fall detection using LSTMs [64],
chest x-ray analysis using ResNet CNN [65], cardiovascular
arrhythmia detection from ECG using an LSTM [31], or
heart rate variability analysis from ECG signals through a
bidirectional LSTM [34], just to name a few.
2) Common approaches to CMOS-driven DL acceleration:
Accelerators will typically target either data center use or
embedded ‘edge-AI’ acceleration. Edge chips, such as those
discussed above, must operate under restrictive power budgets
(e.g., within thermal limits of 5 W) to cope with portable
battery constraints. While the scale of tasks, input dimension
capacity, and clock speeds will differ between edge-AI and
modular data center racks, both will adopt similar principles
in the tasks they seek to optimize.
Most of the accelerator chips, such as those discussed
in Table II, use similar optimization strategies involving re-
duced precision arithmetic [48], [51], [58], [60] to improve
computational throughput. This is typically combined with
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 8
architectural-level enhancements [49], [50], [52], [54], [62]
to either reduce data movement (using in- or near-memory
computing), heightened parallelism, or both.
Sequential and combinational logic research is largely ma-
tured, so outside of emerging memory technologies, the domi-
nant hardware benefits are brought on by optimizing data flow
and architecture. An early example is the neuFlow system-on-
chip (SoC) processor which relies on a grid of processing
tiles, each made up of a bank of processing operators and a
multiplexer based on-chip router [67]. The processing operator
can serially perform primitive computation (MUL, DIV, ADD,
SUB, MAX), or a parallelized 1D/2D convolution. The router
configures data movement between tiles to support streaming
data flow graphs.
Since the development of neuFlow, over 100 startups and
companies have developed, or are developing, machine learn-
ing accelerators. The Neural Processing Unit (NPU) [68] gen-
eralizes the work from neuFlow by employing eight processing
engines which each compute a neuron response: multipli-
cation, accumulation, and activation. If a program could be
partitioned such that a segment of it can be calculated using
MACs, then it would be partially computed on the NPU. This
made it possible to go beyond MLP neural networks. The
NPU was demonstrated to perform Sobel edge detection and
fast Fourier transforms as well.
NVIDIA coupled their expertise in developing GPUs with
machine learning dedicated cores, namely, tensor cores, which
are aimed at demonstrating superior performance over regular
Compute Unified Device Architecture (CUDA) cores [16].
Tensor cores target mixed-precision computing, with their
NVIDIA Tesla V100 GPU combining 672 tensor cores on
a single unit. By merging the parallelism of GPUs with the
application specific nature of tensor cores, their GPUs are
capable of energy efficient general compute workloads, as well
as 12 trillion floating-point operations per seconds (TFLOPSs)
of matrix arithmetic.
Although plenty of other notable architectures exist (see
Table II), a pattern begins to emerge, as most specialized
processors rely on a series of sub-processing elements which
each contribute to increasing throughput of a larger processor.
Whilst there are plenty of ways to achieve MAC parallelism,
one of the most renowned techniques is the systolic array,
and is utilized by Groq [69] and Google, amongst numerous
other chip developers. This is not a new concept: systolic
architectures were first proposed back in the late 1970s [70],
[71], and have become widely popularized since powering the
hardware DeepMind used for the AlphaGo system to defeat
Lee Sedol, the world champion of the board game Go in
October 2015. Google also uses systolic arrays to accelerate
MACs in their TPUs, just one of many CMOS ASICs used in
DNN processing [13]. Here, we explain what systolic arrays
are and how they can be used to decrease memory access
frequency and increase MAC parallelism, towards efficient
ANN accelerators.
3) Systolic arrays for DNN acceleration: In general pur-
pose computing, there is no knowing what the next instruction
could be. The result of every operation must be stored in mem-
ory, while awaiting further instructions from the processor.
Energy is consumed in reading from memory, in writing to
memory, and time is wasted by shuttling information on a
limited bandwidth bus to and from the processor (Fig. 5(a)).
On the other hand, neural networks are deterministic. Once
the network has been trained, every operation that the input
data is subject to has already been pre-determined. This allows
a single element of information, such as one pixel of an image,
to have many operations applied to it prior to being stored
in memory. Systolic arrays loosely draw inspiration from
the cardiovascular system, where blood is pumped through
various subsystems prior to returning to the heart. Similarly,
in systolic processing, data flows through many processing
elements before it returns to memory (Fig. 5(b)). In fact, the
word systolic is derived from the cardiac cycle.
The appeal of systolic arrays is that they can come in
many forms, designed for different tasks using repeatable and
modular blocks. As a simple case study of how systolic arrays
parallelize operations with infrequent memory write cycles,
we can consider a 3x3 matrix multiplication by referring to
Fig. 6. Here, the processing element is designed to multiply
two inputs together, and accumulate it with all future products.
The input data is a matrix of values xm,nand a weight matrix
of values wm,n. Multiplying these two matrices together is
an efficient way to compute a sum-of-products, or a MAC
operation.
As depicted in Fig. 6, input data is carefully orchestrated
in time such that it naturally flows in rhythm with incoming
weight data. At T=1, x0,0is multiplied with w0,0. At T=2,
x0,0flows to the right and is multiplied by the next weight in
sequence, w0,1. The weight w0,0flows down and is multiplied
by the next input, x1,0. Another input-weight pair enters
the array: x0,1and w1,0are multiplied together in the top-
left processing element, and summed with the result of the
previous time-step.
This process is repeated, until all inputs have traversed to
the right of the array, and all weights have traversed to the
bottom of the array, giving the result shown at T=7. This is
equivalently performing matrix multiplication, which is the
dominant operation in a DNN. Every element of the matrix
can be computed in this way, without having to store any
intermediate results in main memory.
4) CMOS-based systolic arrays used in biomedical applica-
tions: Googles TPU utilizes a 128×128 systolic array, which
enables 180 TFLOPSs, while v3 reaches up to 100 peta-
FLOPS (PFLOPS). The modularity of systolic arrays makes
them easily scalable for a large number of interconnected
TPUs, a necessary feature for data center use. Even with a
relatively slow clock (e.g., 700 MHz for TPU v1), systolic
arrays are highly parallel meaning there are numerous ma-
trices being processed simultaneously. TPUs were used in the
seminal work from Ref. [9] where an ensemble of three DNNs
were used to surpass radiologist performance in breast cancer
detection. This network was trained on a set of over 100,000
images, many of which were at 4K resolution and required
the development of substantial infrastructure to make training
such a system possible. The results demonstrated significant
reduction in both false positives and false negatives. Notably,
the system was able to generalize from being trained on UK-
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 9
(a) (b)
Fig. 5. (a) Conventional CPUs rely on a shared bus to transfer data to and from memory resulting in a bottleneck of data transmission. (b) Systolic arrays
pass data through multiple processing elements before storing in memory.
Fig. 6. Mapping a 3×3 matrix multiplication onto a 2D systolic array. This figure shows the movement of input and weight data over time, from time-step
T=0 through to T=7. The final result shows how all elements of a matrix are computed in parallel. It differs from pipelining in that individual processing
elements perform entire operations, can be multi-directional and operating at different speeds, can execute kernels with their own local memory. In contrast,
pipelining is executing a piece of an overall instruction in multiple pipelined stages.
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 10
based data sets to competitive performance on USA-based
images.
Overall, systolic arrays make efficient use of a limited mem-
ory bandwidth. While the connection from processor to mem-
ory is a bottleneck, the interconnections between processing
elements can be very fast. The drawback is that if the required
computation cannot be mapped into the processing elements
functions, such as a MAC, then it cannot be implemented.
B. Memristive DNNs
To achieve the two aforementioned key DNN acceleration
goals, i.e. massive MAC parallelism and reduced memory
access, many studies have leveraged memristors [72], [73],
[74], [75] as weight elements in their DNN and SNN [76], [77]
architectures. Memristors are often referred to as the fourth
fundamental circuit element, and can adapt their resistance
(conductance) to changes in the applied current or voltage.
This is similar to the adaptation of neural synapses to their
surrounding activity while learning. This adaptation feature
is integral to the brain’s in-memory processing ability, which
is missing in today’s general purpose computers. This in-situ
processing can be utilized to perform parallel MAC operations
inside memory, hence, significantly improving DNN learning
and inference. This is achieved by developing memristive
crossbar neuromorphic architectures, which are projected to
achieve approximately 2500-fold reduction in power and a
25-fold increase in acceleration, compared to state-of-the-art
specialized hardware such as GPUs [72].
1) Memristive crossbars for parallel MAC and VMM oper-
ations: A memristive crossbar that can be fabricated using a
variety of device technologies [77], [78] can perform analog
MAC operations in a single time-step (see Fig. 7(a)). This
reduces the time complexity to its minimum (O(1)), and
is achieved by carrying out multiplication at the place of
memory, in a non-Von Neumann structure. Using this well-
known approach, VMM can be parallelized as demonstrated
in Fig. 7(b), where the vector of size Mvalues represented
Fig. 7. Memristive crossbars can parallelize (a) analog MAC and (b) VMM
operations. Here, V represents the input vector, while conductances in the
crossbar represent the matrix.
as voltage signals ([V1..VM]) is applied to the rows of the
crossbar, while the matrix (of size M×N), whose elements
are represented as conductances (resistances), is stored in
the memristive components at each crossing point. Taking
advantage of the basic Ohm’s law (I=V.G), the current
summed in each crossbar column represents one element of
the resulting multiplication vector of size N.
2) Mapping memristive crossbars to DNN layers: Although
implementing fully-connected DNN layers is straightforward
by mapping the weights to crossbar point memristors and
having the inputs represented by input voltages, implementing
a complex CNN requires mapping techniques to convert con-
volution operations to MAC operations. A popular approach
to perform this conversion is to use an unrolling (unfolding)
operation that transforms the convolution of input feature
maps and convolutional filters to MAC operations. We have
developed a software platform named MemTorch [79] that will
be introduced in subsequent sections, to perform this mapping
as well as a number of other operations, for converting
DNNs to Memristive DNNs (MDNNs). The mapping process
implemented in MemTorch is illustrated in the left panel in
Fig. 8. The figure shows that the normal input feature maps and
convolution filters (shown in gray shaded area) are unfolded
and reshaped (shown in the cyan shaded area) to be compatible
to memristive crossbar parallel VMM operation. It is worth
noting that, the convolutional filters that can be applied to
the input feature maps have a direct relationship with the
required crossbar sizes. Furthermore, the resulting hardware
size or required time, depends on the size of the input feature
maps [80].
3) Peripheral circuitry for memristive DNNs: In addition
to the memristive devices that are used as programmable
elements in MDNN architectures, various peripheral circuitry
is required to perform feed-forward error-backpropagation
learning in MDNNs [74]. This extra circuitry may include:
(i) a conversion circuit to translate the input feature maps to
input voltages, which for programming memristive devices are
usually Pulse Width Modulator (PWM) circuits, (ii) current
integrators or sense amplifiers, which pass the current read
from every column of the memristive crossbar to (iii) analog
to digital converters (ADCs), which pass the converted voltage
to (iv) an activation function circuit, for forward propagation,
and for backward error propagation (v) the activation function
derivative circuit. Other circuits required in the error back-
propagation path include (vi) backpropagation values to PWM
voltage generators, (vii) backpropagation current integrators,
and (viii) backpropagation path ADCs. In addition, an update
module that updates network weights based on an algorithm
such as SGD is required, which is usually implemented in
software. After the update, the new weight values should be
written to the memristive crossbar, which itself requires Bit-
Line (BL) and Word-line (WL) switch matrices to address
the memristors for update, as well as a circuit to update
the memristive weights. There are different approaches to
implement this circuit such as that proposed in [81], while
others may use software ex-situ training where the new weight
values are calculated in software and transferred to the physical
memristors through peripheral circuitry [74].
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 11
0 20050 100 150
σ
Resistance (Ω)
Voltage (V)
Amplitude = 0.1
Amplitude = 0.2
Amplitude = 0.3
Amplitude = 0.4
Amplitude = 0.5
Amplitude = 0.6
Amplitude = 0.7
Amplitude = 0.8
Amplitude = 0.9
Amplitude = 1.0
RON Distribution
ROFF Distribution
Current (A)
g00
g10
gM0 gM1 gMN
g1N
g0N
g01
g11
PyTorch Model
MemTorch
Input Data
Output
Filter 1
Filter 2
Unfolded Input
Reshaped and Concatenated Filters
Output
Output Slice
D1,1
D0,0
D1,0
D3,0 D4,0
D1,0 D2,0 D4,0 D5,0
D4,0 D5,0 D7,0 D8,0
D0,1 D0,2 D1,2
D1,2
D1,1
D3,1 D3,2
D4,1 D4,2
D4,1 D4,2
D2,1 D4,1 D5,1
D5,1 D7,1 D8,1
D2,2 D4,2 D5,2
D5,2 D7,2 D8,2
D3,0 D4,0 D6,0 D7,0 D3,1 D3,2
D4,1 D6,1 D7,1 D4,2 D6,2 D7,2
During inference, VMMs are performed using ReRAM crossbars
MemTorch transforms convolutional operations to unrolled convolutions,
which are computed using VMMs
For all converted layers, network parameters are converted to equivalent device conductances, which are used to program devices within 1T1R crossbars that can perform VMMs in Θ(1)
D0,0 D1,0 D2,0
D3,0 D4,0 D5,0
D6,0 D7,0 D8,0
D0,1 D2,1
D3,1 D4,1 D5,1
D6,1 D7,1 D8,1
D0,2 D1,2 D2,2
D3,2 D4,2 D5,2
D6,2 D7,2 D8,2
F0,2 F1,2
F2,2 F3,2
F0,1 F1,1
F2,1 F3,1
F0,0 F1,0
F2,0 F3,0
H0,1 H1,1
H2,1 H3,1
H0,2 H1,2
H2,2 H3,2
H0,0 H1,0
H2,0 H3,0
O0,0
O0,1
O1,0
O1,1
O2,0
O2,1
O3,0
O3,1
D1,1
O0,1
O1,1
O2,1
O3,1
O0,0
O1,0
O2,0
O3,0
F0,0 H0,0
H1,0
H2,0
H3,0
F1,0
F2,0
F3,0
F0,1
F1,1
F2,1
F3,1
F0,2
F1,2
F2,2
F3,2
H0,1
H1,1
H2,1
H3,1
H0,2
H1,2
H2,2
H3,2
Unfolded Input Slice
Fig. 8. Conversion process of a DNN trained in PyTorch and mapped to a Memristive DNN using MemTorch [79], to parallelize MVMs using 1T1R
memristive crossbars and to take into account memristor variability including finite number of conductance states and non-ideal RON and ROFF distributions.
4) Memristive device nonidealities: Although ideal mem-
ristive crossbars have been projected to remarkably accelerate
DNN learning and inference and drastically reduce their
power consumption [72], [73], device imperfections observed
in experimentally fabricated memristors impose significant
performance degradation when the crossbar sizes are scaled up
for deployment in real-world DNN architectures, such as those
required for healthcare and biomedical applications discussed
in subsection III-A. These imperfections include nonlinear
asymmetric and stochastic conductance (weight) update, de-
vice temporal and spatial variations, device yield, as well as
limited on/off ratios [72]. To minimize the impact of these
imperfections, specific peripheral circuitry and system-level
mitigation techniques have been used [82]. However, these
techniques add significant computation time and complexity to
the system. It is, therefore, essential to take the effect of these
nonidealities into consideration before utilizing memristive
DNNs for any healthcare and medical applications, where
accuracy is critical. In addition, there is a need for a unified
tool that reliably simulates the conversion of a pre-trained
DNN to a MDNN, while critically considering experimentally
modeled device imperfections [79].
5) Conversion of DNN to MDNN while considering mem-
ristor nonidealities: Due to the significant time and energy
required to train new large versions of DNNs for challeng-
ing cognitive tasks, such as biomedical and healthcare data
processing [9], [83], the training of the algorithms is usually
undertaken in data centers [9], [13]. The pretrained DNN can
then be transferred to be used on memristive crossbars. There
exist several different frameworks and tools that can be used
to simulate and facilitate this transition [84]. In a recent study,
we have developed a comprehensive tool named MemTorch,
which is an open source, general, high-level simulation plat-
form that can fully integrate any behavioral or experimental
memristive device model into crossbar architectures to design
MDNNs [79].
Here, we utilize the benchmark biomedical signal process-
ing task explained in subsection II-D to demonstrate how
pretrained DNNs can be converted to equivalent MDNNs, and
how non-ideal memristive devices can be simulated within
MDNNs prior to hardware realization. The conversion process,
which can be generalized to other biomedical models using
MemTorch, is depicted in Fig. 8.
The targeted MDNNs are constructed by converting linear
and convolutional layers from PyTorch pre-trained DNNs
to memristive equivalent layers employing 1-Transistor-1-
Resistor (1T1R) crossbars. A double-column scheme is used
to represent network weights within memristive crossbars. The
converted MDNN models are tuned using linear regression, as
described in [79]. The complete and detailed process and the
source code of the network conversion for the experiments
shown in this subsection are provided in a publicly accessible
complementary Jupyter Notebook2.
During the conversion, any memristor model can be used.
For the benchmark task, a reference VTEAM model [85] is
instantiated using parameters from Pt/Hf/Ti Resistive Ran-
dom Access Memory (RRAM) devices [86], to model all
memristive devices within converted linear and convolutional
layers. As already mentioned, memristive devices have in-
evitable variability, which should be taken into account when
implementing an MDNNs for learning and/or inference. Also,
depicted in Fig. 8 are visualizations of two non-ideal device
characteristics: the finite number of conductance states and
device-to-device variability. Using MemTorch [79], not only
can we convert any DNNs to an equivalent MDNNs utilizing
any memristive device model, we are also able to comprehen-
sively investigate the effect of various device non-idealities and
variation on the performance of a possible MDNN, before it
is physically realized in hardware.
In order to demonstrate an example which includes vari-
ability in our MDNN simulations, device-device variability is
introduced by sampling ROFF for each device from a normal
distribution with ¯
ROFF = 2k5with standard deviation 2σ,
and RON for each device from a normal distribution with ¯
RON
= 100with standard deviation σ.
2https://github.com/coreylammie/TBCAS-Towards-Healthcare- and-
Biomedical-Applications/blob/master/MemTorch.ipynb
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 12
In Fig. 9, for the converted memristive MLP and CNN
that process APS hand-gesture inputs, we gradually increase
σfrom 0 to 500, and compare the mean test set accuracy
across the three folds. As can be observed from Fig. 9, with
increasing device-to-device variability, i.e. the variability of
RON and ROFF, the performance degradation increases across
all networks. For all simulations, RON and ROFF are bounded
to be positive.
6) Memristive DNNs towards biomedical applications:
Although some previous small-scale MDNNs have been sim-
ulated for biomedical tasks such as cardiac arrhythmia clas-
sification [87], or have been implemented on a physical
programmable memristive array for breast cancer diagno-
sis [88], there currently exists no significant MDNN, even at
simulation-level, which has realized a large-scale biomedical
processing task.
Similar to the recent advances in CMOS-driven DNN
accelerator chips discussed in subsection III-A, there have
been promises in partial [73] or full [74] realizations of
MDNNs in hardware, which are shown to achieve significant
energy saving compared to state-of-the-art GPUs. However,
unlike their CMOS counterparts, these implementations have
been only able to perform simple tasks such as MNIST
and CIFAR classification. This is, of course, not suitable for
implementing large-scale CNNs and RNNs, which as shown
in subsection III-A are required for biomedical and healthcare
tasks dealing with image [30] or temporal [31] data types.
In addition, following similar optimization strategies as
those used in CMOS accelerators, [89] has investigated, in
simulations, the use of quantized and binarized MDNNs and
their error tolerance in a biomedical ECG processing task
and has shown their potential to achieve significant energy
savings compared to full-precision MDNNs. However, due to
the many intricacies in the design process and considering the
aforementioned peripheral circuitry that may offset the benefits
gained by using MDNNs, full hardware design is required
before the actual energy saving of such binarized MDNNs
can be verified.
σ
Mean Test Set Accuracy (%)
Convolutional
MLP
Fig. 9. Simulation results investigating the performance of MDNNs for hand
gesture classification adopting non-ideal Pt/Hf/Ti ReRAM devices. Device-
device variability is simulated using MemTorch [79].
C. FPGA DNNs
FPGAs are fairly low-cost reconfigurable hardware that
can be used in almost any hardware prototyping and imple-
mentation task, significantly shortening the time-to-market of
an electronic product. They also provide parallel computa-
tion, which is essential when simultaneous data processing
is required such as processing multiple ECG channels in
parallel. Furthermore, there exists a variety of High Level
Synthesis (HLS) tools and techniques [90], [91] that facilitate
FPGA prototyping without the need to directly develop time-
consuming low-level Hardware Description Language (HDL)
codes [92]. These tools allow engineers to describe their
targeted hardware in high-level programming languages such
as C to synthesize them to Register Transfer Level (RTL).
The tools then offload the computational-critical RTL to run as
kernels on parallel processing platforms such as FPGAs [93].
1) Accelerating DNNs on FPGAs: FPGAs have been previ-
ously used to realize mostly inference [91], [94], [95], and in
some cases training of DNNs with reduced-precision-data [96],
or hardware-friendly approaches [97]. For a comprehensive
review of previous FPGA-based DNN accelerators, we refer
the reader to [91].
Here, we demonstrate an exemplar process of accelerating
DNNs used for the benchmark biomedical signal processing
task explained in subsection II-D. For our acceleration, we
use fixed-point parameter representations on a Starter Platform
for OpenVINO Toolkit FPGA using OpenCL. OpenCL [90] is
an HLS framework for writing programs that execute across
heterogeneous platforms. OpenCL specifies programming lan-
guages (based on C99 and C++11) for programming the com-
pute devices and Application Programming Interfaces (APIs)
to control and execute its developed kernels on the devices,
where depending on the available computation resources, an
accelerator can pipeline and execute all work items in parallel
or sequentially.
Fig. 10 depicts the compilation flow we adopted. The
trained DNN PyTorch model is first converted to .prototxt
and .caffemodel files using Caffe. All weights and biases are
then converted to a fixed point representation using MAT-
LAB’s Fixed-point toolbox using word length and fractional
bit lengths defined in [98], prior to being exported as a
single binary .dat file for integration with PipeCNN, which is
used to generate the necessary RTL libraries, and to perform
compilation of the host executable and the FPGA bit-stream.
We used Intel’s FPGA SDK for OpenCL 19.1, and provide
all files used during the compilation shown in Fig. 10 in a
publicly accessible complementary GitHub repository3.
2) FPGA-based DNNs for biomedical applications: De-
spite the many FPGA-based DNN accelerators available [91],
only a few have been developed specifically for biomedical
applications such as ECG anomaly detection [99], or real-time
mass-spectrometry data analysis for cancer detection [100],
where the authors show that application-specific parameter
quantization and customized network design can result in
significant inference speed-up compared to both CPU and
3https://github.com/coreylammie/TBCAS-Towards-Healthcare- and-
Biomedical-Applications/blob/master/FPGA/
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 13
Host Executable
FPGA Bit-Stream
Binary File
PyTorch Model
Network architecture and exported
state dictionary
Caffe Model
.prototxt and .caffemodel files
Quantization
All weights and biases are converted
to a fixed point representation using
MATLAB's Fixed-point toolbox
RTL Libraries which describe digital
hardware to realise operations required
to perform fixed-point inference
RTL Libraries
Q
D
clk
PipeCNN
Fig. 10. Compilation flow used to deploy an EMG classification CNN to an OpenVINO FPGA adopting fixed-point number representations using OpenCL.
GPU. In addition, the authors in [101] have developed an
FPGA-based BCI, in which a MLP is used for reconstructing
ECog signals. In [102], the authors have implemented an EEG
processing and neurofeedback prototype on a low-power but
low-cost FPGA and then scaled it on a high-end Ultra-scale
Virtex-VU9P, which has achieved 215 and 8 times power
efficiency compared to CPU and GPU, respectively. For the
EEG processing, they developed an LSTM inference engine.
It is projected that, by leveraging specific algorithmic de-
sign and hardware-software co-design techniques, FPGAs can
provide >10 times energy-delay efficiency compared to state-
of-the-are GPUs for accelerating DL [91]. This is significant
for realizing portable and reliable healthcare applications.
However, FPGA design is not as straightforward as high-level
designs conducted for DL accelerators and requires skilled
engineers and stronger tools, such as those offered by the GPU
manufacturers.
In the next section, we provide our analysis and perspective
on the use of the three hardware technologies discussed in this
section for DL-based biomedical and healthcare applications.
We also discuss how SNN-based neuromorphic processors can
benefit edge-processing for biomedical applications.
IV. ANA LYSIS A ND PE RSP ECT IV E
The use of ANNs trained with the backpropagation learning
algorithm in the domain of healthcare and for biomedical
applications such as cancer diagnosis [108] or ECG monitor-
ing [109] dates back to the early 90s. These networks, were
typically small-scale networks run on normal workstations. As
they were not deep and did not have too many parameters,
they did not demand high-performance accelerators. However,
with the resurgence of CNNs in the early 2010s followed
by the rapid spread of DNNs and large data-sets, came
the need for high-speed specialized processors. This need
resulted in repurposing GPUs and actively researching other
hardware and design technologies including ASIC CMOS
chips (see Table II) and platforms [13], memristive crossbars
and in-memory computing [73], [74], [80], and FPGA-based
designs for DNN training [96], [97] and inference [94].
Despite notable progress in deploying non-GPU platforms
for DL acceleration, similar to other data processing tasks,
biomedical and healthcare tasks have mainly relied on standard
technologies and GPUs. Currently, depending on the size
of the required DNN, its number of parameters, as well as
the available training dataset size, biomedical DL tasks are
usually “trained” on high-performance workstations with one
or more GPUs [12], [18], on customized proprietary processors
such as Google TPU [9], or on various Infrastructure-as-
a-Service (IaaS) provider platforms, including Nvidia GPU
cloud, Google Cloud, and Amazon Web Services, among
others. This is mostly due to (i) the convenience these plat-
forms provide using high-level languages such as Python; (ii)
the availability of wide-spread and open-source DL libraries
such as TensorFlow and PyTorch; and (iii) strong community
and/or provider support in utilizing GPUs and IaaS for training
various DNN algorithms and applications.
However, “inference” can benefit from further research and
development on emerging and mature hardware and design
technologies such as those discussed in this paper, to open
up new opportunities for deploying healthcare devices closer
to the edge, paving the way for low-power and low-cost DL
accelerators for PoC devices and healthcare IoT. Despite this
fact, hardware implementations of biomedical and healthcare
inference engines are very sparse. Table III lists a summary of
the available hardware implementations and hardware-based
simulations of DNNs used for healthcare and biomedical
signal processing applications, using the three hardware tech-
nologies covered herein. In addition, the table shows existing
biomedical signal processing tasks implemented on generic
low-power spiking neuromorphic processors.
A. CMOS technology has been the main player for DL infer-
ence in the biomedical domain
Similarly to general-purpose GPUs that are CMOS-based,
all the other current non-GPU DL inference engines are im-
plemented in CMOS. Therefore, it is obvious that most of the
future edge-based biomedical platforms would rely on these
inference platforms. In Table II, we listed a number of these
accelerators that are mainly developed for low-power mobile
applications. We also mentioned a set of potential healthcare
and biomedical tasks that can be realized using them. However,
before the deployment of any edge-based DL accelerators
for biomedical and healthcare tasks, some challenges need
to be overcome. A non-exhaustive list of these obstacles
include: (i) the power and resource constraints of available
mobile platforms, which despite significant improvements are
still not suitable for complex medical tasks; (ii) the need to
verify that a DL system can generalize beyond the distribution
they are trained and tested on; (iii) bias that is inherent to
datasets which may have adverse impacts on classification
across different populations; (iv) confusion surrounding the
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 14
TABLE III
EXI ST IN G HA RDWAR E IM PL EM EN TATIO NS A ND H AR DWARE -BA SE D SI MU LATI ON S OF DNN AC CE LE RATO RS U SE D FO R HE ALT HC AR E AN D BI OM ED IC AL
AP PL IC ATIO NS ,AN D GE NE RI C SNN NE URO MO RP HI C PRO CE SS OR S UT IL IZ ED F OR B IO ME DI CA L SI GN AL P ROC ES SI NG .SIM UL ATIO N-B AS ED
Biomedical or Healthcare Task DNN/SNN Architecture Hardware
Image-based breast cancer diagnosis [9] Ensemble of CNNs CMOS (Google TPU)
Energy-efficient multi-class ECG classification [38] Spiking RNN CMOS
EMG signal processing [39] Spiking CNN/MLP CMOS
ECG signal processing [103] Spiking RNN CMOS
EMG signal processing [104] Spiking RNN CMOS
EMG signal processing [105] Feed-forward SNN CMOS
EMG and EEG signal processing [106] Recurrent 3D SNN CMOS
EEG and LFP signal processing [107] TrueNorth-compatible CNN CMOS
ECG processing for cardiac arrhythmia classification [87] MLP Memristors
Breast cancer diagnosis [88] MLP Programmable Memristor-CMOS system
ECG signal processing [89] Binarized CNN Memristors
ECG arrhythmia detection for hearth monitoring [99] MLP FPGA
Mass-spectrometry for real-time cancer detection [100] MLP FPGA
ECog signal processing for BCI [101] MLP FPGA
EEG processing for energy-efficient Neurofeedback devices [102] LSTM FPGA
liability of AI algorithms in high-risk environments [110];
and (v) the lack of a streamlined workflow between medical
practitioners and DL. While the latter challenges are matters of
legality and policy, the former issues highlight the fundamental
need to understand where dataset bias comes from, and how
to improve our understanding of why neural networks learn
the features they do, such that they may generalize across
populations in a manner that is safe for receivers of medical
care.
In addition, to make the use of any accelerators possible
for general as well as more complex biomedical applications,
the field requires strong hardware-software co-design to build
up hardware that can be readily programmed for biomedical
tasks. One successful example of a solid hardware-software
co-design for a DL-customized CMOS platform (shown in
Table III) is the Google TPU [13], which while generic, has
been used along with complex tailored software for human-
surpassing medical imaging tasks [9]. Google has used a sim-
ilar CMOS TPU technology to design inference engines [45],
which are very promising as edge hardware to enable mobile
healthcare care applications. The main reason for this promise
is the availability of the solid software platforms (such as
TensorFlow Light) and the community support for the Google
TPU.
Overall, great advancements have happened for DL acceler-
ators in the past several years and they are currently stemming
in various aspects of our life from self-driving cars to smart
personal assistants. After overcoming a number of obstacles
such as those mentioned above, we may be also able to widely
integrate these DL accelerators in healthcare and biomedical
applications. However, for some medical applications such as
monitoring that requires always-on processing, we still need
systems with orders of magnitude better power efficiency, so
they can run on a simple button battery for a long time. To
achieve such systems, one possible approach is to process data
only when available and make our processing asynchronous.
A promising method to achieve such goals is the use of brain-
inspired SNN-based neuromorphic processors.
B. Towards edge processing for biomedical applications with
neuromorphic processors
Although most of the efforts presented in this work focused
on DNN accelerators, there are also notable efforts in the
domain of SNN processors that offer complementary advan-
tages, such as the potential to reduce the power consumption
by multiple orders of magnitude, and to process the data in
real time. These so-called neuromorphic processors are ideal
for end-to-end processing scenarios for example in wearable
devices, where the streaming input needs to be monitored in
continuous time in an always-on manner.
There are already some works in the direction of processing
biomedical signals that explore both mixed analog-digital and
digital neuromorphic platforms, showing promising results for
always-on embedded biomedical systems. Table IV shows
a summary of today’s large scale neuromorphic processors,
used for biomedical signal processing. The first chip presented
in this table is DYNAP-SE [111], a multi-core mixed-signal
neuromorphic implementation with analog neural dynamics
circuits and event-based asynchronous routing and commu-
nication circuits. The DYNAP-SE chip has been used to
implement four of the seven SNN processing systems listed
in Table III. These SNNs are used for the classification or
detection of EMG [104], [105] and ECG [103], [38]. The
DYNAP-SE was also used to build a spiking perceptron as part
of a design to classify and detect High-Frequency Oscillations
(HFO) in human intracranial EEG [42].
In [38], [103], [104] a spiking RNN is used to integrate
the ECG/EMG patterns temporally and separate them in a
linear fashion to be classifiable with a linear read-out. Support
Vector Machine (SVM) and linear least square approximation
is used in the read out layer for [103], [38] and overall
accuracy of 91% and 95% for anomaly detection were reached
respectively. In [104], the timing and dynamic features of
the spiking RNN on EMG recordings was investigated for
classifying different hand gestures. In [105] the performance
of a feedforward SNN and a hardware-friendly spiking learn-
ing algorithm for hand gesture recognition using superficial
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 15
TABLE IV
NEU ROM OR PH IC P LATF OR MS U SE D FO R BI OM ED IC AL S IG NA L PRO CE SS IN G
Neuromorphic Chip DYNAP-SE SpiNNaker TrueNorth Loihi ODIN
CMOS Technology 180 nm ARM968, 130 nm 28 nm 14 nm FinFET 28 nm FDSOI
Implementation Mixed-signal Digital Digital ASIC Digital ASIC Digital ASIC
Neurons per core 256 1000 (1M cores) 256 Max 1k 256
Synapses per core 16k 1M 64k 114k-1M 64k
Energy per SOP 17 pJ @ 1.8V Peak power 1W per chip 26 pJ @ 0.775 23.6 pJ @ 0.75V 12.7 pJ@0.55V
Size 38.5 mm2102 mm2- 60 mm20.086 mm2
Biosignal processing
application
EMG [105],
ECG [103], HFO [42]
EMG and EEG [106] EEG and LFP [107] EMG [39] EMG [39]
EMG was investigated and compared to traditional machine
learning approaches, such as SVM. Results show that applying
SVM on the spiking output of the hidden layer achieved a
classification rate of 84%, and the spiking learning method
achieved 74% with a power consumption of about 0.05 mW .
The consumption was compared to state-of-the-art embedded
system showing that the proposed spiking network is two
orders of magnitude more power efficient [112], [113].
The other neuromorphic platforms listed in Table IV include
digital architectures such as SpiNNaker [114], TrueNorth [115]
and Loihi [116]. SpiNNaker has been used for EMG and
EEG processing and obtained results show a better classi-
fication accuracy compared to traditional machine learning
methods [106]. In [107], the authors developed a framework
for decoding EEG and LFP using CNNs. The network was
first developed in Caffe and the result was then used as a
basis for building a TrueNorth-compatible neural network. The
TrueNorth-compatible network achieved the highest classifi-
cation, around 76%. Recently, the benchmark hand-gesture
classification introduced in subsection II-D, was processed and
compared on two other digital neuromorphic platforms, i.e.
Loihi and ODIN/MorphIC [117], [118]. A spiking CNN was
implemented on Loihi and a spiking MLP was implemented
on ODIN/MorphIC [39]. The results achieved using these
networks are presented in Table V.
On-chip adaptation and learning mechanisms, such as those
present in some of the neuromorphic devics listed in Table IV,
could be a game changer for personalized medicine, where the
system can adapt to each patient’s unique bio signature and/or
drift over time. However, the challenge of implementing effi-
cient on-chip online learning in these types of neuromorphic
architectures has not yet been solved. This challenge lies on
two main factors: locality of the weight update and weight
storage.
Locality: There is a hardware constraint that the learning
information for updating the weights of any on-chip network
should be locally available to the synapse, otherwise most of
the silicon area would be taken by the wires, required to route
the update information to it. As Hebbian learning satisfies this
requirement, most of the available on-chip learning algorithms
focus on its implementation in forms of unsupervised/semi-
supervised learning [117], [119]. However, local Hebbian-
based algorithms are limited in learning static patterns or using
very shallow networks [120]. There are also some efforts in
the direction of on-chip gradient-descent based methods which
implement on-chip error-based learning algorithms where the
least mean square of a neural network cost function is
minimized. For example, spike-based delta rule is the most
common weight update used for single-layer networks which
is the base of the back-propagation algorithm used in the vast
majority of current multi-layer neural networks. Single layer
mixed-signal neuromorphic circuit implementation of the delta
rule have already been designed [121] and employed for EMG
classification [105]. Expanding this to multi-layer networks
involves non-local weight updates which limits its on-chip
implementation. Making the backpropagation algorithm local
is a topic of on-going research [122], [123], [124].
Weight storage: The holy grail weight storage for online
on-chip learning is a memory with non-volatile properties
whose state can change linearly in an analog fashion. Non-
volatile memristive devices provide a great potential for this.
Therefore, there is a large body of literature in combining
the maturity of CMOS technology with the potential of the
emerging memories to take the best out of the two worlds.
The integration of CMOS technology with that of the
emerging devices has been demonstrated for non-volatile fil-
amentary switches [125] already at a commercial level [126].
There have also been some efforts in combining CMOS and
memristor technologies to design supervised local error-based
learning circuits using only one network layer by exploiting
the properties of memristive devices [121], [127], [128].
Apart from the above-mentioned benefits in utilizing mem-
ristive devices for online learning in SNN-based neuromorphic
chips, as discussed in subsection III-B, memristive devices
have also shown interesting features to improve the power
consumption and delay of conventional DNNs. However, as
shown in Table III, memristor-based DNNs are very sparse in
the biomedical domain, and existing works are largely based
only on simulation.
C. Why is the use of MDNNs very limited in the biomedical
domain?
Currently there are very few hardware implementations of
biomedical MDNNs that make use of general programmable
memristive-CMOS, and only one programmed to construct
an MLP for cancer diagnosis. We could also find two other
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 16
memristive designs in literature for biomedical applications
(shown in Table III), but they are only simulations considering
memristive crossbars. This sparsity is despite the significant
advantages that memristors provide in MAC parallelization
and in-memory computing paradigm, while being compatible
with CMOS technology. These features make memristors ideal
candidates for DL accelerators in general, and for portable and
edge-based healthcare applications in particular, because they
have stringent device size and power consumption require-
ments. To be able to use memristive devices in biomedical
domain, though, several of their shortcomings such as limited
endurance, mismatch, and analog noise accumulation must be
overcome first. This demands further research in the materials,
as well as the circuit and system design side of this emerging
technology, while at the same time developing facilitator open-
source software [79] to support MDNNs. Furthermore, inves-
tigating the same techniques utilized in developing CMOS-
based DL accelerators such as limited precision data repre-
sentation [80], [89] and approximate computing schemes can
lead to advances in developing MDNNs and facilitate their use
in biomedical domains.
D. Why and when to use FPGA for biomedical DNNs?
Table III shows that FPGA is a fairly popular hardware
technology for implementing simple DL networks such as
MLPs and in one case, a complex LSTM. The table also shows
that FPGAs are mainly used for signal processing tasks and
have not been widely used to run complex DL architectures
such as CNNs. This is mainly because they have limited
on-chip memory and low bandwidth compared to GPUs.
However, they present notable benefits in terms of significantly
shorter development time compared to ASICs, and much lower
power consumption than typical GPUs. Besides, significant
power and latency improvement can be gained by customizing
the implementation of various components of a DNN on an
FPGA, compared to running it on a general-purpose CPU or
GPU [100], [102]. For instance, in [102], EEG signals are
processed on FPGAs using two customized hardware blocks
for (i) parallelizing MAC operations and (ii) efficient recurrent
state updates, both of which are key elements of LSTMs. This
has resulted in almost an order of magnitude power efficiency
compared to GPUs. This efficiency is critical in many edge-
computing applications including DNN-based point-of-care
biomedical devices [20] and healthcare IoT [19], [57].
Another benefit of FPGAs is that a customized efficient
FPGA design can be directly synthesized into an ASIC using
a nanometer-node CMOS technology to achieve even more
benefits. For instance, [102] has shown near 100 times en-
ergy efficiency improvement as an ASIC in a 15-nm CMOS
technology, compared to its FPGA counterpart.
Although low-power consumption and affordable cost are
two key factors for almost any edge-computing or near-sensor
device, these are even more important for biomedical devices
such as wearables, health-monitoring systems, and PoC de-
vices. Therefore, FPGAs present an appealing solution, where
their limitations can be addressed for a customized DNN using
specific design methods such as approximate computing [97]
and limited-precision data [94], [96], depending on the cost,
required power consumption, and the acceptable accuracy of
the biomedical device.
E. Benchmarking EMG processing across multiple DNN and
SNN hardware platforms
In Table V, we compare our FPGA and memristive im-
plementations to other DNN accelerators and neuromorphic
processors from [39]. Input and hidden layers are sequenced
with the ReLU activation function, and output layers are
fed through Softmax activation functions to determine class
probabilities. Dropout layers are used in all networks to avoid
over-fitting. The DNN architectures are determined in the table
caption. Further implementation details can be found in [39].
The platforms used for each system in Table V are as
follows: ODIN+MorphIC [117], [118] and Loihi [116] neu-
romorphic platforms were used for spiking implementations;
NVIDIA Jetson Nano was used for all embedded GPU im-
plementations; OpenVINO Toolkit FPGA was used for all
FPGA implementations, and MemTorch [79] was used for
converting the MLP and CNN networks to their corresponding
MDNNs to determine the test set accuracies of all memristive
implementations.
From Table V, it can be observed that, when transitioning
from generalized architectures to application specific proces-
sors, more optimized processing of a subset of given tasks can
be achieved. Moving up the specificity hierarchy from GPU to
FPGA to memristive networks shows orders of magnitude of
improvement in both MLP and CNN processing, but naturally
at the expense of a generalizable range of tasks. While
GPUs are relatively efficient at training networks (compared
to CPUs), the impressive metrics presented by memristor
(RRAM in this simulations) is coupled with limited endurance.
This is not an issue for read-only tasks, as is the case with
inference, but training is thwarted by the thousands of epochs
of weight updates which limits broad use of RRAMs in
training. Rather, more exploration in alternative resistive-based
technologies such as Magnetoresistive Random Access Mem-
ory (MRAM) could prove beneficial for tasks that demand
high endurance.
After determining the test set accuracy of each MDNN using
MemTorch [79], we determined the energy required to perform
inference on a single input, the inference time, and the Energy-
Delay Product (EDP) using a similar approach to [129], for a
tiled memristor architecture. All presumptions made in our
calculations are listed below. Parameters are adopted from
those given in a 1T1R 65nm technology, where the maximum
current during inference is 3µA per cell with a read voltage of
0.3V. Each cell is capable of storing 8 bits with a resistance
ratio of 100, and mapping signed weights is achieved using
a dual column representation. All convolutions are performed
by unrolling the kernels and performing MVMs, and the fully
connected layers have the fan-in weights for a single neuron
assigned to one column. Each crossbar has an aspect ratio
of 256×64 to enable more analog operations per ADC when
compared to a 128×128 array. Where there is insufficient
space to map weights to a single array, they are distributed
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 17
TABLE V
COM PARI SO N OF C ON VE NT IO NA L DNNS IM PL EM EN TE D ON VARI OU S HA RD WARE P LATF OR MS W IT H SP IK IN G DNN NE URO MO RP HI C SY ST EM S ON T HE
BE NC HM AR K BI OM ED IC AL S IG NAL P ROC ES SI NG TA SK O F HA ND G ES TU RE R EC OG NI TI ON F OR B OTH S IN GL E SE NS OR A ND S EN SO R FU SI ON ,AS
EX PL AI NE D IN S UB SE CT IO N II-D. TH E RE SU LTS OF T HE A CCU R ACY A RE R EP ORT ED W IT H ME AN A ND S TAND AR D DE VI ATIO N OB TAIN ED O VE R A 3-FO LD
CRO SS VAL IDATI ON . LO IH I, E MB ED DE D GPU, AN D OD IN +M OR PH IC I MP LE ME NTATI ON R ES ULT S AR E FRO M [3 9] . TH E DNN AR CH IT EC TU RE S
AD OP TE D AR E AS F OL LOW S:8C3 -2 P-1 6C3 -2 P-3 2C3 -5 12 -5 CN N. 1 6- 12 8- 12 8- 5 M LP. 16-230-5 M LP. 4×40 0- 21 0- 5 MLP. EMG A ND
APS/DVS N ET WO RK S AR E FU SE D US IN G A 5-NE URO N DE NS E LAYE R.
Platform Modality Accuracy (%) Energy (uJ) Inference time (ms) EDP (uJ * s)
Loihi
(Spiking)
EMG (MLP) 55.7 ±2.7 173.2 ±21.2 5.89 ±0.18 1.0 ±0.1
DVS (CNN) 92.1 ±1.2 815.3 ±115.9 6.64 ±0.14 5.4 ±0.8
EMG+DVS (CNN) 96.0 ±0.4 1104.5 ±58.8 7.75 ±0.07 8.6 ±0.5
Embedded GPU
EMG (MLP) 68.1 ±2.8 (25.5 ±8.4) ·1033.8 ±0.1 97.3 ±4.4
APS (CNN) 92.4 ±1.6 (31.7 ±7.4) ·1035.9 ±0.1 186.9 ±3.9
EMG+APS (CNN) 95.4 ±1.7 (32.1 ±7.9) ·1036.9 ±0.05 221.1 ±4.1
FPGA
EMG (MLP) 67.2 ±2.3 (17.6 ±1.1) 1034.2 ±0.1 74.1 ±1.2
APS (CNN) 96.7 ±3.0 (24.0 ±1.2) 1035.4 ±0.2 130.8 ±1.4
EMG+APS (CNN) 94.8 ±2.0 (31.2 ±3.0) 1036.3 ±0.1 196.3 ±3.1
Memristive
EMG (MLP) 64.6 ±2.2 0.038 6.0 ·1042.38 ·108
APS (CNN) 96.2 ±3.3 4.83 1.0 ·1034.83 ·106
EMG+APS (CNN) 94.8 ±2.0 4.90 1.2 ·1035.88 ·106
ODIN+MorphIC
(Spiking)
EMG (MLP) 53.6 ±1.4 7.42 ±0.11 23.5 ±0.35 0.17 ±0.01
DVS (MLP) 85.1 ±4.1 57.2 ±6.8 17.3 ±2.0 1.00 ±0.24
EMG+DVS (MLP) 89.4 ±3.0 37.4 ±4.2 19.5 ±0.3 0.42 ±0.08
Embedded GPU
EMG (MLP) 67.2 ±3.6 (23.9 ±5.6) ·1032.8 ±0.08 67.2 ±2.9
APS (MLP) 84.2 ±4.3 (30.2 ±7.5) ·1036.9 ±0.1 211.3 ±6.1
EMG+APS (MLP) 88.1 ±4.1 (32.0 ±8.9) ·1037.9 ±0.05 253 ±3.9
FPGA
EMG (MLP) 63.8 ±1.4 (13.9 ±1.8) ·1033.5 ±0.1 48.9 ±1.9
APS (MLP) 82.9 ±8.4 (23.1 ±2.6) ·1035.7 ±0.2 131.4 ±2.8
EMG+APS (MLP) 83.4 ±2.8 (31.1 ±1.4) ·1037.3 ±0.2 228.2 ±1.6
Memristive
EMG (MLP) 63.8 ±1.4 0.026 4.0 ·1041.04 ·108
APS (MLP) 82.4 ±8.5 0.18 4.0 ·1047.2 ·108
EMG+APS (MLP) 83.4 ±2.8 0.33 6.0 ·1041.98 ·107
across multiple arrays, with their results to be added digitally.
Throughput can be improved at the expense of additional
arrays for convolutional layers, by duplicating kernels such
that multiple inputs can be processed in parallel. The number
of tiles used for each network is assumed to be the exact
number required to balance the processing time of each layer.
The power consumption of each current-mode 8-bit ADC is
estimated to be 2×104W with an operating frequency of
40 MHz (5 MHz for bit-serial operation). The ADC latency
is presumed to dominate digital addition of partial products
from various tiles. The dynamic range of each ADC has been
adapted to the maximum possible range for each column, and
each ADC occupies a pair of columns.
The above presumptions lead to pre-silicon results that
are extremely promising for memristor arrays, as shown in
Table V. But it should be clear that these calculations were
performed for network-specific architectures, rather than a
more general application-specific use-case. That is, we assume
the chip has been designed for a given neural network model.
The other comparison benchmarks are far more generalizable,
in that they are suited to not only handle most network
topologies, but are also well-suited for training. The substantial
improvement of inference time over other methods is a result
of duplicate weights being mapped to enable higher paral-
lelism, which is tolerable for small architectures, but lends
to prohibitively large ADC power consumption for computer
vision tasks which rely on deep networks and millions of pa-
rameters, such as VGG-16. The use of memristors as synapses
in spike-based implementations may be more appropriate, so
as to reduce the ADC overhead by replacing multi-bit ADCs
with current sense amplifiers instead, and reducing the reliance
on analog current summation along resistive and capacitive
bit-lines.
Spike-based hardware show approximately two orders of
magnitude improvement in the EDP from Table V when
compared to their GPU and FPGA counterparts, which high-
lights the prospective use of such architectures in always-on
monitoring. This is necessary for enhancing the prospect of
ambient-assisted living, which would allow medical resources
to be freed up for tasks that are not suited for automation. In
general, one would expect that data should be processed in its
naturalized form. For example, 2D CNNs do not discard the
spatial relations between pixels in an image. Graph networks
are optimized for connectionist data, such as the structure
of proteins. By extension, the discrete events generated by
electrical impulses such as in EMGs, EEGs and ECGs may
also be optimized for SNNs. Of course, this discounts any
subthreshold firing patterns of measured neuron populations.
But one possible explanation for the suitability of spiking
hardware for biological processes stems from the natural
timing of neuronal action potentials. Individual neurons will
typically not fire in excess of 100 Hz, and the average heart
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 18
rate (and correspondingly, ECG spiking rate) will not exceed 3
Hz. There is a clear mismatch between the clock rate of non-
spiking neural network hardware, which tend to at least be in
the MHz range, and spike-driven processes. This introduces a
significant amount of wastage in processing data when there
is no new information to process (e.g., in between heartbeats,
action potentials, or neural activity).
Nonetheless, it is clear that accuracy is compromised when
relying on EMG signals alone, based on the approximately
10% decrease of classification accuracy on the Loihi chip and
ODIN+MorphIC, as against their GPU/FPGA counterparts.
This could be a result of spike-based training algorithms
lagging behind in maturity compared to conventional neural
network methods, or it could be an indication that critical
information is being discarded when neglecting the subthresh-
old signals generated by populations of neurons. But when
EMG and DVS data are combined, this multi-sensory data
fusion of spiking signals positively reinforce upon each other
with an approximately 4% accuracy improvement, whereas
combining non-spiking, mismatched data representations leads
to marginal improvements, and even a destructive effect (e.g.,
non-spiking CNN implementation on FPGA and memristive
arrays). This may be a result of EMG and APS data taking on
completely different structures. This is a possible indication
that feature extraction from merging the same structural form
of data (i.e., as spikes) proves to be more beneficial than com-
bining a pair of networks with two completely different modes
of data (i.e., EMG signals with pixel-driven images). This
allows us to draw an important hypothesis: neural networks
can benefit from a consistent representation of data generated
by various sensory mechanisms. This is supported by biology,
where all biological interpretations are typically represented
by graded or spiking action potentials.
V. CO NC L US ION
The use of DL in biomedical signal processing and health-
care promises significant utility for medical practitioners and
their patients. DNNs can be used to improve the quality of life
for chronically ill patients by enabling ambient monitoring for
abnormalities, and correspondingly can reduce the burden on
medical resources. Proper use can lead to reduced workloads
for medical practitioners who may divert their attention to
time-critical tasks that require a standard beyond what neural
networks can achieve at this point in time.
We have stepped through the use of various DL accelera-
tors on a disparate range of medical tasks, and shown how
SNNs may complement DNNs where hardware efficiency is
the primary bottleneck for widespread integration. We have
provided a balanced view to how memristors may lead to
optimal hardware processing of both DNNs and SNNs, and
have highlighted the challenges that must be overcome before
they can be adopted at a large-scale. While the focus of this
tutorial and review is on hardware implementation of various
DL algorithms, the reader should be mindful that progress
in hardware is a necessary, but insufficient, condition for
successful integration of medical-AI.
Adopting medical-AI tools is clearly a challenge that de-
mands the collaborative attention of healthcare providers, hard-
ware and software engineers, data scientists, policy-makers,
cognitive neuroscientists, device engineers and materials sci-
entists, amongst other specializations. A unified approach
to developing better hardware can have pervasive impacts
upon the healthcare industry, and realize significant payoff by
improving the accessibility and outcomes of healthcare.
ACK NO WL EDG ME N T
This paper is supported in part by the European Unions
Horizon 2020 ERC project NeuroAgents (Grant No. 724295).
REF ER ENC ES
[1] T. Arevalo, “The State of Health Care Industry- Statistics & Facts,”
PolicyAdvice, Tech. Rep., Apr. 2020.
[2] G. Rong, A. Mendez, E. B. Assi, B. Zhao, and M. Sawan, “Artifi-
cial Intelligence in Healthcare: Review and Prediction Case Studies,
Engineering, vol. 6, no. 3, pp. 291–301, 2020.
[3] V. Jindal, “Integrating Mobile and Cloud for PPG Signal Selection to
Monitor Heart Rate during Intensive Physical Exercise,” in Proceedings
of the International Conference on Mobile Software Engineering and
Systems (MOBILESOFT), Austin, TX., May 2016, pp. 36–37.
[4] P. Sundaravadivel, K. Kesavan, L. Kesavan, S. P. Mohanty, and
E. Kougianos, “Smart-Log: A Deep-Learning Based Automated Nutri-
tion Monitoring System in the IoT,” IEEE Transactions on Consumer
Electronics, vol. 64, no. 3, pp. 390–398, 2018.
[5] B. Shi, L. J. Grimm, M. A. Mazurowski, J. A. Baker, J. R. Marks,
L. M. King, C. C. Maley, E. S. Hwang, and J. Y. Lo, “Prediction of
occult invasive disease in ductal carcinoma in situ using deep learning
features,” Journal of the American College of Radiology, vol. 15, no. 3,
pp. 527–534, 2018.
[6] X. Liu, L. Faes, A. U. Kale, S. K. Wagner, D. J. Fu, A. Bruynseels,
T. Mahendiran, G. Moraes, M. Shamdas, C. Kern et al., “A comparison
of deep learning performance against health-care professionals in
detecting diseases from medical imaging: a systematic review and
meta-analysis,” The Lancet Digital Health, vol. 1, no. 6, pp. e271–
e297, 2019.
[7] F. Liu, P. Yadav, A. M. Baschnagel, and A. B. McMillan, “MR-
based Treatment Planning in Radiation Therapy Using a Deep Learning
Approach,” Journal of Applied Clinical Medical Physics, vol. 20, no. 3,
pp. 105–114, 2019.
[8] W. Zhu, L. Xie, J. Han, and X. Guo, “The Application of Deep
Learning in Cancer Prognosis Prediction,” Cancers, vol. 12, no. 3, p.
603, 2020.
[9] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova,
H. Ashrafian, T. Back, M. Chesus, G. C. Corrado, A. Darzi et al.,
“International evaluation of an AI system for breast cancer screening,
Nature, vol. 577, no. 7788, pp. 89–94, 2020.
[10] A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn,
M. P. Turakhia, and A. Y. Ng, “Cardiologist-level arrhythmia detection
and classification in ambulatory electrocardiograms using a deep neural
network,” Nature medicine, vol. 25, no. 1, p. 65, 2019.
[11] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,
and S. Thrun, “Dermatologist-level classification of skin cancer with
deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
[12] T. Kalaiselvi, P. Sriramakrishnan, and K. Somasundaram, “Survey of
using GPU CUDA programming model in medical image analysis,
Informatics in Medicine Unlocked, vol. 9, pp. 133–144, 2017.
[13] N. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R.
Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,
A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar,
S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,
A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Na-
garajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-Datacenter
Performance Analysis of a Tensor Processing Unit,” in Proceedings
of the International Symposium on Computer Architecture (ISCA),
Toronto, Canada., Jun. 2017.
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 19
[14] A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo,
K. Chou, C. Cui, G. Corrado, S. Thrun, and J. Dean, “A guide to deep
learning in healthcare,” Nature Medicine, vol. 25, no. 1, pp. 24–29,
2019.
[15] N. G. Peter et al., “NVIDIA Fermi: The First Complete GPU Com-
puting Architecture,” A White Paper of NVIDIA, 2009.
[16] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting
the NVIDIA Volta GPU Architecture via Microbenchmarking,” arXiv
preprint arXiv:1804.06826, 2018.
[17] R. Zemouri, N. Zerhouni, and D. Racoceanu, “Deep Learning in the
Biomedical Applications: Recent and Future Status,” Applied Sciences,
vol. 9, no. 8, p. 1526, 2019.
[18] E. Smistad, T. L. Falch, M. Bozorgi, A. C. Elster, and F. Lindseth,
“Medical Image Segmentation on GPUs–A Comprehensive Review,”
Medical Image Analysis, vol. 20, no. 1, pp. 1–18, 2015.
[19] B. Farahani, F. Firouzi, and K. Chakrabarty, “Healthcare IoT,” in
Intelligent Internet of Things. Springer, 2020, pp. 515–545.
[20] Q. Xie, K. Faust, R. Van Ommeren, A. Sheikh, U. Djuric, and P. Dia-
mandis, “Deep Learning for Image Analysis: Personalizing Medicine
Closer to the Point of Care,” Critical Reviews in Clinical Laboratory
Sciences, vol. 56, no. 1, pp. 61–73, 2019.
[21] M. Hartmann, U. S. Hashmi, and A. Imran, “Edge computing in
smart health care systems: Review, challenges, and research directions,”
Transactions on Emerging Telecommunications Technologies, p. e3710,
2019.
[22] I. Azimi, A. Anzanpour, A. M. Rahmani, T. Pahikkala, M. Levorato,
P. Liljeberg, and N. Dutt, “Hich: Hierarchical Fog-Assisted Computing
Architecture for Healthcare IoT,” ACM Transactions on Embedded
Computing Systems (TECS), vol. 16, no. 5s, pp. 1–20, 2017.
[23] K. Sethi, V. Parmar, and M. Suri, “Low-Power Hardware-Based Deep-
Learning Diagnostics Support Case Study,” in Proceedings of the IEEE
Biomedical Circuits and Systems Conference (BioCAS), Cleveland,
OH., Oct. 2018.
[24] P. Sahu, D. Yu, and H. Qin, “Apply lightweight deep learning on
internet of things for low-cost and easy-to-access skin cancer detec-
tion,” in Proceedings of Medical Imaging 2018: Imaging Informatics
for Healthcare, Research, and Applications, vol. 10579, Houston, TX,
Feb. 2018, p. 1057912.
[25] Y. Wei, J. Zhou, Y. Wang, Y. Liu, Q. Liu, J. Luo, C. Wang, F. Ren, and
L. Huang, “A Review of Algorithm & Hardware Design for AI-Based
Biomedical Applications,” IEEE Transactions on Biomedical Circuits
and Systems, vol. 14, no. 2, pp. 145–163, 2020.
[26] D. E. Rumelhart, G. Hinton, and R. J. Williams, “Learning represen-
tations by back-propagating errors,” Nature, vol. 323, no. 6088, pp.
533–538, 1986.
[27] T. C. Hollon, B. Pandian, A. R. Adapa, E. Urias, A. V. Save, S. S. S.
Khalsa, D. G. Eichberg, R. S. DAmico, Z. U. Farooq, S. Lewis et al.,
“Near real-time intraoperative brain tumor diagnosis using stimulated
Raman histology and deep neural networks,” Nature Medicine, pp. 1–7,
2020.
[28] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A
Survey of Recent Advances in Deep Learning Techniques for Elec-
tronic Health Record (EHR) Analysis,” IEEE Journal of Biomedical
and Health Informatics, vol. 22, no. 5, pp. 1589–1604, 2017.
[29] M. A. Sayeed, S. P. Mohanty, E. Kougianos, and H. P. Zaveri,
“Neuro-Detect: A Machine Learning-Based Fast and Accurate Seizure
Detection System in the IoMT,” IEEE Transactions on Consumer
Electronics, vol. 65, no. 3, pp. 359–368, 2019.
[30] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,
M. B. Gotway, and J. Liang, “Convolutional Neural Networks for Med-
ical Image Analysis: Full Training or Fine Tuning?” IEEE Transactions
on Medical Imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
[31] J. Gao, H. Zhang, P. Lu, and Z. Wang, “An Effective LSTM Recurrent
Network to Detect Arrhythmia on Imbalanced ECG Dataset,” Journal
of Healthcare Engineering, vol. 2019, 2019.
[32] D. Zhang, L. Yao, X. Zhang, S. Wang, W. Chen, R. Boots, and B. Bena-
tallah, “Cascade and Parallel Convolutional Recurrent Neural Networks
on EEG-based Intention Recognition for Brain Computer Interface,” in
Proceedings of the AAAI Conference on Artificial Intelligence, New
Orleans, LA., Feb. 2018.
[33] X. Zhou, Y. Li, and W. Liang, “CNN-RNN Based Intelligent Rec-
ommendation for Online Medical Pre-Diagnosis Support,” IEEE/ACM
Transactions on Computational Biology and Bioinformatics, 2020.
[34] J. Laitala, M. Jiang, E. Syrj¨
al¨
a, E. K. Naeini, A. Airola, A. M. Rahmani,
N. D. Dutt, and P. Liljeberg, “Robust ECG R-peak detection using
LSTM,” in Proceedings of the ACM Symposium on Applied Computing
(SAC), Brno, Czech Republic., Mar. 2020, pp. 1104–1111.
[35] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT
press, 2016.
[36] G. Indiveri and S.-C. Liu, “Memory and Information Processing in
Neuromorphic Systems,” Proceedings of the IEEE, vol. 103, no. 8, pp.
1379–1397, 2015.
[37] F. Corradi and G. Indiveri, “A Neuromorphic Event-Based Neural
Recording System for Smart Brain-Machine-Interfaces,” IEEE Trans-
actions on Biomedical Circuits and Systems, vol. 9, no. 5, pp. 699–709,
2015.
[38] F. Corradi, S. Pande, J. Stuijt, N. Qiao, S. Schaafsma, G. Indiveri,
and F. Catthoor, “ECG-based Heartbeat Classification in Neuromorphic
Hardware,” in Proceedings of the International Joint Conference on
Neural Networks (IJCNN), Budapest, Hungary., Jul. 2019.
[39] E. Ceolini, C. Frenkel, S. B. Shrestha, G. Taverni, L. Khacef, M. Pay-
vand, and E. Donati, “Hand-gesture recognition based on EMG and
event-based camera sensor fusion: a benchmark in neuromorphic
computing,” Frontiers in Neuroscience, no. 520438, p. 36, 2020.
[40] J. K. Eshraghian, K. Cho, C. Zheng, M. Nam, H. H.-C. Iu, W. Lei, and
K. Eshraghian, “Neuromorphic vision hybrid rram-cmos architecture,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 26, no. 12, pp. 2816–2829, 2018.
[41] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman,
“Hots: a hierarchy of event-based time-surfaces for pattern recogni-
tion,” IEEE transactions on pattern analysis and machine intelligence,
vol. 39, no. 7, pp. 1346–1359, 2016.
[42] M. Sharifshazileh, K. Burelo, T. Fedele, J. Sarnthein, and G. Indiveri,
“A Neuromorphic Device for Detecting High-Frequency Oscillations in
Human iEEG,” in Proceedings of the IEEE International Conference on
Electronics, Circuits and Systems (ICECS), Genova, Italy., Nov. 2019,
pp. 69–72.
[43] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128X128 120 dB 15
us Latency Asynchronous Temporal Contrast Vision Sensor,” IEEE
Journal of Solid-state Circuits, vol. 43, no. 2, pp. 566–576, 2008.
[44] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kep-
ner, “Survey and Benchmarking of Machine Learning Accelerators,
arXiv preprint arXiv:1908.11348, 2019.
[45] “Edge TPU,” https://coral.ai/docs/edgetpu/faq/.
[46] J. Hruska, “Intel Nervana Inference and Training AI Cards,
https://www.extremetech.com/computing/296990-intel-nervana-nnp-i-
nnp-t-a-training-inference.
[47] P. Kennedy, “Huawei Ascend 310,
https://www.servethehome.com/huawei-ascend-910-provides-a-nvidia-
ai-training-alternative/.
[48] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, “LNPU:
A 25.3 TFLOPS/W Sparse Deep-Neural-Network Learning Processor
with Fine-Grained Mixed Precision of FP8-FP16,” in Proceedings of
the IEEE International Solid-State Circuits Conference (ISSCC), San
Francisco, CA., Feb. 2019, pp. 142–144.
[49] D. Shin, J. Lee, J. Lee, J. Lee, and H.-J. Yoo, “DNPU: An Energy-
Efficient Deep-Learning Processor with Heterogeneous Multi-Core
Architecture,” IEEE Micro, vol. 38, no. 5, pp. 85–93, 2018.
[50] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu,
L. Liu, and S. Wei, “A high energy efficient reconfigurable hybrid
neural network processor for deep learning applications,” IEEE Journal
of Solid-State Circuits, vol. 53, no. 4, pp. 968–982, 2017.
[51] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An
Energy-Efficient Deep Neural Network Accelerator With Fully Variable
Weight Bit Precision,IEEE Journal of Solid-State Circuits, vol. 54,
no. 1, pp. 173–185, 2018.
[52] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” in
Proceedings of the IEEE/ACM International Symposium on Microar-
chitecture (MICRO), Taipei, Taiwan., Oct. 2016.
[53] J. Zhang, S. Gajjala, P. Agrawal, G. H. Tison, L. A. Hallock,
L. Beussink-Nelson, E. Fan, M. A. Aras, C. Jordan, K. E. Fleischmann
et al., “A Computer Vision Pipeline for Automated Determination of
Cardiac Structure and Function and Detection of Disease by Two-
Dimensional Echocardiography,arXiv preprint arXiv:1706.07342,
2017.
[54] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks,” IEEE Journal of Solid-state Circuits, vol. 52, no. 1, pp.
127–138, 2016.
[55] Q. Guan, Y. Wang, B. Ping, D. Li, J. Du, Y. Qin, H. Lu, X. Wan,
and J. Xiang, “Deep Convolutional Neural Network VGG-16 Model
for Differential Diagnosing of Papillary Thyroid Carcinomas in Cyto-
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 20
logical Images: A Pilot Study,” Journal of Cancer, vol. 10, no. 20, p.
4876, 2019.
[56] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W Convolutional
Network Accelerator,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 27, no. 11, pp. 2461–2475, 2016.
[57] I. Azimi, J. Takalo-Mattila, A. Anzanpour, A. M. Rahmani, J.-P.
Soininen, and P. Liljeberg, “Empowering Healthcare IoT Systems
with Hierarchical Edge-Based Deep Learning,” in Proceedings of the
International Conference on Connected Health: Applications, Systems
and Engineering Technologies (CHASE), Washington, DC., Sep. 2018,
pp. 63–68.
[58] B. Moons and M. Verhelst, “An Energy-Efficient Precision-Scalable
ConvNet Processor in 40-nm CMOS,IEEE Journal of Solid-state
Circuits, vol. 52, no. 4, pp. 903–914, 2016.
[59] M. Blaivas and L. Blaivas, “Are All Deep Learning Architectures
Alike for Point-of-Care Ultrasound?: Evidence From a Cardiac Image
Classification Model Suggests Otherwise,” Journal of Ultrasound in
Medicine, 2019.
[60] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envision:
A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-
frequency-scalable Convolutional Neural Network processor in 28nm
FDSOI,” in Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), San Francisco, CA., Feb. 2017, pp. 246–247.
[61] M.-P. Hosseini, T. X. Tran, D. Pompili, K. Elisevich, and H. Soltanian-
Zadeh, “Deep Learning with Edge Computing for Localization of
Epileptogenicity Using Multimodal rs-fMRI and EEG Big Data,”
in Proceedings of the IEEE International Conference on Autonomic
Computing (ICAC), Columbus, OH., Jul. 2017, pp. 83–92.
[62] J. Song, Y. Cho, J.-S. Park, J.-W. Jang, S. Lee, J.-H. Song, J.-G. Lee,
and I. Kang, “An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-
Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile
SoC,” in Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), San Francisco, CA., Feb. 2019, pp. 130–132.
[63] F. Preiswerk, C.-C. Cheng, J. Luo, and B. Madore, “Synthesizing
Dynamic MRI Using Long-Term Recurrent Convolutional Networks,
in Proceedings of the International Workshop on Machine Learning in
Medical Imaging (MLMI). Granada, Spain.: Springer, Sep. 2018, pp.
89–97.
[64] J. P. Queralta, T. N. Gia, H. Tenhunen, and T. Westerlund, “Edge-
AI in LoRa-based Health Monitoring: Fall Detection System with Fog
Computing and LSTM Recurrent Neural Networks,” in Proceedings
of the International Conference on Telecommunications and Signal
Processing (TSP), 2019, pp. 601–604.
[65] I. M. Baltruschat, H. Nickisch, M. Grass, T. Knopp, and A. Saalbach,
“Comparison of Deep Learning Approaches for Multi-Label Chest X-
Ray Classification,” Scientific Reports, vol. 9, no. 1, pp. 1–10, 2019.
[66] G. Zamzmi, L.-Y. Hsu, W. Li, V. Sachdev, and S. Antani, “Harnessing
Machine Intelligence in Automatic Echocardiogram Analysis: Current
Status, Limitations, and Future Directions,” IEEE Reviews in Biomed-
ical Engineering, 2020.
[67] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and E. Cu-
lurciello, “NeuFlow: Dataflow vision processing system-on-a-chip,” in
Proceedings of the IEEE International Midwest Symposium on Circuits
and Systems (MWSCAS), Fort Collins, CO., Aug. 2012, pp. 1044–1047.
[68] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,
J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “A
reconfigurable fabric for accelerating large-scale datacenter services,
in Proceedings of the ACM/IEEE International Symposium on Com-
puter Architecture (ISCA), Minneapolis, MN., Jun. 2014, pp. 13–24.
[69] D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker,
T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell et al.,
“Think Fast: A Tensor Streaming Processor (TSP) for Accelerating
Deep Learning Workloads,” Valencia, Spain., May 2020.
[70] H. Kung and C. E. Leiserson, “Systolic Arrays (for VLSI),” in
Proceedings of Sparse Matrix, vol. 1. Society for industrial and applied
mathematics, 1979, pp. 256–282.
[71] H.-T. Kung, “Why systolic architectures?” Computer, no. 1, pp. 37–46,
1982.
[72] G. Burr, P. Narayanan, R. Shelby, S. Sidler, I. Boybat, C. di Nolfo,
and Y. Leblebici, “Large-scale neural networks implemented with
non-volatile memory as the synaptic weight element: Comparative
performance analysis (accuracy, speed, and power),” in Proceedings of
the IEEE International Electron Devices Meeting (IEDM), Washington,
DC., Dec. 2015.
[73] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. Nolfo,
S. Sidler, M. Giordano, M. Bodini, N. C. Farinha et al., “Equivalent-
accuracy accelerated neural-network training using analogue memory,”
Nature, vol. 558, no. 7708, p. 60, 2018.
[74] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and
H. Qian, “Fully hardware-implemented memristor convolutional neural
network,” Nature, vol. 577, no. 7792, pp. 641–646, 2020.
[75] J. K. Eshraghian, S.-M. Kang, S. Baek, G. Orchard, H. H.-C. Iu, and
W. Lei, “Analog weights in reram dnn accelerators,” in 2019 IEEE
International Conference on Artificial Intelligence Circuits and Systems
(AICAS). IEEE, 2019, pp. 267–271.
[76] M. R. Azghadi, B. Linares-Barranco, D. Abbott, and P. H. Leong, “A
Hybrid CMOS-Memristor Neuromorphic Synapse,” IEEE Transactions
on Biomedical Circuits and Systems, vol. 11, no. 2, pp. 434–445, 2017.
[77] M. Rahimi Azghadi, Y.-C. Chen, J. K. Eshraghian, J. Chen, C.-Y.
Lin, A. Amirsoleimani, A. Mehonic, A. J. Kenyon, B. Fowler, J. C.
Lee et al., “Complementary metal-oxide semiconductor and memristive
hardware for neuromorphic computing,” Advanced Intelligent Systems,
vol. 2, no. 5, p. 1900189, 2020.
[78] Q. Xia and J. J. Yang, “Memristive crossbar arrays for brain-inspired
computing,” Nature Materials, vol. 18, no. 4, pp. 309–323, 2019.
[79] C. Lammie, W. Xiang, B. Linares-Barranco, and M. R. Azghadi,
“MemTorch: An Open-source Simulation Framework for Memristive
Deep Learning Systems,” arXiv preprint arXiv:2004.10971, 2020.
[80] C. Lammie, O. Krestinskaya, A. James, and M. R. Azghadi, “Variation-
aware Binarized Memristive Networks,” in Proceedings of the Inter-
national Conference on Electronics, Circuits and Systems (ICECS),
Genova, Italy., Nov. 2019, pp. 490–493.
[81] O. Krestinskaya, K. N. Salama, and A. P. James, “Learning in Mem-
ristive Neural Network Architectures Using Analog Backpropagation
Circuits,” IEEE Transactions on Circuits and Systems I: Regular
Papers, vol. 66, no. 2, pp. 719–732, 2018.
[82] S. Yu, P.-Y. Chen, Y. Cao, L. Xia, Y. Wang, and H. Wu, “Scaling-up
resistive synaptic arrays for neuro-inspired architecture: Challenges and
prospect,” in Proceedings of the IEEE International Electron Devices
Meeting (IEDM), Washington, DC., Dec. 2015.
[83] N. Bien, P. Rajpurkar, R. L. Ball, J. Irvin, A. Park, E. Jones, M. Bereket,
B. N. Patel, K. W. Yeom, K. Shpanskaya et al., “Deep-learning-
assisted diagnosis for knee magnetic resonance imaging: development
and retrospective validation of MRNet,PLoS Medicine, vol. 15, no. 11,
p. e1002699, 2018.
[84] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin,
R. S. Williams, P. Faraboschi, W. Hwu, J. P. Strachan, K. Roy,
and D. S. Milojicic, “PUMA: A Programmable Ultra-efficient
Memristor-based Accelerator for Machine Learning Inference,” CoRR,
vol. abs/1901.10351, 2019. [Online]. Available: http://arxiv.org/abs/
1901.10351
[85] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny,
“VTEAM: A General Model for Voltage-controlled Memristors,IEEE
Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 8,
pp. 786–790, 2015.
[86] E. Yalon, A. Gavrilov, S. Cohen, D. Mistele, B. Meyler, J. Salzman,
and D. Ritter, “Resistive Switching in HfO2Probed by a Metal–
Insulator–Semiconductor Bipolar Transistor,” IEEE Electron Device
Letters, vol. 33, no. 1, pp. 11–13, 2012.
[87] A. M. Hassan, A. F. Khalaf, K. S. Sayed, H. H. Li, and Y. Chen,
“Real-Time Cardiac Arrhythmia Classification Using Memristor Neu-
romorphic Computing System,” in Proceedings of the International
Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC), Honolulu, HI, Jul. 2018, pp. 2567–2570.
[88] F. Cai, J. M. Correll, S. H. Lee, Y. Lim, V. Bothra, Z. Zhang, M. P.
Flynn, and W. D. Lu, “A fully integrated reprogrammable memristor–
CMOS system for efficient multiply–accumulate operations,” Nature
Electronics, vol. 2, no. 7, pp. 290–299, 2019.
[89] T. Hirtzlin, M. Bocquet, B. Penkovsky, J.-O. Klein, E. Nowak,
E. Vianello, J.-M. Portal, and D. Querlioz, “Digital Biologically Plau-
sible Implementation of Binarized Neural Networks With Differential
Hafnium Oxide Resistive Memory Arrays,Frontiers in Neuroscience,
vol. 13, 2019.
[90] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming
standard for heterogeneous computing systems,” Computing in Science
& Engineering, vol. 12, no. 3, pp. 66–73, 2010.
[91] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A Survey of FPGA-
Based Neural Network Inference Accelerator,” ACM Transactions on
Reconfigurable Technology and Systems (TRETS), vol. 12, no. 1, pp.
1–26, 2019.
[92] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
“High-Level Synthesis for FPGAs: From Prototyping to Deployment,
SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 21
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 30, no. 4, pp. 473–491, 2011.
[93] C. Lammie, W. Xiang, and M. R. Azghadi, “Accelerating Deterministic
and Stochastic Binarized Neural Networks on FPGAS using OpenCL,”
in Proceedings of the IEEE International Midwest Symposium on
Circuits and Systems (MWSCAS), Dallas, TX., Aug. 2019, pp. 626–
629.
[94] C. Lammie, A. Olsen, T. Carrick, and M. R. Azghadi, “Low-Power and
High-Speed Deep FPGA Inference Engines for Weed Classification at
the Edge,” IEEE Access, vol. 7, pp. 51171–51 184, 2019.
[95] M. Carreras, G. Deriu, L. Raffo, L. Benini, and P. Meloni, “Optimizing
Temporal Convolutional Network inference on FPGA-based accelera-
tors,” arXiv preprint arXiv:2005.03775, 2020.
[96] C. Lammie, W. Xiang, and M. R. Azghadi, “Training Progres-
sively Binarizing Deep Networks Using FPGAs,arXiv preprint
arXiv:2001.02390, 2020.
[97] C. Lammie and M. R. Azghadi, “Stochastic Computing for Low-Power
and High-Speed Deep Learning on FPGA,” in Proceedings of the IEEE
International Symposium on Circuits and Systems (ISCAS), Sapporo,
Japan., May 2019.
[98] D. Wang, K. Xu, and D. Jiang, “PipeCNN: An OpenCL-based open-
source FPGA accelerator for convolution neural networks,” in Pro-
ceedings of the International Conference on Field Programmable
Technology (ICFPT), Melbourne, Australia., Dec. 2017, pp. 279–282.
[99] M. Wess, P. S. Manoj, and A. Jantsch, “Neural network based ECG
anomaly detection on FPGA and trade-off analysis,” in Proceedings of
the IEEE International Symposium on Circuits and Systems (ISCAS),
Baltimore, MD., May 2017.
[100] A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, and M. C. Herbordt,
“Real-time data analysis for medical diagnosis using FPGA-accelerated
neural networks,” BMC Bioinformatics, vol. 19, no. 18, p. 490, 2018.
[101] R. R. Shrivastwa, V. Pudi, and A. Chattopadhyay, “An FPGA-Based
Brain Computer Interfacing Using Compressive Sensing and Machine
Learning,” in Proceedings of the IEEE Computer Society Annual
Symposium on VLSI (ISVLSI), Hong Kong, China., Jul. 2018, pp. 726–
731.
[102] Z. Chen, A. Howe, H. T. Blair, and J. Cong, “CLINK: Compact LSTM
Inference Kernel for Energy Efficient Neurofeedback Devices,” in
Proceedings of the International Symposium on Low Power Electronics
and Design (ISLPED), Bellevue, WA., Jul. 2018.
[103] F. C. Bauer, D. R. Muir, and G. Indiveri, “Real-time ultra-low power
ECG anomaly detection using an event-driven neuromorphic proces-
sor,IEEE Transactions on Biomedical Circuits and Systems, 2019.
[104] E. Donati, M. Payvand, N. Risi, R. Krause, K. Burelo, G. Indiveri,
T. Dalgaty, and E. Vianello, “Processing EMG signals using reservoir
computing on an event-based neuromorphic system,” in Proceedings
of the IEEE Biomedical Circuits and Systems Conference (BioCAS),
Cleveland, Ohio., Oct. 2018.
[105] E. Donati, M. Payvand, N. Risi, R. Krause, and G. Indiveri, “Discrim-
ination of EMG Signals Using a Neuromorphic Implementation of a
Spiking Neural Network,” IEEE Transactions on Biomedical Circuits
and Systems, vol. 13, no. 5, pp. 795–803, 2019.
[106] J. Behrenbeck, Z. Tayeb, C. Bhiri, C. Richter, O. Rhodes, N. Kasabov,
J. I. Espinosa-Ramos, S. Furber, G. Cheng, and J. Conradt, “Classifi-
cation and regression of spatio-temporal signals using NeuCube and its
realization on SpiNNaker neuromorphic hardware,” Journal of Neural
Engineering, vol. 16, no. 2, p. 026014, 2019.
[107] E. Nurse, B. S. Mashford, A. J. Yepes, I. Kiral-Kornek, S. Harrer, and
D. R. Freestone, “Decoding EEG and LFP Signals using Deep Learn-
ing: Heading TrueNorth,” in Proceedings of the ACM International
Conference on Computing Frontiers (CF), Como, Italy., May 2016,
pp. 259–266.
[108] L. Ohno-Machado and D. Bialek, “Diagnosing breast cancer from fnas:
variable relevance in neural network and logistic regression models,
Studies in Health Technology and Informatics, vol. 52, pp. 537–540,
1998.
[109] Y. Ku, W. Tompkins, and Q. Xue, “Artificial neural network for
ECG arrhythmia monitoring,” in Proceedings of the International Joint
Conference on Neural Networks (IJCNN), vol. 2. Baltimore, MD.:
IEEE, Jun. 1992, pp. 987–992.
[110] J. K. Eshraghian, “Human ownership of artificial creativity,Nature
Machine Intelligence, pp. 157–160, 2020.
[111] S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri, “A Scalable Multicore
Architecture With Heterogeneous Memory Structures for Dynamic
Neuromorphic Asynchronous Processors (DYNAPs),IEEE Transac-
tions on Biomedical Circuits and Systems, vol. 12, no. 1, pp. 106–122,
Feb. 2018.
[112] S. Benatti, F. Casamassima, B. Milosevic, E. Farella, P. Sch ¨
onle,
S. Fateh, T. Burger, Q. Huang, and L. Benini, “A Versatile Embed-
ded Platform for EMG Acquisition and Gesture Recognition,” IEEE
Transactions on Biomedical Circuits and Systems, vol. 9, no. 5, pp.
620–630, 2015.
[113] F. Montagna, A. Rahimi, S. Benatti, D. Rossi, and L. Benini,
“PULP-HD: Accelerating Brain-inspired High-dimensional Comput-
ing on a Parallel Ultra-low Power Platform,” in Proceedings of the
ACM/ESDA/IEEE Design Automation Conference (DAC), San Fran-
cisco, CA., Jun. 2018.
[114] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,
S. Temple, and A. D. Brown, “Overview of the SpiNNaker System
Architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.
2454–2467, 2013.
[115] P. Merolla and K. Boahen, “A Recurrent Model of Orientation Maps
with Simple and Complex Cells,” in Proceedings of Advances in Neural
Information Processing Systems 17 (NIPS), Vancouver, Canada., Dec.
2004, pp. 1995–2002.
[116] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday,
G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A Neuromorphic
Manycore Processor with On-Chip Learning,” IEEE Micro, vol. 38,
no. 1, pp. 82–99, 2018.
[117] C. Frenkel, M. Lefebvre, J.-D. Legat, and D. Bol, “A 0.086-mm212.7-
pJ/SOP 64k-Synapse 256-Neuron Online-Learning Digital Spiking
Neuromorphic Processor in 28-nm CMOS,” IEEE Transactions on
Biomedical Circuits and Systems, vol. 13, no. 1, pp. 145–158, 2019.
[118] C. Frenkel, J.-D. Legat, and D. Bol, “MorphIC: A 65-nm 738k-
Synapse/mm 2Quad-Core Binary-Weight Digital Neuromorphic Pro-
cessor With Stochastic Spike-Driven Online Learning,IEEE Transac-
tions on Biomedical Circuits and Systems, vol. 13, no. 5, pp. 999–1010,
2019.
[119] N. Qiao, H. Mostafa, F. Corradi, M. Osswald, F. Stefanini, D. Sum-
islawska, and G. Indiveri, “A reconfigurable on-line learning spiking
neuromorphic processor comprising 256 neurons and 128K synapses,”
Frontiers in neuroscience, vol. 9, p. 141, 2015.
[120] M. R. Azghadi, S. Moradi, D. B. Fasnacht, M. S. Ozdas, and G. Indi-
veri, “Programmable spike-timing-dependent plasticity learning circuits
in neuromorphic VLSI architectures,” ACM Journal on Emerging Tech-
nologies in Computing Systems (JETC), vol. 12, no. 2, p. art. no. 17,
2015.
[121] M. Payvand and G. Indiveri, “Spike-based Plasticity Circuits for
Always-on On-line Learning in Neuromorphic Systems,” in Proceed-
ings of the IEEE International Symposium on Circuits and Systems
(ISCAS), Sapporo, Japan., May 2019.
[122] J. Kaiser, H. Mostafa, and E. Neftci, “Synaptic Plasticity Dynamics
for Deep Continuous Local Learning (DECOLLE),” Frontiers in Neu-
roscience, vol. 14, p. 424, 2020.
[123] G. Bellec, F. Scherr, E. Hajek, D. Salaj, A. Subramoney, R. Legenstein,
and W. Maass, “Eligibility traces provide a data-inspired alternative to
backpropagation through time,” Vancouver, Canada., Dec. 2019.
[124] J. Sacramento, R. P. Costa, Y. Bengio, and W. Senn, “Dendritic
cortical microcircuits approximate the backpropagation algorithm,”
in Proceedings of the Conference on Neural Information Processing
Systems (NIPS), Montreal, Canada., Dec. 2018, pp. 8721–8732.
[125] A. Valentian, F. Rummens, E. Vianello et al., “Fully Integrated Spiking
Neural Network with Analog Neurons and RRAM Synapses,” in Pro-
ceedings of the IEEE International Electron Devices Meeting (IEDM),
San Francisco, CA., Dec. 2019, pp. 14.13.1–14.13.4.
[126] Y. Hayakawa, A. Himeno, R. Yasuhara, W. Boullart et al., “Highly
reliable TaOx ReRAM with centralized filament for 28-nm embedded
application,” in VLSI Technology, 2015, pp. T14–T15.
[127] T. Dalgaty, M. Payvand, F. Moro, D. R. Ly, F. Pebay-Peyroula, J. Casas,
G. Indiveri, and E. Vianello, “Hybrid neuromorphic circuits exploiting
non-conventional properties of RRAM for massively parallel local
plasticity mechanisms,” APL Materials, vol. 7, no. 8, p. 081125, 2019.
[128] M. Payvand, Y. Demirag, T. Dalgaty, E. Vianello, and G. Indiveri,
“Analog weight updates with compliance current modulation of binary
rerams for on-chip learning,” in To appear in the Proceedings of
the IEEE International Symposium on Circuits and Systems (ISCAS).
IEEE, 2020.
[129] Q. Wang, X. Wang, S. H. Lee, F.-H. Meng, and W. D. Lu, “A Deep
Neural Network Accelerator Based on Tiled RRAM Architecture,
in Proceedings of the IEEE International Electron Devices Meeting
(IEDM), San Francisco, CA., Dec. 2019, pp. 14–4.
... Such workloads can be accelerated for low latency and low energy, when processed on neuromorphic hardware that utilizes asynchronous, fine-grain processing to efficiently handle spiking signals and parallel operations [8]. Much like in the brain, spikes are thought to encode information over time, and have shown improvements in the energy efficiency of sequencebased computer vision tasks by several orders of magnitude in a variety of workloads [9,10,11]. ...
Preprint
Full-text available
Autonomous driving demands an integrated approach that encompasses perception, prediction, and planning, all while operating under strict energy constraints to enhance scalability and environmental sustainability. We present Spiking Autonomous Driving (\name{}), the first unified Spiking Neural Network (SNN) to address the energy challenges faced by autonomous driving systems through its event-driven and energy-efficient nature. SAD is trained end-to-end and consists of three main modules: perception, which processes inputs from multi-view cameras to construct a spatiotemporal bird's eye view; prediction, which utilizes a novel dual-pathway with spiking neurons to forecast future states; and planning, which generates safe trajectories considering predicted occupancy, traffic rules, and ride comfort. Evaluated on the nuScenes dataset, SAD achieves competitive performance in perception, prediction, and planning tasks, while drawing upon the energy efficiency of SNNs. This work highlights the potential of neuromorphic computing to be applied to energy-efficient autonomous driving, a critical step toward sustainable and safety-critical automotive technology. Our code is available at \url{https://github.com/ridgerchu/SAD}.
... To improve energy efficiency, we can leverage various energy-efficient hardware accelerators. For example, DNNs could be implemented on edge hardware accelerators (Aimar et al., 2018a;Deng et al., 2020;Gao et al., 2020;Kim et al., 2022;Lee et al., 2018;Liu et al., 2022), while convolutional neural networks (CNNs), in particular, could be converted to rate-based spiking neural networks (SNNs) via an SNN toolbox (Rueckauer et al., 2017) and then deployed on spiking hardware accelerators (Azghadi et al., 2020;Basu et al., 2022;Chien et al., 2018;Davies et al., 2018). In addition to the use of hardware accelerators, we can also use asynchronous readout electronics to process the continuous recordings from the patch (Cuenca-Michans et al., 2023) and then deploy DNNs that can process these event-driven data and fuse them with recordings from other sensor modalities (Neil & Liu, 2016). ...
Thesis
Full-text available
Sweat biomarkers offer valuable insights into the health conditions of individuals. Despite the recent advances in wearable technologies that enable real-time monitoring of sweat biomarkers, their potential to infer health conditions remains largely unexplored. This thesis, conducted as part of the WeCare project, leverages machine learning (ML) models including deep neural networks (DNNs), for real-time predictive health monitoring using these sweat biomarkers. Our research primarily focuses on predicting physiological states such as hydration status and core body temperature during exercise. One version of the wearable sweat patch developed by our WeCare partner at the Instituto de Microelectrónica de Barcelona (IMB-CNM) uses ion-sensitive field-effect transistors (ISFETs). While these sensors are sensitive, lightweight, and cost-effective, they are prone to sensor drift. Previous work shows that DNNs are promising for predicting ionic concentration from ISFET sensor readings with the presence of sensor drift. However, training DNNs requires large labeled datasets that are difficult to collect. To address this, we first construct a physical model for ISFET sensors that simulates sensor readings and takes into account sensor drift. We then train an end-to-end prediction neural network as a sensor calibration tool on these simulated readings. Our prediction network outperforms two manual calibration methods in predicting sodium concentration from uncalibrated real-world sodium ISFET readings, suggesting its promise for future calibration of wearable patches using ISFETs. Next, we carry out a study aimed at designing personalized hydration strategies based on noninvasive biomarkers. We examine the feasibility of using ML models to predict the hydration status of an athlete using physiological and sweat biomarker recordings collected from a subject during a set of indoor cycling sessions supervised by the Lausanne University Hospital (CHUV). Because the wearable sweat patches were still under development at that stage, absorbent patches were used for sweat sample collection. We also compared the performance of nonlinear ML models with the linear model on this predictive task. This investigation provides insights for future hydration status predictions using ML models on sweat biomarker data collected from wearable sweat patches. Finally, following the available printed sensor patch developed by the Soft Transducers Lab at École Polytechnique Fédérale de Lausanne (EPFL-LMTS), we determine the prediction accuracy of core body temperature during exercise using real-time sweat biomarkers measured with this wearable prototype and with the addition of other biomarker data collected with commercial devices. All experimental sessions were conducted at CHUV. Our results indicate that DNNs can accurately and continuously predict core body temperature solely from sweat biomarker data, specifically sweat sodium and potassium concentrations collected from the wearable patch. Moreover, our analysis of the collected sweat biomarker data shows that they can be used to predict future core body temperature values. Our findings highlight the potential of integrating advanced predictive models with wearable sweat patches for real-time and accurate prediction of physiological states.
... Neuromorphic computing systems often incorporate specialized hardware, such as neuromorphic chips or memristive devices, to enable the efficient execution of brain-inspired learning algorithms. 117,118 These systems have the potential to drastically improve the performance of machine learning applications, particularly in edge computing and real-time processing scenarios. ...
Article
Full-text available
Artificial neural networks (ANNs) have emerged as an essential tool in machine learning, achieving remarkable success across diverse domains, including image and speech generation, game playing, and robotics. However, there exist fundamental differences between ANNs’ operating mechanisms and those of the biological brain, particularly concerning learning processes. This paper presents a comprehensive review of current brain-inspired learning representations in artificial neural networks. We investigate the integration of more biologically plausible mechanisms, such as synaptic plasticity, to improve these networks’ capabilities. Moreover, we delve into the potential advantages and challenges accompanying this approach. In this review, we pinpoint promising avenues for future research in this rapidly advancing field, which could bring us closer to understanding the essence of intelligence.
... This leads to sparse activations, which can be efficiently exploited by neuromorphic hardware, such as Loihi [54], SpiNNaker [55] and TrueNorth [56]. It has been shown that these specialized hardware chips can reduce the energy consumption of neural network-based processes by factors of up to ×1000 [54,[57][58][59][60]. Apart from their energy efficiency in prediction, recent attempts to increase the training efficiency of SNN can be found in [61][62][63][64]. ...
Article
Full-text available
Spiking neural networks (SNN), also often referred to as the third generation of neural networks, carry the potential for a massive reduction in memory and energy consumption over traditional, second-generation neural networks. Inspired by the undisputed efficiency of the human brain, they introduce temporal and neuronal sparsity, which can be exploited by next-generation neuromorphic hardware. Energy efficiency plays a crucial role in many engineering applications, for instance, in structural health monitoring. Machine learning in engineering contexts, especially in data-driven mechanics, focuses on regression. While regression with SNN has already been discussed in a variety of publications, in this contribution, we provide a novel formulation for its accuracy and energy efficiency. In particular, a network topology for decoding binary spike trains to real numbers is introduced, using the membrane potential of spiking neurons. Several different spiking neural architectures, ranging from simple spiking feed-forward to complex spiking long short-term memory neural networks, are derived. Since the proposed architectures do not contain any dense layers, they exploit the full potential of SNN in terms of energy efficiency. At the same time, the accuracy of the proposed SNN architectures is demonstrated by numerical examples, namely different material models. Linear and nonlinear, as well as history-dependent material models, are examined. While this contribution focuses on mechanical examples, the interested reader may regress any custom function by adapting the published source code.
... (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In [34][35] [36], the high frequency (HF) component represents cardiac parasympathetic nerve activity during rest, and the LF component represents sympathetic nerve activity during stress. Thus, the low-frequency (LF) and the Low-Frequency to High-Frequency ratio (LFHF) components are expected to be higher during stress conditions and the HF component to be lower. ...
Preprint
Stress is a psychological condition due to the body's response to a challenging situation. If a person is exposed to prolonged periods and various forms of stress, their physical and mental health can be negatively affected, leading to chronic health problems. It is important to detect stress in its initial stages to prevent psychological and physical stress-related issues. Thus, there must be alternative and effective solutions for spontaneous stress monitoring. Wearable sensors are one of the most prominent solutions, given their capacity to collect data continuously in real-time. Wearable sensors, among others, have been widely used to bridge existing gaps in stress monitoring thanks to their non-intrusive nature. Besides, they can continuously monitor vital signs, e.g., heart rate and activity. Yet, most existing works have focused on data acquired in controlled settings. To this end, our study aims to propose a machine learning-based approach for detecting the onsets of stress in a free-living environment using wearable sensors. The authors utilized the SWEET dataset collected from 240 subjects via electrocardiography (ECG), skin temperature (ST), and skin conductance (SC). In this work, four machine learning models were tested on this data set consisting of 240 subjects, namely K-Nearest Neighbors (KNN), Support vector classification (SVC), Decision Tree (DT), and Random Forest (RF). These models were trained and tested on four data scenarios. The K-Nearest Neighbor (KNN) model had the highest accuracy of 98%, while the other models also performed satisfactorily.
... Very Large Scale Integration (VLSI) is performed by the chips to utilize these units for computation, whereas a layer of logical gates is used to implement algorithms on neuromorphic chips. This helps the SNNs to fulfill their potential and reduce the energy consumption by a factor up to 10 3 [77,80,81]. ...
Article
Full-text available
The present study aims to develop a sustainable framework employing brain-inspired neural networks for solving boundary value problems in Engineering Mechanics. Spiking neural networks, known as the third generation of artificial neural networks, are proposed for physics-based artificial intelligence. Accompanied by a new pseudo-explicit integration scheme based on spiking recurrent neural networks leading to a spike-based pseudo explicit integration scheme, the underlying differential equations are solved with a physics-informed strategy. We propose additionally a third-generation spike-based Legendre Memory Unit that handles large sequences. These third-generation networks can be implemented on the coming-of-age neuromorphic hardware resulting in less energy and memory consumption. The proposed framework, although implicit, is viewed as a pseudo-explicit scheme since it requires almost no or fewer online training steps to achieve a converged solution even for unseen loading sequences. The proposed framework is deployed in a Finite Element solver for plate structures undergoing cyclic loading and a Xylo-Av2 SynSense neuromorphic chip is used to assess its energy performance. An acceleration of more than 40% when compared to classical Finite Element Method simulations and the capability of online training is observed. We also see a reduction in energy consumption down to the thousandth order.
Preprint
Full-text available
Memristive neuromorphic systems are designed to emulate human perception and cognition, where the memristor states represent essential historical information to perform both low-level and high-level tasks. However, current systems face challenges with the separation of state modulation and acquisition, leading to undesired time delays that impact real-time performance. To overcome this issue, we introduce a dual-function circuit that concurrently modulates and acquires memristor state information. This is achieved through two key features: 1) a feedback operational amplifier (op-amp) based circuit that ensures precise voltage application on the memristor while converting the passing current into a voltage signal; 2) a division calculation circuit that acquires state information from the modulation voltage and the converted voltage, improving stability by leveraging the intrinsic threshold characteristics of memristors. This circuit has been evaluated in a memristor-based nociceptor and a memristor crossbar, demonstrating exceptional performance. For instance, it achieves mean absolute acquisition errors below 1 {\Omega} during the modulation process in the nociceptor application. These results demonstrate that the proposed circuit can operate at different scales, holding the potential to enhance a wide range of neuromorphic applications.
Article
As societies age, the issue of falls has become increasingly critical for the health and safety of the elderly. Fall detection in the elderly has traditionally relied on supervised learning methods, which require data on falls, which is difficult to obtain in real situations. Additionally, the complexity of integrating deep learning models into wearable devices for real-time fall detection has been challenging due to limited computational resources. In this paper, we propose a novel fall detection method using unsupervised learning based on a denoising long short term memory (LSTM)-based convolutional variational autoencoder (CVAE) model to solve the problem of lack of fall data. By utilizing the proposed data debugging and hierarchical data balancing techniques, the proposed method achieves an F1 score of 1.0 while reducing the parameter count by 25.6 times compared to the state-of-the-art unsupervised deep learning method. The resulting model occupies only 157.65 KB of memory, making it highly suitable for integration into wearable devices.
Conference Paper
Full-text available
Many edge computing and IoT applications require adaptive and on-line learning architectures for fast and low-power processing of locally sensed signals. A promising class of architectures to solve this problem is that of in-memory computing ones, based on event-based hybrid memristive-CMOS devices. In this work, we present an example of such systems that supports always-on on-line learning. To overcome the problems of variability and limited resolution of ReRAM memristive devices used to store synaptic weights, we propose to use only their High Conductive State (HCS) and control their desired conductance by modulating their programming Compliance Current (ICC). We describe the spike-based learning CMOS circuits that are used to modulate the synaptic weights and demonstrate the relationship between the synaptic weight, the device conductance, and the ICC used to set its weight, with experimental measurements from a 4kb array of HfO 2-based devices. To validate the approach and the circuits presented, we present circuit simulation results for a standard CMOS 180 nm process and system-level behavioral simulations for classifying handwritten digits from the MNIST data-set with classification accuracy of 92.68% on the test set.
Article
Full-text available
Bowel sounds (BSs), typically generated by the intestinal peristalses, are a significant physiological indicator of the digestive system's health condition. In this study, a wearable BS monitoring system is presented for long-term BS monitoring. The system features a wearable BS sensor that can record BSs for days long and transmit them wirelessly in real-time. With the system, a total of 20 subjects' BS data under the hospital environment were collected. Afterward, CNNs are introduced for BS segment recognition. Specifically, this study proposes a novel CNN design method that makes it possible to transfer the popular CNN modules in image recognition into the BS segmentation domain. Experimental results show that in holdout evaluation with corrected labels, the designed CNN model achieves a moderate accuracy of 91.8% and the highest sensitivity of 97.0% compared with the similar works. In cross validation with noisy labels, the designed CNN delivers the best generability.
Article
Full-text available
Recent review papers have investigated seizure prediction, creating the possibility of preempting epileptic seizures. Correct seizure prediction can significantly improve the standard of living for the majority of epileptic patients, as the unpredictability of seizures is a major concern for them. Today, the development of algorithms, particularly in the field of machine learning, enables reliable and accurate seizure prediction using desktop computers. However, despite extensive research effort being devoted to developing seizure detection integrated circuits (ICs), dedicated seizure prediction ICs have not been developed yet. We believe that interdisciplinary study of system architecture, analog and digital ICs, and machine learning algorithms can promote the translation of scientific theory to a more realistic intelligent, integrated, and low-power system that can truly improve the standard of living for epileptic patients. This review explores topics ranging from signal acquisition analog circuits to classification algorithms and dedicated digital signal processing circuits for detection and prediction purposes, to provide a comprehensive and useful guideline for the construction, implementation and optimization of wearable and integrated smart seizure prediction systems.
Article
Brain-Computer Interface (BCI) is a system empowering humans to communicate with or control the outside world with exclusively brain intentions. Electroencephalography (EEG) based BCIs are promising solutions due to their convenient and portable instruments. Despite the extensive research of EEG in recent years, it is still challenging to interpret EEG signals effectively due to the massive noises in EEG signals (e.g., low signal-noise ratio and incomplete EEG signals), and difficulties in capturing the inconspicuous relationships between EEG signals and certain brain activities. Most existing works either only consider EEG as chain-like sequences neglecting complex dependencies between adjacent signals or requiring pre-processing such as transforming EEG waves into images. In this paper, we introduce both cascade and parallel convolutional recurrent neural network models for precisely identifying human intended movements and instructions effectively learning the compositional spatio-temporal representations of raw EEG streams. Extensive experiments on a large scale movement intention EEG dataset (108 subjects,3,145,160 EEG records) have demonstrated that both models achieve high accuracy near 98.3% and outperform a set of baseline methods and most recent deep learning based EEG recognition models, yielding a significant accuracy increase of 18% in the cross-subject validation scenario. The developed models are further evaluated with a real-world BCI and achieve a recognition accuracy of 93% over five instruction intentions. This suggests the proposed models are able to generalize over different kinds of intentions and BCI systems.
Article
Memristive devices have shown great promise to facilitate the acceleration and improve the power efficiency of Deep Learning (DL) systems. Crossbar architectures constructed using these Resistive Random-Access Memory(RRAM) devices can be used to efficiently implement various in-memory computing operations, such as Multiply Accumulate (MAC) and unrolled-convolutions, which are used extensively in Deep Neural Network(DNN) and Convolutional Neural Network (CNN). However, memristive devices face concerns of aging and non-idealities, which limit the accuracy, reliability, and robustness of Memristive Deep Learning System(MDLS), that should be considered prior to circuit-level realization. This Original Software Publication(OSP) presents MemTorch, an open-source¹ framework for customized large-scale memristive Deep Learning(DL) simulations, with a refined focus on the co-simulation of device non-idealities. MemTorch also facilitates co-modelling of key crossbar peripheral circuitry. MemTorch adopts a modernized software engineering methodology and integrates directly with the well-known PyTorch Machine Learning(ML) library.
Article
Photoplethysmographic (PPG) measurements from ambulatory subjects may suffer from unreliability due to body movements and missing data segments due to loosening of sensor. This paper describes an on-device reliability assessment from PPG measurements using a stack denoising autoencoder (SDAE) and multilayer perceptron neural network (MLPNN). The missing segments were predicted by a personalized convolutional neural network (CNN) and long-short term memory (LSTM) model using a short history of the same channel data. 40 sets of volunteers' data, consisting of equal share of healthy and cardiovascular subjects were used for validation and testing. The PPG reliability assessment model (PRAM) achieved over 95% accuracy for correctly identifying acceptable PPG beats out of total 5000 using expert annotated data. Disagreement with experts' annotation was nearly 3.5%. The missing segment prediction model (MSPM) achieved a root mean square error (RMSE) of 0.22, and mean absolute error (MAE) of 0.11 for 40 missing beats prediction using only four beat history from the same channel PPG. The two models were integrated in a standalone device based on quad-core ARM Cortex-A53, 1.2 GHz, with 1 GB RAM, with 130 MB memory requirement and latency ~0.35 s per beat prediction with a 30 s frame. The present method also provides improved performance with published works on PPG quality assessment and missing data prediction using two public datasets, CinC and MIMIC-II under PhysioNet.
Article
Convolutional Neural Networks (CNNs) are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition and segmentation. Recent research results demonstrate that multi-layer (deep) network involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as well as in tasks involving sequence modeling. These structures, commonly referred to as Temporal Convolutional Networks (TCNs), represent an extremely promising alternative to recurrent architectures, commonly used across a broad range of sequence modeling tasks [1]. While FPGA based inference accelerators for classic CNNs are widespread, literature is lacking in a quantitative evaluation of their usability on inference for TCN models. In this paper we present such an evaluation, considering a CNN accelerator with specific features supporting TCN kernels as a reference and a set of state-of-the-art TCNs as a benchmark. Experimental results show that, during TCN execution, operational intensity can be critical for the overall performance. We propose a convolution scheduling based on batch processing that can boost efficiency up to 96% of theoretical peak performance. Overall we can achieve up to 111,8 GOPS/s and a power efficiency of 33,8 GOPS/s/W on an Ultrascale+ ZU3EG (up to 10x speedup and 3x power efficiency improvement with respect to pure software implementation).