ArticlePDF Available

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

November 2020
IEEE Transactions on Biomedical Circuits and Systems PP(99):1-1

November 2020
PP(99):1-1

DOI:10.1109/TBCAS.2020.3036081

License
CC BY 4.0

Authors:

Mostafa Rahimi Azghadi

James Cook University

Corey Lammie

Jason K. Eshraghian

University of California, Santa Cruz

Melika Payvand

University of Zurich/ETH Zurich

Show all 7 authorsHide

With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors, new opportunities are emerging for applying deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of the medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies ranging from emerging memristive devices, to established Field Programmable Gate Arrays (FPGAs), and mature Complementary Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. After providing the required background, we unify the sparsely distributed research on neural network and neuromorphic hardware implementations as applied to the healthcare domain. In addition, we benchmark various hardware platforms by performing a biomedical electromyography (EMG) signal processing task and drawing comparisons among them in terms of inference delay and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that different accelerators and neuromorphic processors introduce to healthcare and biomedical domains. This paper can serve a large audience, ranging from nanoelectronics researchers to biomedical and healthcare practitioners in grasping the fundamental interplay between hardware, algorithms, and clinical adoption of these tools, as we shed light on the future of deep networks and spiking neuromorphic processing systems.

Typical hardware technologies for DNN acceleration. In this paper we cover the top two layers of the pyramid, which include specialized hardware technologies for high-performance training and inference of DNNs. While the apex is labelled RRAM, this is intended to broadly cover all programmable non-volatile resistive switching memories e.g. CBRAM, MRAM, PCM, etc.

…

(a) Conventional CPUs rely on a shared bus to transfer data to and from memory resulting in a bottleneck of data transmission. (b) Systolic arrays pass data through multiple processing elements before storing in memory.

…

Simulation results investigating the performance of MDNNs for hand gesture classification adopting non-ideal Pt/Hf/Ti ReRAM devices. Devicedevice variability is simulated using MemTorch [79].

…

Compilation flow used to deploy an EMG classification CNN to an OpenVINO FPGA adopting fixed-point number representations using OpenCL.

…

Figures - uploaded by Mostafa Rahimi Azghadi

Content may be subject to copyright.

Content uploaded by Mostafa Rahimi Azghadi

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Zurich Open Repository and

Archive

University of Zurich

Main Library

Strickhofstrasse 39

CH-8057 Zurich

www.zora.uzh.ch

Year: 2020

Hardware Implementation of Deep Network Accelerators Towards

Healthcare and Biomedical Applications

Azghadi, Mostafa Rahimi ; Lammie, Corey ; Eshraghian, Jason K ; Payvand, Melika ; Donati, Elisa ;

Linares-Barranco, Bernabe ; Indiveri, Giacomo

Abstract: The advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors has

brought on new opportunities for applying both Deep and Spiking Neural Network (SNN) algorithms

to healthcare and biomedical applications at the edge. This can facilitate the advancement of medical

Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial

describing how various technologies including emerging memristive devices, Field Programmable Gate

Arrays (FPGAs), and Complementary Metal Oxide Semiconductor (CMOS) can be used to develop ef-

cient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing

problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement

their DL counterparts for processing biomedical signals. The tutorial is augmented with case studies of

the vast literature on neural network and neuromorphic hardware as applied to the healthcare domain.

We benchmark various hardware platforms by performing a sensor fusion signal processing task combin-

ing electromyography (EMG) signals with computer vision. Comparisons are made between dedicated

neuromorphic processors and embedded AI accelerators in terms of inference latency and energy. Finally,

we provide our analysis of the eld and share a perspective on the advantages, disadvantages, challenges,

and opportunities that various accelerators and neuromorphic processors introduce to healthcare and

biomedical domains.

DOI: https://doi.org/10.1109/tbcas.2020.3036081

Posted at the Zurich Open Repository and Archive, University of Zurich

ZORA URL: https://doi.org/10.5167/uzh-200402

Journal Article

Accepted Version

Originally published at:

Azghadi, Mostafa Rahimi; Lammie, Corey; Eshraghian, Jason K; Payvand, Melika; Donati, Elisa;

Linares-Barranco, Bernabe; Indiveri, Giacomo (2020). Hardware Implementation of Deep Network Ac-

celerators Towards Healthcare and Biomedical Applications. IEEE Transactions on Biomedical Circuits

and Systems, 14(6):1138-1159.

DOI: https://doi.org/10.1109/tbcas.2020.3036081

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 1

Hardware Implementation of Deep Network

Accelerators Towards Healthcare and Biomedical

Applications

Mostafa Rahimi Azghadi, Senior Member, IEEE, Corey Lammie, Student Member, IEEE,

Jason K. Eshraghian, Member, IEEE, Melika Payvand, Member, IEEE, Elisa Donati, Member, IEEE,

Bernab´

e Linares-Barranco, Fellow, IEEE and Giacomo Indiveri, Senior Member, IEEE

Abstract—With the advent of dedicated Deep Learning (DL)

accelerators and neuromorphic processors, new opportunities are

emerging for applying deep and Spiking Neural Network (SNN)

algorithms to healthcare and biomedical applications at the edge.

This can facilitate the advancement of the medical Internet of

Things (IoT) systems and Point of Care (PoC) devices. In this

paper, we provide a tutorial describing how various technologies

ranging from emerging memristive devices, to established Field

Programmable Gate Arrays (FPGAs), and mature Complemen-

tary Metal Oxide Semiconductor (CMOS) technology can be used

to develop efﬁcient DL accelerators to solve a wide variety of

diagnostic, pattern recognition, and signal processing problems in

healthcare. Furthermore, we explore how spiking neuromorphic

processors can complement their DL counterparts for processing

biomedical signals. After providing the required background,

we unify the sparsely distributed research on neural network

and neuromorphic hardware implementations as applied to the

healthcare domain. In addition, we benchmark various hardware

platforms by performing a biomedical electromyography (EMG)

signal processing task and drawing comparisons among them

in terms of inference delay and energy. Finally, we provide our

analysis of the ﬁeld and share a perspective on the advantages,

disadvantages, challenges, and opportunities that different ac-

celerators and neuromorphic processors introduce to healthcare

and biomedical domains. This paper can serve a large audience,

ranging from nanoelectronics researchers, to biomedical and

healthcare practitioners in grasping the fundamental interplay

between hardware, algorithms, and clinical adoption of these

tools, as we shed light on the future of deep networks and

spiking neuromorphic processing systems as proponents for

driving biomedical circuits and systems forward.

Index Terms—Spiking Neural Networks, Deep Neural Net-

works, Neuromorphic Hardware, CMOS, Memristor, FPGA,

RRAM, Healthcare, Medical IoT, Point-of-Care

I. INT RO DUC TI ON

HEALTH and well-being is, undoubtedly, one of the

most fundamental concerns of human beings. This is

evidenced by the sheer size and the fast growth of global

healthcare industries, which is projected to reach over 10

M. Rahimi Azghadi and Corey Lammie are with the College of Science

and Engineering, James Cook University, QLD 4811, Australia. e-mail:

mostafa.rahimiazghadi@jcu.edu.au

J. K. Eshraghian is with the Department of Electrical Engineering and

Computer Science, The University of Michigan, Ann Arbor, MI 48109-2122,

USA.

M. Payvand, E. Donati, and G. Indiveri are with the Institute of Neuroin-

formatics, University and ETH Zurich, Switzerland.

B. Linares-Barranco is with the Instituto de Microelectr´

onica de Sevilla

IMSE-CNM (CSIC and Universidad de Sevilla), Sevilla, Spain.

trillion dollars by 2022 [1]. One of the most promising

technologies to advance this fast-growing industry is Artiﬁcial

Intelligence (AI) [2] and its implementation with Deep Learn-

ing (DL). DL has shown success in various domains and

as its reliability improves, it has pervaded various facets

of healthcare from monitoring [3], [4], to prediction [5],

diagnosis [6], treatment [7], and prognosis [8], as visualized

in Fig. 1(a). The ﬁgure shows that the data collected from the

patient, which in this case is illustrated as a biomedical signal,

but can be any or a combination of other data types such as

bio-samples, medical images, temperature, movement, etc. can

be processed using a smart DL system that monitors the patient

for anomalies and/or to predict diseases. The prediction can

inform diagnosis, which itself can beneﬁt from DL algorithms.

In addition, DL systems can be used to recommend treatment

options and prognosis, which further affect monitoring and

prediction in a closed-loop scenario. In every step of this loop,

there is a need for a DL training and inference procedure,

which requires signiﬁcant computational resources.

The capacity of AI to meet or exceed the performance of

human experts in medical-data analysis [9], [10], [11] can,

in part, be attributed to the continued improvement of high-

performance computing platforms such as Graphics Processing

Units (GPUs) [12] and customized Machine Learning (ML)

hardware [13]. These can now process and learn from a large

amount of multi-modal heterogeneous general and medical

data [14]. This was not readily achievable a decade ago.

Although the DL ﬁeld has been growing at an astonishing

rate in terms of software, algorithms, and architecture develop-

ments, its hardware accelerator development currently largely

relies on advances by a handful of giant technology companies,

most notably Nvidia and its GPUs [15], [16] and Google and

its Tensor Processing Units (TPUs) [13], in addition to new

startups and research groups developing Application Speciﬁc

Integrated Circuits (ASICs) for DL training and acceleration.

Similarly, while there are signiﬁcant advances in tailoring deep

network models and algorithms for various healthcare and

biomedical applications [17], most medical deep networks are

currently trained and run on GPUs or in data centers [12], [18].

This mostly requires the use of cloud-based DL processors

which rely on costly and power-demanding data centers, as

opposed to the effective deployment of DL at the edge on an

increasing number of healthcare and medical IoT systems [19]

and PoC devices [20], as illustrated in Fig. 1(b). These devices

arXiv:2007.05657v1 [cs.AR] 11 Jul 2020

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2

Fig. 1. A depiction of (a) the usage of DL in a smart healthcare setting, which typically involves monitoring, prediction, diagnosis, treatment, and prognosis.

The various parts of the DL-based healthcare system can run on (b) the three levels of the IoT, i.e. edge devices, edge nodes, and the cloud. However, for

healthcare IoT and PoC processing, edge learning and inference is preferred.

and systems are desired to be as low-cost, compact, low-

power, and rapid (high throughput) as possible, to facilitate

applications at the edge and make smart health monitoring

technology more viable and affordable for integration into

human life [21]. Furthermore, edge learning and/or inference

can enable systems which are mostly independent of the cloud.

This feature is critical for highly sensitive medical data and

ofﬂine operation, which are much desired in healthcare and

biomedical settings.

To facilitate at-edge processing, specialized embedded DL

accelerators such as Nvidia Jetson and Xavier series [22], as

well as Movidius Neural Compute Stick [23], [24] have been

produced. These devices and systems have been shown to be

quite suitable for healthcare edge or near-edge inference. More

recently, examples of specialized embedded hardware systems

for medical tasks, such as The Nvidia Clara Embedded, have

been proposed. This is a computing platform for edge-enabled

AI on the Internet of Medical Things (IoMT). However, as

these embedded devices are still relatively power hungry and

costly, they are still not ideal learning/inference engines for

ambient-assisted healthcare IoT applications and PoC systems.

So, there is a need for innovative systems which can satisfy

the stringent requirements of healthcare edge devices, to make

them available and beneﬁcial to the community at large scales

and with affordable costs.

To that end, in this paper, we focus on the use of three vari-

ous hardware technologies to develop dedicated deep network

accelerators which will be discussed from a biomedical and

healthcare application point-of-view, even though they could

be used for general-purpose smart edge IoT devices. The three

technologies that we cover here are CMOS, memristors, and

Field Programmable Gate Arrays (FPGAs). It is worth noting

that, while our focus is mainly around efﬁcient inference

engines at the biomedical application edge, the techniques and

hardware advantages discussed here may be also useful for

more efﬁcient ofﬂine deep network learning, or online on-

chip learning. Herein, the term DL ‘accelerator’ is used for

referring to a device that is able to perform DL inference and

potentially training.

To provide a self-contained tutorial on the implementation

of DL accelerators, we ﬁrst deliver a brief introduction to

the fundamentals of artiﬁcial and spiking neural networks and

their various architectures. Next, we shed light on why deep

networks are power- and resource-hungry and need speciﬁc

hardware platforms to enable them for edge processing. After

that, we discuss recent hardware advances which have led to

improvements in training and inference efﬁciency. These im-

provements ultimately guide us to more viable edge inference

engine options.

When discussing the three target hardware technologies, we

show that the ﬁeld of hardware implementation for customized

healthcare and biomedical DL accelerators is very sparse.

After reviewing the literature on these DL accelerators, we

provide a guided analysis to quantify the performance of

various algorithms on different types of DL processors. The

results allow us to draw a perspective on the potential future

of spike-based neuromorphic processors in the biomedical

signal processing domain. Based on our analysis and perspec-

tive, we conjecture that for at-edge processing, neuromorphic

computing technologies and their underlying Spiking Neural

Networks (SNNs) could complement DL inference engines,

either through signaling anomalies in the data or acting as

‘intelligent always-on watchdogs’ which continuously monitor

the data being recorded, but only activate further processing

stages if and when necessary.

Although there are previous reviews on general AI-based al-

gorithm design and hardware for biomedical applications [25],

to the best of our knowledge, this is the ﬁrst work where a

comprehensive tutorial and review is proposed that focuses on

customized DL accelerators for biomedical applications. Our

contributions that differentiate our work from the available

literature can be summarized as follows:

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 3

Input Cell Output CellRecurrent Cell Memory CellHidden Cell Convolution Pooling

Convolutional Neural

Network (CNN)

Recurrent Neural

Network (RNN)

Long Short-Term

Memory (LSTM)

Multilayer Perceptron (MLP)

/ Dense / Fully Connected

...

Fig. 2. Popular ANN structures. MLP/Dense/Fully Connected are typically well-suited for cross-sectional quantitative data, whereas RNNs and LSTMs

networks are typically well-suited for sequential data. CNNs are equipped for both types.

•There is no previous comprehensive paper that focuses

only on DL accelerators and shows how they can be used

for medical and healthcare applications.

•Our paper is the ﬁrst to discuss the use of three differ-

ent emerging and established hardware technologies for

facilitating DL acceleration, with a focus on biomedical

applications.

•We provide tutorial sections on how one may implement

a typical biomedical task on FPGAs or simulate it for

deployment on memristive crossbars.

•Our paper is the ﬁrst to discuss how event-based neuro-

morphic processors can complement DL accelerators for

biomedical signal processing.

•We provide open-source codes and data to enable the

reproduction of our shown results.

We believe these features make our paper a useful contribution

to the wider biomedical circuits and systems society with an

interest in utilizing mature and emerging technologies and

techniques for enabling DL training and inference on the edge

of healthcare systems.

The remainder of the paper is organized as follows. In

Section II, we deﬁne the technical terminology that is used

throughout this paper and cover the working principles of

artiﬁcial and spiking neural networks. We also introduce a

biomedical signal processing task for hand-gesture classiﬁca-

tion, which is used for benchmarking the different technolo-

gies and algorithms discussed in this paper. In Section III,

we step through the design, simulation, and implementation

of Deep Neural Networks (DNNs) using different hardware

technologies. We show sample cases of how they have been

deployed in healthcare settings. Furthermore, we demonstrate

the steps and techniques required to simulate and implement

hardware for the benchmark hand-gesture classiﬁcation task

using memristive crossbars and FPGAs.

In Section IV, we provide our perspective on the challenges

and opportunities of both DNNs and SNNs for biomedical ap-

plications and shed light on the future of spiking neuromorphic

hardware technologies in the biomedical domain. Section V

presents concluding remarks and discussions.

II. DE EP ARTI FICI AL A N D SPI KIN G NEU R AL NE TWO RK S

A. Nomenclature of Neural Network Architectures

Although most DNNs reported in literature are ANNs,

DNNs usually refer to more than one hidden layer, indepen-

dently of whether the architecture is fully connected, convolu-

tional, recurrent, ANN or SNN, or of any other structure. For

example, the most widely used DNN type, i.e. a CNN, can

be physically implemented as an ANN or SNN, and in both

cases it would be ‘deep’. However, in this paper, whenever

we use the term ‘deep’, DL, or deep network, we refer to

Deep Artiﬁcial Neural Networks. For Deep Spiking Neural

Networks, we simply use the term SNN.

B. Deep Artiﬁcial Neural Networks

Traditional ANNs and their learning strategies that were

ﬁrst developed several decades ago [26] have, in the past

several years, demonstrated unprecedented performance in a

plethora of challenging tasks which are typically associated

with human cognition. These have been applied to medical

image diagnosis [27] and medical text processing [28], using

DNNs.

Fig. 2 illustrates a simpliﬁed overview of the structure of

some of the most widely-used DNNs. The most conventional

form of these architectures is the Multi-Layer Perceptron

(MLP). Increasing the number of hidden layers of perceptron

cells is widely regarded to improve hierarchical feature ex-

traction which is exploited in various biomedical tasks, such

as seizure detection from electroencephalography (EEG) [29].

CNNs introduce convolutional layers, which use spatial ﬁlters

to learn various parts of the feature space. CNNs also have

pooling layers that are placed after convolutional layers to

down-sample their outputs to reduce the search space size for

subsequent convolutional layers. CNNs have been widely used

in medical and healthcare applications, as they are very well-

suited for spatially structured data. Their use in medical image

analysis [30] will form a major part of our discussions in

this paper. RNNs represent another powerful DL architecture

type that has been recently used both individually [31], and

in combination with CNNs [32] in biomedical applications.

RNNs introduce recurrent cells with a feedback loop, and

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 4

are especially useful for processing sequential data such as

temporal signals and time-series data, e.g. electrocardiography

(ECG) [32], and medical text [33]. The feedback loop in

recurrent cells gives them a memory of previous steps and

builds a dynamic awareness of changes in the input. The most

well-known type of RNNs are LSTMs which are designed

to mine patterns in data sequences using their short memory

of distant events stored in their memory cells. LSTMs have

been widely used for processing biomedical signals such as

ECG [31], [34]. Although there are many other varieties of

DNN architectures, we will focus on these most commonly

used types.

1) Automatic hierarchical feature extraction: The above

mentioned DNNs learn intricate data features and representa-

tions through multiple neural computational layers across var-

ious levels of abstraction [35]. The fundamental advantage of

DNNs is that they mine the input data features automatically,

without the need for human knowledge in their supervised

learning loop. This essential feature helps deep networks learn

complex features by combining a hierarchy of simpler features

learned in their hidden layers [35].

2) Learning algorithms: Learning features from data in

a DNN, e.g. the networks shown in Fig. 2, is typically

achieved by minimizing a loss function. In most cases, the

loss is deﬁned as maximum likelihood using the cross-entropy

between training data and the learned model distribution. The

loss function minimization happens through optimizing the

network parameters (weights and biases). This optimization

process minimizes the loss function from the ﬁnal network

layer backward through all the network layers and is therefore,

called backpropagation. A typical optimization algorithm that

is widely used in DNNs is Stochastic Gradient Descent (SGD)

or its several variants [35].

3) Backpropagation in DNNs is computationally expensive:

Despite the continual improvement of hardware platforms for

running DNNs, training and running these networks remains

a highly power consuming and computationally formidable

task. The catalyst for the intensive computational requirement,

which results in high power consumption, is the feed-forward

error backpropagation algorithm, which depends on thou-

sands of epochs of computationally intensive Vector Matrix

Multiplication (VMM) operations [26]. These operations, if

performed on a conventional von Neumann architecture which

has separate memory and processing units, will have a time

and power complexity of order O(N2)for multiplying a vector

of length Nin a matrix of dimensions N×N.

In addition, an artiﬁcial neuron in DNNs calculates a sum-

of-products of its input-weight matrix pairs. For instance, a

CNN spatially structures the sum-of-products calculation into

a VMM operation. In digital logic, an adder tree can be

used to accumulate a large number of values. This, however,

becomes problematic in DNNs when one considers the sheer

number of elements that must be summed together, as each

addition requires one cycle. Table I depicts some popular

CNN architectures, accompanied with the total number of

weights, and multiply-and-accumulate (MAC) operations that

must be computed for a single image (656×468 for OpenPose,

224×224 for the rest).

TABLE I

NUM BE R OF W EI GH TS A ND MAC OP ER ATIO NS I N VARI OU S CNN

AR CH IT EC TU RE S FO R A SI NG LE I MAG E AN D FO R VI DE O PRO CE SS IN G AT

25 F RA ME S PE R SE CO ND .

Network architecture Weights MACs @25 FPS

AlexNet 61 M 725 M 18 B

ResNet-18 11 M 1.8 B 45 B

ResNet-50 23 M 3.5 B 88 B

VGG-19 144 M 22 B 550 B

OpenPose 46 M 180 B 4500 B

MobileNet 4.2 M 529 M 13 B

This table highlights two key facts. Firstly, MACs are the

dominant operation of DNNs. Therefore, hardware implemen-

tations of DNNs should strive to parallelize a large number

of MACs to perform effectively. Secondly, there are many

predetermined weights that must be called from memory. Re-

ducing the energy and time consumed by reading weights from

memory provides another opportunity to improve efﬁciency.

Consequently, signiﬁcant research has been being conducted

to achieve massive parallelism and to reduce memory access

in DNN accelerators, using different hardware technologies

and platforms as depicted in Fig. 3. Although these goals

are towards general DL applications, they can signiﬁcantly

facilitate fast and low-power smart PoC devices [20] and

healthcare IoT systems.

In addition to conventional DL accelerators, there have been

signiﬁcant research efforts to utilize biologically plausible

SNNs for learning and cognition [36]. Spiking neuromorphic

processors have also been used for biomedical signal process-

ing [37], [38], [39]. Below, we provide a brief introduction to

SNNs, which will be discussed as a method complementary

to DL accelerators for efﬁcient biomedical signal processing

later in this paper. We will also perform comparisons among

SNNs and DNNs in performing an electromyography (EMG)

processing task.

C. Spiking Neural Networks

SNNs are neural networks that typically use Integrate-

and-Fire neurons to dynamically process temporally varying

signals (see Fig. 4(j)). By integrating multiple spikes over time,

it is possible to reconstruct an analog value that represents

the mean ﬁring rate of the neuron. The mean ﬁring rate is

equivalent to the value of the activation function of ANNs. So

in the mean ﬁring rate limit, there is an equivalence between

ANNs and SNNs. By using spikes as all-or-none digital

events (Fig. 4(i)), SNNs enable the reliable transmission of

signals across long distances in electronic systems. In addition,

by introducing the temporal dimension, these networks can

efﬁciently encode and process sequential data and temporally

changing inputs. SNNs can be efﬁciently interfaced with event-

based sensors since they only process events as they are

generated. An example of such sensors is the Dynamic Vision

Sensor (DVS), which is an event-based camera shown in

Fig. 4(h). The DVS consists of a logarithmic photo-detector

stage followed by an operational transconductance ampliﬁer

with a capacitive-divider gain stage, and two comparators.

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 5

Fig. 3. Typical hardware technologies for DNN acceleration. In this paper we

cover the top two layers of the pyramid, which include specialized hardware

technologies for high-performance training and inference of DNNs. While the

apex is labelled RRAM, this is intended to broadly cover all programmable

non-volatile resistive switching memories e.g. CBRAM, MRAM, PCM, etc.

The ON/OFF spikes are generated every time the difference

between the current and previous value of the input exceeds a

pre-deﬁned threshold. The sign of the difference corresponds

to the ON or OFF channel where the spike is produced. This is

different to conventional cameras (Fig. 4(f)), which produce

image frames (Fig. 4(g)). Intuitively, it makes sense to use

asynchronous event-based sensor data in asynchronous SNNs,

and synchronously generated frames (i.e., all pixels are given

at a regular clock interval) through synchronous ANNs. But

it is worth noting that conventional frames can be encoded as

asynchronous spikes with frequencies that vary based on pixel

intensity, and event streams can be integrated over time into

synchronously generated time-surfaces [40], [41]. Event-based

sensors have been used to process biomedical signals [37], [42]

(Fig. 4(a)), which can be encoded to spike trains (Fig. 4(b))

to be processed by SNNs or be digitally sampled (Fig. 4(c))

for use in DNNs for learning and inference (Fig. 4(d)).

D. Benchmarking on a Biomedical Signal Processing Task

In Section III we will present a use-case of bio-signal

processing where FPGA and memristive DNN accelerators

are implemented and simulated. These are later compared to

equivalent existing implementations1using DNN accelerators

and neuromorphic processors from [39]. To perform com-

parisons, we use the same hand-gesture recognition task as

in [39].

Hand-gesture recognition is an important task in medical

settings such as prosthetic control, which can be performed us-

ing EMG biomedical signals, hand-gesture images, or a com-

bination of both. Here, the adopted hand-gesture dataset [39]

is a collection of 5 hand gestures recorded with two sensor

modalities: muscle activity from a Myo armband that senses

1https://github.com/Enny1991/dvs emg fusion/blob/master/full baseline.

EMG electrical activity in forearm muscles, and a visual input

in the form of DVS events. Moreover, the dataset provides

accompanying video captured from a traditional frame-based

camera, i.e., images from an Active Pixel Sensor (APS) to feed

DNNs. Recordings were collected from 21 subjects including

12 males and 9 females between the ages 25 and 35, and were

taken over three separate sessions.

For each implementation, we compare the mean and stan-

dard deviation of the accuracy obtained over a 3-fold cross

validation, where each fold encapsulates all recordings from

a given session. Additionally, for all implementations, we

compare the energy and time required to perform inference

on a single input, as well as the Energy-Delay Product (EDP),

which is the average energy consumption multiplied by the

average inference time.

III. DNN ACC EL ERATO RS T OWAR DS H E ALT HC ARE A ND

BI OM EDI CAL AP PL ICATI ONS

In this Section, we cover the use of CMOS and memris-

tors in DL acceleration. We discuss how they use different

strategies to achieve two of the key DNN acceleration goals,

namely MAC parallelism and reduced memory access. We also

discuss and review FPGAs as an alternative reconﬁgurable

DNN accelerator platform, which has shown great promise

in the healthcare and biomedical domains.

A. CMOS DNN accelerators

General edge-AI CMOS accelerator chips can be used for

DNN-enabled healthcare IoT and PoC systems. Therefore,

within this subsection, we ﬁrst review a number of these chips

and provide examples of potential healthcare applications they

can accelerate. We will also explore some common approaches

to CMOS-driven acceleration of AI algorithms using massive

MAC parallelism and reduced memory access, which are

useful for both edge-AI devices and ofﬂine data center scale

acceleration. We then delve deeper into one of the more

renowned approaches, namely, the use of systolic arrays, and

show how a large accelerator developed using systolic arrays

has been used to perform a breast cancer detection training

and acceleration [9].

1) Edge-AI DNN accelerators suitable for biomedical ap-

plications: The research and market for ASICs, which focus

on a new generation of microprocessor chips dedicated entirely

to machine learning and DNNs, have rapidly expanded in

recent years. Table II shows a number of these CMOS-

driven chips, which are intended for portable applications.

There are many other examples of AI accelerator chips (for

a comprehensive survey see [44]), but here we picked several

proliﬁc examples, which are designed speciﬁcally for DL using

DNNs, RNNs, or both. We have also included a few general

purpose AI accelerators from Google [45], Intel [46], and

Huawei [47].

Although developed for general DNNs, the accelerators

shown in Table II can efﬁciently realize portable smart DL-

based healthcare IoT and PoC systems for processing image-

based (medical imaging) or dynamic sequential medical data

types (such as EEG and ECG). For instance, the table shows

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 6

Image Signal

Processor

Interface I/O

Image Sensor

Lens

ISP

Photoeletric Conversion

Charge Accumulation

Transfer Signal

Signal Detection

Analog to Digital Conversion

Spike Train Encoding Scheme

Analog to Digital Converter

Conventional Camera

Image Data

ON SpikesOFF Spikes

Photoreceptor Bipolar Cell Ganglion Cells

ON SpikesOFF Spikes

Photoreceptor Bipolar Cell Ganglion Cells

ON SpikesOFF Spikes

Photoreceptor Bipolar Cell Ganglion Cells

ON SpikesOFF Spikes

Photoreceptor Bipolar Cell Ganglion Cells

ON SpikesOFF Spikes

RESET

Dynamic Vision Sensors are asynchronous, and proccess each pixel independently

Conventional Cameras are frame-based, and process pixels synchronously

Inputs to neuromorphic processors are temporally encoded

Inputs to ANNs are processed in batches, which are propagated serially Time-series signals can be temporally encoded or digitally sampled

D[C,M,N]

...

D[0,0,0]

...

D[0,0,N]

D[0,1,0]

...

D[0,1,N]

D[0,2,0]

...

D[0,2,N]

D[C, M, 0]

...

D[C, M, N]

...

ton

ANNs require clocks for process synchronization

(a) (b) (c) (d)

(e) (f) (g)

(h) (i)

(j)

Fig. 4. DNNs and SNN neuromorphic processors adopt different operation models. In DNNs, inputs are processed in batches which propagate serially.

Consequently, they require clocks for process synchronization. SNNs are asynchronous and process temporally encoded inputs independently. Time series

signals, such as the EMG signal presented in (a) can be either (b) temporally encoded using spike train encoding schemes such as [37], before being fed into

(j) neuromorphic processors, or (c) digitally sampled, before being concatenated into batches, to be fed into (d) DNNs. Similarly, photographs captured from

(e) lenses can be (i) temporally encoded into spike trains using (h) DVSs [43] or (f) digitally encoded using conventional cameras to build (g) image frames.

a few exemplar healthcare and biomedical applications that

are picked based on the demonstrated capacity of these ac-

celerators to run (or train [48]) various well-known CNN

architectures such as VGG, ResNet, MobileNet, AlexNet,

Inception, or RNNs such as LSTMs, or combined CNN-RNNs.

It is worth noting that most of the available accelerators are

intended for CNN inference, while only some [49], [50], [51]

also include recurrent connections for RNNs acceleration.

The Table shows that the total power per chip in most of

these devices is typically in the range of hundreds of mW, with

a few exceptions consuming excessive power of around 10

Watts [46], [47]. This is required to avoid large heat sinks and

to satisfy portable battery constraints. The Table also shows the

computing capability per unit time (column ‘Computational

Power (GOP/s)’). Regardless of power consumption, this col-

umn reveals the computational performance and consequently

the size of a network one can compute per unit time. It is

demonstrated that several of these chips can run large and

deep CNNs such as VGG and ResNet, which enable them to

perform complex processing tasks within a constrained edge

power budget.

For instance, it has been previously shown in [53] that VGG

CNN (shown to be compatible with Cambricon-x [52]), can

successfully analyze ECog signals. Therefore, considering the

power efﬁciency of Cambricon-x, it can be used to implement

a portable automatic ECog analyzer for PoC diagnosis of

various cardiovascular diseases [66]. Similarly, Eyeriss [54]

can run VGG-16, which is shown to be effective in diagnosing

thyroid cancer [55]. In addition, Eyeriss can run AlexNet for

several different medical imaging applications [30]. Therefore,

Eyeriss can be used as a mobile diagnostic tool that can be

integrated into or complement medical imaging systems at the

PoC. Origami [56] is another CNN accelerator chip, which can

be used for other healthcare applications based on a CNN. For

instance, [57] proposes a CNN-based ECG analysis for heart

monitoring, where Origami can be used to develop a smart

healthcare IoT edge device. Similarly, the CNN processor pro-

posed in [58] is shown to be able to run AlexNet, which can be

deployed in a PoC ultrasound image processing system [59].

Envision [60] is another accelerator that has the capability to

run large-scale CNNs. It can also be used as an edge inference

engine for a multi-layer CNN for EEG/ECog feature extraction

for epilepsy diagnosis [61]. Neural processor [62] is another

CNN accelerator that is shown to be able to run Inception

V3 CNN, which can be used for skin cancer detection [11] at

the edge. LNPU [48] is the only CNN accelerator shown in

Table II, which unlike the others can perform both learning

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 7

TABLE II

ANU MB ER O F RE CE NT E DG E-AI CMOS CH IP S SU ITAB LE F OR P ORTA BL E HE ALTH CA RE A ND B IO ME DI CA L AP PL IC ATIO NS .

CMOS Chip Core size

(mm2)

Technology

(nm)

Computational

Power (GOP/s)

Power

(mW)

Power Efﬁciency

(TOPS/W)

Potential Mobile and Edge-based Health-

care and Medical Applications

Cambricon-x [52] 6.38 65 544 954 0.5 ECog analysis using a sparse VGG [53] for

PoC diagnosis of cardiovascular diseases

Eyeriss [54] 12.25 65 17-42 278 0.06–0.15 - Mobile Image-based cancer diagnosis us-

ing VGG-16 [55],

- Mobile diagnosis tool based on AlexNet

for radiology, cardiology gastroenterology

imaging [30]

Origami [56] 3.09 65 196 654 0.8 Smart healthcare IoT edge device for heart

health monitoring using a CNN-based ECG

analysis [57]

ConvNet processor [58] 2.4 40 102 25-287 0.3–2.7 PoC Ultrasound processing using

AlexNet [59]

Envision [60] 1.87 28 76-408 7.5-300 0.8–10 Multi-layer CNN for EEG/ECog feature

extraction for epileptogenicity for epilepsy

diagnosis on edge [61]

Neural processor [62] 5.5 8 1900-7000 39–1500 4.5-11.5 On edge classiﬁcation of skin cancer using

Inception V3 CNN [11]

LNPU [48] 16 65 600 43-367 25 - On edge learning/inference using VGG-16

for cancer diagnosis [55],

- On edge AlexNet learning/inference

for radiology, cardiology, gastroenterology

imaging diagnosis [30]

DNPU [49] 16 65 300-1200 35-279 2.1–8.1 Parallel and Cascade RNN and CNN for

acECG analysis for BCI [32]

Thinker [50] 14.44 65 371 293 1–5 PoC conversion of respiratory organ motion

ultrasound into MRI using a long-term re-

current CNN [63]

UNPU [51] 16 65 346-7372 3.2-297 3.08–50.6 Intelligent pre-diagnosis medical

support/consultation using a CNN-

RNN [33]

Google Edge TPU [45] 25 - 4000 2000 2 - Low-cost and easy-to-access skin cancer

detection using MobileNet V1 CNN [24]

- On edge health monitoring for fall detec-

tion using LSTMs [64]

Intel Nervana NNP-I

1000 (Spring Hill) [46]

- 10 48000 10000 4.8 Diagnosis using chest x-ray classiﬁcation on

ResNet CNN family [65]

Huawei Ascend 310

[47]

- 12 16000 8000 2 - Cardiovascular monitoring for arrhythmia

diagnosis from ECG using an LSTM [31],

- Health monitoring by heart rate variability

analysis using ECG analysis by a bidirec-

tional LSTM [34]

and inference of a deep network such as AlexNet and VGG-

16, for applications including on edge medical imaging [30]

and cancer diagnosis [55].

Unlike the above discussed chips that are capable of running

only CNNs, DNPU [49], Thinker [50], and UNPU [51] are

capable of accelerating both CNNs and RNNs. This fea-

ture makes them suitable for a wider variety of edge-based

biomedical applications such as ECG analysis for BCI using

a cascaded RNN-CNN [32], or PoC MRI construction from

motion ultrasounds using a long-term recurrent CNN [63], or

intelligent medical consultation using a CNN-RNN [33].

Table II lists three general purpose AI accelerator chips,

which have been deployed for low-cost and easy-to-access

skin cancer detection using MobileNet V1 CNN [24], on

edge health monitoring for fall detection using LSTMs [64],

chest x-ray analysis using ResNet CNN [65], cardiovascular

arrhythmia detection from ECG using an LSTM [31], or

heart rate variability analysis from ECG signals through a

bidirectional LSTM [34], just to name a few.

2) Common approaches to CMOS-driven DL acceleration:

Accelerators will typically target either data center use or

embedded ‘edge-AI’ acceleration. Edge chips, such as those

discussed above, must operate under restrictive power budgets

(e.g., within thermal limits of 5 W) to cope with portable

battery constraints. While the scale of tasks, input dimension

capacity, and clock speeds will differ between edge-AI and

modular data center racks, both will adopt similar principles

in the tasks they seek to optimize.

Most of the accelerator chips, such as those discussed

in Table II, use similar optimization strategies involving re-

duced precision arithmetic [48], [51], [58], [60] to improve

computational throughput. This is typically combined with

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 8

architectural-level enhancements [49], [50], [52], [54], [62]

to either reduce data movement (using in- or near-memory

computing), heightened parallelism, or both.

Sequential and combinational logic research is largely ma-

tured, so outside of emerging memory technologies, the domi-

nant hardware beneﬁts are brought on by optimizing data ﬂow

and architecture. An early example is the neuFlow system-on-

chip (SoC) processor which relies on a grid of processing

tiles, each made up of a bank of processing operators and a

multiplexer based on-chip router [67]. The processing operator

can serially perform primitive computation (MUL, DIV, ADD,

SUB, MAX), or a parallelized 1D/2D convolution. The router

conﬁgures data movement between tiles to support streaming

data ﬂow graphs.

Since the development of neuFlow, over 100 startups and

companies have developed, or are developing, machine learn-

ing accelerators. The Neural Processing Unit (NPU) [68] gen-

eralizes the work from neuFlow by employing eight processing

engines which each compute a neuron response: multipli-

cation, accumulation, and activation. If a program could be

partitioned such that a segment of it can be calculated using

MACs, then it would be partially computed on the NPU. This

made it possible to go beyond MLP neural networks. The

NPU was demonstrated to perform Sobel edge detection and

fast Fourier transforms as well.

NVIDIA coupled their expertise in developing GPUs with

machine learning dedicated cores, namely, tensor cores, which

are aimed at demonstrating superior performance over regular

Compute Uniﬁed Device Architecture (CUDA) cores [16].

Tensor cores target mixed-precision computing, with their

NVIDIA Tesla V100 GPU combining 672 tensor cores on

a single unit. By merging the parallelism of GPUs with the

application speciﬁc nature of tensor cores, their GPUs are

capable of energy efﬁcient general compute workloads, as well

as 12 trillion ﬂoating-point operations per seconds (TFLOPSs)

of matrix arithmetic.

Although plenty of other notable architectures exist (see

Table II), a pattern begins to emerge, as most specialized

processors rely on a series of sub-processing elements which

each contribute to increasing throughput of a larger processor.

Whilst there are plenty of ways to achieve MAC parallelism,

one of the most renowned techniques is the systolic array,

and is utilized by Groq [69] and Google, amongst numerous

other chip developers. This is not a new concept: systolic

architectures were ﬁrst proposed back in the late 1970s [70],

[71], and have become widely popularized since powering the

hardware DeepMind used for the AlphaGo system to defeat

Lee Sedol, the world champion of the board game Go in

October 2015. Google also uses systolic arrays to accelerate

MACs in their TPUs, just one of many CMOS ASICs used in

DNN processing [13]. Here, we explain what systolic arrays

are and how they can be used to decrease memory access

frequency and increase MAC parallelism, towards efﬁcient

ANN accelerators.

3) Systolic arrays for DNN acceleration: In general pur-

pose computing, there is no knowing what the next instruction

could be. The result of every operation must be stored in mem-

ory, while awaiting further instructions from the processor.

Energy is consumed in reading from memory, in writing to

memory, and time is wasted by shuttling information on a

limited bandwidth bus to and from the processor (Fig. 5(a)).

On the other hand, neural networks are deterministic. Once

the network has been trained, every operation that the input

data is subject to has already been pre-determined. This allows

a single element of information, such as one pixel of an image,

to have many operations applied to it prior to being stored

in memory. Systolic arrays loosely draw inspiration from

the cardiovascular system, where blood is pumped through

various subsystems prior to returning to the heart. Similarly,

in systolic processing, data ﬂows through many processing

elements before it returns to memory (Fig. 5(b)). In fact, the

word systolic is derived from the cardiac cycle.

The appeal of systolic arrays is that they can come in

many forms, designed for different tasks using repeatable and

modular blocks. As a simple case study of how systolic arrays

parallelize operations with infrequent memory write cycles,

we can consider a 3x3 matrix multiplication by referring to

Fig. 6. Here, the processing element is designed to multiply

two inputs together, and accumulate it with all future products.

The input data is a matrix of values xm,nand a weight matrix

of values wm,n. Multiplying these two matrices together is

an efﬁcient way to compute a sum-of-products, or a MAC

operation.

As depicted in Fig. 6, input data is carefully orchestrated

in time such that it naturally ﬂows in rhythm with incoming

weight data. At T=1, x0,0is multiplied with w0,0. At T=2,

x0,0ﬂows to the right and is multiplied by the next weight in

sequence, w0,1. The weight w0,0ﬂows down and is multiplied

by the next input, x1,0. Another input-weight pair enters

the array: x0,1and w1,0are multiplied together in the top-

left processing element, and summed with the result of the

previous time-step.

This process is repeated, until all inputs have traversed to

the right of the array, and all weights have traversed to the

bottom of the array, giving the result shown at T=7. This is

equivalently performing matrix multiplication, which is the

dominant operation in a DNN. Every element of the matrix

can be computed in this way, without having to store any

intermediate results in main memory.

4) CMOS-based systolic arrays used in biomedical applica-

tions: Googles TPU utilizes a 128×128 systolic array, which

enables 180 TFLOPSs, while v3 reaches up to 100 peta-

FLOPS (PFLOPS). The modularity of systolic arrays makes

them easily scalable for a large number of interconnected

TPUs, a necessary feature for data center use. Even with a

relatively slow clock (e.g., 700 MHz for TPU v1), systolic

arrays are highly parallel meaning there are numerous ma-

trices being processed simultaneously. TPUs were used in the

seminal work from Ref. [9] where an ensemble of three DNNs

were used to surpass radiologist performance in breast cancer

detection. This network was trained on a set of over 100,000

images, many of which were at 4K resolution and required

the development of substantial infrastructure to make training

such a system possible. The results demonstrated signiﬁcant

reduction in both false positives and false negatives. Notably,

the system was able to generalize from being trained on UK-

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 9

(a) (b)

Fig. 5. (a) Conventional CPUs rely on a shared bus to transfer data to and from memory resulting in a bottleneck of data transmission. (b) Systolic arrays

pass data through multiple processing elements before storing in memory.

Fig. 6. Mapping a 3×3 matrix multiplication onto a 2D systolic array. This ﬁgure shows the movement of input and weight data over time, from time-step

T=0 through to T=7. The ﬁnal result shows how all elements of a matrix are computed in parallel. It differs from pipelining in that individual processing

elements perform entire operations, can be multi-directional and operating at different speeds, can execute kernels with their own local memory. In contrast,

pipelining is executing a piece of an overall instruction in multiple pipelined stages.

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 10

based data sets to competitive performance on USA-based

images.

Overall, systolic arrays make efﬁcient use of a limited mem-

ory bandwidth. While the connection from processor to mem-

ory is a bottleneck, the interconnections between processing

elements can be very fast. The drawback is that if the required

computation cannot be mapped into the processing elements

functions, such as a MAC, then it cannot be implemented.

B. Memristive DNNs

To achieve the two aforementioned key DNN acceleration

goals, i.e. massive MAC parallelism and reduced memory

access, many studies have leveraged memristors [72], [73],

[74], [75] as weight elements in their DNN and SNN [76], [77]

architectures. Memristors are often referred to as the fourth

fundamental circuit element, and can adapt their resistance

(conductance) to changes in the applied current or voltage.

This is similar to the adaptation of neural synapses to their

surrounding activity while learning. This adaptation feature

is integral to the brain’s in-memory processing ability, which

is missing in today’s general purpose computers. This in-situ

processing can be utilized to perform parallel MAC operations

inside memory, hence, signiﬁcantly improving DNN learning

and inference. This is achieved by developing memristive

crossbar neuromorphic architectures, which are projected to

achieve approximately 2500-fold reduction in power and a

25-fold increase in acceleration, compared to state-of-the-art

specialized hardware such as GPUs [72].

1) Memristive crossbars for parallel MAC and VMM oper-

ations: A memristive crossbar that can be fabricated using a

variety of device technologies [77], [78] can perform analog

MAC operations in a single time-step (see Fig. 7(a)). This

reduces the time complexity to its minimum (O(1)), and

is achieved by carrying out multiplication at the place of

memory, in a non-Von Neumann structure. Using this well-

known approach, VMM can be parallelized as demonstrated

in Fig. 7(b), where the vector of size Mvalues represented

vMvM

Multiply-accumulate

operation

g11

g21

gM1

g12

g22

gM2 gM3 gMN

g23

g13

g2N

g1N

Vector-matrix multiplication

(a) (b)

Fig. 7. Memristive crossbars can parallelize (a) analog MAC and (b) VMM

operations. Here, V represents the input vector, while conductances in the

crossbar represent the matrix.

as voltage signals ([V1..VM]) is applied to the rows of the

crossbar, while the matrix (of size M×N), whose elements

are represented as conductances (resistances), is stored in

the memristive components at each crossing point. Taking

advantage of the basic Ohm’s law (I=V.G), the current

summed in each crossbar column represents one element of

the resulting multiplication vector of size N.

2) Mapping memristive crossbars to DNN layers: Although

implementing fully-connected DNN layers is straightforward

by mapping the weights to crossbar point memristors and

having the inputs represented by input voltages, implementing

a complex CNN requires mapping techniques to convert con-

volution operations to MAC operations. A popular approach

to perform this conversion is to use an unrolling (unfolding)

operation that transforms the convolution of input feature

maps and convolutional ﬁlters to MAC operations. We have

developed a software platform named MemTorch [79] that will

be introduced in subsequent sections, to perform this mapping

as well as a number of other operations, for converting

DNNs to Memristive DNNs (MDNNs). The mapping process

implemented in MemTorch is illustrated in the left panel in

Fig. 8. The ﬁgure shows that the normal input feature maps and

convolution ﬁlters (shown in gray shaded area) are unfolded

and reshaped (shown in the cyan shaded area) to be compatible

to memristive crossbar parallel VMM operation. It is worth

noting that, the convolutional ﬁlters that can be applied to

the input feature maps have a direct relationship with the

required crossbar sizes. Furthermore, the resulting hardware

size or required time, depends on the size of the input feature

maps [80].

3) Peripheral circuitry for memristive DNNs: In addition

to the memristive devices that are used as programmable

elements in MDNN architectures, various peripheral circuitry

is required to perform feed-forward error-backpropagation

learning in MDNNs [74]. This extra circuitry may include:

(i) a conversion circuit to translate the input feature maps to

input voltages, which for programming memristive devices are

usually Pulse Width Modulator (PWM) circuits, (ii) current

integrators or sense ampliﬁers, which pass the current read

from every column of the memristive crossbar to (iii) analog

to digital converters (ADCs), which pass the converted voltage

to (iv) an activation function circuit, for forward propagation,

and for backward error propagation (v) the activation function

derivative circuit. Other circuits required in the error back-

propagation path include (vi) backpropagation values to PWM

voltage generators, (vii) backpropagation current integrators,

and (viii) backpropagation path ADCs. In addition, an update

module that updates network weights based on an algorithm

such as SGD is required, which is usually implemented in

software. After the update, the new weight values should be

written to the memristive crossbar, which itself requires Bit-

Line (BL) and Word-line (WL) switch matrices to address

the memristors for update, as well as a circuit to update

the memristive weights. There are different approaches to

implement this circuit such as that proposed in [81], while

others may use software ex-situ training where the new weight

values are calculated in software and transferred to the physical

memristors through peripheral circuitry [74].

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 11

0 20050 100 150

Resistance (Ω)

Voltage (V)

Amplitude = 0.1

Amplitude = 0.2

Amplitude = 0.3

Amplitude = 0.4

Amplitude = 0.5

Amplitude = 0.6

Amplitude = 0.7

Amplitude = 0.8

Amplitude = 0.9

Amplitude = 1.0

RON Distribution

ROFF Distribution

Current (A)

g00

g10

gM0 gM1 gMN

g1N

g0N

g01

g11

PyTorch Model

MemTorch

Input Data

Output

Filter 1

Filter 2

Unfolded Input

Reshaped and Concatenated Filters

Output

Output Slice

D1,1

D0,0

D1,0

D3,0 D4,0

D1,0 D2,0 D4,0 D5,0

D4,0 D5,0 D7,0 D8,0

D0,1 D0,2 D1,2

D1,2

D1,1

D3,1 D3,2

D4,1 D4,2

D2,1 D4,1 D5,1

D5,1 D7,1 D8,1

D2,2 D4,2 D5,2

D5,2 D7,2 D8,2

D3,0 D4,0 D6,0 D7,0 D3,1 D3,2

D4,1 D6,1 D7,1 D4,2 D6,2 D7,2

During inference, VMMs are performed using ReRAM crossbars

MemTorch transforms convolutional operations to unrolled convolutions,

which are computed using VMMs

For all converted layers, network parameters are converted to equivalent device conductances, which are used to program devices within 1T1R crossbars that can perform VMMs in Θ(1)

D0,0 D1,0 D2,0

D3,0 D4,0 D5,0

D6,0 D7,0 D8,0

D0,1 D2,1

D3,1 D4,1 D5,1

D6,1 D7,1 D8,1

D0,2 D1,2 D2,2

D3,2 D4,2 D5,2

D6,2 D7,2 D8,2

F0,2 F1,2

F2,2 F3,2

F0,1 F1,1

F2,1 F3,1

F0,0 F1,0

F2,0 F3,0

H0,1 H1,1

H2,1 H3,1

H0,2 H1,2

H2,2 H3,2

H0,0 H1,0

H2,0 H3,0

O0,0

O0,1

O1,0

O1,1

O2,0

O2,1

O3,0

O3,1

D1,1

O0,1

O1,1

O2,1

O3,1

O0,0

O1,0

O2,0

O3,0

F0,0 H0,0

H1,0

H2,0

H3,0

F1,0

F2,0

F3,0

F0,1

F1,1

F2,1

F3,1

F0,2

F1,2

F2,2

F3,2

H0,1

H1,1

H2,1

H3,1

H0,2

H1,2

H2,2

H3,2

Unfolded Input Slice

Fig. 8. Conversion process of a DNN trained in PyTorch and mapped to a Memristive DNN using MemTorch [79], to parallelize MVMs using 1T1R

memristive crossbars and to take into account memristor variability including ﬁnite number of conductance states and non-ideal RON and ROFF distributions.

4) Memristive device nonidealities: Although ideal mem-

ristive crossbars have been projected to remarkably accelerate

DNN learning and inference and drastically reduce their

power consumption [72], [73], device imperfections observed

in experimentally fabricated memristors impose signiﬁcant

performance degradation when the crossbar sizes are scaled up

for deployment in real-world DNN architectures, such as those

required for healthcare and biomedical applications discussed

in subsection III-A. These imperfections include nonlinear

asymmetric and stochastic conductance (weight) update, de-

vice temporal and spatial variations, device yield, as well as

limited on/off ratios [72]. To minimize the impact of these

imperfections, speciﬁc peripheral circuitry and system-level

mitigation techniques have been used [82]. However, these

techniques add signiﬁcant computation time and complexity to

the system. It is, therefore, essential to take the effect of these

nonidealities into consideration before utilizing memristive

DNNs for any healthcare and medical applications, where

accuracy is critical. In addition, there is a need for a uniﬁed

tool that reliably simulates the conversion of a pre-trained

DNN to a MDNN, while critically considering experimentally

modeled device imperfections [79].

5) Conversion of DNN to MDNN while considering mem-

ristor nonidealities: Due to the signiﬁcant time and energy

required to train new large versions of DNNs for challeng-

ing cognitive tasks, such as biomedical and healthcare data

processing [9], [83], the training of the algorithms is usually

undertaken in data centers [9], [13]. The pretrained DNN can

then be transferred to be used on memristive crossbars. There

exist several different frameworks and tools that can be used

to simulate and facilitate this transition [84]. In a recent study,

we have developed a comprehensive tool named MemTorch,

which is an open source, general, high-level simulation plat-

form that can fully integrate any behavioral or experimental

memristive device model into crossbar architectures to design

MDNNs [79].

Here, we utilize the benchmark biomedical signal process-

ing task explained in subsection II-D to demonstrate how

pretrained DNNs can be converted to equivalent MDNNs, and

how non-ideal memristive devices can be simulated within

MDNNs prior to hardware realization. The conversion process,

which can be generalized to other biomedical models using

MemTorch, is depicted in Fig. 8.

The targeted MDNNs are constructed by converting linear

and convolutional layers from PyTorch pre-trained DNNs

to memristive equivalent layers employing 1-Transistor-1-

Resistor (1T1R) crossbars. A double-column scheme is used

to represent network weights within memristive crossbars. The

converted MDNN models are tuned using linear regression, as

described in [79]. The complete and detailed process and the

source code of the network conversion for the experiments

shown in this subsection are provided in a publicly accessible

complementary Jupyter Notebook2.

During the conversion, any memristor model can be used.

For the benchmark task, a reference VTEAM model [85] is

instantiated using parameters from Pt/Hf/Ti Resistive Ran-

dom Access Memory (RRAM) devices [86], to model all

memristive devices within converted linear and convolutional

layers. As already mentioned, memristive devices have in-

evitable variability, which should be taken into account when

implementing an MDNNs for learning and/or inference. Also,

depicted in Fig. 8 are visualizations of two non-ideal device

characteristics: the ﬁnite number of conductance states and

device-to-device variability. Using MemTorch [79], not only

can we convert any DNNs to an equivalent MDNNs utilizing

any memristive device model, we are also able to comprehen-

sively investigate the effect of various device non-idealities and

variation on the performance of a possible MDNN, before it

is physically realized in hardware.

In order to demonstrate an example which includes vari-

ability in our MDNN simulations, device-device variability is

introduced by sampling ROFF for each device from a normal

distribution with ¯

ROFF = 2k5Ωwith standard deviation 2σ,

and RON for each device from a normal distribution with ¯

RON

= 100Ωwith standard deviation σ.

2https://github.com/coreylammie/TBCAS-Towards-Healthcare- and-

Biomedical-Applications/blob/master/MemTorch.ipynb

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 12

In Fig. 9, for the converted memristive MLP and CNN

that process APS hand-gesture inputs, we gradually increase

σfrom 0 to 500, and compare the mean test set accuracy

across the three folds. As can be observed from Fig. 9, with

increasing device-to-device variability, i.e. the variability of

RON and ROFF, the performance degradation increases across

all networks. For all simulations, RON and ROFF are bounded

to be positive.

6) Memristive DNNs towards biomedical applications:

Although some previous small-scale MDNNs have been sim-

ulated for biomedical tasks such as cardiac arrhythmia clas-

siﬁcation [87], or have been implemented on a physical

programmable memristive array for breast cancer diagno-

sis [88], there currently exists no signiﬁcant MDNN, even at

simulation-level, which has realized a large-scale biomedical

processing task.

Similar to the recent advances in CMOS-driven DNN

accelerator chips discussed in subsection III-A, there have

been promises in partial [73] or full [74] realizations of

MDNNs in hardware, which are shown to achieve signiﬁcant

energy saving compared to state-of-the-art GPUs. However,

unlike their CMOS counterparts, these implementations have

been only able to perform simple tasks such as MNIST

and CIFAR classiﬁcation. This is, of course, not suitable for

implementing large-scale CNNs and RNNs, which as shown

in subsection III-A are required for biomedical and healthcare

tasks dealing with image [30] or temporal [31] data types.

In addition, following similar optimization strategies as

those used in CMOS accelerators, [89] has investigated, in

simulations, the use of quantized and binarized MDNNs and

their error tolerance in a biomedical ECG processing task

and has shown their potential to achieve signiﬁcant energy

savings compared to full-precision MDNNs. However, due to

the many intricacies in the design process and considering the

aforementioned peripheral circuitry that may offset the beneﬁts

gained by using MDNNs, full hardware design is required

before the actual energy saving of such binarized MDNNs

can be veriﬁed.

Mean Test Set Accuracy (%)

Convolutional

MLP

Fig. 9. Simulation results investigating the performance of MDNNs for hand

gesture classiﬁcation adopting non-ideal Pt/Hf/Ti ReRAM devices. Device-

device variability is simulated using MemTorch [79].

C. FPGA DNNs

FPGAs are fairly low-cost reconﬁgurable hardware that

can be used in almost any hardware prototyping and imple-

mentation task, signiﬁcantly shortening the time-to-market of

an electronic product. They also provide parallel computa-

tion, which is essential when simultaneous data processing

is required such as processing multiple ECG channels in

parallel. Furthermore, there exists a variety of High Level

Synthesis (HLS) tools and techniques [90], [91] that facilitate

FPGA prototyping without the need to directly develop time-

consuming low-level Hardware Description Language (HDL)

codes [92]. These tools allow engineers to describe their

targeted hardware in high-level programming languages such

as C to synthesize them to Register Transfer Level (RTL).

The tools then ofﬂoad the computational-critical RTL to run as

kernels on parallel processing platforms such as FPGAs [93].

1) Accelerating DNNs on FPGAs: FPGAs have been previ-

ously used to realize mostly inference [91], [94], [95], and in

some cases training of DNNs with reduced-precision-data [96],

or hardware-friendly approaches [97]. For a comprehensive

review of previous FPGA-based DNN accelerators, we refer

the reader to [91].

Here, we demonstrate an exemplar process of accelerating

DNNs used for the benchmark biomedical signal processing

task explained in subsection II-D. For our acceleration, we

use ﬁxed-point parameter representations on a Starter Platform

for OpenVINO Toolkit FPGA using OpenCL. OpenCL [90] is

an HLS framework for writing programs that execute across

heterogeneous platforms. OpenCL speciﬁes programming lan-

guages (based on C99 and C++11) for programming the com-

pute devices and Application Programming Interfaces (APIs)

to control and execute its developed kernels on the devices,

where depending on the available computation resources, an

accelerator can pipeline and execute all work items in parallel

or sequentially.

Fig. 10 depicts the compilation ﬂow we adopted. The

trained DNN PyTorch model is ﬁrst converted to .prototxt

and .caffemodel ﬁles using Caffe. All weights and biases are

then converted to a ﬁxed point representation using MAT-

LAB’s Fixed-point toolbox using word length and fractional

bit lengths deﬁned in [98], prior to being exported as a

single binary .dat ﬁle for integration with PipeCNN, which is

used to generate the necessary RTL libraries, and to perform

compilation of the host executable and the FPGA bit-stream.

We used Intel’s FPGA SDK for OpenCL 19.1, and provide

all ﬁles used during the compilation shown in Fig. 10 in a

publicly accessible complementary GitHub repository3.

2) FPGA-based DNNs for biomedical applications: De-

spite the many FPGA-based DNN accelerators available [91],

only a few have been developed speciﬁcally for biomedical

applications such as ECG anomaly detection [99], or real-time

mass-spectrometry data analysis for cancer detection [100],

where the authors show that application-speciﬁc parameter

quantization and customized network design can result in

signiﬁcant inference speed-up compared to both CPU and

3https://github.com/coreylammie/TBCAS-Towards-Healthcare- and-

Biomedical-Applications/blob/master/FPGA/

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 13

Host Executable

FPGA Bit-Stream

Binary File

PyTorch Model

Network architecture and exported

state dictionary

Caffe Model

.prototxt and .caffemodel files

Quantization

All weights and biases are converted

to a fixed point representation using

MATLAB's Fixed-point toolbox

RTL Libraries which describe digital

hardware to realise operations required

to perform fixed-point inference

RTL Libraries

clk

PipeCNN

Fig. 10. Compilation ﬂow used to deploy an EMG classiﬁcation CNN to an OpenVINO FPGA adopting ﬁxed-point number representations using OpenCL.

GPU. In addition, the authors in [101] have developed an

FPGA-based BCI, in which a MLP is used for reconstructing

ECog signals. In [102], the authors have implemented an EEG

processing and neurofeedback prototype on a low-power but

low-cost FPGA and then scaled it on a high-end Ultra-scale

Virtex-VU9P, which has achieved 215 and 8 times power

efﬁciency compared to CPU and GPU, respectively. For the

EEG processing, they developed an LSTM inference engine.

It is projected that, by leveraging speciﬁc algorithmic de-

sign and hardware-software co-design techniques, FPGAs can

provide >10 times energy-delay efﬁciency compared to state-

of-the-are GPUs for accelerating DL [91]. This is signiﬁcant

for realizing portable and reliable healthcare applications.

However, FPGA design is not as straightforward as high-level

designs conducted for DL accelerators and requires skilled

engineers and stronger tools, such as those offered by the GPU

manufacturers.

In the next section, we provide our analysis and perspective

on the use of the three hardware technologies discussed in this

section for DL-based biomedical and healthcare applications.

We also discuss how SNN-based neuromorphic processors can

beneﬁt edge-processing for biomedical applications.

IV. ANA LYSIS A ND PE RSP ECT IV E

The use of ANNs trained with the backpropagation learning

algorithm in the domain of healthcare and for biomedical

applications such as cancer diagnosis [108] or ECG monitor-

ing [109] dates back to the early 90s. These networks, were

typically small-scale networks run on normal workstations. As

they were not deep and did not have too many parameters,

they did not demand high-performance accelerators. However,

with the resurgence of CNNs in the early 2010s followed

by the rapid spread of DNNs and large data-sets, came

the need for high-speed specialized processors. This need

resulted in repurposing GPUs and actively researching other

hardware and design technologies including ASIC CMOS

chips (see Table II) and platforms [13], memristive crossbars

and in-memory computing [73], [74], [80], and FPGA-based

designs for DNN training [96], [97] and inference [94].

Despite notable progress in deploying non-GPU platforms

for DL acceleration, similar to other data processing tasks,

biomedical and healthcare tasks have mainly relied on standard

technologies and GPUs. Currently, depending on the size

of the required DNN, its number of parameters, as well as

the available training dataset size, biomedical DL tasks are

usually “trained” on high-performance workstations with one

or more GPUs [12], [18], on customized proprietary processors

such as Google TPU [9], or on various Infrastructure-as-

a-Service (IaaS) provider platforms, including Nvidia GPU

cloud, Google Cloud, and Amazon Web Services, among

others. This is mostly due to (i) the convenience these plat-

forms provide using high-level languages such as Python; (ii)

the availability of wide-spread and open-source DL libraries

such as TensorFlow and PyTorch; and (iii) strong community

and/or provider support in utilizing GPUs and IaaS for training

various DNN algorithms and applications.

However, “inference” can beneﬁt from further research and

development on emerging and mature hardware and design

technologies such as those discussed in this paper, to open

up new opportunities for deploying healthcare devices closer

to the edge, paving the way for low-power and low-cost DL

accelerators for PoC devices and healthcare IoT. Despite this

fact, hardware implementations of biomedical and healthcare

inference engines are very sparse. Table III lists a summary of

the available hardware implementations and hardware-based

simulations of DNNs used for healthcare and biomedical

signal processing applications, using the three hardware tech-

nologies covered herein. In addition, the table shows existing

biomedical signal processing tasks implemented on generic

low-power spiking neuromorphic processors.

A. CMOS technology has been the main player for DL infer-

ence in the biomedical domain

Similarly to general-purpose GPUs that are CMOS-based,

all the other current non-GPU DL inference engines are im-

plemented in CMOS. Therefore, it is obvious that most of the

future edge-based biomedical platforms would rely on these

inference platforms. In Table II, we listed a number of these

accelerators that are mainly developed for low-power mobile

applications. We also mentioned a set of potential healthcare

and biomedical tasks that can be realized using them. However,

before the deployment of any edge-based DL accelerators

for biomedical and healthcare tasks, some challenges need

to be overcome. A non-exhaustive list of these obstacles

include: (i) the power and resource constraints of available

mobile platforms, which despite signiﬁcant improvements are

still not suitable for complex medical tasks; (ii) the need to

verify that a DL system can generalize beyond the distribution

they are trained and tested on; (iii) bias that is inherent to

datasets which may have adverse impacts on classiﬁcation

across different populations; (iv) confusion surrounding the

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 14

TABLE III

EXI ST IN G HA RDWAR E IM PL EM EN TATIO NS A ND H AR DWARE -BA SE D SI MU LATI ON S OF DNN AC CE LE RATO RS U SE D FO R HE ALT HC AR E AN D BI OM ED IC AL

AP PL IC ATIO NS ,AN D GE NE RI C SNN NE URO MO RP HI C PRO CE SS OR S UT IL IZ ED F OR B IO ME DI CA L SI GN AL P ROC ES SI NG .†SIM UL ATIO N-B AS ED

Biomedical or Healthcare Task DNN/SNN Architecture Hardware

Image-based breast cancer diagnosis [9] Ensemble of CNNs CMOS (Google TPU)

Energy-efﬁcient multi-class ECG classiﬁcation [38] Spiking RNN CMOS

EMG signal processing [39] Spiking CNN/MLP CMOS

ECG signal processing [103] Spiking RNN CMOS

EMG signal processing [104] Spiking RNN CMOS

EMG signal processing [105] Feed-forward SNN CMOS

EMG and EEG signal processing [106] Recurrent 3D SNN CMOS

EEG and LFP signal processing [107] TrueNorth-compatible CNN CMOS

ECG processing for cardiac arrhythmia classiﬁcation [87] MLP Memristors†

Breast cancer diagnosis [88] MLP Programmable Memristor-CMOS system

ECG signal processing [89] Binarized CNN Memristors†

ECG arrhythmia detection for hearth monitoring [99] MLP FPGA

Mass-spectrometry for real-time cancer detection [100] MLP FPGA

ECog signal processing for BCI [101] MLP FPGA

EEG processing for energy-efﬁcient Neurofeedback devices [102] LSTM FPGA

liability of AI algorithms in high-risk environments [110];

and (v) the lack of a streamlined workﬂow between medical

practitioners and DL. While the latter challenges are matters of

legality and policy, the former issues highlight the fundamental

need to understand where dataset bias comes from, and how

to improve our understanding of why neural networks learn

the features they do, such that they may generalize across

populations in a manner that is safe for receivers of medical

care.

In addition, to make the use of any accelerators possible

for general as well as more complex biomedical applications,

the ﬁeld requires strong hardware-software co-design to build

up hardware that can be readily programmed for biomedical

tasks. One successful example of a solid hardware-software

co-design for a DL-customized CMOS platform (shown in

Table III) is the Google TPU [13], which while generic, has

been used along with complex tailored software for human-

surpassing medical imaging tasks [9]. Google has used a sim-

ilar CMOS TPU technology to design inference engines [45],

which are very promising as edge hardware to enable mobile

healthcare care applications. The main reason for this promise

is the availability of the solid software platforms (such as

TensorFlow Light) and the community support for the Google

TPU.

Overall, great advancements have happened for DL acceler-

ators in the past several years and they are currently stemming

in various aspects of our life from self-driving cars to smart

personal assistants. After overcoming a number of obstacles

such as those mentioned above, we may be also able to widely

integrate these DL accelerators in healthcare and biomedical

applications. However, for some medical applications such as

monitoring that requires always-on processing, we still need

systems with orders of magnitude better power efﬁciency, so

they can run on a simple button battery for a long time. To

achieve such systems, one possible approach is to process data

only when available and make our processing asynchronous.

A promising method to achieve such goals is the use of brain-

inspired SNN-based neuromorphic processors.

B. Towards edge processing for biomedical applications with

neuromorphic processors

Although most of the efforts presented in this work focused

on DNN accelerators, there are also notable efforts in the

domain of SNN processors that offer complementary advan-

tages, such as the potential to reduce the power consumption

by multiple orders of magnitude, and to process the data in

real time. These so-called neuromorphic processors are ideal

for end-to-end processing scenarios for example in wearable

devices, where the streaming input needs to be monitored in

continuous time in an always-on manner.

There are already some works in the direction of processing

biomedical signals that explore both mixed analog-digital and

digital neuromorphic platforms, showing promising results for

always-on embedded biomedical systems. Table IV shows

a summary of today’s large scale neuromorphic processors,

used for biomedical signal processing. The ﬁrst chip presented

in this table is DYNAP-SE [111], a multi-core mixed-signal

neuromorphic implementation with analog neural dynamics

circuits and event-based asynchronous routing and commu-

nication circuits. The DYNAP-SE chip has been used to

implement four of the seven SNN processing systems listed

in Table III. These SNNs are used for the classiﬁcation or

detection of EMG [104], [105] and ECG [103], [38]. The

DYNAP-SE was also used to build a spiking perceptron as part

of a design to classify and detect High-Frequency Oscillations

(HFO) in human intracranial EEG [42].

In [38], [103], [104] a spiking RNN is used to integrate

the ECG/EMG patterns temporally and separate them in a

linear fashion to be classiﬁable with a linear read-out. Support

Vector Machine (SVM) and linear least square approximation

is used in the read out layer for [103], [38] and overall

accuracy of 91% and 95% for anomaly detection were reached

respectively. In [104], the timing and dynamic features of

the spiking RNN on EMG recordings was investigated for

classifying different hand gestures. In [105] the performance

of a feedforward SNN and a hardware-friendly spiking learn-

ing algorithm for hand gesture recognition using superﬁcial

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 15

TABLE IV

NEU ROM OR PH IC P LATF OR MS U SE D FO R BI OM ED IC AL S IG NA L PRO CE SS IN G

Neuromorphic Chip DYNAP-SE SpiNNaker TrueNorth Loihi ODIN

CMOS Technology 180 nm ARM968, 130 nm 28 nm 14 nm FinFET 28 nm FDSOI

Implementation Mixed-signal Digital Digital ASIC Digital ASIC Digital ASIC

Neurons per core 256 1000 (1M cores) 256 Max 1k 256

Synapses per core 16k 1M 64k 114k-1M 64k

Energy per SOP 17 pJ @ 1.8V Peak power 1W per chip 26 pJ @ 0.775 23.6 pJ @ 0.75V 12.7 pJ@0.55V

Size 38.5 mm2102 mm2- 60 mm20.086 mm2

Biosignal processing

application

EMG [105],

ECG [103], HFO [42]

EMG and EEG [106] EEG and LFP [107] EMG [39] EMG [39]

EMG was investigated and compared to traditional machine

learning approaches, such as SVM. Results show that applying

SVM on the spiking output of the hidden layer achieved a

classiﬁcation rate of 84%, and the spiking learning method

achieved 74% with a power consumption of about 0.05 mW .

The consumption was compared to state-of-the-art embedded

system showing that the proposed spiking network is two

orders of magnitude more power efﬁcient [112], [113].

The other neuromorphic platforms listed in Table IV include

digital architectures such as SpiNNaker [114], TrueNorth [115]

and Loihi [116]. SpiNNaker has been used for EMG and

EEG processing and obtained results show a better classi-

ﬁcation accuracy compared to traditional machine learning

methods [106]. In [107], the authors developed a framework

for decoding EEG and LFP using CNNs. The network was

ﬁrst developed in Caffe and the result was then used as a

basis for building a TrueNorth-compatible neural network. The

TrueNorth-compatible network achieved the highest classiﬁ-

cation, around 76%. Recently, the benchmark hand-gesture

classiﬁcation introduced in subsection II-D, was processed and

compared on two other digital neuromorphic platforms, i.e.

Loihi and ODIN/MorphIC [117], [118]. A spiking CNN was

implemented on Loihi and a spiking MLP was implemented

on ODIN/MorphIC [39]. The results achieved using these

networks are presented in Table V.

On-chip adaptation and learning mechanisms, such as those

present in some of the neuromorphic devics listed in Table IV,

could be a game changer for personalized medicine, where the

system can adapt to each patient’s unique bio signature and/or

drift over time. However, the challenge of implementing efﬁ-

cient on-chip online learning in these types of neuromorphic

architectures has not yet been solved. This challenge lies on

two main factors: locality of the weight update and weight

storage.

Locality: There is a hardware constraint that the learning

information for updating the weights of any on-chip network

should be locally available to the synapse, otherwise most of

the silicon area would be taken by the wires, required to route

the update information to it. As Hebbian learning satisﬁes this

requirement, most of the available on-chip learning algorithms

focus on its implementation in forms of unsupervised/semi-

supervised learning [117], [119]. However, local Hebbian-

based algorithms are limited in learning static patterns or using

very shallow networks [120]. There are also some efforts in

the direction of on-chip gradient-descent based methods which

implement on-chip error-based learning algorithms where the

least mean square of a neural network cost function is

minimized. For example, spike-based delta rule is the most

common weight update used for single-layer networks which

is the base of the back-propagation algorithm used in the vast

majority of current multi-layer neural networks. Single layer

mixed-signal neuromorphic circuit implementation of the delta

rule have already been designed [121] and employed for EMG

classiﬁcation [105]. Expanding this to multi-layer networks

involves non-local weight updates which limits its on-chip

implementation. Making the backpropagation algorithm local

is a topic of on-going research [122], [123], [124].

Weight storage: The holy grail weight storage for online

on-chip learning is a memory with non-volatile properties

whose state can change linearly in an analog fashion. Non-

volatile memristive devices provide a great potential for this.

Therefore, there is a large body of literature in combining

the maturity of CMOS technology with the potential of the

emerging memories to take the best out of the two worlds.

The integration of CMOS technology with that of the

emerging devices has been demonstrated for non-volatile ﬁl-

amentary switches [125] already at a commercial level [126].

There have also been some efforts in combining CMOS and

memristor technologies to design supervised local error-based

learning circuits using only one network layer by exploiting

the properties of memristive devices [121], [127], [128].

Apart from the above-mentioned beneﬁts in utilizing mem-

ristive devices for online learning in SNN-based neuromorphic

chips, as discussed in subsection III-B, memristive devices

have also shown interesting features to improve the power

consumption and delay of conventional DNNs. However, as

shown in Table III, memristor-based DNNs are very sparse in

the biomedical domain, and existing works are largely based

only on simulation.

C. Why is the use of MDNNs very limited in the biomedical

domain?

Currently there are very few hardware implementations of

biomedical MDNNs that make use of general programmable

memristive-CMOS, and only one programmed to construct

an MLP for cancer diagnosis. We could also ﬁnd two other

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 16

memristive designs in literature for biomedical applications

(shown in Table III), but they are only simulations considering

memristive crossbars. This sparsity is despite the signiﬁcant

advantages that memristors provide in MAC parallelization

and in-memory computing paradigm, while being compatible

with CMOS technology. These features make memristors ideal

candidates for DL accelerators in general, and for portable and

edge-based healthcare applications in particular, because they

have stringent device size and power consumption require-

ments. To be able to use memristive devices in biomedical

domain, though, several of their shortcomings such as limited

endurance, mismatch, and analog noise accumulation must be

overcome ﬁrst. This demands further research in the materials,

as well as the circuit and system design side of this emerging

technology, while at the same time developing facilitator open-

source software [79] to support MDNNs. Furthermore, inves-

tigating the same techniques utilized in developing CMOS-

based DL accelerators such as limited precision data repre-

sentation [80], [89] and approximate computing schemes can

lead to advances in developing MDNNs and facilitate their use

in biomedical domains.

D. Why and when to use FPGA for biomedical DNNs?

Table III shows that FPGA is a fairly popular hardware

technology for implementing simple DL networks such as

MLPs and in one case, a complex LSTM. The table also shows

that FPGAs are mainly used for signal processing tasks and

have not been widely used to run complex DL architectures

such as CNNs. This is mainly because they have limited

on-chip memory and low bandwidth compared to GPUs.

However, they present notable beneﬁts in terms of signiﬁcantly

shorter development time compared to ASICs, and much lower

power consumption than typical GPUs. Besides, signiﬁcant

power and latency improvement can be gained by customizing

the implementation of various components of a DNN on an

FPGA, compared to running it on a general-purpose CPU or

GPU [100], [102]. For instance, in [102], EEG signals are

processed on FPGAs using two customized hardware blocks

for (i) parallelizing MAC operations and (ii) efﬁcient recurrent

state updates, both of which are key elements of LSTMs. This

has resulted in almost an order of magnitude power efﬁciency

compared to GPUs. This efﬁciency is critical in many edge-

computing applications including DNN-based point-of-care

biomedical devices [20] and healthcare IoT [19], [57].

Another beneﬁt of FPGAs is that a customized efﬁcient

FPGA design can be directly synthesized into an ASIC using

a nanometer-node CMOS technology to achieve even more

beneﬁts. For instance, [102] has shown near 100 times en-

ergy efﬁciency improvement as an ASIC in a 15-nm CMOS

technology, compared to its FPGA counterpart.

Although low-power consumption and affordable cost are

two key factors for almost any edge-computing or near-sensor

device, these are even more important for biomedical devices

such as wearables, health-monitoring systems, and PoC de-

vices. Therefore, FPGAs present an appealing solution, where

their limitations can be addressed for a customized DNN using

speciﬁc design methods such as approximate computing [97]

and limited-precision data [94], [96], depending on the cost,

required power consumption, and the acceptable accuracy of

the biomedical device.

E. Benchmarking EMG processing across multiple DNN and

SNN hardware platforms

In Table V, we compare our FPGA and memristive im-

plementations to other DNN accelerators and neuromorphic

processors from [39]. Input and hidden layers are sequenced

with the ReLU activation function, and output layers are

fed through Softmax activation functions to determine class

probabilities. Dropout layers are used in all networks to avoid

over-ﬁtting. The DNN architectures are determined in the table

caption. Further implementation details can be found in [39].

The platforms used for each system in Table V are as

follows: ODIN+MorphIC [117], [118] and Loihi [116] neu-

romorphic platforms were used for spiking implementations;

NVIDIA Jetson Nano was used for all embedded GPU im-

plementations; OpenVINO Toolkit FPGA was used for all

FPGA implementations, and MemTorch [79] was used for

converting the MLP and CNN networks to their corresponding

MDNNs to determine the test set accuracies of all memristive

implementations.

From Table V, it can be observed that, when transitioning

from generalized architectures to application speciﬁc proces-

sors, more optimized processing of a subset of given tasks can

be achieved. Moving up the speciﬁcity hierarchy from GPU to

FPGA to memristive networks shows orders of magnitude of

improvement in both MLP and CNN processing, but naturally

at the expense of a generalizable range of tasks. While

GPUs are relatively efﬁcient at training networks (compared

to CPUs), the impressive metrics presented by memristor

(RRAM in this simulations) is coupled with limited endurance.

This is not an issue for read-only tasks, as is the case with

inference, but training is thwarted by the thousands of epochs

of weight updates which limits broad use of RRAMs in

training. Rather, more exploration in alternative resistive-based

technologies such as Magnetoresistive Random Access Mem-

ory (MRAM) could prove beneﬁcial for tasks that demand

high endurance.

After determining the test set accuracy of each MDNN using

MemTorch [79], we determined the energy required to perform

inference on a single input, the inference time, and the Energy-

Delay Product (EDP) using a similar approach to [129], for a

tiled memristor architecture. All presumptions made in our

calculations are listed below. Parameters are adopted from

those given in a 1T1R 65nm technology, where the maximum

current during inference is 3µA per cell with a read voltage of

0.3V. Each cell is capable of storing 8 bits with a resistance

ratio of 100, and mapping signed weights is achieved using

a dual column representation. All convolutions are performed

by unrolling the kernels and performing MVMs, and the fully

connected layers have the fan-in weights for a single neuron

assigned to one column. Each crossbar has an aspect ratio

of 256×64 to enable more analog operations per ADC when

compared to a 128×128 array. Where there is insufﬁcient

space to map weights to a single array, they are distributed

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 17

TABLE V

COM PARI SO N OF C ON VE NT IO NA L DNNS IM PL EM EN TE D ON VARI OU S HA RD WARE P LATF OR MS W IT H SP IK IN G DNN NE URO MO RP HI C SY ST EM S ON T HE

BE NC HM AR K BI OM ED IC AL S IG NAL P ROC ES SI NG TA SK O F HA ND G ES TU RE R EC OG NI TI ON F OR B OTH S IN GL E SE NS OR A ND S EN SO R FU SI ON ,AS

EX PL AI NE D IN S UB SE CT IO N II-D. TH E RE SU LTS OF T HE A CCU R ACY A RE R EP ORT ED W IT H ME AN A ND S TAND AR D DE VI ATIO N OB TAIN ED O VE R A 3-FO LD

CRO SS VAL IDATI ON . LO IH I, E MB ED DE D GPU, AN D OD IN +M OR PH IC I MP LE ME NTATI ON R ES ULT S AR E FRO M [3 9] . TH E DNN AR CH IT EC TU RE S

AD OP TE D AR E AS F OL LOW S:⋄8C3 -2 P-1 6C3 -2 P-3 2C3 -5 12 -5 CN N. †1 6- 12 8- 12 8- 5 M LP. ‡16-230-5 M LP. ∓4×40 0- 21 0- 5 MLP. ∪EMG A ND

APS/DVS N ET WO RK S AR E FU SE D US IN G A 5-NE URO N DE NS E LAYE R.

Platform Modality Accuracy (%) Energy (uJ) Inference time (ms) EDP (uJ * s)

Loihi

(Spiking)

EMG (MLP†) 55.7 ±2.7 173.2 ±21.2 5.89 ±0.18 1.0 ±0.1

DVS (CNN⋄) 92.1 ±1.2 815.3 ±115.9 6.64 ±0.14 5.4 ±0.8

EMG+DVS (CNN∪) 96.0 ±0.4 1104.5 ±58.8 7.75 ±0.07 8.6 ±0.5

Embedded GPU

EMG (MLP†) 68.1 ±2.8 (25.5 ±8.4) ·1033.8 ±0.1 97.3 ±4.4

APS (CNN⋄) 92.4 ±1.6 (31.7 ±7.4) ·1035.9 ±0.1 186.9 ±3.9

EMG+APS (CNN∪) 95.4 ±1.7 (32.1 ±7.9) ·1036.9 ±0.05 221.1 ±4.1

FPGA

EMG (MLP†) 67.2 ±2.3 (17.6 ±1.1) 1034.2 ±0.1 74.1 ±1.2

APS (CNN⋄) 96.7 ±3.0 (24.0 ±1.2) 1035.4 ±0.2 130.8 ±1.4

EMG+APS (CNN∪) 94.8 ±2.0 (31.2 ±3.0) 1036.3 ±0.1 196.3 ±3.1

Memristive

EMG (MLP†) 64.6 ±2.2 0.038 6.0 ·10−42.38 ·10−8

APS (CNN⋄) 96.2 ±3.3 4.83 1.0 ·10−34.83 ·10−6

EMG+APS (CNN∪) 94.8 ±2.0 4.90 1.2 ·10−35.88 ·10−6

ODIN+MorphIC

(Spiking)

EMG (MLP‡) 53.6 ±1.4 7.42 ±0.11 23.5 ±0.35 0.17 ±0.01

DVS (MLP∓) 85.1 ±4.1 57.2 ±6.8 17.3 ±2.0 1.00 ±0.24

EMG+DVS (MLP∪) 89.4 ±3.0 37.4 ±4.2 19.5 ±0.3 0.42 ±0.08

Embedded GPU

EMG (MLP‡) 67.2 ±3.6 (23.9 ±5.6) ·1032.8 ±0.08 67.2 ±2.9

APS (MLP∓) 84.2 ±4.3 (30.2 ±7.5) ·1036.9 ±0.1 211.3 ±6.1

EMG+APS (MLP∪) 88.1 ±4.1 (32.0 ±8.9) ·1037.9 ±0.05 253 ±3.9

FPGA

EMG (MLP‡) 63.8 ±1.4 (13.9 ±1.8) ·1033.5 ±0.1 48.9 ±1.9

APS (MLP∓) 82.9 ±8.4 (23.1 ±2.6) ·1035.7 ±0.2 131.4 ±2.8

EMG+APS (MLP∪) 83.4 ±2.8 (31.1 ±1.4) ·1037.3 ±0.2 228.2 ±1.6

Memristive

EMG (MLP‡) 63.8 ±1.4 0.026 4.0 ·10−41.04 ·10−8

APS (MLP∓) 82.4 ±8.5 0.18 4.0 ·10−47.2 ·10−8

EMG+APS (MLP∪) 83.4 ±2.8 0.33 6.0 ·10−41.98 ·10−7

across multiple arrays, with their results to be added digitally.

Throughput can be improved at the expense of additional

arrays for convolutional layers, by duplicating kernels such

that multiple inputs can be processed in parallel. The number

of tiles used for each network is assumed to be the exact

number required to balance the processing time of each layer.

The power consumption of each current-mode 8-bit ADC is

estimated to be 2×10−4W with an operating frequency of

40 MHz (5 MHz for bit-serial operation). The ADC latency

is presumed to dominate digital addition of partial products

from various tiles. The dynamic range of each ADC has been

adapted to the maximum possible range for each column, and

each ADC occupies a pair of columns.

The above presumptions lead to pre-silicon results that

are extremely promising for memristor arrays, as shown in

Table V. But it should be clear that these calculations were

performed for network-speciﬁc architectures, rather than a

more general application-speciﬁc use-case. That is, we assume

the chip has been designed for a given neural network model.

The other comparison benchmarks are far more generalizable,

in that they are suited to not only handle most network

topologies, but are also well-suited for training. The substantial

improvement of inference time over other methods is a result

of duplicate weights being mapped to enable higher paral-

lelism, which is tolerable for small architectures, but lends

to prohibitively large ADC power consumption for computer

vision tasks which rely on deep networks and millions of pa-

rameters, such as VGG-16. The use of memristors as synapses

in spike-based implementations may be more appropriate, so

as to reduce the ADC overhead by replacing multi-bit ADCs

with current sense ampliﬁers instead, and reducing the reliance

on analog current summation along resistive and capacitive

bit-lines.

Spike-based hardware show approximately two orders of

magnitude improvement in the EDP from Table V when

compared to their GPU and FPGA counterparts, which high-

lights the prospective use of such architectures in always-on

monitoring. This is necessary for enhancing the prospect of

ambient-assisted living, which would allow medical resources

to be freed up for tasks that are not suited for automation. In

general, one would expect that data should be processed in its

naturalized form. For example, 2D CNNs do not discard the

spatial relations between pixels in an image. Graph networks

are optimized for connectionist data, such as the structure

of proteins. By extension, the discrete events generated by

electrical impulses such as in EMGs, EEGs and ECGs may

also be optimized for SNNs. Of course, this discounts any

subthreshold ﬁring patterns of measured neuron populations.

But one possible explanation for the suitability of spiking

hardware for biological processes stems from the natural

timing of neuronal action potentials. Individual neurons will

typically not ﬁre in excess of 100 Hz, and the average heart

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 18

rate (and correspondingly, ECG spiking rate) will not exceed 3

Hz. There is a clear mismatch between the clock rate of non-

spiking neural network hardware, which tend to at least be in

the MHz range, and spike-driven processes. This introduces a

signiﬁcant amount of wastage in processing data when there

is no new information to process (e.g., in between heartbeats,

action potentials, or neural activity).

Nonetheless, it is clear that accuracy is compromised when

relying on EMG signals alone, based on the approximately

10% decrease of classiﬁcation accuracy on the Loihi chip and

ODIN+MorphIC, as against their GPU/FPGA counterparts.

This could be a result of spike-based training algorithms

lagging behind in maturity compared to conventional neural

network methods, or it could be an indication that critical

information is being discarded when neglecting the subthresh-

old signals generated by populations of neurons. But when

EMG and DVS data are combined, this multi-sensory data

fusion of spiking signals positively reinforce upon each other

with an approximately 4% accuracy improvement, whereas

combining non-spiking, mismatched data representations leads

to marginal improvements, and even a destructive effect (e.g.,

non-spiking CNN implementation on FPGA and memristive

arrays). This may be a result of EMG and APS data taking on

completely different structures. This is a possible indication

that feature extraction from merging the same structural form

of data (i.e., as spikes) proves to be more beneﬁcial than com-

bining a pair of networks with two completely different modes

of data (i.e., EMG signals with pixel-driven images). This

allows us to draw an important hypothesis: neural networks

can beneﬁt from a consistent representation of data generated

by various sensory mechanisms. This is supported by biology,

where all biological interpretations are typically represented

by graded or spiking action potentials.

V. CO NC L US ION

The use of DL in biomedical signal processing and health-

care promises signiﬁcant utility for medical practitioners and

their patients. DNNs can be used to improve the quality of life

for chronically ill patients by enabling ambient monitoring for

abnormalities, and correspondingly can reduce the burden on

medical resources. Proper use can lead to reduced workloads

for medical practitioners who may divert their attention to

time-critical tasks that require a standard beyond what neural

networks can achieve at this point in time.

We have stepped through the use of various DL accelera-

tors on a disparate range of medical tasks, and shown how

SNNs may complement DNNs where hardware efﬁciency is

the primary bottleneck for widespread integration. We have

provided a balanced view to how memristors may lead to

optimal hardware processing of both DNNs and SNNs, and

have highlighted the challenges that must be overcome before

they can be adopted at a large-scale. While the focus of this

tutorial and review is on hardware implementation of various

DL algorithms, the reader should be mindful that progress

in hardware is a necessary, but insufﬁcient, condition for

successful integration of medical-AI.

Adopting medical-AI tools is clearly a challenge that de-

mands the collaborative attention of healthcare providers, hard-

ware and software engineers, data scientists, policy-makers,

cognitive neuroscientists, device engineers and materials sci-

entists, amongst other specializations. A uniﬁed approach

to developing better hardware can have pervasive impacts

upon the healthcare industry, and realize signiﬁcant payoff by

improving the accessibility and outcomes of healthcare.

ACK NO WL EDG ME N T

This paper is supported in part by the European Unions

Horizon 2020 ERC project NeuroAgents (Grant No. 724295).

REF ER ENC ES

[1] T. Arevalo, “The State of Health Care Industry- Statistics & Facts,”

PolicyAdvice, Tech. Rep., Apr. 2020.

[2] G. Rong, A. Mendez, E. B. Assi, B. Zhao, and M. Sawan, “Artiﬁ-

cial Intelligence in Healthcare: Review and Prediction Case Studies,”

Engineering, vol. 6, no. 3, pp. 291–301, 2020.

[3] V. Jindal, “Integrating Mobile and Cloud for PPG Signal Selection to

Monitor Heart Rate during Intensive Physical Exercise,” in Proceedings

of the International Conference on Mobile Software Engineering and

Systems (MOBILESOFT), Austin, TX., May 2016, pp. 36–37.

[4] P. Sundaravadivel, K. Kesavan, L. Kesavan, S. P. Mohanty, and

E. Kougianos, “Smart-Log: A Deep-Learning Based Automated Nutri-

tion Monitoring System in the IoT,” IEEE Transactions on Consumer

Electronics, vol. 64, no. 3, pp. 390–398, 2018.

[5] B. Shi, L. J. Grimm, M. A. Mazurowski, J. A. Baker, J. R. Marks,

L. M. King, C. C. Maley, E. S. Hwang, and J. Y. Lo, “Prediction of

occult invasive disease in ductal carcinoma in situ using deep learning

features,” Journal of the American College of Radiology, vol. 15, no. 3,

pp. 527–534, 2018.

[6] X. Liu, L. Faes, A. U. Kale, S. K. Wagner, D. J. Fu, A. Bruynseels,

T. Mahendiran, G. Moraes, M. Shamdas, C. Kern et al., “A comparison

of deep learning performance against health-care professionals in

detecting diseases from medical imaging: a systematic review and

meta-analysis,” The Lancet Digital Health, vol. 1, no. 6, pp. e271–

e297, 2019.

[7] F. Liu, P. Yadav, A. M. Baschnagel, and A. B. McMillan, “MR-

based Treatment Planning in Radiation Therapy Using a Deep Learning

Approach,” Journal of Applied Clinical Medical Physics, vol. 20, no. 3,

pp. 105–114, 2019.

[8] W. Zhu, L. Xie, J. Han, and X. Guo, “The Application of Deep

Learning in Cancer Prognosis Prediction,” Cancers, vol. 12, no. 3, p.

603, 2020.

[9] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova,

H. Ashraﬁan, T. Back, M. Chesus, G. C. Corrado, A. Darzi et al.,

“International evaluation of an AI system for breast cancer screening,”

Nature, vol. 577, no. 7788, pp. 89–94, 2020.

[10] A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn,

M. P. Turakhia, and A. Y. Ng, “Cardiologist-level arrhythmia detection

and classiﬁcation in ambulatory electrocardiograms using a deep neural

network,” Nature medicine, vol. 25, no. 1, p. 65, 2019.

[11] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,

and S. Thrun, “Dermatologist-level classiﬁcation of skin cancer with

deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.

[12] T. Kalaiselvi, P. Sriramakrishnan, and K. Somasundaram, “Survey of

using GPU CUDA programming model in medical image analysis,”

Informatics in Medicine Unlocked, vol. 9, pp. 133–144, 2017.

[13] N. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,

S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,

C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,

T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R.

Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,

A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar,

S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,

A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Na-

garajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,

N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,

C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,

M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,

R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-Datacenter

Performance Analysis of a Tensor Processing Unit,” in Proceedings

of the International Symposium on Computer Architecture (ISCA),

Toronto, Canada., Jun. 2017.

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 19

[14] A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo,

K. Chou, C. Cui, G. Corrado, S. Thrun, and J. Dean, “A guide to deep

learning in healthcare,” Nature Medicine, vol. 25, no. 1, pp. 24–29,

2019.

[15] N. G. Peter et al., “NVIDIA Fermi: The First Complete GPU Com-

puting Architecture,” A White Paper of NVIDIA, 2009.

[16] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting

the NVIDIA Volta GPU Architecture via Microbenchmarking,” arXiv

preprint arXiv:1804.06826, 2018.

[17] R. Zemouri, N. Zerhouni, and D. Racoceanu, “Deep Learning in the

Biomedical Applications: Recent and Future Status,” Applied Sciences,

vol. 9, no. 8, p. 1526, 2019.

[18] E. Smistad, T. L. Falch, M. Bozorgi, A. C. Elster, and F. Lindseth,

“Medical Image Segmentation on GPUs–A Comprehensive Review,”

Medical Image Analysis, vol. 20, no. 1, pp. 1–18, 2015.

[19] B. Farahani, F. Firouzi, and K. Chakrabarty, “Healthcare IoT,” in

Intelligent Internet of Things. Springer, 2020, pp. 515–545.

[20] Q. Xie, K. Faust, R. Van Ommeren, A. Sheikh, U. Djuric, and P. Dia-

mandis, “Deep Learning for Image Analysis: Personalizing Medicine

Closer to the Point of Care,” Critical Reviews in Clinical Laboratory

Sciences, vol. 56, no. 1, pp. 61–73, 2019.

[21] M. Hartmann, U. S. Hashmi, and A. Imran, “Edge computing in

smart health care systems: Review, challenges, and research directions,”

Transactions on Emerging Telecommunications Technologies, p. e3710,

2019.

[22] I. Azimi, A. Anzanpour, A. M. Rahmani, T. Pahikkala, M. Levorato,

P. Liljeberg, and N. Dutt, “Hich: Hierarchical Fog-Assisted Computing

Architecture for Healthcare IoT,” ACM Transactions on Embedded

Computing Systems (TECS), vol. 16, no. 5s, pp. 1–20, 2017.

[23] K. Sethi, V. Parmar, and M. Suri, “Low-Power Hardware-Based Deep-

Learning Diagnostics Support Case Study,” in Proceedings of the IEEE

Biomedical Circuits and Systems Conference (BioCAS), Cleveland,

OH., Oct. 2018.

[24] P. Sahu, D. Yu, and H. Qin, “Apply lightweight deep learning on

internet of things for low-cost and easy-to-access skin cancer detec-

tion,” in Proceedings of Medical Imaging 2018: Imaging Informatics

for Healthcare, Research, and Applications, vol. 10579, Houston, TX,

Feb. 2018, p. 1057912.

[25] Y. Wei, J. Zhou, Y. Wang, Y. Liu, Q. Liu, J. Luo, C. Wang, F. Ren, and

L. Huang, “A Review of Algorithm & Hardware Design for AI-Based

Biomedical Applications,” IEEE Transactions on Biomedical Circuits

and Systems, vol. 14, no. 2, pp. 145–163, 2020.

[26] D. E. Rumelhart, G. Hinton, and R. J. Williams, “Learning represen-

tations by back-propagating errors,” Nature, vol. 323, no. 6088, pp.

533–538, 1986.

[27] T. C. Hollon, B. Pandian, A. R. Adapa, E. Urias, A. V. Save, S. S. S.

Khalsa, D. G. Eichberg, R. S. DAmico, Z. U. Farooq, S. Lewis et al.,

“Near real-time intraoperative brain tumor diagnosis using stimulated

Raman histology and deep neural networks,” Nature Medicine, pp. 1–7,

2020.

[28] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A

Survey of Recent Advances in Deep Learning Techniques for Elec-

tronic Health Record (EHR) Analysis,” IEEE Journal of Biomedical

and Health Informatics, vol. 22, no. 5, pp. 1589–1604, 2017.

[29] M. A. Sayeed, S. P. Mohanty, E. Kougianos, and H. P. Zaveri,

“Neuro-Detect: A Machine Learning-Based Fast and Accurate Seizure

Detection System in the IoMT,” IEEE Transactions on Consumer

Electronics, vol. 65, no. 3, pp. 359–368, 2019.

[30] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,

M. B. Gotway, and J. Liang, “Convolutional Neural Networks for Med-

ical Image Analysis: Full Training or Fine Tuning?” IEEE Transactions

on Medical Imaging, vol. 35, no. 5, pp. 1299–1312, 2016.

[31] J. Gao, H. Zhang, P. Lu, and Z. Wang, “An Effective LSTM Recurrent

Network to Detect Arrhythmia on Imbalanced ECG Dataset,” Journal

of Healthcare Engineering, vol. 2019, 2019.

[32] D. Zhang, L. Yao, X. Zhang, S. Wang, W. Chen, R. Boots, and B. Bena-

tallah, “Cascade and Parallel Convolutional Recurrent Neural Networks

on EEG-based Intention Recognition for Brain Computer Interface,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence, New

Orleans, LA., Feb. 2018.

[33] X. Zhou, Y. Li, and W. Liang, “CNN-RNN Based Intelligent Rec-

ommendation for Online Medical Pre-Diagnosis Support,” IEEE/ACM

Transactions on Computational Biology and Bioinformatics, 2020.

[34] J. Laitala, M. Jiang, E. Syrj¨

al¨

a, E. K. Naeini, A. Airola, A. M. Rahmani,

N. D. Dutt, and P. Liljeberg, “Robust ECG R-peak detection using

LSTM,” in Proceedings of the ACM Symposium on Applied Computing

(SAC), Brno, Czech Republic., Mar. 2020, pp. 1104–1111.

[35] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT

press, 2016.

[36] G. Indiveri and S.-C. Liu, “Memory and Information Processing in

Neuromorphic Systems,” Proceedings of the IEEE, vol. 103, no. 8, pp.

1379–1397, 2015.

[37] F. Corradi and G. Indiveri, “A Neuromorphic Event-Based Neural

Recording System for Smart Brain-Machine-Interfaces,” IEEE Trans-

actions on Biomedical Circuits and Systems, vol. 9, no. 5, pp. 699–709,

2015.

[38] F. Corradi, S. Pande, J. Stuijt, N. Qiao, S. Schaafsma, G. Indiveri,

and F. Catthoor, “ECG-based Heartbeat Classiﬁcation in Neuromorphic

Hardware,” in Proceedings of the International Joint Conference on

Neural Networks (IJCNN), Budapest, Hungary., Jul. 2019.

[39] E. Ceolini, C. Frenkel, S. B. Shrestha, G. Taverni, L. Khacef, M. Pay-

vand, and E. Donati, “Hand-gesture recognition based on EMG and

event-based camera sensor fusion: a benchmark in neuromorphic

computing,” Frontiers in Neuroscience, no. 520438, p. 36, 2020.

[40] J. K. Eshraghian, K. Cho, C. Zheng, M. Nam, H. H.-C. Iu, W. Lei, and

K. Eshraghian, “Neuromorphic vision hybrid rram-cmos architecture,”

IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

vol. 26, no. 12, pp. 2816–2829, 2018.

[41] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman,

“Hots: a hierarchy of event-based time-surfaces for pattern recogni-

tion,” IEEE transactions on pattern analysis and machine intelligence,

vol. 39, no. 7, pp. 1346–1359, 2016.

[42] M. Sharifshazileh, K. Burelo, T. Fedele, J. Sarnthein, and G. Indiveri,

“A Neuromorphic Device for Detecting High-Frequency Oscillations in

Human iEEG,” in Proceedings of the IEEE International Conference on

Electronics, Circuits and Systems (ICECS), Genova, Italy., Nov. 2019,

pp. 69–72.

[43] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128X128 120 dB 15

us Latency Asynchronous Temporal Contrast Vision Sensor,” IEEE

Journal of Solid-state Circuits, vol. 43, no. 2, pp. 566–576, 2008.

[44] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kep-

ner, “Survey and Benchmarking of Machine Learning Accelerators,”

arXiv preprint arXiv:1908.11348, 2019.

[45] “Edge TPU,” https://coral.ai/docs/edgetpu/faq/.

[46] J. Hruska, “Intel Nervana Inference and Training AI Cards,”

https://www.extremetech.com/computing/296990-intel-nervana-nnp-i-

nnp-t-a-training-inference.

[47] P. Kennedy, “Huawei Ascend 310,”

https://www.servethehome.com/huawei-ascend-910-provides-a-nvidia-

ai-training-alternative/.

[48] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, “LNPU:

A 25.3 TFLOPS/W Sparse Deep-Neural-Network Learning Processor

with Fine-Grained Mixed Precision of FP8-FP16,” in Proceedings of

the IEEE International Solid-State Circuits Conference (ISSCC), San

Francisco, CA., Feb. 2019, pp. 142–144.

[49] D. Shin, J. Lee, J. Lee, J. Lee, and H.-J. Yoo, “DNPU: An Energy-

Efﬁcient Deep-Learning Processor with Heterogeneous Multi-Core

Architecture,” IEEE Micro, vol. 38, no. 5, pp. 85–93, 2018.

[50] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu,

L. Liu, and S. Wei, “A high energy efﬁcient reconﬁgurable hybrid

neural network processor for deep learning applications,” IEEE Journal

of Solid-State Circuits, vol. 53, no. 4, pp. 968–982, 2017.

[51] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An

Energy-Efﬁcient Deep Neural Network Accelerator With Fully Variable

Weight Bit Precision,” IEEE Journal of Solid-State Circuits, vol. 54,

no. 1, pp. 173–185, 2018.

[52] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and

Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” in

Proceedings of the IEEE/ACM International Symposium on Microar-

chitecture (MICRO), Taipei, Taiwan., Oct. 2016.

[53] J. Zhang, S. Gajjala, P. Agrawal, G. H. Tison, L. A. Hallock,

L. Beussink-Nelson, E. Fan, M. A. Aras, C. Jordan, K. E. Fleischmann

et al., “A Computer Vision Pipeline for Automated Determination of

Cardiac Structure and Function and Detection of Disease by Two-

Dimensional Echocardiography,” arXiv preprint arXiv:1706.07342,

2017.

[54] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-

Efﬁcient Reconﬁgurable Accelerator for Deep Convolutional Neural

Networks,” IEEE Journal of Solid-state Circuits, vol. 52, no. 1, pp.

127–138, 2016.

[55] Q. Guan, Y. Wang, B. Ping, D. Li, J. Du, Y. Qin, H. Lu, X. Wan,

and J. Xiang, “Deep Convolutional Neural Network VGG-16 Model

for Differential Diagnosing of Papillary Thyroid Carcinomas in Cyto-

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 20

logical Images: A Pilot Study,” Journal of Cancer, vol. 10, no. 20, p.

4876, 2019.

[56] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W Convolutional

Network Accelerator,” IEEE Transactions on Circuits and Systems for

Video Technology, vol. 27, no. 11, pp. 2461–2475, 2016.

[57] I. Azimi, J. Takalo-Mattila, A. Anzanpour, A. M. Rahmani, J.-P.

Soininen, and P. Liljeberg, “Empowering Healthcare IoT Systems

with Hierarchical Edge-Based Deep Learning,” in Proceedings of the

International Conference on Connected Health: Applications, Systems

and Engineering Technologies (CHASE), Washington, DC., Sep. 2018,

pp. 63–68.

[58] B. Moons and M. Verhelst, “An Energy-Efﬁcient Precision-Scalable

ConvNet Processor in 40-nm CMOS,” IEEE Journal of Solid-state

Circuits, vol. 52, no. 4, pp. 903–914, 2016.

[59] M. Blaivas and L. Blaivas, “Are All Deep Learning Architectures

Alike for Point-of-Care Ultrasound?: Evidence From a Cardiac Image

Classiﬁcation Model Suggests Otherwise,” Journal of Ultrasound in

Medicine, 2019.

[60] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envision:

A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-

frequency-scalable Convolutional Neural Network processor in 28nm

FDSOI,” in Proceedings of the IEEE International Solid-State Circuits

Conference (ISSCC), San Francisco, CA., Feb. 2017, pp. 246–247.

[61] M.-P. Hosseini, T. X. Tran, D. Pompili, K. Elisevich, and H. Soltanian-

Zadeh, “Deep Learning with Edge Computing for Localization of

Epileptogenicity Using Multimodal rs-fMRI and EEG Big Data,”

in Proceedings of the IEEE International Conference on Autonomic

Computing (ICAC), Columbus, OH., Jul. 2017, pp. 83–92.

[62] J. Song, Y. Cho, J.-S. Park, J.-W. Jang, S. Lee, J.-H. Song, J.-G. Lee,

and I. Kang, “An 11.5TOPS/W 1024-MAC Butterﬂy Structure Dual-

Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile

SoC,” in Proceedings of the IEEE International Solid-State Circuits

Conference (ISSCC), San Francisco, CA., Feb. 2019, pp. 130–132.

[63] F. Preiswerk, C.-C. Cheng, J. Luo, and B. Madore, “Synthesizing

Dynamic MRI Using Long-Term Recurrent Convolutional Networks,”

in Proceedings of the International Workshop on Machine Learning in

Medical Imaging (MLMI). Granada, Spain.: Springer, Sep. 2018, pp.

89–97.

[64] J. P. Queralta, T. N. Gia, H. Tenhunen, and T. Westerlund, “Edge-

AI in LoRa-based Health Monitoring: Fall Detection System with Fog

Computing and LSTM Recurrent Neural Networks,” in Proceedings

of the International Conference on Telecommunications and Signal

Processing (TSP), 2019, pp. 601–604.

[65] I. M. Baltruschat, H. Nickisch, M. Grass, T. Knopp, and A. Saalbach,

“Comparison of Deep Learning Approaches for Multi-Label Chest X-

Ray Classiﬁcation,” Scientiﬁc Reports, vol. 9, no. 1, pp. 1–10, 2019.

[66] G. Zamzmi, L.-Y. Hsu, W. Li, V. Sachdev, and S. Antani, “Harnessing

Machine Intelligence in Automatic Echocardiogram Analysis: Current

Status, Limitations, and Future Directions,” IEEE Reviews in Biomed-

ical Engineering, 2020.

[67] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and E. Cu-

lurciello, “NeuFlow: Dataﬂow vision processing system-on-a-chip,” in

Proceedings of the IEEE International Midwest Symposium on Circuits

and Systems (MWSCAS), Fort Collins, CO., Aug. 2012, pp. 1044–1047.

[68] A. Putnam, A. M. Caulﬁeld, E. S. Chung, D. Chiou, K. Constantinides,

J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “A

reconﬁgurable fabric for accelerating large-scale datacenter services,”

in Proceedings of the ACM/IEEE International Symposium on Com-

puter Architecture (ISCA), Minneapolis, MN., Jun. 2014, pp. 13–24.

[69] D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker,

T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell et al.,

“Think Fast: A Tensor Streaming Processor (TSP) for Accelerating

Deep Learning Workloads,” Valencia, Spain., May 2020.

[70] H. Kung and C. E. Leiserson, “Systolic Arrays (for VLSI),” in

Proceedings of Sparse Matrix, vol. 1. Society for industrial and applied

mathematics, 1979, pp. 256–282.

[71] H.-T. Kung, “Why systolic architectures?” Computer, no. 1, pp. 37–46,

1982.

[72] G. Burr, P. Narayanan, R. Shelby, S. Sidler, I. Boybat, C. di Nolfo,

and Y. Leblebici, “Large-scale neural networks implemented with

non-volatile memory as the synaptic weight element: Comparative

performance analysis (accuracy, speed, and power),” in Proceedings of

the IEEE International Electron Devices Meeting (IEDM), Washington,

DC., Dec. 2015.

[73] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. Nolfo,

S. Sidler, M. Giordano, M. Bodini, N. C. Farinha et al., “Equivalent-

accuracy accelerated neural-network training using analogue memory,”

Nature, vol. 558, no. 7708, p. 60, 2018.

[74] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and

H. Qian, “Fully hardware-implemented memristor convolutional neural

network,” Nature, vol. 577, no. 7792, pp. 641–646, 2020.

[75] J. K. Eshraghian, S.-M. Kang, S. Baek, G. Orchard, H. H.-C. Iu, and

W. Lei, “Analog weights in reram dnn accelerators,” in 2019 IEEE

International Conference on Artiﬁcial Intelligence Circuits and Systems

(AICAS). IEEE, 2019, pp. 267–271.

[76] M. R. Azghadi, B. Linares-Barranco, D. Abbott, and P. H. Leong, “A

Hybrid CMOS-Memristor Neuromorphic Synapse,” IEEE Transactions

on Biomedical Circuits and Systems, vol. 11, no. 2, pp. 434–445, 2017.

[77] M. Rahimi Azghadi, Y.-C. Chen, J. K. Eshraghian, J. Chen, C.-Y.

Lin, A. Amirsoleimani, A. Mehonic, A. J. Kenyon, B. Fowler, J. C.

Lee et al., “Complementary metal-oxide semiconductor and memristive

hardware for neuromorphic computing,” Advanced Intelligent Systems,

vol. 2, no. 5, p. 1900189, 2020.

[78] Q. Xia and J. J. Yang, “Memristive crossbar arrays for brain-inspired

computing,” Nature Materials, vol. 18, no. 4, pp. 309–323, 2019.

[79] C. Lammie, W. Xiang, B. Linares-Barranco, and M. R. Azghadi,

“MemTorch: An Open-source Simulation Framework for Memristive

Deep Learning Systems,” arXiv preprint arXiv:2004.10971, 2020.

[80] C. Lammie, O. Krestinskaya, A. James, and M. R. Azghadi, “Variation-

aware Binarized Memristive Networks,” in Proceedings of the Inter-

national Conference on Electronics, Circuits and Systems (ICECS),

Genova, Italy., Nov. 2019, pp. 490–493.

[81] O. Krestinskaya, K. N. Salama, and A. P. James, “Learning in Mem-

ristive Neural Network Architectures Using Analog Backpropagation

Circuits,” IEEE Transactions on Circuits and Systems I: Regular

Papers, vol. 66, no. 2, pp. 719–732, 2018.

[82] S. Yu, P.-Y. Chen, Y. Cao, L. Xia, Y. Wang, and H. Wu, “Scaling-up

resistive synaptic arrays for neuro-inspired architecture: Challenges and

prospect,” in Proceedings of the IEEE International Electron Devices

Meeting (IEDM), Washington, DC., Dec. 2015.

[83] N. Bien, P. Rajpurkar, R. L. Ball, J. Irvin, A. Park, E. Jones, M. Bereket,

B. N. Patel, K. W. Yeom, K. Shpanskaya et al., “Deep-learning-

assisted diagnosis for knee magnetic resonance imaging: development

and retrospective validation of MRNet,” PLoS Medicine, vol. 15, no. 11,

p. e1002699, 2018.

[84] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin,

R. S. Williams, P. Faraboschi, W. Hwu, J. P. Strachan, K. Roy,

and D. S. Milojicic, “PUMA: A Programmable Ultra-efﬁcient

Memristor-based Accelerator for Machine Learning Inference,” CoRR,

vol. abs/1901.10351, 2019. [Online]. Available: http://arxiv.org/abs/

1901.10351

[85] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny,

“VTEAM: A General Model for Voltage-controlled Memristors,” IEEE

Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 8,

pp. 786–790, 2015.

[86] E. Yalon, A. Gavrilov, S. Cohen, D. Mistele, B. Meyler, J. Salzman,

and D. Ritter, “Resistive Switching in HfO2Probed by a Metal–

Insulator–Semiconductor Bipolar Transistor,” IEEE Electron Device

Letters, vol. 33, no. 1, pp. 11–13, 2012.

[87] A. M. Hassan, A. F. Khalaf, K. S. Sayed, H. H. Li, and Y. Chen,

“Real-Time Cardiac Arrhythmia Classiﬁcation Using Memristor Neu-

romorphic Computing System,” in Proceedings of the International

Conference of the IEEE Engineering in Medicine and Biology Society

(EMBC), Honolulu, HI, Jul. 2018, pp. 2567–2570.

[88] F. Cai, J. M. Correll, S. H. Lee, Y. Lim, V. Bothra, Z. Zhang, M. P.

Flynn, and W. D. Lu, “A fully integrated reprogrammable memristor–

CMOS system for efﬁcient multiply–accumulate operations,” Nature

Electronics, vol. 2, no. 7, pp. 290–299, 2019.

[89] T. Hirtzlin, M. Bocquet, B. Penkovsky, J.-O. Klein, E. Nowak,

E. Vianello, J.-M. Portal, and D. Querlioz, “Digital Biologically Plau-

sible Implementation of Binarized Neural Networks With Differential

Hafnium Oxide Resistive Memory Arrays,” Frontiers in Neuroscience,

vol. 13, 2019.

[90] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming

standard for heterogeneous computing systems,” Computing in Science

& Engineering, vol. 12, no. 3, pp. 66–73, 2010.

[91] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A Survey of FPGA-

Based Neural Network Inference Accelerator,” ACM Transactions on

Reconﬁgurable Technology and Systems (TRETS), vol. 12, no. 1, pp.

1–26, 2019.

[92] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,

“High-Level Synthesis for FPGAs: From Prototyping to Deployment,”

SUBMITTED TO IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 21

IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems, vol. 30, no. 4, pp. 473–491, 2011.

[93] C. Lammie, W. Xiang, and M. R. Azghadi, “Accelerating Deterministic

and Stochastic Binarized Neural Networks on FPGAS using OpenCL,”

in Proceedings of the IEEE International Midwest Symposium on

Circuits and Systems (MWSCAS), Dallas, TX., Aug. 2019, pp. 626–

629.

[94] C. Lammie, A. Olsen, T. Carrick, and M. R. Azghadi, “Low-Power and

High-Speed Deep FPGA Inference Engines for Weed Classiﬁcation at

the Edge,” IEEE Access, vol. 7, pp. 51171–51 184, 2019.

[95] M. Carreras, G. Deriu, L. Raffo, L. Benini, and P. Meloni, “Optimizing

Temporal Convolutional Network inference on FPGA-based accelera-

tors,” arXiv preprint arXiv:2005.03775, 2020.

[96] C. Lammie, W. Xiang, and M. R. Azghadi, “Training Progres-

sively Binarizing Deep Networks Using FPGAs,” arXiv preprint

arXiv:2001.02390, 2020.

[97] C. Lammie and M. R. Azghadi, “Stochastic Computing for Low-Power

and High-Speed Deep Learning on FPGA,” in Proceedings of the IEEE

International Symposium on Circuits and Systems (ISCAS), Sapporo,

Japan., May 2019.

[98] D. Wang, K. Xu, and D. Jiang, “PipeCNN: An OpenCL-based open-

source FPGA accelerator for convolution neural networks,” in Pro-

ceedings of the International Conference on Field Programmable

Technology (ICFPT), Melbourne, Australia., Dec. 2017, pp. 279–282.

[99] M. Wess, P. S. Manoj, and A. Jantsch, “Neural network based ECG

anomaly detection on FPGA and trade-off analysis,” in Proceedings of

the IEEE International Symposium on Circuits and Systems (ISCAS),

Baltimore, MD., May 2017.

[100] A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, and M. C. Herbordt,

“Real-time data analysis for medical diagnosis using FPGA-accelerated

neural networks,” BMC Bioinformatics, vol. 19, no. 18, p. 490, 2018.

[101] R. R. Shrivastwa, V. Pudi, and A. Chattopadhyay, “An FPGA-Based

Brain Computer Interfacing Using Compressive Sensing and Machine

Learning,” in Proceedings of the IEEE Computer Society Annual

Symposium on VLSI (ISVLSI), Hong Kong, China., Jul. 2018, pp. 726–

731.

[102] Z. Chen, A. Howe, H. T. Blair, and J. Cong, “CLINK: Compact LSTM

Inference Kernel for Energy Efﬁcient Neurofeedback Devices,” in

Proceedings of the International Symposium on Low Power Electronics

and Design (ISLPED), Bellevue, WA., Jul. 2018.

[103] F. C. Bauer, D. R. Muir, and G. Indiveri, “Real-time ultra-low power

ECG anomaly detection using an event-driven neuromorphic proces-

sor,” IEEE Transactions on Biomedical Circuits and Systems, 2019.

[104] E. Donati, M. Payvand, N. Risi, R. Krause, K. Burelo, G. Indiveri,

T. Dalgaty, and E. Vianello, “Processing EMG signals using reservoir

computing on an event-based neuromorphic system,” in Proceedings

of the IEEE Biomedical Circuits and Systems Conference (BioCAS),

Cleveland, Ohio., Oct. 2018.

[105] E. Donati, M. Payvand, N. Risi, R. Krause, and G. Indiveri, “Discrim-

ination of EMG Signals Using a Neuromorphic Implementation of a

Spiking Neural Network,” IEEE Transactions on Biomedical Circuits

and Systems, vol. 13, no. 5, pp. 795–803, 2019.

[106] J. Behrenbeck, Z. Tayeb, C. Bhiri, C. Richter, O. Rhodes, N. Kasabov,

J. I. Espinosa-Ramos, S. Furber, G. Cheng, and J. Conradt, “Classiﬁ-

cation and regression of spatio-temporal signals using NeuCube and its

realization on SpiNNaker neuromorphic hardware,” Journal of Neural

Engineering, vol. 16, no. 2, p. 026014, 2019.

[107] E. Nurse, B. S. Mashford, A. J. Yepes, I. Kiral-Kornek, S. Harrer, and

D. R. Freestone, “Decoding EEG and LFP Signals using Deep Learn-

ing: Heading TrueNorth,” in Proceedings of the ACM International

Conference on Computing Frontiers (CF), Como, Italy., May 2016,

pp. 259–266.

[108] L. Ohno-Machado and D. Bialek, “Diagnosing breast cancer from fnas:

variable relevance in neural network and logistic regression models,”

Studies in Health Technology and Informatics, vol. 52, pp. 537–540,

1998.

[109] Y. Ku, W. Tompkins, and Q. Xue, “Artiﬁcial neural network for

ECG arrhythmia monitoring,” in Proceedings of the International Joint

Conference on Neural Networks (IJCNN), vol. 2. Baltimore, MD.:

IEEE, Jun. 1992, pp. 987–992.

[110] J. K. Eshraghian, “Human ownership of artiﬁcial creativity,” Nature

Machine Intelligence, pp. 157–160, 2020.

[111] S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri, “A Scalable Multicore

Architecture With Heterogeneous Memory Structures for Dynamic

Neuromorphic Asynchronous Processors (DYNAPs),” IEEE Transac-

tions on Biomedical Circuits and Systems, vol. 12, no. 1, pp. 106–122,

Feb. 2018.

[112] S. Benatti, F. Casamassima, B. Milosevic, E. Farella, P. Sch ¨

onle,

S. Fateh, T. Burger, Q. Huang, and L. Benini, “A Versatile Embed-

ded Platform for EMG Acquisition and Gesture Recognition,” IEEE

Transactions on Biomedical Circuits and Systems, vol. 9, no. 5, pp.

620–630, 2015.

[113] F. Montagna, A. Rahimi, S. Benatti, D. Rossi, and L. Benini,

“PULP-HD: Accelerating Brain-inspired High-dimensional Comput-

ing on a Parallel Ultra-low Power Platform,” in Proceedings of the

ACM/ESDA/IEEE Design Automation Conference (DAC), San Fran-

cisco, CA., Jun. 2018.

[114] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,

S. Temple, and A. D. Brown, “Overview of the SpiNNaker System

Architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.

2454–2467, 2013.

[115] P. Merolla and K. Boahen, “A Recurrent Model of Orientation Maps

with Simple and Complex Cells,” in Proceedings of Advances in Neural

Information Processing Systems 17 (NIPS), Vancouver, Canada., Dec.

2004, pp. 1995–2002.

[116] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday,

G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A Neuromorphic

Manycore Processor with On-Chip Learning,” IEEE Micro, vol. 38,

no. 1, pp. 82–99, 2018.

[117] C. Frenkel, M. Lefebvre, J.-D. Legat, and D. Bol, “A 0.086-mm212.7-

pJ/SOP 64k-Synapse 256-Neuron Online-Learning Digital Spiking

Neuromorphic Processor in 28-nm CMOS,” IEEE Transactions on

Biomedical Circuits and Systems, vol. 13, no. 1, pp. 145–158, 2019.

[118] C. Frenkel, J.-D. Legat, and D. Bol, “MorphIC: A 65-nm 738k-

Synapse/mm 2Quad-Core Binary-Weight Digital Neuromorphic Pro-

cessor With Stochastic Spike-Driven Online Learning,” IEEE Transac-

tions on Biomedical Circuits and Systems, vol. 13, no. 5, pp. 999–1010,

2019.

[119] N. Qiao, H. Mostafa, F. Corradi, M. Osswald, F. Stefanini, D. Sum-

islawska, and G. Indiveri, “A reconﬁgurable on-line learning spiking

neuromorphic processor comprising 256 neurons and 128K synapses,”

Frontiers in neuroscience, vol. 9, p. 141, 2015.

[120] M. R. Azghadi, S. Moradi, D. B. Fasnacht, M. S. Ozdas, and G. Indi-

veri, “Programmable spike-timing-dependent plasticity learning circuits

in neuromorphic VLSI architectures,” ACM Journal on Emerging Tech-

nologies in Computing Systems (JETC), vol. 12, no. 2, p. art. no. 17,

2015.

[121] M. Payvand and G. Indiveri, “Spike-based Plasticity Circuits for

Always-on On-line Learning in Neuromorphic Systems,” in Proceed-

ings of the IEEE International Symposium on Circuits and Systems

(ISCAS), Sapporo, Japan., May 2019.

[122] J. Kaiser, H. Mostafa, and E. Neftci, “Synaptic Plasticity Dynamics

for Deep Continuous Local Learning (DECOLLE),” Frontiers in Neu-

roscience, vol. 14, p. 424, 2020.

[123] G. Bellec, F. Scherr, E. Hajek, D. Salaj, A. Subramoney, R. Legenstein,

and W. Maass, “Eligibility traces provide a data-inspired alternative to

backpropagation through time,” Vancouver, Canada., Dec. 2019.

[124] J. Sacramento, R. P. Costa, Y. Bengio, and W. Senn, “Dendritic

cortical microcircuits approximate the backpropagation algorithm,”

in Proceedings of the Conference on Neural Information Processing

Systems (NIPS), Montreal, Canada., Dec. 2018, pp. 8721–8732.

[125] A. Valentian, F. Rummens, E. Vianello et al., “Fully Integrated Spiking

Neural Network with Analog Neurons and RRAM Synapses,” in Pro-

ceedings of the IEEE International Electron Devices Meeting (IEDM),

San Francisco, CA., Dec. 2019, pp. 14.13.1–14.13.4.

[126] Y. Hayakawa, A. Himeno, R. Yasuhara, W. Boullart et al., “Highly

reliable TaOx ReRAM with centralized ﬁlament for 28-nm embedded

application,” in VLSI Technology, 2015, pp. T14–T15.

[127] T. Dalgaty, M. Payvand, F. Moro, D. R. Ly, F. Pebay-Peyroula, J. Casas,

G. Indiveri, and E. Vianello, “Hybrid neuromorphic circuits exploiting

non-conventional properties of RRAM for massively parallel local

plasticity mechanisms,” APL Materials, vol. 7, no. 8, p. 081125, 2019.

[128] M. Payvand, Y. Demirag, T. Dalgaty, E. Vianello, and G. Indiveri,

“Analog weight updates with compliance current modulation of binary

rerams for on-chip learning,” in To appear in the Proceedings of

the IEEE International Symposium on Circuits and Systems (ISCAS).

IEEE, 2020.

[129] Q. Wang, X. Wang, S. H. Lee, F.-H. Meng, and W. D. Lu, “A Deep

Neural Network Accelerator Based on Tiled RRAM Architecture,”

in Proceedings of the IEEE International Electron Devices Meeting

(IEDM), San Francisco, CA., Dec. 2019, pp. 14–4.

Autonomous Driving with Spiking Neural Networks

Preprint

Full-text available

May 2024

Autonomous driving demands an integrated approach that encompasses perception, prediction, and planning, all while operating under strict energy constraints to enhance scalability and environmental sustainability. We present Spiking Autonomous Driving (\name{}), the first unified Spiking Neural Network (SNN) to address the energy challenges faced by autonomous driving systems through its event-driven and energy-efficient nature. SAD is trained end-to-end and consists of three main modules: perception, which processes inputs from multi-view cameras to construct a spatiotemporal bird's eye view; prediction, which utilizes a novel dual-pathway with spiking neurons to forecast future states; and planning, which generates safe trajectories considering predicted occupancy, traffic rules, and ride comfort. Evaluated on the nuScenes dataset, SAD achieves competitive performance in perception, prediction, and planning tasks, while drawing upon the energy efficiency of SNNs. This work highlights the potential of neuromorphic computing to be applied to energy-efficient autonomous driving, a critical step toward sustainable and safety-critical automotive technology. Our code is available at \url{https://github.com/ridgerchu/SAD}.

Towards Real-Time Predictive Health Monitoring from Sweat Wearables

Thesis

Full-text available

Aug 2023

Shu Wang

Sweat biomarkers offer valuable insights into the health conditions of individuals. Despite the recent advances in wearable technologies that enable real-time monitoring of sweat biomarkers, their potential to infer health conditions remains largely unexplored. This thesis, conducted as part of the WeCare project, leverages machine learning (ML) models including deep neural networks (DNNs), for real-time predictive health monitoring using these sweat biomarkers. Our research primarily focuses on predicting physiological states such as hydration status and core body temperature during exercise. One version of the wearable sweat patch developed by our WeCare partner at the Instituto de Microelectrónica de Barcelona (IMB-CNM) uses ion-sensitive field-effect transistors (ISFETs). While these sensors are sensitive, lightweight, and cost-effective, they are prone to sensor drift. Previous work shows that DNNs are promising for predicting ionic concentration from ISFET sensor readings with the presence of sensor drift. However, training DNNs requires large labeled datasets that are difficult to collect. To address this, we first construct a physical model for ISFET sensors that simulates sensor readings and takes into account sensor drift. We then train an end-to-end prediction neural network as a sensor calibration tool on these simulated readings. Our prediction network outperforms two manual calibration methods in predicting sodium concentration from uncalibrated real-world sodium ISFET readings, suggesting its promise for future calibration of wearable patches using ISFETs. Next, we carry out a study aimed at designing personalized hydration strategies based on noninvasive biomarkers. We examine the feasibility of using ML models to predict the hydration status of an athlete using physiological and sweat biomarker recordings collected from a subject during a set of indoor cycling sessions supervised by the Lausanne University Hospital (CHUV). Because the wearable sweat patches were still under development at that stage, absorbent patches were used for sweat sample collection. We also compared the performance of nonlinear ML models with the linear model on this predictive task. This investigation provides insights for future hydration status predictions using ML models on sweat biomarker data collected from wearable sweat patches. Finally, following the available printed sensor patch developed by the Soft Transducers Lab at École Polytechnique Fédérale de Lausanne (EPFL-LMTS), we determine the prediction accuracy of core body temperature during exercise using real-time sweat biomarkers measured with this wearable prototype and with the addition of other biomarker data collected with commercial devices. All experimental sessions were conducted at CHUV. Our results indicate that DNNs can accurately and continuously predict core body temperature solely from sweat biomarker data, specifically sweat sodium and potassium concentrations collected from the wearable patch. Moreover, our analysis of the collected sweat biomarker data shows that they can be used to predict future core body temperature values. Our findings highlight the potential of integrating advanced predictive models with wearable sweat patches for real-time and accurate prediction of physiological states.

Brain-inspired learning in artificial neural networks: A review

Article

Full-text available

Jun 2024

Artificial neural networks (ANNs) have emerged as an essential tool in machine learning, achieving remarkable success across diverse domains, including image and speech generation, game playing, and robotics. However, there exist fundamental differences between ANNs’ operating mechanisms and those of the biological brain, particularly concerning learning processes. This paper presents a comprehensive review of current brain-inspired learning representations in artificial neural networks. We investigate the integration of more biologically plausible mechanisms, such as synaptic plasticity, to improve these networks’ capabilities. Moreover, we delve into the potential advantages and challenges accompanying this approach. In this review, we pinpoint promising avenues for future research in this rapidly advancing field, which could bring us closer to understanding the essence of intelligence.

Spiking neural networks for nonlinear regression

Article

Full-text available

May 2024

Spiking neural networks (SNN), also often referred to as the third generation of neural networks, carry the potential for a massive reduction in memory and energy consumption over traditional, second-generation neural networks. Inspired by the undisputed efficiency of the human brain, they introduce temporal and neuronal sparsity, which can be exploited by next-generation neuromorphic hardware. Energy efficiency plays a crucial role in many engineering applications, for instance, in structural health monitoring. Machine learning in engineering contexts, especially in data-driven mechanics, focuses on regression. While regression with SNN has already been discussed in a variety of publications, in this contribution, we provide a novel formulation for its accuracy and energy efficiency. In particular, a network topology for decoding binary spike trains to real numbers is introduced, using the membrane potential of spiking neurons. Several different spiking neural architectures, ranging from simple spiking feed-forward to complex spiking long short-term memory neural networks, are derived. Since the proposed architectures do not contain any dense layers, they exploit the full potential of SNN in terms of energy efficiency. At the same time, the accuracy of the proposed SNN architectures is demonstrated by numerical examples, namely different material models. Linear and nonlinear, as well as history-dependent material models, are examined. While this contribution focuses on mechanical examples, the interested reader may regress any custom function by adapting the published source code.

A machine-learning Approach for Stress Detection Using Wearable Sensors in Free-living Environments

Preprint

Apr 2024

Stress is a psychological condition due to the body's response to a challenging situation. If a person is exposed to prolonged periods and various forms of stress, their physical and mental health can be negatively affected, leading to chronic health problems. It is important to detect stress in its initial stages to prevent psychological and physical stress-related issues. Thus, there must be alternative and effective solutions for spontaneous stress monitoring. Wearable sensors are one of the most prominent solutions, given their capacity to collect data continuously in real-time. Wearable sensors, among others, have been widely used to bridge existing gaps in stress monitoring thanks to their non-intrusive nature. Besides, they can continuously monitor vital signs, e.g., heart rate and activity. Yet, most existing works have focused on data acquired in controlled settings. To this end, our study aims to propose a machine learning-based approach for detecting the onsets of stress in a free-living environment using wearable sensors. The authors utilized the SWEET dataset collected from 240 subjects via electrocardiography (ECG), skin temperature (ST), and skin conductance (SC). In this work, four machine learning models were tested on this data set consisting of 240 subjects, namely K-Nearest Neighbors (KNN), Support vector classification (SVC), Decision Tree (DT), and Random Forest (RF). These models were trained and tested on four data scenarios. The K-Nearest Neighbor (KNN) model had the highest accuracy of 98%, while the other models also performed satisfactorily.

Brain-inspired spiking neural networks in Engineering Mechanics: a new physics-based self-learning framework for sustainable Finite Element analysis

Article

Full-text available

Apr 2024
ENG COMPUT-GERMANY

The present study aims to develop a sustainable framework employing brain-inspired neural networks for solving boundary value problems in Engineering Mechanics. Spiking neural networks, known as the third generation of artificial neural networks, are proposed for physics-based artificial intelligence. Accompanied by a new pseudo-explicit integration scheme based on spiking recurrent neural networks leading to a spike-based pseudo explicit integration scheme, the underlying differential equations are solved with a physics-informed strategy. We propose additionally a third-generation spike-based Legendre Memory Unit that handles large sequences. These third-generation networks can be implemented on the coming-of-age neuromorphic hardware resulting in less energy and memory consumption. The proposed framework, although implicit, is viewed as a pseudo-explicit scheme since it requires almost no or fewer online training steps to achieve a converged solution even for unseen loading sequences. The proposed framework is deployed in a Finite Element solver for plate structures undergoing cyclic loading and a Xylo-Av2 SynSense neuromorphic chip is used to assess its energy performance. An acceleration of more than 40% when compared to classical Finite Element Method simulations and the capability of online training is observed. We also see a reduction in energy consumption down to the thousandth order.

SQUAT: Stateful Quantization-Aware Training in Recurrent Spiking Neural Networks

Conference Paper

Apr 2024

Real-Time State Modulation and Acquisition Circuit in Neuromorphic Memristive Systems

Preprint

Full-text available

Jun 2024

Memristive neuromorphic systems are designed to emulate human perception and cognition, where the memristor states represent essential historical information to perform both low-level and high-level tasks. However, current systems face challenges with the separation of state modulation and acquisition, leading to undesired time delays that impact real-time performance. To overcome this issue, we introduce a dual-function circuit that concurrently modulates and acquires memristor state information. This is achieved through two key features: 1) a feedback operational amplifier (op-amp) based circuit that ensures precise voltage application on the memristor while converting the passing current into a voltage signal; 2) a division calculation circuit that acquires state information from the modulation voltage and the converted voltage, improving stability by leveraging the intrinsic threshold characteristics of memristors. This circuit has been evaluated in a memristor-based nociceptor and a memristor crossbar, demonstrating exceptional performance. For instance, it achieves mean absolute acquisition errors below 1 {\Omega} during the modulation process in the nociceptor application. These results demonstrate that the proposed circuit can operate at different scales, holding the potential to enhance a wide range of neuromorphic applications.

SSTE: Syllable-Specific Temporal Encoding to FORCE-learn audio sequences with an associative memory approach

Article

May 2024
NEURAL NETWORKS

Fall Detection of the Elderly Using Denoising LSTM-Based Convolutional Variant Autoencoder

Article

Jun 2024

As societies age, the issue of falls has become increasingly critical for the health and safety of the elderly. Fall detection in the elderly has traditionally relied on supervised learning methods, which require data on falls, which is difficult to obtain in real situations. Additionally, the complexity of integrating deep learning models into wearable devices for real-time fall detection has been challenging due to limited computational resources. In this paper, we propose a novel fall detection method using unsupervised learning based on a denoising long short term memory (LSTM)-based convolutional variational autoencoder (CVAE) model to solve the problem of lack of fall data. By utilizing the proposed data debugging and hierarchical data balancing techniques, the proposed method achieves an F1 score of 1.0 while reducing the parameter count by 25.6 times compared to the state-of-the-art unsupervised deep learning method. The resulting model occupies only 157.65 KB of memory, making it highly suitable for integration into wearable devices.

Analog Weight Updates with Compliance Current Modulation of Binary ReRAMs for On-Chip Learning

Conference Paper

Full-text available

Oct 2020

Many edge computing and IoT applications require adaptive and on-line learning architectures for fast and low-power processing of locally sensed signals. A promising class of architectures to solve this problem is that of in-memory computing ones, based on event-based hybrid memristive-CMOS devices. In this work, we present an example of such systems that supports always-on on-line learning. To overcome the problems of variability and limited resolution of ReRAM memristive devices used to store synaptic weights, we propose to use only their High Conductive State (HCS) and control their desired conductance by modulating their programming Compliance Current (ICC). We describe the spike-based learning CMOS circuits that are used to modulate the synaptic weights and demonstrate the relationship between the synaptic weight, the device conductance, and the ICC used to set its weight, with experimental measurements from a 4kb array of HfO 2-based devices. To validate the approach and the circuits presented, we present circuit simulation results for a standard CMOS 180 nm process and system-level behavioral simulations for classifying handwritten digits from the MNIST data-set with classification accuracy of 92.68% on the test set.

Long-Term Bowel Sound Monitoring and Segmentation by Wearable Devices and Convolutional Neural Networks

Article

Full-text available

Aug 2020

Bowel sounds (BSs), typically generated by the intestinal peristalses, are a significant physiological indicator of the digestive system's health condition. In this study, a wearable BS monitoring system is presented for long-term BS monitoring. The system features a wearable BS sensor that can record BSs for days long and transmit them wirelessly in real-time. With the system, a total of 20 subjects' BS data under the hospital environment were collected. Afterward, CNNs are introduced for BS segment recognition. Specifically, this study proposes a novel CNN design method that makes it possible to transfer the popular CNN modules in image recognition into the BS segmentation domain. Experimental results show that in holdout evaluation with corrected labels, the designed CNN model achieves a moderate accuracy of 91.8% and the highest sensitivity of 97.0% compared with the similar works. In cross validation with noisy labels, the designed CNN delivers the best generability.

From Seizure Detection to Smart and Fully Embedded Seizure Prediction Engine: A Review

Article

Full-text available

Aug 2020

Recent review papers have investigated seizure prediction, creating the possibility of preempting epileptic seizures. Correct seizure prediction can significantly improve the standard of living for the majority of epileptic patients, as the unpredictability of seizures is a major concern for them. Today, the development of algorithms, particularly in the field of machine learning, enables reliable and accurate seizure prediction using desktop computers. However, despite extensive research effort being devoted to developing seizure detection integrated circuits (ICs), dedicated seizure prediction ICs have not been developed yet. We believe that interdisciplinary study of system architecture, analog and digital ICs, and machine learning algorithms can promote the translation of scientific theory to a more realistic intelligent, integrated, and low-power system that can truly improve the standard of living for epileptic patients. This review explores topics ranging from signal acquisition analog circuits to classification algorithms and dedicated digital signal processing circuits for detection and prediction purposes, to provide a comprehensive and useful guideline for the construction, implementation and optimization of wearable and integrated smart seizure prediction systems.

Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads

Conference Paper

Full-text available

May 2020

Cascade and Parallel Convolutional Recurrent Neural Networks on EEG-based Intention Recognition for Brain Computer Interface

Article

Apr 2018

Brain-Computer Interface (BCI) is a system empowering humans to communicate with or control the outside world with exclusively brain intentions. Electroencephalography (EEG) based BCIs are promising solutions due to their convenient and portable instruments. Despite the extensive research of EEG in recent years, it is still challenging to interpret EEG signals effectively due to the massive noises in EEG signals (e.g., low signal-noise ratio and incomplete EEG signals), and difficulties in capturing the inconspicuous relationships between EEG signals and certain brain activities. Most existing works either only consider EEG as chain-like sequences neglecting complex dependencies between adjacent signals or requiring pre-processing such as transforming EEG waves into images. In this paper, we introduce both cascade and parallel convolutional recurrent neural network models for precisely identifying human intended movements and instructions effectively learning the compositional spatio-temporal representations of raw EEG streams. Extensive experiments on a large scale movement intention EEG dataset (108 subjects,3,145,160 EEG records) have demonstrated that both models achieve high accuracy near 98.3% and outperform a set of baseline methods and most recent deep learning based EEG recognition models, yielding a significant accuracy increase of 18% in the cross-subject validation scenario. The developed models are further evaluated with a real-world BCI and achieve a recognition accuracy of 93% over five instruction intentions. This suggests the proposed models are able to generalize over different kinds of intentions and BCI systems.

MemTorch: An Open-source Simulation Framework for Memristive Deep Learning Systems

Article

Feb 2022
NEUROCOMPUTING

Memristive devices have shown great promise to facilitate the acceleration and improve the power efficiency of Deep Learning (DL) systems. Crossbar architectures constructed using these Resistive Random-Access Memory(RRAM) devices can be used to efficiently implement various in-memory computing operations, such as Multiply Accumulate (MAC) and unrolled-convolutions, which are used extensively in Deep Neural Network(DNN) and Convolutional Neural Network (CNN). However, memristive devices face concerns of aging and non-idealities, which limit the accuracy, reliability, and robustness of Memristive Deep Learning System(MDLS), that should be considered prior to circuit-level realization. This Original Software Publication(OSP) presents MemTorch, an open-source¹ framework for customized large-scale memristive Deep Learning(DL) simulations, with a refined focus on the co-simulation of device non-idealities. MemTorch also facilitates co-modelling of key crossbar peripheral circuitry. MemTorch adopts a modernized software engineering methodology and integrates directly with the well-known PyTorch Machine Learning(ML) library.

Training Progressively Binarizing Deep Networks using FPGAs

Conference Paper

Oct 2020

Implementation of Artificial Neural Network for Real Time Applications Using Field Programmable Analog Arrays

Conference Paper

Jan 2006

On-Device Reliability Assessment and Prediction of Missing Photoplethysmographic Data Using Deep Neural Networks

Article

Oct 2020

Photoplethysmographic (PPG) measurements from ambulatory subjects may suffer from unreliability due to body movements and missing data segments due to loosening of sensor. This paper describes an on-device reliability assessment from PPG measurements using a stack denoising autoencoder (SDAE) and multilayer perceptron neural network (MLPNN). The missing segments were predicted by a personalized convolutional neural network (CNN) and long-short term memory (LSTM) model using a short history of the same channel data. 40 sets of volunteers' data, consisting of equal share of healthy and cardiovascular subjects were used for validation and testing. The PPG reliability assessment model (PRAM) achieved over 95% accuracy for correctly identifying acceptable PPG beats out of total 5000 using expert annotated data. Disagreement with experts' annotation was nearly 3.5%. The missing segment prediction model (MSPM) achieved a root mean square error (RMSE) of 0.22, and mean absolute error (MAE) of 0.11 for 40 missing beats prediction using only four beat history from the same channel PPG. The two models were integrated in a standalone device based on quad-core ARM Cortex-A53, 1.2 GHz, with 1 GB RAM, with 130 MB memory requirement and latency ~0.35 s per beat prediction with a 30 s frame. The present method also provides improved performance with published works on PPG quality assessment and missing data prediction using two public datasets, CinC and MIMIC-II under PhysioNet.

Optimizing Temporal Convolutional Network Inference on FPGA-Based Accelerators

Article

Aug 2020

Convolutional Neural Networks (CNNs) are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition and segmentation. Recent research results demonstrate that multi-layer (deep) network involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as well as in tasks involving sequence modeling. These structures, commonly referred to as Temporal Convolutional Networks (TCNs), represent an extremely promising alternative to recurrent architectures, commonly used across a broad range of sequence modeling tasks [1]. While FPGA based inference accelerators for classic CNNs are widespread, literature is lacking in a quantitative evaluation of their usability on inference for TCN models. In this paper we present such an evaluation, considering a CNN accelerator with specific features supporting TCN kernels as a reference and a set of state-of-the-art TCNs as a benchmark. Experimental results show that, during TCN execution, operational intensity can be critical for the overall performance. We propose a convolution scheduling based on batch processing that can boost efficiency up to 96% of theoretical peak performance. Overall we can achieve up to 111,8 GOPS/s and a power efficiency of 33,8 GOPS/s/W on an Ultrascale+ ZU3EG (up to 10x speedup and 3x power efficiency improvement with respect to pure software implementation).

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Abstract and Figures

Recommended publications

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Memristive Stochastic Computing for Deep Learning Parameter Optimization

An Adaptive Event-based Data Converter for Always-on Biomedical Applications at the Edge

Memristive Stochastic Computing for Deep Learning Parameter Optimization