ArticlePDF Available

Memristor-Based Multilayer Neural Networks With Online Gradient Descent Training

January 2015
IEEE Transactions on Neural Networks and Learning Systems 26(10)

January 2015
26(10)

DOI:10.1109/TNNLS.2014.2383395

Source
PubMed

Authors:

Daniel Soudry

Technion - Israel Institute of Technology

Avinoam Kolodny

Technion - Israel Institute of Technology

Show all 5 authorsHide

Learning in multilayer neural networks (MNNs) relies on continuous updating of large matrices of synaptic weights by local rules. Such locality can be exploited for massive parallelism when implementing MNNs in hardware. However, these update rules require a multiply and accumulate operation for each synaptic weight, which is challenging to implement compactly using CMOS. In this paper, a method for performing these update operations simultaneously (incremental outer products) using memristor-based arrays is proposed. The method is based on the fact that, approximately, given a voltage pulse, the conductivity of a memristor will increment proportionally to the pulse duration multiplied by the pulse magnitude if the increment is sufficiently small. The proposed method uses a synaptic circuit composed of a small number of components per synapse: one memristor and two CMOS transistors. This circuit is expected to consume between 2% and 8% of the area and static power of previous CMOS-only hardware alternatives. Such a circuit can compactly implement hardware MNNs trainable by scalable algorithms based on online gradient descent (e.g., backpropagation). The utility and robustness of the proposed memristor-based circuit are demonstrated on standard supervised learning tasks.

Robustness of gradient descent. Consider a simple 2 → 1 SNN (inset) with a single input x and desired output d (repeatedly presented). As demonstrated schematically in this figure, training the SNN using gradient descent on the MSE E(w 1 , w 2 ) = (w 1 x 1 + w 2 x 2 − d) 2 will tend to decrease this error until it converges to some fixed point (generally, a local minimum of the MSE). Schematically, the gradient direction (red arrows) can be arbitrarily varied within a relatively wide range (green triangles), on each iteration, and this will not prevent this convergence.

…

Noise model for an artificial synapse. (a) During the operation, only one transistor is conducting (assume it is the n-type transistor). (b) Thermal noise in a small-signal model: the transistor is converted to a resistor (g 1 ) in parallel to current source (I 1 ), the memristor is converted to a resistor (g 2 ) in parallel to current source (I 2 ), and the effects of the sources are summed linearly.

…

Figures - uploaded by Shahar Kvatinsky

Content may be subject to copyright.

Content uploaded by Shahar Kvatinsky

Content may be subject to copyright.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Memristor-Based Multilayer Neural Networks With

Online Gradient Descent Training

Daniel Soudry, Dotan Di Castro, Asaf Gal, Avinoam Kolodny, and Shahar Kvatinsky

Abstract— Learning in multilayer neural networks (MNNs)

relies on continuous updating of large matrices of synaptic

weights by local rules. Such locality can be exploited for massive

parallelism when implementing MNNs in hardware. However,

these update rules require a multiply and accumulate operation

for each synaptic weight, which is challenging to implement

compactly using CMOS. In this paper, a method for performing

these update operations simultaneously (incremental outer prod-

ucts) using memristor-based arrays is proposed. The method is

based on the fact that, approximately, given a voltage pulse, the

conductivity of a memristor will increment proportionally to the

pulse duration multiplied by the pulse magnitude if the increment

is sufﬁciently small. The proposed method uses a synaptic circuit

composed of a small number of components per synapse: one

memristor and two CMOS transistors. This circuit is expected

to consume between 2% and 8% of the area and static power

of previous CMOS-only hardware alternatives. Such a circuit

can compactly implement hardware MNNs trainable by scalable

algorithms based on online gradient descent (e.g., backpropaga-

tion). The utility and robustness of the proposed memristor-based

circuit are demonstrated on standard supervised learning tasks.

Index Terms—Backpropagation, hardware, memristive

systems, memristor, multilayer neural networks (MNNs),

stochastic gradient descent, synapse.

I. INTRODUCTION

MULTILAYER neural networks (MNNs) have been

recently incorporated into numerous commercial prod-

ucts and services such as mobile devices and cloud computing.

For realistic large scale learning tasks, MNNs can perform

impressively well and produce state-of-the-art results when

Manuscript received June 16, 2014; revised December 5, 2014; accepted

December 11, 2014. This work was supported in part by the Gruss Lipper

Charitable Foundation, in part by the Intel Collaborative Research Institute for

Computational Intelligence, in part by the Hasso Plattner Institute, Potsdam,

Germany, and in part by the Andrew and Erna Finci Viterbi Fellowship

Program.

D. Soudry is with the Center for Theoretical Neuroscience, and the

Grossman Center for the Statistics of Mind, Department of Statistics,

Columbia University, New York, NY 10027 USA (e-mail: daniel.soudry@

gmail.com).

D. Di Castro is with Yahoo! Labs, Haifa 31905, Israel (e-mail:

dotan.dicastro@gmail.com).

A. Gal is with the Department of Electrical Engineering, Biological

Networks Research Laboratories, Technion-Israel Institute of Technology,

Haifa 32000, Israel (e-mail: asafg1@gmail.com).

A. Kolodny is with the Department of Electrical Engineering,

Technion-Israel Institute of Technology, Haifa 32000, Israel (e-mail:

kolodny@ee.technion.ac.il).

S. Kvatinsky is with the Department of Computer Science, Stanford

University, Stanford, CA 94305 USA (e-mail: skva@tx.technion.ac.il).

This paper has supplemental material available online at

http://ieeexplore.ieee.org (File size: 9 MB).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TNNLS.2014.2383395

massive computational power is available [1]–[3] (see [4] for

related press). However, such computational intensity limits

their usability due to area and power requirements. New

dedicated hardware design approach must therefore be devel-

oped to overcome these limitations. It was recently suggested

that such new types of specialized hardware are essential for

real progress toward building intelligent machines [5].

MNNs utilize large matrices of values termed synaptic

weights. These matrices are continuously updated during the

operation of the system, and are constantly being used to

interpret new data. The power of MNNs mainly stems from

the learning rules used for updating the weights. These rules

are usually local, in the sense that they depend only on

information available at the site of the synapse. A canonical

example for a local learning rule is the backpropagation

algorithm, an efﬁcient implementation of online gradient

descent, which is commonly used to train MNNs [6]. The

locality of backpropagation stems from the chain rule used

to calculate the gradients. Similar locality appears in many

other learning rules used to train neural networks and various

machine learning (ML) algorithms.

Implementing ML algorithms such as backpropagation

on conventional general-purpose digital hardware

(i.e., von Neumann architecture) is highly inefﬁcient.

A primary reason for this is the physical separation between

the memory arrays used to store the values of the synaptic

weights and the arithmetic module used to compute the update

rules. General-purpose architecture actually eliminates the

advantage of these learning rules—their locality. This locality

allows highly efﬁcient parallel computation, as demonstrated

by biological brains.

To overcome the inefﬁciency of general-purpose hardware,

numerous dedicated hardware designs, based on

CMOS technology, have been proposed in the past two

decades [7, and references therein]. These designs perform

online learning tasks in MNNs using massively parallel

synaptic arrays, where each synapse stores a synaptic weight

and updates it locally. However, so far, these designs are not

commonly used for practical large-scale applications, and it is

not clear whether they could be scaled up, since each synapse

requires too much power and area (Section VII). This issue

of scalability possibly casts doubt on the entire ﬁeld [8].

Recently, it has been suggested [9]– [10] that scalable

hardware implementations of neural networks may become

possible if a novel device, the memristor [11]–[14], is

used. A memristor is a resistor with a varying history-

dependent resistance. It is a passive analog device with

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

activation-dependent dynamics, which makes it ideal for

registering and updating of synaptic weights. Furthermore,

its relatively small size enables integration of memory

with the computing circuit [15] and allows a compact and

efﬁcient architecture for learning algorithms, as well as other

neural-network-related applications ([16]–[18], and references

therein), which will not be discussed here.

Most previous implementations of learning rules using

memristor arrays have been limited to spiking neurons and foc-

used on spike-timing-dependent plasticity (STDP) [19]–[26].

Applications of STDP are usually aimed to explain biological

neuroscience results. At this point, however, it is not clear

how useful is the STDP algorithmically. For example, the

convergence of STDP-based learning is not guaranteed for

general inputs [27]. Other learning systems [28]–[30] that

rely on memristor arrays are limited to single-layer neural

networks (SNNs) with a few inputs per neuron. To the best of

the authors’ knowledge, the learning rules used in the above

memristor-based designs are not yet competitive and have not

been used for large-scale problems, which is precisely where

dedicated hardware is required.

A recent memristor array design [31] implemented the

scalable perceptron algorithm [32]. This algorithm can be

potentially used to train SNNs with binary outputs on very

large datasets. However, training general MNNs (which are

much more powerful than SNNs [33]) cannot be performed

using the perceptron algorithm. Scalable training of MNNs

requires the backpropagation algorithm. This is special form

of online gradient descent, which is generally very effective in

large-scale problems [34]. Importantly, it can achieve state-of-

the-art-results on large datasets when executed with massive

computational power [3], [35].

To date, no circuit has been suggested to utilize memristors

for implementing such scalable online learning in MNNs.

Interestingly, it was recently shown [36], [37] that memristors

could be used to implement MNNs trained by backpropagation

in a chip-in-a-loop setting, where the weight update calculation

is performed using a host computer. However, it remained

an open question whether the learning itself, which is a

major computational bottleneck, could also be done online

(i.e., completely implemented using hardware) with efﬁcient

massively parallel memristor arrays. The main challenge for

general synaptic array circuit design arises from the nature of

learning rules such as backpropagation: practically, all of them

contain a multiplicative term [38], which is hard to implement

in compact and scalable hardware.

In this paper, a novel and general scheme to design hardware

for online gradient descent learning rules is presented. The

proposed scheme uses a memristor as a memory element to

store the weight and temporal encoding as a mechanism to

perform a multiplication operation. The proposed design uses

a single memristor and two CMOS transistors per synapse and,

therefore, requires 2%–8% of the area and static power of the

previously proposed CMOS-only circuits.

Using this proposed scheme, for the ﬁrst time, it is possible

to implement a memristor-based hardware MNN capable of

online learning (using the scalable backpropgation algorithm).

The functionality of such a hardware MNN circuit utilizing

the memristive synapse array is demonstrated numerically on

standard supervised learning tasks. On all datasets, the circuit

performs as well as the software algorithm. Introducing noise

levels of about 10% and parameter variability of about 30%

only mildly reduced the performance of the circuit, due to the

inherent robustness of online gradient descent (Fig. 5). The

proposed design may therefore allow the use of specialized

hardware for MNNs, as well as other ML algorithms, rather

than the currently used general-purpose architecture.

The remainder of this paper is organized as follows.

In Section II, a basic background on memristors and online

gradient descent learning is given. In Section III, the proposed

circuit is described for efﬁcient implementation of SNNs in

hardware. In Section IV, a modiﬁcation of the proposed circuit

is used to implement a general MNN. In Section V, the

sources of noise and variation in the circuit are estimated.

In Section VI, the circuit operation and learning capabilities

are evaluated numerically, demonstrating that it can be used

to implement MNNs trainable with online gradient descent.

In Section VII, the novelty of the proposed circuit is discussed,

as well as possible generalizations, and in Section VIII,

this paper is summarized. The supplementary material [39]

includes detailed circuit schematics, code, and an appendix

with additional technicalities.

II. PRELIMINARIES

For convenience, a basic background information on

memristors and online gradient descent learning is given in this

section. For simplicity, the second part is focused on a simple

example of the adaline algorithm—a linear SNN trained using

mean square error (MSE).

A. Memristor

The memristor was originally proposed [11], [12] as

the missing fourth fundamental passive circuit element.

Memristors are basically resistors with varying resistance,

where their resistance changes according to time integral of

the current through the device, or alternatively, the integrated

voltage upon the device. In the classical representation, the

conductance of a memristor Gdepends directly on the integral

over time of the voltage upon the device, sometimes referred

to as ﬂux. Formally, a memristor obeys the following:

i(t)=G(s(t))v(t)(1)

˙s(t)=v(t). (2)

A generalization of the memristor model 1, 2, which is called

amemristive system, was proposed in [40]. In memristive

devices, sis a general state variable, rather than an integral of

the voltage. Such memristive models, which are more com-

monly used to model actual physical devices [14], [41], [42],

are discussed in [Appendix A, 39].

For the sake of generality and simplicity, in the following

sections, it is assumed that the variations in the value of s(t)

are restricted to be small so that G(s(t)) can be linearized

around some point s∗, and the conductivity of the memristor

is given, to ﬁrst order, by

G(s(t)) =¯g+ˆgs(t)(3)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 3

where ˆg=[dG(s)/ds]s=s∗and ¯g=G(s∗)−ˆgs∗.Sucha

linearization is formally justiﬁed if sufﬁciently small inputs

are used, so sdoes not stray far from the ﬁxed point

(i.e., 2 ˆg/[d2G(s)/ds2]s=s∗|s(t)−s∗|, so second-order con-

tributions are negligible). The operation in such a small signal

region is demonstrated numerically in [Appendix D, 39], for

a family of physical memristive device technologies. The only

(rather mild) assumption is that G(s)is differentiable near s∗.

Note that despite this linearization, the memristor is still a

nonlinear component, since [from (1) and (3)] i(t)=¯gv(t)+

ˆgs(t)v(t). Importantly, this nonlinear product s(t)v(t)under-

lies the key role of the memristor in the proposed design,

where an input signal v(t)is being multiplied by an adjustable

internal value s(t). Thus, the memristor enables an efﬁcient

implementation of trainable MNNs in hardware, as explained

below.

B. Online Gradient Descent Learning

The ﬁeld of ML is dedicated to the construction and study of

systems that can be learned from data. For example, consider

the following supervised learning task. Assume a learning

system that operates on Kdiscrete presentations of inputs

(trials), indexed by k=1,2,...,K. For brevity, the indexing

of iteration number is sometimes suppressed, where it is clear

from the context. On each trial k, the system receives empirical

data, a pair of two real column vectorsof sizes Mand N: a

pattern x(k)∈RMand a desired label d(k)∈RN, with all pairs

sharing the same desired relation d(k)=f(x(k)). Note that

two distinct patterns can have the same label. The objective

of the system is to estimate (learn) the function f(·)using the

empirical data.

As a simple example, suppose Wis a tunable N×Mmatrix

of parameters, and consider the estimator

r(k)=W(k)x(k)(4)

r(k)

n=

W(k)

nm x(k)

m(5)

which is a SNN. The result of the estimator r=Wx should

aim to predict the right desired labels d=f(x)for new unseen

patterns x. To solve this problem, Wis tuned to minimize

some measure of error between the estimated and desired

labels, over a K0-long subset of the empirical data, called

the training set (for which k=1,...,K0). For example, if

we deﬁne the error vector

y(k)d(k)−r(k)(6)

then a common measure is MSE

MSE 



k=1

y(k)2.(7)

Other error measures can be also be used. The performance

of the resulting estimator is then tested over a different subset,

called the test set (k=K0+1,...,K).

As explained in the introduction, a reasonable iterative

algorithm for minimizing objective (7) (i.e., updating W,

where initially Wis arbitrarily chosen) is the following online

gradient descent (also called stochastic gradient descent)

iteration:

W(k+1)=W(k)−1

2η∇W(k)y(k)2(8)

where the 1/2 coefﬁcient is written for mathematical

convenience, ηis the learning rate, a (usually small) positive

constant, and at each iteration k, a single empirical sample x(k)

is chosen randomly and presented at the input of the system.

Using the chain rule (4), (6), we have ∇W(k)y(k)2=

−2(d(k)−W(k)x(k))(x(k)). Therefore, deﬁning W(k)

W(k+1)−W(k)and (·)to be the transpose operation,

we obtain the outer product

W(k)=ηy(k)(x(k))(9)

W(k+1)

nm =W(k)

nm +ηx(k)

my(k)

n.(10)

Speciﬁcally, this update rule is called the adaline

algorithm [43], used in adaptive signal processing and

control [44]. The parameters of more complicated estimators

can also be similarly tuned (trained), using online gradient

descent or similar methods. Speciﬁcally, MNNs (Section IV)

are commonly being trained using backpropagation, which is

an efﬁcient form of online gradient descent [6]. Importantly,

note that the update rule in (10) is local, i.e., the change

in the synaptic weight W(k)

nm depends only on the related

components of input (x(k)

m)and error (y(k)

n). This local

update, which ubiquitously appears in neural network training

(e.g., backpropagation and the perceptron learning rule [32])

and other ML algorithms [45]–[47], enables a massively

parallel hardware design, as explained in Section III.

Such massively parallel designs are needed, since

for large Nand M, learning systems usually become

computationally prohibitive in both time and memory space.

For example, in the simple adaline algorithm, the main com-

putational burden in each iteration comes from (4) and (9),

where the number of operations (addition and multiplication)

is of order O(M·N). Commonly, these steps have become

the main computational bottleneck in executing MNNs (and

related ML algorithms) in software. Other algorithmic steps,

such as (6) here, include either O(M)or O(N)operations and,

therefore, have a negligible computational complexity, lower

than O(M+N).

III. CIRCUIT DESIGN

Next, dedicated analog hardware for implementing online

gradient descent learning is described. For simplicity, this

section is focused on simple linear SNNs trained using adaline,

as described in Section II. Later, in Section IV, the circuit

is modiﬁed to implement general MNNs, trained using back-

propagation. The derivations are rigorously done using a single

controlled approximation (14) and no heuristics (i.e., unproven

methods which can damage performance), thus creating a

precisely deﬁned mapping between a mathematical learning

system and a hardware learning system. Readers who do not

wish to go through the detailed derivations, can get the general

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 1. Simple adaline learning task, with the proposed synaptic grid circuit

executing (4) and (9), which are the main computational bottlenecks in the

algorithm.

Fig. 2. Synaptic grid ( N×M) circuit architecture scheme. Every (n,m)node

in the grid is a memristor-based synapse that receives voltage input from the

shared um,¯umand the enlines and outputs Inm current on the onlines. These

output lines receive total current arriving from all the synapses on the nth row

and are grounded.

idea from the following overview (Section III-A) together

with Figs. 1–4.

A. Circuit Overview

To implement these learning systems, a grid of artiﬁcial

synapses is constructed, where each synapse stores a single

synaptic weight Wnm. The grid is a large N×Marray of

synapses, where the synapses operate simultaneously, each

performing a simple local operation. This synaptic grid circuit

carries the main computational load in ML algorithms by

implementing the two computational bottlenecks, (5) and (10),

in a massively parallel way. This matrix×vector product in (5)

is done using a resistive grid (of memristors), implementing

multiplication through Ohm’s law and addition through current

summation. The vector×vector outer product in (10) is done

using the fact that given a voltage pulse, the conductivity

of a memristor will increment proportionally to the pulse

duration multiplied by the pulse magnitude. Using this method,

multiplication requires only two transistors per synapse. Thus,

together with auxiliary circuits that handle a negligible amount

of O(M+N)additional operations, these arrays can be used

to construct efﬁcient learning systems. These systems perform

Fig. 3. Memristor-based synapse. (a) Schematic of a single memristive

synapse (without the nand mindices). The synapse receives input voltages

uand ¯u=−u, an enable signal e, and output current I. (b) Read and write

protocols—incoming signals in a single synapse and the increments in the

synaptic weight s, as determined by (27). T=Twr +Trd.

massive parallelization of the bottleneck O(N·M)operations

(4) and (9) over many computational units—the synapses.

Similarly to the adaline algorithm described in Section II,

the circuit operates on discrete presentations of inputs (trials)

(Fig. 1). On each trial k, the circuit receives an input vector

x(k)∈[−A,A]Mand an error vector y(k)∈[−A,A]N

(where Ais a bound on both the input and error) and produces

aresult output vector r(k)∈RN, which depends on the input

by (4), where the matrix W(k)∈RN×M, called the synaptic

weight matrix, is stored in the system. In addition, on each

step, the circuit updates W(k)according to (9). This circuit can

be used to implement ML algorithms. For example, as shown

in Fig. 1, the simple adaline algorithm can be implemented

using the circuit, with training enabled on k=1,...,K0.The

implementation of the backpropagation algorithm is shown

in Fig. 4 and explained in Section IV.

B. Circuit Architecture

1) Synaptic Grid: The synaptic grid system described

in Fig. 1 is implemented by the circuit shown in Fig. 2,

where the components of all vectors are shown as individual

signals. Each gray cell in Fig. 2 is a synaptic circuit (artiﬁcial

synapse) using a memristor [described in Fig. 3(a)]. The

synapses are arranged in a 2-D N×Mgrid array as shown

in Fig. 2, where each synapse is indexed by (n,m), with

m∈{1,...,M}and n∈{1,...,N}. Each (n,m)synapse

receives two inputs um,um,anenable signal en, and produces

an output current Inm. Each column of synapses in the array

(the mth column) shares two vertical input lines umand um,

both connected to a column input interface. The voltage signals

umand um(∀m) are generated by the column input interfaces

from the components of the input signal x, upon presentation.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 5

Fig. 4. (a) Structure of a MNN with the neurons in each layer (circles) and the synaptic connectivity weights (arrows). (b) Implementation of a MNN with

on-line backpropagation training with MSE measure (7), using a slightly modiﬁed version of the original circuit (Fig. 2). Each circuit performs (23) and (28)

(as in Fig. 1), together with the additional operation (31). The function boxes denote the operation of either σ(·)(neuronal activation function) or σ(·)

(its derivative) on the input, and the ×box denotes a component-wise product. The symbol ∅(don’t care) denotes an unused output. For detailed circuit

schematics, see [39].

Each row of synapses in the array (the nth row) shares the

horizontal enable line enand output line on,whereenis

connected to a row input interface and onis connected to a

row output interface. The voltage (pulse) signal on the enable

line en(∀n) is generated by the row input interfaces from the

error signal y, upon presentation. The row output interfaces

keep the onlines grounded (to zero voltage) and convert the

total current from all the synapses in the row going to the

ground, mInm,to the output signal rn.

2) Artiﬁcial Synapse: The proposed memristive synapse

is composed of a single memristor, connected to a shared

terminal of two MOSFET transistors (p-type and n-type), as

shown schematically in Fig. 3(a) (without the n,mindices).

These terminals act as drain or source, interchangeably,

depending on the input, similarly to the CMOS transistors

in transmission gates. Recall that the memristor dynamics

are given by (1)–(3), with s(t)being the state variable of

the memristor and G(s(t)) =¯g+ˆgs(t)its conductivity.

In addition, the current of the n-type transistor in the linear

region is, ideally

I=K(VGS −VT)VDS −1

2V2

DS(11)

where VGS is the gate–source voltage, VTis the threshold

voltage, VDS is the drain–source voltage, and Kis the

conduction parameter of the transistors. When VGS <VT,the

current is cutoff (I=0). Similarly, the current of the p-type

transistor in the linear region is

I=−K(VGS +VT)VDS −1

2V2

DS(12)

where for simplicity, we assumed that the parameters Kand

VTare equal for both transistors. Note that for notational

simplicity, the parameter VTin (12) has a different sign than

in the usual deﬁnition. When VGS >−VT, the current is

cutoff (I=0).

The synapse receives three voltage input signals: uand

¯u=−uare connected, respectively, to a terminal of the n-type

and p-type transistors and an enable signal eis connected to

the gate of both transistors. The enable signal can have a value

of 0, VDD,or−VDD (with VDD >VT) and have a pulse shape

of varying duration, as explained below. The output of the

synapse is a current Ito the grounded line o. The magnitude

of the input signal u(t)and the circuit parameters are set so

they fulﬁll the following.

1) We assume (somewhat unusually) that

−VT<u(t)<VT.(13)

2) We assume that [recall (1)]

K(VDD −2VT)G(s(t)). (14)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

From the ﬁrst assumption (13), we assume the following

conditions.

1) If e=0 (i.e., the gate is grounded), both transistors are

nonconducting (in the cutoff region). In this case, I=0

in the output, the voltage across the memristor is zero,

and the state variable does not change.

2) If e=VDD, the n-type transistor is conducting

in the linear region while the p-type transistor is

nonconducting.

3) If e=−VDD, the p-type transistor is conducting

in the linear region while the n-type transistor is

nonconducting.

To satisfy (13), the proper value of u(t)is chosen (15).

From the second assumption, (14), if e=±VDD,

when in the linear region, both transistors have relatively

high conductivity as compared with the conductivity of the

memristor. Therefore, in that case, the voltage on the memris-

tor is approximately ±u. Note (14) is a reasonable assumption,

as shown in [48]. If not (e.g., if the memristor conductivity

is very high), instead one can use an alternative design, as

described in [Appendix B, 39].

C. Circuit Operation

The operation of the circuit in each trial (a single presen-

tation of a speciﬁc input) is composed of two phases. First,

in the computing phase (read), the output current from all the

synapses is summed and adjusted to produce an arithmetic

operation r=Wx from (4). Second, in the updating phase

(write), the synaptic weights are incremented according to the

update rule W=ηyxfrom (9). In the proposed design,

for each synapse, the synaptic weight Wnm is stored using

snm, the memristor state variable of the (n,m)synapse. The

parallel read and write operations are achieved by applying

simultaneous voltage signals on the inputs umand enable

signals en(∀n,m). The signals and their effect on the state

variable are shown in Fig. 3(b).

1) Computation Phase (Read): During each read phase, a

vector xis given and encoded in uand ¯ucomponent-wise by

the column input interfaces for a duration of Trd,∀m:um(t)=

axm=−¯um(t),whereais a positive constant converting xm,

a unitless number, to voltage. Recall that Ais the maximal

value of |xm|,so

aA <VT(15)

as required in (13). In addition, the row input interfaces

produce voltage signal on the enlines, ∀n

en(t)=VDD,if 0 ≤t<0.5Trd

−VDD,if 0.5Trd ≤t≤Trd.(16)

From (2), the total change in the internal state variable is,

therefore, ∀n,m

snm =ˆ0.5Trd

(axm)dt +ˆTrd

0.5Trd

(−axm)dt =0.(17)

The zero net change in the value of snm between the times of 0

and Trd implements a nondestructive read, as common in many

memory devices. To minimize inaccuracies, the row output

interface samples the output current at time 0+(immediately

after time zero). This is done before the conductance of the

memristor is signiﬁcantly changed from its value before the

read phase. Using (3), the output current of the synapse to

the online at the time is thus

Inm =a(¯g+ˆgsnm)xm.(18)

Therefore, the total current in each output line onis equal to

the sum of the individual currents produced by the synapses

driving that line

on=

Inm =a

(¯g+ˆgsnm)xm.(19)

The row output interface measures the output current onand

outputs

rn=c(on−oref )(20)

where cis a constant converting the current units of onto a

unit-less number rnand

oref =a¯g

xm.(21)

Deﬁning

Wnm =acˆgsnm (22)

we obtain

r=Wx (23)

as desired.

2) Update Phase (Write): During each write phase, of

duration of Twr,uand ¯umaintain their values from the read

phase, while the signal echanges. In this phase, the row input

interfaces encode ecomponentwise, ∀n

en(t)=sign(yn)VDD,if 0 ≤t−Trd ≤b|yn|

0,if b|yn|<t−Trd <Twr .(24)

The interpretation of (24) is that enis a pulse with a magnitude

VDD, the same sign as yn, and a duration b|yn|(where bis

a constant converting yn, a unitless number, to time units).

Recall that Ais the maximal value of |yn|, so we require that

Twr >bA.(25)

The total change in the internal state is therefore

snm =ˆTrd+b|yn|

Trd

(asign(yn)xm)dt (26)

=abxmyn.(27)

Using (22), the desired update rule for the synaptic weights

is, therefore, obtained as

W=ηyx(28)

where η=a2bcˆg.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 7

IV. MULTILAYER NEURAL NETWORK CIRCUIT

So far, the circuit operation was exempliﬁed on a SNN with

the simple adaline algorithm. In this section, it is explained

how, with a few adjustments, the proposed circuit can be used

to implement backpropagation on a general MNN.

Recall the context of the supervised learning setup detailed

in Section II, where x∈RMis a given pattern and d∈RNis

a desired label. Consider a double layer MNN estimator of d,

of the form

r=W2σ(W1x)

with W1∈RH×M(His some integer) and W2∈RN×H

being the two parameter matrices and where σis some

nonlinear (usually sigmoid) function operating component

wise [i.e., (σ (x))i=σ(xi)]. Such a double layer MNN, with

Hhidden neurons, can approximate any target function with

arbitrary precision [33]. Denote by

r1=W1x∈RH;r2=r=W2σ(r1)∈RN(29)

the output of each layer. Suppose we again use MSE to

quantify the error between dand r. In the backpropagation

algorithm, each update of the parameter matrices is given by

an online gradient descent step, which can be directly derived

from [6, eq. (8)] [note the similarity with (9)]

W1=ηy1x

1;W2=ηy2x

2(30)

with x2σ(r1), y2d−r2,x1xand y1(W

2y2)×

σ(r1), where here (a×b)iaibi, a component-wise product,

and (σ (x))i=dσ(xi)/dxi. Implementing such an algorithm

requires a minor modiﬁcation of the proposed circuit, that

is, it should have an additional inverted output Wy.Once

this modiﬁcation is made to the circuit, by cascading such

circuits, it is straightforward to implement the backpropagation

algorithm for two layers or more, as shown in Fig. 4. For

each layer, a synaptic grid circuit stores the corresponding

weight matrix, performs a matrix-vector product as in (29),

and updates the weight matrix as in (30). This last operation

requires the additional output

δWy(31)

in each synaptic grid circuit (except the ﬁrst). We assume that

the synaptic circuit has dimensions M×N(with a slight abuse

of notation, since Mand Nshould be different for each layer).

The additional output (31) can be generated by the circuit

in an additional read phase with the duration Trd, between

the original read and write phases, in which the original role

of the input and output lines is inverted. In this phase, the

n-type MOS (nMOS) transistor is ON, i.e., ∀n,en=VDD

and the former output onlines are given the following voltage

signal (again, used for a nondestructive read):

on=ayn,if Trd ≤t<1.5Trd

−ayn,if 1.5Trd ≤t≤2Trd.(32)

The Inm current now ﬂows to the (original input) umterminal,

which is grounded and shared ∀n. The sum of all the currents

is measured at time T+

rd by the column interface (before it goes

into ground)

um=

Inm =a

(¯g+ˆgsnm)yn.(33)

The total current on umat time Trd is the output

δm=c(um−uref )(34)

where uref =a¯gnyn. Thus, from (22)

δm=

Wnm yn(35)

as required.

Finally, note that it is straightforward to use a different error

function instead of MSE. For example, in a MNN, recall that

in the last layer

y=d−r=−

2∇rd−r2.

If a different error function E(r,d)is required, one should

simply replace this with

y−α∇rE(r,d)(36)

where αis some constant that can be used to tune the learning

rate. For example, one could use instead a cross entropy error

function

E(r,d)=−



diln ri(37)

which is more effective for classiﬁcation tasks, in which dis

a binary vector [49]. To implement this change in the MNN

circuit (Fig. 4), the subtractor should be simply replaced with

some other module.

V. SOURCES OF NOISE AND VARIABILITY

Usually, analog computation suffers from reduced robust-

ness to noise as compared with digital computation [50].

ML algorithms are, however, inherently robust to noise, which

is a key element in the set of problems they are designed

to solve. For example, gradient descent is quite robust to

perturbations, as intuitively demonstrated in Fig. 5. This

suggests that the effects of intrinsic noise on the perfor-

mance of the analog circuit are relatively small. These effects

largely depend on the speciﬁc circuit implementation (e.g., the

CMOS process). Particularly, memristor technology is not

mature yet and memristors have not been fully characterized.

Therefore, to check the robustness of the circuit, crude esti-

mation of the magnitude of noise and variability has been

used in this section. This estimation is based on known

sources of noise and variability, which are less dependent

on the speciﬁc implementation. In Section VI, the alleged

robustness of the circuit would be evaluated by simulating

the circuit in the presence of these noise and variation

sources.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 5. Robustness of gradient descent. Consider a simple 2 →1SNN

(inset) with a single input xand desired output d(repeatedly presented).

As demonstrated schematically in this ﬁgure, training the SNN using gradient

descent on the MSE E(w1,w

2)=(w1x1+w2x2−d)2will tend to decrease

this error until it converges to some ﬁxed point (generally, a local minimum of

the MSE). Schematically, the gradient direction (red arrows) can be arbitrarily

varied within a relatively wide range (green triangles), on each iteration, and

this will not prevent this convergence.

A. Noise

When one of the transistors is enabled (e(t)=±VDD), then

the current is affected by intrinsic thermal noise sources in the

transistors and memristor of each synapse. This noise affects

the operation of the circuit during the write and read phases.

Current ﬂuctuations on a device due to thermal origin can be

approximated by a white noise signal I(t)with zero mean

(I(t)=0)and autocorrelation I(t)I(t)=σ2δ(t−t),

where δ(·)is Dirac’s delta function and σ2=2k

Tg (where

kis Boltzmann’s constant, ˜

Tis the temperature, and gis the

conductance of the device). For 65 nm transistors (parameters

taken from IBMs 10LPe/10RFe process [51]), the character-

istic conductivity is g1∼10−4−1. Therefore, for I1(t),the

thermal current source of the transistors at room temperature,

we have σ2

1∼10−24 A2s. Assume that the memristor

characteristic conductivity g2=g1, so for the thermal current

source of the memristor, σ2

2=σ2

1. Note that from (14), we

have 1, and the resistance of the transistor is much smaller

than that of the memristor. The total voltage on the memristor

is thus

VM(t)=g1

g1+g2u(t)+(g1+g2)−1(I1(t)−I2(t))

1+u(t)+ξ(t)

where ξ(t)=(g−1

1/1+)(I1(t)−I2(t)). Since different ther-

mal noise sources are uncorrelated, we have I1(t)I2(t)=0,

and so ξ(t)ξ (t)∼σ2

ξδ(t−t)with

σ2

ξ=1

1(1+)2I2

1(t)+I2

2(t)

=σ2

1+σ2

1(1+)2(38)

≈g−2

1σ2

1=2k˜

Tg−1

1∼10−16 V2s.

Fig. 6. Noise model for an artiﬁcial synapse. (a) During the operation, only

one transistor is conducting (assume it is the n-type transistor). (b) Thermal

noise in a small-signal model: the transistor is converted to a resistor (g1)in

parallel to current source (I1), the memristor is converted to a resistor (g2)

in parallel to current source (I2), and the effects of the sources are summed

linearly.

The equivalent circuit, including the sources of noise, is shown

in Fig. 6. Assuming the circuit minimal trial duration is

T=10 ns, the root MSE due to thermal noise is bounded

above by

ET∼



1

TˆT

dξ(t)2

=T−1/2σξ∼10−4V.

Noise in the inputs u,¯u,andealso exists. According

to [52], the relative noise in the power supply of the u/¯u

inputs is approximately 10% in the worst case. Applying

u=ax effectively gives an input of ax +ET+Eu,where

|Eu|≤0.1a|x|. The absolute noise level in duration of

eshould be smaller than Tmin

clk ∼2·10−10 s, assuming a digital

implementation of pulsewidth modulation with Tmin

clk being the

shortest clock cycle currently available. On every write cycle

e=±VDD is, therefore, applied for a duration of b|y|+ Ee

(instead of b|y|), where |Ee|<Tmin

clk .

B. Parameter Variability

A common estimation of the variability in

memristor parameters is a coefﬁcient of variation

(CV =standard deviation/mean) of a few percent [53].

In this paper, the circuit is also evaluated with considerably

larger variability (CV ∼30%), in addition to the noise

sources, as described in Section V-A. The variability in the

parameters of the memristors is modeled by sampling each

memristor conductance parameter ˆgindependently from a

uniform distribution between 0.5and1.5 of the original

(nonrandom) value of ˆg. When running the algorithm in

software, these variations are equivalent to corresponding

changes in the synaptic weights Wor the learning rate η

in the algorithm. Note that the variability in the transistor

parameters is not considered, since these can affect the

circuit operation only if (13) or (14) are invalidated. This

can happen, however, only if the values of Kor VTvary in

orders of magnitude, which is unlikely.

VI. CIRCUIT EVA L UA T I O N

In this section, the proposed synaptic grid circuit is

implemented in a simulated physical model (Section VI-A).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 9

Fig. 7. Synaptic 2 ×2 grid circuit simulation, during ten operation cycles. Top: circuit inputs (x1,x2)and result outputs (r1,r2). Middle and bottom:

voltage (solid black) and conductance (dashed red) change for each memristor in the grid. Note that the conductance of the (n,m)memristor indeed changes

proportionally to xmynfollowing (27). Simulation was done using SimElectronics, with the inputs as in (39) and (40), and the circuit parameters as in Table I.

Similar results were obtained using SPICE with linear ion drift memristor model and CMOS 0.18 μm process, as shown in [Appendix D, 39].

First, the basic functionality of a toy example, a 2 ×2

circuit, is demonstrated (Section VI-B). Then (Section VI-C)

using standard supervised learning datasets, the implemen-

tation of SNNs and MNNs using this synaptic grid cir-

cuit is demonstrated, including its robustness to noise and

variation.

A. Software Implementation

Recall that the synaptic grid circuit (implementing the boxes

in Figs. 1 and 4) operates in discrete time trials, and receives

at each trial two vector inputs xand y, updates an internally

stored synaptic matrix W(according to W=ηyx), and

outputs the vector r=Wx (optionally, it outputs also the

vector δ=Wy).

The physical model of the circuit is implemented both in

SPICE and SimElectronics [54]. Both are software tools that

enable physical circuit simulation of memristors and MOSFET

devices. The SPICE model is described in [Appendix D, 39].

Next, the SimElectronics synaptic grid circuit model is

described. Exactly the same model (at different sizes of grid)

is used for all numerical evaluations in Sections VI-C. The

circuit parameters appear in Table I, with the parameters of

the (ideal) transistors kept at their defaults. Note VTof the

pMOS is deﬁned here with an opposite sign to the usual

deﬁnition. The memristor model is implemented using (1)–

(3) with parameters taken from the experimental data [55,

Fig. 2]. As shown schematically in Fig. 2, the circuit imple-

mentation consists of a synapse-grid and the interface blocks.

The interface units were implemented using a few standard

CMOS and logic components, as can be seen in the detailed

schematics (available in [39] as an HTML ﬁle, which may

be opened using Firefox). The circuit operated synchronously

using global control signals supplied externally. The circuit

TAB L E I

CIRCUIT PARAMETERS

inputs xand yand output rwere kept constant in each trial

using sample and hold units.

B. Basic Functionality

First, the basic operation of the proposed synaptic grid

circuit is examined numerically in a toy example.

Asmall2×2 synaptic grid circuit is simulated for time 10 T

(10 read-write cycles) with a simple piecewise constant input

x=(x1,x2)=(0.8,−0.4)·sign(t−5T)(39)

and a constant input

y=(y1,y2)=(0.2,−0.1). (40)

In Fig. 7, the resulting circuit outputs are shown, together with

the memristors’ voltages and conductances. The correct basic

operation of the circuit is veriﬁed.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TAB L E I I

LEARNING PARAMETERS FOR EACH DATA S E T

1) In the ﬁrst read cycle, the memristors are used to

generate the output (23). The voltage trace on the (n,m)

memristor is a ±axmbipolar pulse, as expected

from (16). This results in a nondestructive (a zero net

change in conductance), as expected from (17).

2) In the second read cycle, the memristors are used to

generate the output (35). The voltage trace on the

(n,m)memristor is a ±aynbipolar pulse [as expected

from (32)], again resulting in a nondestructive read.

3) In the write cycle, the stored weights are incremented

according to (28). As expected from (24), the (n,m)

memristor is subjected to a voltage pulse of amplitude

sign(yn)axmwith duration b|yn|. Furthermore, there

is an ˆgabxmynincrement in memristor conductance

following (27).

4) In the output of the circuit is rn=macˆgsnmxm

following (5), and (18)–(23).

C. Learning Performance

The synaptic grid circuit model is used to implement a SNN

and a MNN, trainable by the online gradient descent algorithm.

To demonstrate that the algorithm is indeed implemented

correctly by the proposed circuit, the circuit performance has

been compared to an algorithmic implementation of the MNNs

in (MATLAB) software. Two standard tasks are used:

1) the Wisconsin Breast Cancer diagnosis task [56]

(linearly separable);

2) the Iris classiﬁcation task [56] (not linearly separable).

The ﬁrst task was evaluated using an SNN circuit, similarly

to Fig. 1. Note this task has only a single output (yes/no),

so the synaptic grid (Fig. 2) has only a single row. The

second task is evaluated on a two-layer MNN circuit, similarly

to Fig. 4. The learning parameters are described in Table II.

Fig. 8 shows the training error (performance during training

phase) of the following:

1) the algorithm;

2) the proposed circuit (implementing the algorithm);

3) the circuit, with about 10% noise and 30% variability

(Section V).

In addition, Table III shows the test error—the error estimated

on the test set after training was over (The standard deviation

(Std) was calculated as (Pe(1−Pe)/Ntest)1/2, with Ntest being

the number of samples in the test set and Pebeing the test

error). On each of the two tasks, the training performance

(i.e., on the training set) of the circuit and the algorithm

similarly improves, ﬁnally reaching a similar test error. These

results indicate that the proposed circuit design can precisely

Fig. 8. Circuit evaluation—training error for two datasets.

(a) Task 1—Wisconsin breast cancer diagnosis. (b) Task 2—Iris classiﬁcation.

TABLE III

CIRCUIT EVA L UAT I O N —TEST ERROR (MEAN ±STD)FOR

(a) ALGORITHM,(b) CIRCUIT,AND (c) CIRCUIT WITH

NOISE AND VARIABILITY

implement MNNs trainable with online gradient descent, as

expected, from the derivations in Section III. Moreover, the

circuit exhibits considerable robustness, as its performance

is only mildly affected by signiﬁcant levels of noise and

variation.

Implementation Details: The SNN and the MNN were

implemented using the (SimElectronics) synaptic grid circuit

model, which was described in Section VI-A. The detailed

schematics of the two-layer MNN model are again given

in [39] (as an HTML ﬁle, which may be opened using Firefox),

and also in the code. All simulations are done using a desktop

computer with core i7 930 processor, a Windows 8.1 operating

system, and a MATLAB 2013b environment. Each circuit

simulation takes about one day to complete (note the software

implementation of the circuit does not exploit the parallel

operation of the circuit hardware). For both tasks, we used

the standard training and testing procedures [6]. For each

ksample from the dataset, the attributes are converted to a

M-long vector x(k), and each label is converted to a N-long

binary vector d(k). The training set is repeatedly presented,

with samples at random order. The inputs mean was sub-

tracted, and they were normalized as recommended by [6],

with additional rescaling done in dataset 1, to ensure that

the inputs fall within the circuit speciﬁcation 1, i.e., meet the

requirement speciﬁed by (13). The performance was averaged

over 10 repetitions (of training and testing) and the training

error was also averaged over the previous 20 samples.

The initial weights were sampled independently from a sym-

metric uniform distribution. For both tasks, in the output layer,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 11

a softmax activation function was used

(σ (x))i=exi

N

j=1exj

together with a Cross entropy error (37) since this combination

is known to improve performance [57]. In task 2, in the

two-layer MNN, the neuronal activation functions in the ﬁrst

(hidden) layer were set as

σ(xi)=1.7159 tanh 2xi

3

as recommended in [6]. Other parameters are given

in Tables I and II. In addition to the inputs from the previous

layer, the commonly used bias input is implemented in the

standard way (i.e., at each circuit, the neuronal input xis

extended with an additional component equal to a constant

one).

VII. DISCUSSION

As explained in Sections II and IV, two major computational

bottlenecks of MNNs and many ML algorithms, are given by

amatrix×vector product operation (4) and a vector ×vector

outer product operation (9). Both are of order O(M·N),

where Mand Nare the sizes of the input and output

vectors. In this paper, the proposed circuit is designed speciﬁ-

cally to deal with these bottlenecks using memristor arrays.

This design has a relatively small number of components

in each array element—one memristor and two transistors.

The physical grid-like structure of the arrays implements the

matrix×vector product operation (4) using analog summation

of currents, while the memristor (nonlinear) dynamics enable

us to perform the vector×vector outer product operation (10),

using time ×voltage encoding paradigm. The idea to use a

resistive grid to perform matrix ×vector product operation

is not new (e.g., [58]). The main novelty of this paper is

the use of memristors together with time ×voltage encoding,

which allows us to perform a mathematically accurate

vector ×vector outer product operation in the learning rule

using a small number of components.

A. Previous CMOS-Based Designs

As mentioned in the introduction, CMOS hardware designs

that speciﬁcally implement online learning algorithms remain

an unfulﬁlled promise at this point.

The main incentive for existing hardware solutions is the

inherent inefﬁciency in implementing these algorithms in

software running on general-purpose hardware (e.g., CPUs,

digital signal processors, and GPUs). However, squeezing

the required circuit for both the computation and the update

phases (two conﬁgurable multipliers, for the matrix ×vector

product and vector ×vector outer product, and a memory

element to store the synaptic weight) into an array cell has

proven to be a hard task, using currently available CMOS

technology. Off-chip or chip-in-the-loop design architectures

[67, Table I] have been suggested in many cases as a way

around this design barrier. These designs, however, generally

deal with the computational bottleneck of the matrix ×vector

TAB L E I V

HARDWARE DESIGNS OF ARTIFICIAL SYNAPSES IMPLEMENTING

SCALABLE ONLINE LEARNING ALGORITHMS

product operation in the computation phase, rather than the

computational bottleneck of the vector ×vector outer product

operation in the update phase. In addition, these solutions are

only useful in cases where the training is not continuous and

is done in a predeployment phase or in special reconﬁgura-

tion phases during the operation. Other designs implement

nonstandard (e.g., perturbation algorithms [68]) or speciﬁcally

tailored learning algorithms (e.g., modiﬁed backpropagation

for spiking neurons [69]). However, it remains to be seen

whether such algorithms are indeed scalable.

Hardware designs of artiﬁcial synaptic arrays that are

capable of implementing common (scalable) online gradient-

descent-based learning, are listed in Table IV. For large

arrays (i.e., large Mand N), the effective transistor count per

synapse (where resistors and capacitors were also counted as

transistors) is approximately proportional to the required area

and average static power usage of the circuit.

The smallest synaptic circuit [59] includes two transistors

similarly to our design, but requires the (rather unusual) use of

UV illumination during its operation and has the disadvantage

of having volatile weights decaying within minutes. The

next device [60] includes six transistors per synapse, but the

update rule can only increase the synaptic weights, which

makes the device unusable for practical purposes. The next

device [62], [70] suggested a grid design using the CMOS

Gilbert multipliers, resulting in 39 transistors per synapse.

In a similar design [63], 52 transistors are used. Both these

devices use capacitive elements for analog memory and suffer

from the limitation of having volatile weights, vanishing

after training has stopped. Therefore, they require constant

retraining (practically acting as refresh). Such retraining is

required also in each startup or, alternatively, reading out the

weights into an auxiliary memory—a solution that requires a

mechanism for reading out the synaptic weights. The larger

design in [64] (92 transistors) also has weight decay, but with

a slow hours-long timescale. The device in [65] (83 transistors

and an unspeciﬁed weight unit that stores the weights) does not

report to have weight decay, apparently since digital storage

is used. This is also true for [66] (150 transistors).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

B. Memristor-Based Designs: Expected

Beneﬁts and Technical Issues

The proposed memristor-based design should resolve the

main obstacles of the above CMOS-based designs, and provide

a compact nonvolatile [71] circuit. Area and power consump-

tion are expected to be reduced by a factor of 13–50, in

comparison with standard CMOS technology, if a memristor

is counted as an additional transistor (although it is actually

more compact). Maybe the most convincing evidence for

the limitations of the CMOS-based designs is the fact that

although most of these designs are two decades old, they

have not been incorporated into commercial products. It is

fair only to mention at this point that while our design

is purely theoretical and based on speculative technology,

the above reviewed designs are based on mature technology

and have overcome obstacles all the way to manufacturing.

However, a physical implementation of a memristor-based

neural network should be feasible, as was demonstrated in a

recent work [31]—where a different memristor-based design

was manufactured and tested.

The hardware design in [31] of a SNN with binary outputs

demonstrated a successful online training using the popular

perceptron algorithm. The current proposed design allows

more ﬂexibility, since it can be used to train general MNNs

(considered to be much more powerful than SNNs [33]) using

the scalable online gradient descent algorithm (backpropaga-

tion). In addition, the current proposed design can be also used

to implement the perceptron algorithm (used in [31]), since

it is very similar to the Adaline algorithm (Fig. 1). Due to

this similarity, both designs should encounter similar technical

issues in a concrete implementation.

Encouragingly, the memristor-based neural network circuit

in [31] is able to achieve good performance, despite of the

following issues: 1) noisy memristor dynamics affect the

accuracy of the weight update; 2) variations in memristor para-

meters generate similar variations in the learning rates (here η);

and 3) the nonlinearity of the conductivity [here G(s(t))]

can have a saturating effect on the weights (resulting in

bounded weights). Overcoming issues 1) and 2) suggests that

training MNNs should be relatively robust to noise and vari-

ations, as argued here (Fig. 5) and demonstrated numerically

(Fig. 8 and Table III). Overcoming issue (3) suggests that the

saturating effect of the nonlinearity G(s(t)) on the weights is

not catastrophic. This could be related again to the robustness

of gradient descent (Fig. 5). Moreover, bounding the weights

magnitude can be desirable and various regularization methods

are commonly used to achieve this effect in MNNs. More

speciﬁcally, a saturating nonlinearity on the weights can even

improve performance [72]. Other types of nonlinearity may

also be beneﬁcial [73].

C. Circuit Modiﬁcations and Generalizations

The speciﬁc parameters that were used for the circuit evalua-

tion (Table I) are not strictly necessary for the proper execution

of the proposed design and are only used to demonstrate its

applicability. For example, it is straightforward to show that

K,VT,and ¯ghave little effect [as long as (13) and (14) hold],

and different values of ˆgcan be adjusted for by rescaling the

constant c, which appears in (20) and (34). This is important

since the feasible range of parameters for the memristive

devices is still not well characterized and it seems to be quite

broad. For example, the values of the memristive timescales

range from picoseconds [74] to milliseconds [55]. Here, the

parameters were taken from a millisecond-timescale memristor

[55]. The actual speed of the proposed circuit would critically

depend on the timescales of commercially available memristor

devices.

Additional important modiﬁcations of the proposed circuit

are straightforward. In [Appendix A, 39], it is explained how

to modify the circuit to work with more realistic memristive

devices [40, 14], instead of the classical memristor model [11],

given a few conditions. In [Appendix C, 39], it is shown that

it is possible to reduce the transistor count from two to one,

at the price of doubling the duration of the write phase. Other

useful modiﬁcations of circuit are also possible. For example,

the input xmay be allowed to receive different values during

the read and write operations. In addition, it is straightforward

to replace the simple outer product update rule in (10) by more

general update rules of the form

W(k+1)

nm =W(k)

nm +η

i,j

fiy(k)

ngjx(k)

m

where fi,gjare some functions. Finally, it is possible to

adaptively modify the learning rate ηduring training [e.g., by

modifying αin (36)].

VIII. CONCLUSION

A novel method to implement scalable online gradient

descent learning in multilayer neural networks through local

update rules is proposed based on the emerging memristor

technology. The proposed method is based on an artiﬁcial

synapse using one memristor to store the synaptic weight and

two CMOS transistors to control the circuit. The correctness

of the proposed synapse structure exhibits a similar accuracy

to its equivalent software implementation, while the proposed

structure shows high robustness and immunity to noise and

parameter variability.

Such circuits may be used to implement large-scale online

learning in multilayer neural networks, as well as other learn-

ing systems. The circuit is estimated to be signiﬁcantly smaller

than existing CMOS-only designs, opening the opportunity for

massive parallelism with millions of adaptive synapses on a

single integrated circuit, operating with low static power and

good robustness. Such brain-like properties may give a signiﬁ-

cant boost to the ﬁeld of neural networks and learning systems.

ACKNOWLEDGMENT

The authors would like to thank E. Friedman, I. Hubara,

R. Meir, and U. Weiser for their support and helpful comments,

and E. Rosenthal and S. Greshnikov for their contribution to

the SPICE simulations.

REFERENCES

[1] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado,

J. Dean, and A. Y. Ng, “Building high-level features using large scale

unsupervised learning,” in Proc. ICML, Edinburgh, Scotland, Jun. 2012,

pp. 81–88.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 13

[2] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural

networks for image classiﬁcation,” in Proc. CVPR, Providence, RI, USA,

Jun. 2012, pp. 3642–3649.

[3] J. Dean et al., “Large scale distributed deep networks,” in Proc. NIPS,

Lake Tahoe, NV, USA, Dec. 2012, pp. 1–9.

[4] R. D. Hof, “Deep learning,” MIT Technol. Rev., Apr. 2013.

[5] D. Hernandez, “Now you can build Google’s $1M artiﬁcial brain on the

cheap,” Wired , Jun. 2013, pp. 9–48.

[6] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efﬁcient

backprop,” in Neural Networks: Tricks of the Trade, G. Montavon,

G. B. Orr, and K.-R. Müller, Eds., 2nd ed. Heidelberg, Germany:

Springer-Verlag, 2012.

[7] J. Misra and I. Saha, “Artiﬁcial neural networks in hardware: A sur-

vey of two decades of progress,” Neurocomputing, vol. 74, nos. 1–3,

pp. 239–255, Dec. 2010.

[8] A. R. Omondi, “Neurocomputers: A dead end?” Int. J. Neural Syst.,

vol. 10, no. 6, pp. 475–481, 2000.

[9] M. Versace and B. Chandler, “The brain of a new machine,” IEEE

Spectr., vol. 47, no. 12, pp. 30–37, Dec. 2010.

[10] M. M. Waldrop, “Neuroelectronics: Smart connections,” Nature,

vol. 503, no. 7474, pp. 22–24, Nov. 2013.

[11] L. O. Chua, “Memristor-the missing circuit element,” IEEE Trans.

Circuit Theory, vol. 18, no. 5, pp. 507–519, Sep. 1971.

[12] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The

missing memristor found,” Nature, vol. 453, no. 7191, pp. 80–83,

Mar. 2008.

[13] F. Corinto and A. Ascoli, “A boundary condition-based approach to the

modeling of memristor nanostructures,” IEEE Trans. Circuits Syst. I,

Reg. Papers, vol. 59, no. 11, pp. 2713–2726, Nov. 2012.

[14] S. Kvatinsky, E. G. Friedman, A. Kolodny, and U. C. Weiser, “TEAM:

ThrEshold adaptive memristor model,” IEEE Trans. Circuits Syst. I, Reg.

Papers, vol. 60, no. 1, pp. 211–221, Jan. 2013.

[15] S. Kvatinsky, Y. H. Nacson, Y. Etsion, E. G. Friedman, A. Kolodny, and

U. C. Weiser, “Memristor-based multithreading,” Comput. Archit. Lett.,

vol. 13, no. 1, pp. 41–44, Jul. 2014.

[16] Z. Guo, J. Wang, and Z. Yan, “Passivity and passiﬁcation of memristor-

based recurrent neural networks with time-varying delays,” IEEE

Trans. Neural Netw. Learn. Syst., vol. 25, no. 11, pp. 2099–2109,

Nov. 2014.

[17] L. Wang, Y. Shen, Q. Yin, and G. Zhang, “Adaptive synchronization

of memristor-based neural networks with time-varying delays,” IEEE

Trans. Neural Netw. Learn. Syst., to be published.

[18] G. Zhang and Y. Shen, “Exponential synchronization of delayed

memristor-based chaotic neural networks via periodically intermittent

control,” Neural Netw., vol. 55, pp. 1–10, Jul. 2014.

[19] D. Querlioz, O. Bichler, and C. Gamrat, “Simulation of a memristor-

based spiking neural network immune to device variations,” in Proc. Int.

Joint Conf. Neural Netw. (IJCNN), San Jose, CA, USA, Jul./Aug. 2011,

pp. 1775–1781.

[20] C. Zamarreño-Ramos, L. A. Camuñas-Mesa, J. A. Pérez-Carrasco,

T. Masquelier, T. Serrano-Gotarredona, and B. Linares-Barranco,

“On spike-timing-dependent-plasticity, memristive devices, and build-

ing a self-learning visual cortex,” Frontiers Neurosci., vol. 5, p. 26,

Jan. 2011.

[21] O. Kavehei et al., “Memristor-based synaptic networks and logical

operations using in-situ computing,” in Proc. 7th Int. Conf. Intell.

Sensors, Sensor Netw. Inf. Process., Adelaide, SA, Australia, Dec. 2011,

pp. 137–142.

[22] A. Nere, U. Olcese, D. Balduzzi, and G. Tononi, “A neuromor-

phic architecture for object recognition and motion anticipation using

burst-STDP,” PloS One, vol. 7, no. 5, p. e36958, Jan. 2012.

[23] Y. Kim, Y. Zhang, and P. Li, “A digital neuromorphic VLSI architecture

with memristor crossbar synaptic array for machine learning,” in Proc.

IEEE Int. SOC Conf. (SOCC), Niagara Falls, NY, USA, Sep. 2012,

pp. 328–333.

[24] W. Chan and J. Lohn, “Spike timing dependent plasticity with memris-

tive synapse in neuromorphic systems,” in Proc. Int. Joint Conf. Neural

Netw. (IJCNN), Brisbane, QLD, Australia, Jun. 2012, pp. 1–6.

[25] T. Serrano-Gotarredona, T. Masquelier, T. Prodromakis, G. Indiveri, and

B. Linares-Barranco, “STDP and STDP variations with memristors for

spiking neuromorphic learning systems,” Frontiers Neurosci.,vol.7,p.2,

Jan. 2013.

[26] D. Querlioz, O. Bichler, P. Dollfus, and C. Gamrat, “Immunity to device

variations in a spiking neural network with memristive nanodevices,”

IEEE Trans. Nanotechnol., vol. 12, no. 3, pp. 288–295, May 2013.

[27] R. Legenstein, C. Naeger, and W. Maass, “What can a neuron learn with

spike-timing-dependent plasticity?” Neural Comput., vol. 17, no. 11,

pp. 2337–2382, Mar. 2005.

[28] D. Chabi, W. Zhao, D. Querlioz, and J.-O. Klein, “Robust neural logic

block (NLB) based on memristor crossbar array,” in Proc. IEEE/ACM

Int. Symp. IEEE Nanosc. Archit. (NANOARCH), San Diego, CA, USA,

Jun. 2011, pp. 137–143.

[29] H. Manem, J. Rajendran, and G. S. Rose, “Stochastic gradient descent

inspired training technique for a CMOS/nano memristive trainable

threshold gate array,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59,

no. 5, pp. 1051–1060, May 2012.

[30] M. Soltiz, D. Kudithipudi, C. Merkel, G. S. Rose, and R. E.

Pino, “Memristor-based neural logic blocks for nonlinearly separable

functions,” IEEE Trans. Comput., vol. 62, no. 8, pp. 1597–1606,

Aug. 2013.

[31] F. Alibart, E. Zamanidoost, and D. B. Strukov, “Pattern classiﬁcation by

memristive crossbar circuits using ex situ and in situ training,” Nature

Commun., vol. 4, p. 2072, May 2013.

[32] F. Rosenblatt, “The perceptron: A probabilistic model for information

storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6,

pp. 386–408, Nov. 1958.

[33] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”

Math. Control, Signals, Syst., vol. 2, no. 4, pp. 303–314, 1989.

[34] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,”

in Optimization for Machine Learning,S.Sra,S.Nowozin,and

S. J. Wright, Eds. Cambridge, MA, USA: MIT Press, 2011, p. 351.

[35] D. Cire¸san and U. Meier, “Deep, big, simple neural nets for handwritten

digit recognition,” Neural Comput., vol. 22, no. 12, pp. 3207–3220,

Nov. 2010.

[36] S. P. Adhikari, C. Yang, H. Kim, and L. O. Chua, “Memristor bridge

synapse-based neural network and its learning,” IEEE Trans. Neural

Netw.Learn.Syst., vol. 23, no. 9, pp. 1426–1435, Sep. 2012.

[37] R. Hasan and T. M. Taha, “Enabling back propagation training of

memristor crossbar neuromorphic processors,” in Proc. Int. Joint Conf.

Neural Netw. (IJCNN), Beijing, China, Jul. 2014, pp. 21–28.

[38] Z. Vasilkoski et al., “Review of stability properties of neural plasticity

rules for implementation on memristive neuromorphic hardware,” in

Proc. Int. Joint Conf. Neural Netw., San Jose, CA, USA, Jul./Aug. 2011,

pp. 2563–2569.

[39] The Supplementary Material. [Online]. Available:

http://ieeexplore.ieee.org.

[40] L. O. Chua and S. M. Kang, “Memristive devices and systems,” Proc.

IEEE, vol. 64, no. 2, pp. 209–223, Feb. 1976.

[41] M. D. Pickett et al., “Switching dynamics in titanium dioxide memristive

devices,” J. Appl. Phys., vol. 106, no. 7, p. 074508, 2009.

[42] J. Strachan et al., “State dynamics and modeling of tantalum oxide

memristors,” IEEE Trans. Electron Devices, vol. 60, no. 7,

pp. 2194–2202, Jul. 2013.

[43] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” Stanford

Electron. Labs, Stanford Univ., Stanford, CA, USA, Tech. Rep., 1960.

[44] B. Widrow and S. D. Stearns, Adaptive Signal Processing.

Englewood Cliffs, NJ, USA: Prentice-Hall, 1985.

[45] E. Oja, “Simpliﬁed neuron model as a principal component analyzer,”

J. Math. Biol., vol. 15, no. 3, pp. 267–273, Nov. 1982.

[46] C. M. Bishop, Pattern Recognition and Machine Learning. Singapore:

Springer-Verlag, 2006.

[47] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: Primal

estimated sub-gradient solver for SVM,” Math. Program., vol. 127, no. 1,

pp. 3–30, Oct. 2010.

[48] S. Kvatinsky, N. Wald, E. Satat, E. G. Friedman, A. Kolodny, and

U. C. Weiser, “Memristor-based material implication (IMPLY) logic:

Design principles and methodologies,” IEEE Trans. Very Large Scale

Integr. (VLSI) Syst., vol. 22, no. 10, pp. 2054–2066, Oct. 2014.

[49] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.:

Oxford Univ. Press, 1995.

[50] R. Sarpeshkar, “Analog versus digital: Extrapolating from electronics

to neurobiology,” Neural Comput., vol. 10, no. 7, pp. 1601–1638,

Oct. 1998.

[51] The Mosis Service. [Online]. Available: http://www.mosis.com, accessed

Nov. 21, 2012.

[52] G. Huang, D. C. Sekar, A. Naeemi, K. Shakeri, and J. D. Meindl,

“Compact physical models for power supply noise and chip/package

co-design of gigascale integration,” in Proc. IEEE 57th Electron.

Compon. Technol. Conf. (ECTC), Sparks, NV, USA, May/Jun. 2007,

pp. 1659–1666.

[53] M. Hu, H. Li, Y. Chen, X. Wang, and R. E. Pino, “Geometry variations

analysis of TiO2thin-ﬁlm and spintronic memristors,” in Proc. 16th

Asia South Paciﬁc Design Autom. Conf., Yokohama, Japan, Jan. 2011,

pp. 25–30.

[54] SimElectronics. [Online]. Available: http://www.mathworks.com/

products/simelectronics/, accessed Nov. 21, 2012.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[55] T. Chang, S.-H. Jo, K.-H. Kim, P. Sheridan, S. Gaba, and W. Lu,

“Synaptic behaviors and modeling of a metal oxide memristive device,”

Appl. Phys. A, vol. 102, no. 4, pp. 857–863, Feb. 2011.

[56] K. Bache and M. Lichman. (2013). UCI Machine Learning Repository.

[Online]. Available: http://archive.ics.uci.edu/ml

[57] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convo-

lutional neural networks applied to visual document analysis,” in Proc.

7th Int. Conf. Document Anal. Recognit., vol. 1. Edinburgh, Scotland,

Aug. 2003, pp. 958–963.

[58] D. B. Strukov and K. K. Likharev, “Reconﬁgurable nano-crossbar archi-

tectures,” in Nanoelectronics and Information Technology, R. Waser, Ed.

New York, NY, USA: Wiley, 2012, pp. 543–562.

[59] G. Cauwenberghs, C. F. Neugebauer, and A. Yariv, “Analysis and veri-

ﬁcation of an analog VLSI incremental outer-product learning system,”

IEEE Trans. Neural Netw., vol. 3, no. 3, pp. 488–497, May 1992.

[60] H. C. Card, C. R. Schneider, and W. R. Moore, “Hebbian plasticity in

MOS synapses,” IEE Proc. F, Radar Signal Process., vol. 138, no. 1,

pp. 13–16, Feb. 1991.

[61] C. Schneider and H. Card, “Analogue CMOS Hebbian synapses,”

Electron. Lett., vol. 27, no. 9, pp. 785–786, Apr. 1991.

[62] H. C. Card, C. R. Schneider, and R. S. Schneider, “Learning capacitive

weights in analog CMOS neural networks,” J. VLSI Signal Process. Syst.

Signal, Image Video Technol., vol. 8, no. 3, pp. 209–225, Oct. 1994.

[63] M. Valle, D. D. Caviglia, and G. M. Bisio, “An experimental analog

VLSI neural network with on-chip back-propagation learning,” Analog

Integr. Circuits Signal Process., vol. 9, no. 3, pp. 231–245, Apr. 1996.

[64] T. Morie and Y. Amemiya, “An all-analog expandable neural net-

work LSI with on-chip backpropagation learning,” IEEE J. Solid-State

Circuits, vol. 29, no. 9, pp. 1086–1093, Sep. 1994.

[65] C. Lu, B.-X. Shi, and L. Chen, “An on-chip BP learning neural network

with ideal neuron characteristics and learning rate adaptation,” Analog

Integr. Circuits Signal Process., vol. 31, no. 1, pp. 55–62, Apr. 2002.

[66] T. Shima, T. Kimura, Y. Kamatani, T. Itakura, Y. Fujita, and T. Iida,

“Neuro chips with on-chip back-propagation and/or Hebbian learning,”

IEEE J. Solid-State Circuits, vol. 27, no. 12, pp. 1868–1876, Dec. 1992.

[67] C. S. Lindsey and T. Lindblad, “Survey of neural network hardware,”

Proc. SPIE, vol. 2492, pp. 1194–1205, Apr. 1995.

[68] G. Cauwenberghs, “A learning analog neural network chip with

continuous-time recurrent dynamics,” in Proc. NIPS, Golden, CO, USA,

Nov. 1994, pp. 858–865.

[69] H. Eguchi, T. Furuta, H. Horiguchi, S. Oteki, and T. Kitaguchi, “Neural

network LSI chip with on-chip learning,” in Proc. Int. Joint Conf. Neural

Netw. (IJCNN), Seattle, WA, USA, Jul. 1991, pp. 453–456.

[70] C. Schneider and H. Card, “CMOS implementation of analog

Hebbian synaptic learning circuits,” in Proc. Int. Joint Conf. Neural

Netw. (IJCNN), vol. i. Seattle, WA, USA, vol. 1, Jul. 1991,

pp. 437–442.

[71] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precision

tuning of state for memristive devices by adaptable variation-tolerant

algorithm,” Nanotechnology, vol. 23, no. 7, p. 075201, Feb. 2012.

[72] D. Soudry, I. Hubara, and R. Meir, “Expectation backpropagation:

Parameter-free training of multilayer neural networks with continuous

or discrete weights,” in Proc. NIPS, Montreal, QC, Canada, Dec. 2014,

pp. 963–971.

[73] M. Milev and M. Hristov, “Analog implementation of ANN with inherent

quadratic nonlinearity of the synapses,” IEEE Trans. Neural Netw.,

vol. 14, no. 5, pp. 1187–1200, Sep. 2003.

[74] A. C. Torrezan, J. P. Strachan, G. Medeiros-Ribeiro, and R. S. Williams,

“Sub-nanosecond switching of a tantalum oxide memristor,”

Nanotechnology, vol. 22, no. 48, p. 485203, Dec. 2011.

Daniel Soudry received the B.Sc. degree in electri-

cal engineering and physics and the Ph.D. degree

in electrical engineering from the Technion-Israel

Institute of Technology, Haifa, Israel, in 2008 and

2013, respectively.

He is currently a Gruss Lipper Post-Doctoral

Fellow with the Department of Statistics, Center of

Theoretical Neuroscience, and the Grossman Center

for the Statistics of Mind, Columbia University,

New York, NY, USA. His current research interests

include modeling the nervous system and its com-

ponents, Bayesian methods for neural data analysis and inference in neural

networks, and hardware implementation of neural systems.

Dotan Di Castro received the B.Sc., M.Sc., and

Ph.D. degrees from the Technion-Israel Institute of

Technology, Haifa, Israel, in 2003, 2006, and 2010,

respectively.

He was with IBM Research Labs, Haifa, from

2000 to 2004, He was involved in several startup

companies from 2009 to 2013. He is currently

with Yahoo! Labs, Haifa, where he is investigating

information processing in very large scale systems.

His current research interests include machine learn-

ing (in particular, reinforcement learning), computer

vision, and large-scale hierarchical learning systems.

Asaf Gal received the B.Sc. degree in physics and

electrical engineering from the Technion-Israel Insti-

tute of Technology, Haifa, Israel, in 2004, and the

Ph.D. degree in computational neuroscience from the

Hebrew University of Jerusalem, Jerusalem, Israel,

in 2013.

He is currently a Clore Post-Doctoral Fellow with

the Department of Physics of Complex Systems,

Weizmann Institute of Science, Rehovot 76100,

Israel. His current research interests include theo-

retically oriented study of biological systems, bio-

physics, and the application of complex systems science to experiments in

biology.

Avinoam Kolodny received the Ph.D. degree in

microelectronics from the Technion-Israel Institute

of Technology (Technion), Haifa, Israel, in 1980.

He joined Intel Corporation, Santa Clara, CA,

USA, where he was involved in research and devel-

opment in the areas of device physics, very large

scale integration (VLSI) circuits, electronic design

automation, and organizational development. He has

been a member of the Faculty of Electrical Engineer-

ing with Technion since 2000. His current research

interests include interconnects in VLSI systems, at

both physical and architectural levels.

Shahar Kvatinsky received the B.Sc. degree in

computer engineering and applied physics and the

M.B.A. degree from the Hebrew University of

Jerusalem, Jerusalem, Israel, in 2009 and 2010,

respectively, and the Ph.D. degree in electrical engi-

neering from the Technion-Israel Institute of Tech-

nology, Haifa, Israel, in 2014.

He was with Intel Corporation, Santa Clara, CA,

USA, as a Circuit Designer, from 2006 to 2009. He

is currently a Post-Doctoral Research Fellow with

Stanford University, Stanford, CA, USA. His current

research interests include circuits and architectures with emerging memory

technologies and design of energy efﬁcient architectures.

Memristive crossbar-based circuit design of back-propagation neural network with synchronous memristance adjustment

Article

Full-text available

Apr 2024

The performance improvement of CMOS computer fails to meet the enormous data processing requirement of artificial intelligence progressively. The memristive neural network is one of the most promising circuit hardwares to make a breakthrough. This paper proposes a novel memristive synaptic circuit that is composed of four MOS transistors and one memristor (4T1M). The 4T1M synaptic circuit provides flexible control strategies to change memristance or respond to the input signal. Applying the 4T1M synaptic circuit as the cell of memristive crossbar array, based on the structure and algorithm of the back-propagation (BP) neural network, this paper proposes circuit design of the memristive crossbar-based BP neural network. By reusing the 4T1M memristive crossbar array, the computations in the forward-propagation process and back-propagation process of BP neural network are accomplished on the memristive crossbar-based circuit to accelerate the computing speed. The 4T1M memristive crossbar array can change all the cells’ memristance at a time, accordingly, the memristive crossbar-based BP neural network can realize synchronous memristance adjustment. The proposed memristive crossbar-based BP neural network is then evaluated through experiments involving XOR logic operation, iris classification, and MNIST handwritten digit recognition. The experimental results present fewer iterations or higher classification accuracies. Further, the comprehensive comparisons with the existing memristive BP neural networks highlight the advantages of the proposed memristive crossbar-based BP neural network, which achieves the fastest memristance adjustment speed using relatively few components.

A Pipelined Memristive Neural Network Analog-to-Digital Converter

Preprint

Full-text available

Jun 2024

With the advent of high-speed, high-precision, and low-power mixed-signal systems, there is an ever-growing demand for accurate, fast, and energy-efficient analog-to-digital (ADCs) and digital-to-analog converters (DACs). Unfortunately, with the downscaling of CMOS technology, modern ADCs trade off speed, power and accuracy. Recently, memristive neuromorphic architectures of four-bit ADC/DAC have been proposed. Such converters can be trained in real-time using machine learning algorithms, to break through the speedpower-accuracy trade-off while optimizing the conversion performance for different applications. However, scaling such architectures above four bits is challenging. This paper proposes a scalable and modular neural network ADC architecture based on a pipeline of four-bit converters, preserving their inherent advantages in application reconfiguration, mismatch selfcalibration, noise tolerance, and power optimization, while approaching higher resolution and throughput in penalty of latency. SPICE evaluation shows that an 8-bit pipelined ADC achieves 0.18 LSB INL, 0.20 LSB DNL, 7.6 ENOB, and 0.97 fJ/conv FOM. This work presents a significant step towards the realization of large-scale neuromorphic data converters.

Enhancing in-situ updates of quantized memristor neural networks: a Siamese network learning approach

Article

Full-text available

Feb 2024
COGN NEURODYNAMICS

Brain-inspired neuromorphic computing has emerged as a promising solution to overcome the energy and speed limitations of conventional von Neumann architectures. In this context, in-memory computing utilizing memristors has gained attention as a key technology, harnessing their non-volatile characteristics to replicate synaptic behavior akin to the human brain. However, challenges arise from non-linearities, asymmetries, and device variations in memristive devices during synaptic weight updates, leading to inaccurate weight adjustments and diminished recognition accuracy. Moreover, the repetitive weight updates pose endurance challenges for these devices, adversely affecting latency and energy consumption. To address these issues, we propose a Siamese network learning approach to optimize the training of multi-level memristor neural networks. During neural inference, forward propagation takes place within the memristor neural network, enabling error and noise detection in the memristive devices and hardware circuits. Simultaneously, high-precision gradient computation occurs on the software side, initially updating the floating-point weights within the Siamese network with gradients. Subsequently, weight quantization is performed, and the memristor conductance values requiring updates are modified using a sparse update strategy. Additionally, we introduce gradient accumulation and weight quantization error compensation to further enhance network performance. The experimental results of MNIST data recognition, whether based on a MLP or a CNN model, demonstrate the rapid convergence of our network model. Moreover, our method successfully eliminates over 98% of weight updates for memristor conductance weights within a single epoch. This substantial reduction in weight updates leads to a significant decrease in energy consumption and time delay by more than 98% when compared to the basic closed-loop update method. Consequently, this approach effectively addresses the durability requirements of memristive devices.

Machine Learning Based Delta Sigma Modulator Using Memristor for Neuromorphic Computing

Chapter

May 2024

Simulation and Chaos Prediction of Wind Power Generator Based on Memristive LSTM Network

Conference Paper

Dec 2023

A Parallel Read-Write Circuit Design for Driving Memristor Crossbar Array

Conference Paper

Dec 2023

Artificial Neural Network Based on Memristive Circuit for High-Speed Equalization

Article

Apr 2024

The limitations of traditional von Neumann architectures and digital computing are the bottlenecks for high-speed signal processing capabilities, not to mention the explosion of information growth. To tackle this challenge, this paper proposes an artificial neural network (ANN) equalizer based on the memristor for high-speed channel transmission at 112Gbps with 4-level pulse amplitude modulation (PAM4). To implement the PAM4 signal decision circuit based on the softmax algorithm, a comparator is used to make binary decisions for each output, and the only high-level output is further selected for the decision-making. The simulations on the PSPICE platform reveal that the number of input taps and the location of the main tap have the greatest impact on bit error rate (BER) performance. With optimal parameters, the circuit can achieve an impressive BER performance as low as 3.45E-6. To the best of our knowledge, this is the first implementation of channel equalization using memristive circuits, providing a valuable reference for analog circuit implementations of neural network equalizers.

Robust Ex-situ Training of Memristor Crossbar-based Neural Network with Limited Precision Weights

Conference Paper

Jan 2024

Raqibul Hasan

MEDSA: A Memristive-passive Delta-Sigma ADC Circuit for Detecting Neural Signals

Conference Paper

Oct 2023

Low-Complexity Tomlinson-Harashima Precoding Update Algorithm for Massive MIMO System

Article

Jan 2023

Efficient implementation of Tomlinson-Harashima precoding (THP) is crucial in massive multiple-input-multiple-output (MIMO) systems with a large number of antennas at the base station (BS) serving many user equipments (UEs). To address the high computational complexity of THP, in this paper, we first propose novel THP update algorithms that can avoid recomputing the THP filters when a new UE arrives or departs. Specifically, by using the Gram-Schmidt process and a series of Givens matrices, the THP filters are computed without full matrix operations. Then we extend the THP update algorithms to a more general scenario when multiple multi-antenna UEs arrive or depart. In this case, the proposed algorithms use both direct and iterative approaches. Moreover, the computational complexity of the proposed algorithms is derived and compared with that of the conventional THP. Finally, to further align with the practical scenario, we analyze and derive the approximate close-form expressions for the sum achievable rate of the proposed algorithms under imperfect channel state information (CSI). Simulation results are provided to illustrate the effectiveness of the proposed algorithms. The impact of quasi-static fading and slow time-varying scenarios with imperfect CSI on the communication performance of the proposed algorithms is also evaluated.

Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights

Conference Paper

Full-text available

Dec 2014
Adv Neural Inform Process Syst

Multilayer Neural Networks (MNNs) are commonly trained using gradient descent-based methods, such as BackPropagation (BP). Inference in probabilistic graphical models is often done using variational Bayes methods, such as Expectation Propagation (EP). We show how an EP based approach can also be used to train deterministic MNNs. Specifically, we approximate the posterior of the weights given the data using a " mean-field " factorized distribution, in an online setting. Using online EP and the central limit theorem we find an analytical approximation to the Bayes update of this posterior, as well as the resulting Bayes estimates of the weights and outputs. Despite a different origin, the resulting algorithm, Expectation BackPropagation (EBP), is very similar to BP in form and efficiency. However, it has several additional advantages: (1) Training is parameter-free, given initial conditions (prior) and the MNN architecture. This is useful for large-scale problems, where parameter tuning is a major challenge. (2) The weights can be restricted to have discrete values. This is especially useful for implementing trained MNNs in precision limited hardware chips, thus improving their speed and energy efficiency by several orders of magnitude. We test the EBP algorithm numerically in eight binary text classification tasks. In all tasks, EBP outperforms: (1) standard BP with the optimal constant learning rate (2) previously reported state of the art. Interestingly, EBP-trained MNNs with binary weights usually perform better than MNNs with continuous (real) weights-if we average the MNN output using the inferred posterior.

The Tradeoffs of Large-Scale Learning

Chapter

Sep 2011

An up-to-date account of the interplay between optimization and machine learning, accessible to students and researchers in both communities. The interplay between optimization and machine learning is one of the most important developments in modern computational science. Optimization formulations and methods are proving to be vital in designing algorithms to extract essential knowledge from huge volumes of data. Machine learning, however, is not simply a consumer of optimization technology but a rapidly evolving field that is itself generating new optimization ideas. This book captures the state of the art of the interaction between optimization and machine learning in a way that is accessible to researchers in both fields.Optimization approaches have enjoyed prominence in machine learning because of their wide applicability and attractive theoretical properties. The increasing complexity, size, and variety of today's machine learning models call for the reassessment of existing assumptions. This book starts the process of reassessment. It describes the resurgence in novel contexts of established frameworks such as first-order methods, stochastic approximations, convex relaxations, interior-point methods, and proximal methods. It also devotes attention to newer themes such as regularized optimization, robust optimization, gradient and subgradient methods, splitting techniques, and second-order methods. Many of these techniques draw inspiration from other fields, including operations research, theoretical computer science, and subfields of optimization. The book will enrich the ongoing cross-fertilization between the machine learning community and these other fields, and within the broader optimization community.

Multi-column Deep Neural Networks for Image Classification

Conference Paper

Jun 2012

Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or trafﬁc signs. Our biologically plausible, wide and deep artiﬁcial neural network architectures can. Small (often minimal) receptive ﬁelds of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the ﬁrst to achieve near-human performance. On a trafﬁc sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classiﬁcation benchmarks.

Adaptive Switching Circuits

Article

Jan 1960
NEUROCOMPUTING

The tradeoffs of large scale learning

Article

Jan 2007
Adv Neural Inform Process Syst

Memristor - The missing circuit element

Article

Jan 1971

L.O. Chua

Deep Learning

Article

May 2013
Tech Rev

Robert D. Hof

Companies such as Google have made significant progress in developing artificial intelligence (AI) known as deep learning. Deep learning software attempts to mimic the activity in layers of neurons in the neocortex, the wrinkly 80 percent of the brain where thinking occurs. The software learns to recognize patterns in digital representations of sounds, images, and other data in a real sense. Computer scientists can model many more layers of virtual neurons due to improvements in mathematical formulas and increasingly powerful computers. They are able produce significant advancements in speech and image recognition with this greater depth. Google in particular has become a leader in developing deep learning and related AI talent. Extending deep learning into applications beyond speech and image recognition will require more conceptual and software discoveries and advancements in processing power.

Enabling back propagation training of memristor crossbar neuromorphic processors

Article

Sep 2014

Recent studies have shown that memristor crossbar based neuromorphic hardware enables high performance implementations of neural networks at low power and in low chip area. This paper presents circuits to train a cascaded set of memristor crossbars representing a multi-layered neural network. The circuits presented implement back-propagation training and would enable on-chip training of memristor crossbars. On-chip training of memristor crossbars can be necessary to overcome the effect of device variability and alternate current paths within crossbars being used as neural networks. We model the memristor crossbars in SPICE in order capture alternate current paths and the impact of wire resistance. Our design can be scaled to multiple neural layers and multiple output neurons. We demonstrate the training of up to three layered neural networks evaluating non-linearly separable functions through detailed SPICE simulations. This is the first study in the literature we have seen that examines the implementation of back-propagation based training of memristor crossbar circuits. The impact of this work would be to enable the design of highly energy efficient and compact neuromorphic processing systems that can be trained to implement large deep networks (such as deep belief networks).

Approximation by superposition of sigmoidale function

Article

Jan 1989

George Cybenko

In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function ofn real variables with support in the unit hypercube; only mild conditions are imposed on the univariate function. Our results settle an open question about representability in the class of single hidden layer neural networks. In particular, we show that arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity. The paper discusses approximation properties of other possible types of nonlinearities that might be implemented by artificial neural networks.

Adaptive Synchronization of Memristor-Based Neural Networks with Time-Varying Delays

Article

Sep 2015

In this paper, adaptive synchronization of memristor-based neural networks (MNNs) with time-varying delays is investigated. The dynamical analysis here employs results from the theory of differential equations with discontinuous right-hand sides as introduced by Filippov. Sufficient conditions for the global synchronization of MNNs are established with a general adaptive controller. The update gain of the controller can be adjusted to control the synchronization speed. The obtained results complement and improve the previously known results. Finally, numerical simulations are carried out to demonstrate the effectiveness of the obtained results.

Memristor-Based Multilayer Neural Networks With Online Gradient Descent Training

Abstract and Figures

Recommended publications

Learning in Memristor Crossbar-Based Spiking Neural Networks Through Modulation of Weight-Dependent...

Adjusting Learning Rate of Memristor-Based Multilayer Neural Networks via Fuzzy Method

Models of memristors for SPICE simulations

VTEAM: A General Model for Voltage-Controlled Memristors

GST-Memristor-Based Online Learning Neural Networks