ArticlePDF Available

Memristor-Based Multilayer Neural Networks With Online Gradient Descent Training

Authors:

Abstract and Figures

Learning in multilayer neural networks (MNNs) relies on continuous updating of large matrices of synaptic weights by local rules. Such locality can be exploited for massive parallelism when implementing MNNs in hardware. However, these update rules require a multiply and accumulate operation for each synaptic weight, which is challenging to implement compactly using CMOS. In this paper, a method for performing these update operations simultaneously (incremental outer products) using memristor-based arrays is proposed. The method is based on the fact that, approximately, given a voltage pulse, the conductivity of a memristor will increment proportionally to the pulse duration multiplied by the pulse magnitude if the increment is sufficiently small. The proposed method uses a synaptic circuit composed of a small number of components per synapse: one memristor and two CMOS transistors. This circuit is expected to consume between 2% and 8% of the area and static power of previous CMOS-only hardware alternatives. Such a circuit can compactly implement hardware MNNs trainable by scalable algorithms based on online gradient descent (e.g., backpropagation). The utility and robustness of the proposed memristor-based circuit are demonstrated on standard supervised learning tasks.
Content may be subject to copyright.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Memristor-Based Multilayer Neural Networks With
Online Gradient Descent Training
Daniel Soudry, Dotan Di Castro, Asaf Gal, Avinoam Kolodny, and Shahar Kvatinsky
Abstract Learning in multilayer neural networks (MNNs)
relies on continuous updating of large matrices of synaptic
weights by local rules. Such locality can be exploited for massive
parallelism when implementing MNNs in hardware. However,
these update rules require a multiply and accumulate operation
for each synaptic weight, which is challenging to implement
compactly using CMOS. In this paper, a method for performing
these update operations simultaneously (incremental outer prod-
ucts) using memristor-based arrays is proposed. The method is
based on the fact that, approximately, given a voltage pulse, the
conductivity of a memristor will increment proportionally to the
pulse duration multiplied by the pulse magnitude if the increment
is sufficiently small. The proposed method uses a synaptic circuit
composed of a small number of components per synapse: one
memristor and two CMOS transistors. This circuit is expected
to consume between 2% and 8% of the area and static power
of previous CMOS-only hardware alternatives. Such a circuit
can compactly implement hardware MNNs trainable by scalable
algorithms based on online gradient descent (e.g., backpropaga-
tion). The utility and robustness of the proposed memristor-based
circuit are demonstrated on standard supervised learning tasks.
Index Terms—Backpropagation, hardware, memristive
systems, memristor, multilayer neural networks (MNNs),
stochastic gradient descent, synapse.
I. INTRODUCTION
MULTILAYER neural networks (MNNs) have been
recently incorporated into numerous commercial prod-
ucts and services such as mobile devices and cloud computing.
For realistic large scale learning tasks, MNNs can perform
impressively well and produce state-of-the-art results when
Manuscript received June 16, 2014; revised December 5, 2014; accepted
December 11, 2014. This work was supported in part by the Gruss Lipper
Charitable Foundation, in part by the Intel Collaborative Research Institute for
Computational Intelligence, in part by the Hasso Plattner Institute, Potsdam,
Germany, and in part by the Andrew and Erna Finci Viterbi Fellowship
Program.
D. Soudry is with the Center for Theoretical Neuroscience, and the
Grossman Center for the Statistics of Mind, Department of Statistics,
Columbia University, New York, NY 10027 USA (e-mail: daniel.soudry@
gmail.com).
D. Di Castro is with Yahoo! Labs, Haifa 31905, Israel (e-mail:
dotan.dicastro@gmail.com).
A. Gal is with the Department of Electrical Engineering, Biological
Networks Research Laboratories, Technion-Israel Institute of Technology,
Haifa 32000, Israel (e-mail: asafg1@gmail.com).
A. Kolodny is with the Department of Electrical Engineering,
Technion-Israel Institute of Technology, Haifa 32000, Israel (e-mail:
kolodny@ee.technion.ac.il).
S. Kvatinsky is with the Department of Computer Science, Stanford
University, Stanford, CA 94305 USA (e-mail: skva@tx.technion.ac.il).
This paper has supplemental material available online at
http://ieeexplore.ieee.org (File size: 9 MB).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2014.2383395
massive computational power is available [1]–[3] (see [4] for
related press). However, such computational intensity limits
their usability due to area and power requirements. New
dedicated hardware design approach must therefore be devel-
oped to overcome these limitations. It was recently suggested
that such new types of specialized hardware are essential for
real progress toward building intelligent machines [5].
MNNs utilize large matrices of values termed synaptic
weights. These matrices are continuously updated during the
operation of the system, and are constantly being used to
interpret new data. The power of MNNs mainly stems from
the learning rules used for updating the weights. These rules
are usually local, in the sense that they depend only on
information available at the site of the synapse. A canonical
example for a local learning rule is the backpropagation
algorithm, an efficient implementation of online gradient
descent, which is commonly used to train MNNs [6]. The
locality of backpropagation stems from the chain rule used
to calculate the gradients. Similar locality appears in many
other learning rules used to train neural networks and various
machine learning (ML) algorithms.
Implementing ML algorithms such as backpropagation
on conventional general-purpose digital hardware
(i.e., von Neumann architecture) is highly inefficient.
A primary reason for this is the physical separation between
the memory arrays used to store the values of the synaptic
weights and the arithmetic module used to compute the update
rules. General-purpose architecture actually eliminates the
advantage of these learning rules—their locality. This locality
allows highly efficient parallel computation, as demonstrated
by biological brains.
To overcome the inefficiency of general-purpose hardware,
numerous dedicated hardware designs, based on
CMOS technology, have been proposed in the past two
decades [7, and references therein]. These designs perform
online learning tasks in MNNs using massively parallel
synaptic arrays, where each synapse stores a synaptic weight
and updates it locally. However, so far, these designs are not
commonly used for practical large-scale applications, and it is
not clear whether they could be scaled up, since each synapse
requires too much power and area (Section VII). This issue
of scalability possibly casts doubt on the entire field [8].
Recently, it has been suggested [9]– [10] that scalable
hardware implementations of neural networks may become
possible if a novel device, the memristor [11]–[14], is
used. A memristor is a resistor with a varying history-
dependent resistance. It is a passive analog device with
2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
activation-dependent dynamics, which makes it ideal for
registering and updating of synaptic weights. Furthermore,
its relatively small size enables integration of memory
with the computing circuit [15] and allows a compact and
efficient architecture for learning algorithms, as well as other
neural-network-related applications ([16]–[18], and references
therein), which will not be discussed here.
Most previous implementations of learning rules using
memristor arrays have been limited to spiking neurons and foc-
used on spike-timing-dependent plasticity (STDP) [19]–[26].
Applications of STDP are usually aimed to explain biological
neuroscience results. At this point, however, it is not clear
how useful is the STDP algorithmically. For example, the
convergence of STDP-based learning is not guaranteed for
general inputs [27]. Other learning systems [28]–[30] that
rely on memristor arrays are limited to single-layer neural
networks (SNNs) with a few inputs per neuron. To the best of
the authors’ knowledge, the learning rules used in the above
memristor-based designs are not yet competitive and have not
been used for large-scale problems, which is precisely where
dedicated hardware is required.
A recent memristor array design [31] implemented the
scalable perceptron algorithm [32]. This algorithm can be
potentially used to train SNNs with binary outputs on very
large datasets. However, training general MNNs (which are
much more powerful than SNNs [33]) cannot be performed
using the perceptron algorithm. Scalable training of MNNs
requires the backpropagation algorithm. This is special form
of online gradient descent, which is generally very effective in
large-scale problems [34]. Importantly, it can achieve state-of-
the-art-results on large datasets when executed with massive
computational power [3], [35].
To date, no circuit has been suggested to utilize memristors
for implementing such scalable online learning in MNNs.
Interestingly, it was recently shown [36], [37] that memristors
could be used to implement MNNs trained by backpropagation
in a chip-in-a-loop setting, where the weight update calculation
is performed using a host computer. However, it remained
an open question whether the learning itself, which is a
major computational bottleneck, could also be done online
(i.e., completely implemented using hardware) with efficient
massively parallel memristor arrays. The main challenge for
general synaptic array circuit design arises from the nature of
learning rules such as backpropagation: practically, all of them
contain a multiplicative term [38], which is hard to implement
in compact and scalable hardware.
In this paper, a novel and general scheme to design hardware
for online gradient descent learning rules is presented. The
proposed scheme uses a memristor as a memory element to
store the weight and temporal encoding as a mechanism to
perform a multiplication operation. The proposed design uses
a single memristor and two CMOS transistors per synapse and,
therefore, requires 2%–8% of the area and static power of the
previously proposed CMOS-only circuits.
Using this proposed scheme, for the first time, it is possible
to implement a memristor-based hardware MNN capable of
online learning (using the scalable backpropgation algorithm).
The functionality of such a hardware MNN circuit utilizing
the memristive synapse array is demonstrated numerically on
standard supervised learning tasks. On all datasets, the circuit
performs as well as the software algorithm. Introducing noise
levels of about 10% and parameter variability of about 30%
only mildly reduced the performance of the circuit, due to the
inherent robustness of online gradient descent (Fig. 5). The
proposed design may therefore allow the use of specialized
hardware for MNNs, as well as other ML algorithms, rather
than the currently used general-purpose architecture.
The remainder of this paper is organized as follows.
In Section II, a basic background on memristors and online
gradient descent learning is given. In Section III, the proposed
circuit is described for efficient implementation of SNNs in
hardware. In Section IV, a modification of the proposed circuit
is used to implement a general MNN. In Section V, the
sources of noise and variation in the circuit are estimated.
In Section VI, the circuit operation and learning capabilities
are evaluated numerically, demonstrating that it can be used
to implement MNNs trainable with online gradient descent.
In Section VII, the novelty of the proposed circuit is discussed,
as well as possible generalizations, and in Section VIII,
this paper is summarized. The supplementary material [39]
includes detailed circuit schematics, code, and an appendix
with additional technicalities.
II. PRELIMINARIES
For convenience, a basic background information on
memristors and online gradient descent learning is given in this
section. For simplicity, the second part is focused on a simple
example of the adaline algorithm—a linear SNN trained using
mean square error (MSE).
A. Memristor
The memristor was originally proposed [11], [12] as
the missing fourth fundamental passive circuit element.
Memristors are basically resistors with varying resistance,
where their resistance changes according to time integral of
the current through the device, or alternatively, the integrated
voltage upon the device. In the classical representation, the
conductance of a memristor Gdepends directly on the integral
over time of the voltage upon the device, sometimes referred
to as flux. Formally, a memristor obeys the following:
i(t)=G(s(t))v(t)(1)
˙s(t)=v(t). (2)
A generalization of the memristor model 1, 2, which is called
amemristive system, was proposed in [40]. In memristive
devices, sis a general state variable, rather than an integral of
the voltage. Such memristive models, which are more com-
monly used to model actual physical devices [14], [41], [42],
are discussed in [Appendix A, 39].
For the sake of generality and simplicity, in the following
sections, it is assumed that the variations in the value of s(t)
are restricted to be small so that G(s(t)) can be linearized
around some point s, and the conductivity of the memristor
is given, to first order, by
G(s(t)) ggs(t)(3)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 3
where ˆg=[dG(s)/ds]s=sand ¯g=G(s)−ˆgs.Sucha
linearization is formally justified if sufficiently small inputs
are used, so sdoes not stray far from the fixed point
(i.e., 2 ˆg/[d2G(s)/ds2]s=s|s(t)s|, so second-order con-
tributions are negligible). The operation in such a small signal
region is demonstrated numerically in [Appendix D, 39], for
a family of physical memristive device technologies. The only
(rather mild) assumption is that G(s)is differentiable near s.
Note that despite this linearization, the memristor is still a
nonlinear component, since [from (1) and (3)] i(t)gv(t)+
ˆgs(t)v(t). Importantly, this nonlinear product s(t)v(t)under-
lies the key role of the memristor in the proposed design,
where an input signal v(t)is being multiplied by an adjustable
internal value s(t). Thus, the memristor enables an efficient
implementation of trainable MNNs in hardware, as explained
below.
B. Online Gradient Descent Learning
The field of ML is dedicated to the construction and study of
systems that can be learned from data. For example, consider
the following supervised learning task. Assume a learning
system that operates on Kdiscrete presentations of inputs
(trials), indexed by k=1,2,...,K. For brevity, the indexing
of iteration number is sometimes suppressed, where it is clear
from the context. On each trial k, the system receives empirical
data, a pair of two real column vectorsof sizes Mand N: a
pattern x(k)RMand a desired label d(k)RN, with all pairs
sharing the same desired relation d(k)=f(x(k)). Note that
two distinct patterns can have the same label. The objective
of the system is to estimate (learn) the function f(·)using the
empirical data.
As a simple example, suppose Wis a tunable N×Mmatrix
of parameters, and consider the estimator
r(k)=W(k)x(k)(4)
or
r(k)
n=
m
W(k)
nm x(k)
m(5)
which is a SNN. The result of the estimator r=Wx should
aim to predict the right desired labels d=f(x)for new unseen
patterns x. To solve this problem, Wis tuned to minimize
some measure of error between the estimated and desired
labels, over a K0-long subset of the empirical data, called
the training set (for which k=1,...,K0). For example, if
we define the error vector
y(k)d(k)r(k)(6)
then a common measure is MSE
MSE
K0
k=1
y(k)2.(7)
Other error measures can be also be used. The performance
of the resulting estimator is then tested over a different subset,
called the test set (k=K0+1,...,K).
As explained in the introduction, a reasonable iterative
algorithm for minimizing objective (7) (i.e., updating W,
where initially Wis arbitrarily chosen) is the following online
gradient descent (also called stochastic gradient descent)
iteration:
W(k+1)=W(k)1
2ηW(k)y(k)2(8)
where the 1/2 coefficient is written for mathematical
convenience, ηis the learning rate, a (usually small) positive
constant, and at each iteration k, a single empirical sample x(k)
is chosen randomly and presented at the input of the system.
Using the chain rule (4), (6), we have W(k)y(k)2=
2(d(k)W(k)x(k))(x(k)). Therefore, defining W(k)
W(k+1)W(k)and (·)to be the transpose operation,
we obtain the outer product
W(k)=ηy(k)(x(k))(9)
or
W(k+1)
nm =W(k)
nm +ηx(k)
my(k)
n.(10)
Specifically, this update rule is called the adaline
algorithm [43], used in adaptive signal processing and
control [44]. The parameters of more complicated estimators
can also be similarly tuned (trained), using online gradient
descent or similar methods. Specifically, MNNs (Section IV)
are commonly being trained using backpropagation, which is
an efficient form of online gradient descent [6]. Importantly,
note that the update rule in (10) is local, i.e., the change
in the synaptic weight W(k)
nm depends only on the related
components of input (x(k)
m)and error (y(k)
n). This local
update, which ubiquitously appears in neural network training
(e.g., backpropagation and the perceptron learning rule [32])
and other ML algorithms [45]–[47], enables a massively
parallel hardware design, as explained in Section III.
Such massively parallel designs are needed, since
for large Nand M, learning systems usually become
computationally prohibitive in both time and memory space.
For example, in the simple adaline algorithm, the main com-
putational burden in each iteration comes from (4) and (9),
where the number of operations (addition and multiplication)
is of order O(M·N). Commonly, these steps have become
the main computational bottleneck in executing MNNs (and
related ML algorithms) in software. Other algorithmic steps,
such as (6) here, include either O(M)or O(N)operations and,
therefore, have a negligible computational complexity, lower
than O(M+N).
III. CIRCUIT DESIGN
Next, dedicated analog hardware for implementing online
gradient descent learning is described. For simplicity, this
section is focused on simple linear SNNs trained using adaline,
as described in Section II. Later, in Section IV, the circuit
is modified to implement general MNNs, trained using back-
propagation. The derivations are rigorously done using a single
controlled approximation (14) and no heuristics (i.e., unproven
methods which can damage performance), thus creating a
precisely defined mapping between a mathematical learning
system and a hardware learning system. Readers who do not
wish to go through the detailed derivations, can get the general
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Fig. 1. Simple adaline learning task, with the proposed synaptic grid circuit
executing (4) and (9), which are the main computational bottlenecks in the
algorithm.
Fig. 2. Synaptic grid ( N×M) circuit architecture scheme. Every (n,m)node
in the grid is a memristor-based synapse that receives voltage input from the
shared um,¯umand the enlines and outputs Inm current on the onlines. These
output lines receive total current arriving from all the synapses on the nth row
and are grounded.
idea from the following overview (Section III-A) together
with Figs. 1–4.
A. Circuit Overview
To implement these learning systems, a grid of artificial
synapses is constructed, where each synapse stores a single
synaptic weight Wnm. The grid is a large N×Marray of
synapses, where the synapses operate simultaneously, each
performing a simple local operation. This synaptic grid circuit
carries the main computational load in ML algorithms by
implementing the two computational bottlenecks, (5) and (10),
in a massively parallel way. This matrix×vector product in (5)
is done using a resistive grid (of memristors), implementing
multiplication through Ohm’s law and addition through current
summation. The vector×vector outer product in (10) is done
using the fact that given a voltage pulse, the conductivity
of a memristor will increment proportionally to the pulse
duration multiplied by the pulse magnitude. Using this method,
multiplication requires only two transistors per synapse. Thus,
together with auxiliary circuits that handle a negligible amount
of O(M+N)additional operations, these arrays can be used
to construct efficient learning systems. These systems perform
Fig. 3. Memristor-based synapse. (a) Schematic of a single memristive
synapse (without the nand mindices). The synapse receives input voltages
uand ¯u=−u, an enable signal e, and output current I. (b) Read and write
protocols—incoming signals in a single synapse and the increments in the
synaptic weight s, as determined by (27). T=Twr +Trd.
massive parallelization of the bottleneck O(N·M)operations
(4) and (9) over many computational units—the synapses.
Similarly to the adaline algorithm described in Section II,
the circuit operates on discrete presentations of inputs (trials)
(Fig. 1). On each trial k, the circuit receives an input vector
x(k)∈[A,A]Mand an error vector y(k)∈[A,A]N
(where Ais a bound on both the input and error) and produces
aresult output vector r(k)RN, which depends on the input
by (4), where the matrix W(k)RN×M, called the synaptic
weight matrix, is stored in the system. In addition, on each
step, the circuit updates W(k)according to (9). This circuit can
be used to implement ML algorithms. For example, as shown
in Fig. 1, the simple adaline algorithm can be implemented
using the circuit, with training enabled on k=1,...,K0.The
implementation of the backpropagation algorithm is shown
in Fig. 4 and explained in Section IV.
B. Circuit Architecture
1) Synaptic Grid: The synaptic grid system described
in Fig. 1 is implemented by the circuit shown in Fig. 2,
where the components of all vectors are shown as individual
signals. Each gray cell in Fig. 2 is a synaptic circuit (artificial
synapse) using a memristor [described in Fig. 3(a)]. The
synapses are arranged in a 2-D N×Mgrid array as shown
in Fig. 2, where each synapse is indexed by (n,m), with
m∈{1,...,M}and n∈{1,...,N}. Each (n,m)synapse
receives two inputs um,um,anenable signal en, and produces
an output current Inm. Each column of synapses in the array
(the mth column) shares two vertical input lines umand um,
both connected to a column input interface. The voltage signals
umand um(m) are generated by the column input interfaces
from the components of the input signal x, upon presentation.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 5
Fig. 4. (a) Structure of a MNN with the neurons in each layer (circles) and the synaptic connectivity weights (arrows). (b) Implementation of a MNN with
on-line backpropagation training with MSE measure (7), using a slightly modified version of the original circuit (Fig. 2). Each circuit performs (23) and (28)
(as in Fig. 1), together with the additional operation (31). The function boxes denote the operation of either σ(·)(neuronal activation function) or σ(·)
(its derivative) on the input, and the ×box denotes a component-wise product. The symbol (don’t care) denotes an unused output. For detailed circuit
schematics, see [39].
Each row of synapses in the array (the nth row) shares the
horizontal enable line enand output line on,whereenis
connected to a row input interface and onis connected to a
row output interface. The voltage (pulse) signal on the enable
line en(n) is generated by the row input interfaces from the
error signal y, upon presentation. The row output interfaces
keep the onlines grounded (to zero voltage) and convert the
total current from all the synapses in the row going to the
ground, mInm,to the output signal rn.
2) Artificial Synapse: The proposed memristive synapse
is composed of a single memristor, connected to a shared
terminal of two MOSFET transistors (p-type and n-type), as
shown schematically in Fig. 3(a) (without the n,mindices).
These terminals act as drain or source, interchangeably,
depending on the input, similarly to the CMOS transistors
in transmission gates. Recall that the memristor dynamics
are given by (1)–(3), with s(t)being the state variable of
the memristor and G(s(t)) ggs(t)its conductivity.
In addition, the current of the n-type transistor in the linear
region is, ideally
I=K(VGS VT)VDS 1
2V2
DS(11)
where VGS is the gate–source voltage, VTis the threshold
voltage, VDS is the drain–source voltage, and Kis the
conduction parameter of the transistors. When VGS <VT,the
current is cutoff (I=0). Similarly, the current of the p-type
transistor in the linear region is
I=−K(VGS +VT)VDS 1
2V2
DS(12)
where for simplicity, we assumed that the parameters Kand
VTare equal for both transistors. Note that for notational
simplicity, the parameter VTin (12) has a different sign than
in the usual definition. When VGS >VT, the current is
cutoff (I=0).
The synapse receives three voltage input signals: uand
¯u=−uare connected, respectively, to a terminal of the n-type
and p-type transistors and an enable signal eis connected to
the gate of both transistors. The enable signal can have a value
of 0, VDD,orVDD (with VDD >VT) and have a pulse shape
of varying duration, as explained below. The output of the
synapse is a current Ito the grounded line o. The magnitude
of the input signal u(t)and the circuit parameters are set so
they fulfill the following.
1) We assume (somewhat unusually) that
VT<u(t)<VT.(13)
2) We assume that [recall (1)]
K(VDD 2VT)G(s(t)). (14)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
From the first assumption (13), we assume the following
conditions.
1) If e=0 (i.e., the gate is grounded), both transistors are
nonconducting (in the cutoff region). In this case, I=0
in the output, the voltage across the memristor is zero,
and the state variable does not change.
2) If e=VDD, the n-type transistor is conducting
in the linear region while the p-type transistor is
nonconducting.
3) If e=−VDD, the p-type transistor is conducting
in the linear region while the n-type transistor is
nonconducting.
To satisfy (13), the proper value of u(t)is chosen (15).
From the second assumption, (14), if eVDD,
when in the linear region, both transistors have relatively
high conductivity as compared with the conductivity of the
memristor. Therefore, in that case, the voltage on the memris-
tor is approximately ±u. Note (14) is a reasonable assumption,
as shown in [48]. If not (e.g., if the memristor conductivity
is very high), instead one can use an alternative design, as
described in [Appendix B, 39].
C. Circuit Operation
The operation of the circuit in each trial (a single presen-
tation of a specific input) is composed of two phases. First,
in the computing phase (read), the output current from all the
synapses is summed and adjusted to produce an arithmetic
operation r=Wx from (4). Second, in the updating phase
(write), the synaptic weights are incremented according to the
update rule W=ηyxfrom (9). In the proposed design,
for each synapse, the synaptic weight Wnm is stored using
snm, the memristor state variable of the (n,m)synapse. The
parallel read and write operations are achieved by applying
simultaneous voltage signals on the inputs umand enable
signals en(n,m). The signals and their effect on the state
variable are shown in Fig. 3(b).
1) Computation Phase (Read): During each read phase, a
vector xis given and encoded in uand ¯ucomponent-wise by
the column input interfaces for a duration of Trd,m:um(t)=
axm=−¯um(t),whereais a positive constant converting xm,
a unitless number, to voltage. Recall that Ais the maximal
value of |xm|,so
aA <VT(15)
as required in (13). In addition, the row input interfaces
produce voltage signal on the enlines, n
en(t)=VDD,if 0 t<0.5Trd
VDD,if 0.5Trd tTrd.(16)
From (2), the total change in the internal state variable is,
therefore, n,m
snm =ˆ0.5Trd
0
(axm)dt +ˆTrd
0.5Trd
(axm)dt =0.(17)
The zero net change in the value of snm between the times of 0
and Trd implements a nondestructive read, as common in many
memory devices. To minimize inaccuracies, the row output
interface samples the output current at time 0+(immediately
after time zero). This is done before the conductance of the
memristor is significantly changed from its value before the
read phase. Using (3), the output current of the synapse to
the online at the time is thus
Inm =a(¯ggsnm)xm.(18)
Therefore, the total current in each output line onis equal to
the sum of the individual currents produced by the synapses
driving that line
on=
m
Inm =a
m
(¯ggsnm)xm.(19)
The row output interface measures the output current onand
outputs
rn=c(onoref )(20)
where cis a constant converting the current units of onto a
unit-less number rnand
oref =a¯g
m
xm.(21)
Defining
Wnm =acˆgsnm (22)
we obtain
r=Wx (23)
as desired.
2) Update Phase (Write): During each write phase, of
duration of Twr,uand ¯umaintain their values from the read
phase, while the signal echanges. In this phase, the row input
interfaces encode ecomponentwise, n
en(t)=sign(yn)VDD,if 0 tTrd b|yn|
0,if b|yn|<tTrd <Twr .(24)
The interpretation of (24) is that enis a pulse with a magnitude
VDD, the same sign as yn, and a duration b|yn|(where bis
a constant converting yn, a unitless number, to time units).
Recall that Ais the maximal value of |yn|, so we require that
Twr >bA.(25)
The total change in the internal state is therefore
snm =ˆTrd+b|yn|
Trd
(asign(yn)xm)dt (26)
=abxmyn.(27)
Using (22), the desired update rule for the synaptic weights
is, therefore, obtained as
W=ηyx(28)
where η=a2bcˆg.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 7
IV. MULTILAYER NEURAL NETWORK CIRCUIT
So far, the circuit operation was exemplified on a SNN with
the simple adaline algorithm. In this section, it is explained
how, with a few adjustments, the proposed circuit can be used
to implement backpropagation on a general MNN.
Recall the context of the supervised learning setup detailed
in Section II, where xRMis a given pattern and dRNis
a desired label. Consider a double layer MNN estimator of d,
of the form
r=W2σ(W1x)
with W1RH×M(His some integer) and W2RN×H
being the two parameter matrices and where σis some
nonlinear (usually sigmoid) function operating component
wise [i.e., (σ (x))i=σ(xi)]. Such a double layer MNN, with
Hhidden neurons, can approximate any target function with
arbitrary precision [33]. Denote by
r1=W1xRH;r2=r=W2σ(r1)RN(29)
the output of each layer. Suppose we again use MSE to
quantify the error between dand r. In the backpropagation
algorithm, each update of the parameter matrices is given by
an online gradient descent step, which can be directly derived
from [6, eq. (8)] [note the similarity with (9)]
W1=ηy1x
1;W2=ηy2x
2(30)
with x2σ(r1), y2dr2,x1xand y1(W
2y2)×
σ(r1), where here (a×b)iaibi, a component-wise product,
and (x))i=dσ(xi)/dxi. Implementing such an algorithm
requires a minor modification of the proposed circuit, that
is, it should have an additional inverted output Wy.Once
this modification is made to the circuit, by cascading such
circuits, it is straightforward to implement the backpropagation
algorithm for two layers or more, as shown in Fig. 4. For
each layer, a synaptic grid circuit stores the corresponding
weight matrix, performs a matrix-vector product as in (29),
and updates the weight matrix as in (30). This last operation
requires the additional output
δWy(31)
in each synaptic grid circuit (except the first). We assume that
the synaptic circuit has dimensions M×N(with a slight abuse
of notation, since Mand Nshould be different for each layer).
The additional output (31) can be generated by the circuit
in an additional read phase with the duration Trd, between
the original read and write phases, in which the original role
of the input and output lines is inverted. In this phase, the
n-type MOS (nMOS) transistor is ON, i.e., n,en=VDD
and the former output onlines are given the following voltage
signal (again, used for a nondestructive read):
on=ayn,if Trd t<1.5Trd
ayn,if 1.5Trd t2Trd.(32)
The Inm current now flows to the (original input) umterminal,
which is grounded and shared n. The sum of all the currents
is measured at time T+
rd by the column interface (before it goes
into ground)
um=
n
Inm =a
n
(¯ggsnm)yn.(33)
The total current on umat time Trd is the output
δm=c(umuref )(34)
where uref =a¯gnyn. Thus, from (22)
δm=
n
Wnm yn(35)
as required.
Finally, note that it is straightforward to use a different error
function instead of MSE. For example, in a MNN, recall that
in the last layer
y=dr=−
1
2rdr2.
If a different error function E(r,d)is required, one should
simply replace this with
yαrE(r,d)(36)
where αis some constant that can be used to tune the learning
rate. For example, one could use instead a cross entropy error
function
E(r,d)=−
i
diln ri(37)
which is more effective for classification tasks, in which dis
a binary vector [49]. To implement this change in the MNN
circuit (Fig. 4), the subtractor should be simply replaced with
some other module.
V. SOURCES OF NOISE AND VARIABILITY
Usually, analog computation suffers from reduced robust-
ness to noise as compared with digital computation [50].
ML algorithms are, however, inherently robust to noise, which
is a key element in the set of problems they are designed
to solve. For example, gradient descent is quite robust to
perturbations, as intuitively demonstrated in Fig. 5. This
suggests that the effects of intrinsic noise on the perfor-
mance of the analog circuit are relatively small. These effects
largely depend on the specific circuit implementation (e.g., the
CMOS process). Particularly, memristor technology is not
mature yet and memristors have not been fully characterized.
Therefore, to check the robustness of the circuit, crude esti-
mation of the magnitude of noise and variability has been
used in this section. This estimation is based on known
sources of noise and variability, which are less dependent
on the specific implementation. In Section VI, the alleged
robustness of the circuit would be evaluated by simulating
the circuit in the presence of these noise and variation
sources.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Fig. 5. Robustness of gradient descent. Consider a simple 2 1SNN
(inset) with a single input xand desired output d(repeatedly presented).
As demonstrated schematically in this figure, training the SNN using gradient
descent on the MSE E(w1,w
2)=(w1x1+w2x2d)2will tend to decrease
this error until it converges to some fixed point (generally, a local minimum of
the MSE). Schematically, the gradient direction (red arrows) can be arbitrarily
varied within a relatively wide range (green triangles), on each iteration, and
this will not prevent this convergence.
A. Noise
When one of the transistors is enabled (e(t)VDD), then
the current is affected by intrinsic thermal noise sources in the
transistors and memristor of each synapse. This noise affects
the operation of the circuit during the write and read phases.
Current fluctuations on a device due to thermal origin can be
approximated by a white noise signal I(t)with zero mean
(I(t)=0)and autocorrelation I(t)I(t)=σ2δ(tt),
where δ(·)is Dirac’s delta function and σ2=2k
˜
Tg (where
kis Boltzmann’s constant, ˜
Tis the temperature, and gis the
conductance of the device). For 65 nm transistors (parameters
taken from IBMs 10LPe/10RFe process [51]), the character-
istic conductivity is g11041. Therefore, for I1(t),the
thermal current source of the transistors at room temperature,
we have σ2
11024 A2s. Assume that the memristor
characteristic conductivity g2=g1, so for the thermal current
source of the memristor, σ2
2=σ2
1. Note that from (14), we
have 1, and the resistance of the transistor is much smaller
than that of the memristor. The total voltage on the memristor
is thus
VM(t)=g1
g1+g2u(t)+(g1+g2)1(I1(t)I2(t))
=1
1+u(t)+ξ(t)
where ξ(t)=(g1
1/1+)(I1(t)I2(t)). Since different ther-
mal noise sources are uncorrelated, we have I1(t)I2(t)=0,
and so ξ(t (t)∼σ2
ξδ(tt)with
σ2
ξ=1
g2
1(1+)2I2
1(t)+I2
2(t)
=σ2
1+σ2
2
g2
1(1+)2(38)
g2
1σ2
1=2k˜
Tg1
11016 V2s.
Fig. 6. Noise model for an artificial synapse. (a) During the operation, only
one transistor is conducting (assume it is the n-type transistor). (b) Thermal
noise in a small-signal model: the transistor is converted to a resistor (g1)in
parallel to current source (I1), the memristor is converted to a resistor (g2)
in parallel to current source (I2), and the effects of the sources are summed
linearly.
The equivalent circuit, including the sources of noise, is shown
in Fig. 6. Assuming the circuit minimal trial duration is
T=10 ns, the root MSE due to thermal noise is bounded
above by
ET
1
TˆT
0
dξ(t)2
=T1/2σξ104V.
Noise in the inputs u,¯u,andealso exists. According
to [52], the relative noise in the power supply of the u/¯u
inputs is approximately 10% in the worst case. Applying
u=ax effectively gives an input of ax +ET+Eu,where
|Eu|≤0.1a|x|. The absolute noise level in duration of
eshould be smaller than Tmin
clk 2·1010 s, assuming a digital
implementation of pulsewidth modulation with Tmin
clk being the
shortest clock cycle currently available. On every write cycle
eVDD is, therefore, applied for a duration of b|y|+ Ee
(instead of b|y|), where |Ee|<Tmin
clk .
B. Parameter Variability
A common estimation of the variability in
memristor parameters is a coefficient of variation
(CV =standard deviation/mean) of a few percent [53].
In this paper, the circuit is also evaluated with considerably
larger variability (CV 30%), in addition to the noise
sources, as described in Section V-A. The variability in the
parameters of the memristors is modeled by sampling each
memristor conductance parameter ˆgindependently from a
uniform distribution between 0.5and1.5 of the original
(nonrandom) value of ˆg. When running the algorithm in
software, these variations are equivalent to corresponding
changes in the synaptic weights Wor the learning rate η
in the algorithm. Note that the variability in the transistor
parameters is not considered, since these can affect the
circuit operation only if (13) or (14) are invalidated. This
can happen, however, only if the values of Kor VTvary in
orders of magnitude, which is unlikely.
VI. CIRCUIT EVA L UA T I O N
In this section, the proposed synaptic grid circuit is
implemented in a simulated physical model (Section VI-A).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 9
Fig. 7. Synaptic 2 ×2 grid circuit simulation, during ten operation cycles. Top: circuit inputs (x1,x2)and result outputs (r1,r2). Middle and bottom:
voltage (solid black) and conductance (dashed red) change for each memristor in the grid. Note that the conductance of the (n,m)memristor indeed changes
proportionally to xmynfollowing (27). Simulation was done using SimElectronics, with the inputs as in (39) and (40), and the circuit parameters as in Table I.
Similar results were obtained using SPICE with linear ion drift memristor model and CMOS 0.18 μm process, as shown in [Appendix D, 39].
First, the basic functionality of a toy example, a 2 ×2
circuit, is demonstrated (Section VI-B). Then (Section VI-C)
using standard supervised learning datasets, the implemen-
tation of SNNs and MNNs using this synaptic grid cir-
cuit is demonstrated, including its robustness to noise and
variation.
A. Software Implementation
Recall that the synaptic grid circuit (implementing the boxes
in Figs. 1 and 4) operates in discrete time trials, and receives
at each trial two vector inputs xand y, updates an internally
stored synaptic matrix W(according to W=ηyx), and
outputs the vector r=Wx (optionally, it outputs also the
vector δ=Wy).
The physical model of the circuit is implemented both in
SPICE and SimElectronics [54]. Both are software tools that
enable physical circuit simulation of memristors and MOSFET
devices. The SPICE model is described in [Appendix D, 39].
Next, the SimElectronics synaptic grid circuit model is
described. Exactly the same model (at different sizes of grid)
is used for all numerical evaluations in Sections VI-C. The
circuit parameters appear in Table I, with the parameters of
the (ideal) transistors kept at their defaults. Note VTof the
pMOS is defined here with an opposite sign to the usual
definition. The memristor model is implemented using (1)–
(3) with parameters taken from the experimental data [55,
Fig. 2]. As shown schematically in Fig. 2, the circuit imple-
mentation consists of a synapse-grid and the interface blocks.
The interface units were implemented using a few standard
CMOS and logic components, as can be seen in the detailed
schematics (available in [39] as an HTML file, which may
be opened using Firefox). The circuit operated synchronously
using global control signals supplied externally. The circuit
TAB L E I
CIRCUIT PARAMETERS
inputs xand yand output rwere kept constant in each trial
using sample and hold units.
B. Basic Functionality
First, the basic operation of the proposed synaptic grid
circuit is examined numerically in a toy example.
Asmall2×2 synaptic grid circuit is simulated for time 10 T
(10 read-write cycles) with a simple piecewise constant input
x=(x1,x2)=(0.8,0.4)·sign(t5T)(39)
and a constant input
y=(y1,y2)=(0.2,0.1). (40)
In Fig. 7, the resulting circuit outputs are shown, together with
the memristors’ voltages and conductances. The correct basic
operation of the circuit is verified.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
TAB L E I I
LEARNING PARAMETERS FOR EACH DATA S E T
1) In the first read cycle, the memristors are used to
generate the output (23). The voltage trace on the (n,m)
memristor is a ±axmbipolar pulse, as expected
from (16). This results in a nondestructive (a zero net
change in conductance), as expected from (17).
2) In the second read cycle, the memristors are used to
generate the output (35). The voltage trace on the
(n,m)memristor is a ±aynbipolar pulse [as expected
from (32)], again resulting in a nondestructive read.
3) In the write cycle, the stored weights are incremented
according to (28). As expected from (24), the (n,m)
memristor is subjected to a voltage pulse of amplitude
sign(yn)axmwith duration b|yn|. Furthermore, there
is an ˆgabxmynincrement in memristor conductance
following (27).
4) In the output of the circuit is rn=macˆgsnmxm
following (5), and (18)–(23).
C. Learning Performance
The synaptic grid circuit model is used to implement a SNN
and a MNN, trainable by the online gradient descent algorithm.
To demonstrate that the algorithm is indeed implemented
correctly by the proposed circuit, the circuit performance has
been compared to an algorithmic implementation of the MNNs
in (MATLAB) software. Two standard tasks are used:
1) the Wisconsin Breast Cancer diagnosis task [56]
(linearly separable);
2) the Iris classification task [56] (not linearly separable).
The first task was evaluated using an SNN circuit, similarly
to Fig. 1. Note this task has only a single output (yes/no),
so the synaptic grid (Fig. 2) has only a single row. The
second task is evaluated on a two-layer MNN circuit, similarly
to Fig. 4. The learning parameters are described in Table II.
Fig. 8 shows the training error (performance during training
phase) of the following:
1) the algorithm;
2) the proposed circuit (implementing the algorithm);
3) the circuit, with about 10% noise and 30% variability
(Section V).
In addition, Table III shows the test error—the error estimated
on the test set after training was over (The standard deviation
(Std) was calculated as (Pe(1Pe)/Ntest)1/2, with Ntest being
the number of samples in the test set and Pebeing the test
error). On each of the two tasks, the training performance
(i.e., on the training set) of the circuit and the algorithm
similarly improves, finally reaching a similar test error. These
results indicate that the proposed circuit design can precisely
Fig. 8. Circuit evaluation—training error for two datasets.
(a) Task 1—Wisconsin breast cancer diagnosis. (b) Task 2—Iris classification.
TABLE III
CIRCUIT EVA L UAT I O N —TEST ERROR (MEAN ±STD)FOR
(a) ALGORITHM,(b) CIRCUIT,AND (c) CIRCUIT WITH
NOISE AND VARIABILITY
implement MNNs trainable with online gradient descent, as
expected, from the derivations in Section III. Moreover, the
circuit exhibits considerable robustness, as its performance
is only mildly affected by significant levels of noise and
variation.
Implementation Details: The SNN and the MNN were
implemented using the (SimElectronics) synaptic grid circuit
model, which was described in Section VI-A. The detailed
schematics of the two-layer MNN model are again given
in [39] (as an HTML file, which may be opened using Firefox),
and also in the code. All simulations are done using a desktop
computer with core i7 930 processor, a Windows 8.1 operating
system, and a MATLAB 2013b environment. Each circuit
simulation takes about one day to complete (note the software
implementation of the circuit does not exploit the parallel
operation of the circuit hardware). For both tasks, we used
the standard training and testing procedures [6]. For each
ksample from the dataset, the attributes are converted to a
M-long vector x(k), and each label is converted to a N-long
binary vector d(k). The training set is repeatedly presented,
with samples at random order. The inputs mean was sub-
tracted, and they were normalized as recommended by [6],
with additional rescaling done in dataset 1, to ensure that
the inputs fall within the circuit specification 1, i.e., meet the
requirement specified by (13). The performance was averaged
over 10 repetitions (of training and testing) and the training
error was also averaged over the previous 20 samples.
The initial weights were sampled independently from a sym-
metric uniform distribution. For both tasks, in the output layer,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 11
a softmax activation function was used
(σ (x))i=exi
N
j=1exj
together with a Cross entropy error (37) since this combination
is known to improve performance [57]. In task 2, in the
two-layer MNN, the neuronal activation functions in the first
(hidden) layer were set as
σ(xi)=1.7159 tanh 2xi
3
as recommended in [6]. Other parameters are given
in Tables I and II. In addition to the inputs from the previous
layer, the commonly used bias input is implemented in the
standard way (i.e., at each circuit, the neuronal input xis
extended with an additional component equal to a constant
one).
VII. DISCUSSION
As explained in Sections II and IV, two major computational
bottlenecks of MNNs and many ML algorithms, are given by
amatrix×vector product operation (4) and a vector ×vector
outer product operation (9). Both are of order O(M·N),
where Mand Nare the sizes of the input and output
vectors. In this paper, the proposed circuit is designed specifi-
cally to deal with these bottlenecks using memristor arrays.
This design has a relatively small number of components
in each array element—one memristor and two transistors.
The physical grid-like structure of the arrays implements the
matrix×vector product operation (4) using analog summation
of currents, while the memristor (nonlinear) dynamics enable
us to perform the vector×vector outer product operation (10),
using time ×voltage encoding paradigm. The idea to use a
resistive grid to perform matrix ×vector product operation
is not new (e.g., [58]). The main novelty of this paper is
the use of memristors together with time ×voltage encoding,
which allows us to perform a mathematically accurate
vector ×vector outer product operation in the learning rule
using a small number of components.
A. Previous CMOS-Based Designs
As mentioned in the introduction, CMOS hardware designs
that specifically implement online learning algorithms remain
an unfulfilled promise at this point.
The main incentive for existing hardware solutions is the
inherent inefficiency in implementing these algorithms in
software running on general-purpose hardware (e.g., CPUs,
digital signal processors, and GPUs). However, squeezing
the required circuit for both the computation and the update
phases (two configurable multipliers, for the matrix ×vector
product and vector ×vector outer product, and a memory
element to store the synaptic weight) into an array cell has
proven to be a hard task, using currently available CMOS
technology. Off-chip or chip-in-the-loop design architectures
[67, Table I] have been suggested in many cases as a way
around this design barrier. These designs, however, generally
deal with the computational bottleneck of the matrix ×vector
TAB L E I V
HARDWARE DESIGNS OF ARTIFICIAL SYNAPSES IMPLEMENTING
SCALABLE ONLINE LEARNING ALGORITHMS
product operation in the computation phase, rather than the
computational bottleneck of the vector ×vector outer product
operation in the update phase. In addition, these solutions are
only useful in cases where the training is not continuous and
is done in a predeployment phase or in special reconfigura-
tion phases during the operation. Other designs implement
nonstandard (e.g., perturbation algorithms [68]) or specifically
tailored learning algorithms (e.g., modified backpropagation
for spiking neurons [69]). However, it remains to be seen
whether such algorithms are indeed scalable.
Hardware designs of artificial synaptic arrays that are
capable of implementing common (scalable) online gradient-
descent-based learning, are listed in Table IV. For large
arrays (i.e., large Mand N), the effective transistor count per
synapse (where resistors and capacitors were also counted as
transistors) is approximately proportional to the required area
and average static power usage of the circuit.
The smallest synaptic circuit [59] includes two transistors
similarly to our design, but requires the (rather unusual) use of
UV illumination during its operation and has the disadvantage
of having volatile weights decaying within minutes. The
next device [60] includes six transistors per synapse, but the
update rule can only increase the synaptic weights, which
makes the device unusable for practical purposes. The next
device [62], [70] suggested a grid design using the CMOS
Gilbert multipliers, resulting in 39 transistors per synapse.
In a similar design [63], 52 transistors are used. Both these
devices use capacitive elements for analog memory and suffer
from the limitation of having volatile weights, vanishing
after training has stopped. Therefore, they require constant
retraining (practically acting as refresh). Such retraining is
required also in each startup or, alternatively, reading out the
weights into an auxiliary memory—a solution that requires a
mechanism for reading out the synaptic weights. The larger
design in [64] (92 transistors) also has weight decay, but with
a slow hours-long timescale. The device in [65] (83 transistors
and an unspecified weight unit that stores the weights) does not
report to have weight decay, apparently since digital storage
is used. This is also true for [66] (150 transistors).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
B. Memristor-Based Designs: Expected
Benefits and Technical Issues
The proposed memristor-based design should resolve the
main obstacles of the above CMOS-based designs, and provide
a compact nonvolatile [71] circuit. Area and power consump-
tion are expected to be reduced by a factor of 13–50, in
comparison with standard CMOS technology, if a memristor
is counted as an additional transistor (although it is actually
more compact). Maybe the most convincing evidence for
the limitations of the CMOS-based designs is the fact that
although most of these designs are two decades old, they
have not been incorporated into commercial products. It is
fair only to mention at this point that while our design
is purely theoretical and based on speculative technology,
the above reviewed designs are based on mature technology
and have overcome obstacles all the way to manufacturing.
However, a physical implementation of a memristor-based
neural network should be feasible, as was demonstrated in a
recent work [31]—where a different memristor-based design
was manufactured and tested.
The hardware design in [31] of a SNN with binary outputs
demonstrated a successful online training using the popular
perceptron algorithm. The current proposed design allows
more flexibility, since it can be used to train general MNNs
(considered to be much more powerful than SNNs [33]) using
the scalable online gradient descent algorithm (backpropaga-
tion). In addition, the current proposed design can be also used
to implement the perceptron algorithm (used in [31]), since
it is very similar to the Adaline algorithm (Fig. 1). Due to
this similarity, both designs should encounter similar technical
issues in a concrete implementation.
Encouragingly, the memristor-based neural network circuit
in [31] is able to achieve good performance, despite of the
following issues: 1) noisy memristor dynamics affect the
accuracy of the weight update; 2) variations in memristor para-
meters generate similar variations in the learning rates (here η);
and 3) the nonlinearity of the conductivity [here G(s(t))]
can have a saturating effect on the weights (resulting in
bounded weights). Overcoming issues 1) and 2) suggests that
training MNNs should be relatively robust to noise and vari-
ations, as argued here (Fig. 5) and demonstrated numerically
(Fig. 8 and Table III). Overcoming issue (3) suggests that the
saturating effect of the nonlinearity G(s(t)) on the weights is
not catastrophic. This could be related again to the robustness
of gradient descent (Fig. 5). Moreover, bounding the weights
magnitude can be desirable and various regularization methods
are commonly used to achieve this effect in MNNs. More
specifically, a saturating nonlinearity on the weights can even
improve performance [72]. Other types of nonlinearity may
also be beneficial [73].
C. Circuit Modifications and Generalizations
The specific parameters that were used for the circuit evalua-
tion (Table I) are not strictly necessary for the proper execution
of the proposed design and are only used to demonstrate its
applicability. For example, it is straightforward to show that
K,VT,and ¯ghave little effect [as long as (13) and (14) hold],
and different values of ˆgcan be adjusted for by rescaling the
constant c, which appears in (20) and (34). This is important
since the feasible range of parameters for the memristive
devices is still not well characterized and it seems to be quite
broad. For example, the values of the memristive timescales
range from picoseconds [74] to milliseconds [55]. Here, the
parameters were taken from a millisecond-timescale memristor
[55]. The actual speed of the proposed circuit would critically
depend on the timescales of commercially available memristor
devices.
Additional important modifications of the proposed circuit
are straightforward. In [Appendix A, 39], it is explained how
to modify the circuit to work with more realistic memristive
devices [40, 14], instead of the classical memristor model [11],
given a few conditions. In [Appendix C, 39], it is shown that
it is possible to reduce the transistor count from two to one,
at the price of doubling the duration of the write phase. Other
useful modifications of circuit are also possible. For example,
the input xmay be allowed to receive different values during
the read and write operations. In addition, it is straightforward
to replace the simple outer product update rule in (10) by more
general update rules of the form
W(k+1)
nm =W(k)
nm +η
i,j
fiy(k)
ngjx(k)
m
where fi,gjare some functions. Finally, it is possible to
adaptively modify the learning rate ηduring training [e.g., by
modifying αin (36)].
VIII. CONCLUSION
A novel method to implement scalable online gradient
descent learning in multilayer neural networks through local
update rules is proposed based on the emerging memristor
technology. The proposed method is based on an artificial
synapse using one memristor to store the synaptic weight and
two CMOS transistors to control the circuit. The correctness
of the proposed synapse structure exhibits a similar accuracy
to its equivalent software implementation, while the proposed
structure shows high robustness and immunity to noise and
parameter variability.
Such circuits may be used to implement large-scale online
learning in multilayer neural networks, as well as other learn-
ing systems. The circuit is estimated to be significantly smaller
than existing CMOS-only designs, opening the opportunity for
massive parallelism with millions of adaptive synapses on a
single integrated circuit, operating with low static power and
good robustness. Such brain-like properties may give a signifi-
cant boost to the field of neural networks and learning systems.
ACKNOWLEDGMENT
The authors would like to thank E. Friedman, I. Hubara,
R. Meir, and U. Weiser for their support and helpful comments,
and E. Rosenthal and S. Greshnikov for their contribution to
the SPICE simulations.
REFERENCES
[1] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado,
J. Dean, and A. Y. Ng, “Building high-level features using large scale
unsupervised learning,” in Proc. ICML, Edinburgh, Scotland, Jun. 2012,
pp. 81–88.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SOUDRY et al.: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS 13
[2] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
networks for image classification,” in Proc. CVPR, Providence, RI, USA,
Jun. 2012, pp. 3642–3649.
[3] J. Dean et al., “Large scale distributed deep networks,” in Proc. NIPS,
Lake Tahoe, NV, USA, Dec. 2012, pp. 1–9.
[4] R. D. Hof, “Deep learning,” MIT Technol. Rev., Apr. 2013.
[5] D. Hernandez, “Now you can build Google’s $1M artificial brain on the
cheap,” Wired , Jun. 2013, pp. 9–48.
[6] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient
backprop,” in Neural Networks: Tricks of the Trade, G. Montavon,
G. B. Orr, and K.-R. Müller, Eds., 2nd ed. Heidelberg, Germany:
Springer-Verlag, 2012.
[7] J. Misra and I. Saha, “Artificial neural networks in hardware: A sur-
vey of two decades of progress,” Neurocomputing, vol. 74, nos. 1–3,
pp. 239–255, Dec. 2010.
[8] A. R. Omondi, “Neurocomputers: A dead end?” Int. J. Neural Syst.,
vol. 10, no. 6, pp. 475–481, 2000.
[9] M. Versace and B. Chandler, “The brain of a new machine,” IEEE
Spectr., vol. 47, no. 12, pp. 30–37, Dec. 2010.
[10] M. M. Waldrop, “Neuroelectronics: Smart connections,” Nature,
vol. 503, no. 7474, pp. 22–24, Nov. 2013.
[11] L. O. Chua, “Memristor-the missing circuit element,” IEEE Trans.
Circuit Theory, vol. 18, no. 5, pp. 507–519, Sep. 1971.
[12] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The
missing memristor found,” Nature, vol. 453, no. 7191, pp. 80–83,
Mar. 2008.
[13] F. Corinto and A. Ascoli, “A boundary condition-based approach to the
modeling of memristor nanostructures,” IEEE Trans. Circuits Syst. I,
Reg. Papers, vol. 59, no. 11, pp. 2713–2726, Nov. 2012.
[14] S. Kvatinsky, E. G. Friedman, A. Kolodny, and U. C. Weiser, “TEAM:
ThrEshold adaptive memristor model,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 60, no. 1, pp. 211–221, Jan. 2013.
[15] S. Kvatinsky, Y. H. Nacson, Y. Etsion, E. G. Friedman, A. Kolodny, and
U. C. Weiser, “Memristor-based multithreading,” Comput. Archit. Lett.,
vol. 13, no. 1, pp. 41–44, Jul. 2014.
[16] Z. Guo, J. Wang, and Z. Yan, “Passivity and passification of memristor-
based recurrent neural networks with time-varying delays,” IEEE
Trans. Neural Netw. Learn. Syst., vol. 25, no. 11, pp. 2099–2109,
Nov. 2014.
[17] L. Wang, Y. Shen, Q. Yin, and G. Zhang, “Adaptive synchronization
of memristor-based neural networks with time-varying delays,” IEEE
Trans. Neural Netw. Learn. Syst., to be published.
[18] G. Zhang and Y. Shen, “Exponential synchronization of delayed
memristor-based chaotic neural networks via periodically intermittent
control,” Neural Netw., vol. 55, pp. 1–10, Jul. 2014.
[19] D. Querlioz, O. Bichler, and C. Gamrat, “Simulation of a memristor-
based spiking neural network immune to device variations,” in Proc. Int.
Joint Conf. Neural Netw. (IJCNN), San Jose, CA, USA, Jul./Aug. 2011,
pp. 1775–1781.
[20] C. Zamarreño-Ramos, L. A. Camuñas-Mesa, J. A. Pérez-Carrasco,
T. Masquelier, T. Serrano-Gotarredona, and B. Linares-Barranco,
“On spike-timing-dependent-plasticity, memristive devices, and build-
ing a self-learning visual cortex,” Frontiers Neurosci., vol. 5, p. 26,
Jan. 2011.
[21] O. Kavehei et al., “Memristor-based synaptic networks and logical
operations using in-situ computing,” in Proc. 7th Int. Conf. Intell.
Sensors, Sensor Netw. Inf. Process., Adelaide, SA, Australia, Dec. 2011,
pp. 137–142.
[22] A. Nere, U. Olcese, D. Balduzzi, and G. Tononi, “A neuromor-
phic architecture for object recognition and motion anticipation using
burst-STDP,” PloS One, vol. 7, no. 5, p. e36958, Jan. 2012.
[23] Y. Kim, Y. Zhang, and P. Li, “A digital neuromorphic VLSI architecture
with memristor crossbar synaptic array for machine learning,” in Proc.
IEEE Int. SOC Conf. (SOCC), Niagara Falls, NY, USA, Sep. 2012,
pp. 328–333.
[24] W. Chan and J. Lohn, “Spike timing dependent plasticity with memris-
tive synapse in neuromorphic systems,” in Proc. Int. Joint Conf. Neural
Netw. (IJCNN), Brisbane, QLD, Australia, Jun. 2012, pp. 1–6.
[25] T. Serrano-Gotarredona, T. Masquelier, T. Prodromakis, G. Indiveri, and
B. Linares-Barranco, “STDP and STDP variations with memristors for
spiking neuromorphic learning systems,” Frontiers Neurosci.,vol.7,p.2,
Jan. 2013.
[26] D. Querlioz, O. Bichler, P. Dollfus, and C. Gamrat, “Immunity to device
variations in a spiking neural network with memristive nanodevices,”
IEEE Trans. Nanotechnol., vol. 12, no. 3, pp. 288–295, May 2013.
[27] R. Legenstein, C. Naeger, and W. Maass, “What can a neuron learn with
spike-timing-dependent plasticity?” Neural Comput., vol. 17, no. 11,
pp. 2337–2382, Mar. 2005.
[28] D. Chabi, W. Zhao, D. Querlioz, and J.-O. Klein, “Robust neural logic
block (NLB) based on memristor crossbar array,” in Proc. IEEE/ACM
Int. Symp. IEEE Nanosc. Archit. (NANOARCH), San Diego, CA, USA,
Jun. 2011, pp. 137–143.
[29] H. Manem, J. Rajendran, and G. S. Rose, “Stochastic gradient descent
inspired training technique for a CMOS/nano memristive trainable
threshold gate array,IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59,
no. 5, pp. 1051–1060, May 2012.
[30] M. Soltiz, D. Kudithipudi, C. Merkel, G. S. Rose, and R. E.
Pino, “Memristor-based neural logic blocks for nonlinearly separable
functions,” IEEE Trans. Comput., vol. 62, no. 8, pp. 1597–1606,
Aug. 2013.
[31] F. Alibart, E. Zamanidoost, and D. B. Strukov, “Pattern classification by
memristive crossbar circuits using ex situ and in situ training,” Nature
Commun., vol. 4, p. 2072, May 2013.
[32] F. Rosenblatt, “The perceptron: A probabilistic model for information
storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6,
pp. 386–408, Nov. 1958.
[33] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
Math. Control, Signals, Syst., vol. 2, no. 4, pp. 303–314, 1989.
[34] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,”
in Optimization for Machine Learning,S.Sra,S.Nowozin,and
S. J. Wright, Eds. Cambridge, MA, USA: MIT Press, 2011, p. 351.
[35] D. Cire¸san and U. Meier, “Deep, big, simple neural nets for handwritten
digit recognition,” Neural Comput., vol. 22, no. 12, pp. 3207–3220,
Nov. 2010.
[36] S. P. Adhikari, C. Yang, H. Kim, and L. O. Chua, “Memristor bridge
synapse-based neural network and its learning,” IEEE Trans. Neural
Netw.Learn.Syst., vol. 23, no. 9, pp. 1426–1435, Sep. 2012.
[37] R. Hasan and T. M. Taha, “Enabling back propagation training of
memristor crossbar neuromorphic processors,” in Proc. Int. Joint Conf.
Neural Netw. (IJCNN), Beijing, China, Jul. 2014, pp. 21–28.
[38] Z. Vasilkoski et al., “Review of stability properties of neural plasticity
rules for implementation on memristive neuromorphic hardware,” in
Proc. Int. Joint Conf. Neural Netw., San Jose, CA, USA, Jul./Aug. 2011,
pp. 2563–2569.
[39] The Supplementary Material. [Online]. Available:
http://ieeexplore.ieee.org.
[40] L. O. Chua and S. M. Kang, “Memristive devices and systems,” Proc.
IEEE, vol. 64, no. 2, pp. 209–223, Feb. 1976.
[41] M. D. Pickett et al., “Switching dynamics in titanium dioxide memristive
devices,” J. Appl. Phys., vol. 106, no. 7, p. 074508, 2009.
[42] J. Strachan et al., “State dynamics and modeling of tantalum oxide
memristors,IEEE Trans. Electron Devices, vol. 60, no. 7,
pp. 2194–2202, Jul. 2013.
[43] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” Stanford
Electron. Labs, Stanford Univ., Stanford, CA, USA, Tech. Rep., 1960.
[44] B. Widrow and S. D. Stearns, Adaptive Signal Processing.
Englewood Cliffs, NJ, USA: Prentice-Hall, 1985.
[45] E. Oja, “Simplified neuron model as a principal component analyzer,”
J. Math. Biol., vol. 15, no. 3, pp. 267–273, Nov. 1982.
[46] C. M. Bishop, Pattern Recognition and Machine Learning. Singapore:
Springer-Verlag, 2006.
[47] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: Primal
estimated sub-gradient solver for SVM,” Math. Program., vol. 127, no. 1,
pp. 3–30, Oct. 2010.
[48] S. Kvatinsky, N. Wald, E. Satat, E. G. Friedman, A. Kolodny, and
U. C. Weiser, “Memristor-based material implication (IMPLY) logic:
Design principles and methodologies,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 22, no. 10, pp. 2054–2066, Oct. 2014.
[49] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.:
Oxford Univ. Press, 1995.
[50] R. Sarpeshkar, “Analog versus digital: Extrapolating from electronics
to neurobiology,Neural Comput., vol. 10, no. 7, pp. 1601–1638,
Oct. 1998.
[51] The Mosis Service. [Online]. Available: http://www.mosis.com, accessed
Nov. 21, 2012.
[52] G. Huang, D. C. Sekar, A. Naeemi, K. Shakeri, and J. D. Meindl,
“Compact physical models for power supply noise and chip/package
co-design of gigascale integration,” in Proc. IEEE 57th Electron.
Compon. Technol. Conf. (ECTC), Sparks, NV, USA, May/Jun. 2007,
pp. 1659–1666.
[53] M. Hu, H. Li, Y. Chen, X. Wang, and R. E. Pino, “Geometry variations
analysis of TiO2thin-film and spintronic memristors,” in Proc. 16th
Asia South Pacific Design Autom. Conf., Yokohama, Japan, Jan. 2011,
pp. 25–30.
[54] SimElectronics. [Online]. Available: http://www.mathworks.com/
products/simelectronics/, accessed Nov. 21, 2012.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
[55] T. Chang, S.-H. Jo, K.-H. Kim, P. Sheridan, S. Gaba, and W. Lu,
“Synaptic behaviors and modeling of a metal oxide memristive device,
Appl. Phys. A, vol. 102, no. 4, pp. 857–863, Feb. 2011.
[56] K. Bache and M. Lichman. (2013). UCI Machine Learning Repository.
[Online]. Available: http://archive.ics.uci.edu/ml
[57] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convo-
lutional neural networks applied to visual document analysis,” in Proc.
7th Int. Conf. Document Anal. Recognit., vol. 1. Edinburgh, Scotland,
Aug. 2003, pp. 958–963.
[58] D. B. Strukov and K. K. Likharev, “Reconfigurable nano-crossbar archi-
tectures,” in Nanoelectronics and Information Technology, R. Waser, Ed.
New York, NY, USA: Wiley, 2012, pp. 543–562.
[59] G. Cauwenberghs, C. F. Neugebauer, and A. Yariv, “Analysis and veri-
fication of an analog VLSI incremental outer-product learning system,
IEEE Trans. Neural Netw., vol. 3, no. 3, pp. 488–497, May 1992.
[60] H. C. Card, C. R. Schneider, and W. R. Moore, “Hebbian plasticity in
MOS synapses,” IEE Proc. F, Radar Signal Process., vol. 138, no. 1,
pp. 13–16, Feb. 1991.
[61] C. Schneider and H. Card, Analogue CMOS Hebbian synapses,”
Electron. Lett., vol. 27, no. 9, pp. 785–786, Apr. 1991.
[62] H. C. Card, C. R. Schneider, and R. S. Schneider, “Learning capacitive
weights in analog CMOS neural networks,” J. VLSI Signal Process. Syst.
Signal, Image Video Technol., vol. 8, no. 3, pp. 209–225, Oct. 1994.
[63] M. Valle, D. D. Caviglia, and G. M. Bisio, “An experimental analog
VLSI neural network with on-chip back-propagation learning,Analog
Integr. Circuits Signal Process., vol. 9, no. 3, pp. 231–245, Apr. 1996.
[64] T. Morie and Y. Amemiya, “An all-analog expandable neural net-
work LSI with on-chip backpropagation learning,” IEEE J. Solid-State
Circuits, vol. 29, no. 9, pp. 1086–1093, Sep. 1994.
[65] C. Lu, B.-X. Shi, and L. Chen, “An on-chip BP learning neural network
with ideal neuron characteristics and learning rate adaptation,” Analog
Integr. Circuits Signal Process., vol. 31, no. 1, pp. 55–62, Apr. 2002.
[66] T. Shima, T. Kimura, Y. Kamatani, T. Itakura, Y. Fujita, and T. Iida,
“Neuro chips with on-chip back-propagation and/or Hebbian learning,”
IEEE J. Solid-State Circuits, vol. 27, no. 12, pp. 1868–1876, Dec. 1992.
[67] C. S. Lindsey and T. Lindblad, “Survey of neural network hardware,”
Proc. SPIE, vol. 2492, pp. 1194–1205, Apr. 1995.
[68] G. Cauwenberghs, “A learning analog neural network chip with
continuous-time recurrent dynamics,” in Proc. NIPS, Golden, CO, USA,
Nov. 1994, pp. 858–865.
[69] H. Eguchi, T. Furuta, H. Horiguchi, S. Oteki, and T. Kitaguchi, “Neural
network LSI chip with on-chip learning,” in Proc. Int. Joint Conf. Neural
Netw. (IJCNN), Seattle, WA, USA, Jul. 1991, pp. 453–456.
[70] C. Schneider and H. Card, “CMOS implementation of analog
Hebbian synaptic learning circuits,” in Proc. Int. Joint Conf. Neural
Netw. (IJCNN), vol. i. Seattle, WA, USA, vol. 1, Jul. 1991,
pp. 437–442.
[71] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precision
tuning of state for memristive devices by adaptable variation-tolerant
algorithm,” Nanotechnology, vol. 23, no. 7, p. 075201, Feb. 2012.
[72] D. Soudry, I. Hubara, and R. Meir, “Expectation backpropagation:
Parameter-free training of multilayer neural networks with continuous
or discrete weights,” in Proc. NIPS, Montreal, QC, Canada, Dec. 2014,
pp. 963–971.
[73] M. Milev and M. Hristov, “Analog implementation of ANN with inherent
quadratic nonlinearity of the synapses,” IEEE Trans. Neural Netw.,
vol. 14, no. 5, pp. 1187–1200, Sep. 2003.
[74] A. C. Torrezan, J. P. Strachan, G. Medeiros-Ribeiro, and R. S. Williams,
“Sub-nanosecond switching of a tantalum oxide memristor,”
Nanotechnology, vol. 22, no. 48, p. 485203, Dec. 2011.
Daniel Soudry received the B.Sc. degree in electri-
cal engineering and physics and the Ph.D. degree
in electrical engineering from the Technion-Israel
Institute of Technology, Haifa, Israel, in 2008 and
2013, respectively.
He is currently a Gruss Lipper Post-Doctoral
Fellow with the Department of Statistics, Center of
Theoretical Neuroscience, and the Grossman Center
for the Statistics of Mind, Columbia University,
New York, NY, USA. His current research interests
include modeling the nervous system and its com-
ponents, Bayesian methods for neural data analysis and inference in neural
networks, and hardware implementation of neural systems.
Dotan Di Castro received the B.Sc., M.Sc., and
Ph.D. degrees from the Technion-Israel Institute of
Technology, Haifa, Israel, in 2003, 2006, and 2010,
respectively.
He was with IBM Research Labs, Haifa, from
2000 to 2004, He was involved in several startup
companies from 2009 to 2013. He is currently
with Yahoo! Labs, Haifa, where he is investigating
information processing in very large scale systems.
His current research interests include machine learn-
ing (in particular, reinforcement learning), computer
vision, and large-scale hierarchical learning systems.
Asaf Gal received the B.Sc. degree in physics and
electrical engineering from the Technion-Israel Insti-
tute of Technology, Haifa, Israel, in 2004, and the
Ph.D. degree in computational neuroscience from the
Hebrew University of Jerusalem, Jerusalem, Israel,
in 2013.
He is currently a Clore Post-Doctoral Fellow with
the Department of Physics of Complex Systems,
Weizmann Institute of Science, Rehovot 76100,
Israel. His current research interests include theo-
retically oriented study of biological systems, bio-
physics, and the application of complex systems science to experiments in
biology.
Avinoam Kolodny received the Ph.D. degree in
microelectronics from the Technion-Israel Institute
of Technology (Technion), Haifa, Israel, in 1980.
He joined Intel Corporation, Santa Clara, CA,
USA, where he was involved in research and devel-
opment in the areas of device physics, very large
scale integration (VLSI) circuits, electronic design
automation, and organizational development. He has
been a member of the Faculty of Electrical Engineer-
ing with Technion since 2000. His current research
interests include interconnects in VLSI systems, at
both physical and architectural levels.
Shahar Kvatinsky received the B.Sc. degree in
computer engineering and applied physics and the
M.B.A. degree from the Hebrew University of
Jerusalem, Jerusalem, Israel, in 2009 and 2010,
respectively, and the Ph.D. degree in electrical engi-
neering from the Technion-Israel Institute of Tech-
nology, Haifa, Israel, in 2014.
He was with Intel Corporation, Santa Clara, CA,
USA, as a Circuit Designer, from 2006 to 2009. He
is currently a Post-Doctoral Research Fellow with
Stanford University, Stanford, CA, USA. His current
research interests include circuits and architectures with emerging memory
technologies and design of energy efficient architectures.
... The corresponding memristive neural networks are trained to realize data processing, image compression, etc. For the memristive neural networks in [29,30], the cell of the crossbar array is made up of two MOS transistors and one memristor, acting as the synaptic circuit. ...
... As a result, the applicability of the memristive neural networks is limited to a few applications. The synaptic circuit in [29,30] cannot represent zero weight and negative weight by the passive physical memristor. Moreover, the circuit designs of the memristive neural networks do not align with the computations in BP neural network. ...
... This limitation hampers the applications of corresponding memristive neural networks [12][13][14]. In studies [29] and [30], the synaptic circuit fails to represent negative weights and zero weights using physical memristors. ...
Article
Full-text available
The performance improvement of CMOS computer fails to meet the enormous data processing requirement of artificial intelligence progressively. The memristive neural network is one of the most promising circuit hardwares to make a breakthrough. This paper proposes a novel memristive synaptic circuit that is composed of four MOS transistors and one memristor (4T1M). The 4T1M synaptic circuit provides flexible control strategies to change memristance or respond to the input signal. Applying the 4T1M synaptic circuit as the cell of memristive crossbar array, based on the structure and algorithm of the back-propagation (BP) neural network, this paper proposes circuit design of the memristive crossbar-based BP neural network. By reusing the 4T1M memristive crossbar array, the computations in the forward-propagation process and back-propagation process of BP neural network are accomplished on the memristive crossbar-based circuit to accelerate the computing speed. The 4T1M memristive crossbar array can change all the cells’ memristance at a time, accordingly, the memristive crossbar-based BP neural network can realize synchronous memristance adjustment. The proposed memristive crossbar-based BP neural network is then evaluated through experiments involving XOR logic operation, iris classification, and MNIST handwritten digit recognition. The experimental results present fewer iterations or higher classification accuracies. Further, the comprehensive comparisons with the existing memristive BP neural networks highlight the advantages of the proposed memristive crossbar-based BP neural network, which achieves the fastest memristance adjustment speed using relatively few components.
... The neural network shown in Fig. 1(a) implements (1) in hardware using reconfigurable synaptic weights (Wi,jconductance between a pre-synaptic neuron with index j and a post-synaptic neuron with index i) to address their non-deterministic distribution in real-time operation and post-silicon fabrication. As shown in Fig. 1(b), the synapses are realized using one NMOS, one PMOS and one memristor, with the transistor gates connected to a common enable input e [14]. When e = VDD (-VDD), the NMOS (PMOS) switches on and u (-ū) is passed to the output. ...
Preprint
Full-text available
With the advent of high-speed, high-precision, and low-power mixed-signal systems, there is an ever-growing demand for accurate, fast, and energy-efficient analog-to-digital (ADCs) and digital-to-analog converters (DACs). Unfortunately, with the downscaling of CMOS technology, modern ADCs trade off speed, power and accuracy. Recently, memristive neuromorphic architectures of four-bit ADC/DAC have been proposed. Such converters can be trained in real-time using machine learning algorithms, to break through the speedpower-accuracy trade-off while optimizing the conversion performance for different applications. However, scaling such architectures above four bits is challenging. This paper proposes a scalable and modular neural network ADC architecture based on a pipeline of four-bit converters, preserving their inherent advantages in application reconfiguration, mismatch selfcalibration, noise tolerance, and power optimization, while approaching higher resolution and throughput in penalty of latency. SPICE evaluation shows that an 8-bit pipelined ADC achieves 0.18 LSB INL, 0.20 LSB DNL, 7.6 ENOB, and 0.97 fJ/conv FOM. This work presents a significant step towards the realization of large-scale neuromorphic data converters.
... In the backpropagation phase, we employ the sign gradient weight update method to update the weights based on the Backpropagation (BP) algorithm on the memristive cross array (Soudry et al. 2015). Figure 3 illustrates the weight update process for the second fully connected layer. ...
Article
Full-text available
Brain-inspired neuromorphic computing has emerged as a promising solution to overcome the energy and speed limitations of conventional von Neumann architectures. In this context, in-memory computing utilizing memristors has gained attention as a key technology, harnessing their non-volatile characteristics to replicate synaptic behavior akin to the human brain. However, challenges arise from non-linearities, asymmetries, and device variations in memristive devices during synaptic weight updates, leading to inaccurate weight adjustments and diminished recognition accuracy. Moreover, the repetitive weight updates pose endurance challenges for these devices, adversely affecting latency and energy consumption. To address these issues, we propose a Siamese network learning approach to optimize the training of multi-level memristor neural networks. During neural inference, forward propagation takes place within the memristor neural network, enabling error and noise detection in the memristive devices and hardware circuits. Simultaneously, high-precision gradient computation occurs on the software side, initially updating the floating-point weights within the Siamese network with gradients. Subsequently, weight quantization is performed, and the memristor conductance values requiring updates are modified using a sparse update strategy. Additionally, we introduce gradient accumulation and weight quantization error compensation to further enhance network performance. The experimental results of MNIST data recognition, whether based on a MLP or a CNN model, demonstrate the rapid convergence of our network model. Moreover, our method successfully eliminates over 98% of weight updates for memristor conductance weights within a single epoch. This substantial reduction in weight updates leads to a significant decrease in energy consumption and time delay by more than 98% when compared to the basic closed-loop update method. Consequently, this approach effectively addresses the durability requirements of memristive devices.
Article
The limitations of traditional von Neumann architectures and digital computing are the bottlenecks for high-speed signal processing capabilities, not to mention the explosion of information growth. To tackle this challenge, this paper proposes an artificial neural network (ANN) equalizer based on the memristor for high-speed channel transmission at 112Gbps with 4-level pulse amplitude modulation (PAM4). To implement the PAM4 signal decision circuit based on the softmax algorithm, a comparator is used to make binary decisions for each output, and the only high-level output is further selected for the decision-making. The simulations on the PSPICE platform reveal that the number of input taps and the location of the main tap have the greatest impact on bit error rate (BER) performance. With optimal parameters, the circuit can achieve an impressive BER performance as low as 3.45E-6. To the best of our knowledge, this is the first implementation of channel equalization using memristive circuits, providing a valuable reference for analog circuit implementations of neural network equalizers.
Article
Efficient implementation of Tomlinson-Harashima precoding (THP) is crucial in massive multiple-input-multiple-output (MIMO) systems with a large number of antennas at the base station (BS) serving many user equipments (UEs). To address the high computational complexity of THP, in this paper, we first propose novel THP update algorithms that can avoid recomputing the THP filters when a new UE arrives or departs. Specifically, by using the Gram-Schmidt process and a series of Givens matrices, the THP filters are computed without full matrix operations. Then we extend the THP update algorithms to a more general scenario when multiple multi-antenna UEs arrive or depart. In this case, the proposed algorithms use both direct and iterative approaches. Moreover, the computational complexity of the proposed algorithms is derived and compared with that of the conventional THP. Finally, to further align with the practical scenario, we analyze and derive the approximate close-form expressions for the sum achievable rate of the proposed algorithms under imperfect channel state information (CSI). Simulation results are provided to illustrate the effectiveness of the proposed algorithms. The impact of quasi-static fading and slow time-varying scenarios with imperfect CSI on the communication performance of the proposed algorithms is also evaluated.
Conference Paper
Full-text available
Multilayer Neural Networks (MNNs) are commonly trained using gradient descent-based methods, such as BackPropagation (BP). Inference in probabilistic graphical models is often done using variational Bayes methods, such as Expectation Propagation (EP). We show how an EP based approach can also be used to train deterministic MNNs. Specifically, we approximate the posterior of the weights given the data using a " mean-field " factorized distribution, in an online setting. Using online EP and the central limit theorem we find an analytical approximation to the Bayes update of this posterior, as well as the resulting Bayes estimates of the weights and outputs. Despite a different origin, the resulting algorithm, Expectation BackPropagation (EBP), is very similar to BP in form and efficiency. However, it has several additional advantages: (1) Training is parameter-free, given initial conditions (prior) and the MNN architecture. This is useful for large-scale problems, where parameter tuning is a major challenge. (2) The weights can be restricted to have discrete values. This is especially useful for implementing trained MNNs in precision limited hardware chips, thus improving their speed and energy efficiency by several orders of magnitude. We test the EBP algorithm numerically in eight binary text classification tasks. In all tasks, EBP outperforms: (1) standard BP with the optimal constant learning rate (2) previously reported state of the art. Interestingly, EBP-trained MNNs with binary weights usually perform better than MNNs with continuous (real) weights-if we average the MNN output using the inferred posterior.
Chapter
An up-to-date account of the interplay between optimization and machine learning, accessible to students and researchers in both communities. The interplay between optimization and machine learning is one of the most important developments in modern computational science. Optimization formulations and methods are proving to be vital in designing algorithms to extract essential knowledge from huge volumes of data. Machine learning, however, is not simply a consumer of optimization technology but a rapidly evolving field that is itself generating new optimization ideas. This book captures the state of the art of the interaction between optimization and machine learning in a way that is accessible to researchers in both fields.Optimization approaches have enjoyed prominence in machine learning because of their wide applicability and attractive theoretical properties. The increasing complexity, size, and variety of today's machine learning models call for the reassessment of existing assumptions. This book starts the process of reassessment. It describes the resurgence in novel contexts of established frameworks such as first-order methods, stochastic approximations, convex relaxations, interior-point methods, and proximal methods. It also devotes attention to newer themes such as regularized optimization, robust optimization, gradient and subgradient methods, splitting techniques, and second-order methods. Many of these techniques draw inspiration from other fields, including operations research, theoretical computer science, and subfields of optimization. The book will enrich the ongoing cross-fertilization between the machine learning community and these other fields, and within the broader optimization community.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Article
Companies such as Google have made significant progress in developing artificial intelligence (AI) known as deep learning. Deep learning software attempts to mimic the activity in layers of neurons in the neocortex, the wrinkly 80 percent of the brain where thinking occurs. The software learns to recognize patterns in digital representations of sounds, images, and other data in a real sense. Computer scientists can model many more layers of virtual neurons due to improvements in mathematical formulas and increasingly powerful computers. They are able produce significant advancements in speech and image recognition with this greater depth. Google in particular has become a leader in developing deep learning and related AI talent. Extending deep learning into applications beyond speech and image recognition will require more conceptual and software discoveries and advancements in processing power.
Article
Recent studies have shown that memristor crossbar based neuromorphic hardware enables high performance implementations of neural networks at low power and in low chip area. This paper presents circuits to train a cascaded set of memristor crossbars representing a multi-layered neural network. The circuits presented implement back-propagation training and would enable on-chip training of memristor crossbars. On-chip training of memristor crossbars can be necessary to overcome the effect of device variability and alternate current paths within crossbars being used as neural networks. We model the memristor crossbars in SPICE in order capture alternate current paths and the impact of wire resistance. Our design can be scaled to multiple neural layers and multiple output neurons. We demonstrate the training of up to three layered neural networks evaluating non-linearly separable functions through detailed SPICE simulations. This is the first study in the literature we have seen that examines the implementation of back-propagation based training of memristor crossbar circuits. The impact of this work would be to enable the design of highly energy efficient and compact neuromorphic processing systems that can be trained to implement large deep networks (such as deep belief networks).
Article
In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function ofn real variables with support in the unit hypercube; only mild conditions are imposed on the univariate function. Our results settle an open question about representability in the class of single hidden layer neural networks. In particular, we show that arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity. The paper discusses approximation properties of other possible types of nonlinearities that might be implemented by artificial neural networks.
Article
In this paper, adaptive synchronization of memristor-based neural networks (MNNs) with time-varying delays is investigated. The dynamical analysis here employs results from the theory of differential equations with discontinuous right-hand sides as introduced by Filippov. Sufficient conditions for the global synchronization of MNNs are established with a general adaptive controller. The update gain of the controller can be adjusted to control the synchronization speed. The obtained results complement and improve the previously known results. Finally, numerical simulations are carried out to demonstrate the effectiveness of the obtained results.