ArticlePDF Available

Long Short-term Memory

December 1997
Neural Computation 9(8):1735-80

December 1997
9(8):1735-80

DOI:10.1162/neco.1997.9.8.1735

Source
PubMed

Authors:

Sepp Hochreiter

Johannes Kepler University Linz

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

Architecture of memory cell c j (the box) and its gate units in j ; out j. The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.

…

Example of a net with 8 input units, 4 output units, and 2 memory cell blocks of size 2.

…

Figures - uploaded by Sepp Hochreiter

Content may be subject to copyright.

Content uploaded by Sepp Hochreiter

Content may be subject to copyright.

Content uploaded by Sepp Hochreiter

Content may be subject to copyright.

LONG SHORT-TERM MEMORY

Neural Computation 9(8):1735{1780, 1997

Sepp Hochreiter

Fakultat fur Informatik

Technische Universitat Munchen

80290 Munchen, Germany

hochreit@informatik.tu-muenchen.de

http://www7.informatik.tu-muenchen.de/~hochreit

Jurgen Schmidhuber

IDSIA

Corso Elvezia 36

6900 Lugano, Switzerland

juergen@idsia.ch

http://www.idsia.ch/~juergen

Abstract

Learning to store information over extended time intervals via recurrent backpropagation

takes a very long time, mostly due to insucient, decaying error back ow. We briey review

Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient,

gradient-based method called \Long Short-Term Memory" (LSTM). Truncating the gradient

where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000

discrete time steps by enforcing

constant

error ow through \constant error carrousels" within

special units. Multiplicative gate units learn to open and close access to the constant error

ow. LSTM is local in space and time; its computational complexity per time step and weight

(1). Our experiments with articial data involve local, distributed, real-valued, and noisy

pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation,

Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and

learns much faster. LSTM also solves complex, articial long time lag tasks that have never

been solved by previous recurrent network algorithms.

1 INTRODUCTION

Recurrent networks can in principle use their feedback connections to store representations of

recent input events in form of activations (\short-term memory", as opposed to \long-term mem-

ory" embodied by slowly changing weights). This is potentially signicant for many applications,

including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992).

The most widely used algorithms for learning

what

to put in short-term memory, however, take

too much time or do not work well at all, especially when minimal time lags between inputs and

corresponding teacher signals are long. Although theoretically fascinating, existing methods do

not provide clear

practical

advantages over, say, backprop in feedforward nets with limited time

windows. This paper will review an analysis of the problem and suggest a remedy.

The problem.

With conventional \Back-Propagation Through Time" (BPTT, e.g., Williams

and Zipser 1992, Werbos 1988) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson and

Fallside 1987), error signals \owing backwards in time" tend to either (1) blow up or (2) vanish:

the temporal evolution of the backpropagated error exponentially depends on the size of the

weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to

bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3).

The remedy.

This paper presents

\Long Short-Term Memory"

(LSTM), a novel recurrent

network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM

is designed to overcome these error back-ow problems. It can learn to bridge time intervals in

excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short

time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture

enforcing

constant

(thus neither exploding nor vanishing) error ow through internal states of

special units (provided the gradient computation is truncated at certain architecture-specic points

| this does not aect long-term error ow though).

Outline of paper.

Section 2 will briey review previous work. Section 3 begins with an outline

of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive

approach to constant error backprop for didactic purposes, and highlight its problems concerning

information storage and retrieval. These problems will lead to the LSTM architecture as described

in Section 4. Section 5 will present numerous experiments and comparisons with competing

methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other

recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The

appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae

(A.2).

2 PREVIOUS WORK

This section will focus on recurrent nets with time-varying inputs (as opposed to nets with sta-

tionary inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987).

Gradient-descent variants.

The approaches of Elman (1988), Fahlman (1991), Williams

(1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearl-

mutter's comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see

Sections 1 and 3).

Time-delays.

Other methods that seem practical for short time lags only are Time-Delay

Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activa-

tions based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al.

(1995) propose variants of time-delay networks called NARX networks.

Time constants.

To deal with long time lags, Mozer (1992) uses time constants inuencing

changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact

be viewed as a mixture of TDNN and time constants). For long time lags, however, the time

constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates

the activation of a recurrent unit by adding the old activation and the (scaled) current net input.

The net input, however, tends to perturb the stored information, which makes long-term storage

impractical.

Ring's approach.

Ring (1993) also proposed a method for bridging long time lags. Whenever

a unit in his network receives conicting error signals, he adds a higher order unit inuencing

appropriate connections. Although his approach can sometimes be extremely fast, to bridge a

time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not

generalize to unseen lag durations.

Bengio et al.'s approaches.

Bengio et al. (1994) investigate methods such as simulated

annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete

error propagation. Their \latch" and \2-sequence" problems are very similar to problem 3a with

minimal time lag 100 (see Experiment 3). Bengio and Frasconi (1994) also propose an EM approach

for propagating targets. With

so-called \state networks", at a given time, their system can be

in one of only

dierent states. See also beginning of Section 5. But to solve continuous problems

such as the \adding problem" (Section 5.4), their system would require an unacceptable number

of states (i.e., state networks).

Kalman lters.

Puskorius and Feldkamp (1994) use Kalman lter techniques to improve

recurrent net performance. Since they use \a derivative discount factor imposed to decay expo-

nentially the eects of past dynamic derivatives," there is no reason to believe that their Kalman

Filter Trained Recurrent Networks will b e useful for very long minimal time lags.

Second order nets.

We will see that LSTM uses multiplicative units (MUs) to protect error

ow from unwanted perturbations. It is not the rst recurrent net method using MUs though.

For instance, Watrous and Kuhn (1992) use MUs in second order nets. Some dierences to LSTM

are: (1) Watrous and Kuhn's architecture does not enforce constant error ow and is not designed

to solve long time lag problems. (2) It has fully connected second-order sigma-pi units, while the

LSTM architecture's MUs are used only to gate access to constant error ow. (3) Watrous and

Kuhn's algorithm costs

(

) operations p er time step, ours only

(

), where

is the number

of weights. See also Miller and Giles (1993) for additional work on MUs.

Simple weight guessing.

To avoid long time lag problems of gradient-based approaches we

may simply randomly initialize all network weights until the resulting net happens to classify all

training sequences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996,

Hochreiter and Schmidhuber 1996, 1997) that simple weight guessing solves many of the problems

in (Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than

the algorithms proposed therein. This does not mean that weight guessing is a good algorithm.

It just means that the problems are very simple. More realistic tasks require either many free

parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters),

such that guessing becomes completely infeasible.

Adaptive sequence chunkers.

Schmidhuber's hierarchical chunker systems (1992b, 1993)

have a capability to bridge arbitrary time lags, but only if there is local predictability across the

subsequences causing the time lags (see also Mozer 1992). For instance, in his postdoctoral thesis

(1993), Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning

tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems,

however, deteriorates as the noise level increases and the input sequences become less compressible.

LSTM does not suer from this problem.

3 CONSTANT ERROR BACKPROP

3.1 EXPONENTIALLY DECAYING ERROR

Conventional BPTT

(e.g. Williams and Zipser 1992). Output unit

's target at time

denoted by

(

). Using mean squared error,

's error signal is

(

) =

(

net

(

))(

(

)



(

))

;

where

(

) =

(

net

(

))

is the activation of a non-input unit

with dierentiable activation function

net

(

) =

(



is unit

's current net input, and

is the weight on the connection from unit

. Some

non-output unit

's backpropagated error signal is

(

) =

(

net

(

))

(

+ 1)

The corresponding contribution to

's total weight update is

#

(

)

(



1), where



is the

learning rate, and

stands for an arbitrary unit connected to unit

Outline of Ho chreiter's analysis

(1991, page 19-21). Suppose we have a fully connected

net whose non-input unit indices range from 1 to

. Let us focus on local error ow from unit

to unit

(later we will see that the analysis immediately extends to global error ow). The error

occurring at an arbitrary unit

at time step

is propagated \back into time" for

time steps, to

an arbitrary unit

. This will scale the error by the following factor:

(



)

(

)

(

net

(



1))

= 1

(

net

(



))

(



+1)

(

)

q >

(1)

With

and

, we obtain:

(



)

(

)

:::



(

net

(



))



(2)

(proof by induction). The sum of the



terms

(

net

(



))



determines the

total error back ow (note that since the summation terms may have dierent signs, increasing

the number of units

does not necessarily increase error ow).

Intuitive explanation of equation (2).

(

net

(



))



for all

(as can happen, e.g., with linear

) then the largest product increases exponentially

with

. That is, the error blows up, and conicting error signals arriving at unit

can lead to

oscillating weights and unstable learning (for error blow-ups or bifurcations see also Pineda 1988,

Baldi and Pineda 1991, Doya 1992). On the other hand, if

(

net

(



))



for all

, then the largest product

decreases

exponentially with

. That is, the error vanishes, and

nothing can be learned in acceptable time.

is the logistic sigmoid function, then the maximal value of

is 0.25. If



is constant

and not equal to zero, then

(

net

)



takes on maximal values where



coth(

net

)

;

goes to zero for



j ! 1

, and is less than 1

0 for



0 (e.g., if the absolute max-

imal weight value

max

is smaller than 4.0). Hence with conventional logistic sigmoid activation

functions, the error ow tends to vanish as long as the weights have absolute values below 4.0,

especially in the beginning of the training phase. In general the use of larger initial weights will

not help though | as seen above, for



j ! 1

the relevant derivative goes to zero \faster"

than the absolute weight can grow (also, some weights will have to change their signs by crossing

zero). Likewise, increasing the learning rate does not help either | it will not change the ratio of

long-range error ow and short-range error ow. BPTT is too sensitive to recent distractions. (A

very similar, more recent analysis was presented by Bengio et al. 1994).

Global error ow.

The local error ow analysis above immediately shows that global error

ow vanishes, too. To see this, compute

output unit

(



)

(

)

Weak upp er bound for scaling factor.

The following, slightly extended vanishing error

analysis also takes

, the number of units, into account. For

q >

1, formula (2) can be rewritten

(

)

(



(

W F

(



))

(

net

(



))

;

where the weight matrix

is dened by [

]

's outgoing weight vector

is dened by

[

]

:= [

]

's incoming weight vector

is dened by [

]

:= [

]

, and for

= 1

;:::;q

(



) is the diagonal matrix of rst order derivatives dened as: [

(



)]

:= 0

, and [

(



)]

(

net

(



)) otherwise. Here

is the transposition operator,

[

]

is the element in the

-th column and

-th row of matrix

, and [

]

is the

-th component

of vector

Using a matrix norm

compatible with vector norm

, we dene

max

:= max

;:::;q

(



)

For max

;:::;n

jg  k

we get

j 

Since

(

net

(



))

j  k

(



)



max

;

we obtain the following inequality:

(



)

(

)

j 

(

max

)





(

max

)

This inequality results from

W e

 k

and

 k

;

where

is the unit vector whose components are 0 except for the

-th component, which is 1.

Note that this is a weak, extreme case upper b ound | it will be reached only if all

(



)

take on maximal values, and if the contributions of all paths across which error ows back from

unit

to unit

have the same sign. Large

, however, typically result in small values of

(



)

, as conrmed by experiments (see, e.g., Hochreiter 1991).

For example, with norms

:= max

and

:= max

;

we have

max

= 0

25 for the logistic sigmoid. We observe that if

j 

max

i; j;

then



max

0 will result in exponential decay | by setting





max



we obtain

(



)

(

)

j 

(



)

We refer to Hochreiter's 1991 thesis for additional results.

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH

A single unit.

To avoid vanishing error signals, how can we achieve constant error ow through

a single unit

with a single connection to itself ? According to the rules above, at time

's local

error back ow is

(

) =

(

net

(

))

(

+ 1)

. To enforce

constant

error ow through

, we

require

(

net

(

))

= 1

Note the similarity to Mozer's xed time constant system (1992) | a time constant of 1

0 is

appropriate for potentially innite time lags

The constant error carrousel.

Integrating the dierential equation above, we obtain

(

net

(

)) =

net

(

)

for arbitrary

net

(

). This means:

has to be linear, and unit

's acti-

vation has to remain constant:

(

+ 1) =

(

net

(

+ 1)) =

(

)) =

(

)

We do not use the expression \time constant" in the dierential sense, as, e.g., Pearlmutter (1995).

In the experiments, this will be ensured by using the identity function

(

) =

, and by

setting

= 1

0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's

central feature (see Section 4).

Of course unit

will not only be connected to itself but also to other units. This invokes two

obvious, related problems (also inherent in all other gradient-based approaches):

1. Input weight conict:

for simplicity, let us focus on a single additional input weight

Assume that the total error can be reduced by switching on unit

in response to a certain input,

and keeping it active for a long time (until it helps to compute a desired output). Provided

is non-

zero, since the same incoming weight has to be used for both storing certain inputs

and

ignoring

others,

will often receive conicting weight update signals during this time (recall that

linear): these signals will attempt to make

participate in (1) storing the input (by switching

)

and

(2) protecting the input (by preventing

from being switched o by irrelevant later

inputs). This conict makes learning dicult, and calls for a more context-sensitive mechanism

for controlling \write operations" through input weights.

2. Output weight conict:

assume

is switched on and currently stores some previous

input. For simplicity, let us focus on a single additional outgoing weight

. The same

has

to be used for both retrieving

's content at certain times

and

preventing

from disturbing

at other times. As long as unit

is non-zero,

will attract conicting weight update signals

generated during sequence processing: these signals will attempt to make

participate in (1)

accessing the information stored in

and

| at dierent times | (2) protecting unit

from being

perturbed by

. For instance, with many tasks there are certain \short time lag errors" that can be

reduced in early training stages. However, at later training stages

may suddenly start to cause

avoidable errors in situations that already seemed under control by attempting to participate in

reducing more dicult \long time lag errors". Again, this conict makes learning dicult, and

calls for a more context-sensitive mechanism for controlling \read operations" through output

weights.

Of course, input and output weight conicts are not specic for long time lags, but occur for

short time lags as well. Their eects, however, become particularly pronounced in the long time

lag case: as the time lag increases, (1) stored information must be protected against perturbation

for longer and longer periods, and | especially in advanced stages of learning | (2) more and

more already correct outputs also require protection against p erturbation.

Due to the problems above the naive approach does not work well except in case of certain

simple problems involving local input/output representations and non-repeating input patterns

(see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right.

4 LONG SHORT-TERM MEMORY

Memory cells and gate units

. To construct an architecture that allows for constant error ow

through special, self-connected units without the disadvantages of the naive approach, we extend

the constant error carrousel CEC embo died by the self-connected, linear unit

from Section 3.2

by introducing additional features. A multiplicative

input gate unit

is introduced to protect the

memory contents stored in

from perturbation by irrelevant inputs. Likewise, a multiplicative

output gate unit

is introduced which protects other units from perturbation by currently irrelevant

memory contents stored in

The resulting, more complex unit is called a

memory cell

(see Figure 1). The

-th memory cell

is denoted

. Each memory cell is built around a central linear unit with a xed self-connection

(the CEC). In addition to

net

gets input from a multiplicative unit

out

(the \output gate"),

and from another multiplicative unit

(the \input gate").

's activation at time

is denoted

(

out

's by

out

(

). We have

out

(

) =

out

(

net

out

(

));

(

) =

(

net

(

));

where

net

out

(

) =

out

(



;

and

net

(

) =

(



We also have

net

(

) =

(



The summation indices

may stand for input units, gate units, memory cells, or even conventional

hidden units if there are any (see also paragraph on \network topology" below). All these dierent

types of units may convey useful information about the current state of the net. For instance,

an input gate (output gate) may use inputs from other memory cells to decide whether to store

(access) certain information in its memory cell. There even may be recurrent self-connections like

. It is up to the user to dene the network topology. See Figure 2 for an example.

At time

's output

(

) is computed as

(

) =

out

(

)

(

))

;

where the \internal state"

(

) is

(0) = 0

; s

(

) =

(



1) +

(

)



net

(

)



for

t >

The dierentiable function

squashes

net

; the dierentiable function

scales memory cell

outputs computed from the internal state

inj

outj

wic j

ycj

g h

1.0

net

wiwi

yinjyoutj

netcj

g yinj

= g+scjscjyinj

h youtj

net

Figure 1:

Architecture of memory cell

(the box) and its gate units

; out

. The self-recurrent

connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of

the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and

appendix A.1 for details.

Why gate units?

To avoid input weight conicts,

controls the error ow to memory cell

's input connections

. To circumvent

's output weight conicts,

out

controls the error

ow from unit

's output connections. In other words, the net can use

to decide when to keep

or override information in memory cell

, and

out

to decide when to access memory cell

and

when to prevent other units from being perturbed by

(see Figure 1).

Error signals trapped within a memory cell's CEC

cannot

change { but dierent error signals

owing into the cell (at dierent times) via its output gate may get superimposed. The output

gate will have to learn

which

errors to trap in its CEC, by appropriately scaling them. The input

gate will have to learn when to release errors, again by appropriately scaling them. Essentially,

the multiplicative gate units open and close access to constant error ow through CEC.

Distributed output representations typically do require output gates. Not always are both

gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in

Section 5, it will be p ossible to use input gates only. In fact, output gates are not required in case

of local output encoding | preventing memory cells from perturbing already learned outputs can

be done by simply setting the corresponding weights to zero. Even in this case, however, output

gates can be benecial: they prevent the net's attempts at storing long time lag memories (which

are usually hard to learn) from perturbing activations representing easily learnable short time lag

memories. (This will prove quite useful in Experiment 1, for instance.)

Network topology.

We use networks with one input layer, one hidden layer, and one output

layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units

(for convenience, we refer to both memory cells and gate units as being located in the hidden

layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate

units and memory cells. All units (except for gate units) in all layers have directed connections

(serve as inputs) to all units in the layer above (or to all higher layers { Exp eriments 2a and 2b).

Memory cell blocks.

memory cells sharing the same input gate and the same output gate

form a structure called a \memory cell block of size

". Memory cell blocks facilitate information

storage | as with conventional neural nets, it is not so easy to co de a distributed input within a

single cell. Since each memory cell block has as many gate units as a single memory cell (namely

two), the block architecture can be even slightly more ecient (see paragraph \computational

complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments

(Section 5), we will use memory cell blocks of various sizes.

Learning.

We use a variant of RTRL (e.g., Robinson and Fallside 1987) which properly takes

into account the altered, multiplicative dynamics caused by input and output gates. However, to

ensure non-decaying error backprop through internal states of memory cells, as with truncated

BPTT (e.g., Williams and Peng 1990), errors arriving at \memory cell net inputs" (for cell

, this

includes

net

out

) do not get propagated back further in time (although they

serve

to change the incoming weights). Only within

memory cells, errors are propagated back through

previous internal states

. To visualize this: once an error signal arrives at a memory cell output,

it gets scaled by output gate activation and

. Then it is within the memory cell's CEC, where it

can ow back indenitely without ever being scaled. Only when it leaves the memory cell through

the input gate and

, it is scaled once more by input gate activation and

. It then serves to

change the incoming weights before it is truncated (see appendix for explicit formulae).

Computational complexity.

As with Mozer's focused recurrent backprop algorithm (Mozer

1989), only the derivatives

need to be stored and updated. Hence the LSTM algorithm is

very ecient, with an excellent update complexity of

(

), where

the number of weights (see

details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update

complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is

local in space and time

: there is no need to store activation values observed during sequence

processing in a stack with potentially unlimited size.

Abuse problem and solutions.

In the beginning of the learning phase, error reduction

may be possible without storing information over time. The network will thus tend to abuse

memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing

connections as adaptive thresholds for other units). The potential diculty is: it may take a

long time to release abused memory cells and make them available for further learning. A similar

\abuse problem" appears if two memory cells store the same (redundant) information. There are

at least two solutions to the abuse problem:

(1) Sequential network construction

(e.g., Fahlman

1991): a memory cell and the corresponding gate units are added to the network whenever the

For intra-cellular backprop in a quite dierent context see also Doya and Yoshizawa (1989).

Following Schmidhuber (1989), we say that a recurrent net algorithm is

local in space

if the update complexity

per time step and weight does not depend on network size. We say that a method is

local in time

if its storage

requirements do not depend on input sequence length. For instance, RTRL is local in time but not in space. BPTT

is local in space but not in time.

1 1 2

output

hidden

input

out 1

in 1

out 2

in 2

cell

block block

1cell

block block

cell 2cell 2

Figure 2:

Example of a net with 8 input units, 4 output units, and 2 memory cell blocks of size 2.

marks the input gate,

out

marks the output gate, and

cell

=block

marks the rst memory

cell of block 1.

cell

=block

's architecture is identical to the one in Figure 1, with gate units

and

out

(note that by rotating Figure 1 by 90 degrees anti-clockwise, it will match with the

corresponding parts of Figure 1). The example assumes dense connectivity: each gate unit and

each memory cel l see all non-output units. For simplicity, however, outgoing weights of only

one type of unit are shown for each layer. With the ecient, truncated update rule, error ows

only through connections to output units, and through xed self-connections within cell blocks (not

shown here | see Figure 1). Error ow is truncated once it \wants" to leave memory cells or

gate units. Therefore, no connection shown above serves to propagate error back to the unit from

which the connection originates (except for connections to output units), although the connections

themselves are modiable. That's why the truncated LSTM algorithm is so ecient, despite its

ability to bridge very long time lags. See text and appendix A.1 for details. Figure 2 actually shows

the architecture used for Experiment 6a | only the bias of the non-input units is omitted.

error stops decreasing (see Experiment 2 in Section 5).

(2) Output gate bias:

each output gate gets

a negative initial bias, to push initial memory cell activations towards zero. Memory cells with

more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5).

Internal state drift and remedies.

If memory cell

's inputs are mostly positive or mostly

negative, then its internal state

will tend to drift away over time. This is p otentially dangerous,

for the

(

) will then adopt very small values, and the gradient will vanish. One way to cir-

cumvent this problem is to choose an appropriate function

. But

(

) =

, for instance, has the

disadvantage of unrestricted memory cell output range. Our simple but eective way of solving

drift problems at the beginning of learning is to initially bias the input gate

towards zero.

Although there is a tradeo between the magnitudes of

(

) on the one hand and of

and

on the other, the potential negative eect of input gate bias is negligible compared to the one

of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for

ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4.

5 EXPERIMENTS

Introduction.

Which tasks are appropriate to demonstrate the quality of a novel long time lag

algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher

signals must be long for

all

training sequences. In fact, many previous recurrent net algorithms

sometimes manage to generalize from very short training sequences to very long test sequences.

See, e.g., Pollack (1991). But a real long time lag problem does not have

any

short time lag

exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL,

online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and

Mozer (1992). A second important requirement is that the tasks should be complex enough such

that they cannot be solved quickly by simple-minded strategies such as random weight guessing.

Guessing can outperform many long time lag algorithms.

Recently we discovered

(Schmidhuber and Hochreiter 1996, Ho chreiter and Schmidhuber 1996, 1997) that many long

time lag tasks used in previous work can be solved more quickly by simple random weight guessing

than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's

\parity problem" (1994) problem much faster

than the seven methods tested by Bengio et al.

(1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of

course, this does not mean that guessing is a good algorithm. It just means that some previously

used problems are not extremely appropriate to demonstrate the quality of previously proposed

algorithms.

What's common to Experiments 1{6.

All our experiments (except for Experiment 1)

involve long minimal time lags | there are no short time lag training exemplars facilitating

learning. Solutions to most of our tasks are sparse in weight space. They require either many

parameters/inputs or high weight precision, such that random weight guessing becomes infeasible.

We always use on-line learning (as opposed to batch learning), and logistic sigmoids as acti-

vation functions. For Experiments 1 and 2, initial weights are chosen in the range [



;

2], for

the other experiments in [



;

1]. Training sequences are generated randomly according to the

various task descriptions. In slight deviation from the notation in Appendix A1, each discrete

time step of each input sequence involves three processing steps: (1) use current input to set the

input units. (2) Compute activations of hidden units (including input gates, output gates, mem-

ory cells). (3) Compute output unit activations. Except for Experiments 1, 2a, and 2b, sequence

elements are randomly generated on-line, and error signals are generated only at sequence ends.

Net activations are reset after each processed input sequence.

For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL,

except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see,

e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time

lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online

weight changes) lead to almost identical, negative results (as conrmed by additional simulations

in Hochreiter 1991; see also Mozer 1992). This is b ecause oine RTRL, online RTRL, and full

BPTT all suer badly from exponential error decay.

Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complex-

ity of a given problem, a more systematic approach would be: start with a very small net consisting

of one memory cell. If this do es not work, try two cells, etc. Alternatively, use sequential network

construction (e.g., Fahlman 1991).

Outline of experiments.



Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber

grammar. Since it allows for training sequences with short time lags, it is

not

a long time

lag problem. We include it because (1) it provides a nice example where LSTM's output

gates are truly benecial, and (2) it is a popular b enchmark for recurrent nets that has been

used by many authors | we want to include at least one experiment where conventional

BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The

embedded Reber grammar's minimal time lags represent a b order case in the sense that it

is still possible to learn to bridge them with conventional algorithms. Only slightly longer

It should be mentioned, however, that dierent input representations and dierent types of noise may lead to

worse guessing performance (Yoshua Bengio, personal communication, 1996).

minimal time lags would make this almost impossible. The more interesting tasks in our

paper, however, are those that RTRL, BPTT, etc. cannot solve at all.



Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols

distracting from the few important ones. The most dicult task (Task 2c) involves hundreds

of distractor symbols at random positions, and minimal time lags of 1000 steps. LSTM solves

it, while BPTT and RTRL already fail in case of 10-step minimal time lags (see also, e.g.,

Hochreiter 1991 and Mozer 1992). For this reason RTRL and BPTT are omitted in the

remaining, more complex experiments, all of which involve much longer time lags.



Experiment 3 addresses long time lag problems with noise and signal on the same input

line. Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because

this problem actually can be solved quickly by random weight guessing, we also include a

far more dicult 2-sequence problem (3c) which requires to learn real-valued, conditional

expectations of noisy targets, given the inputs.



Experiments 4 and 5 involve distributed, continuous-valued input representations and require

learning to store precise, real values for very long time periods. Relevant input signals

can occur at quite dierent positions in input sequences. Again minimal time lags involve

hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms.



Experiment 6 involves tasks of a dierent complex type that also has not been solved by

other recurrent net algorithms. Again, relevant input signals can occur at quite dierent

positions in input sequences. The experiment shows that LSTM can extract information

conveyed by the temporal order of widely separated inputs.

Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for

reference.

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR

Task.

Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989),

Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short

time lags (of as few as 9 steps), it is

not

a long time lag problem. We include it for two reasons: (1)

it is a popular recurrent net benchmark used by many authors | we wanted to have at least one

experiment where RTRL and BPTT do not fail completely, and (2) it shows nicely how output

gates can be benecial.

X P

P V

Figure 3:

Transition diagram for the Reber

grammar.

GRAMMAR

REBER

Figure 4:

Transition diagram for the embedded

Reber grammar. Each box represents a copy of

the Reber grammar (see Figure 3).

Starting at the leftmost node of the directed graph in Figure 4, symbol strings are generated

sequentially (beginning with the empty string) by following edges | and appending the associated

symbols to the current string | until the rightmost node is reached. Edges are chosen randomly

if there is a choice (probability: 0.5). The net's task is to read strings, one symbol at a time,

and to permanently predict the next symbol (error signals occur at every time step). To correctly

predict the symbol before last, the net has to remember the second symbol.

Comparison.

We compare LSTM to \Elman nets trained by Elman's training procedure"

(ELM) (results taken from Cleeremans et al. 1989), Fahlman's \Recurrent Cascade-Correlation"

(RCC) (results taken from Fahlman 1991), and RTRL (results taken from Smith and Zipser (1989),

where only the few successful trials are listed). It should b e mentioned that Smith and Zipser

actually make the task easier by increasing the probability of short time lag exemplars. We didn't

do this for LSTM.

Training/Testing.

We use a local input/output representation (7 input units, 7 output

units). Following Fahlman, we use 256 training strings and 256 separate test strings. The training

set is generated randomly; training exemplars are picked randomly from the training set. Test

sequences are generated randomly, to o, but sequences already used in the training set are not

used for testing. After string presentation, all activations are reinitialized with zeros. A trial is

considered successful if all string symbols of all sequences in both test set and training set are

predicted correctly | that is, if the output unit(s) corresponding to the p ossible next symbol(s)

is(are) always the most active ones.

Architectures.

Architectures for RTRL, ELM, RCC are reported in the references listed

above. For LSTM, we use 3 (4) memory cell blocks. Each block has 2 (1) memory cells. The

output layer's only incoming connections originate at memory cells. Each memory cell and each

gate unit receives incoming connections from all memory cells and gate units (the hidden layer is

fully connected | less connectivity may work as well). The input layer has forward connections

to all units in the hidden layer. The gate units are biased. These architecture parameters make it

easy to store at least 3 input signals (architectures 3-2 and 4-1 are employed to obtain comparable

numbers of weights for both architectures: 264 for 4-1 and 276 for 3-2). Other parameters may be

appropriate as well, however. All sigmoid functions are logistic with output range [0

;

1], except for

, whose range is [



;

1], and

, whose range is [



;

2]. All weights are initialized in [



;

2],

except for the output gate biases, which are initialized to -1, -2, and -3, respectively (see abuse

problem, solution (2) of Section 4). We tried learning rates of 0.1, 0.2 and 0.5.

Results.

We use 3 dierent, randomly generated pairs of training and test sets. With each

such pair we run 10 trials with dierent initial weights. See Table 1 for results (mean of 30

trials). Unlike the other methods, LSTM always learns to solve the task. Even when we ignore

the unsuccessful trials of the other approaches, LSTM learns much faster.

Importance of output gates.

The experiment provides a nice example where the output gate

is truly benecial. Learning to store the rst T or P should not perturb activations representing

the more easily learnable transitions of the original Reber grammar. This is the job of the output

gates. Without output gates, we did not achieve fast learning.

5.2 EXPERIMENT 2: NOISE-FREE AND NOISY SEQUENCES

Task 2a: noise-free sequences with long time lags.

There are

+ 1 possible input symbols

denoted

; :::; a



; a

x; a

is \locally" represented by the

+ 1-dimensional vector

whose

-th component is 1 (all other components are 0). A net with

+ 1 input units and

+ 1

output units sequentially observes input symbol sequences, one at a time, permanently trying

to predict the next symbol | error signals occur at every single time step. To emphasize the

\long time lag problem", we use a training set consisting of only two very similar sequences:

(

y; a

; a

;:::;a



; y

) and (

x; a

; a

;:::;a



; x

). Each is selected with probability 0.5. To predict

the nal element, the net has to learn to store a representation of the rst element for

time

steps.

We compare \Real-Time Recurrent Learning" for fully recurrent nets (RTRL), \Back-Propa-

gation Through Time" (BPTT), the sometimes very successful 2-net \Neural Sequence Chunker"

(CH, Schmidhuber 1992b), and our new method (LSTM). In all cases, weights are initialized in

[-0.2,0.2]. Due to limited computation time, training is stopped after 5 million sequence presen-

method hidden units # weights learning rate % of success success after

RTRL 3



170 0.05 \some fraction" 173,000

RTRL 12



494 0.1 \some fraction" 25,000

ELM 15



435 0

200,000

RCC 7-9



119-198 50 182,000

LSTM 4 blocks, size 1 264 0.1 100 39,740

LSTM 3 blocks, size 2 276 0.1 100 21,730

LSTM 3 blocks, size 2 276 0.2 97 14,060

LSTM 4 blocks, size 1 264 0.5 97 9,500

LSTM 3 blocks, size 2 276 0.5 100 8,440

Table 1:

EXPERIMENT 1: Embedded Reber grammar: percentage of successful trials and number

of sequence presentations until success for RTRL (results taken from Smith and Zipser 1989),

\Elman net trained by Elman's procedure" (results taken from Cleeremans et al. 1989), \Recurrent

Cascade-Correlation" (results taken from Fahlman 1991) and our new approach (LSTM). Weight

numbers in the rst 4 rows are estimates | the corresponding papers do not provide all the technical

details. Only LSTM almost always learns to solve the task (only two failures out of 150 trials).

Even when we ignore the unsuccessful trials of the other approaches, LSTM learns much faster

(the number of required training examples in the bottom row varies between 3,800 and 24,100).

tations. A successful run is one that fullls the following criterion: after training, during 10,000

successive, randomly chosen input sequences, the maximal absolute error of all output units is

always below 0

25.

Architectures.

RTRL: one self-recurrent hidden unit,

+ 1 non-recurrent output units. Each

layer has connections from all layers below. All units use the logistic activation function sigmoid

in [0,1].

BPTT: same architecture as the one trained by RTRL.

CH: both net architectures like RTRL's, but one has an additional output for predicting the

hidden unit of the other one (see Schmidhuber 1992b for details).

LSTM: like with RTRL, but the hidden unit is replaced by a memory cell and an input gate

(no output gate required).

is the logistic sigmoid, and

is the identity function

(

) =

Memory cell and input gate are added once the error has stopped decreasing (see abuse problem:

solution (1) in Section 4).

Results.

Using RTRL and a short 4 time step delay (

= 4),

of all trials were successful.

No trial was successful with

= 10

With

long

time lags, only the neural sequence chunker

and LSTM achieved successful trials, while BPTT and RTRL failed. With

= 100, the 2-net

sequence chunker solved the task in only

of all trials. LSTM, however, always learned to solve

the task. Comparing successful trials only, LSTM learned much faster. See Table 2 for details. It

should be mentioned, however, that a

hierarchical

chunker can also always quickly solve this task

(Schmidhuber 1992c, 1993).

Task 2b: no local regularities.

With the task above, the chunker sometimes learns to

correctly predict the nal element, but only because of predictable local regularities in the input

stream that allow for compressing the sequence. In an additional, more dicult task (involving

many more dierent possible sequences), we remove compressibility by replacing the determin-

istic subsequence (

; a

;:::;a



) by a

random

subsequence (of length



1) over the alpha-

bet

; a

;:::;a



. We obtain 2 classes (two sets of sequences)

(

y; a

; a

;:::;a



; y

)



; i

;:::;i







and

(

x; a

; a

;:::;a



; x

)



; i

;:::;i







. Again, every

next sequence element has to be predicted. The only totally predictable targets, however, are

and

, which occur at sequence ends. Training exemplars are chosen randomly from the 2 classes.

Architectures and parameters are the same as in Experiment 2a. A successful run is one that

fullls the following criterion: after training, during 10,000 successive, randomly chosen input

Method Delay

Learning rate # weights % Successful trials Success after

RTRL 4 1.0 36 78 1,043,000

RTRL 4 4.0 36 56 892,000

RTRL 4 10.0 36 22 254,000

RTRL 10 1.0-10.0 144 0

5,000,000

RTRL 100 1.0-10.0 10404 0

5,000,000

BPTT 100 1.0-10.0 10404 0

5,000,000

CH 100 1.0 10506 33 32,400

LSTM 100 1.0 10504 100 5,040

Table 2:

Task 2a: Percentage of successful trials and number of training sequences until success,

for \Real-Time Recurrent Learning" (RTRL), \Back-Propagation Through Time" (BPTT), neural

sequence chunking (CH), and the new method (LSTM). Table entries refer to means of 18 trials.

With 100 time step delays, only CH and LSTM achieve successful trials. Even when we ignore the

unsuccessful trials of the other approaches, LSTM learns much faster.

sequences, the maximal absolute error of all output units is below 0

25 at sequence end.

Results.

As expected, the chunker failed to solve this task (so did BPTT and RTRL, of

course). LSTM, however, was always successful. On average (mean of 18 trials), success for

= 100 was achieved after 5,680 sequence presentations. This demonstrates that LSTM do es not

require sequence regularities to work well.

Task 2c: very long time lags | no local regularities.

This is the most dicult task in

this subsection. To our knowledge no other recurrent net algorithm can solve it. Now there are

possible input symbols denoted

; :::; a



; a

e; a

b; a

x; a

; :::; a

are also called

\distractor symbols"

. Again,

is locally represented by the

+4-dimensional vector

whose

th component is 1 (all other components are 0). A net with

+ 4 input units and 2 output

units sequentially observes input symbol sequences, one at a time. Training sequences are randomly

chosen from the union of two very similar subsets of sequences:

(

b; y; a

; a

;:::;a

; e; y

)



; i

;:::;i



and

(

b; x; a

; a

;:::;a

; e; x

)



; i

;:::;i



. To produce a

training sequence, we (1) randomly generate a sequence prex of length

+ 2, (2) randomly

generate a sequence sux of additional elements (

b; e; x; y

) with probability

or, alternatively,

with probability

. In the latter case, we (3) conclude the sequence with

, depending

on the second element. For a given

, this leads to a uniform distribution on the possible sequences

with length

+ 4. The minimal sequence length is

+ 4; the expected length is

4 +

(

)

(

) =

+ 14

The exp ected number of occurrences of element

;



, in a sequence is

+10



. The

goal is to predict the last symbol, which always occurs after the \trigger symbol"

. Error signals

are generated only at sequence ends. To predict the nal element, the net has to learn to store a

representation of the second element for at least

+ 1 time steps (until it sees the trigger symbol

). Success is dened as \prediction error (for nal sequence element) of both output units always

below 0

2, for 10,000 successive, randomly chosen input sequences".

Architecture/Learning.

The net has

+ 4 input units and 2 output units. Weights are

initialized in [-0.2,0.2]. To avoid too much learning time variance due to dierent weight initial-

izations, the hidden layer gets two memory cells (two cell blocks of size 1 | although one would

be sucient). There are no other hidden units. The output layer receives connections only from

memory cells. Memory cells and gate units receive connections from input units, memory cells

and gate units (i.e., the hidden layer is fully connected). No bias weights are used.

and

are

logistic sigmoids with output ranges [



;

1] and [



;

2], respectively. The learning rate is 0.01.

(time lag



(# random inputs)

# weights Success after

50 50 1 364 30,000

100 100 1 664 31,000

200 200 1 1264 33,000

500 500 1 3064 38,000

1,000 1,000 1 6064 49,000

1,000 500 2 3064 49,000

1,000 200 5 1264 75,000

1,000 100 10 664 135,000

1,000 50 20 364 203,000

Table 3:

Task 2c: LSTM with very long minimal time lags

+ 1

and a lot of noise.

is the

number of available distractor symbols (

+ 4

is the number of input units).

is the expected

number of occurrences of a given distractor symbol in a sequence. The rightmost column lists the

number of training sequences required by LSTM (BPTT, RTRL and the other competitors have

no chance of solving this task). If we let the number of distractor symbols (and weights) increase

in proportion to the time lag, learning time increases very slowly. The lower block illustrates the

expected slow-down due to increased frequency of distractor symbols.

Note that the

minimal

time lag is

+ 1 | the net never sees short training sequences facilitating

the classication of long test sequences.

Results.

20 trials were made for all tested pairs (

p; q

). Table 3 lists the mean of the number

of training sequences required by LSTM to achieve success (BPTT and RTRL have no chance of

solving non-trivial tasks with minimal time lags of 1000 steps).

Scaling.

Table 3 shows that if we let the number of input symbols (and weights) increase

in proportion to the time lag, learning time increases very slowly. This is a another remarkable

property of LSTM not shared by any other method we are aware of. Indeed, RTRL and BPTT

are far from scaling reasonably | instead, they appear to scale exponentially, and appear quite

useless when the time lags exceed as few as 10 steps.

Distractor inuence.

In Table 3, the column headed by

gives the expected frequency of

distractor symbols. Increasing this frequency decreases learning speed, an eect due to weight

oscillations caused by frequently observed input symbols.

5.3 EXPERIMENT 3: NOISE AND SIGNAL ON SAME CHANNEL

This experiment serves to illustrate that LSTM does not encounter fundamental problems if noise

and signal are mixed on the same input line. We initially focus on Bengio et al.'s simple 1994

\2-sequence problem"; in Experiment 3c we will then pose a more challenging 2-sequence problem.

Task 3a

(\2-sequence problem"). The task is to observe and then classify input sequences.

There are two classes, each occurring with probability 0.5. There is only one input line. Only

the rst N real-valued sequence elements convey relevant information about the class. Sequence

elements at positions

t>N

are generated by a Gaussian with mean zero and variance 0.2. Case

= 1: the rst sequence element is 1.0 for class 1, and -1.0 for class 2. Case

= 3: the rst

three elements are 1.0 for class 1 and -1.0 for class 2. The target at the sequence end is 1.0 for

class 1 and 0.0 for class 2. Correct classication is dened as \absolute output error at sequence

end below 0.2". Given a constant T, the sequence length is randomly selected between T and T +

T/10 (a dierence to Bengio et al.'s problem is that they also permit shorter sequences of length

T/2).

Guessing.

Bengio et al. (1994) and Bengio and Frasconi (1994) tested 7 dierent methods

on the 2-sequence problem. We discovered, however, that random weight guessing easily outper-

N stop: ST1 stop: ST2 # weights ST2: fraction misclassied

100 3 27,380 39,850 102 0.000195

100 1 58,370 64,330 102 0.000117

1000 3 446,850 452,460 102 0.000078

Table 4:

Task 3a: Bengio et al.'s 2-sequence problem.

is minimal sequence length.

is the

number of information-conveying elements at sequence begin. The column headed by ST1 (ST2)

gives the number of sequence presentations required to achieve stopping criterion ST1 (ST2). The

rightmost column lists the fraction of misclassied post-training sequences (with absolute error

0.2) from a test set consisting of 2560 sequences (tested after ST2 was achieved). All values are

means of 10 trials. We discovered, however, that this problem is so simple that random weight

guessing solves it faster than LSTM and any other method for which there are published results.

forms them all, because the problem is so simple

. See Schmidhuber and Hochreiter (1996) and

Hochreiter and Schmidhuber (1996, 1997) for additional results in this vein.

LSTM architecture.

We use a 3-layer net with 1 input unit, 1 output unit, and 3 cell blo cks

of size 1. The output layer receives connections only from memory cells. Memory cells and gate

units receive inputs from input units, memory cells and gate units, and have bias weights. Gate

units and output unit are logistic sigmoid in [0

;

1],

in [



;

1], and

in [



;

2].

Training/Testing.

All weights (except the bias weights to gate units) are randomly initialized

in the range [



;

1]. The rst input gate bias is initialized with



0, the second with



and the third with



0. The rst output gate bias is initialized with



0, the second with



and the third with



0. The precise initialization values hardly matter though, as conrmed by

additional experiments. The learning rate is 1.0. All activations are reset to zero at the beginning

of a new sequence.

We stop training (and judge the task as being solved) according to the following criteria: ST1:

none of 256 sequences from a randomly chosen test set is misclassied. ST2: ST1 is satised, and

mean absolute test set error is below 0.01. In case of ST2, an additional test set consisting of 2560

randomly chosen sequences is used to determine the fraction of misclassied sequences.

Results.

See Table 4. The results are means of 10 trials with dierent weight initializations

in the range [



;

1]. LSTM is able to solve this problem, though by far not as fast as random

weight guessing (see paragraph \Guessing" above). Clearly, this trivial problem does not provide a

very good testbed to compare performance of various non-trivial algorithms. Still, it demonstrates

that LSTM does not encounter fundamental problems when faced with signal and noise on the

same channel.

Task 3b.

Architecture, parameters, etc. like in Task 3a, but now with Gaussian noise (mean

0 and variance 0.2) added to the information-conveying elements (

t <

). We stop training

(and judge the task as being solved) according to the following, slightly redened criteria: ST1:

less than 6 out of 256 sequences from a randomly chosen test set are misclassied. ST2: ST1 is

satised, and mean absolute test set error is below 0.04. In case of ST2, an additional test set

consisting of 2560 randomly chosen sequences is used to determine the fraction of misclassied

sequences.

Results.

See Table 5. The results represent means of 10 trials with dierent weight initializa-

tions. LSTM easily solves the problem.

Task 3c.

Architecture, parameters, etc. like in Task 3a, but with a few essential changes that

make the task non-trivial: the targets are 0.2 and 0.8 for class 1 and class 2, resp ectively, and

there is Gaussian noise on the

targets

(mean 0 and variance 0.1; st.dev. 0.32). To minimize mean

squared error, the system has to learn the

conditional expectations of the targets

given the inputs.

Misclassication is dened as \absolute dierence between output and noise-free target (0.2 for

It should be mentioned, however, that dierent input representations and dierent types of noise may lead to

worse guessing performance (Yoshua Bengio, personal communication, 1996).

N stop: ST1 stop: ST2 # weights ST2: fraction misclassied

100 3 41,740 43,250 102 0.00828

100 1 74,950 78,430 102 0.01500

1000 1 481,060 485,080 102 0.01207

Table 5:

Task 3b: modied 2-sequence problem. Same as in Table 4, but now the information-

conveying elements are also perturbed by noise.

N stop # weights fraction misclassied av. dierence to mean

100 3 269,650 102 0.00558 0.014

100 1 565,640 102 0.00441 0.012

Table 6:

Task 3c: modied, more challenging 2-sequence problem. Same as in Table 4, but with

noisy real-valued targets. The system has to learn the conditional expectations of the targets given

the inputs. The rightmost column provides the average dierence between network output and

expected target. Unlike 3a and 3b, this task cannot be solved quickly by random weight guessing.

class 1 and 0.8 for class 2)

0.1. " The network output is considered acceptable if the mean

absolute dierence between noise-free target and output is below 0.015. Since this requires high

weight precision,

Task 3c (unlike 3a and 3b) cannot be solved quickly by random guessing.

Training/Testing.

The learning rate is 0

1. We stop training according to the following

criterion: none of 256 sequences from a randomly chosen test set is misclassied, and mean

absolute dierence between noise free target and output is below 0.015. An additional test set

consisting of 2560 randomly chosen sequences is used to determine the fraction of misclassied

sequences.

Results.

See Table 6. The results represent means of 10 trials with dierent weight initial-

izations. Despite the noisy targets, LSTM still can solve the problem by learning the exp ected

target values.

5.4 EXPERIMENT 4: ADDING PROBLEM

The dicult task in this section is of a type that has never been solved by other recurrent net al-

gorithms. It shows that LSTM can solve long time lag problems involving distributed, continuous-

valued representations.

Task.

Each element of each input sequence is a pair of components. The rst comp onent

is a real value randomly chosen from the interval [



;

1]; the second is either 1.0, 0.0, or -1.0,

and is used as a marker: at the end of each sequence, the task is to output the sum of the rst

components of those pairs that are

marked

by second components equal to 1.0. Sequences have

random lengths between the minimal sequence length

and

. In a given sequence exactly

two pairs are marked as follows: we rst randomly select and mark one of the rst ten pairs

(whose rst component we call

). Then we randomly select and mark one of the rst



still unmarked pairs (whose rst component we call

). The second components of all remaining

pairs are zero except for the rst and nal pair, whose second components are -1. (In the rare case

where the

rst

pair of the sequence gets marked, we set

to zero.) An error signal is generated

only at the sequence end: the target is 0

5 +

(the sum

scaled to the interval [0

;

1]).

A sequence is processed correctly if the absolute error at the sequence end is b elow 0.04.

Architecture.

We use a 3-layer net with 2 input units, 1 output unit, and 2 cell blocks of size

2. The output layer receives connections only from memory cells. Memory cells and gate units

receive inputs from memory cells and gate units (i.e., the hidden layer is fully connected | less

connectivity may work as well). The input layer has forward connections to all units in the hidden

minimal lag # weights # wrong predictions Success after

100 50 93 1 out of 2560 74,000

500 250 93 0 out of 2560 209,000

1000 500 93 1 out of 2560 853,000

Table 7:

EXPERIMENT 4: Results for the Adding Problem.

is the minimal sequence length,

T =

the minimal time lag. \# wrong predictions" is the number of incorrectly processed sequences

(error

0.04) from a test set containing 2560 sequences. The rightmost column gives the number

of training sequences required to achieve the stopping criterion. Al l values are means of 10 trials.

For

= 1000

the number of required training examples varies between 370,000 and 2,020,000,

exceeding 700,000 in only 3 cases.

layer. All non-input units have bias weights. These architecture parameters make it easy to store

at least 2 input signals (a cell block size of 1 works well, too). All activation functions are logistic

with output range [0

;

1], except for

, whose range is [



;

1], and

, whose range is [



;

2].

State drift versus initial bias.

Note that the task requires storing the precise values of

real numbers for long durations | the system must learn to protect memory cell contents against

even minor internal state drift (see Section 4). To study the signicance of the drift problem,

we make the task even more dicult by biasing all non-input units, thus articially inducing

internal state drift. All weights (including the bias weights) are randomly initialized in the range

[



;

1]. Following Section 4's remedy for state drifts, the rst input gate bias is initialized with



0, the second with



0 (though the precise values hardly matter, as conrmed by additional

experiments).

Training/Testing.

The learning rate is 0.5. Training is stopped once the average training

error is below 0.01, and the 2000 most recent sequences were processed correctly.

Results.

With a test set consisting of 2560 randomly chosen sequences, the average test set

error was always below 0.01, and there were never more than 3 incorrectly processed sequences.

Table 7 shows details.

The experiment demonstrates: (1) LSTM is able to work well with distributed representations.

(2) LSTM is able to learn to perform calculations involving

continuous

values. (3) Since the system

manages to store continuous values without deterioration for minimal delays of

time steps, there

is no signicant, harmful internal state drift.

5.5 EXPERIMENT 5: MULTIPLICATION PROBLEM

One may argue that LSTM is a bit biased towards tasks such as the Adding Problem from the

previous subsection. Solutions to the Adding Problem may exploit the CEC's built-in integration

capabilities. Although this CEC prop erty may be viewed as a feature rather than a disadvantage

(integration seems to be a natural subtask of many tasks occurring in the real world), the question

arises whether LSTM can also solve tasks with inherently non-integrative solutions. To test this,

we change the problem by requiring the nal target to equal the product (instead of the sum) of

earlier marked inputs.

Task.

Like the task in Section 5.4, except that the rst component of each pair is a real value

randomly chosen from the interval [0

;

1]. In the rare case where the rst pair of the input sequence

gets marked, we set

to 1.0. The target at sequence end is the pro duct



Architecture.

Like in Section 5.4. All weights (including the bias weights) are randomly

initialized in the range [



;

1].

Training/Testing.

The learning rate is 0.1. We test performance twice: as soon as less

than

seq

of the 2000 most recent training sequences lead to absolute errors exceeding 0.04, where

seq

= 140, and

seq

= 13. Why these values?

seq

= 140 is sucient to learn storage of the

relevant inputs. It is not enough though to ne-tune the precise nal outputs.

seq

= 13, however,

minimal lag # weights

seq

# wrong predictions MSE Success after

100 50 93 140 139 out of 2560 0.0223 482,000

100 50 93 13 14 out of 2560 0.0139 1,273,000

Table 8:

EXPERIMENT 5: Results for the Multiplication Problem.

is the minimal sequence

length,

T =

the minimal time lag. We test on a test set containing 2560 sequences as soon as less

than

seq

of the 2000 most recent training sequences lead to error

0.04. \# wrong predictions"

is the number of test sequences with error

0.04. MSE is the mean squared error on the test

set. The rightmost column lists numbers of training sequences required to achieve the stopping

criterion. Al l values are means of 10 trials.

leads to quite satisfactory results.

Results.

For

seq

= 140 (

seq

= 13) with a test set consisting of 2560 randomly chosen

sequences, the average test set error was always below 0.026 (0.013), and there were never more

than 170 (15) incorrectly processed sequences. Table 8 shows details. (A net with additional

standard hidden units or with a hidden layer above the memory cells may learn the ne-tuning

part more quickly.)

The experiment demonstrates: LSTM can solve tasks involving both continuous-valued repre-

sentations and non-integrative information processing.

5.6 EXPERIMENT 6: TEMPORAL ORDER

In this subsection, LSTM solves other dicult (but articial) tasks that have never been solved by

previous recurrent net algorithms. The experiment shows that LSTM is able to extract information

conveyed by the temp oral order of widely separated inputs.

Task 6a: two relevant, widely separated symbols.

The goal is to classify sequences.

Elements and targets are represented locally (input vectors with only one non-zero bit). The

sequence starts with an

, ends with a

(the \trigger symbol") and otherwise consists of randomly

chosen symbols from the set

a; b; c; d

except for two elements at positions

and

that are either

. The sequence length is randomly chosen between 100 and 110,

is randomly chosen

between 10 and 20, and

is randomly chosen between 50 and 60. There are 4 sequence classes

Q; R; S; U

which depend on the temp oral order of

and

. The rules are:

X; X

;

X; Y

;

Y; X

;

Y; Y

Task 6b: three relevant, widely separated symbols.

Again, the goal is to classify

sequences. Elements/targets are represented locally. The sequence starts with an

, ends with

(the \trigger symbol"), and otherwise consists of randomly chosen symbols from the set

a; b; c; d

except for three elements at positions

; t

and

that are either

. The sequence

length is randomly chosen between 100 and 110,

is randomly chosen between 10 and 20,

randomly chosen between 33 and 43, and

is randomly chosen between 66 and 76. There are 8

sequence classes

Q; R; S; U; V; A; B ; C

which depend on the temporal order of the

s and

s. The

rules are:

X; X; X

;

X; X; Y

;

X; Y; X

;

X; Y; Y

;

Y; X; X

;

Y; X; Y

;

Y; Y ; X

;

Y; Y ; Y

There are as many output units as there are classes. Each class is locally represented by a

binary target vector with one non-zero component. With both tasks, error signals occur only at

the end of a sequence. The sequence is classied correctly if the nal absolute error of all output

units is below 0.3.

Architecture.

We use a 3-layer net with 8 input units, 2 (3) cell blocks of size 2 and 4

(8) output units for Task 6a (6b). Again all non-input units have bias weights, and the output

layer receives connections from memory cells only. Memory cells and gate units receive inputs

from input units, memory cells and gate units (i.e., the hidden layer is fully connected | less

connectivity may work as well). The architecture parameters for Task 6a (6b) make it easy to

store at least 2 (3) input signals. All activation functions are logistic with output range [0

;

1],

except for

, whose range is [



;

1], and

, whose range is [



;

2].

Training/Testing.

The learning rate is 0.5 (0.1) for Experiment 6a (6b). Training is stopped

once the average training error falls below 0.1 and the 2000 most recent sequences were classied

correctly. All weights are initialized in the range [



;

1]. The rst input gate bias is initialized

with



0, the second with



0, and (for Experiment 6b) the third with



0 (again, we conrmed

by additional experiments that the precise values hardly matter).

Results.

With a test set consisting of 2560 randomly chosen sequences, the average test set

error was always below 0.1, and there were never more than 3 incorrectly classied sequences.

Table 9 shows details.

The experiment shows that LSTM is able to extract information conveyed by the temporal

order of widely separated inputs. In Task 6a, for instance, the delays between rst and second

relevant input and b etween second relevant input and sequence end are at least 30 time steps.

task # weights # wrong predictions Success after

Task 6a 156 1 out of 2560 31,390

Task 6b 308 2 out of 2560 571,100

Table 9:

EXPERIMENT 6: Results for the Temporal Order Problem. \# wrong predictions" is

the number of incorrectly classied sequences (error

0.3 for at least one output unit) from a

test set containing 2560 sequences. The rightmost column gives the number of training sequences

required to achieve the stopping criterion. The results for Task 6a are means of 20 trials; those

for Task 6b of 10 trials.

Typical solutions.

In Experiment 6a, how does LSTM distinguish between temporal orders

(

X; Y

) and (

Y; X

)? One of many possible solutions is to store the rst

in cell block 1, and

the second

X=Y

in cell block 2. Before the rst

X=Y

occurs, block 1 can see that it is still empty

by means of its recurrent connections. After the rst

X=Y

, block 1 can close its input gate. Once

block 1 is lled and closed, this fact will become visible to block 2 (recall that all gate units and

all memory cells receive connections from all non-output units).

Typical solutions, however, require only one memory cell block. The blo ck stores the rst

; once the second

X=Y

occurs, it changes its state depending on the rst stored symbol.

Solution type 1 exploits the connection between memory cell output and input gate unit | the

following events cause dierent input gate activations: \

occurs in conjunction with a lled

block"; \

occurs in conjunction with an empty block". Solution type 2 is based on a strong

positive connection between memory cell output and memory cell input. The previous o ccurrence

(

) is represented by a positive (negative) internal state. Once the input gate op ens for the

second time, so does the output gate, and the memory cell output is fed back to its own input.

This causes (

X; Y

) to be represented by a positive internal state, because

contributes to the

new internal state twice (via current internal state and cell output feedback). Similarly, (

Y; X

)

gets represented by a negative internal state.

5.7 SUMMARY OF EXPERIMENTAL CONDITIONS

The two tables in this subsection provide an overview of the most important LSTM parameters

and architectural details for Experiments 1{6. The conditions of the simple experiments 2a and

2b dier slightly from those of the other, more systematic experiments, due to historical reasons.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Task p lag b s in out w c ogb igb bias h g



1-1 9 9 4 1 7 7 264 F -1,-2,-3,-4 r ga h1 g2 0.1

1-2 9 9 3 2 7 7 276 F -1,-2,-3 r ga h1 g2 0.1

to be continued on next page

continued from previous page

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Task p lag b s in out w c ogb igb bias h g



1-3 9 9 3 2 7 7 276 F -1,-2,-3 r ga h1 g2 0.2

1-4 9 9 4 1 7 7 264 F -1,-2,-3,-4 r ga h1 g2 0.5

1-5 9 9 3 2 7 7 276 F -1,-2,-3 r ga h1 g2 0.5

2a 100 100 1 1 101 101 10504 B no og none none id g1 1.0

2b 100 100 1 1 101 101 10504 B no og none none id g1 1.0

2c-1 50 50 2 1 54 2 364 F none none none h1 g2 0.01

2c-2 100 100 2 1 104 2 664 F none none none h1 g2 0.01

2c-3 200 200 2 1 204 2 1264 F none none none h1 g2 0.01

2c-4 500 500 2 1 504 2 3064 F none none none h1 g2 0.01

2c-5 1000 1000 2 1 1004 2 6064 F none none none h1 g2 0.01

2c-6 1000 1000 2 1 504 2 3064 F none none none h1 g2 0.01

2c-7 1000 1000 2 1 204 2 1264 F none none none h1 g2 0.01

2c-8 1000 1000 2 1 104 2 664 F none none none h1 g2 0.01

2c-9 1000 1000 2 1 54 2 364 F none none none h1 g2 0.01

3a 100 100 3 1 1 1 102 F -2,-4,-6 -1,-3,-5 b1 h1 g2 1.0

3b 100 100 3 1 1 1 102 F -2,-4,-6 -1,-3,-5 b1 h1 g2 1.0

3c 100 100 3 1 1 1 102 F -2,-4,-6 -1,-3,-5 b1 h1 g2 0.1

4-1 100 50 2 2 2 1 93 F r -3,-6 all h1 g2 0.5

4-2 500 250 2 2 2 1 93 F r -3,-6 all h1 g2 0.5

4-3 1000 500 2 2 2 1 93 F r -3,-6 all h1 g2 0.5

5 100 50 2 2 2 1 93 F r r all h1 g2 0.1

6a 100 40 2 2 8 4 156 F r -2,-4 all h1 g2 0.5

6b 100 24 3 2 8 8 308 F r -2,-4,-6 all h1 g2 0.1

Table 10:

Summary of experimental conditions for LSTM, Part I. 1st column: task number. 2nd

column: minimal sequence length

. 3rd column: minimal number of steps between most recent

relevant input information and teacher signal. 4th column: number of cell blocks

. 5th column:

block size

. 6th column: number of input units

. 7th column: number of output units

out

. 8th

column: number of weights

. 9th column:

describes connectivity: \F" means \output layer

receives connections from memory cells; memory cells and gate units receive connections from

input units, memory cells and gate units"; \B" means \each layer receives connections from all

layers below". 10th column: initial output gate bias

ogb

, where \r" stands for \randomly chosen

from the interval

[



;

" and \no og" means \no output gate used". 11th column: initial input

gate bias

igb

(see 10th column). 12th column: which units have bias weights? \b1" stands for \all

hidden units", \ga" for \only gate units", and \all" for \al l non-input units". 13th column: the

function

, where \id" is identity function, \h1" is logistic sigmoid in

[



;

. 14th column: the

logistic function

, where \g1" is sigmoid in

;

, \g2" in

[



;

. 15th column: learning rate



1 2 3 4 5 6

Task select interval test set size stopping criterion success

1 t1 [



;

2] 256 training & test correctly pred. see text

2a t1 [



;

2] no test set after 5 million exemplars ABS(0.25)

2b t2 [



;

2] 10000 after 5 million exemplars ABS(0.25)

2c t2 [



;

2] 10000 after 5 million exemplars ABS(0.2)

3a t3 [



;

1] 2560 ST1 and ST2 (see text) ABS(0.2)

3b t3 [



;

1] 2560 ST1 and ST2 (see text) ABS(0.2)

3c t3 [



;

1] 2560 ST1 and ST2 (see text) see text

4 t3 [



;

1] 2560 ST3(0.01) ABS(0.04)

5 t3 [



;

1] 2560 see text ABS(0.04)

6a t3 [



;

1] 2560 ST3(0.1) ABS(0.3)

6b t3 [



;

1] 2560 ST3(0.1) ABS(0.3)

Table 11:

Summary of experimental conditions for LSTM, Part II. 1st column: task number.

2nd column: training exemplar selection, where \t1" stands for \randomly chosen from training

set", \t2" for \randomly chosen from 2 classes", and \t3" for \randomly generated on-line". 3rd

column: weight initialization interval. 4th column: test set size. 5th column: stopping criterion

for training, where \ST3(



)" stands for \average training error below



and the 2000 most recent

sequences were processed correctly". 6th column: success (correct classication) criterion, where

\ABS(



)" stands for \absolute error of all output units at sequence end is below



6 DISCUSSION

Limitations of LSTM.



The particularly ecient truncated backprop version of the LSTM algorithm will not easily

solve problems similar to \strongly delayed XOR problems", where the goal is to compute the

XOR of two widely separated inputs that previously occurred somewhere in a noisy sequence.

The reason is that storing only one of the inputs will not help to reduce the expected error

| the task is non-decomposable in the sense that it is imp ossible to incrementally reduce

the error by rst solving an easier subgoal.

In theory, this limitation can be circumvented by using the full gradient (perhaps with ad-

ditional conventional hidden units receiving input from the memory cells). But we do not

recommend computing the full gradient for the following reasons: (1) It increases computa-

tional complexity. (2) Constant error ow through CECs can be shown only for truncated

LSTM. (3) We actually did conduct a few experiments with non-truncated LSTM. There

was no signicant dierence to truncated LSTM, exactly because outside the CECs error

ow tends to vanish quickly. For the same reason full BPTT does not outperform truncated

BPTT.



Each memory cell block needs two additional units (input and output gate). In comparison

to standard recurrent nets, however, this does not increase the number of weights by more

than a factor of 9: each conventional hidden unit is replaced by at most 3 units in the

LSTM architecture, increasing the number of weights by a factor of 3

in the fully connected

case. Note, however, that our experiments use quite comparable weight numbers for the

architectures of LSTM and competing approaches.



Generally speaking, due to its constant error ow through CECs within memory cells, LSTM

runs into problems similar to those of feedforward nets seeing the entire input string at once.

For instance, there are tasks that can be quickly solved by random weight guessing but not by

the truncated LSTM algorithm with small weight initializations, such as the 500-step parity

problem (see introduction to Section 5). Here, LSTM's problems are similar to the ones of

a feedforward net with 500 inputs, trying to solve 500-bit parity. Indeed LSTM typically

behaves much like a feedforward net trained by backprop that sees the entire input. But

that's also precisely why it so clearly outperforms previous approaches on many non-trivial

tasks with signicant search spaces.



LSTM does not have any problems with the notion of \recency" that go beyond those of

other approaches. All gradient-based approaches, however, suer from practical inability to

precisely count discrete time steps. If it makes a dierence whether a certain signal occurred

99 or 100 steps ago, then an additional counting mechanism seems necessary. Easier tasks,

however, such as one that only requires to make a dierence between, say, 3 and 11 steps,

do not pose any problems to LSTM. For instance, by generating an appropriate negative

connection between memory cell output and input, LSTM can give more weight to recent

inputs and learn decays where necessary.

Advantages of LSTM.



The constant error backpropagation within memory cells results in LSTM's ability to bridge

very long time lags in case of problems similar to those discussed above.



For long time lag problems such as those discussed in this paper, LSTM can handle noise,

distributed representations, and continuous values. In contrast to nite state automata or

hidden Markov models LSTM does not require an

a priori

choice of a nite number of states.

In principle it can deal with unlimited state numbers.



For problems discussed in this paper LSTM generalizes well | even if the positions of widely

separated, relevant inputs in the input sequence do not matter. Unlike previous approaches,

ours quickly learns to distinguish between two or more widely separated occurrences of a

particular element in an input sequence, without depending on appropriate short time lag

training exemplars.



There appears to be no need for parameter ne tuning. LSTM works well over a broad range

of parameters such as learning rate, input gate bias and output gate bias. For instance, to

some readers the learning rates used in our experiments may seem large. However, a large

learning rate pushes the output gates towards zero, thus automatically countermanding its

own negative eects.



The LSTM algorithm's update complexity per weight and time step is essentially that of

BPTT, namely

(1). This is excellent in comparison to other approaches such as RTRL.

Unlike full BPTT, however, LSTM is

local in both space and time

7 CONCLUSION

Each memory cell's internal architecture guarantees constant error ow within its constant error

carrousel CEC, provided that truncated backprop cuts o error ow trying to leak out of memory

cells. This represents the basis for bridging very long time lags. Two gate units learn to open and

close access to error ow within each memory cell's CEC. The multiplicative input gate aords

protection of the CEC from perturbation by irrelevant inputs. Likewise, the multiplicative output

gate protects other units from perturbation by currently irrelevant memory contents.

Future work.

To nd out about LSTM's practical limitations we intend to apply it to real

world data. Application areas will include (1) time series prediction, (2) music composition, and

(3) speech processing. It will also be interesting to augment sequence chunkers (Schmidhuber

1992b, 1993) by LSTM to combine the advantages of both.

8 ACKNOWLEDGMENTS

Thanks to Mike Mozer, Wilfried Brauer, Nic Schraudolph, and several anonymous referees for valu-

able comments and suggestions that helped to improve a previous version of this paper (Hochreiter

and Schmidhuber 1995). This work was supported by

DFG grant SCHM 942/3-1

from \Deutsche

Forschungsgemeinschaft".

APPENDIX

A.1 ALGORITHM DETAILS

In what follows, the index

ranges over output units,

ranges over hidden units,

stands for

the

-th memory cell block,

denotes the

-th unit of memory cell block

u; l; m

stand for

arbitrary units,

ranges over all time steps of a given input sequence.

The gate unit logistic sigmoid (with range [0

;

1]) used in the experiments is

(

) =

1 + exp(



)

. (3)

The function

(with range [



;

1]) used in the experiments is

(

) =

1 + exp(



)



1 . (4)

The function

(with range [



;

2]) used in the experiments is

(

) =

1 + exp(



)



2 . (5)

Forward pass.

The net input and the activation of hidden unit

are

net

(

) =

(



1) (6)

(

) =

(

net

(

)) .

The net input and the activation of

are

net

(

) =

(



1) (7)

(

) =

(

net

(

)) .

The net input and the activation of

out

are

net

out

(

) =

out

(



1) (8)

out

(

) =

out

(

net

out

(

)) .

The net input

net

, the internal state

, and the output activation

of the

-th memory

cell of memory cell block

are:

net

(

) =

(



1) (9)

(

) =

(



1) +

(

)



net

(

)



(

) =

out

(

)

(

)) .

The net input and the activation of output unit

are

net

(

) =

not a gate

(



(

) =

(

net

(

)) .

The backward pass to be described later is based on the following truncated backprop formulae.

Approximate derivatives for truncated backprop.

The truncated version (see Section 4)

only approximates the partial derivatives, which is reected by the \



" signs in the notation

below. It truncates error ow once it leaves memory cells or gate units. Truncation ensures that

there are no loops across which an error that left some memory cell through its input or input

gate can reenter the cell through its output or output gate. This in turn ensures constant error

ow through the memory cell's CEC.

In the truncated backprop version, the following derivatives are replaced by zero:

@net

(

)

(





@net

out

(

)

(





and

@net

(

)

(





Therefore we get

(

)

(



(

net

(

))

@net

(

)

(





out

(

)

(



out

(

net

out

(

))

@net

out

(

)

(





and

(

)

(



(

)

@net

out

(

)

@net

out

(

)

(



(

)

@net

(

)

@net

(

)

(



(

)

@net

(

)

@net

(

)

(





This implies for all

not on connections to

; in

; out

(that is,

62 f

; in

; out

(

)

(

)

(



(





The truncated derivatives of output unit

are:

(

)

(

net

(

))

not a gate

(





(





(10)

(

net

(

))



(





out



(



hidden unit

(





(



(

net

(

))

(



(



(



out

hidden unit

(



otherwise

where



is the Kronecker delta (



= 1 if

and 0 otherwise), and

is the size of memory

cell block

. The truncated derivatives of a hidden unit

that is not part of a memory cell are:

(

)

(

net

(

))

@net

(

)





(

net

(

))

(



1) . (11)

(Note: here it would be possible to use the full gradient without aecting constant error ow

through internal states of memory cells.)

Cell block

's truncated derivatives are:

(

)

(

net

(

))

@net

(

)





(

net

(

))

(



1) . (12)

out

(

)

out

(

net

out

(

))

@net

out

(

)





out

(

net

out

(

))

(



1) . (13)

(

)

(



(

)



net

(

)



(

)



net

(

)



@net

(

)



(14)







(





(

)



net

(

)





(

)



net

(

)



@net

(

)







(





(

net

(

))



net

(

)



(



1) +



(

)



net

(

)



(



1) .

(

)

out

(

)

(

)) +

(

))

(

)

out

(

)



(15)



out

(

)

(

)) +







(

))

(

)

out

(

) .

To eciently update the system at time

, the only (truncated) derivatives that need to be stored

at time



1 are

(



, where

Backward pass.

We will describe the backward pass only for the particularly ecient \truncated

gradient version" of the LSTM algorithm. For simplicity we will use equal signs even where

approximations are made according to the truncated backprop equations above.

The squared error at time

is given by

(

) =

output unit



(

)



(

)



, (16)

where

(

) is output unit

's target at time

Time

's contribution to

's gradient-based update with learning rate





(

) =





(

)

. (17)

We dene some unit

's error at time step

(

) :=



(

)

@net

(

)

. (18)

Using (almost) standard backprop, we rst compute updates for weights to output units (

weights to hidden units (

) and weights to output gates (

out

). We obtain (compare

formulae (10), (11), (13)):

(output) :

(

) =

(

net

(

))



(

)



(

)



, (19)

(hidden) :

(

) =

(

net

(

))

output unit

(

) , (20)

out

(output gates) : (21)

out

(

) =

out

(

net

out

(

))

(

))

output unit

(

)

For all possible

time

's contribution to

's update is



(

) =

 e

(

)

(



1) . (22)

The remaining updates for weights to input gates (

) and to cell units (

) are less

conventional. We dene some internal state

's error:



(

)

(

)

= (23)

out

(

net

out

(

))

(

))

output unit

(

) .

We obtain for

; v

= 1

;:::;S



(

)

(

)

(

)

. (24)

The derivatives of the internal states with respect to weights and the corresponding weight

updates are as follows (compare expression (14)):

(input gates) : (25)

(

)

(



(

net

(

))

(

net

(

))

(



1) ;

therefore time

's contribution to

's update is (compare expression (10)):



(

) =



(

)

(

)

. (26)

Similarly we get (compare expression (14)):

(memory cells) : (27)

(

)

(



(

net

(

))

(

net

(

))

(



1) ;

therefore time

's contribution to

's update is (compare expression (10)):



(

) =

e

(

)

(

)

. (28)

All we need to implement for the backward pass are equations (19), (20), (21), (22), (23), (25),

(26), (27), (28). Each weight's total up date is the sum of the contributions of all time steps.

Computational complexity.

LSTM's update complexity per time step is

(

KCS

H I

CS I

) =

(

)

;

(29)

where

is the number of output units,

is the number of memory cell blocks,

S >

0 is the size

of the memory cell blocks,

is the number of hidden units,

is the (maximal) number of units

forward-connected to memory cells, gate units and hidden units, and

KCS

CS I

+ 2

H I

(

KCS

CS I

H I

)

is the number of weights. Expression (29) is obtained by considering all computations of the

backward pass: equation (19) needs

steps; (20) needs

steps; (21) needs

KS C

steps; (22)

needs

(

) steps for output units,

H I

steps for hidden units,

steps for output gates;

(23) needs

KCS

steps; (25) needs

CS I

steps; (26) needs

CS I

steps; (27) needs

CS I

steps;

(28) needs

CS I

steps. The total is

+ 2

KS C

H I

+ 4

CS I

steps, or

(

KS C

H I

CS I

) steps. We conclude: LSTM algorithm's up date complexity per time

step is just like BPTT's for a fully recurrent net.

At a given time step, only the 2

CS I

most recent

values from equations (25) and (27)

need to be stored. Hence LSTM's storage complexity also is

(

) | it does not dep end on the

input sequence length.

A.2 ERROR FLOW

We compute how much an error signal is scaled while owing back through a memory cell for

time

steps. As a by-product, this analysis reconrms that the error ow within a memory cell's CEC is

indeed constant, provided that truncated backprop cuts o error ow trying to leave memory cells

(see also Section 3.2). The analysis also highlights a potential for undesirable long-term drifts of

(see (2) below), as well as the benecial, countermanding inuence of negatively biased input

gates (see (3) below).

Using the truncated backprop learning rule, we obtain

(



)

(



= (30)

1 +

(



)

(



net

(



)



(



)



net

(



)



@net

(



)

(



1 +



(



)

(



(



(







net

(



)



(



)



net

(



)





@net

(



)

(



(



(







The



sign indicates equality due to the fact that truncated backprop replaces by zero the

following derivatives:

(



)

(



and

@net

(



)

(



In what follows, an error

(

) starts owing back at

's output. We redene

(

) :=

(

+ 1) . (31)

Following the denitions/conventions of Section 3.1, we compute error ow for the truncated

backprop learning rule. The error occurring at the output gate is

out

(

)



out

(

)

@net

out

(

)

(

)

out

(

)

(

) . (32)

The error occurring at the internal state is

(

) =

(

+ 1)

(

)

(

+ 1) +

(

)

(

)

(

) . (33)

Since we use truncated backprop we have

(

) =

no gate and no memory cell

(

+ 1);

therefore we get

(

)

(

+ 1)

(

+ 1)

(

+ 1)



0 . (34)

The previous equations (33) and (34) imply constant error ow through internal states of

memory cells:

(

)

(

+ 1)

(

+ 1)

(

)



1 . (35)

The error occurring at the memory cell input is

(

) =

(

net

(

))

@net

(

)

(

)

(

net

(

))

(

) . (36)

The error occurring at the input gate is

(

)



(

)

@net

(

)

(

)

(

))

(

) . (37)

No external error ow.

Errors are propagated back from units

to unit

along outgoing

connections with weights

. This \external error" (note that for conventional units there is

nothing but external error) at time

(

) =

(

)

@net

(

)

@net

(

+ 1)

(

)

(

+ 1) . (38)

We obtain

(



(

)

= (39)

(



@net

(





out

(

)

(

)

@net

out

(

)

(



(

)

(

)

@net

(

)

(



(

)

(

)

@net

(

)

(







0 .

We observe: the error

arriving at the memory cell output is

not

backpropagated to units

via

external connections to

; out

; c

Error ow within memory cells.

We now focus on the error back ow within a memory

cell's CEC. This is actually the only type of error ow that can bridge several time steps. Supp ose

error

(

) arrives at

's output at time

and is propagated back for

steps until it reaches

or the memory cell input

(

net

). It is scaled by a factor of

(



)

(

)

, where

; c

. We rst

compute

(



)

(

)



(

)

(

)

= 0

(



+1)

(



)

(



+1)

(

)

q >

. (40)

Expanding equation (40), we obtain

(



)

(

)



(



)

(



)

(



)

(

)



(41)

(



)

(



)

(



+ 1)

(



)

(

)

(

)



out

(

)

(

))



(

net

(



)

(



)

(

net

(



)

(

net

(



))

Consider the factors in the previous equation's last expression. Obviously, error ow is scaled

only at times

(when it enters the cell) and



(when it leaves the cell), but not in between

(constant error ow through the CEC). We observe:

(1) The output gate's eect is:

out

(

) scales down those errors that can be reduced early

during training without using the memory cell. Likewise, it scales down those errors resulting

from using (activating/deactivating) the memory cell at later training stages | without the output

gate, the memory cell might for instance suddenly start causing avoidable errors in situations that

already seemed under control (because it was easy to reduce the corresponding errors without

memory cells). See \output weight conict" and \abuse problem" in Sections 3/4.

(2) If there are large positive or negative

(

) values (because

has drifted since time step



), then

(

)) may be small (assuming that

is a logistic sigmoid). See Section 4. Drifts

of the memory cell's internal state

can be countermanded by negatively biasing the input gate

(see Section 4 and next point). Recall from Section 4 that the precise bias value does not

matter much.

(3)

(



) and

(

net

(



)) are small if the input gate is negatively biased (assume

is a logistic sigmoid). However, the potential signicance of this is negligible compared to the

potential signicance of drifts of the internal state

Some of the factors above may scale down LSTM's overall error ow, but not in a manner

that depends on the length of the time lag. The ow will still be much more eective than an

exponentially (of order

) decaying ow without memory cells.

References

Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in a combina-

torial environment. In

IEEE 1st International Conference on Neural Networks, San Diego

volume 2, pages 609{618.

Baldi, P. and Pineda, F. (1991). Contrastive learning and neural oscillator.

Neural Computation

3:526{545.

Bengio, Y. and Frasconi, P. (1994). Credit assignment through time: Alternatives to backpropaga-

tion. In Cowan, J. D., Tesauro, G., and Alspector, J., editors,

Advances in Neural Information

Processing Systems 6

, pages 75{82. San Mateo, CA: Morgan Kaufmann.

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient

descent is dicult.

IEEE Transactions on Neural Networks

, 5(2):157{166.

Cleeremans, A., Servan-Schreiber, D., and McClelland, J. L. (1989). Finite-state automata and

simple recurrent networks.

Neural Computation

, 1:372{381.

de Vries, B. and Principe, J. C. (1991). A theory for neural networks with time delays. In

Lippmann, R. P., Moody, J. E., and Touretzky, D. S., editors,

Advances in Neural Information

Processing Systems 3

, pages 162{168. San Mateo, CA: Morgan Kaufmann.

Doya, K. (1992). Bifurcations in the learning of recurrent neural networks. In

Proceedings of 1992

IEEE International Symposium on Circuits and Systems

, pages 2777{2780.

Doya, K. and Yoshizawa, S. (1989). Adaptive neural oscillator using continuous-time back-

propagation learning.

Neural Networks

, 2:375{385.

Elman, J. L. (1988). Finding structure in time. Technical Report CRL Technical Report 8801,

Center for Research in Language, University of California, San Diego.

Fahlman, S. E. (1991). The recurrent cascade-correlation learning algorithm. In Lippmann, R. P.,

Moody, J. E., and Touretzky, D. S., editors,

Advances in Neural Information Processing

Systems 3

, pages 190{196. San Mateo, CA: Morgan Kaufmann.

Hochreiter, J. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma the-

sis, Institut fur Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen. See

www7.informatik.tu-muenchen.de/~hochreit.

Hochreiter, S. and Schmidhuber, J. (1995). Long short-term memory. Technical Report FKI-207-

95, Fakultat fur Informatik, Technische Universitat Munchen.

Hochreiter, S. and Schmidhuber, J. (1996). Bridging long time lags by weight guessing and \Long

Short-Term Memory". In Silva, F. L., Princip e, J. C., and Almeida, L. B., editors,

Spa-

tiotemporal models in biological and articial systems

, pages 65{72. IOS Press, Amsterdam,

Netherlands. Serie: Frontiers in Articial Intelligence and Applications, Volume 37.

Hochreiter, S. and Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In

Advances in Neural Information Processing Systems 9

. MIT Press, Cambridge MA. Presented

at NIPS 96.

Lang, K., Waibel, A., and Hinton, G. E. (1990). A time-delay neural network architecture for

isolated word recognition.

Neural Networks

, 3:23{43.

Miller, C. B. and Giles, C. L. (1993). Experimental comparison of the eect of order in recurrent

neural networks.

International Journal of Pattern Recognition and Articial Intelligence

7(4):849{872.

Mozer, M. C. (1989). A focused back-propagation algorithm for temporal sequence recognition.

Complex Systems

, 3:349{381.

Mozer, M. C. (1992). Induction of multiscale temporal structure. In Lippman, D. S., Moody,

J. E., and Touretzky, D. S., editors,

Advances in Neural Information Processing Systems 4

pages 275{282. San Mateo, CA: Morgan Kaufmann.

Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks.

Neural

Computation

, 1(2):263{269.

Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey.

IEEE Transactions on Neural Networks

, 6(5):1212{1228.

Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks.

Physical

Review Letters

, 19(59):2229{2232.

Pineda, F. J. (1988). Dynamics and architecture for neural computation.

Journal of Complexity

4:216{245.

Plate, T. A. (1993). Holographic recurrent networks. In S. J. Hanson, J. D. C. and Giles, C. L.,

editors,

Advances in Neural Information Processing Systems 5

, pages 34{41. San Mateo, CA:

Morgan Kaufmann.

Pollack, J. B. (1991). Language induction by phase transition in dynamical recognizers. In

Lippmann, R. P., Moody, J. E., and Touretzky, D. S., editors,

Advances in Neural Information

Processing Systems 3

, pages 619{626. San Mateo, CA: Morgan Kaufmann.

Puskorius, G. V. and Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with

Kalman lter trained recurrent networks.

IEEE Transactions on Neural Networks

, 5(2):279{

297.

Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Han-

son, J. D. C. and Giles, C. L., editors,

Advances in Neural Information Processing Systems

, pages 115{122. Morgan Kaufmann.

Robinson, A. J. and Fallside, F. (1987). The utility driven dynamic error propagation network.

Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department.

Schmidhuber, J. (1989). The Neural Bucket Brigade: A local learning algorithm for dynamic

feedforward and recurrent networks.

Connection Science

, 1(4):403{412.

Schmidhuber, J. (1992a). A xed size storage

(

) time complexity learning algorithm for fully

recurrent continually running networks.

Neural Computation

, 4(2):243{248.

Schmidhuber, J. (1992b). Learning complex, extended sequences using the principle of history

compression.

Neural Computation

, 4(2):234{242.

Schmidhuber, J. (1992c). Learning unambiguous reduced sequence descriptions. In Moo dy, J. E.,

Hanson, S. J., and Lippman, R. P., editors,

Advances in Neural Information Processing Sys-

tems 4

, pages 291{298. San Mateo, CA: Morgan Kaufmann.

Schmidhuber, J. (1993). Netzwerkarchitekturen, Zielfunktionen und Kettenregel. Habilitations-

schrift, Institut fur Informatik, Technische Universitat Munchen.

Schmidhuber, J. and Hochreiter, S. (1996). Guessing can outperform many long time lag algo-

rithms. Technical Report IDSIA-19-96, IDSIA.

Silva, G. X., Amaral, J. D., Langlois, T., and Almeida, L. B. (1996). Faster training of recurrent

networks. In Silva, F. L., Principe, J. C., and Almeida, L. B., editors,

Spatiotemporal models

in biological and articial systems

, pages 168{175. IOS Press, Amsterdam, Netherlands. Serie:

Frontiers in Articial Intelligence and Applications, Volume 37.

Smith, A. W. and Zipser, D. (1989). Learning sequential structures with the real-time recurrent

learning algorithm.

International Journal of Neural Systems

, 1(2):125{131.

Sun, G., Chen, H., and Lee, Y. (1993). Time warping invariant neural networks. In S. J. Hanson,

J. D. C. and Giles, C. L., editors,

Advances in Neural Information Processing Systems 5

pages 180{187. San Mateo, CA: Morgan Kaufmann.

Watrous, R. L. and Kuhn, G. M. (1992). Induction of nite-state languages using second-order

recurrent networks.

Neural Computation

, 4:406{414.

Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market

model.

Neural Networks

, 1.

Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent neural

networks. Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern Univer-

sity, College of Computer Science.

Williams, R. J. and Peng, J. (1990). An ecient gradient-based algorithm for on-line training of

recurrent network trajectories.

Neural Computation

, 4:491{501.

Williams, R. J. and Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks

and their computational complexity. In

Back-propagation: Theory, Architectures and Appli-

cations

. Hillsdale, NJ: Erlbaum.

Hierarchical Long Short-Term Memory (LSTM) Model for News Sentiment Analysis

Conference Paper

Full-text available

Jun 2024

Ijibadejo Oluwasegun William

Abstract – The study done on the use of the Hierarchical Long Short-Term Memory (LSTM) model for news sentiment analysis is succinctly summarised in this abstract. The purpose of the study is to find out how well LSTM captures the subtleties of sentiment seen in news stories. There is a wealth of textual data available that may be analysed to understand public opinion thanks to the growth of social media and online news sources. The intricacy of sentiment represented in news items is frequently too complicated for traditional sentiment analysis techniques to fully capture. An appropriate choice for sentiment analysis is the LSTM model, a kind of recurrent neural network that has demonstrated promise in recognising sequential input and capturing long-term relationships. An extensive dataset of news stories from reliable sources was gathered, and human annotators assigned sentiment labels to each one in order to assess the LSTM model's performance. Preprocessing the data involved translating the text into a numerical representation and eliminating stopwords in order to get it ready for LSTM training. To increase the LSTM model's effectiveness in sentiment analysis, more study may look at adding attention mechanisms or transfer learning strategies. In summary, the present study showcases the efficacy of the LSTM model in assessing sentiment inside news stories and highlights its potential for wider use in comprehending public opinion. Keywords – Long Short-Term Memory, Sentiments Analysis, Recurrent Neural Network, News Analysis, Public Opinion, Text Censor

Machine Learning and Fama-French Three-Factor Model Applications for Analyzing Stock Price in Technology Enterprises

Article

Full-text available

May 2024

This study endeavors to forecast the stock prices of the leading U.S. technology entities - Google, Microsoft, Amazon, Meta, and Apple - through the application of diverse machine learning models, complemented by the traditional Fama-French three-factor model. The employed models encompass Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), Support Vector Machines (SVM), and Decision Tree models. Initially, historical stock price data is utilized to train these machine learning models, enabling the identification of potential price trends. Subsequently, the integration of the Fama-French three-factor model enhances the analysis by scrutinizing the impacts of market risk, company size, and book-to-market value on stock prices. The outcomes illuminate both the effectiveness and limitations of various models in stock price prediction, highlighting the advantages of machine learning methodologies over traditional financial theories. This research provides financial market analysts and investors with a fresh perspective on the amalgamation of machine learning and traditional financial theories for enhanced stock price prediction.

Modelling of improved LSTM +1D convolution neural network methods for the diagnosis of SKF bearings

Article

Full-text available

Jun 2024

The ability to accurately detect and predict faults in automotive bearings is essential for diagnostic applications in the maintenance process. Although previous methods can accurately identify the various faults on bearings, they mostly produce erroneous results in the presence of certain mechanical factors when classifying the data. We propose a new diagnostic framework based on one-dimensional convolutional neural network (CONV1D) modelling and improved long short-term memory (LSTM), together with confusion matrices to evaluate data classification using the Deep Learning algorithm. Our framework classifies the data by taking into account the mechanical factors of the bearings (sudden load, rotation speed, operating temperature, etc.). Our results improve the training accuracy of the model to over 96.6%, with a percentage error of 23.29% for 50 iterations (repetitions). This percentage of training accuracy could be closer to 100% and that of the error margins to 0% if we increase the number of iterations. These results underline the promise of our method across our model and indicate how future expansion of the model by combining three methods can lead to further improvements in training accuracy with fewer errors and fewer iterations.

Stochastic Modeling of Thin Mud Drapes Inside Point Bar Reservoirs With ALLUVSIM‐GANSim

Article

Full-text available

Jun 2024
WATER RESOUR RES

Xun Hu

Modeling inclined fine‐scale mud drapes inside point bars, deposited on accretion surfaces during stages of low energy or slack water, is critical to modeling fluid flow in complex sedimentary environments (e.g., fluvial and turbidity flows). These features have been modeled using deterministic or geostatistical modeling tools (e.g., object‐, event‐, and pixel‐based). However, this is a non‐trivial task due to the need to preserve geological realism (e.g., connectivity within sedimentary features and facies hierarchy), while being able to condition the generated models to point data (e.g., well data). Generative Adversarial Networks (GAN) have been successfully applied to reproduce several large‐scale scenarios (e.g., braided rivers and carbonate reservoirs), yet their potential for capturing small‐scale and hierarchical features remains largely unexplored. Here, we propose a geo‐modeling workflow for fast modeling of small‐scale conditional mud drapes based on ALLUVSIM and GANSim. Initially, improved ALLUVSIM produces realistic unconditional models of mud drapes along accretionary surfaces, serving as GAN training data. GANSim is then employed to achieve conditioning to well data and probability maps derived from geophysical modeling. Finally, temporal pressure data observed in wells are further conditioned via a Markov chain Monte Carlo sampling method. The proposed geo‐modeling workflow is validated in a two‐dimensional synthetic example as the pre‐trained generator extracts mud‐drapes‐features and generates multiple facies realizations conditioned to diverse information. A field application example in a modern meandering river verifies the effectiveness and practicability of the proposed workflow in real case application examples. The application examples illustrate the potential of the proposed method to predict mud drapes inside point bar reservoirs.

Learning-Based Short-Term Energy Consumption Forecasting

Chapter

Jun 2024

UAV-assisted Distributed Learning for Environmental Monitoring in Rural Environments

Conference Paper

Jun 2024

Efficient RNN Models for IOT Intrusion Detection System

Conference Paper

May 2024

Prospect and sustainability prediction of China's new energy vehicles sales considering temporal and spatial dimensions

Article

Jun 2024
J CLEAN PROD

On Necessity of Conscious Learning: From Robots to Humans

Chapter

Jun 2024

Extraction of Keywords for Describing Defects in Standard Rail Lines Bridge Equipment

Conference Paper

Mar 2024

Bridging Long Time Lags by Weight Guessing and "Long Short Term Memory"

Article

Full-text available

Jan 1996

Numerous recent papers (including many NIPS papers) focus on standard recurrent nets' inability to deal with long time lags between relevant input signals and teacher signals. Rather sophisticated, alternative methods were proposed. We first show: problems used to promote certain algorithms in numerous previous papers can be solved more quickly by random weight guessing than by the proposed algorithms. This does not mean that guessing is a good algorithm. It just casts doubt on whether the other algorithms are, or whether the chosen problems are meaningful. We then use long short term memory (LSTM), our own recent algorithm, to solve hard problems that can neither be quickly solved by random weight guessing nor by any other recurrent net algorithm we are aware of. 1 Introduction / Outline Many recent papers focus on standard recurrent nets' inability to deal with long time lags between relevant signals. See, e.g., Bengio et al., El Hihi and Bengio, and others [3, 1, 6, 15]. Rather s...

Guessing Can Outperform Many Long Time Lag Algorithms

Article

Full-text available

Jan 1996

Numerous recent papers focus on standard recurrent nets' problems with long time lags between relevant signals. Some propose rather sophisticated, alternative methods. We show: many problems used to test previous methods can be solved more quickly by random weight guessing.

Untersuchungen zu dynamischen neuronalen Netzen

Article

Full-text available

Apr 1991

Sepp Hochreiter

Finite State Automata and Simple Recurrent Networks

Article

Full-text available

Sep 1989

We explore a network architecture introduced by Elman (1988) for predicting successive elements of a sequence. The network uses the pattern of activation over a set of hidden units from time-step t−1, together with element t, to predict element t + 1. When the network is trained with strings from a particular finite-state grammar, it can learn to be a perfect finite-state recognizer for the grammar. When the network has a minimal number of hidden units, patterns on the hidden units come to correspond to the nodes of the grammar, although this correspondence is not necessary for the network to act as a perfect finite-state recognizer. We explore the conditions under which the network can carry information about distant sequential contingencies across intervening elements. Such information is maintained with relative ease if it is relevant at each intermediate step; it tends to be lost when intervening elements do not depend on it. At first glance this may suggest that such networks are not relevant to natural language, in which dependencies may span indefinite distances. However, embeddings in natural language are not completely independent of earlier information. The final simulation shows that long distance sequential contingencies can be encoded by the network even if only subtle statistical properties of embedded strings depend on the early information.

The utility driven dynamic error propagation network

Technical Report

Full-text available

Jan 1987

Error propagation networks are able to learn a variety of tasks in which a static input pattern is mapped onto a static output pattern. This paper presents a generalisation of these nets to deal with time varying, or dynamic patterns. Three possible architectures are explored which deal with learning sequences of known finite length and sequences of unknown and possibly infinite length. Several examples are given and an application to speech coding is discussed. A further development of dynamic nets is made which allows them to be trained by a signal which expresses the correctness of the output of the net, the utility signal. One possible architecture for such a utility driven dynandc net is given and a simple example is presented. Utility driven dynamic nets are potentially able to calculate and maximise any function of the input and output data streams, within the comidered conext. This is a very powerful property, and an appendix presents a comparison of the information processing in utility driven dynamic nets and that in the human brain.

Learning sequential structures with the real-time recurrent learning algorithm

Article

Jan 1989

LEARNING SEQUENTIAL STRUCTURE WITH THE REAL-TIME RECURRENT LEARNING ALGORITHM

Article

Nov 2011

Recurrent connections in neural networks potentially allow information about events occurring in the past to be preserved and used in current computations. How effectively this potential is realized depends on the power of the learning algorithm used. As an example of a task requiring recurrency, Servan-Schreiber, Cleeremans, and McClelland1 have applied a simple recurrent learning algorithm to the task of recognizing finite-state grammars of increasing difficulty. These nets showed considerable power and were able to learn fairly complex grammars by emulating the state machines that produced them. However, there was a limit to the difficulty of the grammars that could be learned. We have applied a more powerful recurrent learning procedure, called real-time recurrent learning2,6 (RTRL), to some of the same problems studied by Servan-Schreiber, Cleeremans, and McClelland. The RTRL algorithm solved more difficult forms of the task than the simple recurrent networks. The internal representations developed by RTRL networks revealed that they learn a rich set of internal states that represent more about the past than is required by the underlying grammar. The dynamics of the networks are determined by the state structure and are not chaotic.

A recurrent cascade-correlation learning architecture

Article

Jan 1991

Generalization of Back-Propagation to Recurrent Neural Networks

Article

Nov 1987
PHYS REV LETT

Fernando J. Pineda

An adaptive neural network with asymmetric connections is introduced. This network is related to the Hopfield network with graded neurons and uses a recurrent generalization of the δ rule of Rumelhart, Hinton, and Williams to modify adaptively the synaptic weights. The new network bears a resemblance to the master/slave network of Lapedes and Farber but it is architecturally simpler.

Complexity of exact gradient computation algorithms for recurrent neural networks

Article

R. J. Williams

Long Short-term Memory

Abstract and Figures

Recommended publications

Reconhecimento de Padrões em imagens ruidosas utilizando Redes Neurais Artificiais

RECONHECIMENTO DO NÚMERO DE DEDOS DAS MÃOS UTILIZANDO REDES NEURAIS ARTIFICIAIS

Multibiometric Personal Identification based on Hybrid Artificial Intelligence Technique using Seria...

Dynamic Behavior of Artificial Hodgkin-Huxley Neuron Model Subject to Additive Noise