Long-term time series prediction with the NARX network: An empirical evaluation



Long-Term Time Series Prediction with the
NARX Network: An Empirical Evaluation
Jos´e Maria P. J´unior and Guilherme A. Barreto
Department of Teleinformatics Engineering
Federal University of Cear´a, Av. Mister Hull, S/N
CP 6005, CEP 60455-760, Fortaleza-CE, Brazil,
1 Introduction
Artificial neural networks (ANNs) have been successfully applied to a number
of time series prediction and modeling tasks, including financial time series
prediction [12], river flow forecasting [3], biomedical time series modeling [11],
communication network traffic prediction [6,13,2], chaotic time series predic-
tion [42], among several others (see [34], for a recent survey). In particular,
when the time series is noisy and the underlying dynamical system is non-
linear, ANN models frequently outperform standard linear techniques, such
as the well-known Box-Jenkins models [7]. In such cases, the inherent nonlin-
earity of ANN models and a higher robustness to noise seem to explain their
better prediction performance.
In one-step-ahead prediction tasks, ANN models are required to estimate the
next sample value of a time series, without feeding back it to the model’s
input regressor. In other words, the input regressor contains only actual sample
points of the time series. If the user is interested in a longer prediction horizon,
a procedure known as multi-step-ahead or long-term prediction, the model’s
output should be fed back to the input regressor for a fixed but finite number of
time steps [39]. In this case, the components of the input regressor, previously
composed of actual sample points of the time series, are gradually replaced by
previous predicted values.
If the prediction horizon tends to infinity, from some time in the future the
input regressor will start to be composed only of estimated values of the time
series. In this case, the multi-step-ahead prediction task becomes a dynamic
modeling task, in which the ANN model acts as an autonomous system, trying
to recursively emulate the dynamic behavior of the system that generated
the nonlinear time series [17,18]. Multi-step ahead prediction and dynamic
modeling are much more complex to deal with than one-step-ahead prediction,
and it is believed that these are complex tasks in which ANN models play an
important role, in particular recurrent neural architectures [36].
Simple recurrent networks (SRNs) comprise a class of recurrent neural models
that are essentially feedforward in the signal-flow structure, but also contain a
small number of local and/or global feedback loops in their architectures. Even
though feedforward MLP-like networks can be easily adapted to process time
series through an input tapped delay line, giving rise to the well-known Time
Delay Neural Network (TDNN) [36], they can also be easily converted to SRNs
by feeding back the neuronal outputs of the hidden or output layers, giving
rise to Elman and Jordan networks, respectively [23]. It is worth pointing
out that, when applied to long-term prediction, a feedforward TDNN model
will eventually behavior as a kind of SRN architecture, since a global loop is
needed to feed back the current estimated value into the input regressor.
The aforementioned recurrent architectures are usually trained by means of
temporal gradient-based variants of the backpropagation algorithm [35]. How-
ever, learning to perform tasks in which the temporal dependencies present in
the input/output signals span long time intervals can be quite difficult using
gradient-based learning algorithms [4]. In [27], the authors report that learn-
ing such long-term temporal dependencies with gradient-descent techniques is
more effective in a class of SRN model called Nonlinear Autoregressive with
eXogenous input (NARX) [28] than in simple MLP-based recurrent models.
This occurs in part because the NARX model’s input vector is cleverly built
through two tapped-delay lines: one sliding over the input signal together and
another sliding over the network’s output.
Despite the aforementioned advantages of the NARX network, its feasibility as
a nonlinear tool for univariate time series modeling and prediction has not been
fully explored yet. For example, in [29], the NARX model is indeed reduced to
the TDNN model in order to be applied to time series prediction. Bearing this
under-utilization of the NARX network in mind, we propose a simple strategy
based on Taken’s embedding theorem that allows the original architecture of
the NARX network to be easily and efficiently applied to long-term prediction
of univariate nonlinear time series.
Potential fields of application of our approach are communication network
traffic characterization [45,14,16] and chaotic time series prediction [22], since
it has been shown that these kinds of data present long-range dependence
due to their self-similar nature. Thus, for the sake of illustration, we evaluate
the proposed approach using two real-world data sets obtained from these
domains, namely the well-known chaotic laser time series and a variable bit
rate (VBR) video traffic time series.
The remainder of the paper is organized as follows. In Section 2, we describe
the NARX network model and its main characteristics. In Section 3 we intro-
duce the basics of the nonlinear time series prediction problem and present our
approach. The simulations and discussion of results are presented in Section 4.
The paper is concluded in Section 5
2 The NARX Network
The Nonlinear Autoregressive model with Exogenous inputs (NARX) [26,30,33]
is an important class of discrete-time nonlinear systems that can be mathe-
matically represented as
y(n+ 1) = f[y(n),...,y(ndy+ 1); (1)
u(nk), u(nk+ 1),...,u(nduk+ 1)] ,
where u(n)Rand y(n)Rdenote, respectively, the input and output of
the model at discrete time step n, while du1 and dy1, dudy, are
the input-memory and output-memory orders, respectively. The parameter k
(k0) is a delay term, known as the process dead-time.
Without lack of generality, we always assume k= 0 in this paper, thus ob-
taining the following NARX model:
y(n+ 1) = f[y(n),...,y(ndy+ 1); (2)
u(n), u(n1),...,u(ndu+ 1)] ,
which may be written in vector form as
y(n+ 1) = f[y(n); u(n)],(3)
where the vectors y(n) and u(n) denote the output and input regressors,
The nonlinear mapping f(·) is generally unknown and can be approximated,
for example, by a standard multilayer Perceptron (MLP) network. The re-
sulting connectionist architecture is then called a NARX network [10,32], a
powerful class of dynamical models which has been shown to be computa-
Fig. 1. NARX network with dudelayed inputs and dydelayed outputs (z1= unit
time delay).
tionally equivalent to Turing machines [38]. Figure 1 shows the topology of a
two-hidden-layer NARX network.
In what concern training the NARX network, it can be carried out in one out
of two modes:
Series-Parallel (SP) Mode - In this case, the output’s regressor is formed
only by actual values of the system’s output:
ˆy(n+ 1) = ˆ
f[ysp(n); u(n)],(4)
f[y(n),...,y(ndy+ 1); u(n), u(n1),...,u(ndu+ 1)] ,
where the hat symbol () is used to denote estimated values (or functions).
Parallel (P) Mode - In this case, estimated outputs are fed back and
included in the output’s regressor 1:
1The NARX model in P-mode is also known as Output-Error Model [30].
ˆy(n+ 1) = ˆ
f[yp(n); u(n)],(5)
f[ˆy(n),...,ˆy(ndy+ 1); u(n), u(n1),...,u(ndu+ 1)].
As a tool for nonlinear system identification, the NARX network has been suc-
cessfully applied to a number of real-world input-output modeling problems,
such as heat exchangers, waste water treatment plants, catalytic reforming
systems in a petroleum refinery and nonlinear time series prediction (see [29]
and references therein).
As mentioned in the introduction, the particular topic of this paper is the
issue of nonlinear univariate time series prediction with the NARX network.
In this type of application, the output-memory order is usually set dy= 0,
thus reducing the NARX network to the TDNN architecture [29], i.e.
y(n+ 1) = f[u(n)],(6)
=f[u(n), u(n1),...,u(ndu+ 1)],
where u(n)Rduis the input regressor. This simplified formulation of the
NARX network eliminates a considerable portion of its representational capa-
bilities as a dynamic network; that is, all the dynamic information that could
be learned from the past memories of the output (feedback) path is discarded.
For many practical applications, however, such as self-similar traffic model-
ing [16], the network must be able to robustly store information for a long
period of time in the presence of noise. In gradient-based training algorithms,
the fraction of the gradient due to information ntime steps in the past ap-
proaches zero as nbecomes large. This effect is called the problem of vanishing
gradient and has been pointed out as the main cause of the poor performance
of standard dynamical ANN models when dealing with long-range dependen-
The original formulation of the NARX network does not circumvent the prob-
lem of vanishing gradient, but it has been demonstrated that it often per-
forms much better than standard dynamical ANNs in such a class of problems,
achieving much faster convergence and better generalization performance [28].
As pointed out in [27], an intuitive explanation for this improvement in per-
formance is that the output memories of a NARX neural network are repre-
sented as jump-ahead connections in the time-unfolded network that is often
encountered in learning algorithms such as the backpropagation through time
(BPTT). Such jump-ahead connections provide shorter paths for propagat-
ing gradient information, reducing the sensitivity of the network to long-term
Hence, if the output memory is discarded, as shown in Equation (6), per-
formance improvement may no longer be observed. Bearing this in mind as a
motivation, we propose a simple strategy to allow the computational resources
of the NARX network to be fully explored in nonlinear time series prediction
3 Nonlinear Time Series Prediction with NARX Network
In this section we provide a short introduction of the theory of embedding and
state-space reconstruction. The interested reader are referred to [1] for further
The state of a deterministic dynamical system is the information necessary to
determine the evolution of the system in time. In discrete time, this evolution
can be described by the following system of difference equations:
x(n+ 1) = F[x(n)] (7)
where x(n)Rdis the state of the system at time step n, and F[·] is a
nonlinear vector valued function. A time series is a time-ordered set of mea-
sures {x(n)},n= 1, . . . , N, of a scalar quantity observed at the output of the
system. This observable quantity is defined in terms of the state x(n) of the
underlying system as follows:
x(n) = h[x(n)] + ε(t) (8)
where h(·) is a nonlinear scalar-valued function, εis a random variable which
accounts for modeling uncertainties and/or measurement noise. It is commonly
assumed that ε(t) is drawn from a Gaussian white noise process. It can be
inferred immediately from Equation (8) that the observations {x(n)}can be
seen as a projection of the multivariate state space of the system onto the
one-dimensional space. Equations (7) and (8) describe together the state-space
behavior of the dynamical system.
In order to perform prediction, one needs to reconstruct (estimate) as well
as possible the state space of the system using the information provided by
{x(n)}only. In [40], Takens has shown that, under very general conditions, the
state of a deterministic dynamic system can be accurately reconstructed by a
time window of finite length sliding over the observed time series as follows:
x1(n),[x(n), x(nτ),...,x(n(dE1)τ)] (9)
where x(n) is the sample value of the time series at time n,dEis the embedding
dimension and τis the embedding delay. Equation (9) implements the delay
embedding theorem [22]. According to this theorem, a collection of time-lagged
values in a dE-dimensional vector space should provide sufficient information
to reconstruct the states of an observable dynamical system. By doing this,
we are indeed trying to unfold the projection back to a multivariate state
space whose topological properties are equivalent to those of the state space
that actually generated the observable time series, provided the embedding
dimension dEis large enough.
The embedding theorem also provides a theoretical framework for nonlinear
time series prediction, where the predictive relationship between the current
state x1(t) and the next value of the time series is given by the following
x(n+ 1) = g[x1(n)] (10)
Once the embedding dimension dEand delay τare chosen, one remaining
task is to approximate the mapping function g(·). It has been shown that a
feedforward neural network with enough neurons is capable of approximating
any nonlinear function to an arbitrary degree of accuracy. Thus, it can pro-
vide a good approximation to the function g(·) by implementing the following
ˆx(n+ 1) = ˆg[x1(n)] (11)
where ˆx(n+ 1) is an estimate of x(n+ 1) and ˆg(·) is the corresponding ap-
proximation of g(·). The estimation error, e(n+ 1) = x(n+ 1) ˆx(n+ 1), is
commonly used to evaluate the quality of the approximation.
If we set u(n) = x1(n) and y(n+ 1) = x(n+ 1) in Equation (6), then it
leads to an intuitive interpretation of the nonlinear state-space reconstruction
procedure as equivalent to the time series prediction problem whose the goal
is to compute an estimate of x(n+ 1). Thus, the only thing we have to do is
to train a TDNN model [36]. Once training is completed, the TDNN can be
used for predicting the next samples of the time series.
Despite the correctness of the TDNN approach, recall that it is derived from a
simplified version of the NARX network by eliminating the output memory. In
order to use the full computational abilities of the NARX network for nonlinear
time series prediction, we propose novel definitions for its input and output
regressors. Firstly, the input signal regressor, denoted by u(n), is defined by
the delay embedding coordinates of Equation (9):
u(n) = x1(n) = [x(n), x(nτ),...,x(n(dE1)τ)],(12)
where we set du=dE. In words, the input signal regressor u(n) is composed
of dEactual values of the observed time series, separated from each other of
τtime steps.
Secondly, since the NARX network can be trained in two different modes, the
output signal regressor y(n) can be written accordingly as:
ysp(n) = [x(n), . . . , x(ndy+ 1)],(13)
yp(n) = [ˆx(n),...,ˆx(ndy+ 1)].(14)
Note that the output regressor for the SP-mode shown in Equation (13) con-
Fig. 2. Architecture of the NARX network during training in the SP-mode (zτ=
τunit time delays).
tains dypast values of the actual time series, while the output regressor for the
P-mode shown in Equation (14) contains dypast values of the estimated time
series. For a suitably trained network, no matter under which training mode,
these outputs are estimates of previous values of x(n+ 1). Henceforth, NARX
networks trained using the regression pairs {ysp (n),x1(n)}and {yp(n),x1(n)}
are denoted by NARX-SP and NARX-P networks, respectively. These NARX
networks implement following predictive mappings, can be visualized in Figure
2 and (Figure 3):
ˆx(n+ 1) = ˆ
f[ysp(n),u(n)] = ˆ
ˆx(n+ 1) = ˆ
f[yp(n),u(n)] = ˆ
where the nonlinear function ˆ
f(·) be readily implemented through a MLP
trained with plain backpropagation algorithm.
It is worth noting that Figures 2 and 3 correspond to the different ways the
Fig. 3. Architecture of the NARX network during training in the P-mode (zτ=τ
unit time delays).
NARX network can be trained; that is, in SP-mode or in P-mode, respectively.
During the testing phase, however, since long-term predictions are required,
the predicted values should be fed back to both, the input regressor u(n) and
the output regressor ysp(n) (or yp(n)), simultaneously. Thus, the resulting pre-
dictive model has two feedback loops, one for the input regressor and another
for the output regressor, as illustrated in Figure 4.
Thus, unlike the TDNN-based approach for the nonlinear time series predic-
tion problem, the proposed approach makes full use of the output feedback
loop. Equations (12) and (13) are valid only for one-step-ahead prediction
tasks. Again, if one is interested in multi-step-ahead or recursive prediction
tasks, the estimates ˆxshould also be inserted into both regressors in a recursive
One may argue that, in addition to the parameters dEand τ, the proposed
approach introduces one more to be determined, namely, dy. However, this
Fig. 4. Common architecture for the NARX-P and NARX-SP networks during the
testing (recursive prediction) phase.
parameter can be eliminated if we recall that, as pointed out in [18], the delay
embedding of Equation (9) has an alternative form given by:
x2(n),[x(n), x(n1),...,x(nm+ 1)] (17)
where mis an integer defined as mτ·dE. By comparing Equations (13)
and (17), we find that a suitable choice is given by dyτ·dE, which also
also satisfies the necessary condition dy> du. However, we have found by
experimentation that a value chosen from the interval dE< dyτ·dEis
sufficient for achieving a prediction performance better than those achieved
by conventional neural based time series predictors, such as the TDNN and
Elman architectures.
Finally, the proposed approach is summarized as follows. A NARX network
is defined so that its input regressor u(n) contains samples of the measured
variable x(n) separated τ(τ > 0) time steps from each other, while the out-
put regressor y(n) contains actual or estimated values of the same variable,
but sampled at consecutive time steps. As training proceeds, these estimates
should become more and more similar to the actual values of the time series,
indicating convergence of the training process. Thus, it is interesting to note
that the input regressor supplies medium- to long-term information about the
dynamical behavior of the time series, since the delay τis usually larger than
unity, while the output regressor, once the network has converged, supplies
short-term information about the same time series.
4 Simulations and Discussion
In this paper, our aim is to evaluate, in qualitative and quantitative terms,
the predictive ability of the NARX-P and NARX-SP networks using two real-
world data sets, namely the chaotic laser and the VBR video traffic time series.
For the sake of completeness, a performance comparison with the TDNN and
Elman recurrent networks is also carried out.
It is worth emphasizing that our goal in the experiments is to evaluate if the
output regressor ysp (or yp) in the input layer of the NARX network improves
its prediction performance. Thus, to facilitate the performance comparison,
all the networks we simulate have two hidden layers and one output neuron.
All neurons in both hidden layers and the output neuron use hyperbolic tan-
gent activation functions. The standard backpropagation algorithm is used to
train the networks with learning rate equal to 0.001 (selected heuristically).
No momentum term is used. In what concerns the Elman network, only the
neuronal outputs of the first hidden layer are fed back to the input layer.
The number of neurons, Nh,1and Nh,2, in the first and second hidden layers,
respectively, are equal for all simulated networks. These values are chosen
according to the following heuristic rules [31]:
Nh,1= 2dE+ 1 and Nh,2=qNh,1,(18)
where Nh,2is rounded up towards the next integer number. The first rule
is motivated by Kolmogorov’s theorem on function approximation [19]. The
second rule simply states that the number of neurons in the second hidden
layer is the square root of product of the dimension of the first hidden layer
and the dimension of the output layer. Finally, we set dy= 2τdE, where τis
selected as the value occurring at the first minimum of the mutual information
function of the time series [15].
The total number Mof adjustable parameters (weights and thresholds) for
each of the simulated networks are given by:
M= (dE+ 1) ·Nh,1+ (Nh,1+ 2) ·Nh,2+ 1 (TDNN)
M= (Nh,1+dE+ 1) ·Nh,1+ (Nh,1+ 2) ·Nh,2+ 1 (ELMAN)
M= (dE+dy+ 1) ·Nh,1+ (Nh,1+ 2) ·Nh,2+ 1 (NARX)
Once a given network has been trained, it is required to provide estimates of
the future sample values of a given time series for a certain prediction horizon
N. The predictions are executed in a recursive fashion until desired prediction
horizon is reached, i.e., during Ntime steps the predicted values are fed back
in order to take part in the composition of the regressors. The networks are
evaluated in terms of the Normalized Mean Squared Error (NMSE),
NM SE (N) = 1
(x(n+ 1) ˆx(n+ 1))2,(20)
where x(n+ 1) is the actual value of the time series, ˆx(n+ 1) is the predicted
value, Nis the horizon prediction (i.e., how many steps into the future a given
network has to predict), and ˆσ2
xis the sample variance of the actual time series.
The NMSE values are averaged over 10 training/testing runs.
Chaotic laser time series - The first data sequence to be used to evaluate
the NARX-P and NARX-SP models is the chaotic laser time series [42]. This
time series comprises measurements of the intensity pulsations of a single-
mode Far-Infrared-Laser NH3in a chaotic state [21]. It was made available
worldwide during a time series prediction competition organized by the Santa
Fe Institute and, since then, has been used in benchmarking studies.
The laser time series has 1500 points which have been rescaled to the range
[1,1]. The rescaled time series was further split into two sets for the pur-
pose of performing 1-fold cross-validation, so that the first 1000 samples were
used for training and the remaining 500 samples for testing. The embedding
dimension was estimated as dE= 7 by applying Cao’s method [8], which is
a variant of the well-known false nearest neighbors method 2. The embedding
delay was estimated as τ= 2. For the chosen parameters, the total number of
modifiable weights and biases for the three simulated neural architectures are
the following: M= 189 (TDNN), M= 414 (Elman) and M= 609 (NARX).
The results are shown in Figures 5(a), 5(b) and 5(c), for the NARX-SP, Elman
2A recent technique for the estimation of dEcan be found in [25].
0 100 200 300 400 500
0 100 200 300 400 500
Fig. 5. Results for the laser series: (a) NARX-SP, (b) Elman, (c) TDNN.
and TDNN networks, respectively 3. A visual inspection illustrates clearly that
the NARX-SP model performed better than the other two architectures. It is
important to point out that a critical situation occurs around time step 60,
where the laser intensity collapses suddenly from its highest value to its lowest
one; then, it starts recovering the intensity gradually. The NARX-SP model
is able to emulate the laser dynamics very closely. The Elman’s network was
doing well until the critical point. From this point onwards, it was unable to
emulate the laser dynamics faithfully, i.e., the predicted laser intensities have
much lower amplitudes than the actual ones. The TDNN network had a very
poor predictive performance. From a dynamical point of view the output of
the TDNN seems to be stuck in a limit cycle, since it only oscillates endlessly.
It is worth mentioning that the previous results did not mean that the TDNN
and Elman networks cannot learn the dynamics of the chaotic laser. Indeed, it
was shown to be possible in [18] using sophisticated training algorithms, such
as backpropagation through time (BPTT) [43] or real-time recurrent learning
(RTRL) [44]. In what concern the TDNN network, our results confirms the ob-
servations reported by Eric Wan [41, p. 62] in his PhD thesis. There, he states
3The results for the NARX-P network are not shown since they are equivalent to
those shown for the NARX-SP network
that the standard MLP, using the input regressor x1(t) only and trained with
the instantaneous gradient descent rule, has been unable to accurately predict
the laser time series. In his own words, “the downward intensity collapse went
completely undetected.”, as in our case.
In sum, our results show that under the same conditions, i.e. with the same
number of hidden neurons, using the standard gradient-based backpropagation
algorithm, a short time series for training, and the same number of training
epochs, the NARX-SP network performs better than the TDNN and Elman
networks. It seems that the presence of the output regressor ysp improves
indeed the predictive power of the NARX network.
For the sake of comparison, under similar training and network evaluation
methodologies, the FIR-MLP model proposed by Eric Wan [41] achieved very
good long-term prediction results on the laser time series, which are equivalent
to those obtained by the NARX-SP network. However, the FIR-MLP required
M= 1105 adjustable parameters to achieve such a good performance, while
the NARX-SP model required roughly half the number of parameters (i.e.
M= 609).
The long-term predictive performances of all networks can be assessed in more
quantitative terms by means of NMSE curves. Figure 6(a) shows the evolution
of NMSE as a function of the prediction horizon N. It is worth emphasizing two
types of behavior in this figure. Below the critical time step (i.e. N < 60), the
NMSE values reported are approximately the same, with a small advantage to
the Elman network. This means that while the critical point is not reached, all
networks predict well the time series. For N > 60, the NARX-P and NARX-
SP models reveal their superior performance. Figure 6(b) shows the evolution
(a) (b)
Fig. 6. (a) Multi-step-ahead NMSE values and (b) the variances of the predicted
values for the laser time series.
of the variance of the predicted values with N. Note that the highest values of
the variance occur around N= 60. Before this point the NARX-SP network
provides the smallest variances among all models. For N > 60, the variances
obtained for the Elman network are of the same order of magnitude of those
generated by the NARX-SP network; however, the latter provides much more
accurate estimates than the former, as shown in Figure 6(a).
A useful way to qualitatively evaluate the performance of the NARX-SP net-
work for the laser series is through recurrence plots [9]. These diagrams de-
scribe how a reconstructed state-space trajectory recurs or repeats itself, being
useful for characterization of a system as random, periodic or chaotic. For ex-
ample, random signals tends to occupy the whole area of the plot, indicating
that no value tends to repeat itself. Any structured information embedded
in a periodic or chaotic signal is reflected in a certain visual pattern in the
recurrence plot.
Recurrence plots are built by calculating the distance between two points in
the state-space at times i(horizontal axis) and j(vertical axis):
δij =kD(i)D(j)k,(21)
where k · k is the Euclidean norm. The state vectors D(n) = [ˆx(n),ˆx(n
τ),...,ˆx(n(dE1)τ)] are built using the points of the predicted time series.
Then, a dot is placed at the coordinate (i, j ) if δij < r. In this paper, we set
r= 0.4 and the prediction horizon to N= 200.
The results are shown in Figure 7. It can be easily visualized that the recur-
rence plots shown in Figures 7(a) and 7(b) are more similar with one another,
indicating that NARX-SP network reproduced the original state-space trajec-
tory more faithfully.
VBR video traffic time series - Due to the widespread use of Internet
and other packet/cell switching broad-band networks, variable bit rate (VBR)
video traffic will certainly be a major part of the traffic produced by multime-
dia sources. Hence, many researches have focused on VBR video traffic predic-
tion to devise network management strategies that satisfy QoS requirements.
From the point of view of modeling, a particular challenging issue on network
traffic prediction comes from the important discovery of self-similarity and
long-range dependence (LRD) in broad-band network traffic [24]. Researchers
have also observed that VBR video traffic typically exhibits burstiness over
multiple time scales (see [5,20], for example).
In this section, we evaluate the predictive abilities of the NARX-P and NARX-
SP networks using VBR video traffic time series (trace), extracted from Juras-
sic Park, as described in [37]. This video traffic trace was encoded at University
of W¨urzburg with MPEG-I. The frame rates of video sequence coded Juras-
0 50 100 150 200
0 50 100 150 200
0 50 100 150 200
0 50 100 150 200
Fig. 7. Recurrence plot of the (a) original laser time series, and the ones produced
by (b) NARX-SP; (c) TDNN and (d) Elman networks.
sic Park have been used. The MPEG algorithm uses three different types
of frames: Intraframe (I), Predictive (P) and Bidirectionally-Predictive (B).
These three types of frames are organized as a group (Group of Picture, GoP)
defined by the distance L between I frames and the distance M between P
frames. If the cyclic frame pattern is {IBBPBBPBBPBBI}, then L=12 and
M=3. These values for L and M are used in this paper.
The resulting time series has 2000 points which have been rescaled to the
range [1,1]. The rescaled time series was further split into two sets for cross-
validation purposes: 1500 samples for training and 500 samples for testing.
5 10 15 20 25
0 100 200 300 400 500 600
Fig. 8. Evaluation of the sensitivity of the neural networks with respect to (a) the
embedding dimension and (b) the number of training epochs.
Evaluation of the long-term predictive performances of all networks can also
help assessing the sensitivity of the neural models to important training pa-
rameters, such as the number of training epochs and the size of the embedding
dimension, as shown in Figure 8.
Figure 8(a) shows the NMSE curves for all neural networks versus the value of
the embedding dimension, dE, which varies from 3 to 24. For this simulation
we trained all the networks for 300 epochs, τ= 1 and dy= 24. One can easily
note that the NARX-P and NARX-SP performed better than the TDNN
and Elman networks. In particular, the performance of the NARX-SP was
rather impressive, in the sense that it remains constant throughout the studied
range. From dE12 onwards, the performances of the NARX-P and NARX-
SP are practically the same. It is worth noting that the performances of the
TDNN and Elman networks approaches those of the NARX-P and NARX-SP
networks when dEis of the same order of magnitude of dy. This suggests that,
for NARX-SP (or NARX-P) networks, we can select a small value for dEand
still have a very good performance.
Figure 8(b) shows the NMSE curves obtained from the simulated neural net-
works versus the number of training epochs, ranging from 90 to 600. For this
simulation we trained all the networks with τ= 1, dE= 12 and dy= 2τdE=
24. Again, better performances were achieved by the NARX-P and NARX-SP.
The performance of the NARX-SP is practically the same from 100 epochs on.
The same behavior is observed for the NARX-P network from 200 epochs on.
This can be explained by recalling that the NARX-P uses estimated values
to compose the output regressor yp(n) and, because of that, it learns slower
than the NARX-SP network.
Another important behavior can be observed for the TDNN and Elman net-
works. From 200 epochs onwards, these networks increase their NMSE values
instead of decreasing them. We hypothesize that this behavior can be an ev-
idence of overfitting, a phenomenon observed when powerful nonlinear mod-
els, with excessive degrees of freedom (too much adjustable parameters), are
trained for a long period with a finite size data set. In this sense, the results of
Figure 8(b) strongly suggest that the NARX-SP and NARX-P networks are
much more robust than the TDNN and Elman networks. In other words, the
presence of an output regressor in the NARX-SP and NARX-P networks seems
to turn them less prone to overfitting than the Elman and TDNN models, even
when the number of free parameters in the NARX networks are higher than
the Elman and TDNN models.
Finally, we show in Figures 9(a), 9(b) and 9(c) typical estimated VBR video
traffic traces generated by the TDNN, Elman and NARX-SP networks, re-
spectively. For this simulation, all the neural networks are required to predict
recursively the sample values of the VBR video traffic trace for 300 steps
ahead in time. For all networks, we have set dE= 12, τ= 1, dy= 24 and
0 50 100 150 200 250 300
Frame number
0 50 100 150 200 250 300
Frame number
0 50 100 150 200 250 300
Frame number
Fig. 9. Recursive predictions obtained by (a) TDNN, (b) Elman and (c) NARX-SP
trained the neural models for 300 epochs. For these training parameters, the
NARX-SP predicted the video traffic trace much better than the TDNN and
Elman networks.
As we did for the laser time series, we again emphasize that the results reported
in Figure 9 did not mean to say that the TDNN and Elman networks cannot
ever predict the video traffic trace as well as the NARX-SP. They only mean
that, for the same training and configuration parameters, the NARX-SP has
greater computational power provided by the output regressor. Recall that
the MLP is an universal function approximation; and so, any MLP-based
neural model, such as the TDNN and Elman networks, are in principle able to
approximate complex function with arbitrary accuracy, once enough training
epochs and data are provided.
5 Conclusions and Further Work
In this paper, we have shown that the NARX neural network can success-
fully use its output feedback loop to improve its predictive performance in
complex time series prediction tasks. We used the well-known chaotic laser
and real-world VBR video traffic time series to evaluate empirically the pro-
posed approach in long-term prediction tasks. The results have shown that the
proposed approach consistently outperforms standard neural network based
predictors, such as the TDNN and Elman architectures.
Currently we are evaluating the proposed approach on several other applica-
tions that require long-term predictions, such as electric load forecasting and
financial time series prediction. Applications to signal processing tasks, such
as communication channel equalization, are also being planned.
The authors would like to thank CNPq (grant #506979/2004-0), CAPES/PRODOC
and FUNCAP for their financial support.
