ArticlePDF Available

Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River

Authors:

Abstract and Figures

Over the past few decades, floods have severely damaged production and daily life, causing enormous economic losses. Streamflow forecasts prepare us to fight floods ahead of time and mitigate the disasters arising from them. Streamflow forecasting demands a high-capacity model that can make precise long-term predictions. Traditional physics-based hydrological models can only make short-term predictions for streamflow, while current machine learning methods can only obtain acceptable results in normal years without floods. Previous studies have demonstrated a close relation between El Niño-Southern Oscillation (ENSO) and the streamflow of the Yangtze River. However, traditional models, holding the encoder–decoder architecture, only have one encoder block that can not support bivariate time series forecasting. In this study, a transformer-based double-encoder-enabled model was proposed, called the double-encoder Transformer, with a distinctive characteristic: “cross-attention” mechanism that can capture the relation between two time series sequences. Using river flow observation collected by the Yangtze River Water Resources Commission and El Niño-Southern Oscillation (ENSO) observation collected by the National Oceanic and Atmospheric Administration, the model can achieve better performance. By using variational mode decomposition (VMD) technique for preprocessing, the model can make precise long-term predictions for the river flow of the Yangtze River. A monthly prediction of 21 years (from January 1998 to December 2018) was made, and the results indicate that the double-encoder Transformer outperforms mainstream time series models.
Content may be subject to copyright.
Received May 5, 2022, accepted May 21, 2022, date of publication May 27, 2022, date of current version June 6, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3178521
Improved Transformer Model for Enhanced
Monthly Streamflow Predictions of the
Yangtze River
CHUANFENG LIU 1, DARONG LIU 1,3, AND LIN MU 2,3
1College of Marine Science and Technology, China University of Geosciences, Wuhan 430074, China
2College of Life Science and Oceanography, Shenzhen University, Shenzhen 518061, China
3Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou 511458, China
Corresponding author: Darong Liu (lidr1169@cug.edu.cn)
This work was supported in part by the Key Special Project for Introduced Talents Team of Southern Marine Science and Engineering
Guangdong Laboratory (Guangzhou) under Grant GML2019ZD0604, the National Natural Science Foundation of China under
Grant U2006210, and in part by the Shenzhen Fundamental Research Program under Grant JCYJ20200109110220482.
ABSTRACT Over the past few decades, floods have severely damaged production and daily life, causing
enormous economic losses. Streamflow forecasts prepare us to fight floods ahead of time and mitigate
the disasters arising from them. Streamflow forecasting demands a high-capacity model that can make
precise long-term predictions. Traditional physics-based hydrological models can only make short-term
predictions for streamflow, while current machine learning methods can only obtain acceptable results
in normal years without floods. Previous studies have demonstrated a close relation between El Niño-
Southern Oscillation (ENSO) and the streamflow of the Yangtze River. However, traditional models, holding
the encoder–decoder architecture, only have one encoder block that can not support bivariate time series
forecasting. In this study, a transformer-based double-encoder-enabled model was proposed, called the
double-encoder Transformer, with a distinctive characteristic: ‘‘cross-attention’ mechanism that can capture
the relation between two time series sequences. Using river flow observation collected by the Yangtze
River Water Resources Commission and El Niño-Southern Oscillation (ENSO) observation collected by
the National Oceanic and Atmospheric Administration, the model can achieve better performance. By using
variational mode decomposition (VMD) technique for preprocessing, the model can make precise long-term
predictions for the river flow of the Yangtze River. A monthly prediction of 21 years (from January 1998 to
December 2018) was made, and the results indicate that the double-encoder Transformer outperforms
mainstream time series models.
INDEX TERMS Streamflow prediction, Yangtze River, deep learning, transformer, variational modal
decomposition, flood forecasts.
I. INTRODUCTION
The Yangtze River has seen numerous floods in its history.
Affected by various factors, including the El Nino weather
pattern [1]–[3], the river flow time series are complex and
nonlinear [4]. Flow prediction is a practical problem and has
been drawing an increasing amount of attention. Many stud-
ies have been done to predict the river flow for months [5], [6]
or even years in advance [7], [8]. Streamflow predictions help
in the ability to fight floods in advance and help local admin-
istrators make better decisions and mitigate disasters [9].
The associate editor coordinating the review of this manuscript and
approving it for publication was Ehab Elsayed Elattar .
Over the past few decades, many numerical and machine
learning methods have been developed to predict streamflow.
Numerical prediction models [10]–[12] can simulate the
interactions of various physical processes, such as atmo-
spheric circulation and the evolution of long-term weather
in the physical world [13]. Numerical models can be anal-
ogous to conducting a particular physics experiment as a
way to achieve satisfactory results in short-term forecast-
ing [14]. The soil and water assessment tool (SWAT) [10]
was proposed as a way to predict the effects of land man-
agement on water, sediment, and chemicals in a large water-
shed with complex and varied soil types, land-use patterns,
and management practices. SWAT is a distributed watershed
58240 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 10, 2022
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
hydrologic model based on the geographic information sys-
tem (GIS). SWAT primarily uses the space-based information
from remote sensing and geographical information systems
to simulate various hydrological physical and chemical pro-
cesses [10]. However, hydrologic numerical models have
some severe drawbacks: 1. Large amounts of data with high
accuracy are required; 2. long-term forecasting is less accu-
rate; and 3. numerical methods are computationally intensive,
requiring vast computing resources. Statistical prediction
methods generally belong to traditional machine learning.
Many statistical models have been employed in predicting
streamflow [15]–[17]. Support vector machines (SVMs) [18]
was designed to be a strict theoretical and mathematical basis
classifier. In 2016, Shuang Zhu et al.used SVMs to predict
the upper reaches of the Yangtze River; here, the R2could
reach 0.87 in a single-year monthly forecast [19]. Statistical
hydrological models can perform well in normal years but
perform poorly in flood years because the organizational
structure of its parameters limits the complexity of the model,
hence not being able to predict the anomalies precisely.
Traditional deep learning models for extracting features
that can be used as prediction models. Artificial neural net-
work (ANN), convolutional neural network (CNN) [20], and
recurrent neural network(RNN) [21] are promising meth-
ods for predicting river flow. An ANN can approximate the
unknown and nonlinear functions with arbitrary precision,
namely universal function approximators [22]. ANN mod-
els have been used to predict streamflow [23]–[25]. How-
ever, the ANN has two drawbacks: 1. With the increase
of sequence length, the amount of trainable parameters
increases sharply. 2. An ANN cannot capture sequence infor-
mation (e.g., position and order ). It is generally accepted
that an ANN is not suitable for processing time series.
CNNs and RNNs were designed to overcome the limita-
tions of ANNs. Previous studies have found CNNs and
RNNs to be more accurate than other models for dealing
with time series. Shun-Yao Shih et al .used a CNN for
multivariate time series forecasting [26]. Shaojie Bai et al.
proposed a new CNN architecture named a temporal con-
volutional network (TCN) [27] and obtained good results
on various time series datasets. An RNN was specifically
designed to deal with time series [21]. However, the vanilla
RNN architecture is full of problems that cannot be used
directly. Based on RNN architecture, long short-term mem-
ory (LSTM) [28] has been proposed, as a way to allevi-
ate problems within an RNN. Currently, LSTM is widely
used in hydrologic prediction tasks [13], [29], [30]. In 2020,
Liu et al.used LSTM to predict the middle reaches of
the Yangtze River; here, in flood years, the R2monthly
prediction could reach 0.89 [31]. However, there are also
some shortcomings in CNNs and RNNs: 1. RNNs are
not computationally parallel, making them slow and time-
consuming. 2. CNNs will lose critical sequence information
while processing with time series, decreasing their accuracy.
3. CNNs and RNNs cannot capture the long-term dependence
effectively.
The attention mechanism was first proposed by
Bengio et al.in 2014 [32], and since then, it has been widely
applied in various fields of deep learning. With the atten-
tion mechanism, models can ignore low-value information
and focus on high-value information. The attention layer
can capture the long-term dependency by computing the
‘‘attention’ between all pairs of points. The process does not
need to be sequential, so it is computationally parallel. The
attention mechanism can also be used as a feature extractor,
outperforming CNN and RNN architectures. Based on an
attention mechanism, Google proposed a new model called
Transformer [33]. Transformer abandoned the traditional
CNNs and RNNs, and the entire network structure is com-
pletely composed of the attention mechanism. Transformer
was initially designed for natural language processing (NLP).
Still, currently, many studies have indicated that Transformer
is faster and stronger than CNNs and RNNs in dealing
with time series [34]–[36]. However, like the previous mod-
els, Transformer also has trouble in predicting flood peak
discharge.
There are three deficiencies in the previous works: 1.
They can not forecast the flood peak discharge precisely.
The R2in flood years has been lower than 0.9. 2. They
cannot make long-term stable predictions with high accuracy.
When making long-term predictions (e.g., more than a decade
of predictions), the average R2may only be around 0.85.
3. Previous models do not have structures for multidimen-
sional data processing and combine all the dimensions into
feature vectors as input to make multidimensional predic-
tions for river flow, limiting the models’ capability. The
previous multidimensional data prediction models all use
this method of feature vectors [26], [37], [38]. In this study,
a restructured Transformer model combined with VMD is
proposed, namely double-encoder Transformer. Variational
mode decomposition (VMD) is widely used to preprocess
complex and nonlinear data [39]. It can decompose a signal
into several different modes to transform the signal in the time
domain into the frequency domain. The inherent features of
the original signal can be better reflected, making the model
fit well, no matter in normal or flood years. The restructured
Transformer is still an encoder–decoder architecture with two
encoders and one decoder. The structure of the multiencoder
can support multivariate prediction. The inputs of the two
encoders are observed river flow data and observed ENSO
data, respectively. To better obtain the correlation between
El Nino and flow, the decoder receives the outputs of the
two encoders as inputs and then learns and computes the
correlation between them.
The contributions of the present paper are as follows:
The double-encoder Transformer is proposed to enhance
the prediction capability successfully, which signifi-
cantly improves the flow prediction accuracy of flood
years with an R2higher than 0.95.
A reliable long-term (21 years) flow prediction of the
Yangtze River had been performed, achieving high
accuracy.
VOLUME 10, 2022 58241
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
The ‘‘cross-attention’ mechanism is proposed to solve
the bivariate prediction problem, which validates the
attention-based model’s potential value for making a
multivariate prediction.
A heuristic method is applied to determine the Kvalue
in the VMD algorithm.
II. METHODOLOGY
The specific research methods can be divided into data pre-
processing and deep learning. The critical step of data pre-
processing is the VMD algorithm. In section II-A, starting
with data preprocessing, the focus is on the details of the
VMD algorithm. Deep learning techniques generally develop
an encoder–decoder paradigm [38], mainly using a CNN and
RNN and their variants. The double-encoder Transformer
holds the encoder–decoder architecture with attention blocks.
In section II-B, the principles and processes of attention oper-
ation are presented and described. Section II-C provides the
core part of the work. The model’s architecture and network
layers will be presented in detail. Please refer to Fig. 1for an
overview and the sections for more information.
FIGURE 1. The overview of the work.
A. DATA PREPROCESSING
This section presents the process of preprocessing original
observed flow data and ENSO data. There are three steps:
1. data normalization; 2. Adding time stamps; 3. applying
the VMD algorithm to decompose the signal. The key to the
third step is the appropriate amount of frequency components
decomposed by the VMD. Here, we will demonstrate how
to determine the right amount. There are two steps before
employing the VMD.
1. Data normalization. The flow of the Yangtze River
changes dramatically, ranging from 5000 m3/month to nearly
70000 m3/month. Suppose the dimensional (scaler) differ-
ence of the data is too large. In this case, the data with a
large dimension (scale) will take the leading position, which
will reduce the accuracy of the model. In addition, data with
Algorithm 1 Complete Optimization of VMD [39]
Initialize bu1
k,ω1
k,b
λ1,n=0
repeat
n=n+1
for k=1:Kdo
Update bukfor all ω>0:
ˆun+1
k(ω)ˆ
f(ω)Pi<kˆun+1
i(ω)Pi>kˆun
i(ω)+ˆ
λn(ω)
2
1+2αωωn
k2
Update ωk:
ωn+1
kR
0ωˆun+1
k(ω)2dω
R
0ˆun+1
k(ω)2dω
end for
Dual ascent for all ω>0:
ˆ
λn+1(ω)ˆ
λn(ω)+τ ˆ
f(ω)X
kˆun+1
k(ω)!
until convergence:
X
k
ˆun+1
k ˆun
k
2
2/
ˆun
k
2
2<
a large dimension will have a longer axis when updating
the model using a gradient descent, which means that the
convergence of the model will be slow [40]. In this study,
we used min–max scaling to perform data normalization.
2. Adding time stamps. For time series, it is essential
to add time information while training [41], [42]. Classic
Transformer only encodes the input data with position and
does not contain more fine-grained time information such as
year, month, day, or hour [38]. The dataset used has a monthly
dimension, so we added time stamps of years and months to
each piece of data. Like positional encoding [33], these time
stamps would be embedded into higher dimensions and added
to the input data as sequential information.
VMD works efficiently with nonlinear and nonstationary
data like streamflow by decomposing a signal into different
modes of distinct spectral bands. In the original description,
a model is defined as a signal whose number of local extrema
and zero-crossings differ almost by one. Now, the definition
is slightly changed into the so-called intrinsic mode function
(IMF) [43], [44]. The complete VMD algorithm [39] can be
summarized in Algorithm 1, where uk:= {u1,u2,u3,...,uk}
are the shorthand notations for the set of all modes (IMFs) and
wk:= {w1,w2,w3,...,wk}are the shorthand notations for
their center frequencies, respectively. The role of Lagrangian
multipliers λis to enforce the constraint. The goal of VMD is
to decompose the original signal into subsignals uk.
VMD has proven to be a better algorithm than empiri-
cal mode decomposition (EMD) [45]. VMD is much more
58242 VOLUME 10, 2022
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
robust to sampling and noise and supported by mathematical
theory [39]. However, the amount of IMF decomposed by
VMD is not fixed, and the effect of VMD is mainly affected
by this amount. Kwill be used as a shorthand notation for
this amount in the following presentation. Applying VMD
to decompose the original signal will produce losses. When
Kis too small, a lot of important information in the original
signal will be filtered, affecting the accuracy of subsequent
predictions. The loss can be observed by decomposing the
original signal into IMFs and then recomposing them and
comparing them with the original signal. The loss can be
defined as the difference between the recomposed signal and
the original signal. To measure the loss, the coefficient of
determination (R2) is introduced. A R2between 0 and 1 can
reflect the difference between the two distributions, and the
smaller the R2the greater the difference. The loss can be
measured by R2:Loss =1R2. Fig. 2(a) and Fig. 2(b)
show the influence of Kto loss on the flow and ENSO dataset,
respectively. The larger Kis, the smaller the loss. However,
When Kis too large, it will cause a few problems: 1. There
will be gaps in high-frequency IMF, and the high-frequency
signal becomes intermittent. 2. The center frequencies of
the adjacent IMFs will be close to each other, resulting in
modal repetition and extra noise. 3. More IMFs mean more
computing resources and more time.
FIGURE 2. The correlation between Kand Loss on flow and ENSO
datasets.
In this study, a heuristic approach was adopted to select the
appropriate Kvalue. We apply Hilbert transform (HT) [46] to
each IMF. The HT can be described as (1):
H[x(t)] = ˆx(t)
=x(t)1
πt
=1
πZ
−∞
x(τ)
tτdτ(1)
where πis the Pi. and τare the convolution operator and the
integration variable, respectively. The integral is considered
to be Cauchy principal value, which avoids singularities at
τ=tand τ= ±∞. The instantaneous amplitude A(t) and
the instantaneous phase ψ(t) can be calculated using x(t) and
ˆx(t):
A(t)= ±px2(t)+ ˆx2(t) (2)
ψ(t)=arctan ˆx(t)
x(t)(3)
Then, calculate the instantaneous frequency IF(t) for all
points except for two endpoints:
IF(t)=ψ0(t)
=x(t)ˆx0(t)x0(t)ˆx(t)
A2(t)(4)
Finally, the mean of all the instantaneous frequencies of each
IMF is the central frequency of that IMF:
CF =Pn1
t=2IF(t)
n2(5)
Fig. 3(a) and Fig. 3(b) show the central frequency distri-
butions of the two datasets under different Kvalues, respec-
tively. Here, we focus on the high frequencies. When the
Kvalue is in a reasonable range, it is meaningful that the
higher the frequency, the higher the central frequency. When
Kis too large, there is an obvious inflection point where the
central frequency goes down as Kgoes up. This inflection
point means that Kis already large enough. When the Kvalue
is too high, the high-frequency part of the signal will break
off, and there is no instantaneous frequency at the breakpoint;
hence, after averaging, the center frequency will decrease
instead. Fig. 4(a) and Fig. 4(b) show the variation of the
highest center frequency with K, where the inflection points
can be seen. The inflection point means the highest center
frequency, and the Kshould be selected at the point. In this
study, K=9 was selected for the flow dataset and K=8 for
the ENSO dataset.
B. ATTENTION MECHANISM
Self-attention is the core part of Transformer. The attention
mechanism is a heuristic method that refers to the process
of human attention, hence enabling neural networks to focus
more on what is important [32]. Therefore Transformer can
also be regarded as a feature extractor. As the name suggests,
self-attention is about inner attention, which excels in dealing
with time series. For a time series sequence, self-attention can
capture the correlations between each time step and all other
time steps so that the prediction results will be more accurate.
Deep learning techniques mainly develop an encoder–
decoder architecture by using RNNs and CNNs. However,
both the RNN and CNN models are much slower than
attention-based models. Compared with RNN and CNN
architectures, the self-attention mechanism is faster [33] and
is almost unaffected by the length of the sequence. In general,
the longer the sequence, the slower the processing. With this
in mind, an experiment was performed to test the speed of
the three architectures. Three time series models, including
vanilla TCN, vanilla LSTM, and vanilla Transformer, were
selected. By recording the time consumed by three models
while training, the speed can be measured. We used the
flow time series data to train these models, respectively.
The length of the input sequence ranges from 12 to 96 and
the models were trained for 10 and 50 epochs. This exper-
iment and all the following experiments were performed in
the environment given in Table 1. Fig. 5shows the time
VOLUME 10, 2022 58243
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
FIGURE 3. The original signal was decomposed into IMFs of Kamount (e.g. K=9 means the original signal was
decomposed into nine IMFs). The range of Kwas from 2 to 16. The subfigure shows the central frequency of every IMF
under particular K. LF: low frequency; HF: high frequency.
consumed to train the models of three architectures for 10 and
50 epochs under different lengths of the input sequence.
It is supposed that the longer the sequence, the slower the
processing. For the LSTM, with the length of the sequence
58244 VOLUME 10, 2022
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
FIGURE 4. The figure shows the correlation of Kand central frequencies of high-frequency IMFs. Because the IMFs of a high frequency was
discontinued, the central frequencies would decrease. It is supposed to select a Kvalue to maximize the center frequency. The green dot in
the figure represents the highest center frequency. For the flow dataset, when K=9, there is the highest center frequency; and for the
ENSO dataset, K=8.
TABLE 1. The environment.
FIGURE 5. Time consumed during training. The models were trained for
10 and 50 epochs. The length of input sequence is (12, 24, 36, 48, 60, 72,
84, 96).
increasing, it consumes significantly more time. However,
for CNN and Transformer, the time consumption changes
gently and is less affected by the sequence length. The reason
for this phenomenon is that the sequence length is not long
enough. In the settings of this study, the sequence length
ranging from 12 to 96 is reasonable and meaningful. It can
be seen that the time consumption of TCN has an obvious
increase when the sequence length is longer than 48. As for
the Transformer, there will be a significant increase, when
the sequence length is longer than 400. In a word, the time
consumed by the Transformer is the least of the three and is
hardly affected by the sequence length.
On the one hand, the RNN and CNN architectures are
slower than the attention architecture. On the other hand, the
CNN and RNN models are powerless to capture long-term
dependency as efficiently as attention-based models. The
reasons for this are as follows:
1. RNN is a linear architecture, and to obtain the corre-
lations between time steps, the operation of each time step
depends strictly on the previous steps [21]. RNN is a step-
by-step architecture that limits its parallelism. As a result, the
RNN is slow. The CNN uses convolution kernels to extract
features. The size of the kernels and step size of the convo-
lution affect its speed. Moreover, to better extract features,
a multilayer convolutional network is required. Although a
CNN is much faster than an RNN, it is still not as fast
as self-attention. Self-attention completely abandons RNN
structures and instead introduces matrix operations. Matrix
operations are parallel, which means that each time step
is computed simultaneously instead of one by one, which
significantly increases the speed. Because it is parallel, the
length of the sequence has little effect on the speed.
2. Although the structure of the ‘gates’ mechanism such
as LSTM and its variant gated recurrent unit (GRU) [47]
alleviate the problem of long-term dependence, RNN is still
powerless in dealing with the exceedingly long-term depen-
dence [33]. If the length of inputs is L, the RNN obtains a
length of dependence that is shorter than L. The length of
dependence that a CNN could get completely depends on
the size of convolution kernels, which are generally shorter
than L. On the other hand, because the attention value
between any two-time steps is computed, it can obtain arbi-
trarily long-term dependence before then focusing on the high
weight and ignoring the low weight. Self-attention obtains a
length of dependence that can be L.
The input sequence is transformed into three vectors
through three matrices in the self-attention mechanism. These
are query matrix (Q), key matrix (K) and value matrix (V).
VOLUME 10, 2022 58245
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
The canonical self-attention is defined based on the tuple
inputs (Q,K,V). Self-attention performs the scaled dot prod-
uct that can be summarized as follows [33]:
Self-Attention(Q,K,V)=softmax QK T
dkV(6)
where QRdx×dk,KRdx×dk,VRdx×dvand dxis the
input dimension. dkstands for the dimension of the Q martrix
and K matrix, and dvstands for the dimension of the V matrix.
We use 1
dkas the scaling factor.
The concepts of query, key, and value are derived from
information retrieval systems. This matches the key to the
query to get the matching value. The value is the similarity
between the query and the key. In matrix operations, the dot
product is a method to compute the similarity of two matrices,
and operation Q·KTcomputes the similarity of each pair of
time steps. Weighted matching is then performed based on
similarity, and the weights are the similarities.
In this study, based on attention mechanism, we pro-
pose the ‘‘cross-attention’ mechanism (Fig. 6). The
cross-attention mechanism combines both the dot product
attention and the additive attention to efficiently capture the
dependency between two time series sequences. Like tradi-
tional attention mechanisms, there are one query, one key,
and one value. The cross-attention block comes immediately
after the two encoders. The outputs of the flow encoder were
transformed into the query and the value matrices, and the
outputs of the ENSO encoder were transformed into the key
matrix. Let qi,ki, and vistand for the i-th row in Q, K, and V
respectively.The process of cross-attention can be described
as follows:
First, use additive attention to weighted average the key
matrix into a global key vector (k):
k=
L
X
i=1
αi·ki(7)
FIGURE 6. The Cross-Attention.
The weight αiis calculated as follows:
αi=
exp kiwT
k/d
PL
j=1exp kjwT
k/d(8)
where wkRLis a trainable vector. Use Hadamard product
to simulate the nonlinear relation between qiand kand get a
score (si):
si=qik(9)
The i-th query’s cross-attention is defined as:
Cross-attention(qi,K,V)=Softmax(si/d)·V(10)
It has been proven that the additive attention has a lower
computational complexity than dot product attention. The
main purpose of using the additive attention is that it can
quickly summarize important information in a sequence with
linear complexity, which greatly improves the efficiency of
multivariate forecasting.
C. DOUBLE-ENCODER TRANSFORMER
It has been proven that El Nino has a close correlation with
rainfall, which affects the flow of the river, so El Nino has
a significant impact on streamflow [1]. Classic Transformer
only has the self-attention block, which is only suitable for
dealing with a single time series. When dealing with the
prediction issues with multivariate data, it cannot capture
the attention among different kinds of variates effectively.
The aim of the attention mechanism is to obtain the corre-
lation of one point to all the other points and then weight the
average of them so that the attention operation could be more
than just self-attention. Because the self-attention process can
obtain a correlation among the time steps of one sequence,
the correlation between different sequences can be obtained,
too. In this study, we improved the classic Transformer by
using two encoders and one decoder with a cross-attention
block to make streamflow predictions with El Nino covariate.
The whole double-encoder architecture is proposed and illus-
trated in Fig. 7. Because Transformer abandoned the RNN
linear sequence structure to support parallel computing, the
positional information in time series will be lost. Therefore,
adding order signals to vectors is necessary to help the model
learn this positional information, so positional encoding [33]
is used to solve this problem. Positional encoding works
by combining order information and vectors to form a new
representation input to the model to learn order information.
Positional encoding is itself a vector with order information.
In this study, order information and date information are
added to the input sequence. The datasets are monthly, so the
year and month information were embed into vectors and
added to original input vectors with positional encoding.
The original hydrological time series are nonlinear and
nonstationary. Some features will be ignored if used directly,
causing decreases in the prediction accuracy, especially in
flood years. The VMD could decompose original data into
58246 VOLUME 10, 2022
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
FIGURE 7. The architecture of double-encoder Transformer.
several IMFs. Compared with the other frequency decompo-
sition technique, the fundamental components decomposed
by VMD have physical significance and are much more
robust to sampling and noise. In this study, the flow sequence
was decomposed into nine IMFs and ENSO sequence into
eight IMFs. The IMFs were sent into encoders to perform
self-attention. Each IMF should have its own encoder block,
so there are nine encoder blocks for the flow sequence and
eight for the ENSO sequence. The present study adopted
the transfer learning method: flow IMFs share a common
input embedding block and self-attention block, and so do
ENSO IMFs, but the layers behind them are different. To sum
up, there are two input embedding blocks, two self-attention
blocks, 9+8 feed-forward blocks and 9+8 linear blocks.
The introduction of transfer learning can significantly save
the space occupied by the model and accelerate the training
speed.
All the IMFs of flow and IMFs of ENSO were re-composed
as one, respectiovely, after being processed by the encoders.
The recomposed flow and ENSO data were sent into the
decoder to perform cross-attention. In river flow forecasting,
the ENSO dataset works as the covariate. The main pur-
pose of cross-attention is to capture the impact of ENSO on
flow. flow was convert into Qand V, and ENSO was con-
vert into K. The details of cross-attention were presented in
section II-B.
Let Linand Lout stand for the input window size and output
window size. The proposed method can be summarized as
Algorithm 2: where Qself _f,Kself _f,Vself _f,Qself _e,Kself _e,
and Vself _eare conversion matrices in self-attention. All the
IMFs of flow were converted into Q,K, and Vusing the
same Qself _f,Kself _f, and Vself _fmatrices, and all the IMFs
of ENSO were converted into Q,K,Vusing the same Qself _e,
Kself _e, and Vself _ematrices. The wfand weare trainable
vectors used to recompose the IMFs.
Algorithm 2 The Process of Proposed Method
Input: The flow data: F=f1,...,fLin; The ENSO
data: E=e1,...,eLin ; The time stamps:
ST =st1,...,stLin .
Output: The predictions of river flow:
Fpred =fLin+1,...,fLin+Lout .
Initialize Qself _f,Kself _f,Vself _f,Qself _e,Kself _e,Vself _e,
wf,we,wk,Qcross,Kcross,Vcross;
Use the VMD to decompose the original data:
VMD(F)=FIMF_1 ,...,FIMF _9;
VMD(E)=EIMF_1 ,...,EIMF _8;
foreach modal in VMD(F)and VMD(E)do
Perform data embedding:
EMB =Conv1d(modal)+PE +Conv1d(ST );
Convert EMB into Q,K,Vmatrices usingQself _f,
Kself _f,Vself _f,Qself _e,Kself _e, and Vself _e, and
compute the Self-attention:
Self-attention(modal)=softmax QK T
dV;
The outputs of the two encoders are:
Fself =Fself _1,...,Fself _9 ;
Eself =Eself _1,...,Eself _8 ;
Re-compose the Fself and Eself into one:
F0=concat(Fself _1,...,Fself _9 )·wf;
E0=concat(Eself _1,...,Eself _8 )·we;
Convert F0into Q, and Vmatrices using Qcross and
Vcross. Convert E0into Kusing Kcross. Compute the
Cross-attention:
foreach qiin Q do
si=qik=qiPLin
i=1αi·ki;
Cross-attention(F0,E0)
=Softmax(concat(s1,...,sLin )/d)·V;
The prediction of river flow is:
Fpred =Feedforward(Cross-attention(F0,E0))
III. EXPERIMENT AND RESULTS
This study focuses on the streamflow predictions on the
Hankou Hydrological Station. The goal is to predict river
flow with the flow and ENSO datasets. The flow dataset was
collected by the Yangtze River Water Resources Commission,
and the ENSO dataset was collected by the National Oceanic
and Atmospheric Administration. Both datasets were col-
lected monthly from January 1952 to December 2018, hence
VOLUME 10, 2022 58247
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
totaling 67 years. The datasets were divided into two parts:
one from 1952 to 1997 and another from 1998 to 2018. The
former was used to train the model, and the latter was used to
make predictions. The Yangtze River has experienced several
floods from 1952 to 2018, most recently in 2016. In 1998,
the Yangtze River experienced a devastating flood because
of strong subtropical highs, resulting in the most significant
surge over the past 50 years. In this study, the proposed model
was proven to work well in both flood years and normal years
by making predictions from 1998 to 2018. We have chosen
streamflow data in 1998, 2016, and 2018 (two flood years and
one normal year) to make predictions and then made 21 years
of rolling predictions from January 1998 to December 2018 to
further verify the overall reliability.
Some representative models were selected as comparisons,
including the traditional statistical analysis method: autore-
gressive integrated moving average model (ARIMA), the
convolutional neural network: TCN [27], the representative of
the RNN: LSTMa [48], and the classic Transformer [33]. For
CNNs, RNNs, and classic Transformer, there is no structure
specifically designed to support multidimensional prediction,
so it is general to combine all dimensions into feature vectors
as input when dealing with multidimensional time series.
All the models have a 2d correlated input to make mul-
tidimensional time series forecasting. Under the fixed-size
window forecasting setting, for double-encoder Transformer,
the input flow sequence is Ft=ftL,...,ft|fiRdx
from time tLto time t, and the input ENSO sequence
is Et=etL,...,et|eiRdx. And the input sequence
for TCN, LSTMa, and classic Transformer is FEt=
(ftL,etL),...,(ft,et)|fi,eiRdx. The output predicts
the corresponding sequence F0=f0
t,...,f0
t+l|f0
iRdx.
Lis the length of the inputs. It has been proven that El Nino
is cyclical, with a cycle of about 5.5 years [49] and through
experiments, we have verified that there will be better results
when the input length (L) is 72 (6 years). lis the length of
the prediction steps. In this study, lis set to 12 so that the
decision makers can take action to prepare for the flood a
year in advance. For the TCN model, the kernel size was 24,
and the stride was 1. The MSE-Loss was selected as the loss
function, here using AdamW as the optimizer. To measure
the performance of the results, two evaluation metrics were
introduced, including root mean squared error (RMSE) and
coefficient of determination (R2). The RMSE was defined as:
RMSE =v
u
u
t1
n
n
X
i=1
(yibyi)2(11)
and the R2was defined as:
R2=1Pn
i=1(yibyi)2
Pn
i=1(yiyi)2(12)
A. ORIGINAL DATA DECOMPOSE
To improve the accuracy and speed up the convergence of
the model, data normalization is necessary. Min–max scal-
ing was used to perform data normalization. For data X=
{x1,x2,x3,...,xn}, the min-max scaling is defined as:
xi(scaled)=ximin(X)
max(X)min(X)·(max min)+min (13)
where the [min,max] is a scaling interval and all the data
is scaled into the interval. In this study, both the flow and
ENSO data were scaled into [1,1]. The scaled data from
January 1952 to December 2018 were decomposed into nine
IMFs and eight IMFs, respectively. The decomposed results
are shown in Fig. 8(a) and Fig. 8(b).
B. THE FLOOD YEARS AND NORMAL YEAR PREDICTIONS
The main purpose of river flow forecasting is to predict
floods, especially the flood peak discharge, helping those
affected prepare to fight the flood ahead of time. So it is
crucial to predict floods accurately, which previous models
have not been able to do. Traditional models could obtain
decent performance in normal years but did not work well
in flood years. In this section, the time series from January
1952 to December 1997 were used to train the model, and
1998, 2016, and 2018 were selected to make predictions to
measure the model’s performance. Taking 1998 predictions
as an example, the input data is the flow data and ENSO data
from January 1992 to December 1997 (a total of 72 months).
The output is a 12-months prediction for 1998.
ARIMA, TCN, LSTMa, and classic Transformer were
selected as comparisons that made the same predictions
on these models. Fig. 9(a) and Fig. 9(b) show the whole
12 months of predictions for 1998 and 2016, respectively.
Additionally, data in 2018 were selected as a representative of
normal years to measure the model’s performance in normal
years. Fig. 9(c) shows the whole 12-month prediction of
2018. Finally, the metrics including R2and RMSE were used
to measure the performances of all the models in 1998, 2016,
and 2018. Table 2shows the RMSE and R2of all the models
on these three years of predictions.
All three models could fit the streamflow change trend that
increases first and then decreases. They all performed well
at the beginning and end of the year (when the streamflow
is low), and the gap among their forecasts was widest in
June, July, and August (when the streamflow is at its peak
of the year). The double-encoder Transformer had a higher
R2and lower RMSE. In the flood years (1998, 2016), it is
evident that double-encoder Transformer fits better than other
models with the actual values, especially the flood peak,
and the R2could reach more than 0.95. In a normal year
(2018), all three models performed well, but double-encoder
Transformer was more accurate than the others, here with a
0.94 R2.
C. THE 21 YEARS OF ROLLING PREDICTIONS
In section III-B, three years were selected to perform pre-
dictions, and the results showed that the double-encoder
Transformer had significant advantages in flood years and
normal years. However, the results of the three-year pre-
58248 VOLUME 10, 2022
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
FIGURE 8. The IMFs of the original data.
TABLE 2. The streamflow forecasting results on three years (five models).
dictions were largely haphazard. In this section, a 21-year
rolling forecast (from 1998 to 2018) was made to test the
models’ capacity for long-term prediction. Fig. 10(a) shows
the 21 years of rolling predictions of all models and the
actual value: it is a little confusing, so Fig. 10(b) shows
the double-encoder Transformer and the actual value in iso-
lation. Fig. 10(c) and Fig. 10(d) are corresponding scatter
plots.
Annual R2and RMSE were calculated for each year of the
five models and finally obtained the 21 years of the average
R2and RMSE. Fig. 11(a) and Fig. 11(b) show the 21 years
of the annual R2and RMSE, respectively. The results of the
21-year prediction show that the double-encoder Transformer
was better than the traditional time series models, with an
average R2of more than 0.91 and an average RMSE of lower
than 2600 m3/month.
The experimental results show that double-encoder Trans-
former is superior to the current time series forecasting
models. In 21 years of rolling predictions, the average
R2of the double-encoder Transformer could reach more
than 0.91, which is higher than the other models by about
0.1, and the RMSE was just 2579 m3/month, nearly half
of the other models. It can be seen from Fig. 10(c) that
these points of the five models are very concentrated near
the actual red line at first, which means that they all per-
formed well when streamflow was low. As the flow continued
to increase, the distribution of the points became more
scattered, resulting in bad predictions. In this case, the
VOLUME 10, 2022 58249
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
FIGURE 9. The forecasted streamflow of models and the validation scatter plot in 1998, 2016, and 2018.
double-encoder Transformer had a considerable advantage
of being more accurate in predictions of high streamflow.
The results fully prove that the double-encoder Transformer
could perform very well in normal and flood years and
could be reliable enough in long-term predictions with high
accuracy.
58250 VOLUME 10, 2022
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
FIGURE 10. A 21-year rolling forecast (from January 1998 to December 2018) was made to test the models’ capacity. Fig. 10(a) shows the
actual value and the prediction results of all models in 21 years. To show the proposed model’s performance more clearly, Fig. 10(b) shows
the actual value and the result of the double-encoder Transformer in isolation. Fig. 10(c) and Fig. 10(d) are the corresponding scatter plots
for model comparison. It can be seen from the scatter plots that when streamflow was in low value, all the models have great performance
(scatters are concentrated near the red line), but when streamflow was in high value, the models except double-encoder Transformer
perform worse (scatters are away from the red line).
VOLUME 10, 2022 58251
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
FIGURE 11. The R2and RMSE of 21 years of rolling predictions.
IV. CONCLUSION
In this work, the streamflow forecasting problem of the
Yangtze River was studied, and a new Transformer-
based double-encoder-enabled model was proposed: double-
encoder Transformer. This model, combined with the VMD
algorithm, can effectively make precise long-term stream-
flow forecasting of the Yangtze River, especially in flood
years. Although the model is still of an encoder–decoder
architecture, it alleviates the limitation of a traditional
encoder–decoder architecture. Specifically, we designed the
cross-attention mechanism to handle the challenges of not
supporting multivariate prediction in traditional time series
forecasting models. Combining the additive attention with
the dot product attention, the cross-attention mechanism can
effectively capture the relation between flow and ENSO data.
The experiments on real-world data of the Hankou Hydro-
logical Station demonstrated the effectiveness of the new
model for enhancing the prediction capacity both in normal
and flood years. A reliable long-term monthly prediction
(from 1998 to 2018) was made. There were floods in two of
these 21 years (1998 and 2016). The R2in both years is higher
than 0.95, and the RMSE value is just 3224 m3/month, which
is when the flow reached nearly 70000 m3/month in 1998.
Other mainstream forecasting models, including ARIMA,
TCN, LSTMa, and classic Transformer, were selected as
comparisons to demonstrate the superiority of the double-
encoder Transformer. In the 21 years of predictions, the
average R2of double-encoder Transformer was about 0.91,
which is higher than the other model by 0.1, and the
RMSE was 2579 m3/month, which is significantly lower
than the other models. These experimental results show that
the double-Encoder Transformer can be used in real-world
streamflow predictions.
The main work of this study is to predict the stream-
flow of the Yangtze River, focusing on flood prediction.
However, drought year is also important because water is
a vital resource on earth and we depend heavily on river
water. Accurate drought forecasting is promising for future
research. The variation of river flow in the Yangtze River is
related to many variables. In this work, we made a bivariate
forecast with the ENSO data. The outcomes can be improved
by adding more variables to make multivariate forecasting.
The cross-attention mechanism is a promising method that
can efficiently summarize the features of a sequence into a
vector and compute the attention between the independent
variable and covariates. This work is a good start in making
the multivariate long sequence time series forecasting. We are
excited about the future of models with the cross-attention
mechanism.
REFERENCES
[1] J. Wei, W. Wang, Q. Shao, Y. Rong, W. Xing, and C. Liu, ‘Influence of
mature El Niño–Southern oscillation phase on seasonal precipitation and
streamflow in the Yangtze River Basin, China,’’ Int. J. Climatol., vol. 40,
no. 8, pp. 3885–3905, Jun. 2020.
[2] J. Peng, X. Luo, F. Liu, and Z. Zhang, ‘Analysing the influences of ENSO
and PDO on water discharge from the Yangtze River into the sea,’’ Hydrol.
Processes, vol. 32, no. 8, pp. 1090–1103, Apr. 2018.
[3] Q. Zhang, C.-Y. Xu, T. Jiang, and Y. Wu, ‘Possible influence of ENSO
on annual maximum streamflow of the Yangtze River, China,’ J. Hydrol.,
vol. 333, nos. 2–4, pp. 265–274, Feb. 2007.
[4] F. Huang, Z. Xia, N. Zhang, Y. Zhang, and J. Li, ‘‘Flow-complexity
analysis of the upper reaches of the YangtzeRiver, China,’ J. Hydrol. Eng.,
vol. 16, no. 11, pp. 914–919, Nov. 2011.
[5] M. Cheng, F. Fang, T. Kinouchi, I. M. Navon, and C. C. Pain, ‘‘Long lead-
time daily and monthly streamflow forecasting using machine learning
methods,’ J. Hydrol., vol. 590, Nov. 2020, Art. no. 125376.
[6] V. K. Keteklahijani, S. Alimohammadi, and E. Fattahi, ‘‘Predicting
changes in monthly streamflow to karaj dam reservoir, iran, in climate
change condition and assessing its uncertainty,’’ Ain Shams Eng. J., vol. 10,
no. 4, pp. 669–679, Dec. 2019.
[7] C. Ma and Y. Li, ‘‘Improving forecasting accuracy of annual runoff time
series using RBFN based on EEMD decomposition,’ in Proc. DEStech
Trans. Eng. Technol. Res., 2017, pp. 211–216.
[8] G. Yu, H. Ye, Z. Xia, and X. Zhao, ‘‘Application of projection pursuit auto
regression model in predicting runoff of Yangtze River,’ J. Hohai Univ.,
Natural Sci., vol. 37, no. 3, pp. 263–266, 2009.
[9] N. Noori and L. Kalin, ‘‘Coupling SWAT and ANN models for enhanced
daily streamflow prediction,’ J. Hydrol., vol. 533, pp. 141–151, Feb. 2016.
[10] J. Arnold, ‘‘Swat-soil and water assessment tool,’ USDA Agricult.
Res. Service, Grassland, Soil Water Res. Lab., Temple, TX, USA,
Tech. Rep., 1994.
[11] M. Kirkby and K. Beven, ‘A physically based, variable contributing area
model of basin hydrology,’’ Hydrol. Sci. J., vol. 24, no. 1, pp. 43–69, 1979.
[12] R.-J. Zhao, ‘‘The Xinanjiang model applied in China,’’ J. Hydrol., vol.135,
nos. 1–4, pp. 371–381, 1992.
[13] Z. M. Yaseen, S. O. Sulaiman, R. C. Deo, and K.-W. Chau, ‘An enhanced
extreme learning machine model for river flow forecasting: State-of-the-
art, practical applications in water resource engineering area and future
research direction,’ J. Hydrol., vol. 569, pp. 387–408, Feb. 2019.
[14] W. Collischonn, R. Haas, I. Andreolli, and C. E. M. Tucci, ‘‘Forecast-
ing river Uruguay flow using rainfall forecasts from a regional weather-
prediction model,’ J. Hydrol., vol. 305, nos. 1–4, pp. 87–98, Apr. 2005.
[15] Z. A. Al-Sudani, S. Q. Salih, A. Sharafati, and Z. M. Yaseen, ‘‘Develop-
ment of multivariate adaptive regression spline integrated with differential
evolution model for streamflow simulation,’’ J. Hydrol., vol. 573, pp. 1–12,
Jun. 2019.
58252 VOLUME 10, 2022
C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River
[16] W.-C. Wang, K.-W. Chau, D.-M. Xu, L. Qiu, and C.-C. Liu, ‘The annual
maximum flood peak discharge forecasting using Hermite projection pur-
suit regression with SSO and LS method,’ Water Resour. Manage., vol. 31,
no. 1, pp. 461–477, Jan. 2017.
[17] Y. Sang, L. Shang, Z. Wang, C. Liu, and M. Yang, ‘Bayesian-combined
wavelet regressive modeling for hydrologic time series forecasting,’’ Chin.
Sci. Bull., vol. 58, no. 31, pp. 3796–3805, Nov. 2013.
[18] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’ Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
[19] S. Zhu, J. Zhou, L. Ye, and C. Meng, ‘Streamflow estimation by support
vector machine coupled with different methods of time series decompo-
sition in the upper reaches of Yangtze River, China,’’ Environ. Earth Sci.,
vol. 75, no. 6, p. 531, Mar. 2016.
[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
ing applied to document recognition,’ Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
[21] J. L. Elman, ‘‘Finding structure in time,’’ Cogn. Sci., vol. 14, no. 2,
pp. 179–211, Mar. 1990.
[22] K.-I. Funahashi, ‘‘On the approximate realization of continuous mappings
by neural networks,’ Neural Netw., vol. 2, no. 3, pp. 183–192, 1989.
[23] O. Kisi and H. K. Cigizoglu, ‘‘Comparison of different ANN tech-
niques in river flow prediction,’’ Civil Eng. Environ. Syst., vol. 24, no. 3,
pp. 211–231, Sep. 2007.
[24] M. Rezaeianzadeh, H. Tabari, A. A. Yazdi, S. Isik, and L. Kalin, ‘‘Flood
flow forecasting using ANN, ANFIS and regression models,’’ Neural
Comput. Appl., vol. 25, no. 1, pp. 25–37, Jul. 2014.
[25] M. C. Demirel, A. Venancio, and E. Kahya, ‘‘Flow forecast by SWAT
model and ANN in Pracana basin, Portugal,’ Adv. Eng. Softw., vol. 40,
no. 7, pp. 467–473, Jul. 2009.
[26] S.-Y. Shih, F.-K. Sun, and H.-Y. Lee, ‘‘Temporal pattern attention for
multivariate time series forecasting,’’ Mach. Learn., vol. 108, nos. 8–9,
pp. 1421–1441, Sep. 2019.
[27] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling,’’ 2018,
arXiv:1803.01271.
[28] S. Hochreiter and J. Schmidhuber, ‘Long short-term memory,’’ Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[29] Z. Xiang, J. Yan, and I. Demir, ‘A rainfall-runoff model with LSTM-
based sequence-to-sequence learning,’ Water Resour. Res., vol. 56, no. 1,
Jan. 2020, Art. no. e2019WR025326.
[30] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung,W.-K. Wong, and W.-C. Woo,
‘‘Convolutional LSTM network: A machine learning approach for pre-
cipitation nowcasting,’ in Proc. Adv. Neural Inf. Process. Syst., 2015,
pp. 802–810.
[31] D. Liu, W. Jiang, L. Mu, and S. Wang, ‘‘Streamflow prediction using
deep learning neural network: Case study of Yangtze River,’’ IEEE Access,
vol. 8, pp. 90069–90086, 2020.
[32] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
jointly learning to align and translate,’ 2014, arXiv:1409.0473.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[34] P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas,
and A. A. Lee, ‘Molecular transformer: A model for uncertainty-
calibrated chemical reaction prediction,’ ACS Central Sci., vol. 5, no. 9,
pp. 1572–1583, Sep. 2019.
[35] L. Yang, T. L. J. Ng, B. Smyth, and R. Dong, ‘HTML: Hierarchical
transformer-based multi-task learning for volatility prediction,’’ in Proc.
Web Conf., Apr. 2020, pp. 441–451.
[36] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
and L. Shao, ‘‘Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions,’’ 2021, arXiv:2102.12122.
[37] S. Ha, D. Liu, and L. Mu, ‘‘Prediction of Yangtze River streamflow based
on deep learning neural network with El Niño–Southern oscillation,’ Sci.
Rep., vol. 11, no. 1, pp. 1–23, Dec. 2021.
[38] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang,
‘‘Informer: Beyond efficient transformer for long sequence time-series
forecasting,’ in Proc. AAAI, 2021, pp. 1–9.
[39] K. Dragomiretskiy and D. Zosso, ‘‘Variational mode decomposition,’
IEEE Trans. Signal Process., vol. 62, no. 3, pp. 531–544, Feb. 2014.
[40] R. A. van den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, and
M. J. van der Werf, ‘‘Centering, scaling, and transformations: Improving
the biological information content of metabolomics data,’ BMC Genomics,
vol. 7, no. 1, pp. 1–15, Dec. 2006.
[41] G. Sugihara and R. M. May, ‘Nonlinear forecasting as a way of distin-
guishing chaos from measurement error in time series,’ Nature, vol. 344,
no. 6268, pp. 734–741, 1990.
[42] S. Mehran Kazemi, R. Goel, S. Eghbali, J. Ramanan, J. Sahota, S. Thakur,
S. Wu, C. Smyth, P. Poupart, and M. Brubaker, ‘‘Time2Vec: Learning a
vector representation of time,’ 2019, arXiv:1907.05321.
[43] I. Daubechies, J. Lu, and H. T. Wu, ‘‘Synchrosqueezed wavelet transforms:
An empirical mode decomposition-like tool,’ Appl. Comput. Harmon.
Anal., vol. 30, no. 2, pp. 243–261, Mar. 2011.
[44] J. Gilles, ‘‘Empirical wavelet transform,’’ IEEE Trans. Signal Process.,
vol. 61, no. 16, pp. 3999–4010, Aug. 2013.
[45] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng,
N.-C. Yen, C. C. Tung, and H. H. Liu, ‘The empirical mode decomposition
and the Hilbert spectrum for nonlinear and non-stationary time series
analysis,’ Proc. Roy. Soc. London A, Math., Phys. Eng. Sci., vol. 454,
no. 1971, pp. 903–995, 1998.
[46] D. Hilbert, ‘‘Mathematical problems,’’ Bull. Amer. Math. Soc., vol. 8,
no. 10, pp. 437–479, 1902.
[47] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using
RNN encoder-decoder for statistical machine translation,’ 2014,
arXiv:1406.1078.
[48] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
‘‘Attention-based models for speech recognition,’’ 2015,
arXiv:1506.07503.
[49] E. Bryant, E. A. Bryant, and B. Edward, Climate Process and Change.
Cambridge, U.K.: Cambridge Univ. Press, 1997.
CHUANFENG LIU received the B.S. degree
in engineering from the China University of
Geosciences, Wuhan, China, where he is currently
pursuing the master’s degree. His research inter-
ests include machine learning, neural networks,
and physical oceanography.
DARONG LIU received the B.S. degree in engi-
neering from the China University of Geosciences,
Wuhan, China, where he is currently pursuing
the Ph.D. degree. During the undergraduate study,
his research interests include computer science,
three-dimensional modeling, and numerical simu-
lation in geoscience. His current research interests
include numerical simulation in physical oceanog-
raphy, machine learning, and neural networks.
LIN MU received the B.S., M.S., and Ph.D.
degrees in physical oceanography from the Ocean
University of China, Qingdao, China, in 2000,
2002, and 2007, respectively. He is currently a Pro-
fessor of physical oceanography with Shenzhen
University and the Shenzhen Research Institute,
China University of Geosciences, Guangdong,
China. He has authored or coauthored over 20 sci-
entific articles and four books. His research inter-
ests include physical oceanography: prevention
and mitigation of marine disasters, maritime search and rescue, and emer-
gency response management of offshore oil spills.
VOLUME 10, 2022 58253
... Furthermore, Liu et al. (2022) developed a Transformer-based model for monthly streamflow prediction on the Yangtze River, demonstrating its ability to incorporate both historical water levels and the influence of ENSO patterns. Similarly, Castangia et al. (2023) applied a Transformer model for predicting daily water levels within a river network, with a focus on capturing upstream hydrological signals. ...
... For this purpose, this study employs four key metrics that are widely recognized in the field of hydrological modeling and streamflow forecasting: Nash-Sutcliffe Efficiency (NSE), Kling-Gupta Efficiency (KGE), Pearson's r and Normalized Root Mean Square Error (NRMSE). These metrics are chosen for their proven interpretability and comprehensive ability to assess various facets of model performance, as supported by previous research (Kratzert et al., 2018;Xiang and Demir, 2021;Liu et al., 2022). ...
Preprint
Full-text available
This study explores the efficacy of a Transformer model for 120-hour streamflow prediction across 125 diverse locations in Iowa, US. Utilizing data from the preceding 72 hours, including precipitation, evapotranspiration, and discharge values, we developed a generalized model to predict future streamflow. Our approach contrasts with traditional methods that typically rely on location-specific models. We benchmarked the Transformer model's performance against three deep learning models (LSTM, GRU, and Seq2Seq) and the Persistence approach, employing Nash-Sutcliffe Efficiency (NSE), Kling-Gupta Efficiency (KGE), Pearson's r, and Normalized Root Mean Square Error (NRMSE) as metrics. The study reveals the Transformer model's superior performance, maintaining higher median NSE and KGE scores and exhibiting the lowest NRMSE values. This indicates its capability to accurately simulate and predict streamflow, adapting effectively to varying hydrological conditions and geographical variances. Our findings underscore the Transformer model's potential as an advanced tool in hydrological modeling, offering significant improvements over traditional and contemporary approaches.
... While some past studies have claimed some architectures' superior performance compared to LSTM, most of the time the conclusions were highly conditional on using a small dataset for benchmarking (Abed et al., 2022;Amanambu et al., 2022;Ghobadi & Kang, 2022), or using procedures and configurations, e.g., training and test periods, sites, and forcing data, different from published benchmarks (Yin et al., 2022(Yin et al., , 2023, or on a case study which were not tested independently by other teams (Koya & Roy, 2023;Liu et al., 2022). In the interest of reproducibility and comparability which underpin scientific progress, it is a good idea to benchmark under the same conditions, on the same (reasonably large) dataset. ...
... Zhu et al [16] proposed a multiscale domain adaptive method based on Transformer-CNN for fault diagnosis when data are scarce. Liu et al [3,23] proposed a dual encoder model based on transformers to predict the monthly runoff from the Yangtze River. Since the encoder and decoder of the Transformer use a self-attention mechanism, this leads to greater computational space complexity. ...
Article
Full-text available
Pipeline leakage detection is an integral part of pipeline integrity management. Combining AE (Acoustic Emission) with deep learning is currently the most commonly used method for pipeline leakage detection. However, this approach is usually applicable only to specific situations and requires powerful signal analysis and computational capabilities. To address these issues, this paper proposes an improved Transformer network model for diagnosing faults associated with abnormal working conditions in acoustic emission pipelines. First, the method utilizes the temporal properties of the GRU and the positional coding of the Transformer to capture and feature extract the data point sequence position information to suppress redundant information, and introduces the largest pooling layer into the Transformer model to alleviate the overfitting phenomenon. Second, while retaining the original attention learning mechanism and identity path in the original DRSN, a new soft threshold function is introduced to replace the ReLU activation function with a new threshold function, and a new soft threshold module and adaptive slope module are designed to construct the improved residual shrinkage unit (ASB-STRSBU), which is used to adaptively set the optimal threshold. Finally, pipeline leakage is classified. The experimental results show that the NDRSN model is able to make full use of global and local information when considering leakage signals and can automatically learn and acquire the important parameters of the input features in the spatial and channel domains. By optimizing the GRU improved Transformer network recognition model, the method significantly reduces the model training time and computational resource consumption while maintaining high leakage recognition accuracy. The average accuracy reached 93.97%. This indicates that the method has good robustness in acoustic emission pipeline leakage detection.
Article
As a frequent and devastating natural disaster worldwide, floods are influenced by complex factors. Building flood models for simulating, monitoring, and forecasting floods is crucial to reduce the risk of disasters and minimize damage to people and property. With advancements in computing power and the impressive capabilities of deep learning in such areas as classification and prediction, there has been growing interest in using this technology in flood research. There is also a growing body of research into building flood data‐driven models with deep learning. Based on this, this study adopts a mixed‐method approach of bibliometric and qualitative analyses to provide an overview of the research. The research status is revealed in a bibliometric visualization, where the research objects are defined from the flood perspective, and the research strategies are explained from the deep learning perspective to provide a comprehensive and in‐depth understanding of the flood problem and how to apply deep learning to solve it. In addition, the study reflects on the future direction of improvement and innovation needed to promote the further development and exploration of deep learning in flood research.
Article
Precise long-term runoff prediction holds crucial significance in water resource management. Although the long short-term memory (LSTM) model is widely adopted for long-term runoff prediction, they encounter challenges such as error accumulation and low computational efficiency. To address these challenges, we utilized a novel method to predict runoff based on a Transformer and the base flow separation approach (BS-Former) in the Ningxia section of the Yellow River Basin. To evaluate the effectiveness of the Transformer model and its responsiveness to the base flow separation technique, we constructed LSTM and artificial neural network (ANN) models as benchmarks for comparison. The results show that Transformer outperforms the other models in terms of predictive performance and that base flow separation significantly improves the performance of the Transformer model. Specifically, the performance of BS-Former in predicting runoff 7 days in advance is comparable to that of the BS-LSTM and BS-ANN models with lead times of 4 and 2 days, respectively. In general, the BS-Former model is a promising tool for long-term runoff prediction.
Article
In this paper, we address the critical task of 24-h streamflow forecasting using advanced deep-learning models, with a primary focus on the Transformer architecture which has seen limited application in this specific task. We compare the performance of five different models, including Persistence, long short-term memory (LSTM), Seq2Seq, GRU, and Transformer, across four distinct regions. The evaluation is based on three performance metrics: Nash–Sutcliffe Efficiency (NSE), Pearson's r, and normalized root mean square error (NRMSE). Additionally, we investigate the impact of two data extension methods: zero-padding and persistence, on the model's predictive capabilities. Our findings highlight the Transformer's superiority in capturing complex temporal dependencies and patterns in the streamflow data, outperforming all other models in terms of both accuracy and reliability. Specifically, the Transformer model demonstrated a substantial improvement in NSE scores by up to 20% compared to other models. The study's insights emphasize the significance of leveraging advanced deep learning techniques, such as the Transformer, in hydrological modeling and streamflow forecasting for effective water resource management and flood prediction.
Preprint
Full-text available
The imperative for a reliable and accurate flood forecasting procedure stem from the hazardous nature of the disaster. In response, researchers are increasingly turning to innovative approaches, particularly machine learning models, which offer enhanced accuracy compared to traditional methods. However, a notable gap exists in the literature concerning studies focused on the South Asian tropical region, which possesses distinct climate characteristics. This study investigates the applicability and behavior of Long Short-Term Memory (LSTM) and Transformer models in flood simulation with one day lead time, at the lower reach of Mahaweli catchment in Sri Lanka, which is mostly affected by the Northeast Monsoon. The importance of different input variables in the prediction was also a key focus of this study. Input features for the models included observed rainfall data collected from three nearby rain gauges, as well as historical discharge data from the target river gauge. Results showed that use of past water level data denotes a higher impact on the output compared to the other input features such as rainfall, for both architectures. All models denoted satisfactory performances in simulating daily water levels, especially low stream flows, with Nash Sutcliffe Efficiency (NSE) values greater than 0.77 while Transformer Encoder model showed a superior performance compared to Encoder Decoder models.
Article
Full-text available
Accurate long-term streamflow and flood forecasting has always been an important research direction in hydrology research. Nowadays, with climate change, floods, and other anomalies occurring more and more frequently and bringing great losses to society. The prediction of streamflow, especially flood prediction, is important for disaster prevention. Current hydrological models based on physical mechanisms can give accurate predictions of streamflow, but the effective prediction period is only about one month in advance, which is too short for decision making. Previous studies have shown a link between the El Niño–Southern Oscillation (ENSO) and the streamflow of the Yangtze River. In this paper, we use ENSO and the monthly streamflow data of the Yangtze River from 1952 to 2016 to predict the monthly streamflow of the Yangtze River in two extreme flood years by using deep neural networks. In this paper, three deep neural network frameworks are used: Stacked LSTM, Conv LSTM Encoder-Decoder LSTM and Conv LSTM Encoder-Decoder GRU. Experiments have shown that the months of flood occurrence and peak flows predicted by these four models become more accurate after the introduction of ENSO. And the best results were obtained on the Convolutional LSTM + Encoder Decoder Gate Recurrent Unit model.
Article
Full-text available
The most important motivation for streamflow forecasts is flood prediction and longtime continuous prediction in hydrological research. As for many traditional statistical models, forecasting flood peak discharge is nearly impossible. They can only get acceptable results in normal year. On the other hand, the numerical methods including physics mechanisms and rainfall-atmospherics could provide a better performance when floods coming, but the minima prediction period of them is about one month ahead, which is too short to be used in hydrological application. In this study, a deep neural network was employed to predict the streamflow of the Hankou Hydrological Station on the Yangtze River. This method combined the Empirical Mode Decomposition (EMD) algorithm and Encoder Decoder Long Short-Term Memory (En-De-LSTM) architecture. Owing to the hydrological series prediction problem usually contains several different frequency components, which will affect the precision of the longtime prediction. The EMD technique could read and decomposes the original data into several different frequency components. It will help the model to make longtime predictions more efficiently. The LSTM based En-De-LSTM neural network could make the forecasting closer to the observed in peak flow value through reading, training, remembering the valuable information and forgetting the useless data. Monthly streamflow data (from January 1952 to December 2008) from Hankou Hydrological Station on the Yangtze River was selected to train the model, and predictions were made in two years with catastrophic flood events and ten years rolling forecast. Furthermore, the Root Mean Square Error (RMSE), Coefficient of Determination (R2), Willmott’s Index of agreement (WI) and the Legates-McCabe’s Index (LMI) were used to evaluate the goodness-of-fit and performance of this model. The results showed the reliability of this method in catastrophic flood years and longtime continuous rolling forecasting.
Conference Paper
Full-text available
The volatility forecasting task refers to predicting the amount of variability in the price of a financial asset over a certain period. It is an important mechanism for evaluating the risk associated with an asset and, as such, is of significant theoretical and practical importance in financial analysis. While classical approaches have framed this task as a time-series prediction one-using historical pricing as a guide to future risk forecasting-recent advances in natural language processing have seen researchers turn to complementary sources of data, such as analyst reports, social media, and even the audio data from earnings calls. This paper proposes a novel hierarchical, transformer, multi-task architecture designed to harness the text and audio data from quarterly earnings conference calls to predict future price volatility in the short and long term. This includes a comprehensive comparison to a variety of baselines, which demonstrates very significant improvements in prediction accuracy, in the range 17%-49% compared to the current state-of-the-art. In addition, we describe the results of an ablation study to evaluate the relative contributions of each component of our approach and the relative contributions of text and audio data with respect to prediction accuracy.
Article
Full-text available
Rainfall‐runoff modeling is a complex nonlinear time series problem. While there is still room for improvement, researchers have been developing physical and machine learning models for decades to predict runoff using rainfall data sets. With the advancement of computational hardware resources and algorithms, deep learning methods such as the long short‐term memory (LSTM) model and sequence‐to‐sequence (seq2seq) modeling have shown a good deal of promise in dealing with time series problems by considering long‐term dependencies and multiple outputs. This study presents an application of a prediction model based on LSTM and the seq2seq structure to estimate hourly rainfall‐runoff. Focusing on two Midwestern watersheds, namely, Clear Creek and Upper Wapsipinicon River in Iowa, these models were used to predict hourly runoff for a 24‐hr period using rainfall observation, rainfall forecast, runoff observation, and empirical monthly evapotranspiration data from all stations in these two watersheds. The models were evaluated using the Nash‐Sutcliffe efficiency coefficient, the correlation coefficient, statistical bias, and the normalized root‐mean‐square error. The results show that the LSTM‐seq2seq model outperforms linear regression, Lasso regression, Ridge regression, support vector regression, Gaussian processes regression, and LSTM in all stations from these two watersheds. The LSTM‐seq2seq model shows sufficient predictive power and could be used to improve forecast accuracy in short‐term flood forecast applications. In addition, the seq2seq method was demonstrated to be an effective method for time series predictions in hydrology.
Article
Full-text available
As one of the most influential oceanic and atmospheric oscillations in the Earth system, El Niño‐Southern Oscillation (ENSO) has modulated numerous geophysical processes. This is particularly true for the Yangtze River Basin (YRB), which is vulnerable to Asian Monsoon and faces serious hydrological hazards. In this study, the co‐variability between lag–lead precipitation and sea surface temperature anomalies was evaluated utilizing singular value decomposition (SVD) method. Moreover, certain teleconnections between ENSO and streamflow were identified by wavelet methods. In addition, the contribution of related atmospheric variables was revealed by composite analysis. Results indicate that there are strong associations in lag–lead seasons between the wet condition (dry condition) and September–November (December–February) mature ENSO phase. Significant common power and coherence signals between the ENSO indices and the streamflow occur in the 4–8, 8–16 and 16–32 seasonal scales. Meanwhile, the activity cycle of the ENSO indices ahead of streamflow increases from the mid‐lower reaches to the source region. In addition, the Western Pacific Subtropical High is strengthened during the mature ENSO phase. Anomalous sinking motions and divergent water vapour flux occupy the YRB, reducing the precipitation and leading to the dry condition in the source region until the following March–May. On the other hand, ascending movements and abundant water vapour flux coming from northern Pacific, equatorial western Pacific and the Bay of Bengal result in the wet condition in the mid‐lower reaches.
Article
Full-text available
Organic synthesis is one of the key stumbling blocks in medicinal chemistry. A necessary yet unsolved step in planning synthesis is solving the forward problem: Given reactants and reagents, predict the products. Similar to other work, we treat reaction prediction as a machine translation problem between simplified molecular-input line-entry system (SMILES) strings (a text-based representation) of reactants, reagents, and the products. We show that a multihead attention Molecular Transformer model outperforms all algorithms in the literature, achieving a top-1 accuracy above 90% on a common benchmark data set. Molecular Transformer makes predictions by inferring the correlations between the presence and absence of chemical motifs in the reactant, reagent, and product present in the data set. Our model requires no handcrafted rules and accurately predicts subtle chemical transformations. Crucially, our model can accurately estimate its own uncertainty, with an uncertainty score that is 89% accurate in terms of classifying whether a prediction is correct. Furthermore, we show that the model is able to handle inputs without a reactant–reagent split and including stereochemistry, which makes our method universally applicable.
Article
Full-text available
Forecasting of multivariate time series data, for instance the prediction of electricity consumption, solar power production, and polyphonic piano pieces, has numerous valuable applications. However, complex and non-linear interdependencies between time steps and series complicate this task. To obtain accurate prediction, it is crucial to model long-term dependency in time series data, which can be achieved by recurrent neural networks (RNNs) with an attention mechanism. The typical attention mechanism reviews the information at each previous time step and selects relevant information to help generate the outputs; however, it fails to capture temporal patterns across multiple time steps. In this paper, we propose using a set of filters to extract time-invariant temporal patterns, similar to transforming time series data into its “frequency domain”. Then we propose a novel attention mechanism to select relevant time series, and use its frequency domain information for multivariate forecasting. We apply the proposed model on several real-world tasks and achieve state-of-the-art performance in almost all of cases. Our source code is available at https://github.com/gantheory/TPA-LSTM.
Article
Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which achieves O(L log L) in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.
Article
Long lead-time streamflow forecasting is of great significance for water resources planning and management in both the short and long terms. Despite of some studies using machine learning methods in streamflow forecasting, only few studies have been conducted to explore long lead-time forecasting capabilities of these methods, and gain an insight into systematic comparison of model forecasting performance in both the short and long terms. In this work, an artificial neural network (ANN) and a long short term memory (LSTM), a powerful tool for learning long-term temporal dependencies and capturing nonlinear relationship, have been adopted to forecast streamflow at daily and monthly scales for a long lead-time period. For long lead-time streamflow forecasting, a recursive forecasting procedure, which takes the last one-step-ahead forecast as a new input for the next-step-ahead forecast, is used in the ANN and LSTM forecasting systems. Two models are trained and validated for streamflow forecasting using the rainfall and runoff datasets collected from the Nan River Basin, Thailand, covering the period 1974 to 2014. To further explore the impact of parameter settings on model performance, two parameters, i.e. the length of time lag and the number of maximum epochs, are examined in the ANN and LSTM models. The main findings are highlighted here. First, with an optimal setting up of model parameters, both the ANN and LSTM model can provide accurate daily forecasting (up to 20 days ahead). Second, in comparison to the ANN model, the LSTM model exhibits better model performance in long lead-time daily forecasting, but less satisfactory in multi-monthly forecasting due to lack of large monthly training dataset. Third, the selection of the length of the time lag and number of maximum epochs used in both ANN and LSTM modelling are the key for long lead-time streamflow forecasting at daily and monthly scales. These findings suggest that the LSTM could be advance in daily streamflow forecasting and thus would be helpful to assist in strategy decisions in water resource management.