ArticlePDF Available

Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River

January 2022
IEEE Access 10(3):58240-58253

January 2022
10(3):58240-58253

DOI:10.1109/ACCESS.2022.3178521

License
CC BY 4.0

Authors:

Darong Liu

South China Institute Of Environmental Sciences

Lin Mu

Shenzhen University

Over the past few decades, floods have severely damaged production and daily life, causing enormous economic losses. Streamflow forecasts prepare us to fight floods ahead of time and mitigate the disasters arising from them. Streamflow forecasting demands a high-capacity model that can make precise long-term predictions. Traditional physics-based hydrological models can only make short-term predictions for streamflow, while current machine learning methods can only obtain acceptable results in normal years without floods. Previous studies have demonstrated a close relation between El Niño-Southern Oscillation (ENSO) and the streamflow of the Yangtze River. However, traditional models, holding the encoder–decoder architecture, only have one encoder block that can not support bivariate time series forecasting. In this study, a transformer-based double-encoder-enabled model was proposed, called the double-encoder Transformer, with a distinctive characteristic: “cross-attention” mechanism that can capture the relation between two time series sequences. Using river flow observation collected by the Yangtze River Water Resources Commission and El Niño-Southern Oscillation (ENSO) observation collected by the National Oceanic and Atmospheric Administration, the model can achieve better performance. By using variational mode decomposition (VMD) technique for preprocessing, the model can make precise long-term predictions for the river flow of the Yangtze River. A monthly prediction of 21 years (from January 1998 to December 2018) was made, and the results indicate that the double-encoder Transformer outperforms mainstream time series models.

The overview of the work.

…

The correlation between K and Loss on flow and ENSO datasets.

…

The original signal was decomposed into IMFs of K amount (e.g. K = 9 means the original signal was decomposed into nine IMFs). The range of K was from 2 to 16. The subfigure shows the central frequency of every IMF under particular K . LF: low frequency; HF: high frequency.

…

The architecture of double-encoder Transformer.

…

The R 2 and RMSE of 21 years of rolling predictions.

…

Figures - uploaded by Darong Liu

Content may be subject to copyright.

Content uploaded by Darong Liu

Content may be subject to copyright.

Received May 5, 2022, accepted May 21, 2022, date of publication May 27, 2022, date of current version June 6, 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3178521

Improved Transformer Model for Enhanced

Monthly Streamflow Predictions of the

Yangtze River

CHUANFENG LIU 1, DARONG LIU 1,3, AND LIN MU 2,3

1College of Marine Science and Technology, China University of Geosciences, Wuhan 430074, China

2College of Life Science and Oceanography, Shenzhen University, Shenzhen 518061, China

3Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou 511458, China

Corresponding author: Darong Liu (lidr1169@cug.edu.cn)

This work was supported in part by the Key Special Project for Introduced Talents Team of Southern Marine Science and Engineering

Guangdong Laboratory (Guangzhou) under Grant GML2019ZD0604, the National Natural Science Foundation of China under

Grant U2006210, and in part by the Shenzhen Fundamental Research Program under Grant JCYJ20200109110220482.

ABSTRACT Over the past few decades, ﬂoods have severely damaged production and daily life, causing

enormous economic losses. Streamﬂow forecasts prepare us to ﬁght ﬂoods ahead of time and mitigate

the disasters arising from them. Streamﬂow forecasting demands a high-capacity model that can make

precise long-term predictions. Traditional physics-based hydrological models can only make short-term

predictions for streamﬂow, while current machine learning methods can only obtain acceptable results

in normal years without ﬂoods. Previous studies have demonstrated a close relation between El Niño-

Southern Oscillation (ENSO) and the streamﬂow of the Yangtze River. However, traditional models, holding

the encoder–decoder architecture, only have one encoder block that can not support bivariate time series

forecasting. In this study, a transformer-based double-encoder-enabled model was proposed, called the

double-encoder Transformer, with a distinctive characteristic: ‘‘cross-attention’’ mechanism that can capture

the relation between two time series sequences. Using river ﬂow observation collected by the Yangtze

River Water Resources Commission and El Niño-Southern Oscillation (ENSO) observation collected by

the National Oceanic and Atmospheric Administration, the model can achieve better performance. By using

variational mode decomposition (VMD) technique for preprocessing, the model can make precise long-term

predictions for the river ﬂow of the Yangtze River. A monthly prediction of 21 years (from January 1998 to

December 2018) was made, and the results indicate that the double-encoder Transformer outperforms

mainstream time series models.

INDEX TERMS Streamﬂow prediction, Yangtze River, deep learning, transformer, variational modal

decomposition, ﬂood forecasts.

I. INTRODUCTION

The Yangtze River has seen numerous ﬂoods in its history.

Affected by various factors, including the El Nino weather

pattern [1]–[3], the river ﬂow time series are complex and

nonlinear [4]. Flow prediction is a practical problem and has

been drawing an increasing amount of attention. Many stud-

ies have been done to predict the river ﬂow for months [5], [6]

or even years in advance [7], [8]. Streamﬂow predictions help

in the ability to ﬁght ﬂoods in advance and help local admin-

istrators make better decisions and mitigate disasters [9].

The associate editor coordinating the review of this manuscript and

approving it for publication was Ehab Elsayed Elattar .

Over the past few decades, many numerical and machine

learning methods have been developed to predict streamﬂow.

Numerical prediction models [10]–[12] can simulate the

interactions of various physical processes, such as atmo-

spheric circulation and the evolution of long-term weather

in the physical world [13]. Numerical models can be anal-

ogous to conducting a particular physics experiment as a

way to achieve satisfactory results in short-term forecast-

ing [14]. The soil and water assessment tool (SWAT) [10]

was proposed as a way to predict the effects of land man-

agement on water, sediment, and chemicals in a large water-

shed with complex and varied soil types, land-use patterns,

and management practices. SWAT is a distributed watershed

58240 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 10, 2022

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

hydrologic model based on the geographic information sys-

tem (GIS). SWAT primarily uses the space-based information

from remote sensing and geographical information systems

to simulate various hydrological physical and chemical pro-

cesses [10]. However, hydrologic numerical models have

some severe drawbacks: 1. Large amounts of data with high

accuracy are required; 2. long-term forecasting is less accu-

rate; and 3. numerical methods are computationally intensive,

requiring vast computing resources. Statistical prediction

methods generally belong to traditional machine learning.

Many statistical models have been employed in predicting

streamﬂow [15]–[17]. Support vector machines (SVMs) [18]

was designed to be a strict theoretical and mathematical basis

classiﬁer. In 2016, Shuang Zhu et al.used SVMs to predict

the upper reaches of the Yangtze River; here, the R2could

reach 0.87 in a single-year monthly forecast [19]. Statistical

hydrological models can perform well in normal years but

perform poorly in ﬂood years because the organizational

structure of its parameters limits the complexity of the model,

hence not being able to predict the anomalies precisely.

Traditional deep learning models for extracting features

that can be used as prediction models. Artiﬁcial neural net-

work (ANN), convolutional neural network (CNN) [20], and

recurrent neural network(RNN) [21] are promising meth-

ods for predicting river ﬂow. An ANN can approximate the

unknown and nonlinear functions with arbitrary precision,

namely universal function approximators [22]. ANN mod-

els have been used to predict streamﬂow [23]–[25]. How-

ever, the ANN has two drawbacks: 1. With the increase

of sequence length, the amount of trainable parameters

increases sharply. 2. An ANN cannot capture sequence infor-

mation (e.g., position and order ). It is generally accepted

that an ANN is not suitable for processing time series.

CNNs and RNNs were designed to overcome the limita-

tions of ANNs. Previous studies have found CNNs and

RNNs to be more accurate than other models for dealing

with time series. Shun-Yao Shih et al .used a CNN for

multivariate time series forecasting [26]. Shaojie Bai et al.

proposed a new CNN architecture named a temporal con-

volutional network (TCN) [27] and obtained good results

on various time series datasets. An RNN was speciﬁcally

designed to deal with time series [21]. However, the vanilla

RNN architecture is full of problems that cannot be used

directly. Based on RNN architecture, long short-term mem-

ory (LSTM) [28] has been proposed, as a way to allevi-

ate problems within an RNN. Currently, LSTM is widely

used in hydrologic prediction tasks [13], [29], [30]. In 2020,

Liu et al.used LSTM to predict the middle reaches of

the Yangtze River; here, in ﬂood years, the R2monthly

prediction could reach 0.89 [31]. However, there are also

some shortcomings in CNNs and RNNs: 1. RNNs are

not computationally parallel, making them slow and time-

consuming. 2. CNNs will lose critical sequence information

while processing with time series, decreasing their accuracy.

3. CNNs and RNNs cannot capture the long-term dependence

effectively.

The attention mechanism was ﬁrst proposed by

Bengio et al.in 2014 [32], and since then, it has been widely

applied in various ﬁelds of deep learning. With the atten-

tion mechanism, models can ignore low-value information

and focus on high-value information. The attention layer

can capture the long-term dependency by computing the

‘‘attention’’ between all pairs of points. The process does not

need to be sequential, so it is computationally parallel. The

attention mechanism can also be used as a feature extractor,

outperforming CNN and RNN architectures. Based on an

attention mechanism, Google proposed a new model called

Transformer [33]. Transformer abandoned the traditional

CNNs and RNNs, and the entire network structure is com-

pletely composed of the attention mechanism. Transformer

was initially designed for natural language processing (NLP).

Still, currently, many studies have indicated that Transformer

is faster and stronger than CNNs and RNNs in dealing

with time series [34]–[36]. However, like the previous mod-

els, Transformer also has trouble in predicting ﬂood peak

discharge.

There are three deﬁciencies in the previous works: 1.

They can not forecast the ﬂood peak discharge precisely.

The R2in ﬂood years has been lower than 0.9. 2. They

cannot make long-term stable predictions with high accuracy.

When making long-term predictions (e.g., more than a decade

of predictions), the average R2may only be around 0.85.

3. Previous models do not have structures for multidimen-

sional data processing and combine all the dimensions into

feature vectors as input to make multidimensional predic-

tions for river ﬂow, limiting the models’ capability. The

previous multidimensional data prediction models all use

this method of feature vectors [26], [37], [38]. In this study,

a restructured Transformer model combined with VMD is

proposed, namely double-encoder Transformer. Variational

mode decomposition (VMD) is widely used to preprocess

complex and nonlinear data [39]. It can decompose a signal

into several different modes to transform the signal in the time

domain into the frequency domain. The inherent features of

the original signal can be better reﬂected, making the model

ﬁt well, no matter in normal or ﬂood years. The restructured

Transformer is still an encoder–decoder architecture with two

encoders and one decoder. The structure of the multiencoder

can support multivariate prediction. The inputs of the two

encoders are observed river ﬂow data and observed ENSO

data, respectively. To better obtain the correlation between

El Nino and ﬂow, the decoder receives the outputs of the

two encoders as inputs and then learns and computes the

correlation between them.

The contributions of the present paper are as follows:

•The double-encoder Transformer is proposed to enhance

the prediction capability successfully, which signiﬁ-

cantly improves the ﬂow prediction accuracy of ﬂood

years with an R2higher than 0.95.

•A reliable long-term (21 years) ﬂow prediction of the

Yangtze River had been performed, achieving high

accuracy.

VOLUME 10, 2022 58241

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

•The ‘‘cross-attention’’ mechanism is proposed to solve

the bivariate prediction problem, which validates the

attention-based model’s potential value for making a

multivariate prediction.

•A heuristic method is applied to determine the Kvalue

in the VMD algorithm.

II. METHODOLOGY

The speciﬁc research methods can be divided into data pre-

processing and deep learning. The critical step of data pre-

processing is the VMD algorithm. In section II-A, starting

with data preprocessing, the focus is on the details of the

VMD algorithm. Deep learning techniques generally develop

an encoder–decoder paradigm [38], mainly using a CNN and

RNN and their variants. The double-encoder Transformer

holds the encoder–decoder architecture with attention blocks.

In section II-B, the principles and processes of attention oper-

ation are presented and described. Section II-C provides the

core part of the work. The model’s architecture and network

layers will be presented in detail. Please refer to Fig. 1for an

overview and the sections for more information.

FIGURE 1. The overview of the work.

A. DATA PREPROCESSING

This section presents the process of preprocessing original

observed ﬂow data and ENSO data. There are three steps:

1. data normalization; 2. Adding time stamps; 3. applying

the VMD algorithm to decompose the signal. The key to the

third step is the appropriate amount of frequency components

decomposed by the VMD. Here, we will demonstrate how

to determine the right amount. There are two steps before

employing the VMD.

1. Data normalization. The ﬂow of the Yangtze River

changes dramatically, ranging from 5000 m3/month to nearly

70000 m3/month. Suppose the dimensional (scaler) differ-

ence of the data is too large. In this case, the data with a

large dimension (scale) will take the leading position, which

will reduce the accuracy of the model. In addition, data with

Algorithm 1 Complete Optimization of VMD [39]

Initialize bu1

k,ω1

k,b

λ1,n=0

repeat

n=n+1

for k=1:Kdo

Update bukfor all ω>0:

ˆun+1

k(ω)←ˆ

f(ω)−Pi<kˆun+1

i(ω)−Pi>kˆun

i(ω)+ˆ

λn(ω)

1+2αω−ωn

k2

Update ωk:

ωn+1

k←R∞

0ωˆun+1

k(ω)2dω

R∞

0ˆun+1

k(ω)2dω

end for

Dual ascent for all ω>0:

λn+1(ω)←ˆ

λn(ω)+τ ˆ

f(ω)−X

kˆun+1

k(ω)!

until convergence:

k



ˆun+1

k− ˆun

k



2

2/

ˆun

k

2

2< 

a large dimension will have a longer axis when updating

the model using a gradient descent, which means that the

convergence of the model will be slow [40]. In this study,

we used min–max scaling to perform data normalization.

2. Adding time stamps. For time series, it is essential

to add time information while training [41], [42]. Classic

Transformer only encodes the input data with position and

does not contain more ﬁne-grained time information such as

year, month, day, or hour [38]. The dataset used has a monthly

dimension, so we added time stamps of years and months to

each piece of data. Like positional encoding [33], these time

stamps would be embedded into higher dimensions and added

to the input data as sequential information.

VMD works efﬁciently with nonlinear and nonstationary

data like streamﬂow by decomposing a signal into different

modes of distinct spectral bands. In the original description,

a model is deﬁned as a signal whose number of local extrema

and zero-crossings differ almost by one. Now, the deﬁnition

is slightly changed into the so-called intrinsic mode function

(IMF) [43], [44]. The complete VMD algorithm [39] can be

summarized in Algorithm 1, where uk:= {u1,u2,u3,...,uk}

are the shorthand notations for the set of all modes (IMFs) and

wk:= {w1,w2,w3,...,wk}are the shorthand notations for

their center frequencies, respectively. The role of Lagrangian

multipliers λis to enforce the constraint. The goal of VMD is

to decompose the original signal into subsignals uk.

VMD has proven to be a better algorithm than empiri-

cal mode decomposition (EMD) [45]. VMD is much more

58242 VOLUME 10, 2022

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

robust to sampling and noise and supported by mathematical

theory [39]. However, the amount of IMF decomposed by

VMD is not ﬁxed, and the effect of VMD is mainly affected

by this amount. Kwill be used as a shorthand notation for

this amount in the following presentation. Applying VMD

to decompose the original signal will produce losses. When

Kis too small, a lot of important information in the original

signal will be ﬁltered, affecting the accuracy of subsequent

predictions. The loss can be observed by decomposing the

original signal into IMFs and then recomposing them and

comparing them with the original signal. The loss can be

deﬁned as the difference between the recomposed signal and

the original signal. To measure the loss, the coefﬁcient of

determination (R2) is introduced. A R2between 0 and 1 can

reﬂect the difference between the two distributions, and the

smaller the R2the greater the difference. The loss can be

measured by R2:Loss =1−R2. Fig. 2(a) and Fig. 2(b)

show the inﬂuence of Kto loss on the ﬂow and ENSO dataset,

respectively. The larger Kis, the smaller the loss. However,

When Kis too large, it will cause a few problems: 1. There

will be gaps in high-frequency IMF, and the high-frequency

signal becomes intermittent. 2. The center frequencies of

the adjacent IMFs will be close to each other, resulting in

modal repetition and extra noise. 3. More IMFs mean more

computing resources and more time.

FIGURE 2. The correlation between Kand Loss on flow and ENSO

datasets.

In this study, a heuristic approach was adopted to select the

appropriate Kvalue. We apply Hilbert transform (HT) [46] to

each IMF. The HT can be described as (1):

H[x(t)] = ˆx(t)

=x(t)∗1

πt

πZ∞

−∞

x(τ)

t−τdτ(1)

where πis the Pi. ∗and τare the convolution operator and the

integration variable, respectively. The integral is considered

to be Cauchy principal value, which avoids singularities at

τ=tand τ= ±∞. The instantaneous amplitude A(t) and

the instantaneous phase ψ(t) can be calculated using x(t) and

ˆx(t):

A(t)= ±px2(t)+ ˆx2(t) (2)

ψ(t)=arctan ˆx(t)

x(t)(3)

Then, calculate the instantaneous frequency IF(t) for all

points except for two endpoints:

IF(t)=ψ0(t)

=x(t)ˆx0(t)−x0(t)ˆx(t)

A2(t)(4)

Finally, the mean of all the instantaneous frequencies of each

IMF is the central frequency of that IMF:

CF =Pn−1

t=2IF(t)

n−2(5)

Fig. 3(a) and Fig. 3(b) show the central frequency distri-

butions of the two datasets under different Kvalues, respec-

tively. Here, we focus on the high frequencies. When the

Kvalue is in a reasonable range, it is meaningful that the

higher the frequency, the higher the central frequency. When

Kis too large, there is an obvious inﬂection point where the

central frequency goes down as Kgoes up. This inﬂection

point means that Kis already large enough. When the Kvalue

is too high, the high-frequency part of the signal will break

off, and there is no instantaneous frequency at the breakpoint;

hence, after averaging, the center frequency will decrease

instead. Fig. 4(a) and Fig. 4(b) show the variation of the

highest center frequency with K, where the inﬂection points

can be seen. The inﬂection point means the highest center

frequency, and the Kshould be selected at the point. In this

study, K=9 was selected for the ﬂow dataset and K=8 for

the ENSO dataset.

B. ATTENTION MECHANISM

Self-attention is the core part of Transformer. The attention

mechanism is a heuristic method that refers to the process

of human attention, hence enabling neural networks to focus

more on what is important [32]. Therefore Transformer can

also be regarded as a feature extractor. As the name suggests,

self-attention is about inner attention, which excels in dealing

with time series. For a time series sequence, self-attention can

capture the correlations between each time step and all other

time steps so that the prediction results will be more accurate.

Deep learning techniques mainly develop an encoder–

decoder architecture by using RNNs and CNNs. However,

both the RNN and CNN models are much slower than

attention-based models. Compared with RNN and CNN

architectures, the self-attention mechanism is faster [33] and

is almost unaffected by the length of the sequence. In general,

the longer the sequence, the slower the processing. With this

in mind, an experiment was performed to test the speed of

the three architectures. Three time series models, including

vanilla TCN, vanilla LSTM, and vanilla Transformer, were

selected. By recording the time consumed by three models

while training, the speed can be measured. We used the

ﬂow time series data to train these models, respectively.

The length of the input sequence ranges from 12 to 96 and

the models were trained for 10 and 50 epochs. This exper-

iment and all the following experiments were performed in

the environment given in Table 1. Fig. 5shows the time

VOLUME 10, 2022 58243

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

FIGURE 3. The original signal was decomposed into IMFs of Kamount (e.g. K=9 means the original signal was

decomposed into nine IMFs). The range of Kwas from 2 to 16. The subfigure shows the central frequency of every IMF

under particular K. LF: low frequency; HF: high frequency.

consumed to train the models of three architectures for 10 and

50 epochs under different lengths of the input sequence.

It is supposed that the longer the sequence, the slower the

processing. For the LSTM, with the length of the sequence

58244 VOLUME 10, 2022

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

FIGURE 4. The figure shows the correlation of Kand central frequencies of high-frequency IMFs. Because the IMFs of a high frequency was

discontinued, the central frequencies would decrease. It is supposed to select a Kvalue to maximize the center frequency. The green dot in

the figure represents the highest center frequency. For the flow dataset, when K=9, there is the highest center frequency; and for the

ENSO dataset, K=8.

TABLE 1. The environment.

FIGURE 5. Time consumed during training. The models were trained for

10 and 50 epochs. The length of input sequence is (12, 24, 36, 48, 60, 72,

84, 96).

increasing, it consumes signiﬁcantly more time. However,

for CNN and Transformer, the time consumption changes

gently and is less affected by the sequence length. The reason

for this phenomenon is that the sequence length is not long

enough. In the settings of this study, the sequence length

ranging from 12 to 96 is reasonable and meaningful. It can

be seen that the time consumption of TCN has an obvious

increase when the sequence length is longer than 48. As for

the Transformer, there will be a signiﬁcant increase, when

the sequence length is longer than 400. In a word, the time

consumed by the Transformer is the least of the three and is

hardly affected by the sequence length.

On the one hand, the RNN and CNN architectures are

slower than the attention architecture. On the other hand, the

CNN and RNN models are powerless to capture long-term

dependency as efﬁciently as attention-based models. The

reasons for this are as follows:

1. RNN is a linear architecture, and to obtain the corre-

lations between time steps, the operation of each time step

depends strictly on the previous steps [21]. RNN is a step-

by-step architecture that limits its parallelism. As a result, the

RNN is slow. The CNN uses convolution kernels to extract

features. The size of the kernels and step size of the convo-

lution affect its speed. Moreover, to better extract features,

a multilayer convolutional network is required. Although a

CNN is much faster than an RNN, it is still not as fast

as self-attention. Self-attention completely abandons RNN

structures and instead introduces matrix operations. Matrix

operations are parallel, which means that each time step

is computed simultaneously instead of one by one, which

signiﬁcantly increases the speed. Because it is parallel, the

length of the sequence has little effect on the speed.

2. Although the structure of the ‘‘gates’’ mechanism such

as LSTM and its variant gated recurrent unit (GRU) [47]

alleviate the problem of long-term dependence, RNN is still

powerless in dealing with the exceedingly long-term depen-

dence [33]. If the length of inputs is L, the RNN obtains a

length of dependence that is shorter than L. The length of

dependence that a CNN could get completely depends on

the size of convolution kernels, which are generally shorter

than L. On the other hand, because the attention value

between any two-time steps is computed, it can obtain arbi-

trarily long-term dependence before then focusing on the high

weight and ignoring the low weight. Self-attention obtains a

length of dependence that can be L.

The input sequence is transformed into three vectors

through three matrices in the self-attention mechanism. These

are query matrix (Q), key matrix (K) and value matrix (V).

VOLUME 10, 2022 58245

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

The canonical self-attention is deﬁned based on the tuple

inputs (Q,K,V). Self-attention performs the scaled dot prod-

uct that can be summarized as follows [33]:

Self-Attention(Q,K,V)=softmax QK T

√dkV(6)

where Q∈Rdx×dk,K∈Rdx×dk,V∈Rdx×dvand dxis the

input dimension. dkstands for the dimension of the Q martrix

and K matrix, and dvstands for the dimension of the V matrix.

We use 1

√dkas the scaling factor.

The concepts of query, key, and value are derived from

information retrieval systems. This matches the key to the

query to get the matching value. The value is the similarity

between the query and the key. In matrix operations, the dot

product is a method to compute the similarity of two matrices,

and operation Q·KTcomputes the similarity of each pair of

time steps. Weighted matching is then performed based on

similarity, and the weights are the similarities.

In this study, based on attention mechanism, we pro-

pose the ‘‘cross-attention’’ mechanism (Fig. 6). The

cross-attention mechanism combines both the dot product

attention and the additive attention to efﬁciently capture the

dependency between two time series sequences. Like tradi-

tional attention mechanisms, there are one query, one key,

and one value. The cross-attention block comes immediately

after the two encoders. The outputs of the ﬂow encoder were

transformed into the query and the value matrices, and the

outputs of the ENSO encoder were transformed into the key

matrix. Let qi,ki, and vistand for the i-th row in Q, K, and V

respectively.The process of cross-attention can be described

as follows:

First, use additive attention to weighted average the key

matrix into a global key vector (k):

i=1

αi·ki(7)

FIGURE 6. The Cross-Attention.

The weight αiis calculated as follows:

αi=

exp kiwT

k/√d

j=1exp kjwT

k/√d(8)

where wk∈RLis a trainable vector. Use Hadamard product

to simulate the nonlinear relation between qiand kand get a

score (si):

si=qi◦k(9)

The i-th query’s cross-attention is deﬁned as:

Cross-attention(qi,K,V)=Softmax(si/√d)·V(10)

It has been proven that the additive attention has a lower

computational complexity than dot product attention. The

main purpose of using the additive attention is that it can

quickly summarize important information in a sequence with

linear complexity, which greatly improves the efﬁciency of

multivariate forecasting.

C. DOUBLE-ENCODER TRANSFORMER

It has been proven that El Nino has a close correlation with

rainfall, which affects the ﬂow of the river, so El Nino has

a signiﬁcant impact on streamﬂow [1]. Classic Transformer

only has the self-attention block, which is only suitable for

dealing with a single time series. When dealing with the

prediction issues with multivariate data, it cannot capture

the attention among different kinds of variates effectively.

The aim of the attention mechanism is to obtain the corre-

lation of one point to all the other points and then weight the

average of them so that the attention operation could be more

than just self-attention. Because the self-attention process can

obtain a correlation among the time steps of one sequence,

the correlation between different sequences can be obtained,

too. In this study, we improved the classic Transformer by

using two encoders and one decoder with a cross-attention

block to make streamﬂow predictions with El Nino covariate.

The whole double-encoder architecture is proposed and illus-

trated in Fig. 7. Because Transformer abandoned the RNN

linear sequence structure to support parallel computing, the

positional information in time series will be lost. Therefore,

adding order signals to vectors is necessary to help the model

learn this positional information, so positional encoding [33]

is used to solve this problem. Positional encoding works

by combining order information and vectors to form a new

representation input to the model to learn order information.

Positional encoding is itself a vector with order information.

In this study, order information and date information are

added to the input sequence. The datasets are monthly, so the

year and month information were embed into vectors and

added to original input vectors with positional encoding.

The original hydrological time series are nonlinear and

nonstationary. Some features will be ignored if used directly,

causing decreases in the prediction accuracy, especially in

ﬂood years. The VMD could decompose original data into

58246 VOLUME 10, 2022

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

FIGURE 7. The architecture of double-encoder Transformer.

several IMFs. Compared with the other frequency decompo-

sition technique, the fundamental components decomposed

by VMD have physical signiﬁcance and are much more

robust to sampling and noise. In this study, the ﬂow sequence

was decomposed into nine IMFs and ENSO sequence into

eight IMFs. The IMFs were sent into encoders to perform

self-attention. Each IMF should have its own encoder block,

so there are nine encoder blocks for the ﬂow sequence and

eight for the ENSO sequence. The present study adopted

the transfer learning method: ﬂow IMFs share a common

input embedding block and self-attention block, and so do

ENSO IMFs, but the layers behind them are different. To sum

up, there are two input embedding blocks, two self-attention

blocks, 9+8 feed-forward blocks and 9+8 linear blocks.

The introduction of transfer learning can signiﬁcantly save

the space occupied by the model and accelerate the training

speed.

All the IMFs of ﬂow and IMFs of ENSO were re-composed

as one, respectiovely, after being processed by the encoders.

The recomposed ﬂow and ENSO data were sent into the

decoder to perform cross-attention. In river ﬂow forecasting,

the ENSO dataset works as the covariate. The main pur-

pose of cross-attention is to capture the impact of ENSO on

ﬂow. ﬂow was convert into Qand V, and ENSO was con-

vert into K. The details of cross-attention were presented in

section II-B.

Let Linand Lout stand for the input window size and output

window size. The proposed method can be summarized as

Algorithm 2: where Qself _f,Kself _f,Vself _f,Qself _e,Kself _e,

and Vself _eare conversion matrices in self-attention. All the

IMFs of ﬂow were converted into Q,K, and Vusing the

same Qself _f,Kself _f, and Vself _fmatrices, and all the IMFs

of ENSO were converted into Q,K,Vusing the same Qself _e,

Kself _e, and Vself _ematrices. The wfand weare trainable

vectors used to recompose the IMFs.

Algorithm 2 The Process of Proposed Method

Input: The ﬂow data: F=f1,...,fLin; The ENSO

data: E=e1,...,eLin ; The time stamps:

ST =st1,...,stLin .

Output: The predictions of river ﬂow:

Fpred =fLin+1,...,fLin+Lout .

Initialize Qself _f,Kself _f,Vself _f,Qself _e,Kself _e,Vself _e,

wf,we,wk,Qcross,Kcross,Vcross;

Use the VMD to decompose the original data:

VMD(F)=FIMF_1 ,...,FIMF _9;

VMD(E)=EIMF_1 ,...,EIMF _8;

foreach modal in VMD(F)and VMD(E)do

Perform data embedding:

EMB =Conv1d(modal)+PE +Conv1d(ST );

Convert EMB into Q,K,Vmatrices usingQself _f,

Kself _f,Vself _f,Qself _e,Kself _e, and Vself _e, and

compute the Self-attention:

Self-attention(modal)=softmax QK T

√dV;

The outputs of the two encoders are:

Fself =Fself _1,...,Fself _9 ;

Eself =Eself _1,...,Eself _8 ;

Re-compose the Fself and Eself into one:

F0=concat(Fself _1,...,Fself _9 )·wf;

E0=concat(Eself _1,...,Eself _8 )·we;

Convert F0into Q, and Vmatrices using Qcross and

Vcross. Convert E0into Kusing Kcross. Compute the

Cross-attention:

foreach qiin Q do

si=qi◦k=qi◦PLin

i=1αi·ki;

Cross-attention(F0,E0)

=Softmax(concat(s1,...,sLin )/√d)·V;

The prediction of river ﬂow is:

Fpred =Feedforward(Cross-attention(F0,E0))

III. EXPERIMENT AND RESULTS

This study focuses on the streamﬂow predictions on the

Hankou Hydrological Station. The goal is to predict river

ﬂow with the ﬂow and ENSO datasets. The ﬂow dataset was

collected by the Yangtze River Water Resources Commission,

and the ENSO dataset was collected by the National Oceanic

and Atmospheric Administration. Both datasets were col-

lected monthly from January 1952 to December 2018, hence

VOLUME 10, 2022 58247

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

totaling 67 years. The datasets were divided into two parts:

one from 1952 to 1997 and another from 1998 to 2018. The

former was used to train the model, and the latter was used to

make predictions. The Yangtze River has experienced several

ﬂoods from 1952 to 2018, most recently in 2016. In 1998,

the Yangtze River experienced a devastating ﬂood because

of strong subtropical highs, resulting in the most signiﬁcant

surge over the past 50 years. In this study, the proposed model

was proven to work well in both ﬂood years and normal years

by making predictions from 1998 to 2018. We have chosen

streamﬂow data in 1998, 2016, and 2018 (two ﬂood years and

one normal year) to make predictions and then made 21 years

of rolling predictions from January 1998 to December 2018 to

further verify the overall reliability.

Some representative models were selected as comparisons,

including the traditional statistical analysis method: autore-

gressive integrated moving average model (ARIMA), the

convolutional neural network: TCN [27], the representative of

the RNN: LSTMa [48], and the classic Transformer [33]. For

CNNs, RNNs, and classic Transformer, there is no structure

speciﬁcally designed to support multidimensional prediction,

so it is general to combine all dimensions into feature vectors

as input when dealing with multidimensional time series.

All the models have a 2d correlated input to make mul-

tidimensional time series forecasting. Under the ﬁxed-size

window forecasting setting, for double-encoder Transformer,

the input ﬂow sequence is Ft=ft−L,...,ft|fi∈Rdx

from time t−Lto time t, and the input ENSO sequence

is Et=et−L,...,et|ei∈Rdx. And the input sequence

for TCN, LSTMa, and classic Transformer is FEt=

(ft−L,et−L),...,(ft,et)|fi,ei∈Rdx. The output predicts

the corresponding sequence F0=f0

t,...,f0

t+l|f0

i∈Rdx.

Lis the length of the inputs. It has been proven that El Nino

is cyclical, with a cycle of about 5.5 years [49] and through

experiments, we have veriﬁed that there will be better results

when the input length (L) is 72 (6 years). lis the length of

the prediction steps. In this study, lis set to 12 so that the

decision makers can take action to prepare for the ﬂood a

year in advance. For the TCN model, the kernel size was 24,

and the stride was 1. The MSE-Loss was selected as the loss

function, here using AdamW as the optimizer. To measure

the performance of the results, two evaluation metrics were

introduced, including root mean squared error (RMSE) and

coefﬁcient of determination (R2). The RMSE was deﬁned as:

RMSE =v

i=1

(yi−byi)2(11)

and the R2was deﬁned as:

R2=1−Pn

i=1(yi−byi)2

i=1(yi−yi)2(12)

A. ORIGINAL DATA DECOMPOSE

To improve the accuracy and speed up the convergence of

the model, data normalization is necessary. Min–max scal-

ing was used to perform data normalization. For data X=

{x1,x2,x3,...,xn}, the min-max scaling is deﬁned as:

xi(scaled)=xi−min(X)

max(X)−min(X)·(max −min)+min (13)

where the [min,max] is a scaling interval and all the data

is scaled into the interval. In this study, both the ﬂow and

ENSO data were scaled into [−1,1]. The scaled data from

January 1952 to December 2018 were decomposed into nine

IMFs and eight IMFs, respectively. The decomposed results

are shown in Fig. 8(a) and Fig. 8(b).

B. THE FLOOD YEARS AND NORMAL YEAR PREDICTIONS

The main purpose of river ﬂow forecasting is to predict

ﬂoods, especially the ﬂood peak discharge, helping those

affected prepare to ﬁght the ﬂood ahead of time. So it is

crucial to predict ﬂoods accurately, which previous models

have not been able to do. Traditional models could obtain

decent performance in normal years but did not work well

in ﬂood years. In this section, the time series from January

1952 to December 1997 were used to train the model, and

1998, 2016, and 2018 were selected to make predictions to

measure the model’s performance. Taking 1998 predictions

as an example, the input data is the ﬂow data and ENSO data

from January 1992 to December 1997 (a total of 72 months).

The output is a 12-months prediction for 1998.

ARIMA, TCN, LSTMa, and classic Transformer were

selected as comparisons that made the same predictions

on these models. Fig. 9(a) and Fig. 9(b) show the whole

12 months of predictions for 1998 and 2016, respectively.

Additionally, data in 2018 were selected as a representative of

normal years to measure the model’s performance in normal

years. Fig. 9(c) shows the whole 12-month prediction of

2018. Finally, the metrics including R2and RMSE were used

to measure the performances of all the models in 1998, 2016,

and 2018. Table 2shows the RMSE and R2of all the models

on these three years of predictions.

All three models could ﬁt the streamﬂow change trend that

increases ﬁrst and then decreases. They all performed well

at the beginning and end of the year (when the streamﬂow

is low), and the gap among their forecasts was widest in

June, July, and August (when the streamﬂow is at its peak

of the year). The double-encoder Transformer had a higher

R2and lower RMSE. In the ﬂood years (1998, 2016), it is

evident that double-encoder Transformer ﬁts better than other

models with the actual values, especially the ﬂood peak,

and the R2could reach more than 0.95. In a normal year

(2018), all three models performed well, but double-encoder

Transformer was more accurate than the others, here with a

0.94 R2.

C. THE 21 YEARS OF ROLLING PREDICTIONS

In section III-B, three years were selected to perform pre-

dictions, and the results showed that the double-encoder

Transformer had signiﬁcant advantages in ﬂood years and

normal years. However, the results of the three-year pre-

58248 VOLUME 10, 2022

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

FIGURE 8. The IMFs of the original data.

TABLE 2. The streamflow forecasting results on three years (five models).

dictions were largely haphazard. In this section, a 21-year

rolling forecast (from 1998 to 2018) was made to test the

models’ capacity for long-term prediction. Fig. 10(a) shows

the 21 years of rolling predictions of all models and the

actual value: it is a little confusing, so Fig. 10(b) shows

the double-encoder Transformer and the actual value in iso-

lation. Fig. 10(c) and Fig. 10(d) are corresponding scatter

plots.

Annual R2and RMSE were calculated for each year of the

ﬁve models and ﬁnally obtained the 21 years of the average

R2and RMSE. Fig. 11(a) and Fig. 11(b) show the 21 years

of the annual R2and RMSE, respectively. The results of the

21-year prediction show that the double-encoder Transformer

was better than the traditional time series models, with an

average R2of more than 0.91 and an average RMSE of lower

than 2600 m3/month.

The experimental results show that double-encoder Trans-

former is superior to the current time series forecasting

models. In 21 years of rolling predictions, the average

R2of the double-encoder Transformer could reach more

than 0.91, which is higher than the other models by about

0.1, and the RMSE was just 2579 m3/month, nearly half

of the other models. It can be seen from Fig. 10(c) that

these points of the ﬁve models are very concentrated near

the actual red line at ﬁrst, which means that they all per-

formed well when streamﬂow was low. As the ﬂow continued

to increase, the distribution of the points became more

scattered, resulting in bad predictions. In this case, the

VOLUME 10, 2022 58249

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

FIGURE 9. The forecasted streamflow of models and the validation scatter plot in 1998, 2016, and 2018.

double-encoder Transformer had a considerable advantage

of being more accurate in predictions of high streamﬂow.

The results fully prove that the double-encoder Transformer

could perform very well in normal and ﬂood years and

could be reliable enough in long-term predictions with high

accuracy.

58250 VOLUME 10, 2022

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

FIGURE 10. A 21-year rolling forecast (from January 1998 to December 2018) was made to test the models’ capacity. Fig. 10(a) shows the

actual value and the prediction results of all models in 21 years. To show the proposed model’s performance more clearly, Fig. 10(b) shows

the actual value and the result of the double-encoder Transformer in isolation. Fig. 10(c) and Fig. 10(d) are the corresponding scatter plots

for model comparison. It can be seen from the scatter plots that when streamflow was in low value, all the models have great performance

(scatters are concentrated near the red line), but when streamflow was in high value, the models except double-encoder Transformer

perform worse (scatters are away from the red line).

VOLUME 10, 2022 58251

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

FIGURE 11. The R2and RMSE of 21 years of rolling predictions.

IV. CONCLUSION

In this work, the streamﬂow forecasting problem of the

Yangtze River was studied, and a new Transformer-

based double-encoder-enabled model was proposed: double-

encoder Transformer. This model, combined with the VMD

algorithm, can effectively make precise long-term stream-

ﬂow forecasting of the Yangtze River, especially in ﬂood

years. Although the model is still of an encoder–decoder

architecture, it alleviates the limitation of a traditional

encoder–decoder architecture. Speciﬁcally, we designed the

cross-attention mechanism to handle the challenges of not

supporting multivariate prediction in traditional time series

forecasting models. Combining the additive attention with

the dot product attention, the cross-attention mechanism can

effectively capture the relation between ﬂow and ENSO data.

The experiments on real-world data of the Hankou Hydro-

logical Station demonstrated the effectiveness of the new

model for enhancing the prediction capacity both in normal

and ﬂood years. A reliable long-term monthly prediction

(from 1998 to 2018) was made. There were ﬂoods in two of

these 21 years (1998 and 2016). The R2in both years is higher

than 0.95, and the RMSE value is just 3224 m3/month, which

is when the ﬂow reached nearly 70000 m3/month in 1998.

Other mainstream forecasting models, including ARIMA,

TCN, LSTMa, and classic Transformer, were selected as

comparisons to demonstrate the superiority of the double-

encoder Transformer. In the 21 years of predictions, the

average R2of double-encoder Transformer was about 0.91,

which is higher than the other model by 0.1, and the

RMSE was 2579 m3/month, which is signiﬁcantly lower

than the other models. These experimental results show that

the double-Encoder Transformer can be used in real-world

streamﬂow predictions.

The main work of this study is to predict the stream-

ﬂow of the Yangtze River, focusing on ﬂood prediction.

However, drought year is also important because water is

a vital resource on earth and we depend heavily on river

water. Accurate drought forecasting is promising for future

research. The variation of river ﬂow in the Yangtze River is

related to many variables. In this work, we made a bivariate

forecast with the ENSO data. The outcomes can be improved

by adding more variables to make multivariate forecasting.

The cross-attention mechanism is a promising method that

can efﬁciently summarize the features of a sequence into a

vector and compute the attention between the independent

variable and covariates. This work is a good start in making

the multivariate long sequence time series forecasting. We are

excited about the future of models with the cross-attention

mechanism.

REFERENCES

[1] J. Wei, W. Wang, Q. Shao, Y. Rong, W. Xing, and C. Liu, ‘‘Inﬂuence of

mature El Niño–Southern oscillation phase on seasonal precipitation and

streamﬂow in the Yangtze River Basin, China,’’ Int. J. Climatol., vol. 40,

no. 8, pp. 3885–3905, Jun. 2020.

[2] J. Peng, X. Luo, F. Liu, and Z. Zhang, ‘‘Analysing the inﬂuences of ENSO

and PDO on water discharge from the Yangtze River into the sea,’’ Hydrol.

Processes, vol. 32, no. 8, pp. 1090–1103, Apr. 2018.

[3] Q. Zhang, C.-Y. Xu, T. Jiang, and Y. Wu, ‘‘Possible inﬂuence of ENSO

on annual maximum streamﬂow of the Yangtze River, China,’’ J. Hydrol.,

vol. 333, nos. 2–4, pp. 265–274, Feb. 2007.

[4] F. Huang, Z. Xia, N. Zhang, Y. Zhang, and J. Li, ‘‘Flow-complexity

analysis of the upper reaches of the YangtzeRiver, China,’’ J. Hydrol. Eng.,

vol. 16, no. 11, pp. 914–919, Nov. 2011.

[5] M. Cheng, F. Fang, T. Kinouchi, I. M. Navon, and C. C. Pain, ‘‘Long lead-

time daily and monthly streamﬂow forecasting using machine learning

methods,’’ J. Hydrol., vol. 590, Nov. 2020, Art. no. 125376.

[6] V. K. Keteklahijani, S. Alimohammadi, and E. Fattahi, ‘‘Predicting

changes in monthly streamﬂow to karaj dam reservoir, iran, in climate

change condition and assessing its uncertainty,’’ Ain Shams Eng. J., vol. 10,

no. 4, pp. 669–679, Dec. 2019.

[7] C. Ma and Y. Li, ‘‘Improving forecasting accuracy of annual runoff time

series using RBFN based on EEMD decomposition,’’ in Proc. DEStech

Trans. Eng. Technol. Res., 2017, pp. 211–216.

[8] G. Yu, H. Ye, Z. Xia, and X. Zhao, ‘‘Application of projection pursuit auto

regression model in predicting runoff of Yangtze River,’’ J. Hohai Univ.,

Natural Sci., vol. 37, no. 3, pp. 263–266, 2009.

[9] N. Noori and L. Kalin, ‘‘Coupling SWAT and ANN models for enhanced

daily streamﬂow prediction,’’ J. Hydrol., vol. 533, pp. 141–151, Feb. 2016.

[10] J. Arnold, ‘‘Swat-soil and water assessment tool,’’ USDA Agricult.

Res. Service, Grassland, Soil Water Res. Lab., Temple, TX, USA,

Tech. Rep., 1994.

[11] M. Kirkby and K. Beven, ‘‘A physically based, variable contributing area

model of basin hydrology,’’ Hydrol. Sci. J., vol. 24, no. 1, pp. 43–69, 1979.

[12] R.-J. Zhao, ‘‘The Xinanjiang model applied in China,’’ J. Hydrol., vol.135,

nos. 1–4, pp. 371–381, 1992.

[13] Z. M. Yaseen, S. O. Sulaiman, R. C. Deo, and K.-W. Chau, ‘‘An enhanced

extreme learning machine model for river ﬂow forecasting: State-of-the-

art, practical applications in water resource engineering area and future

research direction,’’ J. Hydrol., vol. 569, pp. 387–408, Feb. 2019.

[14] W. Collischonn, R. Haas, I. Andreolli, and C. E. M. Tucci, ‘‘Forecast-

ing river Uruguay ﬂow using rainfall forecasts from a regional weather-

prediction model,’’ J. Hydrol., vol. 305, nos. 1–4, pp. 87–98, Apr. 2005.

[15] Z. A. Al-Sudani, S. Q. Salih, A. Sharafati, and Z. M. Yaseen, ‘‘Develop-

ment of multivariate adaptive regression spline integrated with differential

evolution model for streamﬂow simulation,’’ J. Hydrol., vol. 573, pp. 1–12,

Jun. 2019.

58252 VOLUME 10, 2022

C. Liu et al.: Improved Transformer Model for Enhanced Monthly Streamflow Predictions of Yangtze River

[16] W.-C. Wang, K.-W. Chau, D.-M. Xu, L. Qiu, and C.-C. Liu, ‘‘The annual

maximum ﬂood peak discharge forecasting using Hermite projection pur-

suit regression with SSO and LS method,’’ Water Resour. Manage., vol. 31,

no. 1, pp. 461–477, Jan. 2017.

[17] Y. Sang, L. Shang, Z. Wang, C. Liu, and M. Yang, ‘‘Bayesian-combined

wavelet regressive modeling for hydrologic time series forecasting,’’ Chin.

Sci. Bull., vol. 58, no. 31, pp. 3796–3805, Nov. 2013.

[18] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn.,

vol. 20, no. 3, pp. 273–297, 1995.

[19] S. Zhu, J. Zhou, L. Ye, and C. Meng, ‘‘Streamﬂow estimation by support

vector machine coupled with different methods of time series decompo-

sition in the upper reaches of Yangtze River, China,’’ Environ. Earth Sci.,

vol. 75, no. 6, p. 531, Mar. 2016.

[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-

ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,

pp. 2278–2324, Nov. 1998.

[21] J. L. Elman, ‘‘Finding structure in time,’’ Cogn. Sci., vol. 14, no. 2,

pp. 179–211, Mar. 1990.

[22] K.-I. Funahashi, ‘‘On the approximate realization of continuous mappings

by neural networks,’’ Neural Netw., vol. 2, no. 3, pp. 183–192, 1989.

[23] O. Kisi and H. K. Cigizoglu, ‘‘Comparison of different ANN tech-

niques in river ﬂow prediction,’’ Civil Eng. Environ. Syst., vol. 24, no. 3,

pp. 211–231, Sep. 2007.

[24] M. Rezaeianzadeh, H. Tabari, A. A. Yazdi, S. Isik, and L. Kalin, ‘‘Flood

ﬂow forecasting using ANN, ANFIS and regression models,’’ Neural

Comput. Appl., vol. 25, no. 1, pp. 25–37, Jul. 2014.

[25] M. C. Demirel, A. Venancio, and E. Kahya, ‘‘Flow forecast by SWAT

model and ANN in Pracana basin, Portugal,’’ Adv. Eng. Softw., vol. 40,

no. 7, pp. 467–473, Jul. 2009.

[26] S.-Y. Shih, F.-K. Sun, and H.-Y. Lee, ‘‘Temporal pattern attention for

multivariate time series forecasting,’’ Mach. Learn., vol. 108, nos. 8–9,

pp. 1421–1441, Sep. 2019.

[27] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic

convolutional and recurrent networks for sequence modeling,’’ 2018,

arXiv:1803.01271.

[28] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural

Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[29] Z. Xiang, J. Yan, and I. Demir, ‘‘A rainfall-runoff model with LSTM-

based sequence-to-sequence learning,’’ Water Resour. Res., vol. 56, no. 1,

Jan. 2020, Art. no. e2019WR025326.

[30] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung,W.-K. Wong, and W.-C. Woo,

‘‘Convolutional LSTM network: A machine learning approach for pre-

cipitation nowcasting,’’ in Proc. Adv. Neural Inf. Process. Syst., 2015,

pp. 802–810.

[31] D. Liu, W. Jiang, L. Mu, and S. Wang, ‘‘Streamﬂow prediction using

deep learning neural network: Case study of Yangtze River,’’ IEEE Access,

vol. 8, pp. 90069–90086, 2020.

[32] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by

jointly learning to align and translate,’’ 2014, arXiv:1409.0473.

[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.

Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[34] P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas,

and A. A. Lee, ‘‘Molecular transformer: A model for uncertainty-

calibrated chemical reaction prediction,’’ ACS Central Sci., vol. 5, no. 9,

pp. 1572–1583, Sep. 2019.

[35] L. Yang, T. L. J. Ng, B. Smyth, and R. Dong, ‘‘HTML: Hierarchical

transformer-based multi-task learning for volatility prediction,’’ in Proc.

Web Conf., Apr. 2020, pp. 441–451.

[36] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,

and L. Shao, ‘‘Pyramid vision transformer: A versatile backbone for dense

prediction without convolutions,’’ 2021, arXiv:2102.12122.

[37] S. Ha, D. Liu, and L. Mu, ‘‘Prediction of Yangtze River streamﬂow based

on deep learning neural network with El Niño–Southern oscillation,’’ Sci.

Rep., vol. 11, no. 1, pp. 1–23, Dec. 2021.

[38] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang,

‘‘Informer: Beyond efﬁcient transformer for long sequence time-series

forecasting,’’ in Proc. AAAI, 2021, pp. 1–9.

[39] K. Dragomiretskiy and D. Zosso, ‘‘Variational mode decomposition,’’

IEEE Trans. Signal Process., vol. 62, no. 3, pp. 531–544, Feb. 2014.

[40] R. A. van den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, and

M. J. van der Werf, ‘‘Centering, scaling, and transformations: Improving

the biological information content of metabolomics data,’’ BMC Genomics,

vol. 7, no. 1, pp. 1–15, Dec. 2006.

[41] G. Sugihara and R. M. May, ‘‘Nonlinear forecasting as a way of distin-

guishing chaos from measurement error in time series,’’ Nature, vol. 344,

no. 6268, pp. 734–741, 1990.

[42] S. Mehran Kazemi, R. Goel, S. Eghbali, J. Ramanan, J. Sahota, S. Thakur,

S. Wu, C. Smyth, P. Poupart, and M. Brubaker, ‘‘Time2Vec: Learning a

vector representation of time,’’ 2019, arXiv:1907.05321.

[43] I. Daubechies, J. Lu, and H. T. Wu, ‘‘Synchrosqueezed wavelet transforms:

An empirical mode decomposition-like tool,’’ Appl. Comput. Harmon.

Anal., vol. 30, no. 2, pp. 243–261, Mar. 2011.

[44] J. Gilles, ‘‘Empirical wavelet transform,’’ IEEE Trans. Signal Process.,

vol. 61, no. 16, pp. 3999–4010, Aug. 2013.

[45] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng,

N.-C. Yen, C. C. Tung, and H. H. Liu, ‘‘The empirical mode decomposition

and the Hilbert spectrum for nonlinear and non-stationary time series

analysis,’’ Proc. Roy. Soc. London A, Math., Phys. Eng. Sci., vol. 454,

no. 1971, pp. 903–995, 1998.

[46] D. Hilbert, ‘‘Mathematical problems,’’ Bull. Amer. Math. Soc., vol. 8,

no. 10, pp. 437–479, 1902.

[47] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,

H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using

RNN encoder-decoder for statistical machine translation,’’ 2014,

arXiv:1406.1078.

[48] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,

‘‘Attention-based models for speech recognition,’’ 2015,

arXiv:1506.07503.

[49] E. Bryant, E. A. Bryant, and B. Edward, Climate Process and Change.

Cambridge, U.K.: Cambridge Univ. Press, 1997.

CHUANFENG LIU received the B.S. degree

in engineering from the China University of

Geosciences, Wuhan, China, where he is currently

pursuing the master’s degree. His research inter-

ests include machine learning, neural networks,

and physical oceanography.

DARONG LIU received the B.S. degree in engi-

neering from the China University of Geosciences,

Wuhan, China, where he is currently pursuing

the Ph.D. degree. During the undergraduate study,

his research interests include computer science,

three-dimensional modeling, and numerical simu-

lation in geoscience. His current research interests

include numerical simulation in physical oceanog-

raphy, machine learning, and neural networks.

LIN MU received the B.S., M.S., and Ph.D.

degrees in physical oceanography from the Ocean

University of China, Qingdao, China, in 2000,

2002, and 2007, respectively. He is currently a Pro-

fessor of physical oceanography with Shenzhen

University and the Shenzhen Research Institute,

China University of Geosciences, Guangdong,

China. He has authored or coauthored over 20 sci-

entiﬁc articles and four books. His research inter-

ests include physical oceanography: prevention

and mitigation of marine disasters, maritime search and rescue, and emer-

gency response management of offshore oil spills.

VOLUME 10, 2022 58253

Towards Generalized Hydrological Forecasting using Transformer Models for 120-Hour Streamflow Prediction

Preprint

Full-text available

Jun 2024

This study explores the efficacy of a Transformer model for 120-hour streamflow prediction across 125 diverse locations in Iowa, US. Utilizing data from the preceding 72 hours, including precipitation, evapotranspiration, and discharge values, we developed a generalized model to predict future streamflow. Our approach contrasts with traditional methods that typically rely on location-specific models. We benchmarked the Transformer model's performance against three deep learning models (LSTM, GRU, and Seq2Seq) and the Persistence approach, employing Nash-Sutcliffe Efficiency (NSE), Kling-Gupta Efficiency (KGE), Pearson's r, and Normalized Root Mean Square Error (NRMSE) as metrics. The study reveals the Transformer model's superior performance, maintaining higher median NSE and KGE scores and exhibiting the lowest NRMSE values. This indicates its capability to accurately simulate and predict streamflow, adapting effectively to varying hydrological conditions and geographical variances. Our findings underscore the Transformer model's potential as an advanced tool in hydrological modeling, offering significant improvements over traditional and contemporary approaches.

Probing the limit of hydrologic predictability with the Transformer network

Article

May 2024
J HYDROL

Leakage detection of an acoustic emission pipeline based on an improved transformer network

Article

Full-text available

May 2024

Pipeline leakage detection is an integral part of pipeline integrity management. Combining AE (Acoustic Emission) with deep learning is currently the most commonly used method for pipeline leakage detection. However, this approach is usually applicable only to specific situations and requires powerful signal analysis and computational capabilities. To address these issues, this paper proposes an improved Transformer network model for diagnosing faults associated with abnormal working conditions in acoustic emission pipelines. First, the method utilizes the temporal properties of the GRU and the positional coding of the Transformer to capture and feature extract the data point sequence position information to suppress redundant information, and introduces the largest pooling layer into the Transformer model to alleviate the overfitting phenomenon. Second, while retaining the original attention learning mechanism and identity path in the original DRSN, a new soft threshold function is introduced to replace the ReLU activation function with a new threshold function, and a new soft threshold module and adaptive slope module are designed to construct the improved residual shrinkage unit (ASB-STRSBU), which is used to adaptively set the optimal threshold. Finally, pipeline leakage is classified. The experimental results show that the NDRSN model is able to make full use of global and local information when considering leakage signals and can automatically learn and acquire the important parameters of the input features in the spatial and channel domains. By optimizing the GRU improved Transformer network recognition model, the method significantly reduces the model training time and computational resource consumption while maintaining high leakage recognition accuracy. The average accuracy reached 93.97%. This indicates that the method has good robustness in acoustic emission pipeline leakage detection.

Solving flood problems with deep learning technology: Research status, strategies, and future directions

Article

Jun 2024

As a frequent and devastating natural disaster worldwide, floods are influenced by complex factors. Building flood models for simulating, monitoring, and forecasting floods is crucial to reduce the risk of disasters and minimize damage to people and property. With advancements in computing power and the impressive capabilities of deep learning in such areas as classification and prediction, there has been growing interest in using this technology in flood research. There is also a growing body of research into building flood data‐driven models with deep learning. Based on this, this study adopts a mixed‐method approach of bibliometric and qualitative analyses to provide an overview of the research. The research status is revealed in a bibliometric visualization, where the research objects are defined from the flood perspective, and the research strategies are explained from the deep learning perspective to provide a comprehensive and in‐depth understanding of the flood problem and how to apply deep learning to solve it. In addition, the study reflects on the future direction of improvement and innovation needed to promote the further development and exploration of deep learning in flood research.

Flood Forecasting: Adding Data Smoothing Methods to Deep Neural Networks

Conference Paper

Dec 2023

A novel deep learning rainfall–runoff model based on Transformer combined with base flow separation

Article

May 2024

Precise long-term runoff prediction holds crucial significance in water resource management. Although the long short-term memory (LSTM) model is widely adopted for long-term runoff prediction, they encounter challenges such as error accumulation and low computational efficiency. To address these challenges, we utilized a novel method to predict runoff based on a Transformer and the base flow separation approach (BS-Former) in the Ningxia section of the Yellow River Basin. To evaluate the effectiveness of the Transformer model and its responsiveness to the base flow separation technique, we constructed LSTM and artificial neural network (ANN) models as benchmarks for comparison. The results show that Transformer outperforms the other models in terms of predictive performance and that base flow separation significantly improves the performance of the Transformer model. Specifically, the performance of BS-Former in predicting runoff 7 days in advance is comparable to that of the BS-LSTM and BS-ANN models with lead times of 4 and 2 days, respectively. In general, the BS-Former model is a promising tool for long-term runoff prediction.

Enhancing Hydrological Extremes Prediction Accuracy: Integrating Diverse Loss Functions in Transformer Models

Article

Apr 2024
ENVIRON MODELL SOFTW

Real-time flood maps forecasting for dam-break scenarios with a transformer-based deep learning model

Article

Apr 2024
J HYDROL

Enhancing hydrological modeling with transformers: a case study for 24-h streamflow prediction

Article

Apr 2024

In this paper, we address the critical task of 24-h streamflow forecasting using advanced deep-learning models, with a primary focus on the Transformer architecture which has seen limited application in this specific task. We compare the performance of five different models, including Persistence, long short-term memory (LSTM), Seq2Seq, GRU, and Transformer, across four distinct regions. The evaluation is based on three performance metrics: Nash–Sutcliffe Efficiency (NSE), Pearson's r, and normalized root mean square error (NRMSE). Additionally, we investigate the impact of two data extension methods: zero-padding and persistence, on the model's predictive capabilities. Our findings highlight the Transformer's superiority in capturing complex temporal dependencies and patterns in the streamflow data, outperforming all other models in terms of both accuracy and reliability. Specifically, the Transformer model demonstrated a substantial improvement in NSE scores by up to 20% compared to other models. The study's insights emphasize the significance of leveraging advanced deep learning techniques, such as the Transformer, in hydrological modeling and streamflow forecasting for effective water resource management and flood prediction.

Behavior of LSTM and Transformer Deep Learning Models in Flood Simulation Considering South Asian Tropical Climate

Preprint

Full-text available

Mar 2024

The imperative for a reliable and accurate flood forecasting procedure stem from the hazardous nature of the disaster. In response, researchers are increasingly turning to innovative approaches, particularly machine learning models, which offer enhanced accuracy compared to traditional methods. However, a notable gap exists in the literature concerning studies focused on the South Asian tropical region, which possesses distinct climate characteristics. This study investigates the applicability and behavior of Long Short-Term Memory (LSTM) and Transformer models in flood simulation with one day lead time, at the lower reach of Mahaweli catchment in Sri Lanka, which is mostly affected by the Northeast Monsoon. The importance of different input variables in the prediction was also a key focus of this study. Input features for the models included observed rainfall data collected from three nearby rain gauges, as well as historical discharge data from the target river gauge. Results showed that use of past water level data denotes a higher impact on the output compared to the other input features such as rainfall, for both architectures. All models denoted satisfactory performances in simulating daily water levels, especially low stream flows, with Nash Sutcliffe Efficiency (NSE) values greater than 0.77 while Transformer Encoder model showed a superior performance compared to Encoder Decoder models.

Streamflow prediction of the Yangtze River base on deep learning neural networks: Impact of the El Niño–Southern Oscillation

Article

Full-text available

Jun 2021

Accurate long-term streamflow and flood forecasting has always been an important research direction in hydrology research. Nowadays, with climate change, floods, and other anomalies occurring more and more frequently and bringing great losses to society. The prediction of streamflow, especially flood prediction, is important for disaster prevention. Current hydrological models based on physical mechanisms can give accurate predictions of streamflow, but the effective prediction period is only about one month in advance, which is too short for decision making. Previous studies have shown a link between the El Niño–Southern Oscillation (ENSO) and the streamflow of the Yangtze River. In this paper, we use ENSO and the monthly streamflow data of the Yangtze River from 1952 to 2016 to predict the monthly streamflow of the Yangtze River in two extreme flood years by using deep neural networks. In this paper, three deep neural network frameworks are used: Stacked LSTM, Conv LSTM Encoder-Decoder LSTM and Conv LSTM Encoder-Decoder GRU. Experiments have shown that the months of flood occurrence and peak flows predicted by these four models become more accurate after the introduction of ENSO. And the best results were obtained on the Convolutional LSTM + Encoder Decoder Gate Recurrent Unit model.

Streamflow Prediction Using Deep Learning Neural Network: Case Study of Yangtze River

Article

Full-text available

May 2020

The most important motivation for streamflow forecasts is flood prediction and longtime continuous prediction in hydrological research. As for many traditional statistical models, forecasting flood peak discharge is nearly impossible. They can only get acceptable results in normal year. On the other hand, the numerical methods including physics mechanisms and rainfall-atmospherics could provide a better performance when floods coming, but the minima prediction period of them is about one month ahead, which is too short to be used in hydrological application. In this study, a deep neural network was employed to predict the streamflow of the Hankou Hydrological Station on the Yangtze River. This method combined the Empirical Mode Decomposition (EMD) algorithm and Encoder Decoder Long Short-Term Memory (En-De-LSTM) architecture. Owing to the hydrological series prediction problem usually contains several different frequency components, which will affect the precision of the longtime prediction. The EMD technique could read and decomposes the original data into several different frequency components. It will help the model to make longtime predictions more efficiently. The LSTM based En-De-LSTM neural network could make the forecasting closer to the observed in peak flow value through reading, training, remembering the valuable information and forgetting the useless data. Monthly streamflow data (from January 1952 to December 2008) from Hankou Hydrological Station on the Yangtze River was selected to train the model, and predictions were made in two years with catastrophic flood events and ten years rolling forecast. Furthermore, the Root Mean Square Error (RMSE), Coefficient of Determination (R2), Willmott’s Index of agreement (WI) and the Legates-McCabe’s Index (LMI) were used to evaluate the goodness-of-fit and performance of this model. The results showed the reliability of this method in catastrophic flood years and longtime continuous rolling forecasting.

HTML: Hierarchical Transformer-based Multi-task Learning for Volatility Prediction

Conference Paper

Full-text available

Apr 2020

The volatility forecasting task refers to predicting the amount of variability in the price of a financial asset over a certain period. It is an important mechanism for evaluating the risk associated with an asset and, as such, is of significant theoretical and practical importance in financial analysis. While classical approaches have framed this task as a time-series prediction one-using historical pricing as a guide to future risk forecasting-recent advances in natural language processing have seen researchers turn to complementary sources of data, such as analyst reports, social media, and even the audio data from earnings calls. This paper proposes a novel hierarchical, transformer, multi-task architecture designed to harness the text and audio data from quarterly earnings conference calls to predict future price volatility in the short and long term. This includes a comprehensive comparison to a variety of baselines, which demonstrates very significant improvements in prediction accuracy, in the range 17%-49% compared to the current state-of-the-art. In addition, we describe the results of an ablation study to evaluate the relative contributions of each component of our approach and the relative contributions of text and audio data with respect to prediction accuracy.

A Rainfall‐Runoff Model with LSTM‐based Sequence‐to‐Sequence Learning

Article

Full-text available

Jan 2020
WATER RESOUR RES

Rainfall‐runoff modeling is a complex nonlinear time series problem. While there is still room for improvement, researchers have been developing physical and machine learning models for decades to predict runoff using rainfall data sets. With the advancement of computational hardware resources and algorithms, deep learning methods such as the long short‐term memory (LSTM) model and sequence‐to‐sequence (seq2seq) modeling have shown a good deal of promise in dealing with time series problems by considering long‐term dependencies and multiple outputs. This study presents an application of a prediction model based on LSTM and the seq2seq structure to estimate hourly rainfall‐runoff. Focusing on two Midwestern watersheds, namely, Clear Creek and Upper Wapsipinicon River in Iowa, these models were used to predict hourly runoff for a 24‐hr period using rainfall observation, rainfall forecast, runoff observation, and empirical monthly evapotranspiration data from all stations in these two watersheds. The models were evaluated using the Nash‐Sutcliffe efficiency coefficient, the correlation coefficient, statistical bias, and the normalized root‐mean‐square error. The results show that the LSTM‐seq2seq model outperforms linear regression, Lasso regression, Ridge regression, support vector regression, Gaussian processes regression, and LSTM in all stations from these two watersheds. The LSTM‐seq2seq model shows sufficient predictive power and could be used to improve forecast accuracy in short‐term flood forecast applications. In addition, the seq2seq method was demonstrated to be an effective method for time series predictions in hydrology.

Influence of mature El Niño‐Southern Oscillation phase on seasonal precipitation and streamflow in the Yangtze River Basin, China

Article

Full-text available

Dec 2019
INT J CLIMATOL

As one of the most influential oceanic and atmospheric oscillations in the Earth system, El Niño‐Southern Oscillation (ENSO) has modulated numerous geophysical processes. This is particularly true for the Yangtze River Basin (YRB), which is vulnerable to Asian Monsoon and faces serious hydrological hazards. In this study, the co‐variability between lag–lead precipitation and sea surface temperature anomalies was evaluated utilizing singular value decomposition (SVD) method. Moreover, certain teleconnections between ENSO and streamflow were identified by wavelet methods. In addition, the contribution of related atmospheric variables was revealed by composite analysis. Results indicate that there are strong associations in lag–lead seasons between the wet condition (dry condition) and September–November (December–February) mature ENSO phase. Significant common power and coherence signals between the ENSO indices and the streamflow occur in the 4–8, 8–16 and 16–32 seasonal scales. Meanwhile, the activity cycle of the ENSO indices ahead of streamflow increases from the mid‐lower reaches to the source region. In addition, the Western Pacific Subtropical High is strengthened during the mature ENSO phase. Anomalous sinking motions and divergent water vapour flux occupy the YRB, reducing the precipitation and leading to the dry condition in the source region until the following March–May. On the other hand, ascending movements and abundant water vapour flux coming from northern Pacific, equatorial western Pacific and the Bay of Bengal result in the wet condition in the mid‐lower reaches.

Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction

Article

Full-text available

Aug 2019

Organic synthesis is one of the key stumbling blocks in medicinal chemistry. A necessary yet unsolved step in planning synthesis is solving the forward problem: Given reactants and reagents, predict the products. Similar to other work, we treat reaction prediction as a machine translation problem between simplified molecular-input line-entry system (SMILES) strings (a text-based representation) of reactants, reagents, and the products. We show that a multihead attention Molecular Transformer model outperforms all algorithms in the literature, achieving a top-1 accuracy above 90% on a common benchmark data set. Molecular Transformer makes predictions by inferring the correlations between the presence and absence of chemical motifs in the reactant, reagent, and product present in the data set. Our model requires no handcrafted rules and accurately predicts subtle chemical transformations. Crucially, our model can accurately estimate its own uncertainty, with an uncertainty score that is 89% accurate in terms of classifying whether a prediction is correct. Furthermore, we show that the model is able to handle inputs without a reactant–reagent split and including stereochemistry, which makes our method universally applicable.

Temporal pattern attention for multivariate time series forecasting

Article

Full-text available

Sep 2019
MACH LEARN

Forecasting of multivariate time series data, for instance the prediction of electricity consumption, solar power production, and polyphonic piano pieces, has numerous valuable applications. However, complex and non-linear interdependencies between time steps and series complicate this task. To obtain accurate prediction, it is crucial to model long-term dependency in time series data, which can be achieved by recurrent neural networks (RNNs) with an attention mechanism. The typical attention mechanism reviews the information at each previous time step and selects relevant information to help generate the outputs; however, it fails to capture temporal patterns across multiple time steps. In this paper, we propose using a set of filters to extract time-invariant temporal patterns, similar to transforming time series data into its “frequency domain”. Then we propose a novel attention mechanism to select relevant time series, and use its frequency domain information for multivariate forecasting. We apply the proposed model on several real-world tasks and achieve state-of-the-art performance in almost all of cases. Our source code is available at https://github.com/gantheory/TPA-LSTM.

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Article

May 2021

Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which achieves O(L log L) in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Conference Paper

Oct 2021

Long lead-time daily and monthly streamflow forecasting using machine learning methods

Article

Aug 2020
J HYDROL

Long lead-time streamflow forecasting is of great significance for water resources planning and management in both the short and long terms. Despite of some studies using machine learning methods in streamflow forecasting, only few studies have been conducted to explore long lead-time forecasting capabilities of these methods, and gain an insight into systematic comparison of model forecasting performance in both the short and long terms. In this work, an artificial neural network (ANN) and a long short term memory (LSTM), a powerful tool for learning long-term temporal dependencies and capturing nonlinear relationship, have been adopted to forecast streamflow at daily and monthly scales for a long lead-time period. For long lead-time streamflow forecasting, a recursive forecasting procedure, which takes the last one-step-ahead forecast as a new input for the next-step-ahead forecast, is used in the ANN and LSTM forecasting systems. Two models are trained and validated for streamflow forecasting using the rainfall and runoff datasets collected from the Nan River Basin, Thailand, covering the period 1974 to 2014. To further explore the impact of parameter settings on model performance, two parameters, i.e. the length of time lag and the number of maximum epochs, are examined in the ANN and LSTM models. The main findings are highlighted here. First, with an optimal setting up of model parameters, both the ANN and LSTM model can provide accurate daily forecasting (up to 20 days ahead). Second, in comparison to the ANN model, the LSTM model exhibits better model performance in long lead-time daily forecasting, but less satisfactory in multi-monthly forecasting due to lack of large monthly training dataset. Third, the selection of the length of the time lag and number of maximum epochs used in both ANN and LSTM modelling are the key for long lead-time streamflow forecasting at daily and monthly scales. These findings suggest that the LSTM could be advance in daily streamflow forecasting and thus would be helpful to assist in strategy decisions in water resource management.

Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River

Abstract and Figures

Recommended publications

Streamflow Prediction Using Deep Learning Neural Network: Case Study of Yangtze River

Streamflow prediction of the Yangtze River base on deep learning neural networks: Impact of the El N...

Double-Layer Attention for Long Sequence Time-Series Forecasting

Explicit Artificial Neural Networks For Predicting Gradually Varied Flow