Content uploaded by Jie Yan
Author content
All content in this area was uploaded by Jie Yan on Nov 25, 2022
Content may be subject to copyright.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1
Abstract—The temporal dependencies of wind power are
significant to be involved in the modeling of short-term wind
power forecasts. However, different time series inputs will
contribute differently to the forecasting performance and bring in
challenges to the selection of the relevant driving information. In
this paper, a Multi-Source and Temporal Attention Network
(MSTAN) is proposed for short-term wind power probabilistic
prediction. The MSTAN model introduces multi-source NWP and
makes three specific designs to improve prediction performance.
Firstly, a novel multi-source variable attention module is proposed
to select the driving variables of NWP. Secondly, a temporal
attention module is used to capture the implicit temporal
dependency hidden in the historical measurements and multi-
source NWP sequence. Thirdly, the residual module is wrapped in
MSTAN to skip some unnecessary nonlinear transformations and
provide adaptive complexity to the entire model. After training,
multi-horizon density forecasts for the next 48 hours are yielded
by MSTAN. The MSTAN is compared with state-of-the-art
machine learning schemes in the wind power forecasting system
using the operation data from 3 wind farms. We demonstrate that
MSTAN outperforms other counterparts on both deterministic
and probabilistic prediction. The structure design scheme of
MSTAN has been proven effective.
Index Terms—wind power probabilistic prediction, multi-step
prediction, multi-source NWP, variable attention, attention
mechanism, residual connection, mixture density. 1
I. INTRODUCTION
With the integration of high penetration wind energy, wind
power uncertainties are much concerned to achieve reliable and
economical power system operation and planning. Wind power
probabilistic forecasting (WPPF) provides detailed uncertainty
information, allowing system operators and electricity traders
to make better decisions in the process of reserve setting, unit
commitment, electricity trading, and so on [1]. However,
accurate short-term WPPF is still challenging due to the
inherent randomness of the wind resource [2].
Up to now, the realization of high accuracy WPPF mainly
relies on the following two critical technical routes:
(1) Better Data Inputs and Feature Engineering. Some
novel features and efficient data preprocessing methods will
reduce the difficulty of modeling and improve prediction
accuracy. Therefore, the introduction, construction, and
selection of target-related features have been widely adopted in
WPF research. (i) Some novel features are introduced into the
WPF model to reduce the prediction uncertainty. For instance,
multi-sites Numerical Weather Prediction (NWP) [3] and
ensemble NWP [4] have been used to enrich the input
This work was supported by the National Natural Science Foundation of China (U1765104), and
North China Electric Power University International Joint Training Graduate Program.
Hao ZHANG is with School of Renewable Energy, North China Electric Power University,
Beijing, China. (E-mail: zhanghaoncepu@163.com)
information of the WPPF model. Historical measurements and
NWP data [5-7] are simultaneously used as model input data to
improve very short-term forecasts and short-term forecasts.
However, how to dynamically balance the relative importance
between historically observed values and NWP values at
different prediction horizons is seldom discussed, leading to
underutilization of input data. Furthermore, off-site information
[8,9] and geospatial information [10] are introduced to provide
more spatial features. (ii) Besides, constructing highly target-
related driving features is also an effective way to improve
prediction accuracy. Many feature construction schemes are
proposed to enrich feature inputs of wind power forecasting.
The frequently used manual features include the multiple time
steps average feature, polynomial feature [11], clustering
feature [12], wavelet decomposition feature [13], dimension-
reducing features [14], unsupervised features [15]. (iii) When
the features are redundant and noisy, the features need to be
selected to reduce the influence of irrelevant features and noise.
Feature selection is often used to reduce the model complexity
and avoid the curse of dimensionality. The classical feature
selection methods include Filter, Wrapper, and Embedded
method. For example, the Filter selection based on mutual
information and the Embedded Selection based on the tree
method are proposed in [16, 17]. An embedded selection based
on Automatic Relevance Determination is presented in [7]. The
traditional feature selection methods only pick out the
important features globally, making the selected features not
suitable for every time slot.
(2) Accurate and Flexible Probabilistic Prediction Models.
Generally, short-term wind farm power prediction can be
divided into physical models [18,19], statistic models [20, 21]
and Machine Learning models. In recent years, Machine
Learning models, which could efficiently provide interval,
quantile, probability density, and scenario prediction results,
have gradually become the mainstream of short-term WPPF.
Machine Learning models have two categories: conventional
Machine Learning models and Deep Learning models. (i) From
the perspective of the conventional Machine Learning models,
there have been proposed many modes for short-term WPPF,
such as K-Nearest Neighbors(KNN)[22], Support Vector
Machine(SVM) [23], Gaussian Process (GP) [17], Tree-based
model [24], Bayes Learning [5], Autoregressive-based
models[25-27], shallow Artificial Neural Network (ANN) [28-
30] and Ensemble model [31,32]. Since the conventional ML
models cannot automatically extract deep-level features,
achieving high-accuracy wind power prediction often requires
detailed and specialized feature engineering. (ii) Deep Learning
Jie YAN and Yongqian Liu are the corresponding authors, with School of Renewable Energy,
North China Electric Power University, Beijing, China. (E-mail: yanjie@ncepu.edu.cn)
Yongqi GAO is with the Nansen Environmental and Remote Sensing Center, the University of
Bergen, Bergen, Norway.
Multi-Source and Temporal Attention Network for
Probabilistic Wind Power Prediction
Hao ZHANG, Jie YAN*, Member, IEEE, Yongqian LIU, Yongqi GAO, Shuang HAN, Li LI
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
2
models have strong nonlinear fitting capabilities and flexible
network structures, which can be regarded as competitive data-
driven solutions for WPF. Deep Learning models used for short
WPF can be divided into the following categories: Dense, RNN,
CNN, GCN. Early DL methods used for WPF are mostly
densely connected networks, such as Deep Belief Network [33],
Deep Boltzmann Machine [34], and DAE [35]. All three models
mentioned above have unsupervised training and a fine-tuning
process. Due to the limitation of the network structure, the
Dense models have some defects in modeling the dependence
of spatiotemporal data. Recurrent Neural Network (RNN),
including LSTM [36] and GRU [37], are gradually used to
multi-step wind power prediction, and the temporal dependence
of wind power was better learned in RNNs. Convolution Neural
Network (CNN) models, including 1D-CNN, 2D-CNN, and
TCN models, also have locally temporal dependency learning
capabilities [38-40]. For instance, 2D CNN is used to establish
the spatial dependence of regular grid data [41]. The
combination of CNN and LSTM is hired to capture the
spatiotemporal relationship in wind farms or wind farm clusters
[42]. GCN extends the convolution operation to the Non-
Euclidean domain. It shows better adaptability and higher
efficiency than CNN in the wind power prediction task [9].
However, several problems have not been carefully
addressed in earlier studies due to the limitations of the
prediction model structure and the variety of input data. (I)
Single-source ensemble NWP provided by a weather forecast
institution with different initial conditions and parameterization
schemes has been widely used before. However, the multi-
source NWP from diverse weather forecast providers has rarely
been considered in short-term WPF. Due to the limitations of
observations available for assimilation, computing resources
and engineering experience, one weather forecast provider
cannot guarantee that single-source NWP is accurate enough in
all regions and weather conditions. It is necessary to consider a
multi-source NWP scheme to reduce the risk of wind farm
power prediction [43]. (II) The relative importance of different
NWP features is changing dynamically. However, traditional
feature selection methods cannot pick out the dominant features
step by step, which leads to the input features of some time slots
are not optimal. For instance, a global optimal feature set is
selected by traditional feature selection methods from the entire
NWP dataset. However, there may be an optimal feature set
more suitable for a specific time slot. (III) The most existing
CNN and RNN based WPPF models have difficulties in
modeling the long-term temporal dependency hidden in the
observed sequence and the NWP sequence. When the
concerned temporal dependency is over a long-time window,
RNN models will suffer from the gradient vanish problem and
parallelization difficulties. CNN models focus more on the local
pattern, and it needs more layers and specific layer design to
obtain long-term temporal dependencies. (IV) The temporal-
based Deep Learning models require time sequences as inputs.
Thus, the short-term WPF training set is generally small under
the limitation of the NWP access times. If a short-term WPF
system obtains the NWP data once a day, the daily received
NWP sequence might only provide one sample for the model
training. Deep learning models have strong fitting capabilities
but also prone to overfitting, especially when the data set is
small. How to avoid the overfitting problem under data sets
with different sizes is rarely discussed in short-term WPPF.
In this paper, a Multi-Source and Temporal Attention
Network (MSTAN) is proposed for multi-step short-term
WPPF. The MSTAN uses the multi-source NWP from diverse
weather forecast providers and the historical observations as the
model inputs, the density forecasts for the next 48 hours as the
model outputs. Compared with previous short-term power
prediction studies, this paper has the following contributions:
Multi-source NWP is used in WPPF, and its long-term
temporal error pattern is discussed.
A novel multi-source variable attention module/layer is
designed to extract important variables from multi-source
NWP dynamically. Compared with the general feature
selection schemes [15,16,17], the multi-source variable
attention module makes the specific selection on every
single step.
The temporal dependency in the wind power sequence is
learned by a novel temporal attention layer. Compared
with RNN [36-37] and CNN [38-40] models, the proposed
temporal attention module is more effective in capturing
the long-term dependencies.
For avoiding the overfitting problem, a residual module
constructed by skip connection, gating mechanism, and
layer normalization is used to control the extent of
nonlinear transformation and reduce unnecessary
nonlinear transformation. Compared with some DL-based
WPPF models proposed in the literature [33-42], the
residual module could make the MSTAN model more
stable and adaptive.
This paper is organized as follows: Section II describes the
advantage of multi-source NWP and discusses the temporal
error pattern hidden in multi-source NWP. Section III defines
the multi-horizon wind power probabilistic prediction problem
and formally introduces the proposed MSTAN model. A case
study over three wind farm data is presented in Section IV.
Section V gives the conclusions and future works.
II. MULTI-SOURCE NWP AND ITS TEMPORAL ERROR PATTERN
A. Multi-source NWP
NWP models adopted by weather forecast providers can
differ in many aspects, such as spatial and temporal resolution,
observations available for assimilation and the specific
assimilation scheme, parameterization of physical process, and
other factors [44]. When the observational data for assimilation,
computing resources, and engineering experiences are limited,
single-source NWP products are likely to perform poorly in
some regions and weather conditions, which brings significant
risks to short-term wind power forecasts. As shown in Table 1,
the annual Root Mean Square Error (RMSE) of the multi-source
NWP wind speed for 10 wind farms is counted. No single NWP
source can achieve the lowest error index in ten wind farms at
the same time. Even the best NWP source only can achieve the
smallest RMSE in 6 of 10 wind farms. If a single NWP source
is used in a WPF system that serves many wind farms, it will
bring prediction risks to some of the wind farms it serves.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
3
On the contrary, multi-source NWP comes from different
weather forecast providers with varying settings of prediction.
Each forecast provider has their advantage and disadvantage in
NWP models, observations, parameterization schemes and
computing resources, etc. Therefore, multi-source NWP, which
integrates multiple advantages of different NWP, is more likely
to achieve low prediction risk and reliable accuracy. Significant
benefits will be achieved by using the multi-source NWP
scheme in real WPF projects. In some regions of China, the
accuracy of the WPF will directly affect the wind power
integration priority and the benefits of wind farms. The top
wind farms of the prediction accuracy ranking will be rewarded,
and the bottom wind farms of the prediction accuracy ranking
will be punished. Take a wind farm located in North China as
an example. The electricity price is about 0.078 $/kw · h, the
regional wind curtailment rate of 2019 is about 7.1%, the wind
farm capacity is 100 MW, running hours are about 250 per
month. If the wind farm is punished by an average curtailment
rate, the lost revenue will be 100000 250 0.078 7.1%
1.3845 10 $ per month. The penalty of the wind farm will
far exceed the cost of purchasing multi-source NWPs. Wind
energy companies would be delighted to use multi-source NWP
to promote the WPF accuracy and the grid integration priority.
TABLE 1 THE ANNUAL WIND SPEED RMSE OF 4 NWP SOURCES IN 10 WIND
FARMS
Wind Farm
Source1
Source2
Source3
(
ensemble
)
Source4
1
2.69
3.06
2.76
3.10
2
2.31
2.66
2.05
3.18
3
1.91
2.36
1.68
3.19
4
1.95
2.67
1.69
2.87
5
3.03
2.94
2.93
2.74
6
2.23
2.24
1.96
2.61
7
2.00
2.37
1.83
2.21
8
2.13
2.30
—
2.33
9
2.00
2.37
1.83
2.21
10
2.06
2.23
2.11
2.44
The annual wind speed RMSE of 4 NWP sources are counted by one-year data.
The 3rd source NWP in wind farm 8 is far less the one year. Therefore, the
statistical error of NWP source 3 from wind farm 8 is ignored.
B. Temporal error pattern of multi-source NWP
It is founded from the wind speed prediction error that the
multi-source NWP wind speed has its specific Temporal Error
Pattern (TEP). The TEP of multi-source NWP wind speed is
illustrated in Figure 1. This figure shows the annual averaged
0-47th hour wind speed RMSE of four NWP sources and two
multi-step time series methods (Persistence and Seq2Seq [45]).
Three major characteristics are shown in this figure.
Firstly, multi-step wind speed prediction results are more
accurate than the NWP wind speed from 0 to 6th hour. However,
NWP wind speeds are better than multi-step wind speed
predictions within the 6th to 48th hour. Similar conclusions are
also reported by some literature [46,47]. It means that historical
information (Measurements) and future information (NWP)
both have significant value for highly accurate short-term wind
power forecasts. Short-term forecasts and long-term forecasts
should respectively focus on the historical measurements and
the weather forecasts.
Secondly, the hourly wind speed errors of each NWP source
show a 24-hour cyclical trend. In a day, the NWP wind speed
error first decreases and then increases. Moreover, the NWP
wind speed errors from 24th to 48th hour are slightly higher than
the NWP wind speed errors from 0th to 24th hour. This trend
represents the TEP of multi-source NWP span a long-time
window.
Thirdly, around the 12th and the 36th hour, the wind speed
errors of four NWP sources are close to each other. The wind
speed errors at other time slots are much different. This trend
indicates that the relative wind speed prediction accuracy of
four NWP sources is changing dynamically. In other words, the
prediction model should dynamically pay attention to four
NWP sources.
Fig.1. The Temporal Error Pattern of multi-source NWP wind speed (wind farm
9). Four NWP sources and two kinds of time series prediction methods are
studied in this figure. A one-year dataset is used.
III. MULTI-SOURCE AND TEMPORAL ATTENTION
NETWORK
Introducing multi-source NWP and considering the TEP
hidden in the multi-source NWP is a promising way to improve
the accuracy of the WPPF models. Therefore, a deep learning
based WPPF model, called Multi-Source and Temporal
Attention Network (MSTAN), is proposed in this paper. Four
critical modules/layers are used in MSTAN.
Multi-source variable attention module. It is designed
to extract the driving variables of multi-source NWP
dynamically. The collinearity problem and the harmful
effects of irrelevant variables and noise are reduced.
Residual module. It skips some unnecessary nonlinear
transformations and adaptively control the complexity of
the model to reduce the overfitting risk.
Temporal attention module. It dynamically selects
historical and future information and extracts the long-
term temporal dependency.
Mixture density module. It outputs the joint probability
density of multi-horizon wind power forecasts.
In this section, the used probabilistic prediction framework is
introduced first. Then four designed modules and the loss
function used in MSTAN are presented. Finally, the overall
structure and the relationship between all modules are clarified.
A. The Probabilistic Prediction Framework
Short-term WPPF aims to establish a multi-horizon
prediction function, which takes historical information (such as
wind speed and power measurements) and future information
(NWP) as inputs, and future wind power distribution as outputs.
More formally, let : =(,…,) be the time-
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
4
varying covariates that could be deemed as known feature
values, such as NWP and relative time index (hour of the day),
and , is the dimension number of . Let the value
of measured wind power at time by , where :
(,,…,) and . Similarly, the value of measured
wind speed at time by , where : (,,…,) and
. 0 is the length of historical measurements, is the
maximum prediction horizon.
As shown in Figure 2, given the past wind power and wind
speed measurements :, :, the time-varying covariates
: , and model parameters , the short-term wind
power probabilistic forecasting problem is described as :
(:|:,:,:,)
(1)
Estimating a joint probability density function of future
multi-step wind power values could be considered as a
supervised learning problem, which is solved by minimizing the
discrepancy or maximizing the likelihood between
measurements and model forecasts.
Fig. 2. The structure of MSTAN
B. Multi-source variable attention module
A multi-source variable attention module/layer is designed
for multi-source NWP by combining prior knowledge and
attention mechanism. In terms of prior knowledge, the
theoretical relationship =
can be easily derived,
where is the coefficient of performance, is the air density,
is the rotor swept area, is the wind velocity. Infer from this
equation, (1) wind speed is the most important variable in wind
power forecasting. (2) The other variables (such as pressure,
temperature, humidity, and wind direction) mediately affect the
wind power outputs.
For reflecting the importance of wind speed variables and
other related variables, the multi-source attention module is
constructed by two sub-modules: 1) the multi-source wind
speed variable attention sub-module, 2) the other variable
attention sub-module. The outputs of the multi-source variable
attention are the concatenation of two sub-modules outputs.
Let the multi-source NWP at time by =
[
,...,,
,...,
], is the number of wind
speed variables, is the number of other variables.
and both are scalars. Each wind speed variable
and other
variable is transformed into a vector by a nonlinear Dense
layer.
,=
,+
, 1 (2)
,=,+
, 1 (3)
where, is activation function, weight parameters ,
and , , bias parameters
, and
, .
, ,
, .
The multi-source wind speed variable attention sub-module
performs a weighted summation on all transformed wind speed
vectors
, . The other variable attention sub-module
performs a weighted summation on all other transformed
vectors
, . The attention weights of each transformed
vector are determined by and the Softmax function. The
output of the multi-source variable attention module is the
concatenation of
and
.
=(( +))
(4)
=
,
,
(5)
=(( +))
(6)
=
,
,
(7)
= [
,
]
(8)
where
and
are selection
weights of two sub-module at time .
and
are the outputs of two sub-modules. The
concatenation of
and
is
= [
,
].
, , ,
are the weight and bias parameters of the
nonlinear Dense layer before the Softmax. The structure of the
multi-source variable attention module is shown in Figure 3.
The inputs determine the attention weights at time step .
Therefore, the strengthened important variables and the
weakened irrelevant variables at each time step are different.
Such a dynamic attention mechanism can take the temporal
error pattern of multi-source NWP into account.
Fig. 3. The structure of the multi-source variable attention module.
C. Residual module
In real situations, we cannot know in advance which input
variables have a strong correlation with the supervised target. It
is also challenging to determine whether a nonlinear
transformation is required, especially when the training data set
is minimal and noisy.
The residual module/layer adaptively controls the model
complexity and reduces unnecessary nonlinear transformation
to prevent over-fitting. As shown in Figure 4, the used residual
Add Positional Encoding
Layer Norm
LSTM -Encoder
Variable selection
LSTM -Decoder
Variable selection
LSTM -Decoder
…
……
1T0
T0+1 T0+
Past Inputs
•Wind speed measurements
•Wind power measurements
Future Inputs
•NWP
…
…
…
…
…
…
…
…
Add
Gate
Dense
Layer Norm
LSTM -Encoder
Add
Gate
Dense
…
…
Dense
Layer Norm
Add
Gate
Layer Norm
Add
Gate
Dense
Layer Norm
Add
Gate
Layer Norm
Add
Gate
Softmax
Dense2
Dense1
Softmax
Dense2
Dense1
Self-Attention
Dense_w
1
Dense_w
n1
…
Dense_s1
X
v
1
…
Softmax
Dense_o
1
Dense
_o
n2
…
Dense_
s2
X
1
…
Softmax
Concatenate
v
,1
,
,1
,
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
5
module consists of three operations. The Skip Connection
operation[48] can retain the original information. The Gating
Mechanism operation [49] controls the degree of nonlinear
transformation. The Layer Normalization operation [50] is
used to stabilize the output distribution and accelerate the
convergence of the model.
Since the residual module needs to be used together with
other nonlinear layers/modules, the output of the residual
module is affected by the linear path and the nonlinear path.
When the model needs low complexity, the residual module
skips the nonlinear path and is simplified to a linear mapping or
identity mapping. When the model needs high complexity, the
residual module retains most information from the nonlinear
path and outputs the summation from the linear and nonlinear
paths.
Formally, let the residual module receive a tensor as inputs,
is the vector of at time step , . The input
signal will pass through two paths, the nonlinear gated path
and the linear/identity mapping path. In the nonlinear gated path,
is transformed to , by the nonlinear transformation
function , ,. In MSTAN, LSTM and self-
attention module are used as the . Then , is controlled by
the Gated Linear Units (GLU) [48] to output ,,,
. In linear/identity mapping path, when ,has the
same shape with , , equals to . When the shape of and
, are different, is transformed by linear Dense layer to ,,
, will have the same shape with ,. Finally, the Layer
Normalization module takes the summation of , and , as
inputs and outputs ,, ,.
The nonlinear gated path:
,=,,
(9)
,=,=(,+) (,+)
(10)
The linear or identity mapping path:
,=
+
,
, =
(11)
The outputs of residual modules:
,=(,+,)
(12)
where , ,, are the weight parameters,
, , , are the bias parameters. is the elementwise
Hadamard product.
Fig. 4. The structure of the residual module. Left: the residual module with the
linear mapping path. Right: the residual module with the identity mapping path.
D. Temporal Attention module
For capturing the temporal dependency across all the time
steps, a temporal attention module constructed by Encoder-
Decoder, Positional Encoding (PE), and self-attention, is
employed [51].
(1) Encoder-Decoder module receives historical
measurements and the outputs of the multi-source variables
selection module. For enhancing the expression capability of
temporal dependency, two LSTM are used as Encoder-Decoder
module. Let the historical wind speed and wind power
measurements sequence be : and :. The encoder part
takes the : and : as inputs. The decoder part takes the
: as inputs. At the same time, the residual module is
wrapped on the Encoder-Decoder.
=
(
, [
,
]) 1 0
,
0 + 1 0 + (13)
=
(([
,
])+(
) 1 0
(
+(
) 0 + 1 0 + (14)
where is the initial state. is the Layer Normalization. 0
is the encoder length, is the maximum prediction horizon. The
is the outputs of the Encoder-Decoder with a residual
module. For the simplicity and convenience, the output
dimension of is set to _, .
(2) Positional encoding (PE) generates the position
information of each time step. Since self-attention is a global
attention mechanism, the relative position information cannot
be considered when calculating the similarity. Therefore,
position information is added to the input tensor . The
positional information is defined as:
PE(,2i) = sin(
10000
) (15)
PE(,2i + 1) = cos(
10000
) (16)
The inner product between (, : ) and (+, : ) will
decrease with the increase of , so PE indirectly represents the
relative distance between different time steps. The input tensor
of the self-attention layer could be expressed as:
=+PE(, : )
(17)
(3) In simple terms, the essence of the attention mechanism
is to select the information that has a similar relationship with
the query information from a sequence. The self-attention maps
a query representation to a new representation by the
weighted sum of the values representation . The attention
weights are determined by the scaled dot-product of query
representation and key representation .
=(,,) = (
) (18)
In order to strengthen the expressive ability of the attention
mechanism, the Q, K, and V tensors are usually transformed
linearly by ,,, is the attention
dimension.
=(,,)
(19)
In the practice of the multi-horizon wind power prediction,
the outputs of the self-attention layer are expressed as:
:
=(:
,:
,:
)
(20)
Let the =, the outputs of the temporal attention
module wrapped by the residual module is
=+
0 + 1 0 + (21)
where
, .
E. Mixture density module
Similar to our previous work [52], the mixture density
network (MDN) module is used to model the probability
density of short-term wind power. The hired MDN
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
6
approximates the distribution of normalized forecasts by
weighting multiple Beta distributions. As shown in Figure 5,
the mixture density module is constructed by two Dense layers
and one Softmax layer. The _ outputs shape parameters
, and the _ outputs shape parameters
, the Softmax
layer outputs mixing coefficients . The activation function of
and _ is ReLu. The mixture density module
outputs the ,
, for each time step in a parameter sharing
manner.
=_() = (+))
(22)
= _() = (+))
(23)
=((+))
(24)
() = ,(|,,
,)
(25)
where is the number of components in the MDN, , is the
mixing coefficient at time step , and ,,
,
corresponding to the shape parameters of the component at
time step . ,
, . The sum of mixing coefficients
is equal to 1, ,
= 1 . ,, are
weight parameters, ,, are the bias parameters.
Since the domain of Beta distribution is [0,1], the domain of
mixture density is also [0,1]. The wind power forecast density
could be acquired by multiply the wind power capacity on the
mixture distribution.
Fig. 5. The structure of the mixture density module.
F. Loss Function and Training Method
MSTAN is an end-to-end deep learning model. All
modules/layers of MSTAN could be jointly trained by the
Back-Propagation algorithm and Gradient Descent. Therefore,
the negative log-likelihood function is used as the loss function,
and the Adam optimizer is used as the optimizer. The negative
log-likelihood function is as follows:
(,,
,) = (,(|,,
,)
)
(26)
The trained model outputs the mixture distribution parameters, and
the deterministic forecasts could be calculated by mean value or
median value equation.
The mean value equation of mixture distribution:
=,,
,+
,
(27)
The median value equation of mixture distribution:
,,1/3
,+
,2/3
(28)
G. The Relation between the Modules
In order to state the relationship between the used modules,
the tensor operations in the forward process of MSTAN are
shown in Figure 6. MSTAN takes 3D tensor as the model inputs
and outputs. The shape of past inputs [:,:] is [ B, , 2].
The shape of future inputs : is [ B, ,
+ ]. B is the batch size.
As shown in Figure 6, the multi-source variable attention
module/layer takes : as inputs and : as
outputs. The LSTM Encoder that wrapped by residual module
takes [:,:] as inputs and : as outputs. The LSTM
Decoder that wrapped by residual module takes : as
inputs and : as outputs. Then : adds position
encoding to itself. The self-attention module that wrapped by
residual module take the : as inputs and : as
outputs. The mixture density module takes the : as
inputs and :,
:,: as outputs.
Fig. 6. The relation between the used modules.
IV. APPLICATIONS IN THREE WIND FARMS
A. Dataset
Three wind farms located in North China are studied. In each
wind farm, four NWP sources are used. The 3rd NWP is an
ensemble NWP. Each NWP source contains one to two NWP
sites. Case 1, Case 2 and Case 3 correspond to WF 6, WF 9, and
WF2 in Table 1. The NWP data at each site contains several
features: wind speeds (WS), wind direction (WD), relative
humidity (RH), temperature (TMP), air pressure (PRE)). Before
0:00 every day, the next 48 hours of multi-source NWP data
will be received. The multi-source NWP data is accessed once
a day. The measured data includes the wind speed of the wind
tower and the power output of the booster station. The data set
has one year of data (2019-01-01:2019-12-31), and the time
resolution is one hour. The used wind power prediction dataset
comes from a real regional wind power forecasting project. Due
to the confidentiality agreement with the wind farm operator
and NWP provider, the data cannot be open accessed yet.
The entire datasets are divided into training sets and testing
sets by the Date Time. The data from the 1st to the 24th days of
each month is divided into training sets. The data from the 25th
day to the end of each month is divided into testing sets. 20%
of the samples are randomly selected in the training set as the
validation set. This division scheme ensures that the MSTAN is
tested by the data of each month throughout the year.
B. Benchmarks
Two classic technical routes are used to compare with the
proposed model. These two technical routes are widely used by
Softmax
Dense_
Dense_
x
LSTM_Encoder with
Residual module
LSTM_Decoder with
Residual module
Multi-Source Variable
Selection module
Self-Attention with Residual module
Mixture Density module
TemporalAttention module
Add Positional Encoding
Past inputs: [
:
,
:
]Future inputs:
:
Shape:[ B, , 2] Shape:[ B, ,
+
]
Shape:[ B, ,
*2]
:
Shape:[ B, 0,
]
:
Shape:[ B, ,
]
:
Shape:[ B, +,
]
:
Shape:[ B, ,
]
:
,
:
,
:
Shape:[ B, , ]
:
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
7
machine learning competitions and commercial applications.
The first technical route is feature engineering + regressor, and
the second technical route is deep learning.
1) Technical route 1: Feature Engineering + Regressor
The purpose of feature engineering is to alleviate the curse of
dimensionality and reduce the difficulty of learning tasks.
Regarding the winning feature engineering scheme in
GEFcom2012 and 2014 [10, 11, 53], the feature engineering
scheme, including Feature construction, Feature selection,
Normalization, Category Encoder, and Target transformation,
is adopted.
Four algorithms are employed as regressors, including
Ridge regression, Support Vector Regression (SVR), K Nearest
Neighbor Regression (KNNR), and LightGBM. LightGBM
could be deemed as an advanced implementation of GBM and
GBDT used in GEFcom2014[53]. The feature engineering is
used in each regressor. Therefore, four wind power forecasting
pipelines are built by combining feature engineering and
regressor.
In order to obtain the probabilistic results, different strategies
are used. (1) LightGBM can directly provide the quantile
outputs by setting the quantile loss function. Therefore, 19
models are built to obtain the quantile forecasts corresponding
to 0.05:0.05:0.95 quantiles. (2) Ridge, SVR, KNN cannot
directly produce probabilistic prediction results. Therefore, we
count the prediction error distribution of training data in
different power prediction intervals by Kernel Density
Estimation(KDE). The quantile prediction results could be
given through deterministic results and the corresponding error
distribution when testing.
2) Technical route 2: Deep Learning
Three classical deep learning models for multi-horizon wind
power forecasting are selected as the benchmarks, including
Seq2Seq [44], DeepAR [55], MQRNN [55]. These models are
integrated into Amazon’s time series prediction library
GluonTS [56].
Seq2Seq: using two LSTM as the Encoder and Decoder. The
Encoder LSTM takes the historical measurements as inputs.
The Decoder parts take the multi-source NWP and encoder
context as inputs. Seq2Seq gives the quantile forecasts by
optimizing the quantile loss function. The considered quantiles
are set as 0.05:0.05:0.95 (19 quantiles).
DeepAR: The probability density distribution of target
variables is given by the outputs of LSTM. The maximum
gaussian likelihood of the simulated distribution is used as the
optimization objective during the training process. The multi-
step prediction results are obtained step by step through
sampling during the forecasting process. The input of DeepAR
also includes historical measurements and multi-source NWP.
MQRNN encodes historical measurements by LSTM and
then decodes the encoded information and future inputs through
a global MLP branch and a local MLP branch. The multi-source
NWP is used as the future inputs of Global MLP and local MLP.
MQRNN also outputs the quantile values by optimizing the
quantile loss function. The considered quantiles are set as
0.05:0.05:0.95. (19 quantiles).
C. Training Settings and Software
Hyper-parameters of MSTAN are selected by grid search.
The best hyper-parameter settings are presented in Table 2. In
this paper, the Adam optimizer is used. For Adam optimizer,
learning rate (LR) and batch size are the most important
hyperparameters. The LR of Adam is usually in the range of
[0.0001, 0.1], and the default recommended LR is usually 0.001.
In order to find the optimal LR, we set the grid of LR to [0.01,
0.003, 0.001, 0.0003, 0.0001]. Batch size generally cannot be
too small or too large. It is recommended to select from 2, n
∈3, 4, 5, 6, etc. Since one-year training samples less than 300,
the batch size value should be much smaller than 300. Therefore,
we set the grid of batch size to [8, 16, 32, 64]. d_model and
d_attention are model hyperparameters, which determine the
capacity of the model. Since the training data set is small, a
small model capacity is enough. The d_model and d_attention
both set to [8,16,32,64].
All the deep learning schemes used in this paper are
implemented by Pytorch [57]. Feature engineering + regressor
schemes are implemented by Sklearn [58] and LightGBM [59].
TABLE 2 THE HYPER-PARAMETERS OF MSTAN IN CASE TWO
Hyper Parameters
d_model=16, LSTM_layer= 2,
LSTM_hidden_dim = d_model =16,
d_attention=16, m=3, T0=48, =48
Optimizer
Adam, learning rate =0.001, batch=32
Computing Resource
Apple M1
D. Evaluation criterion
The deterministic and probabilistic forecasting performance
is evaluated by several evaluation criteria. Normalized Root
Mean Square Error (NRMSE) and Normalized Mean Absolute
Error (NMAE) are used for deterministic forecasting evaluation.
For probabilistic forecasting evaluation, the stand Quantile
Loss (QL) index [53], Continuous Ranked Probability Score
(CRPS) [60] and the Average Coverage Error (ACE) [61] are
used.
E. Results and Discussions
1) Deterministic results
NRMSE and NMAE results of three cases are presented in
Table.3. The NRMSE and NMAE of the MSTAN are lower
than the NRMSE and NMAE of other counterparts. By
comparing with the best model using the first technical route,
the NRMSE of the proposed model in the three cases is reduced
by 0.8%, 1.1%, and 0.6%, and the NMAE is reduced by 0.6%,
1.0%, and 0.5% respectively. Among the three algorithms using
the second technical route, the Seq2Seq algorithm performs
better. Compared with the Seq2Seq algorithm, the proposed
algorithm reduces the NRMSE by 1.0%, 1.7%, and 1.0%,
reduces the NMAE by 0.6%, 1.2%, and 0.6%.
TABLE 3 NRMSE AND NMAE OF THREE WIND FARMS
Methods
Case 1
Case 2
Case 3
NRMSE
NMAE
NRMSE
NMAE
NRMSE
NMAE
Persistence
0.331
0.246
0.373
0.279
0.323
0.249
Ridge
0.174
0.126
0.171
0.124
0.163
0.119
KNN
0.182
0.131
0.184
0.131
0.173
0.125
SVM
0.175
0.126
0.17
0.122
0.160
0.116
lightGBM
0.174
0.125
0.168
0.118
0.162
0.117
DeepAR
0.187
0.143
0.198
0.156
0.189
0.146
Seq2Seq
0.176
0.125
0.176
0.124
0.164
0.116
MQRNN
0.179
0.139
0.176
0.135
0.166
0.125
Proposed
0.166
0.119
0.159
0.112
0.154
0.110
Figure 7 shows the averaged 48-hour NRMSE and NMAE
across all test samples for the proposed model and the DL
benchmarks. The 48-hour NRMSE and NMAE of MSTAN are
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
8
below the other three Deep Learning methods. A small number
of RMSE and MAE points of MSTAN are not the lowest, but
the overall error trend of the MSTAN model is better.
Fig. 7. The NRMSE and NMAE of MSTAN from 0th to 47th hours for Case 2.
The results are the averaged values across all testing samples.
2) Probabilistic results
QL and CRPS results of the three cases are presented in Table
4. The QL and CRPS of MSTAN are lower than the QL and
CRPS of other counterparts. By comparing with the best model
using the first technical route, the QL of the proposed model is
reduced by 5%, 7.2%, and 3.0% for the three cases. The CRPS
is reduced by 14.4%, 10.8%, and 10.3% for the three cases.
Among the three algorithms using the second technical route,
the Seq2Seq algorithm performs better. Compared with the
Seq2Seq algorithm, the proposed algorithm reduces the QL by
4.4%, 12.5%, and 0.7%, and the CRPS is reduced by 15.6%,
13.3%, and 6.9%.
TABLE 4 QL AND CRPS OF THREE WIND FARMS
Methods
Case1
Case 2
Case 3
QL
CRPS
QL
CRPS
QL
CRPS
Persistence
-
-
-
-
-
-
Ridge
0.382
0.103
0.3
0.094
0.307
0.096
KNN
0.397
0.106
0.317
0.101
0.322
0.101
SVM
0.38
0.104
0.294
0.094
0.301
0.096
lightGBM
0.378
0.107
0.285
0.092
0.302
0.100
DeepAR
0.432
0.114
0.372
0.115
0.368
0.112
Seq2Seq
0.376
0.104
0.297
0.094
0.295
0.093
MQRNN
0.421
0.114
0.322
0.106
0.318
0.110
Proposed
0.36
0.090
0.264
0.083
0.293
0.087
Figure 8 shows the averaged 48-hour QL and CRPS across
all test samples for the proposed model and the DL benchmarks.
The probabilistic prediction results of the MSTAN model are
significantly better than several deep learning algorithms under
most prediction horizons.
Fig. 8. Hourly QL and CPRS of MSTAN from 0th to 47th hours for Case 2. The
results are the averaged values across all testing samples.
The ACE score of several benchmarks and the proposed
model are shown in Appendix. For the three studied wind farms,
the ACE of MSTAN is below 0 at most confidence levels. ACE
of Ridge and KNN is more stable and closer to 0 than MSTAN
in case 1 and case 3, but MSTAN performs better than most
other counterparts in three cases. The ACE of LightGBM,
MQRNN and DeepAR are poor in three cases.
The probability distributions of wind power forecasts for the
next 48 hours are shown in Figure 9. In each sub-figure, the
probabilistic forecasts for three consecutive days are presented.
In each wind farm, the predicted values can follow the actual
wind farm power outputs very well. When the deterministic
wind power forecasts close to 0 MW or the wind farm capacity,
uncertainty of the wind power forecasts is low. When the
deterministic wind power forecasts close to 50% of wind farm
capacity, wind power prediction uncertainty is high.
Fig. 9. The short-term probabilistic forecasts for the next 48 hours are deduced
every 24 hours. The generated quantiles are 5% to 95% (19 lines).
3) Performance variation evaluation by bootstrapping
Since the used dataset is small, bootstrapping is used to
estimate the significance of the deterministic and probabilistic
results [43]. Three kinds of comparisons are implemented in
this part, (1) the performance variation of different forecasting
models, (2) the performance variation of single-source NWP
and multi-source NWP schemes, (3) the performance variation
of module ablation. The box plots show the NRMSE and CRPS
variation using the bootstrap approach with 200 bootstrap
samples.
(a) The comparison between benchmarks and MSTAN
Results similar to Table 3 and Table 4 are acquired by
bootstrapping in Figure 10 (a) and (d), Figure 11 (a) and (d),
Figure 12 (a) and (d). These figures show that the accuracy
improvement of the proposed model is significant.
(b) The comparison between single-source and multi-source
NWP
As shown in Table 1, each wind farm has 4 sources of NWP.
The most accurate two source and three source NWPs are
picked based on the NWP RMSE ranking. The NWP RMSE
ranking of Case 1(WF6) is 3<1<2<4. The NWP RMSE ranking
of Case 2(WF9) is 3<1<4<2. The NWP RMSE ranking of Case
3(WF2) is 3<1<2<4.
Single source NWP (source 1, source 2, source 3 and source
4) and most accurate two-source, three-source, four-source
NWPs are compared in Figure 10 (b) and (e), Figure 11 (b) and
NRMSE
NMAE
QL
CRPS
Day 1
Day 1
Day 1
Day 2
Day 2
Day 2
Day 3
Day 3
Day 3
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
9
Fig. 10. NRMSE and CRPS of case 1.
Fig. 11. NRMSE and CRPS of case 2.
Fig. 12. NRMSE and CRPS of case 3.
(a) NRMSE of benchmarks and MSTAN. (b) NRMSE of MSTAN by using single-source NWP and multi-source NWP. (c) NRMSE of MSTAN with and without
different modules.
(d) CRPS of benchmarks and MSTAN. (e) CRPS of MSTAN by using single-source NWP and multi-source NWP. (f) CRPS of MSTAN with and without different
modules.
The box plots show the NRMSE and CRPS variation using the bootstrap approach with 200 bootstrap samples.
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
10
(e), Figure 12 (b) and (e). As depicted in these figures, the
poorest prediction performance of MSTAN all appears when
single-source NWP is used. The best prediction results of
MSTAN are achieved when using 4-source NWPs for all cases.
The results of using the best two-sources NWPs and best three-
sources NWPs are close. At wind farm 2, the results of using
one source (source 1 and source 3), best 2-source NWPs and
best 3-source NWPs are close.
In addition, the NWP with the smaller RMSE may not
necessarily lead to better wind power prediction accuracy. This
phenomenon is common and may be caused by the nonlinear
relationship between wind speed and power. It may bring more
risks to the single-source NWP forecasting scheme.
(c) Ablation experiments
In order to demonstrate the effectiveness of the designed
model, ablation experiments are implemented. Specifically, we
remove one module at a time in the MSTAN and readjust the
hyper-parameters. The MSTAN without different module are
named as follows:
w_o selection: The MSTAN model without the multi-source
variable attention module.
w_o temp_attn: The MSTAN model without the temporal
attention module.
w_o skip: The MSTAN model without the residual module.
As shown in Figure 10 (c) and (f), Figure 11 (c) and (f),
Figure 12 (c) and (f), the best prediction results are acquired by
the intact MSTAN. For three cases, removing the temporal
attention module and the residual module will cause a
significant prediction performance drop. Removing the Multi-
source variable attention module causes slight performance
drops.
4) Temporal Attention weights pattern
When giving the forecasts at a time step, the MSTAN model
not only considers the inputs of the current time step but also
considers the inputs of other time steps. Temporal Attention
weights (
) determine which time step should
be concerned. As shown in Figure 13. (a), the historical wind
power and 48-hour forecasts of one sample are drawn. Figure
13. (b) plot the weight matrix of the temporal attention module
for this sample. Figure 13. (c) shows the 3D version of Figure
13. (b). Y-axis represents the lead time step of the output
sequence, and X-axis represents the considered time step of the
input sequence, Z-axis represents the value of attention weights.
At each output time step, it should be determined how to assign
the weights to each input time step. In Figure 13. (a), the next
48-hour real wind power sequence can be divided into three
parts. The first part is a downward ramp process (0th to 18th
hour). The second part is a fast-upward ramp process (18th to
30th hour). The third part is a smooth process (30th to 47th hour).
In the first process, the wind power prediction results focus
on the inputs sequence from the 12th to 18th hour. The wind
power forecasts of the second process focus on the input
sequence from the 18th to 48th hour. The wind power forecasts
of the last process give higher weights to the input sequence
from 40th to 47th hour. Therefore, the learned temporal pattern
could follow the trends of the real wind power sequence. The
ability to dynamically pay attention to the critical parts of the
sequence is not available in DL models such as LSTM and CNN.
Fig. 13. The temporal attention weights of one day (case 2). Subfigure (c) is
the 3D version of subfigure (b).
V. CONCLUSIONS
In this paper, a Multi-Source and Temporal Attention
Network (MSTAN) is proposed for the short-term WPPF. The
MSTAN model takes the multi-source NWP data and historical
measurements sequence as inputs and the future 48-hours wind
power density forecasts as output. The MSTAN is constructed
by four major modules. (1) In order to dynamically select the
driving variables and reduce the harmful effects raised by
introducing multi-source NWP, a novel multi-source selection
module is designed. (2) The temporal attention module is
proposed to extract the long-term temporal dependency hidden
in the multi-source NWP. (3) The residual module is wrapped
into the MSTAN model to provide adaptive complexity and
avoid overfitting. (4) the beta kernel-based mixture density
module is used to output the multi-step probabilistic prediction
results.
Based on the case study over three selected wind farms, the
MSTAN is strictly compared with two state-of-the-art technical
Lead time
Considered time step
(a)
(b)
(c)
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
11
routes. Results demonstrate that MSTAN gives higher
deterministic prediction accuracy and better probabilistic
evaluation score. The effectiveness of the multi-source
selection module, the temporal attention, and the residual
module are respectively demonstrated.
Some works should be done to improve the proposed
MSTAN architecture further. (1) The proposed model only
considers the temporal dependency, but the spatial dependence
is important for wind power forecasting. Novel spatial attention
or spatial feature extraction modules should be merged into
MSTAN. (2) In order to meet the demands of more wind farms,
the applicability of MSTAN at other time resolutions should be
verified.
VI. REFERENCE
[1] J. Yan, Y. Liu, S. Han, Y. Wang, and Shuanglei Feng,” Reviews on
uncertainty analysis of wind power forecasting,” Renewab le and Sustain.
Energy Rev., vol. 52, pp. 1322-1330, 2015.
[2] G. Sideratos and N. D. Hatziargyriou, “An Advanced Statistical Method
for Wind Power Forecasting,” IEEE Trans. Power Syst., vol. 22, no. 1,
pp. 258-265, Feb. 2007.
[3] J. R. Andrade and R. J. Bessa, “Improving Renewable Energy
Forecasting with a Grid of Numerical Weather Predictions,” IEEE Trans.
Sustain. Energy, vol. 8, no. 4, pp. 1571-1580, Oct. 2017.
[4] J. W. Taylor, P. E. McSharry and R. Buizza, “Wind Power Density
Forecasting Using Ensemble Predictions and Time Series Models,” IEEE
Trans. Energy Convers., vol. 24, no. 3, pp. 775-782, Sept. 2009.
[5] W. Xie, P. Zhang, R. Chen, and Z. Zhou, “A Nonparametric Bayesian
Framework for Short-Term Wind Power Probabilistic Forecast,” IEEE
Trans. Power Syst., vol. 34, no. 1, pp. 371-379, Jan. 2019.
[6] P. Du, “Ensemble Machine Learning-Based Wind Forecasting to
Combine NWP Output with Data from Weather Station,” IEEE Trans.
Sustain. Energy, vol. 10, no. 4, pp. 2133-2141, Oct. 2019.
[7] N. Chen, Z. Qian, I. T. Nabney and X. Meng, “Wind Power Forecasts
Using Gaussian Processes and Numerical Weather Prediction,” IEEE
Trans. Power Syst., vol. 29, no. 2, pp. 656-665, March 2014.
[8] Y. Zhang and J. Wang, “A Distributed Approach for Wind Power
Probabilistic Forecasting Considering Spatio-Temporal Correlation
Without Direct Access to Off-Site Information,” IEEE Trans. Power
Syst., vol. 33, no. 5, pp. 5714-5726, Sept. 2018.
[9] Z. Wang, W. Wang, C. Liu, Z. Wang, and Y. Hou, “Probabilistic
Forecast for Multiple Wind Farms Based on Regular Vine Copulas,”
IEEE Trans. Power Syst., vol. 33, no. 1, pp. 578-589, Jan. 2018.
[10] M. Khodayar and J. Wang, “Spatio-Temporal Graph Deep Neural
Network for Short-Term Wind Speed Forecasting,” IEEE Trans. Sustain.
Energy, vol. 10, no. 2, pp. 670-681, April. 2019.
[11] L. Mark, T. P. Erlinger, D. Patschke, and C. Varrichio, “Probabilistic
Gradient Boosting Machines for GEFCom2014 Wind Forecasting,” Int.
J. Forecast., vol. 32, no.3, pp. 1061–66, 2016.
[12] Silva. Lucas, “A Feature Engineering Approach to Wind Power
Forecasting,” I Int. J. Forecast. vol. 30, no. 2, pp. 395–401, 2014.
[13] K. Bhaskar and S. N. Singh, “AWNN-Assisted Wind Power Forecasting
Using Feed-Forward Neural Network,” IEEE Trans. Sustain. Energy, v ol.
3, no. 2, pp. 306-315, April 2012.
[14] Da. Federica and S. Alessandrini, “Post-Processing Techniques and
Principal Component Analysis for Regional Wind Power and Solar
Irradiance Forecasting,” Solar Energy, vol.134, pp. 327–38, 2016.
[15] Y. Wu, Q. Wu and J. Zhu, “Data-driven wind speed forecasting using
deep feature extraction and LSTM,” IET Renew. Power Gene., vol. 13,
no. 12, pp. 2062-2069, 2019.
[16] S. Li, P. Wang, and L. Goel, “Wind Power Forecasting Using Neural
Network Ensembles with Feature Selection,” IEEE Trans. Sustain.
Energy, vol. 6, no. 4, pp. 1447-1456, Oct. 2015.
[17] S. Kunpeng, Y. Qiao, W. Zhao, Q. Wang, M. Liu, and Z. Lu, “An
Improved Random Forest Model of Short ‐term Wind ‐power
Forecasting to Enhance Accuracy, Efficiency, and Robustness,” Wind
Energy, vol. 21, no. 12, pp. 1383–1394, 2018.
[18] L. Li, Y. LIU, Y. YANG, S. HAN, “Short-term wind speed forecasting
based on CFD pre-calculated flow fields,” Proceed. The Chinese Soc.
Electric. Eng., vol. 33, no. 7, pp.27-32, 2013.
[19] L. Landberg, “A mathematical look at a physical power prediction
model,” Wind Energy, vol.1, no.1, pp:23-28, 2015.
[20] E. Erdem, and S. Jing, “ARMA based approaches for forecasting the
tuple of wind speed and direction,” Appl Energy, vol.88, no.4, pp.1405-
1414, 2011.
[21] P. Louka, G. Galanisac, N. Siebert, et al., “Improvements in wind speed
forecasts for wind power prediction purposes using Kalman filtering,” J.
Wind Eng. & Indus. Aero., vol.96, no.12, pp.2348-2362, 2018.
[22] E. Mangalova and O. Shesterneva, “K-nearest neighbors for
GEFCom2014 probabi listic wind power forecasting,” Int. J. of Forecast.,
vol. 32, no. 3, pp. 1067-1073, 2016.
[23] H. S. Dhiman, D. Deb, and J. M. Guerrero, “Hybrid machine intelligent
SVR variants for wind forecasting and ramp events,” Renew. Sustain.
Energy Rev., vol. 108, pp. 369-379, 2019.
[24] M. Landry, T.P. Erlinger, D. Patschke, and C. Varrichio, “Probabilistic
gradient boosting machines for GEFCom2014 wind forecasting,” Int. J.
Forecast., vol. 32, no. 3, pp. 1061-1066, 2016.
[25] Y. Zhao, L. Ye, P. Pinson, Y. Tang, P. Lu, “Correlation-constrained and
sparsity-controlled vector autoregressive model for spatio-temporal wind
power forecasting,” IEEE Trans. Power Syst., vol.33, no. 5, pp.5029-
5040, 2018.,
[26] J. W. Messner, P. Pinson, “Online adaptive lasso estimation in vector
autoregressive models for high dimensional wind power forecasting,” Int.
J. Forecast., vol.35, no. 4, pp.1485-1498, 2019
[27] L. Cavalcante, J. B. Ricardo, R. Marisa, and J. Browell, “LASSO vector
autoregression structures for very short‐term wind power forecasting,”
Wind Energy, vol.20, no. 4, pp. 657-675, 2017.
[28] H. Quan, D. Srinivasan, and A. Khosravi, “Short-Term Load and Wind
Power Forecasting Using Neural Network-Based Prediction Intervals,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp. 303-315, Feb.
2014.
[29] C. Wan, Z. Xu, P. Pinson, Z. Y. Dong, and K. P. Wong, “Optimal
Prediction Intervals of Wind Power Generation,” IEEE Trans. Power
Syst., vol. 29, no. 3, pp. 1166-1174, May 2014.
[30] Z. Shi, H. Liang, and V. Dinavahi, “Direct Interval Forecast of Uncertain
Wind Power Based on Recurrent Neural Networks,” IEEE Trans. Sustain.
Energy, vol. 9, no. 3, pp. 1177-1187, Jul. 2018.
[31] Y. Lin, M. Yang, C. Wan, J. Wang, and Y. Song, “A Multi-Model
Combination Approach for Probabilistic Wind Power Forecasting,”
IEEE Trans. Sustain. Energy., vol. 10, no. 1, pp. 226-237, Jan. 2019.
[32] T. Li, Y. Wang, and N. Zhang, “Combining Probability Density
Forecasts for Power Electrical Loads,” IEEE Trans. Smart Grid, vol. 11,
no. 2, pp. 1679-1690, Mar. 2020.
[33] H. Z. Wang, G. B. Wang, G. Q. Li, J. C. Peng, and Y. T. Liu, “Deep
belief network based deterministic and probabilistic wind speed
forecasting approach,” Appl. Energy, vol. 182, pp. 80-93, 2016.
[34] C. Zhang, C. L. P. Chen, M. Gan, and L. Chen, “Predictive Deep
Boltzmann Machine for Multiperiod Wind Speed Forecasting,” IEEE
Trans. on Sustain. Energy, vol. 6, no. 4, pp. 1416-1425, Oct. 2015.
[35] J. Yan, H. Zhang, Y. Liu, S. Han, L. Li, and Z. Lu, “Forecasting the High
Penetration of Wind Power on Multiple Scales Using Multi-to-Multi
Mapping,” IEEE Trans. on Power Syst., vol. 33, no. 3, pp. 3276-3284,
May 2018.
[36] A. Banik, C. Behera, T. V. Sarathkumar and A. K. Goswami, “Uncertain
wind power forecasting using LSTM-based prediction interval,” IET
Renew. Power Gene., vol. 14, no. 14, pp. 2657-2667, Oct. 2020.
[37] C. Li, G. Tang, X. Xue, A. Saeed, and X. Hu, “Short-Term Wind Speed
Interval Prediction Based on Ensemble GRU Model,” IEEE Trans.
Sustain. Energy, vol. 11, no. 3, pp. 1370-1380, Jul. 2020.
[38] H. Wang, G. Li, G. Wang, J. Peng, H. Jiang, and Y. Liu, “Deep learning
based ensemble approach for probabilistic wind power forecasting,”
Appl. Energy, vol. 188, pp. 56-70, 2017.
[39] Y. Hong, C Lian, and P. P. Rioflorido, “A hybrid deep learning-based
neural network for 24-h ahead wind power forecasting,” Appl. Energy,
vol. 250, pp. 530-539, 2019.
[40] Borovykh A, Bohte S, Oosterlee C W. Conditional Time Series
Forecasting with Convolutional Neural Networks[J]. arXiv, 2017,
[Online] Available: https://arxiv.org/abs/1703.04691.
[41] P. Kou, C. Wang, D. Liang, S. Cheng, and L. Gao, “Deep learning
approach for wind speed forecasts at turbine locations in a wind farm,”
IET Renew. Power Gene., vol. 14, no. 13, pp. 2416-2428, Oct. 2020.
[42] Y. Yu, X. Han, M. Yang, and J. Yang, “Probabilistic Prediction of
Regional Wind Power Based on Spatiotemporal Quantile Regression,”
IEEE Trans. Indust. Appl., vol. 56, no. 6, pp. 6117-6127, Dec. 2020.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
12
[43] B. J. Ricardo, M. Corinna, and F. Vanessa, et al., “Towards Improved
Understanding of the Applicability of Uncertainty Forecasts in the
Electric Power Industry,” Energies, vol. 10, no. 9, 2017.
[44] J. W. Messner, P. Pinson, J. Browel, M. B. Bjerregård, I. Schicker,
“Evaluation of Wind Power Forecasts – an up-to-Date View,” Wind
Energy, vol. 23, no. 6, pp.1461–1481, 2020.
[45] I. Sutskever, O. Vinyals, and Q. V. Le., “Sequence to Sequence Learning
with Neural Networks,” Proceed. The 27th Int. Conf. NIPS, pp. 3104–12,
2014.
[46] G, Gregor. “The State-Of-The-Art in Short-Term Prediction of Wind
Power. A Literature Overview,” National Laboratory, Denmark, Aug.
2003.
[47] S. S. Soman, H. Zareipour, O. Malik and P. Mandal, “A review of wind
power and wind speed forecasting methods with different time horizons,”
North American Power Symposium 2010, Arlington, TX, USA, 2010,
pp. 1-8.
[48] Y. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language Modeling
with Gated Convolutional Networks,” In Proceed. the 34th Int. Conf.
Machine Learning, 2017.
[49] H. Kaiming, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep
Residual Networks,” Euro. Conf. Computer Vision, pp. 630–45. 2016.
[50] B. J. Lei, J. R. Kiros, G. E. Hinton, “Layer Normalization,” 2016, [Online]
Available: https://arxiv.org/pdf/1607.06450v1.pdf
[51] V. Ashish, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” Proceed. the
31st Int. Conf. NIPS, pp. 5998–6008, 2017.
[52] H. Zhang, Y. Liu, J. Yan, S. Han, L. Li, and Q. Long, “Improved Deep
Mixture Density Network for Regional Wind Power Probabilistic
Forecasting,” IEEE Trans. Power Syst., vol. 35, no. 4, pp. 2549-2560,
Jul. 2020.
[53] [51] H. Tao, P. Pinson, S. Fan, H. Zareipour, A. Troccoli, and R. J.
Hyndman, “Probabilistic Energy Forecasting: Global Energy
Forecasting Competition 2014 and Beyond.” Int. J. Forecast., vol. 32,
no. 3, pp. 896–913, 2016.
[54] D. Salinas, V. Flunkert, and J. Gasthaus, “DeepAR: Probabilistic
Forecasting with Autoregressive Recurrent Networks,” Int. J. Forecast.,
vol. 36, no. 3, pp. 1181–1191. 2020.
[55] R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka, “A Multi-
Horizon Quantile Recurrent Forecaster,” 2017, [Online] Available:
https://arxiv.org/pdf/1711.11053.pdf
[56] A. Alexandrov, K. Benidis, and M. B. Schneider, et al., “GluonTS:
Probabilistic Time Series Models in Python,” 2019, [Online] Available:
https://arxiv.org/pdf/1906.05264.pdf
[57] A. Paszke, S. Gross, S. Chintala, et al., “Automatic differentiation in
PyTorch,” NIPS 2017 Workshop Autodiff, Oct. 2017.
[58] F. Pedregosa, G. Varoquaux, A. Gramfort, et al, “Scikit-Learn: Machine
Learning in Python,” J. Machine. Learn. Res., vol. 12, no. 85, pp. 2825–
2830, 2011.
[59] G. Ke, Q. Meng, T. Finley, et al., “LightGBM: A Highly Efficient
Gradient Boosting Decision Tree,” Proceed. the 31st Int. Conf. NIPS, vol.
30, pp. 3149–3157, 2017.
[60] H. Hans, “Decomposition of the Continuous Ranked Probability Score
for Ensemble Prediction Systems,” Weather and Forecasting, vol. 15, no.
5, pp.559–570, 2000.
[61] C. Wan, Z. Xu, P. Pinson, Z. Y. Dong and K. P. Wong, “Probabilistic
Forecasting of Wind Power Generation Using Extreme Learning
Machine,” IEEE Trans Power Syst., vol. 29, no. 3, pp. 1033-1044, May
2014.
VII. APPENDIX
A. The ACE results
Fig. 14 The Averaged Coverage Error for the 3 studied wind farms.
B. The Temporal Error Pattern of Multi-source NWP for other two wind farms
Fig.15. The temporal error pattern of multi-source NWP wind speed (Case 1 and Case 3)
(Four NWP sources and two kinds of time series prediction methods are studied. The one-year dataset is used.)
(a)
(b)
(c)