Content uploaded by Omveer Sharma
Author content
All content in this area was uploaded by Omveer Sharma on Dec 24, 2023
Content may be subject to copyright.
Autonomous Vehicle Trajectory Prediction on
Multi-Lane Highways Using Attention based Model
1st Omveer Sharma
School of Electrical Sciences
Indian Institute of Technology
Bhubaneswar, India
os10@iitbbs.ac.in
2nd N. C. Sahoo
School of Electrical Sciences
Indian Institute of Technology
Bhubaneswar, India
ncsahoo@iitbbs.ac.in
3rd Niladri B. Puhan
School of Electrical Sciences
Indian Institute of Technology
Bhubaneswar, India
nbpuhan@iitbbs.ac.in
Abstract—The autonomous vehicle anticipates its own be-
haviour and future trajectory based on the expected trajectories
of surrounding vehicles to prevent a potential collision in order
to navigate through complex traffic scenarios safely and effec-
tively. The estimated trajectories of surrounding vehicles (target
vehicles) are also influenced by past trajectory and positions
of its surroundings. In this study, a novel Transformer-based
network is used to predict autonomous vehicle trajectory in
highway driving. Transformer’s multi-head attention method is
employed to capture social-temporal interaction between the
target vehicle and its surroundings. The performance of the
proposed model is compared with Recurrent Neural Network
(RNN) based sequential models, using the NGSIM dataset. The
results show that the proposed model predicts 5s long trajectory
with 10% lower Root-Mean-Square Error (RMSE) than the
RNN-based state-of-the-art model.
Index Terms—Trajectory prediction, Transformer, Intelligent
vehicle, Sequential network, Autonomous Vehicle.
I. INTRODUCTION
Trajectory prediction is an indispensable task for an au-
tonomous vehicle (AV) in complex and autonomous driving
scenarios. An AV plans its own trajectory after anticipating
the future trajectory of surrounding vehicles, followed by an
advanced driving assistance system (ADAS) [1–4]. AV (black)
is surrounded by red target vehicles whose trajectories must
be determined as shown in Fig. 1. Further, a target vehicle’s
previous track and the positions of its surrounding vehicles
affect its future trajectory. The phrase ‘surrounding vehicles’
is used throughout the rest of the article to refer to the target
vehicle’s immediate neighbouring vehicles.
The trajectory predictors can be split into two major cat-
egories: model-based and data-driven approaches. A model-
based prediction strategy can forecast a vehicle’s trajectory
using kinematics or statistical model. Trajectory estimation
has been done using model-based techniques such as constant
acceleration model (CA), constant velocity model (CV), con-
stant yaw rate & acceleration motion model, Kalman filter,
hidden Markov model (HMM), and Gaussian mixture model
(GMM) [5–9]. Most model-based techniques can only predict
short-term trajectories. These models could result in prediction
errors if there is even a little difference between the driver’s
actual and predicted behaviour. The limitations of model-based
979-8-3503-1997-2/23/$31.00 ©2023 IEEE
Fig. 1: An autonomous vehicle, positioned in the center, is
predicting the future paths of nearby vehicles.
prediction techniques have been significantly overcome by
data-driven prediction techniques.
For trajectory prediction, several RNN based models have
been proposed [10–13]. Gated Recurrent Units (GRUs) were
incorporated into a generative adversarial imitation learning
trajectory prediction model in [11]. To incorporate informa-
tion from several agents, the Social Generative Adversarial
Networks (S-GAN) model uses both GAN and a recurrent
sequence-to-sequence model in [14]. The research mentioned
above successfully illustrates temporal correlation in terms
of individual motion states, however, it is equally crucial
to consider the spatial relationships between vehicles. These
relationships significantly influence the trajectory of the target
vehicle.
In [15], the spatiotemporal attention long short-term mem-
ory (STA-LSTM) framework is proposed to predict the target
vehicle’s future trajectory. The model, however, only works
well when forecasting future 1 second long trajectories. A
novel approach is introduced through the Attention-based
LSTM encoder-decoder (LSTM-ED) frameworks, which effec-
tively correlate the time dimension and space dimension [16–
18]. Their primary purpose is to generate a spatial-temporal
navigation map. Instead of establishing vehicle spatial rela-
tionships, the authors estimated the target vehicle’s future
behaviour from past traffic participant tracks to predict the
future trajectory [19–21]. Shi et al. [20] introduced a novel
attention network that combines temporal convolution neural
2023 IEEE 3rd International Conference on Sustainable Energy and Future Electric Transportation (SEFET) | 979-8-3503-1997-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/SEFET57834.2023.10245038
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.
network (TCN) and bi-directional long-short term memory
(Bi-LSTM). This network aims to accurately predict lane
keep (LK) and lane change (LC) behavior, along with future
trajectories.
Occupancy grids are utilized in most spatial-temporal
attention-based frameworks [22, 23]. Individual sequential
models (encoders) establish temporal correlation and extract
temporal-dependent sequences to provide spatiotemporal at-
tention. Lastly, a sequential model-based decoder predicts
the target vehicle trajectory utilising the spatial-temporal envi-
ronment. Complex networks with many encoders make trajec-
tory prediction slow and RNN-based encoders propagate input
data sequentially, limiting parallel operation. Gradient vanish-
ing in RNN-based models decreases trajectory prediction accu-
racy and instability over extended sequences. A Transformer-
based encoder-decoder architecture addresses gradient vanish-
ing and computation time issues by accepting the complete
input sequence, unlike RNN-based models. Transformer (TF)
models are popular for sequence-to-sequence learning issues
[24–28]. However, vehicle trajectory prediction on multi-lane
highways has not been investigated using TF-based modelling.
In order to forecast future trajectory utilising a short seg-
ment of tracking data, this work introduces a novel pure
attention-based spatial-temporal attention framework (STA-
TF). The main technological contributions are summarized
below:
1) To assess the influence of surrounding vehicle trajec-
tories on the target vehicle, robust multi-head attention
mechanisms are employed.
2) The proposed model leverages the advantages of TF to
achieve faster performance compared to existing sequen-
tial networks such as LSTM, GRU, and Bi-LSTM.
3) The trajectory prediction issue has been addressed by
adaptation, customisation, and establishment of the pow-
erful Transformer model.
4) A real-world NGSIM dataset is utilized to evaluate the
potential of the proposed model in predicting vehicle
trajectories in highway driving scenarios. The results
demonstrate that the proposed model outperforms ex-
isting RNN-based models.
The structure of this paper is as follows: Section II introduces
the formulation of the trajectory prediction task. The network
architecture of the proposed model is explained in Section
III. Section IV presents the experimental results, and finally,
Section V concludes the work.
II. PRO BL EM F OR MU LATI ON
The proposed model’s goal is to forecast the target ve-
hicle’s future trajectory using the most recent tracks of the
target and its surrounding vehicles as of the observation time
(tobs). While driving on a highway, various driving styles are
exhibited by drivers, each of which significantly influences
the future trajectory of the vehicle. This diversity of driving
styles adds complexity to the task of accurately predicting the
vehicle’s trajectory.
Fig. 2: Target and surrounding vehicles in modified frame of
reference.
A. Inputs and outputs
The proposed model’s input consists of the the previous
tracks of the target (T) and its six immediate neighbours
[19], including the three vehicles that preceded it on the
left, current, and right lanes (PLL, PCL, and PRL), as well
as the additional three vehicles that followed (FLL, FCL,
and FRL), as shown in Fig. 2. The past trajectory of a sur-
rounding vehicle i∈ {P LL, F LL, P C L, F CL, P RL, F RL}
is defined as Si=Xi
tobs−Lin +1, X i
tobs−Lin +2, . . . , X i
tobs ,
where Lin is input sequence length and Xi
t=xi
t, yi
t
is positional vector. The target vehicle’s historical track is
specified as ST=XT
tobs−Lin +1, X T
tobs−Lin +2, . . . , X T
tobs ,
where XT
t=xT
t, yT
t, vT
t, αT
t, classis feature vector, vT
t
is target vehicle’s velocity, αT
tis target vehicle’s acceleration,
and class is type of target vehicle (bike, car or truck). These
past vehicle trajectories of the target and its surrounding
vehicles are fed into the proposed model as input. The model
predicts the target vehicle’s future trajectory (positional feature
vectors). This can be stated as follows:
OT=YT
tobs+1 , Y T
tobs+2 , . . . , Y T
tobs+Lout (1)
where YT
t=xT
t, yT
tare the predicted future coordinates
for the target vehicle.
B. Frame of reference
At the observation time tobs, a stationary frame of reference
is established with the target vehicle serving as the origin. In
this frame, the y-axis represents forward motion, while the x-
axis is perpendicular to it, as illustrated in Fig. 2. Because of
this technique, the proposed model is unaffected by vehicle
track generation [19].
III. NET WORK ARCHITECTURE
Encoder-decoder Transformer models potentially overcome
RNN-based models’ difficulties, as described in the intro-
duction. The proposed model (STA-TF) overcomes gradient
vanishing constraints by processing the entire input sequence
with a TF encoder layer. TF encoder priorities input segments
by using Multi-head attention (MHA) mechanism. In order to
make precise assumptions about vehicle motions, it is crucial
to have a thorough understanding of the interactions and rela-
tionships between traffic participants on the road. Therefore,
the proposed model architecture is divided into three key
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: Proposed model architecture
components: (1) Spatial attention network, (2) Encoder, and
(3) Decoder. These components are illustrated in Fig. 3. MHA
satisfies the specific requirements of each of the three model
components.
A. Multi-head Attention (MHA) Mechanism
Attention correlates a sequence’s numerous locations in
order to identify the hidden representation of the sequence.
The query, key, and value concepts from the information
retrieval approach are used to do this. Attention, in particular,
produces a weighted sum of all values, where the key and
the queries decide the weights. As an illustration, consider
Q(∈RLin×dq) to be the query matrix composed of the d-
dimensional query vectors corresponding to various positions
in the Lin-length sequence. Similar to Q,K(∈RLin ×dk) and
V(∈RLin×dv) represents the key-value pair for various points
in the sequence, where query, key, and value vector dimensions
match (dq=dk=dv). The attention weights, denoted
as WA=sof tmax QKT/√dk, are computed using the
matrices Qand K. Subsequently, the scaled dot product can be
calculated using Eq. 2. In their work [24], the authors propose
linearly projecting the matrices Q,K, and Vmultiple times
(referred to as htimes or heads) to parallelize the scaled dot
product attention for each head, a technique known as“Multi-
head attention” (MHA). This approach allows the model to
jointly attend to multiple representation subspaces. Finally, Eq.
Fig. 4: Multi-head attention mechanism (MHA) [24].
3 is employed to combine the outputs from the various heads.
Attention(Q, K, V ) = sof tmax QKT
√dkV(2)
MultiHead(Q, K, V ) = Concat(head1, ..., headh)Wo(3)
Fig. 4 shows the scaled dot product attention and MHA pro-
cedure. In MHA, dimension of key vector dk(dmodel/heads)
is calculated based on the dimension of model (dmodel = 128)
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.
and number of heads (h= 8). It should be noticed that the
attention weight matrix has a dimension of Lin×Lin . In this
weight matrix, element WA
fg indicates the attention between
fth position of Q(feature vectors of matrix Qat time instant
‘f′) with gth position of K(feature vectors of matrix Kat
time instant ‘g′). Thus, a weighted correlation between all-time
instants (all positions of Qand K) can therefore be inferred
from this attention weight matrix.
B. Spatial attention network
A spatial attention network is used to establish the relation-
ship of six surrounding vehicles with the target vehicles. In this
network, traffic participants’ seven (target and six surrounding
vehicles) positional trajectories (Siand ST) are used as input
and fed to individual feed-forward layers. In the absence of any
surrounding vehicle, the position of the vehicle is represented
as a two-dimensional zero vector. For six surrounding vehicles,
six MHA layers are used to establish a correlation between
the surrounding vehicle trajectory with the past target vehicle
trajectory. The output of six feed-forward layers (Oi
S,F F1∈
RLin×dmodel ) serves as Kand Vpair matrices and the output
of the target vehicle trajectory (OT
S,F F1∈RLin×dmodel ) serves
as matrix Qfor six MHA layers. The outputs of MHA layers
Oi
S,MH A ∈RLin×dmodel (representing the correlation of the
surrounding vehicle ith with the target vehicle) and OT
S,F F1
are concatenated and passed threw another feed-forward layer
to adjust the dimension of output OS,F F2∈RLin×dmodel .
C. Encoder layer
The proposed model encoder determines temporal cor-
relation in the input sequence (output of spatial attention
network). Encoder layer of vanilla Transformer network is
adopted to perform this task [24]. To incorporate temporal
information and leverage sequential correlations between time
steps, the encoder receives the output (OS,FF2) from the
spatial attention network. This output is then passed through
layers of positional encoding. The positional encoding layer
utilizes both sine and cosine functions. The resulting output
of the positional encoding layer (OEncoder
pos ∈RLin×dmodel ) is
calculated as follows:
OEncoder
pos =OS,F F2+P E (4)
where the positional encoding coefficient matrix is represented
by P E. The output is then transmitted to the encoder layer of
the TF after position encoding. The encoder layer is composed
of both an MHA layer and a fully connected feed-forward
network (FFN). The Q,K, and Vinputs for the MHA layer
are provided by OEncoder
pos . The output of the MHA layer
(OEncoder
MH A ∈RLin ×dmodel ) is calculated using Eqs. 2-3. As
mentioned earlier, the attention weights allowed the MHA
layer to establish the hidden temporal relationship in the
input sequence of MHA layer. The output of MHA layer is
processed through Add & Normalization (Norm) as follows:
OEncoder
add &norm1=N orm(OE ncoder
MH A +OEncoder
pos )(5)
This output OEncoder
add &norm1∈RLin×dmodel goes to the FFN,
which executes the following linear transformations over var-
ious positions:
OEncoder
FFN =σOEncoder
add &norm1W1+B1W2+B2(6)
The output of FNN OEncoder
FFN ∈RLin×dmodel is processed
through another Add & Normalization (Norm) as follows:
OEncoder
add &norm2=N orm(OE ncoder
FFN +Oadd &norm1)(7)
The next decoder module of the proposed model receives
the output of the encoder (OEncoder =OEncoder
add &norm2∈
RLin×dmodel ).
D. Decoder layer
Two distinct inter-dependencies are integrated into the de-
coder layer: The first, known as self-attention, is between
decoder input and decoder output whereas the second, known
as encoder-decoder attention, is between encoder output and
decoder output. The target vehicle’s predicted future trajec-
tory is right-shifted (h˜
YT
tobs+1 ,˜
YT
tobs+2 ,..., ˜
YT
tobs+kiat time
t=tobs +k+ 1), combined with encoder output (OEncoder ),
and used as the decoder’s inputs to predict future trajectory
(ODecoder
tobs+1 , ODecoder
tobs+2 , . . . , ODecoder
tobs+k, ODecoder
tobs+k+1 ), as shown
in Fig. 3. Performance of the decoder is enhanced by the
regressive (feedback) technique, which uses the previously
predicted trajectory as input. A ′sos′input (a zeros vector
with a size of 2) is supplied to the decoder as input to start
processing since there is no projected trajectory at time tobs
(or k= 0).
Similar to the encoder, the decoder’s input (right-shifted
decoder output) features are converted to high-dimensional
space using a fully connected layer before being sent via a
layer of positional encoding in accordance with the Eq. 4.
The first MHA layer for self-attention receives the output
of the positional encoding layer as its input, and the output
(ODecoder
MH A1∈RLout ×dmodel ) is calculated using Eqs. 2-3. The
matrices Q,K, and Vare constructed by only using decoder
input, hence the first MHA layer can draw out self-attention
from this decoder input. In the Second MHA layer, matrix
Qcomes from the first MHA layer (a hidden correlation in
decoder input sequence) and the Kand Vpair comes from en-
coder output (OEncoder ), thus attention is computed between
these two input signals (encoder output and decoder input).
The output of second MHA layer (ODecoder
MH A2∈RLout ×dmodel )
is fed to Add & Normalization and FNN of decoder, and output
of decoder (ODecoder ∈RLout×2) is calculated as per Eqs. 5-
7.
Iteratively decreasing learning rate from 0.00001 to
0.000001 with a 64-batch size trains the proposed model.
Modified teacher forcing trains proposed model with full
teacher forcing for the first 10 epochs. The teacher forcing
factor is then incrementally decreased until it equals zero.
The model parameters are trained using the Root-Mean-Square
Error (RMSE) as the loss function. These studies use a desktop
computer with an Intel(R) Xeon(R) Processor E5-2643 V4.
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.
IV. RESULTS AND DISCUSSION
The performance of the proposed model is evaluated and
compared with baseline models after describing the research,
revealing its superior performance in trajectory prediction over
state-of-the-art models.
A. Dataset and its preprocessing
The evaluation of the proposed model is conducted using
the well-known NGSIM dataset, which comprises data from
the southbound US 101 highway in Los Angeles and the
Intersection 80 highway section in Emeryville, California. This
particular highway segment consists of six lanes, including five
motorway lanes and one auxiliary lane, spanning from the on-
ramp to the off-ramp [29]. Only lane keep and discretionary
lane changing (DLC) trajectories are taken into consideration
for vehicles travelling in lanes 2, 3, 4, and 5 throughout this
process. To decrease the complexity and computation time of
the proposed model, all trajectories are down-sampled from a
rate of 10 Hz to 5 Hz. Each trajectory is broken into segments
that last 8s. The vehicle’s historical track is determined by a 3s
long trajectory segment, and the subsequent 5s long trajectory
segments are predicted by the proposed model. The proposed
model is trained using 80% of the trajectory segments, and
the remaining 20% is dedicated to testing its performance.
B. Evaluation metric
The effectiveness of the proposed model is evaluated using
metrics such as RMSE, Mean-Absolute Error (MAE), and
Mean-Square Error (MSE), represented mathematically by
Eqns. 8 to 10.
RMSE =v
u
u
t
1
tpred
tobs+tpred
X
t=tobs+1 ˜
YT
t−YT
t2(8)
MAE =1
tpred
tobs+tpred
X
t=tobs+1
˜
YT
t−YT
t
(9)
MSE =1
tpred
tobs+tpred
X
t=tobs+1 ˜
YT
t−YT
t2(10)
where ˜
YT
tand YT
trepresent the projected position and
actual location of the target vehicle (T) at the prediction
timestamp (t), respectively (here 5 timestamps for 5s).
C. Proposed model performance
Table I demonstrates the effectiveness of the proposed
model for prediction horizons up to 5s along with the worst
5% and 1% of prediction errors. The predictive capability
of the proposed model is evident from its achievement of
RMSE values of 0.46m, 0.80m, 1.19m, 1.67m, and 2.24m for
trajectory predictions at 1s, 2s, 3s, 4s, and 5s respectively.
Similarly, the model demonstrates MSE values of 0.25m,
0.72m, 1.58m, 3.10m, and 5.61m for trajectory lengths of
1s, 2s, 3s, 4s, and 5s respectively. Furthermore, the proposed
model achieves MAE values of 0.38m, 0.63m, 0.91m, 1.23m,
TABLE I: RMSE, MSE and MAE in Proposed model predic-
tion.
Evaluation
metric
Prediction horizon (s)
1s 2s 3s 4s 5s
RMSE (m)
All 0.46 0.80 1.19 1.67 2.24
Worst 5% 0.80 1.78 3.07 4.62 6.32
Worst 1% 1.19 2.68 4.55 6.68 8.99
MSE (m)
All 0.25 0.72 1.58 3.10 5.61
Worst 5% 1.38 4.59 11.34 23.65 43.15
Worst 1% 3.22 10.28 24.29 48.55 85.61
MAE (m)
All 0.38 0.63 0.91 1.23 1.61
Worst 5% 0.88 1.70 2.76 4.02 5.40
Worst 1% 1.31 2.55 4.10 5.85 7.76
Lateral
Error (m)
All 0.11 0.16 0.20 0.23 0.26
Worst 5% 0.11 0.16 0.21 0.25 0.29
Worst 1% 0.15 0.23 0.30 0.35 0.40
Longitudinal
Error (m)
All 0.45 0.78 1.18 1.65 2.23
Worst 5% 0.78 1.77 3.06 4.61 6.32
Worst 1% 1.17 2.65 4.53 6.68 8.99
0 0.5 1 1.5 2 2.5
RMSE (m)
1
2
3
4
5
Prediction horizon (s)
Lateral
Longitudinal
Total
Fig. 5: Lateral, longitudinal and total error (RMSE) of pro-
posed model.
and 1.61m for trajectory lengths of 1s, 2s, 3s, 4s, and 5s
respectively, further highlighting its predictive accuracy. The
proposed model’s robustness is demonstrated by the fact that
prediction errors for worst-case scenarios are noticeably lower.
The longitudinal and lateral prediction RMSE error of the
proposed models is also shown in Fig. 5.
D. Comparative Analysis
In this section, the performance of the proposed trajectory
prediction model is evaluated and compared against state-of-
the-art models. To demonstrate the model’s effectiveness, the
same experimental settings and evaluation metric are adopted
as [16, 21]. The following models are trained and evaluated
on NGSIM dataset under the same experimental conditions to
compare the proposed model.
•Vanilla LSTM (V-LSTM) [10] : The LSTM network takes
the raw input trajectories of the target and surrounding
vehicles as inputs. By employing an LSTM model, future
trajectory predictions are made as point estimates.
•Bi-LSTM [13]: The bi-directional LSTM network re-
ceives the raw input trajectories as its input.
•TCN [20]: The TCN network receives the raw input
trajectories as its input.
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Analyzing the Comparative Performance of the
Proposed Model against State-of-the-Art Models. The best
result is indicated in the bold face.
Evaluation
metric
Models Prediction horizon (s)
1s 2s 3s 4s 5s
RMSE (m)
V-LSTM [10] 0.9 1.29 1.71 2.2 2.74
Bi-LSTM [13] 0.77 1.09 1.51 2.03 2.63
TCN [20] 0.71 1.09 1.52 2.01 2.55
LSTM-ED [18] 0.53 0.91 1.36 1.9 2.56
TCN-LSTM 0.56 0.91 1.34 1.86 2.49
STA-LSTM [16] 0.49 0.86 1.31 1.85 2.51
MHA-LSTM [22] 0.48 0.85 1.28 1.81 2.47
V-TF 0.47 0.82 1.24 1.73 2.35
Proposed (STA-TF) 0.46 0.8 1.19 1.67 2.24
MSE (m)
V-LSTM [10] 0.93 1.81 3.15 5.15 8
Bi-LSTM [13] 0.66 1.29 2.45 4.41 7.38
TCN [20] 0.58 1.33 2.59 4.5 7.26
LSTM-ED [18] 0.33 0.92 2.03 3.99 7.26
TCN-LSTM 0.36 0.95 2.03 3.88 6.94
STA-LSTM [16] 0.3 0.84 1.9 3.8 6.94
MHA-LSTM [22] 0.27 0.79 1.78 3.54 6.59
V-TF 0.25 0.76 1.7 3.37 6.21
Proposed (STA-TF) 0.25 0.72 1.58 3.1 5.61
MAE (m)
V-LSTM [10] 0.72 1 1.31 1.64 2.01
Bi-LSTM [13] 0.64 0.88 1.17 1.51 1.9
TCN [20] 0.61 0.89 1.19 1.53 1.9
LSTM-ED [18] 0.47 0.74 1.05 1.42 1.84
TCN-LSTM 0.49 0.75 1.05 1.39 1.8
STA-LSTM [16] 0.46 0.7 1.02 1.38 1.78
MHA-LSTM [22] 0.38 0.64 0.94 1.3 1.72
V-TF 0.39 0.64 0.94 1.28 1.68
Proposed (STA-TF) 0.38 0.63 0.91 1.23 1.61
•LSTM-ED [18]: LSTM-based encoder-decoder model is
used in which raw input trajectories are fed to LSTM-
based encoder and decoder is predicting future trajectory.
•TCN-LSTM: A encoder-decoder network is constructed,
where TCN-based encoder and LSTM-based decoder is
used.
•STA-LSTM [16]: Spatio-temporal attention-based LSTM
encoder-decoder network is used to vehicle trajectory
prediction.
•MHA-LSTM [22]: To extract spatial attention between
vehicles, an encoder-decoder network based on spatial
attention employs LSTM.
•V-TF: The spatial attention network is excluded from
the proposed model, and therefore, only the raw input
trajectories of the target and surrounding vehicles are
directly fed into the vanilla Transformer network (V-TF).
The quantitative experimental results are summarised in Table
II and are shown in Fig. 6. The results show that the proposed
model has utilised the spatial attention network to establish
the hidden relationship between traffic participants, which has
resulted in a small trajectory prediction error. The proposed
method successfully predicts future trajectory with a 2.24m
RMSE for a 5s long prediction horizon, which is 10% less than
the state-of-the-art models [16, 22]. In congested traffic, where
several vehicles occupy the same drivable area, trajectory
prediction must account for surrounding vehicle correlation.
Irrespective of the prediction horizon, the proposed model
demonstrates superior performance compared to state-of-the-
art models. Short-term predictions primarily rely on recent
1 1.5 2 2.5 3 3.5 4 4.5 5
Prediction Horizon (s)
0.5
1
1.5
2
2.5
RMSE (m)
V-LSTM [10]
Bi-LSTM [13]
LSTM-ED [18]
MHA-LSTM [22]
V-TF
STA-TF
1 1.5 2
0.4
0.6
0.8
1
1.2
Fig. 6: Comparing the proposed model with other models
based on RMSE.
TABLE III: Computing time comparisons among models.
Computation time Proposed V-TF STA-LSTM MHA-LSTM
STA-TF [16] [22]
(millisecond) 3.9 3.1 6.8 10.0
vehicle dynamics, whereas long-term predictions are more in-
fluenced by correlation information. Gradient vanishing, which
impacts longer forecasts, did not affect the Transformer’s
memory mechanism. Hence, the proposed model (STA-TF)
predicts future trajectories more efficiently than current state-
of-the-art models due to correlation (spatial attention network)
and Transformer-based architecture.
During deployment, the model’s calculation time com-
plexity is examined . The computation time of proposed
model is compared with LSTM-based similar social contex-
tual attention-based state-of-the-art models [16, 22]. Table III
shows the models’ computation times. The proposed model
predicts a 5s trajectory 43% and 61% faster than LSTM
[16] and social contextual attention-based models [22], respec-
tively. The TF-based model offers the advantage of processing
the entire input sequence simultaneously, resulting in faster
computation compared to RNN-based models. However, it is
important to note that the TF-based model requires a longer
training time. The performance evaluation of the proposed
model encompasses both lane keeping (LK) and lane change
trajectories. Furthermore, the analysis of lane change trajecto-
ries is further divided into two types: lane change to the left
(LCL) and lane change to the right (LCR). The subsequent
section provides a comprehensive analysis of the proposed
model’s performance in relation to these lateral behaviors
(LCL, LCR, and LK).
E. Proposed model’s performance on lane keep and lane
change trajectories
In this section, the investigation focuses on trajectory pre-
diction errors related to lateral behaviours. Table IV presents
the prediction errors for each behaviour. Across all behaviours,
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.
1 1.5 2 2.5 3 3.5 4 4.5 5
Prediction Horizon (s)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Lateral Error (m)
LCL
CLR
LK
Overall
(a)
1 1.5 2 2.5 3 3.5 4 4.5 5
Prediction Horizon (s)
0
0.5
1
1.5
2
2.5
3
Longitudinal Error (m)
LCL
LCR
LK
Overall
(b)
1 1.5 2 2.5 3 3.5 4 4.5 5
Prediction Horizon (s)
0
0.5
1
1.5
2
2.5
3
RMSE (m)
LCL
LCR
LK
Overall
(c)
1 1.5 2 2.5 3 3.5 4 4.5 5
Prediction Horizon (s)
0
2
4
6
8
10
MSE (m)
LCL
LCR
LK
Overall
(d)
1 1.5 2 2.5 3 3.5 4 4.5 5
Prediction Horizon (s)
0
0.5
1
1.5
2
2.5
MAE (m)
LCL
LCR
LK
Overall
(e)
Fig. 7: (a) Lateral, (b) longitudinal and total error ((c) RMSE,
(d) MSE and (e) MAE) of proposed model.
TABLE IV: The RMSE, MSE, and MAE for lateral and
longitudinal behaviours.
Prediction Horizon (s)
12345
Lateral
Error (m)
Lateral
Behaviours
LCL 0.33 0.46 0.52 0.56 0.60
LCR 0.35 0.50 0.59 0.66 0.71
LK 0.08 0.12 0.15 0.19 0.22
Overall 0.11 0.16 0.20 0.23 0.26
Longitudinal
Error (m)
Lateral
Behaviours
LCL 0.65 1.07 1.53 2.08 2.75
LCR 0.72 1.16 1.67 2.28 3
LK 0.43 0.76 1.15 1.62 2.19
Overall 0.45 0.78 1.18 1.65 2.23
RMSE
(m)
Lateral
Behaviours
LCL 0.73 1.17 1.62 2.16 2.81
LCR 0.80 1.27 1.77 2.37 3.08
LK 0.44 0.77 1.15 1.63 2.2
Overall 0.46 0.80 1.19 1.67 2.24
MSE
(m)
Lateral
Behaviours
LCL 0.61 1.51 2.84 4.99 8.41
LCR 0.69 1.69 3.32 6 10.16
LK 0.22 0.66 1.47 2.93 5.34
Overall 0.25 0.72 1.58 3.1 5.61
MAE
(m)
Lateral
Behaviours
LCL 0.71 1.09 1.42 1.79 2.22
LCR 0.77 1.17 1.56 1.99 2.47
LK 0.36 0.59 0.86 1.18 1.55
Overall 0.38 0.63 0.91 1.23 1.61
the lateral directional error consistently appears smaller than
the longitudinal directional error, with a range of 0.22m to
0.71m for 5-second long predictions. Notably, both LCL and
LCR trajectories exhibit high lateral directional errors, measur-
ing 0.6m and 0.71m, respectively. Similarly, the longitudinal
directional errors are also high for LCL and LCR trajectories
measuring 2.75m and 3m, respectively.
It should be noted that the highest trajectory error (RMSE)
is observed in lane change (LCL and LCR) related trajectories,
indicating the significant impact of longitudinal error on the
overall error, as depicted in Table IV. A similar observation
can be drawn from the MSE and MAE for lateral behaviour-
based trajectories (LCL, LCR, and LK). Fig. 7 displays the
lateral error, longitudinal error, RMSE, MSE, and MAE of
the proposed model for lateral behaviours.
V. CONCLUSION
This research proposes a novel vehicle trajectory predic-
tion model using raw trajectory data of the target and its
surrounding vehicles. The proposed model has three sub-
modules: Spatial attention network, encoder, and decoder. The
Spatial attention network established the hidden relationship
between the target and surrounding vehicles. Tracking modules
(encoder and decoder) use spatial attention network output for
trajectory prediction. Finally, thorough quantitative and quali-
tative experiments on the publicly available NGSIM dataset
show that the proposed model outperforms state-of-the-art
methods in long-range trajectory prediction and is comparable
in short-term prediction. Since trajectory analysis is important
to pedestrian trajectory prediction, it would be interesting to
modify the proposed model for pedestrian trajectory prediction
in future work. Further, implementing the proposed methods
for highway risk/collision estimation is also a promising
direction.
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGMENT
The research grant for the project ”Driver Behavior Mod-
elling for Autonomous Driving” has been provided by KPIT
Technologies Pvt. Ltd., Bangalore, India, offering valuable
support to this work.
REFERENCES
[1] G. S. Aoude, V. R. Desaraju, L. H. Stephens, and J. P. How, “Driver
behavior classification at intersections and validation on large naturalistic
data set,” IEEE Transactions on Intelligent Transportation Systems,
vol. 13, no. 2, pp. 724–736, 2012.
[2] O. Sharma, N. C. Sahoo, and N. B. Puhan, “Recent advances in
motion and behavior planning techniques for software architecture of
autonomous vehicles: A state-of-the-art survey,” Engineering applica-
tions of artificial intelligence, vol. 101, p. 104211, 2021.
[3] M. Br¨
annstr¨
om, E. Coelingh, and J. Sj¨
oberg, “Model-based threat
assessment for avoiding arbitrary vehicle collisions,” IEEE Transactions
on Intelligent Transportation Systems, vol. 11, no. 3, pp. 658–669, 2010.
[4] O. Sharma, N. C. Sahoo, and N. B. Puhan, “A survey on smooth path
generation techniques for nonholonomic autonomous vehicle systems,”
in IECON 2019 - 45th Annual Conference of the IEEE Industrial
Electronics Society. IEEE, 2019, pp. 5167–5172.
[5] A. Houenou, P. Bonnifait, V. Cherfaoui, and W. Yao, “Vehicle trajectory
prediction based on motion model and maneuver recognition,” in 2013
IEEE/RSJ international conference on intelligent robots and systems.
IEEE, 2013, pp. 4363–4369.
[6] S. Qiao, D. Shen, X. Wang, N. Han, and W. Zhu, “A self-adaptive param-
eter selection trajectory prediction approach via hidden markov models,”
IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 1,
pp. 284–296, 2014.
[7] O. Sharma, N. C. Sahoo, and N. B. Puhan, “Highway discretionary
lane changing behavior recognition using continuous and discrete hidden
markov model,” in 2021 IEEE International Intelligent Transportation
Systems Conference (ITSC). IEEE, 2021, pp. 1476–1481.
[8] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in
empirical observations and microscopic simulations,” Physical review E,
vol. 62, no. 2, p. 1805, 2000.
[9] N. Deo, A. Rangesh, and M. M. Trivedi, “How would surround vehicles
move? a unified framework for maneuver classification and motion
prediction,” IEEE Transactions on Intelligent Vehicles, vol. 3, no. 2,
pp. 129–140, 2018.
[10] F. Altch´
e and A. de La Fortelle, “An lstm network for highway trajectory
prediction,” in 2017 IEEE 20th international conference on intelligent
transportation systems (ITSC). IEEE, 2017, pp. 353–359.
[11] A. Kuefler, J. Morton, T. Wheeler, and M. Kochenderfer, “Imitating
driver behavior with generative adversarial networks,” in 2017 IEEE
Intelligent Vehicles Symposium (IV). IEEE, 2017, pp. 204–211.
[12] G. Xie, A. Shangguan, R. Fei, W. Ji, W. Ma, and X. Hei, “Motion
trajectory prediction based on a cnn-lstm sequential model,” Science
China Information Sciences, vol. 63, no. 11, pp. 1–21, 2020.
[13] M. Abdalla, A. Hendawi, H. M. Mokhtar, N. Elgamal, J. Krumm, and
M. Ali, “deepmotions: A deep learning system for path prediction
using similar motions,” IEEE Access, vol. 8, pp. 23881–23 894, 2020.
[14] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan:
Socially acceptable trajectories with generative adversarial networks,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 2255–2264.
[15] L. Lin, W. Li, H. Bi, and L. Qin, “Vehicle trajectory prediction using
lstms with spatial–temporal attention mechanisms,” IEEE Intelligent
Transportation Systems Magazine, vol. 14, no. 2, pp. 197–208, 2021.
[16] M. Fu, T. Zhang, W. Song, Y. Yang, and M. Wang, “Trajectory
prediction-based local spatio-temporal navigation map for autonomous
driving in dynamic highway environments,” IEEE Transactions on
Intelligent Transportation Systems, 2021.
[17] H. Kim, D. Kim, G. Kim, J. Cho, and K. Huh, “Multi-head attention
based probabilistic vehicle trajectory prediction,” in 2020 IEEE Intelli-
gent Vehicles Symposium (IV). IEEE, 2020, pp. 1720–1725.
[18] M. Khakzar, A. Rakotonirainy, A. Bond, and S. G. Dehkordi, “A dual
learning model for vehicle trajectory prediction,” IEEE Access, vol. 8,
pp. 21 897–21 908, 2020.
[19] N. Deo and M. M. Trivedi, “Multi-modal trajectory prediction of sur-
rounding vehicles with maneuver based lstms,” in 2018 IEEE Intelligent
Vehicles Symposium (IV). IEEE, 2018, pp. 1179–1184.
[20] K. Shi, Y. Wu, H. Shi, Y. Zhou, and B. Ran, “An integrated car-
following and lane changing vehicle trajectory prediction algorithm
based on a deep neural network,” Physica A: Statistical Mechanics and
its Applications, vol. 599, p. 127303, 2022.
[21] N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle
trajectory prediction,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition Workshops, 2018, pp. 1468–1476.
[22] K. Messaoud, I. Yahiaoui, A. Verroust, and F. Nashashibi, “Attention
based vehicle trajectory prediction,” IEEE Transactions on Intelligent
Vehicles, vol. 6, no. 1, pp. 175–185, 2020.
[23] K. Messaoud, I. Yahiaoui, A. Verroust-Blondet, and F. Nashashibi,
“Non-local social pooling for vehicle trajectory prediction,” in 2019
IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 975–980.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in neural information processing systems, 2017, pp. 5998–6008.
[25] F. Giuliari, I. Hasan, M. Cristani, and F. Galasso, “Transformer networks
for trajectory forecasting,” in 2020 25th International Conference on
Pattern Recognition (ICPR). IEEE, 2021, pp. 10 335–10 342.
[26] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Multimodal motion
prediction with stacked transformers,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2021, pp.
7577–7586.
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[28] O. Sharma, N. Sahoo, and N. B. Puhan, “Kernelized convolutional trans-
former network based driver behavior estimation for conflict resolution
at unsignalized roundabout,” ISA transactions, 2022.
[29] A. Vassili, C. James, and H. John, “Next generation simulation fact
sheet, washington, dc, usa.” 2007.
Authorized licensed use limited to: University Haifa. Downloaded on December 24,2023 at 07:41:44 UTC from IEEE Xplore. Restrictions apply.