ArticlePDF Available

Lane Change Strategies for Autonomous Vehicles: A Deep Reinforcement Learning Approach Based on Transformer

Authors:

Abstract and Figures

End-to-end approaches are one of the most promising solutions for autonomous vehicles (AVs) decision-making. However, the deployment of these technologies is usually constrained by the high computational burden. To alleviate this problem, we proposed a lightweight transformer-based end-to-end model with risk awareness ability for AV decision-making. Specifically, a lightweight network with depth-wise separable convolution and transformer modules was firstly proposed for image semantic extraction from time sequences of trajectory data. Then, we assessed driving risk by a probabilistic model with position uncertainty. This model was integrated into deep reinforcement learning (DRL) to find strategies with minimum expected risk. Finally, the proposed method was evaluated in three lane change scenarios to validate its superiority.
Content may be subject to copyright.
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023 2197
Lane Change Strategies for Autonomous Vehicles:
A Deep Reinforcement Learning Approach
Based on Transformer
Guofa Li , Member, IEEE,YifanQiu , Yifan Yang, Zhenning Li ,ShenLi , Member, IEEE, Wenbo Chu,
Paul Green , and Shengbo Eben Li , Senior Member, IEEE
Abstract—End-to-end approaches are one of the most promising
solutions for autonomous vehicles (AVs) decision-making. However,
the deployment of these technologies is usually constrained by the
high computational burden. To alleviate this problem, we proposed
a lightweight transformer-based end-to-end model with risk aware-
ness ability for AV decision-making. Specifically, a lightweight
network with depth-wise separable convolution and transformer
modules was firstly proposed for image semantic extraction from
time sequences of trajectory data. Then, we assessed driving risk
by a probabilistic model with position uncertainty. This model was
integrated into deep reinforcement learning (DRL) to find strate-
gies with minimum expected risk. Finally, the proposedmethod was
evaluated in three lane change scenarios to validate its superiority.
Index Terms—Autonomous vehicles, decision-making,
reinforcement learning, lane change, transformer.
I. INTRODUCTION
AS REPORTED by the National Highway Traffic Safety
Administration (NHTSA) [1], 50000 fatal traffic accidents
are attributed to driving mistakes each year in the United States
Manuscript received 27 November 2022; accepted 5 December 2022. Date
of publication 9 December 2022; date of current version 27 April 2023. This
work was supported in part by the National Natural Science Foundation of China
under Grant 52272421, and in part by Shenzhen Fundamental Research Fund
under Grant JCYJ20190808142613246. (Corresponding author: Shen Li.)
Guofa Li is with the College of Mechanical and Vehicle Engineering,
Chongqing University, Chongqing 400044, China (e-mail: hanshan198@
gmail.com).
Yifan Qiu and Yifan Yang are with the College of Mechatronics and Con-
trol Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China
(e-mail: rye1222@qq.com; lvan0619@qq.com).
Zhenning Li is with the State Key Laboratory of Internet of Things for Smart
City and the Department of Computer and Information Science, University of
Macau, Macau 999078, China (e-mail: zhenningli@um.edu.mo).
Shen Li is with the School of Civil Engineering, Tsinghua University, Beijing
100084, China (e-mail: sli299@tsinghua.edu.cn).
WenboChu is with the Western China Science City Innovation Center of Intel-
ligent and Connected Vehicles (Chongqing) Co, Ltd., Chongqing 401329, China,
and also with the College of Mechanical and Vehicle Engineering, Chongqing
University, Chongqing 400044, China (e-mail: chuwenbo@wicv.cn).
Paul Green is with the University of Michigan Transportation Research
Institute (UMTRI) & Department of Industrial and Operations Engineering, Uni-
versity of Michigan, Ann Arbor, MI 48109 USA (e-mail: pagreen@umich.edu).
Shengbo Eben Li is with the State Key Lab of Automotive Safety and Energy,
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
(e-mail: lishbo@tsinghua.edu.cn).
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TIV.2022.3227921.
Digital Object Identifier 10.1109/TIV.2022.3227921
[2]. Statistics in China also show that over 90% of traffic ac-
cidents are related to driving mistakes [3]. Therefore, to help
drivers make reliable decisions and reduce the frequency of
human-caused accidents, numerous safety applications for lev-
els 1 and 2 autonomous vehicles have been developed in recent
years, such as advanced driver assistance systems (ADAS),
fatigue recognition systems, etc. Furthermore, academia re-
searchers have begun to focus on designing active safety systems
for higher-level autonomous vehicles with heavy attention on
collision avoidance systems [4], [5], [6]. In the following para-
graphs, we summarize the influential approaches in the devel-
opment of decision-making systems for collision avoidance,
which can be categorized into motion planning-based methods,
risk estimation-based methods, and data-driven-based methods.
Specifically, supervised learning and reinforcement learning are
two principal categories for data-driven-based methods.
A. Motion Planning-Based Methods
Aand artificial potential field (APF) are two representative
methods of the conventional motion planning-based category for
collision avoidance decision-making. For instance, Dolgov et al.
[7] searched 3D kinematic state space via a variant Amethod.
Then, a numeric non-linear optimization method was further
utilized to enhance the performance of variant Aapproach.
Huang et al. [8] proposed an APF method with different potential
functions for road boundaries after meshing the drivable area.
Subsequently, a local current comparison method is employed
to generate the path with no crash. Nevertheless, these methods
have two intrinsic drawbacks: 1) how they generate graphs
(considering physical constraints) greatly affects their perfor-
mance, 2) paths that are impossible for vehicle kinematics would
sometimes be generated.
For improvement, another solution considering vehicle kine-
matics is developed. Li et al. [9] introduced an optimization
approach based on adaptive-scaling constraints for multi-agent
travelling by considering vehicle dynamics to have the time-
optimal planning of trajectories. Shen et al. [10] utilized a
predictive occupancy map (POM) to assess the potential risk
levels of surrounding vehicles based on vehicle kinematics. The
optimal path was then obtained by a random tree algorithm with
POM. Simon et al. [11] assumed that inevitable collisions were
inherently time-critical and thus introduced a novel method to
2379-8858 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2198 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
mitigate collisions based on vehicle kinematics. The proposed
method could capture the trajectory with the minimum execution
time by simulating with finite element modeling (FEM). How-
ever, given that the constraints for motion planning are usually
nonlinear or nonconvex, the planning task may be a NP-hard
problem which is difficult to be addressed.
B. Risk Estimation-Based Methods
Risk estimation-based methods always estimate risk in the
current driving state and then formulate a subsequent action pol-
icy in accordance with the risk estimation results. The modular
or hierarchical design in these methods can be more accessible
for the brilliant breakthrough in autonomous vehicles [12]. Cur-
rently, deterministic approaches and probabilistic approaches
are the two principal mainstreams for the risk estimation-based
methods.
The deterministic approaches mainly estimate the occurrence
of a collision to infer the strategies for vehicle control. TTC
(time to collision) and THW (time headway) are the classical
evaluation metrics on driving safety [13], [14]. In single-lane
scenarios, these risk estimation methods are comparatively accu-
rate in longitudinal driving without computational burden [15].
However, barely considering the uncertainty of the input data
leads to unpractical derived policies for real-world applications
and even unsatisfactory performance in multi-lane scenarios
[16].
To avoid the uncertainty problem, probabilistic descriptions
are introduced for risk probability assessment in probabilistic
approaches [17]. After fusing traditional metrics (e.g., TTC) into
risk estimation by using a Bayesian model, Noh [18] developed a
rule-based expertise for subject vehicle control at intersections.
Shin et al. [19] observed the uncertainties in the motion of remote
vehicles via vehicle-to-vehicle (V2V) communication, which
was a reference for calculating the number of crashes within
uncertainty boundaries. Suffering from complication in the real-
istic traffic environment and disregarding human drivers’ learn-
ing ability, the intrinsic drawback of probabilistic approaches is
that they only formulate rule-based strategies based on expert
knowledge. However, complex traffic environment details can-
not always be effectively defined by countable rules, and it is
also impossible to determine all the rules in all situations [9].
C. Data-Driven-Based Methods
After discovering the learning capability of neural net-
works, data-driven-based methods including supervised learn-
ing and reinforcement learning become the mainstream for
decision-making [20], [21]. The studies using supervised learn-
ing are booming for the development of autonomous vehicles.
Pan et al. [22] introduced a hybrid control policy network
guided by the human expert and model predictive control (MPC)
expert. It only requires images taken with a monocular camera
and rolling speed to output the steering and throttle command
directly. Xu et al. [23] introduced an FCN-LSTM architecture to
generate actions based on prior states of agents and videos taken
with a monocular camera. Moreover, this architecture leveraged
imitation learning (IL) to improve performance since semantic
segmentation as a side task enforces the FCN-LSTM architec-
ture to learn interpretable feature representation. Unfortunately,
since data collection in dangerous scenarios (e.g., inevitable
collisions) is challenging, there is a gap between reality and
training. These supervised methods always have unsatisfactory
performance in realistic scenarios due to the lack of data from
dangerous scenarios [24], [25].
To avoid the high-cost problem for data collection in dan-
gerous scenarios with supervised learning methods, researchers
utilized deep reinforcement learning (DRL) methods with af-
fordable trial-and-error to find the driving policy close to reality
for decision-making in autonomous vehicles based on driving
simulators [26], [27]. Different from rule-based methods, DRL-
based methods learn how to drive from trial-and-error, mak-
ing them suitable to various situations [28], [29]. The defined
learning criteria in DRL-based methods are just some simple
constraints or encouraging orientations which are far more less
or simple than the cases in rule-based methods [30]. Mirchevska
et al. [31] developed a reinforced learning approach based on
deep Q-learning network (DQN) for autonomous vehicles to
take safe actions for lane change in highway driving. To cope
with the challenge in multi-agent collision avoidance, Fan et al.
[33] proposed an innovative multi-stage RL-based architecture
for safe and effective navigation in dense traffic with pedestrians.
Chen et al. [34] designed a network with a hierarchical structure
that maintained an overstory policy and an underlying operation
instruction simultaneously. However, heavy time consumption
burden in model training is still a problem for DRL-based
methods [35]. Therefore, the development of lightweight DRL
models is attracting attentions of researchers in the recent years.
D. Lightweight Model Design
Feature extraction from data with a fair amount of redun-
dant information, such as images, is computationally expen-
sive. Since reinforcement learning based models are mainly
for online training, models with a huge number of parameters
cannot satisfy real-time requirements. Therefore, technologies
for lightweight model design have been developed for practical
applications. Howard et al. [36] proposed a novel lightweight
method with a pre-defined architecture to reduce the number
of convolution calculations to only 1/9 of the number when
using the vanilla convolution. Hua et al. [37] introduced a
dynamic pruning method, named channel gating, to optimize
CNN inference by utilizing input-specific features. By identify-
ing the regions with insignificant contributions, channel gating
will dynamically skip weight propagation for these ineffective
regions to ease the calculation burden. However, these dynamic
methods that merely consider reality implementation can hardly
achieve theoretical acceleration due to extra computation waste
(e.g., indexing, zero-masking, weight-copying, etc.). To achieve
hardware-efficiency acceleration, [38] dynamically sliced the
network parameters to realize statical and contiguous storage
in hardware. As for the computation itself, Jacob et al. [39]
proposed a quantization scheme to gain integer-only models,
which avoids the huge calculation cost in floating point infer-
ence. Apart from these approaches, Henning et al. [40] proposed
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2199
a multi-layer attention map (MLAM) to only process the relevant
data, which mitigates the high redundancy in feature extraction
for environment perception. Although approaches have been
made in the recent years, far more efforts are still needed for
lightweight DRL model design, especially in the area of risk-
aware decision-making for autonomous vehicles in intelligent
transportation systems.
E. Contributions
It has been widely accepted that lane change is one of the most
commonly adopted maneuvers in naturalistic driving [41], [42],
[43], [44]. In order to develop human-like autonomous driving
technologies to avoid conflicts or crashes caused by inconsis-
tences between human drivers and artificial drivers, automatic
lane change systems should be well designed for autonomous
driving [45], [46]. However, current end-to-end automatic lane
changing models usually suffer from high computational cost
or risk insensitivity and may not be useful for high-speed lane
change scenarios. To overcome these drawbacks, we propose
an innovative method that allows agents to learn strategies with
the minimal expected risk at a low computational burden for
highway lane change. Firstly, we proposed a lightweight image
semantic extraction network based on depth-wise separable
convolutions and used transformers to merge the image seman-
tic contexts in time series. Next, we proposed a quantization
approach containing positional uncertainty based on Bayesian
theory for risk assessment, which was then introduced into DRL
to find the policy with minimal expected risk. Lastly, some
virtual scenarios were built in a driving simulator CARLA (Car
Learning to Act) [47] to evaluate the performance of our method.
The key contributions of our work are summarized as follows:
1) An innovative end-to-end model on the basis of depth-
wise separable convolution with low computer burden and
transformer network is newly proposed for lane change
decision-making in autonomous driving. To the best of our
knowledge, the comprehensive use of depth-wise separa-
ble convolution together with transformer in DRL-based
architectures for lane change decision inference has never
been reported in the previous literature.
2) The driving policy with minimal expected risk is cre-
atively integrated into DRL-based architectures for safe
lane change, making the autonomous vehicle being with
risk awareness ability.
3) Three lane change scenarios with different difficulties (i.e.,
one with stationary vehicles, one with moving vehicles,
and one accelerating, decelerating, and lane changing
vehicles) are designed to evaluate the performances of the
examined methods.
4) The lightweight characteristic and superior performance
of our proposed approach can facilitate the development
of autonomous driving technologies in various driving
scenarios for intelligent transportation.
F. Paper Structure
The rest of this paper is structured as follows. Problem state-
ment and previous solutions are mentioned in Section II. The
mentioned methodology and the details of the deep reinforce-
ment learning framework for decision-making are described
in Section III and Section IV, respectively. The experiments
in CARLA are detailed in Section V. Section VI presents the
performance of experiments. Lastly, the conclusions of this
paper are drawn in Section VII.
II. PROBLEM STATEMENT AND PREVIOUS SOLUTIONS
Generally, in the DRL framework, the agent is capable of
driving in an uncertain environment by selecting a sequence of
actions over several continuous-time steps. Subsequently, it will
grant rewards as the feedback of the interaction with environ-
ment. Finally, a strategy with maximum cumulative reward will
be chosen. In this study, a lane change process can be briefly
described as a Markov decision process (MDP):
M=< S,A,P,R > (1)
where S={s0,s
1, ..., s
t}indicates the set of states, A=
{a0,a
1, ..., a
t}indicates the set of actions, P:S×A×
S[0,1] indicates the transitional probability between states,
and R:S×A×SRindicates the reward.
In particular, a sequence of actions adapted to the particu-
lar scenario will be chosen according to a stochastic strategy
π:S→P(A), where P(A)indicates the possibility of an
action Awill be chosen following the strategy π. A trajectory
will take place through this progress, which can be indicated as
τ=s0,a
0,r
0,s
1,a
1,r
1, ..., s
t,a
t,r
t, and the optimal
strategy πwith the maximum expected cumulative reward can
be found:
π=argmax
π
Eπ+
t=0
γtrt+1|s0=s(2)
where γ[0,1] implies a parameter to control the weight of the
next time step reward rt+1, and πindicates the optimal strategy.
However, (2) is hard to solve. To address this issue, a Q-value
function is used for strategy optimization.
Qπ(s, a)=Eπ+
t=0
γtrt+1|s0=s, a0=a(3)
where Qπ(s, a)indicates the expected cumulative reward from
the initial condition (state sand action a) by following strategy
π. Subsequently, the strategy πcan be improved by choosing
action ato maximize Q-value, i.e., π(s)=argmax
a
Qπ(s, a).
Thus, as an equivalent solution to (2), an optimal strategy π
with maximum Qπ(s, a)will be generated, i.e., Qπ(s, a)=
max
πQπ(s, a).
A. Deep Q-Network (DQN)
In DQN architecture [26], Q values can be estimated by two
networks (i.e., Qtarget and Qonline). A trajectory st,a
t,r
t,s
t+1
will be generated through Qonline and then it will be stored in
memory M. Randomly sampling in memory Mwill reduce the
relevancy of data to update the network. Qtarget outputs the
target Q-value to calculate the temporal difference (TD) error.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2200 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 1. The framework of DQN.
Fig. 2. The framework of PRDQN.
The weights of Qonline will then be updated according to the
obtained TD error. The separation of two processes improves
stability of the network. The computational framework is shown
in Fig. 1, and the loss function is described as:
L=E(st,at,rt,st)M(yQ(st,a
t;θt))2
y=rt+γmax
at
Q(st,a
t;θt)(4)
where (st,a
t,r
t,s
t)indicates a trajectory sampled in memory
M,θtand θtrespectively indicates the weights of Qonline and
Qtarget.
B. Prioritized Replay Deep Q-Learning Network (PRDQN)
Memory replay of DQN with consistent sample policy is not
appropriate for samples with higher temporal difference errors
because samples with minor TD errors are easily captured. To
address this problem, Schaul et al. [27] developed a prioritized
replay based on DQN (i.e., PRDQN), prioritizing samples with
higher TD errors to learn. The sampled probability of sample i
is described as:
P(i)= pa
i
kpa
k
(5)
where pindicates the TD error, ais a pre-defined parameter to
control the priority.
In addition, gradient descent in experience replay with priority
is based on the importance of sample weight, which is described
as:
wi=1
N·1
P(i)β
(6)
where Nindicates the number of replay experience, and β
indicates a pre-defined parameter.
The whole computational framework of PRDQN is shown in
Fig. 2.
Fig. 3. The whole architecture of our proposed approach.
Fig. 4. The computational flow of bottleneck.
III. PROPOSED APPROACH
We proposed an end-to-end lightweight architecture with risk
awareness to make decision for autonomous vehicles. First, a
lightweight semantic extraction network is introduced based on
depth-wise separable convolution and transformer for image
sequence processing. Then, we introduce our risk assessment
module and apply it into the proposed DRL-based method to
obtain a driving policy with the minimal expected risk. The
whole architecture is illustrated in Fig. 3.
A. End-to-End Lightweight Network
To alleviate the computational burden of decision-making
network, we introduced a depth-wise separable convolution
[36] based module, called bottleneck [48], to build the image
semantic extraction network. This module consists of Conv 1×1
and Dwise 3×3. The former is designed for dimension adjust-
ment and the latter is designed to decrease the cost of standard
3×3 convolution. To alleviate the computational burden and
conserve information as much as possible at the same time, the
computational flow of bottleneck is designed as shown in Fig. 4.
The whole image semantic network configuration (DSCNN:
depth-wise separable convolution neural network) is shown in
Table I. The parameter details show that our semantic extraction
network (i.e., DSCNN) has only 22.11 MFLOPs (floating-point
operations) inferential expenditure and 0.92M parameters which
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2201
TAB L E I
PARAMETERS OF THE IMAGE SEMANTIC NETWORK DSCNN AND VCNN (T:
EXPANSION RATI O ,C:OUTPUT CHANNELS,N:REPEAT TIMES,S:THE NUMBER
OF STRIDES,FLOPS:FLOATING POINT OPERATIONS PER SECOND,M:10
6,
GAP: GLOBAL AVERAGE POOLING)
Fig. 5. Attention module.
is a very tiny amount for computation, whereas the vanilla CNN
(VCNN) has 40.01 MFLOPs inferential expenditure and 1.43M
parameters. It is obvious that DSCNN has a lower computa-
tional burden and fewer parameters than VCNN, indicating that
DSCNN is more appropriate for real-time applications.
To date, transformer has achieved better performance in par-
allelization than the traditional LSTM based methods, and it has
a strong capability to build word embedding in longer sequences
based on relationships of all features [49]. Therefore, to make the
agent aware of the change of traffic environment, we introduced
transformer [49] to mixup the image semantic context of time
sequences in this study. We put a video frame as the input
into the model and transform it into the action space to infer
an action. The transformer is used for feature extraction based
on global attention, and it is comprised of many multi-head
attention units deriving from the scaled dot-product attention
(a self-attention criterion). The self-attention criterion divides
the input embedding into three vectors Q,Kand V. Firstly, the
scaled dot-product attention is calculated according to (7). The
corresponding diagram is shown in Fig. 5.
Attention (Q, K, V )=softmax QKT
dkV(7)
where Qis a query vector, Kis a key vector, Vis a value vector,
and dkis a normalization factor.
Then, hparallel scaled dot-product attention modules are
merged to generate the multi-head attention module, which
means that self-attention is calculated htimes with Q,K, and
Vby the scaled dot-product attention modules with different
weights. The computation flow is shown in Fig. 5 and the
corresponding equation is shown as:
MultiHead (Q, K, V )=Concat (head1,...,headh)WO
where headj= Attention WQ
jQ, W K
jK, W V
jV
(8)
where Windicates the weight matrix, WQ
jRdmodel×dq,WK
j
Rdmodel×dk,WV
jRdmodel×dv,WVRhdv×dmodel ,dvand dmodel
are the dimensions of the value vector Vand the model, respec-
tively.
Finally, the complete end-to-end network is shown in Fig. 6.
We set the above-mentioned dvand dmodel as 64 and 512,
respectively. The Nin Fig. 6 and the hin (8) are set as 6
and 8, respectively. All the parameters mentioned above are
recommended by the authors of transformer in [49].
B. Risk Assessment
Different from the deterministic approaches that only predict
risk occurrence, our risk assessment method can hierarchically
evaluate the risk possibility of the host vehicle (HV) as:
τΩ={dangerous,attentive,safe}def
={D,A,S}(9)
where τand Ωrespectively denote the risk level and the set of
risk levels.
Introducing uncertainty σand relative distance dto other
vehicles (OVs) into consideration, we take two stages for risk
assessment with the probabilistic approach: 1) hierarchically
computing the conditional possibility based on the distribution
of safety metrics, and 2) risk level determination for a specific
state based on Bayes inference.
Thus, the distribution of safety metrics is defined as follows:
P(d|τ=D) = 1, otherwise
e
Δd2
D
2σ2,if d dD
P(d|τ=A)= e
Δd2
A
2σ2
P(d|τ=S)= eΔd2
s
2σ2,ifddS
1, otherwise
Δdi=|ddi|,i∈{D,A,S}
(10)
where dis the relative distance (from HV to OVs), dD,dAand dS
are hyper-parameters to determine the risk level. These param-
eters defined in advance (i.e., dD,dA,d
Sand σ) are leveraged
for curves smooth at different risk levels. Fig. 7 [30] is the visual
representation of (10). In order to adjust the risk distributions to
be smooth, we design these hyper-parameters to be reasonable
according to the visualized prior distribution of risk. More details
of the determination of these hyper-parameters can be found in
[17] and [30].
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2202 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 6. DSCNN transformer: The proposed end-to-end lightweight decision-making network.
Fig. 7. The concrete risk curves of (10).
According to Bayes inference, posterior possibility of risk
level τcan determine as:
P(τ|d)= P(d|τ)·P(τ)
τΩP(τ)·P(d|τ)(11)
where P(τ|d)indicates posterior possibility for risk level τin
each settled relative distance d,P(d|τ)indicates conditional
possibility determined by (10), P(τ)indicates priori possibil-
ity of risk level τ. For convenience, a uniform prior possibil-
ity is set in distinct risk levels with the restrictive condition
τΩP(τ)=1.
C. Decision-Making With Minimal Expected Risk
In order to seek the policy with minimal expected risk, we
incorporate the output of risk evaluation into the DRL-based
methods for more satisfactory performance of safe driving.
Nevertheless, the output of risk evaluation (i.e., P(τ|d))isdis-
continuous, leading to inapplicability for continuous inference
using DRL methods. To solve this problem, a parameter εabout
continuous risk is calculated in (12) by considering the risk
level τ. Because the abbreviated letters (i.e., D,A, and S)in
(9) cannot be directly used in the calculation of expectation,
we respectively assign D,A, and Swith a score 2, 1, 0 (i.e.,
τΩdef
={2,1,0}) for mathematical calculation.
ε=E(τ)=
τΩ
τ·P(τ|d)
=
τ∈{2,1}
τ·P(τ|d)(12)
where τis discontinuous risk levels, and εindicates the expec-
tation used for continuous transformation of risk.
Subsequently, (13) generates a policy with minimal expecta-
tion based on the continuously quantified driving risk:
π=argmin
π
Eπ+
t=0
γtεt+1|s0=s(13)
An equivalent expression is written as:
π=argmax
π
Eπ+
t=0
γt(max εεt+1)|s0=s(14)
where max εindicates the maximal value of the defined risk,
which means that max ε=2.
A similar expression rt+1=maxεεt+1 can be observed in
(2) and (14). Thus, the corresponding Q-value function can be
determined as (15) to solve the problem.
Qπ(s,a)=Eπ+
t=0
γt(maxεεt+1)|s0=s, a0=a
(15)
The DQN results are the probabilities of choosing each action.
By maximizing (15), the action with the maximal probability
will be selected. See (16). Thus, the actions chosen by DQN in
each step are independent.
a=argmax
a
Qπ(s,a)(16)
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2203
TAB L E I I
CAMERA DETAILS.(HEIGHT AND WIDTH:RAW SIZEOFTHECOLLECTED
IMAGES,FOV:FIELD OF VIEW,FREQ:IMAGE COLLECTION FREQUENCY,POSE:
THE WORLD COORDINATE OF THE CAMERA.THE UNLISTED X,Y,YAW,ROLL
AND PITCH ARE ALL 0.)
where adenotes the optimal action with maximal Q-value
chosen by DQN, Qπ(s,a)is the Q-value function defined
in (15).
IV. DRL-BASED DRIVING DECISION-MAKING
A. State and Action
The state space consists of images from a vehicle camera in
our approach. The camera captures the environment images with
a pre-defined frequency at 50 Hz. The most recent 5 images (0.1
second per image) in the last 0.5s are used to represent the state,
ensuring that the agent can be aware of the environment changes
from the images with dynamically changing information. The
details of camera for data collection are introduced in Table II.
Our proposed method considers longitudinal and lateral con-
trol by steering and throttle action in the designed autonomous
driving strategies. The brake action is retained for human drivers
instead of the DRL agent to prevent over-conservative behaviors
for better travel efficiency. Despite the omissive consideration
of brake action, our method retains efficient performance due to
the well-designed methodology, which can be supported by our
obtained results. Based on the above statement, the final action
space at a given time t(i.e., at) is defined as:
at∈{LT Lt,LT St,St,RT St,RTLt}(17)
where LTLtand RTLtindicate intense steering for left-turn
and right-turn, i.e., ±0.5 (+denotes left-turn and denotes
right-turn), LTStand RTStindicate slight steering for left-turn
and right-turn, i.e., ±0.1, and Stindicates the host agent keeps
straight driving with steering.
DQN-based agents with discrete actions are usually inappro-
priate for driving comfort [26]. The generated trajectories by
DQN-based methods are always rough [30] because DQN-based
methods are only appropriate for situations with discrete action
space. To alleviate this problem, an exponential moving aver-
aging strategy [30] is developed to smooth the motion path for
improvement. Both the previously executed action and the action
chosen by DQN methods at the current step are considered to
smooth the gaps between the two continuous discrete actions.
a
t=at1+γ(atat1)(18)
where a
tis the smooth action, γis an invariable parameter
defined in advance for smooth adjustment, at1and atare
the actions taken by DQN-based models at time t1and t,
respectively.
B. Reward Function
In order to generate a policy to ensure driving safety, driving
risk should be considered in priority. Therefore, the reward of
risk is written as:
rrisk =maxεεt(19)
where rrisk is the reward of driving risk, and εtis the estimated
risk at time t, and max εindicates the maximal value of risk.
In reality, traffic rules should be considered in the design of
autonomous driving strategies. Vehicle collisions always suffer
from illegal lane changes. Unlike the binary penalty for illegal
lane change in [32], we propose a soft penalty to strengthen
the awareness that the HV should avoid lane invasion for safe
driving. A greater relative distance between the HV and road
boundary corresponds to a smaller penalty and thus the soft
penalty is defined as:
rinvasion =e(laldlahv)2
2σ2(20)
where lald indicates the road boundary, lahv indicates the lateral
position of the host agent, the uncertainty is described by σ.
Besides, rexist is designed to encourage the HV driving
following the above lane and boundary rules as long as possible
with no crash:
rexist =0.1,if survive
1,otherwise (21)
where ‘survive’ denotes that the HV drives within lane bound-
aries with no crash. The reward values of rexist are determined
according to [30] and [50].
According to (19), (20), and (21), we can obtain that rrisk
[0, 2], rinvasion [1, 0], and rexist ∈{1,0.1}. Reducing driv-
ing risk has been well accepted as the top priority in the devel-
opment of autonomous driving technologies [51]. Therefore, it
is reasonable that the upper bound of rrisk is twice of the corre-
sponding absolute values of the other sub-rewards. According
to the well-accepted simplification in RL [26], [52], the weights
of different sub-rewards are determined as a constant value
(i.e., 1). Comprehensively considering all these rewardelements,
the holistic reward function is designed as:
r=rrisk +rinvasion +rexist (22)
C. Training Details
To decrease the variance when updating the network, we train
the model by involving the technologies including warm-up
learning rate, gradient clip, and soft update.
Warm-up learning rate: With a significant variance existing
in previous training, neural network should be optimized by the
warmup learning rate strategy for updating steadily. In other
words, a small learning rate lives in the previous optimization
process, and ultimately the learning rate reverts to the mean
number. In practice, the learning rate of DRL is initially assigned
to 0.01 and then changed to 0.1 after 50 episodes.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2204 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 8 Description of the OV locations in scenario-I.
Gradient clip: Gradient clip is a prevalent mitigatory for
gradient explosion, which can be calculated as:
grad
i=gradiclipnorm
max (norm (gradi), clipnorm)(23)
where gradiand grad
iindicate original and clipped gradient
in layer i;norm indicates the standard deviation computation;
clipnorm denotes a hyper-parameter about the standard devi-
ation after clipping, which is usually defined as 0.1 to mitigate
volatility in the training process.
Soft update: Unlike hard network updating, soft updating
keeps the identical weights of the online and target networks,
which is defined as:
θtarget =(1η)·θtarget +η·θonline (24)
where θonline and θtarget are the weights of the online and target
networks, ηdenotes a parameter to control the target network
updating speed, which is usually determined as 0.01.
V. S IMULATION SCENARIOS
Most approaches based on DRL methods are developed based
on simulators [53], [54], [55], [56] instead of real world tests to
prevent the unaffordable trial-and-error cost. In this study, we
design three lane change scenarios (stationary vehicles, moving
vehicles, and moving vehicles with acceleration, deceleration,
and lane change) in a prevalent simulator called Car Learning to
Act (CARLA) [47] to examine the effectiveness of our proposed
method and the compared methods.
Scenario-I (stationary vehicles): Several motionless vehicles
(1026) are randomly settled in a 420m road. To prevent the
block of the road by two cars in parallel, each road segment
(i.e., 60 m) is divided into four sub-segments. Each vehicle
(including the HV and OVs) is independently placed in one sub-
segment. The position and lane choice of the placed vehicle are
randomly initialized with the Gaussian-based sampling method.
The HV is expected to drive forward safely without any crash.
See Fig. 8 for more details.
The longest straight road in CARLA is 420m, which is not
sufficient to examine the effectiveness of our proposed method.
To solve this problem, the experiment with randomly distributed
vehicles is repeated 100 times (i.e., 100 episodes) on the same
420m road in the evaluation stage. In total, the HV run 42 km
with 1600 lane changes in the test phase, which means that
the HV needs to change lanes about four times in each 100m
driving. This 420m road is commonly used for autonomous
driving technology development in CARLA [47].
TABLE III
EVA L UA T I O N METRICSOFTHEEXAMINED METHODS IN SCENARIO-I
Scenario-II (moving vehicles): All the settings in this scenario
(such as the initial positioning strategy, the task of HV and the
number of episodes) are the same with scenario-I. Differently,
all the OVs run with a speed limit of 30 m/s.
Scenario-III (moving vehicles with acceleration, deceleration,
and lane change): In this scenario, each OV has a possibility of p
(p=0.5) to accelerate, decelerate, or change lane. The steering
value varies from 1.0 to 1.0. The acceleration and deceleration
ranges are (0, 0.2) and (0.1, 0), respectively. All the other
settings are the same as scenario-II.
The initial speed of HV in all the examined scenarios is 0.
The inferred driving actions from the examined methods work
on the HV to overtake the static or moving obstacles. Therefore,
the speed of HV is controlled by the inferred driving actions to
be dynamically changed according to the estimated driving risk.
Given that the speed limit of OVs is 30 m/s, the speed of HV
needs to be generally higher than 30 m/s in scenario-II.
Apart from these statements, some details about rrisk need
attending. To mitigate the trajectory fluctuation when going
straight, evaluation about the risk of OVs differs in scenario-III
and the other two scenarios. Since OVs will not change lanes in
scenario-I and scenario-II, we will consider the risk of obstacles
in both lanes at the same time only when the agent is close
enough to the front obstacle in the current lane, otherwise
ignoring the effect of obstacles in the other lane. The threshold
of distance between the HV and OV in the current lane should
be adaptive for the speed of HV, which was set as safe distance
for convenience. Differently, we do not distinguish the effects
of obstacles in the current lane from those in both lanes in
scenario-III.
VI. RESULTS AND DISCUSSION
A CNN-based method [57] and a CNN LSTM method [23] are
used for comparison to support the superiority of our approach
(i.e., DSCNN transformer). The CNN based method only uses
space semantic information for decision-making with single
image frame as input. CNN LSTM is another decision-making
network which combines the semantic information from both
space and time aspects with image sequence as input. For fair
comparison, we used our proposed image sematic extraction
network to replace the corresponding networks in the compared
methods, named as DSCNN and DSCNN LSTM.
A. Quantitative Analysis
The reward of our proposed method when training in
scenario-I is shown in Fig. 9, and the comparison results with
different methods are presented in Fig. 10 and Table III. The
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2205
Fig. 9. Reward of our proposed method when training in scenario-I.
Fig. 10. Evaluation performances of the examined methods in scenario-I. The
lines describe the means of driving distances before collision, and the shade
areas describe the corresponding standard deviations.
baseline in Fig. 10 and Table III means the random action
strategy [26] as the reference to demonstrate the effectiveness
of the examined methods. The episode number in Fig. 10 means
the number of running experiments for evaluation. The Score
denotes driving distances before collision in each episode. Score
(μ) and Score (σ) in Table III respectively denote the mean and
standard deviation of driving distances before collision. nCs
is the number of crashes occurred in experiments. Given that
Dsafe is the total driving distance in the testing episodes without
collisions, the finish rate (FR) is defined as the percentage of
Dsafe in the total driving distance (i.e., 420m) of all the testing
episodes.
The experimental results demonstrate that our method
achieves superior performance to the compared methods. Specif-
ically, the score (μ) of the proposed method is 360.40, improved
by 194.9% and 58.2% than DSCNN and DSCNN LSTM. The
score (σ) of our proposed method decreases by 31.4% and
40.5% than DSCNN and DSCNN LSTM, indicating that the
proposed method is with better stability. The results of nCs
and FR show similar trends. The nCs when using DSCNN and
DSCNN LSTM are 76 and 37, respectively. The number when
using our proposed DSCNN transformer decreases to 18. The
Fig. 11. Reward of our proposed method when training in scenario-II.
Fig. 12. Evaluation performances of the examined methods in scenario-II.
TAB L E I V
EVA L UA T I O N METRICSOFTHEEXAMINED METHODS IN SCENARIO-II
FR increases from 29.10 and 54.24 when using DSCNN and
DSCNN LSTM respectively, to 85.81 when using our proposed
DSCNN transformer.
The reward of our proposed method when training in
scenario-II is shown in Fig. 11, and the performances of the
examined methods are presented in Fig. 12 and Table IV. The
general trends are similar with their performances in scenario-I.
Specifically, score (μ) of our proposed method is 339.82 in
scenario II, improved by 213.8% and 104.8% than the compared
DSCNN and DSCNN LSTM. Score (σ) of our proposed method
is 112.06, decreased by 8.01% than DSCNN LSTM. Although
the score (σ) is higher than that of DSCNN, the driving distance
before collision (i.e., score (μ)) of our approach achieves supe-
rior performance to DSCNN. Besides, the nCs of our proposed
method is 24, only 37.5% and 52.2% of the numbers when using
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2206 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 13. Reward of our proposed method when training in scenario-III.
Fig. 14. Evaluation performances of the examined methods in scenario-III.
TAB L E V
EVA L UA T I O N METRICS OF THE EXAMINED METHODS IN SCENARIO-III
DSCNN and DSCNN LSTM, respectively. The FR increases
from 25.78 and 39.51 when using DSCNN and DSCNN LSTM,
to 80.91 when using our proposed method.
The reward of our proposed method when training in
scenario-III is shown in Fig. 13, and the performances of the
examined methods are presented in Fig. 14 and Table V. The
general trends are similar with their performances in scenario-I
and scenario-II. Specifically, score (μ) of our proposed method
is 303.00, improved by 228.3% and 129.3% than the compared
DSCNN and DSCNN LSTM. Although the score (σ) of our
proposed method is higher than the numbers of the other meth-
ods, the score (μ) of our approach is greatly higher than the
other methods, which can be clearly observed in Fig. 14. The
nCs of our proposed method is 33, only 58.9% and 80.5% of the
numbers when using DSCNN and DSCNN LSTM, respectively.
The FR increases from 21.97 and 31.46 when using DSCNN and
DSCNN LSTM, to 72.14 when using our proposed method.
B. Qualitative Analysis
The outputs of our proposed DSCNN transformer running in
scenario-I, scenario-II and scenario-III are illustrated in Figs. 15,
16 and 17, respectively. The results show that the AV is with
safe and steady driving ability. The highlighted red areas in the
figures illustrate that it becomes more dangerous when the HV
gets closer to an obstacle, complying with the perceived risk of
drivers in reality. There will be an inevitable crash if the agent
takes no lane change when the driving risk situation is getting
worse.
Fortunately, our trained DRL-based agent is aware of driving
risk to take proper actions for safe driving, as shown in the green
areas presented in Fig. 15. The HV is able to take proper actions
to avoid driving out of lane boundaries, and learn traveling
following the lane center by using an incentive mechanism.
Therefore, a series of actions will be taken by the HV to recover
to a low risk level when in dangerous situations, contributing to
the better performance of our proposed method.
Comprehensively considering the presented quantitative and
qualitative analysis, our proposed approach shows obvious supe-
riority to the compared methods. When comparing DSCNN with
our proposed method, DSCNN uses single image frame as input,
which makes it only aware of the static environment in a single
image, but the proposed method can be aware of the dynami-
cally changing environment in the input image sequence. The
awareness of the dynamically changing environment is essential
to decision-making in the examined scenarios. Therefore, the
proposed DSCNN transformer can reach a better performance
than DSCNN. When comparing with DSCNN LSTM, the advan-
tage of our approach mainly comes from the sematic extraction
capability of transformer. The nucleus module of transformer
is multi-head attention module. The dot-product operation in
multi-head attention can recalibrate feature embeddings and
filter useless information to make the agent focus on those
essential information (e.g., the sematic information of lanes or
obstacles).
These superior quantitative and qualitative performances may
be attributed to the newly proposed deep learning architecture
with a lightweight feature extraction network. Some effective
tricks (i.e., depth-wise separable convolution, linear bottlenecks,
together with transformer) used in combination is a novel and
effective attempt. Our develop method further improves the
previous DRL-based methods by incorporating the advanced
sequential action inference technology (i.e., transformer) and
considering driving policy with minimal expected risk for de-
cision inference, which has been demonstrated to be effective
and superior to the compared methods. Besides, the designed
strategy with minimal risk expectation comprehensively uses
position and its uncertainty to model driving risk, making the
agent be aware of driving risk to improve driving safety. The
presented results demonstrate the satisfactory performance of
our approach in the static scenario-I and dynamic scenario-II
and scenario-III.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2207
Fig. 15. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-I.
Fig. 16. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-II.
C. Comparison With the Vanilla CNN Methods
To demonstrate the advances of our proposed approach, we
add a comparison experiment with the vanilla CNN methods
(i.e., VCNN and VCNN transformer) in scenario-I. The com-
parison results are shown in Fig. 18 and Table VI. Specifically,
the score (μ) of DSCNN transformer is 360.40 which is 104.6%
higher than the number of the compared VCNN transformer.
This means that the driving performance is greatly improved
when using DSCNN transformer. Similarly, the score (σ)of
DSCNN transformer is 29.9% lower than number of DSCNN,
indicating a more stable performance than VCNN transformer.
In addition, the nCs declines from 27 to 18, and the FR increases
TAB L E V I
EVA L UA T I O N METRICS OF DSCNN-BASED AND VCNN-BASED
METHODS IN SCENARIO-I
from 41.95% to 85.81%. Another interesting finding is that the
score (σ) and the nCs of DSCNN is not better than VCNN.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2208 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 17. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-III.
Fig. 18. Evaluation performances of DSCNN-based and VCNN-based meth-
ods in scenario-I.
But when using DSCNN together with transformer, the best
performance can be achieved, demonstrating the effectiveness
and advances of our proposed approach.
D. Realtime Capability for Model Deployment
Three evaluation metrics about computational cost (i.e., pa-
rameters, FLOPs, and FPS: frames per second) are used to justify
the computational cost advantage of our proposed DSCNN-
based methods over the methods based on VCNN. As shown
in in Table I, the parameters and FLOPs results show that
our proposed semantic extraction network (DSCNN) has only
22.11 MFLOPs inferential expenditure and 0.92M parameters,
while VCNN has 40.01 MFLOPs inferential expenditure and
1.43M parameters. The FPS results of the examined methods are
shown in Table VII. The numbers show that the FPS of DSCNN
transformer is slightly inferior to the numbers of DSCNN and
TAB L E V I I
FRAMES PER SECOND (FPS) OF THE EXAMINED METHODS.THE FPS MEANS
THE NUMBER OF INFERENCES PER SECOND
DSCNN LSTM, but superior to the number of VSCNN trans-
former. Specifically, the FPS of DSCNN transformer is 34.07
which means that the proposed method can finish inference
in only 0.029s. This running speed should be workable on
autonomous vehicle devices. Therefore, although the FPS of
DSCNN transformer is not the best among the examined meth-
ods, the overall performance by comprehensively considering
the above-mentioned quantitativeand qualitative results together
with these computational cost metrics show that our DSCNN
transformer is generally the best among the examined methods.
With the innovative development of AI-SOC (artificial in-
telligence system on chip), learning architectures have promis-
ing opportunities to run in embedded devices. For instance,
TensorFlow-lite, a deep learning architecture developed by
Google adapting to the devices based on ARM, delivers a
convenient solution for our method DSCNN transformer de-
ployment. The lower inferential expenditure, fewer number of
parameters, and reasonable FPS of proposed DSCNN trans-
former demonstrate its low computation cost. Compared with
those huge networks with a large number of parameters used in
traditional end-to-end decision-making (e.g., VGG and Resnet),
our designed network possesses the ability for inference on CPU
devices. Thus, the satisfactory real-time performance makes
our proposed DSCNN transformer have promising talents for
practical applications.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2209
E. Limitations and Future Work of Our Proposed Approach
Although our proposed approach has the above-mentioned
novelties and advances, there are still some limitations. Firstly,
only stable highway driving scenarios are considered in this
study. Given that safe distance between vehicles changes while
driving in different road conditions with various speeds and lane
changing of autonomous vehicles relies heavily on precise po-
sitioning function which is challenging especially in situations
with poor lighting or weather conditions [58], [59], [60], an
adaptive risk level determination strategy should be developed
for improvement in various driving situations. Using knowledge
transferring technologies may be a promising solution [6], [24].
Secondly, our proposed approach is only with discrete actions,
which limits the stability of driving trajectories [61], [62], [63].
It has been reported that methods with continuous actions (e.g.,
DDPG (deep deterministic policy gradient)) can mitigate this
problem, but they may generate over-conservative actions [61].
How to comprehensively integrate the advantages of DQN-
based methods and DDPG-based methods is a promising re-
search topic for the development of autonomous driving tech-
nologies. Thirdly, as reported in previous studies [64], human
driving habits affect drivers’ decision-making performances and
including human driving habits in the design of autonomous
driving systems can improve drivers’ acceptance of the emerging
technologies. However, we did not consider this factor in this
study. Our future work will design methods to involve the
influence of human driving habits on the risk awareness module
to further improve our developed method, which is expected to
help design individualized systems that can better match drivers’
characteristics for acceptance improvement. In addition, more
lightweight improvements will be considered in our future work.
VII. CONCLUSION
In this paper, an innovative driving decision-making network
with risk evaluation is designed to seek an optimum driving
policy with minimal risk expectation. Our proposed approach is
compared with the other methods in three lane change scenarios
with different difficulties. The quantitative and qualitative results
reveal that the comprehensive use of depth-wise separable con-
volution together with transformer in DRL-based architectures
for lane change decision inference can generate an optimal
policy with minimal driving risk to avoid crashes in all the three
examined scenarios. The comparison results well support the
superiority of our proposed approach. The lightweight charac-
teristic and superior performance of our proposed approach can
facilitate the development of autonomous driving technologies
in various driving scenarios.
REFERENCES
[1] “Traffic Safety Facts 2016: A compilation of motor vehicle crash data from
the fatality analysis reporting system and the general estimates system,”
U.S. Dept. Transp., Nat. Highway Traffic Saf. Admin., Washington, DC,
USA, Tech. Rep. DOT HS 812 554, 2017.
[2] M. S. Shirazi and B. T. Morris, “Looking at intersections: A survey of
intersection monitoring, behavior and safety analysis of recent studies,
IEEE Trans. Intell. Transp. Syst., vol. 18, no. 1, pp. 4–24, Jan. 2017.
[3] G.Li, Y. Liao, Q. Guo, C. Shen, and W. Lai, “Trafficcrash characteristics in
Shenzhen, China from 2014 to 2016,” Int. J. Environ. Res. Public Health,
vol. 18, no. 3, Jan. 2021, Art. no. 1176.
[4] W. Xue and L. Zhe ng, “Active collision avoidance system design based
on model predictive control with varying sampling time, Automot. Innov.,
vol. 3, no. 1, pp. 62–72, Mar. 2020.
[5] G. Li, Y. Yang, X. Qu, D. Cao, and K. Li, “A deep learning based image
enhancement approach for autonomous driving at night, Knowl.-Based
Syst., vol. 213, 2021, Art. no. 106617.
[6] G. Li, Z. Ji, and X. Qu, “Stepwise domain adaptation (SDA) for ob-
ject detection in autonomous vehicles using an adaptive CenterNet,
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 17729–17743,
Oct. 2022.
[7] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel, “Path planning for
autonomous vehicles in unknown semi-structured environments, Int. J.
Robot. Res., vol. 29, no. 5, pp. 485–501, Apr. 2010.
[8] Y. Huang et al., “A motion planning and tracking framework for au-
tonomous vehicles based on artificial potential field elaborated resis-
tance network approach,” IEEE Trans. Ind. Electron., vol. 67, no. 2,
pp. 1376–1386, Feb. 2020.
[9] B. Li, Y. Ouyang, Y. Zhang, T. Acarman, Q. Kong, and Z. Shao, “Optimal
cooperative maneuver planning for multiple nonholonomic robots in a tiny
environment via adaptive-scaling constrained optimization, IEEE Robot.
Automat. Lett., vol. 6, no. 2, pp. 1511–1518, Apr. 2021.
[10] D. Shen, Y. Chen, L. Li, and S. Chien, “Collision-free path planning for
automated vehicles risk assessment via predictive occupancy map, in
Proc. IEEE Intell. Veh. Symp., 2020, pp. 985–991.
[11] B. Simon, F. Franke, P. Riegl, and A. Gaull, “Motion planning for collision
mitigation via FEM–based crash severity maps, in Proc. IEEE Intell. Veh.
Symp., 2019, pp. 2187–2194.
[12] M. Ali, P. Falcone, and J. Sjöberg, “Threat assessment design under driver
parameter uncertainty,” in Proc. IEEE 51st Conf. Decis. Control, 2012,
pp. 6315–6320.
[13] S. Glaser, B. Vanholme, S. Mammar, D. Gruyer, and L. Nouvelière,
“Maneuver-based trajectory planning for highly autonomous vehicles on
real road with traffic and driver interaction, IEEE Trans . Intell. Transp.
Syst., vol. 11, no. 3, pp. 589–606, Sep. 2010.
[14] Operational Definitions of Driving Performance Measures and Statistics,
Standard SAE J2944, Society of Automotive Engineers, 2015.
[15] J. Kim and D. Kum, “Collision risk assessment algorithm via
lane-based probabilistic motion prediction of surrounding vehicles,”
IEEE Trans. Intell. Transp. Syst., vol. 19, no. 9, pp. 2965–2976,
Sep. 2018.
[16] S. Noh and K. An, “Decision-making framework for automated driving in
highway environments, IEEE Trans. Intell. Transp. Syst., vol. 19, no. 1,
pp. 58–71, Jan. 2018.
[17] G. Li et al., “Risk assessment based collision avoidance decision-making
for autonomous vehicles in multi-scenarios,” Transp. Res. Part C: Emerg.
Technol., vol. 122, Jan. 2021, Art. no. 102820.
[18] S. Noh, “Decision-making framework for autonomous driving at road
intersections: Safeguarding against collision, overly conservative behav-
ior, and violation vehicles, IEEE Trans. Ind. Electron., vol. 66, no. 4,
pp. 3275–3286, Apr. 2019.
[19] D. Shin, B. Kim, K. Yi, A. Carvalho, and F. Borrelli, “Human-centered
risk assessment of an automated vehicle using vehicular wireless commu-
nication,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. 2, pp. 667–681,
Feb. 2019.
[20] J. Nidamanuri, C. Nibhanupudi, R. Assfalg, and H. Venkataraman, “A
progressive review: Emerging technologies for ADAS driven solutions,
IEEE Trans. Intell. Veh., vol. 7, no. 2, pp. 326–341, Jun. 2022.
[21] G. Li, L. Yang, S. Li, X. Luo, X. Qu, and P. Green, “Human-like
decision making of artificial drivers in intelligent transportation sys-
tems: An end-to-end driving behavior prediction approach, IEEE In-
tell. Transp. Syst. Mag., vol. 14, no. 6, pp. 188–205, Nov./Dec. 2022,
doi: 10.1109/MITS.2021.3085986.
[22] Y. Pan et al., “Imitation learning for agile autonomous driving,” Int. J.
Robot. Res., vol. 39, no. 2/3, pp. 286–302, 2020.
[23] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2017, pp. 3530–3538.
[24] G. Li, Z. Ji, X. Qu, R. Zhou, and D. Cao, “Cross-domain object de-
tection for autonomous driving: A stepwise domain adaptative YOLO
approach,” IEEE Trans. Intell. Veh., vol. 7, no. 3, pp. 603–615, Sep. 2022,
doi: 10.1109/TIV.2022.3165353.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2210 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
[25] L. Peng, H. Wang, and J. Li, “Uncertainty evaluation of object detection
algorithms for autonomous vehicles,” Automot. Innov., vol. 4, no. 3,
pp. 241–252, Aug. 2021.
[26] V. Mnih et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[27] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
replay,” 2016, arXiv: 1511.05952.
[28] G. Li, S. Lin, S. Li, and X. Qu, “Learning automated driving in complex
intersection scenarios based on camera sensors: A deep reinforcement
learning approach,” IEEE Sensors J., vol. 22, no. 5, pp. 4687–4696,
Mar. 2022.
[29] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J. Kochen-
derfer, “Combining planning and deep reinforcement learning in tactical
decision making for autonomous driving, IEEE Trans. Intell. Veh.,vol.5,
no. 2, pp. 294–305, Jun. 2020.
[30] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, and S. E. Li, “Decision making
of autonomous vehicles in lane change scenarios: Deep reinforcement
learning approaches with risk awareness, Transp. Res. C Emerg. Technol.,
vol. 134, Jan. 2022, Art. no. 103452.
[31] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker, “High-
level decision making for safe and reasonable autonomous lane changing
using reinforcement learning,” in Proc. IEEE 21st Int. Conf. Intell. Transp.
Syst., 2018, pp. 2156–2162.
[32] T. Shi, P. Wang, X. Cheng, C. Y. Chan, and D. Huang, “Driving de-
cision and control for autonomous lane change based on deep rein-
forcement learning,” in Proc. IEEE Intell. Transp. Syst. Conf., 2019,
pp. 2895–2900.
[33] T. Fan, P. Long, W. Liu, and J. Pan, “Distributed multi-robot collision
avoidance via deep reinforcement learning for navigation in complex
scenarios,” Int. J. Robot. Res., vol. 39, no. 7, pp. 856–892, 2020.
[34] Y. Chen, C. Dong, P. Palanisamy, P. Mudalige, K. Muelling, and J.
M. Dolan, “Attention-based hierarchical deep reinforcement learning for
lane change behaviors in autonomous driving, in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 1326–1334.
[35] C. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J. Kochenderfer,
“Combining planning and deep reinforcement learning in tactical decision
making for autonomous driving, IEEE Trans. Intell. Veh., vol. 5, no. 2,
pp. 294–305, Jun. 2020.
[36] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks
for mobile vision applications,” 2017, arXiv:1704.04861.
[37] W. Hua et alet al., “Channel gating neural networks, in Proc. Neural Inf.
Process. Syst., 2019, vol. 32, pp. 1886–1896.
[38] C. Li, G. Wang, B. Wang, X. Liang, Z. Li, and X. Chang, “Dynamic
slimmable network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2021, pp. 8607–8617.
[39] B. Jacob et al., “Quantization and training of neural networks for efficient
integer-arithmetic-only inference, in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit., 2018, pp. 2704–2713.
[40] M. Henning, J. C. Muller, F. Gies, M. Buchholz, and K. Dietmayer,
“Situation-aware environment perception using a multi-layer attention
map,” IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 481–491, Jan. 2023.
[41] Y. Chen, G. Li, S. Li, W. Wang, S. E. Li, and B. Cheng, “Exploring
behavioral patterns of lane change maneuvers for human-like autonomous
driving, IEEE Trans.Intell. Transp. Syst., vol. 23, no. 9, pp. 14322–14335,
Sep. 2022.
[42] T. Rehder, A. Koenig, M. Goehl, L. Louis, and D. Schramm, “Lane change
intention awareness for assisted and automated driving on highways,
IEEE Trans. Intell. Veh., vol. 4, no. 2, pp. 265–276, Jun. 2019.
[43] J. Zhang, C. Chang, X. Zeng, and L. Li, “Multi-agent DRL-based lane
change with right-of-way collaboration awareness, IEEE Trans. Intell.
Transp. Syst., vol. 24 no. 1, pp. 854–869, Jan. 2023.
[44] X. He, H. Yang, Z. Hu, and C. Lv, “Robustlane change decision making for
autonomous vehicles: An observation adversarial reinforcement learning
approach,” IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 184–193, Jan. 2023.
[45] Y. Wang, D. Pan, H. Deng, Y. Jiang, and Z. Liu, “Dynamic trajectory
planning of autonomous lane change at medium and low speeds based on
elastic soft constraint of the safety domain,” Automot. Innov., vol. 3, no. 1,
pp. 73–87, Mar. 2020.
[46] G. Li, Y. Chen, D. Cao, X. Qu, B. Cheng, and K. Li, “Extraction of descrip-
tive driving patterns from driving data using unsupervised algorithms,
Mech. Syst. Signal Process., vol. 156, Jul. 2021, Art. no. 107589.
[47] A. Dosovitskiy, G. Ros, F. Codevilla, A. López, and V. Koltun, “CARLA:
An open urban driving simulator, in Proc. Conf. Robot. Learn., 2017,
vol. 78, pp. 1–16.
[48] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. -C. Chen, “Mo-
bileNetV2: Inverted residuals and linear bottlenecks, in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
[49] A. Vaswani et al., “Attention is all you need, in Proc. Adv. Neural Inf.
Process. Syst., 2017, pp. 6000–6010.
[50] P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally
decentralized multi-robot collision avoidance via deep reinforcement
learning,” in Proc. IEEE Int. Conf. Robot. Automat., 2018, pp. 6252–6259.
[51] M. Bouton et al., “Reinforcement learning with probabilistic guarantees
for autonomous driving, 2018, arXiv: 1904.07189.
[52] X. Qi, Y. Luo, G. Wu, K. Boriboonsomsin, and M. Barth, “Deep reinforce-
ment learning enabled self-learning control for energy efficient driving,
Transp. Res. Part C: Emerg. Technol., vol. 99, pp. 67–81, Feb. 2019.
[53] Y. Ye, X. Zhang, and J. Sun, Automated vehicle’s behavior decision
making using deep reinforcement learning and high-fidelity simulation
environment, Transp. Res. C Emerg. Technol., vol. 107, pp. 155–170,
Oct. 2019.
[54] M. Zhu, Y. Wang, Z. Pu, J. Hu, X. Wang, and R. Ke, “Safe, efficient,
and comfortable velocity control based on reinforcement learning for
autonomous driving, Transp.Res. C Emerg. Technol.,vol. 117, Aug. 2020,
Art. no. 102662.
[55] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical rein-
forcement learning for self-driving decision-making without reliance on
labelled driving data,”IET Intell. Transp. Syst., vol. 14, no. 5, pp. 297–305,
2020.
[56] B. R. Kiran et al., “Deep reinforcement learning for autonomous driving:
A survey, IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4909–4926,
Jun. 2022, doi: 10.1109/TITS.2021.3054625.
[57] F. Codevilla, M. Miiller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-
to-end driving via conditional imitation learning,” in Proc. IEEE Int. Conf.
Robot. Automat., 2018, pp. 4693–4700.
[58] G. Li, Y. Lin, and X. Qu, “An infrared and visible image fusion method
based on multi-scale transformation and norm optimization,” Inf. Fusion,
vol. 71, pp. 109–129, 2021.
[59] G. Guo and J. Liu, “A stochastic model-based fusion algorithm for en-
hanced localization of land vehicles,” IEEE Trans. Instrum. Meas., vol. 71,
2022, Art no. 8500810, doi: 10.1109/TIM.2021.3137566.
[60] J. Liu and G. Guo, “Vehiclelocalization during GPS outages with extended
Kalman filter and deep learning,” IEEE Trans. Instrum. Meas., vol. 70,
2021, Art no. 7503410, doi: 10.1109/TIM.2021.3097401.
[61] G. Li, S. Li, S. Li, and X. Qu, “Continuous decision-making for au-
tonomous driving at intersections using deep deterministic policy gra-
dient,” IET Intell. Transp. Syst., vol. 16, no. 12, pp. 1669–1681, 2021,
doi: 10.1049/itr2.12107.
[62] G. Li et al., “Deep reinforcement learning enabled decision-making for
autonomous driving at intersections,” Automot. Innov., vol. 3, no. 4,
pp. 374–385, Dec. 2020.
[63] B. Peng et al., “End-to-end autonomous driving through dueling double
deep Q-network,” Automot. Innov., vol. 4, no. 3, pp. 328–337, Aug. 2021.
[64] G. Li, S. E. Li, B. Cheng, and P. Green, “Estimation of driving style
in naturalistic highway traffic using maneuver transition probabilities,
Transp. Res. Part C: Emerg. Technol., vol. 74, pp. 113–125, Jan. 2017.
Guofa Li (Member, IEEE) received the Ph.D. degree
in mechanical engineering from Tsinghua University,
Beijing, China, in 2016. He is currently a Professor
with the College of Mechanical and Vehicle Engineer-
ing, Chongqing University, Chongqing, China. He
has authored or coauthored more than 70 papers in his
research field, which include environment perception,
driver behavior analysis, and human-like decision-
making and control based on artificial intelligence
technologies in autonomous vehicles and intelligent
transportation systems. He was the recipient of the
Young Elite Scientists Sponsorship Program in China, and the best paper
awards from the China Association for Science and Technology (CAST) and the
Automotive Innovation Journal. He is an Associate Editor for IEEE SENSORS
JOURNAL, the Guest Editor of IEEE Intelligent Transportation Systems Magazine
and Automotive Innovation.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2211
Yifa n Qiu received the B.E. degree in 2021 from
Shenzhen University, Shenzhen, China, where he is
currently working toward the master’s degree with the
College of Mechatronics and Control Engineering.
His research focus on using deep reinforcement learn-
ing technologies for the development of autonomous
vehicles.
Yifa n Yang received the M.E. degree from Shen-
zhen University, Shenzhen, China, in 2021. He is
currently with the Autonomous Driving Group, Ten-
cent, Shenzhen, China. His research interests include
computer vision, deep reinforcement learning, and
machine learning in automotive and transportation
engineering. He has completed five projects on pedes-
trian recognition, object detection, image enhance-
ment, risk assessment, and decision making using
deep reinforcement learning for the development of
autonomous vehicles.
Zhenning Li received the B.S. and M.S. degrees
in transportation science and engineering from the
Harbin Institute of Technology, Harbin, China, in
2014 and 2016, respectively, and the Ph.D. degree
in civil engineering from the University of Hawaii at
Ma¯noa, Honolulu, HI, USA, in 2019. He is currently
an Assistant Professor with the State Key Labora-
tory of Internet of Things for Smart City and the
Department of Computer and Information Science,
University of Macau, Macau, China. His research
interests include connected autonomous vehicles and
Big Data application on urban transportation system.
Shen Li (Member, IEEE) received the B.E. degree
from Jilin University, Changchun, China, in 2012,
and the Ph.D. degree from the University of Wis-
consin Madison, Madison, WI, USA, in 2019. His
research interests include cooperative control method
of connected vehicles, autonomous driving safety,
intelligent transportation systems (ITS), architecture
design of CAVH system, traffic data mining based on
cellular data, and traffic operations and management.
He has participated in many research projects funding
by the National Natural Science Foundation of China,
Ministry of Science and Technology (863 projects) and U.S. Department of
Transportation.
Wenbo Chu received the B.S. degree major in au-
tomotive engineering from Tsinghua University, Bei-
jing, China, in 2008, and the M.S. degree major in au-
tomotive engineering from RWTH-Aachen, Aachen,
Germany, and the Ph.D. degree major in mechanical
engineering from Tsinghua University, in 2014. He is
currently a Research Fellow with the Western China
Science City Innovation Center of Intelligent and
Connected Vehicles (Chongqing) Co, Ltd., and Na-
tional Innovation Center of Intelligent and Connected
Vehicles.
Paul Green received the M.S.E. and Ph.D. degrees
from the University of Michigan, Ann Arbor, MI,
USA, in 1974 and 1979, respectively. He is cur-
rently a Research Professor with the University of
Michigan Transportation Research Institute Driver
Interface Group, Ann Arbor, MI, USA, and an Ad-
junct Professor with the Department of Industrial
and Operations Engineering, University of Michigan.
He teaches automotive human factors and human-
computer interaction classes. He is the Leader of
the University’s Human Factors Engineering Short
Course, flagship continuing education course in the profession, now in its
year 62. His research interests include driving safety, driver interfaces, driver
behavior, driver workload, and the development of standards to get research into
practice. Prof. Green is the Past President of the Human Factors and Ergonomics
Society.
Shengbo Eben Li (Senior Member, IEEE) received
the M.S. and Ph.D. degrees from Tsinghua University,
Beijing, China, in 2006 and 2009, respectively. Before
joining Tsinghua University, he was with Stanford
University, Stanford, CA, USA University of Michi-
gan, Ann Arbor, MI, USA, and UC Berkeley, Berke-
ley, CA, USA. He is currently a Professor leading the
Intelligent Driving Lab (iDLab), Tsinghua Univer-
sity. He is the author of more than 120 peer-reviewed
journal/conference papers, and the co-inventor of
more than 30 patents. His research interests include
intelligent vehicles and driver assistance systems, reinforcement learning and
optimal control, and distributed control and estimation. Dr. Li was the recipient
of Best Paper Award in IEEE ITSC 2020, ICCAS 2020, IEEE ICUS 2020, CCCC
2018/2019, ITSAPF 2015, and IEEE ITSC 2014. His important awards include
the National Award for Technological Invention of China in 2013, Excellent
Young Scholar of NSF China in 2016, Young Professor of ChangJiang Scholar
Program in 2016, National Award for Progress in Sci & Tech of China in 2018,
Distinguished Young Scholar of Beijing NSF in 2018, and Youth Sci & Tech
Innovation Leader from MOST in 2020. He is also the Board of Governor of
IEEE Intelligent Transportation Systems Society. He is an Associate Editor
for IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, IEEE TRANSACTIONS ON
INTELLIGENT TRANSPORTATION SYSTEMS,andIEEE Intelligent Transportation
Systems Magazine.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
... Numerous studies have utilized RL techniques to optimize lane change behaviors [14,15] or speed modes [16,17] for autonomous vehicles. For instance, researchers leveraged deep Q-network (DQN) variants [18,19] or proximal policy optimization (PPO) [20,21] to acquire the lane change skills of autonomous driving agents. The target speed of autonomous vehicles can be determined via multi-objective RL [22,23] or multi-agent RL [24,25]. ...
... Regarding the adversarial attacker, the worst-case observational perturbations can be generated using Algorithm 1. In terms of the policy learning, the policies of the agent can be optimized by combining Eq. (8), Eq. (9), Eq. (18), Eq. (19) and Eq. (12). ...
Article
Full-text available
Despite the substantial advancements in reinforcement learning (RL) in recent years, ensuring trustworthiness remains a formidable challenge when applying this technology to safety-critical autonomous driving domains. One pivotal bottleneck is that well-trained driving policy models may be particularly vulnerable to observational perturbations or perceptual uncertainties, potentially leading to severe failures. In view of this, we present a novel defense-aware robust RL approach tailored for ensuring the robustness and safety of autonomous vehicles in the face of worst-case attacks on observations. The proposed paradigm primarily comprises two crucial modules: an adversarial attacker and a robust defender. Specifically, the adversarial attacker is devised to approximate the worst-case observational perturbations that attempt to induce safety violations (e.g., collisions) in the RL-driven autonomous vehicle. Additionally, the robust defender is developed to facilitate the safe RL agent to learn robust optimal policies that maximize the return while constraining the policy and cost perturbed by the adversarial attacker within specified bounds. Finally, the proposed technique is assessed across three distinct traffic scenarios: highway, on-ramp, and intersection. The simulation and experimental results indicate that our scheme enables the agent to execute trustworthy driving policies, even in the presence of the worst-case observational perturbations.
... Risk-constrained RL employs cost functions to represent the agent's future risk levels, ensuring safety by minimizing risk or keeping expected cumulative cost within a threshold. [29] establishes a Bayesian-based risk assessment model to evaluate state risk levels by state uncertainties and relative distances between the ego and surrounding vehicles, motivating the agent to learn policy that minimize expected risk. The Lagrangian method is more commonly used in risk constrained RL [30], coupling the process of seeking high-value states and avoiding high-risk states through Lagrange multipliers. ...
Preprint
Coping with intensively interactive scenarios is one of the significant challenges in the development of autonomous driving. Reinforcement learning (RL) offers an ideal solution for such scenarios through its self-evolution mechanism via interaction with the environment. However, the lack of sufficient safety mechanisms in common RL leads to the fact that agent often find it difficult to interact well in highly dynamic environment and may collide in pursuit of short-term rewards. Much of the existing safe RL methods require environment modeling to generate reliable safety boundaries that constrain agent behavior. Nevertheless, acquiring such safety boundaries is not always feasible in dynamic environments. Inspired by the driver's behavior of acting when uncertainty is minimal, this study introduces the concept of action timing to replace explicit safety boundary modeling. We define "actor" as an agent to decide optimal action at each step. By imaging the actor take opportunity to act as a timing-dependent gradual process, the other agent called "timing taker" can evaluate the optimal action execution time, and relate the optimal timing to each action moment as a dynamic safety factor to constrain the actor's action. In the experiment involving a complex, unsignaled intersection interaction, this framework achieved superior safety performance compared to all benchmark models.
... In Reinforcement Learning (RL), online and offline training are two main training modalities with learningby-trial-and-error. Some researchers [46] show an efficient online training model for SEF by lightweight an attention-based model. Decision Transformer [47] and Trajectory Transformer [48] are two formulate autonomous driving tasks as a single sequence modeling problem for offline learning. ...
Preprint
In recent years, open-pit mining has seen significant advancement, the cooperative operation of various specialized machinery substantially enhancing the efficiency of mineral extraction. However, the harsh environment and complex conditions in open-pit mines present substantial challenges for the implementation of autonomous transportation systems. This research introduces a novel paradigm that integrates Scenario Engineering (SE) with autonomous transportation systems to significantly improve the trustworthiness, robustness, and efficiency in open-pit mines by incorporating the four key components of SE, including Scenario Feature Extractor, Intelligence and Index (I&I), Calibration and Certification (C&C), and Verification and Validation (V&V). This paradigm has been validated in two famous open-pit mines, the experiment results demonstrate marked improvements in robustness, trustworthiness, and efficiency. By enhancing the capacity, scalability, and diversity of autonomous transportation, this paradigm fosters the integration of SE and parallel driving and finally propels the achievement of the '6S' objectives.
... Furthermore, as a learning algorithm, DRL improves its performance automatically in any scenario without requiring handcrafted designs [24]. In view of these merits, far-reaching DRL algorithms such as Deep Q Network (DQN) and its variants have been used in AV decision-making simulations in city intersections [25], rural roads [26], highways [27], [28], and ramp merges [29], [30]. Nonetheless, two major concerns limit DRL-based decision-making strategies for practical applications: the data efficiency problem of the training stage and the generalization problem of the deployment stage. ...
Article
Full-text available
Although deep reinforcement learning (DRL) methods are promising for making behavioral decisions in autonomous vehicles (AVs), their low training efficiency and difficulty to adapt to untrained cases hinder their applications. Introducing a human role in the DRL paradigm could improve training efficiency by using human prior knowledge and overcome untrained cases in deployment by online human takeover. In this study, a novel value-based DRL algorithm that leverages human guidance to improve its performance is proposed for addressing high-level decision-making problems in autonomous driving. We develop a new learning objective for DRL to increase the value of the human policy over the undertrained DRL policy so that the DRL agent can be encouraged to mimic human behaviors and thereby utilizing human guidance more efficiently. Our method can autonomously evaluate the importance of different human guidance, which makes it more robust for variation of human performance. The proposed DRL algorithm was used to address a challenging multiobjective lane-change decision-making problem. We collected human guidance from a human-in-the-loop driving experiment and evaluated our method in a high-fidelity simulator. Results validated the advantages of the proposed algorithm in terms of training efficiency and optimality in the decision-making problem compared to the baselines of state-of-the-art existing methods. Results also revealed the favorable fine-tuning ability of the proposed algorithm, which is promising for addressing the long-tail issue in DRL-based autonomous driving. Our methodology does not introduce additional domain knowledge so that it can be seamlessly applied to other similar issues. The supplementary video is available at https://youtu.be/Ec7WkqeLsB8.
... In Reinforcement Learning (RL), online and offline training are two main training modalities with learningby-trial-and-error. Some researchers [42] show an efficient online training model for SEF by lightweight an attention-based model. Decision Transformer [43] and Trajectory Transformer [44] are two formulate autonomous driving tasks as a single sequence modeling problem for offline learning. ...
Article
Full-text available
In recent years, open-pit mining has seen significant advancement, the cooperative operation of various specialized machinery substantially enhancing the efficiency of mineral extraction. However, the harsh environment and complex conditions in open-pit mines present substantial challenges for the implementation of autonomous transportation systems. This research introduces a novel paradigm that integrates Scenario Engineering (SE) with autonomous transportation systems to significantly improve the trustworthiness, robustness, and efficiency in open-pit mines by incorporating the four key components of SE, including Scenario Feature Extractor, Intelligence and Index, Calibration and Certification, and Verification and Validation. This paradigm has been validated in two famous open-pit mines, the experiment results demonstrate marked improvements in robustness, trustworthiness, and efficiency. By enhancing the capacity, scalability, and diversity of autonomous transportation, this paradigm fosters the integration of SE and parallel driving and finally propels the achievement of the ‘6S’ objectives.
Article
In this paper, an integrated planning and control strategy based on piecewise model predictive control (MPC) is developed to improve the efficiency of planning-control framework. First, a penalty term characterizing the influence of obstacles is merged into the heuristic function of A* algorithm. This enables the planned path to maintain navigability in narrow areas while deviating from obstacles in open areas. Second, to rapidly identify safe regions around the original path, several crucial path nodes (CPNs) are selected, including heading change nodes and constraint change nodes. With the safe regions around these CPNs, a novel method for constructing safe travel corridors (STCs) is developed. Finally, a piecewise MPC is put forward to drive the unmanned ground vehicle (UGV) towards the target node within STCs, which eliminates the requirement for the reference path. This promotes the efficiency of planning-control framework and achieves stronger traveling stability. The proposed strategy is deployed on a UGV platform, and its superiority and effectiveness are demonstrated via real vehicle experiments.
Article
Reinforcement learning methods have shown the ability to solve challenging scenarios in unmanned systems. However, solving long-time decision-making sequences in a highly complex environment, such as continuous lane change and overtaking in dense scenarios, remains challenging. Although existing unmanned vehicle systems have made considerable progress, minimizing driving risk is the first consideration. Risk-aware reinforcement learning is crucial for addressing potential driving risks. However, the variability of the risks posed by several risk sources is not considered by existing reinforcement learning algorithms applied in unmanned vehicles. Based on the above analysis, this study proposes a risk-aware reinforcement learning method with driving task decomposition to minimize the risk of various sources. Specifically, risk potential fields are constructed and combined with reinforcement learning to decompose the driving task. The proposed reinforcement learning framework uses different risk-branching networks to learn the driving task. Furthermore, a low-risk episodic sampling augmentation method for different risk branches is proposed to solve the shortage of high-quality samples and further improve sampling efficiency. Also, an intervention training strategy is employed wherein the artificial potential field (APF) is combined with reinforcement learning to speed up training and further ensure safety. Finally, the complete intervention risk classification twin delayed deep deterministic policy gradient-task decompose (IDRCTD3-TD) algorithm is proposed. Two scenarios with different difficulties are designed to validate the superiority of this framework. Results show that the proposed framework has remarkable improvements in performance.
Article
The ability to accurately predict the trajectory of surrounding vehicles is a critical hurdle to overcome on the journey to fully autonomous vehicles. To address this challenge, we pioneer a novel behavior-aware trajectory prediction model (BAT) that incorporates insights and findings from traffic psychology, human behavior, and decision-making. Our model consists of behavior-aware, interaction-aware, priority-aware, and position-aware modules that perceive and understand the underlying interactions and account for uncertainty and variability in prediction, enabling higher-level learning and flexibility without rigid categorization of driving behavior. Importantly, this approach eliminates the need for manual labeling in the training process and addresses the challenges of non-continuous behavior labeling and the selection of appropriate time windows. We evaluate BAT's performance across the Next Generation Simulation (NGSIM), Highway Drone (HighD), Roundabout Drone (RounD), and Macao Connected Autonomous Driving (MoCAD) datasets, showcasing its superiority over prevailing state-of-the-art (SOTA) benchmarks in terms of prediction accuracy and efficiency. Remarkably, even when trained on reduced portions of the training data (25%), our model outperforms most of the baselines, demonstrating its robustness and efficiency in predicting vehicle trajectories, and the potential to reduce the amount of data required to train autonomous vehicles, especially in corner cases. In conclusion, the behavior-aware model represents a significant advancement in the development of autonomous vehicles capable of predicting trajectories with the same level of proficiency as human drivers. The project page is available on our GitHub.
Article
Decision-making for urban autonomous driving is challenging due to the stochastic nature of interactive traffic participants and the complexity of road structures. Although reinforcement learning (RL)-based decision-making schemes are promising to handle urban driving scenarios, they suffer from low sample efficiency and poor adaptability. In this paper, we propose the Scene-Rep Transformer to enhance RL decision-making capabilities through improved scene representation encoding and sequential predictive latent distillation. Specifically, a multi-stage Transformer (MST) encoder is constructed to model not only the interaction awareness between the ego vehicle and its neighbors but also intention awareness between the agents and their candidate routes. A sequential latent Transformer (SLT) with self-supervised learning objectives is employed to distill future predictive information into the latent scene representation, in order to reduce the exploration space and speed up training. The final decision-making module based on soft actor-critic (SAC) takes as input the refined latent scene representation from the Scene-Rep Transformer and generates decisions. The framework is validated in five challenging simulated urban scenarios with dense traffic, and its performance is manifested quantitatively by substantial improvements in data efficiency and performance in terms of success rate, safety, and efficiency. Qualitative results reveal that our framework is able to extract the intentions of neighbor agents, enabling better decision-making and more diversified driving behaviors. Code and results are available at: https://georgeliu233.github.io/Scene-Rep-Transformer/</uri
Article
A novel event-triggered control (ETC) method, called deep event-triggered parallel control (deep-ETPC), is presented to achieve path tracking for comfortable autonomous driving (CAD) using parallel control and deep deterministic policy gradient (DDPG). Based on parallel control, the developed deep-ETPC method constructs a dynamic control policy by introducing variation rates of controls. By employing variation rates of controls, the developed deep-ETPC method is capable of indicating communication loss and comfortable driving indices in the reward, and then enables reinforcement learning (RL) agents to learn comfortable ETC driving policies directly. Moreover, the communication loss, which reflects ETC, is integrated into the reward, so there is no need to additionally design/train triggering conditions, which can be considered a type of multi-tasking learning. Furthermore, an EPTC-oriented DDPG algorithm is developed to achieve the developed deep-ETPC method, making DDPG applicable to ETC. Empirical results, including tracking a simple straight line trajectory and a complicated sinusoidal trajectory, demonstrate the effectiveness of the developed deepETPC method.
Article
Full-text available
Lane change is a common-yet-challenging driving behavior for automated vehicles. To improve the safety and efficiency of automated vehicles, researchers have proposed various lane-change decision models. However, most of the existing models consider lane-change behavior as a one-player decision-making problem, ignoring the essential multi-agent properties when vehicles are driving in traffic. Such models lead to deficiencies in interaction and collaboration between vehicles, which results in hazardous driving behaviors and overall traffic inefficiency. In this paper, we revisit the lane-change problem and propose a bi-level lane-change behavior planning strategy, where the upper level is a novel multi-agent deep reinforcement learning (DRL) based lane-change decision model and the lower level is a negotiation based right-of-way assignment model. We promote the collaboration performance of the upper-level lane-change decision model from three crucial aspects. First, we formulate the lane-change decision problem with a novel multi-agent reinforcement learning model, which provides a more appropriate paradigm for collaboration than the single-agent model. Second, we encode the driving intentions of surrounding vehicles into the observation space, which can empower multiple vehicles to implicitly negotiate the right-of-way in decision-making and enable the model to determine the right-of-way in a collaborative manner. Third, an ingenious reward function is designed to allow the vehicles to consider not only ego benefits but also the impact of changing lanes on traffic, which will guide the multi-agent system to learn excellent coordination performance. With the upper-level lane-change decisions, the lower-level right-of-way assignment model is used to guarantee the safety of lane-change behaviors. The experiments show that the proposed approaches can lead to safe, efficient, and harmonious lane-change behaviors, which boosts the collaboration between vehicles and in turn improves the safety and efficiency of the overall traffic. Moreover, the proposed approaches promote the microscopic synchronization of vehicles, which can lead to the macroscopic synchronization of traffic flow.
Article
Full-text available
Reinforcement learning holds the promise of allowing autonomous vehicles to learn complex decision making behaviors through interacting with other traffic participants. However, many real-world driving tasks involve unpredictable perception errors or measurement noises which may mislead an autonomous vehicle into making unsafe decisions, even cause catastrophic failures. In light of these risks, to ensure safety under perception uncertainty, autonomous vehicles are required to be able to cope with the worst case observation perturbations. Therefore, this paper proposes a novel observation adversarial reinforcement learning approach for robust lane change decision making of autonomous vehicles. A constrained observation-robust Markov decision process is presented to model lane change decision making behaviors of autonomous vehicles under policy constraints and observation uncertainties. Meanwhile, a black-box attack technique based on Bayesian optimization is implemented to approximate the optimal adversarial observation perturbations efficiently. Furthermore, a constrained observation-robust actor-critic algorithm is advanced to optimize autonomous driving lane change policies while keeping the variations of the policies attacked by the optimal adversarial observation perturbations within bounds. Finally, the robust lane change decision making approach is evaluated in three stochastic mixed traffic flows based on different densities. The results demonstrate that the proposed method can not only enhance the performance of an autonomous vehicle but also improve the robustness of lane change policies against adversarial observation perturbations.
Article
In recent years, deep learning technologies for object detection have made great progress and have powered the emergence of state-of-the-art models to address object detection problems. Since the domain shift can make detectors unstable or even crash, the detection of cross-domain becomes very important for the design of object detectors. However, traditional deep learning technologies for object detection always rely on a large amount of reliable ground-truth labelling that is laborious, costly, and time-consuming. Although an advanced approach CycleGAN has been proposed for cross-domain object detection tasks, the ability of CycleGAN to reduce the divergence across domains at the feature level is limited. In this paper, a stepwise domain adaptation (SDA) detection method is proposed to further improve the performance of CycleGAN by minimizing the divergence in cross-domain object detection tasks. Specifically, the domain shift is addressed in two steps. In the first step, to bridge the domain gap, an unpaired image-to-image translator is trained to construct a fake target domain by translating the source images to the similar ones in the target domain. In the second step, to further minimize divergence across domains, an adaptive CenterNet is designed to align distributions at the feature level in an adversarial learning manner. Our proposed method is evaluated in domain shift scenarios based on the driving datasets including Cityscapes, Foggy Cityscapes, SIM10k, and BDD100K. The results show that our method is superior to the state-of-the-art methods and is effective for object detection in domain shift scenarios.
Article
Supervised object detection models based on deep learning technologies cannot perform well in domain shift scenarios where annotated data for training is always insufficient. To this end, domain adaptation technologies for knowledge transfer have emerged to handle the domain shift problems. A stepwise domain adaptive YOLO (S-DAYOLO) framework is developed which constructs an auxiliary domain to bridge the domain gap and uses a new domain adaptive YOLO (DAYOLO) in cross-domain object detection tasks. Different from the previous solutions, the auxiliary domain is composed of original source images and synthetic images that are translated from source images to the similar ones in the target domain. DAYOLO based on YOLOv5s is designed with a category-consistent regularization module and adaptation modules for image-level and instance-level features to generate domain invariant representations. Our proposed method is trained and evaluated by using five public driving datasets including Cityscapes, Foggy Cityscapes, BDD100K, KITTI, and KAIST. Experiment results demonstrate that object detection performance is significantly improved when using our proposed method in various domain shift scenarios for autonomous driving applications.
Article
Within the field of automated driving, a clear trend in environment perception tends towards more sensors, higher redundancy, and overall increase in computational power. This is mainly driven by the paradigm to perceive the entire environment as best as possible at all times. However, due to the ongoing rise in functional complexity, compromises have to be considered to ensure real-time capabilities of the perception system. In this work, we introduce a concept for situation-aware environment perception to control the resource allocation towards processing relevant areas within the data as well as towards employing only a subset of functional modules for environment perception, if sufficient for the current driving task. Specifically, we propose to evaluate the context of an automated vehicle to derive a multi-layer attention map (MLAM) that defines relevant areas. Using this MLAM, the optimum of active functional modules is dynamically configured and intra-module processing of only relevant data is enforced. We outline the feasibility of application of our concept using real-world data in a straight-forward implementation for our system at hand. While retaining overall functionality, we achieve a reduction of accumulated processing time of 59%.
Article
Making proper decisions at intersections that are one of the most dangerous and sophisticated driving scenarios is full of challenges, especially for autonomous vehicles (AVs). The existing decision-making approaches for AVs at intersections are limited as they only consider driving safety in simple intersection scenarios while sacrificing travel efficiency and driving comfort. To solve this issue, a decision-making structure motivated by deep reinforcement learning was proposed for autonomous driving at complex intersection scenarios based on long short-term memory (LSTM). The mapping relationship between traffic images collected from camera sensors and AVs' actions was established by constructing convolutional-recurrent neural networks in a decision-making framework. Traffic images collected from camera sensors at two different timesteps were used to understand the relative motion information between AVs and other vehicles. To model the interaction between the AV and other vehicles, Markov decision process was used. The deep Q-network (DQN) algorithm was applied to generate the optimal driving policy that could comprehensively consider driving safety, travel efficiency and driving comfort. Three crash-prone complex intersection scenarios were reconstructed in CARLA (car learning to act) to evaluate the performance of our proposed method. The results indicate that our method can make AV drive through intersections safely and efficiently with desirable driving comfort in all the examined scenarios.
Article
This paper investigates a position estimation problem for land-vehicles using sensors fusion and dead-reckoning to mitigate the influence of model inaccuracy and uncertain noise covariance. The kinematics of the vehicle is roughly modeled, considering the roll angle and slip angle. To achieve accurate position estimation, a novel stochastic model-based fusion algorithm is proposed by embedding absolute value modulated random noises into the model. For uncertainties that are Gaussian, a quantitative description of the deviation due to uncertainties is given. Improved state and measurement equations are derived to enhance the accuracy of positioning. The algorithm recursively provides robust estimations in a stochastic manner. The effectiveness and superiority of the proposed vehicle localization method with inadequate process knowledge is demonstrated by numerical simulations and real-world experiments. Experimental results also demonstrate that our method is more accurate and reliable than the state-of-the-art methods for vehicle localization under various driving conditions.
Article
Due to the growing interest in automated driving, a deep understanding on the characteristics of human driving behavior is critical for human-like autonomous vehicles. Among various driving behaviors, lane change is the most important one for vehicle lateral driving safety. This study proposes an unsupervised method to extract and discover the behavioral patterns of lane change maneuvers for the purpose of exploring the composed behavioral patterns during lane change. This method involves two phases: Firstly, the lane change sequences will be segmented into blocks using time-series segmentation algorithms. Three segmentation algorithms were utilized in this study. In the second phase, the segments will be clustered to find the corresponding behavioral pattern of each segment. Two extended latent Dirichlet allocation (LDA) models were adopted to cluster the segments. The combination of different segmentation and clustering algorithms were evaluated and compared by employing entropy and perplexity as the evaluation criteria. Collected lane change data from naturalistic driving were applied to examine its effectiveness. The results show that this method could effectively mine descriptive behavioral patterns from lane change data. This study provides a promising data mining solution to facilitating deep and comprehensive understanding on driver lane change behaviors, which will promote the development of human-like autonomous vehicles.
Article
Driving safety is the most important element that needs to be considered for autonomous vehicles (AVs). To ensure driving safety, we proposed a lane change decision-making framework based on deep reinforcement learning to find a risk-aware driving decision strategy with the minimum expected risk for autonomous driving. Firstly, a probabilistic-model based risk assessment method was proposed to assess the driving risk using position uncertainty and distance-based safety metrics. Then, a risk aware decision making algorithm was proposed to find a strategy with the minimum expected risk using deep reinforcement learning. Finally, our proposed methods were evaluated in CARLA in two scenarios (one with static obstacles and one with dynamically moving vehicles). The results show that our proposed methods can generate robust safe driving strategies and achieve better driving performances than previous methods.