Content uploaded by Shengbo Eben Li
Author content
All content in this area was uploaded by Shengbo Eben Li on May 20, 2024
Content may be subject to copyright.
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023 2197
Lane Change Strategies for Autonomous Vehicles:
A Deep Reinforcement Learning Approach
Based on Transformer
Guofa Li , Member, IEEE,YifanQiu , Yifan Yang, Zhenning Li ,ShenLi , Member, IEEE, Wenbo Chu,
Paul Green , and Shengbo Eben Li , Senior Member, IEEE
Abstract—End-to-end approaches are one of the most promising
solutions for autonomous vehicles (AVs) decision-making. However,
the deployment of these technologies is usually constrained by the
high computational burden. To alleviate this problem, we proposed
a lightweight transformer-based end-to-end model with risk aware-
ness ability for AV decision-making. Specifically, a lightweight
network with depth-wise separable convolution and transformer
modules was firstly proposed for image semantic extraction from
time sequences of trajectory data. Then, we assessed driving risk
by a probabilistic model with position uncertainty. This model was
integrated into deep reinforcement learning (DRL) to find strate-
gies with minimum expected risk. Finally, the proposedmethod was
evaluated in three lane change scenarios to validate its superiority.
Index Terms—Autonomous vehicles, decision-making,
reinforcement learning, lane change, transformer.
I. INTRODUCTION
AS REPORTED by the National Highway Traffic Safety
Administration (NHTSA) [1], 50000 fatal traffic accidents
are attributed to driving mistakes each year in the United States
Manuscript received 27 November 2022; accepted 5 December 2022. Date
of publication 9 December 2022; date of current version 27 April 2023. This
work was supported in part by the National Natural Science Foundation of China
under Grant 52272421, and in part by Shenzhen Fundamental Research Fund
under Grant JCYJ20190808142613246. (Corresponding author: Shen Li.)
Guofa Li is with the College of Mechanical and Vehicle Engineering,
Chongqing University, Chongqing 400044, China (e-mail: hanshan198@
gmail.com).
Yifan Qiu and Yifan Yang are with the College of Mechatronics and Con-
trol Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China
(e-mail: rye1222@qq.com; lvan0619@qq.com).
Zhenning Li is with the State Key Laboratory of Internet of Things for Smart
City and the Department of Computer and Information Science, University of
Macau, Macau 999078, China (e-mail: zhenningli@um.edu.mo).
Shen Li is with the School of Civil Engineering, Tsinghua University, Beijing
100084, China (e-mail: sli299@tsinghua.edu.cn).
WenboChu is with the Western China Science City Innovation Center of Intel-
ligent and Connected Vehicles (Chongqing) Co, Ltd., Chongqing 401329, China,
and also with the College of Mechanical and Vehicle Engineering, Chongqing
University, Chongqing 400044, China (e-mail: chuwenbo@wicv.cn).
Paul Green is with the University of Michigan Transportation Research
Institute (UMTRI) & Department of Industrial and Operations Engineering, Uni-
versity of Michigan, Ann Arbor, MI 48109 USA (e-mail: pagreen@umich.edu).
Shengbo Eben Li is with the State Key Lab of Automotive Safety and Energy,
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
(e-mail: lishbo@tsinghua.edu.cn).
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TIV.2022.3227921.
Digital Object Identifier 10.1109/TIV.2022.3227921
[2]. Statistics in China also show that over 90% of traffic ac-
cidents are related to driving mistakes [3]. Therefore, to help
drivers make reliable decisions and reduce the frequency of
human-caused accidents, numerous safety applications for lev-
els 1 and 2 autonomous vehicles have been developed in recent
years, such as advanced driver assistance systems (ADAS),
fatigue recognition systems, etc. Furthermore, academia re-
searchers have begun to focus on designing active safety systems
for higher-level autonomous vehicles with heavy attention on
collision avoidance systems [4], [5], [6]. In the following para-
graphs, we summarize the influential approaches in the devel-
opment of decision-making systems for collision avoidance,
which can be categorized into motion planning-based methods,
risk estimation-based methods, and data-driven-based methods.
Specifically, supervised learning and reinforcement learning are
two principal categories for data-driven-based methods.
A. Motion Planning-Based Methods
A∗and artificial potential field (APF) are two representative
methods of the conventional motion planning-based category for
collision avoidance decision-making. For instance, Dolgov et al.
[7] searched 3D kinematic state space via a variant A∗method.
Then, a numeric non-linear optimization method was further
utilized to enhance the performance of variant A∗approach.
Huang et al. [8] proposed an APF method with different potential
functions for road boundaries after meshing the drivable area.
Subsequently, a local current comparison method is employed
to generate the path with no crash. Nevertheless, these methods
have two intrinsic drawbacks: 1) how they generate graphs
(considering physical constraints) greatly affects their perfor-
mance, 2) paths that are impossible for vehicle kinematics would
sometimes be generated.
For improvement, another solution considering vehicle kine-
matics is developed. Li et al. [9] introduced an optimization
approach based on adaptive-scaling constraints for multi-agent
travelling by considering vehicle dynamics to have the time-
optimal planning of trajectories. Shen et al. [10] utilized a
predictive occupancy map (POM) to assess the potential risk
levels of surrounding vehicles based on vehicle kinematics. The
optimal path was then obtained by a random tree algorithm with
POM. Simon et al. [11] assumed that inevitable collisions were
inherently time-critical and thus introduced a novel method to
2379-8858 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2198 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
mitigate collisions based on vehicle kinematics. The proposed
method could capture the trajectory with the minimum execution
time by simulating with finite element modeling (FEM). How-
ever, given that the constraints for motion planning are usually
nonlinear or nonconvex, the planning task may be a NP-hard
problem which is difficult to be addressed.
B. Risk Estimation-Based Methods
Risk estimation-based methods always estimate risk in the
current driving state and then formulate a subsequent action pol-
icy in accordance with the risk estimation results. The modular
or hierarchical design in these methods can be more accessible
for the brilliant breakthrough in autonomous vehicles [12]. Cur-
rently, deterministic approaches and probabilistic approaches
are the two principal mainstreams for the risk estimation-based
methods.
The deterministic approaches mainly estimate the occurrence
of a collision to infer the strategies for vehicle control. TTC
(time to collision) and THW (time headway) are the classical
evaluation metrics on driving safety [13], [14]. In single-lane
scenarios, these risk estimation methods are comparatively accu-
rate in longitudinal driving without computational burden [15].
However, barely considering the uncertainty of the input data
leads to unpractical derived policies for real-world applications
and even unsatisfactory performance in multi-lane scenarios
[16].
To avoid the uncertainty problem, probabilistic descriptions
are introduced for risk probability assessment in probabilistic
approaches [17]. After fusing traditional metrics (e.g., TTC) into
risk estimation by using a Bayesian model, Noh [18] developed a
rule-based expertise for subject vehicle control at intersections.
Shin et al. [19] observed the uncertainties in the motion of remote
vehicles via vehicle-to-vehicle (V2V) communication, which
was a reference for calculating the number of crashes within
uncertainty boundaries. Suffering from complication in the real-
istic traffic environment and disregarding human drivers’ learn-
ing ability, the intrinsic drawback of probabilistic approaches is
that they only formulate rule-based strategies based on expert
knowledge. However, complex traffic environment details can-
not always be effectively defined by countable rules, and it is
also impossible to determine all the rules in all situations [9].
C. Data-Driven-Based Methods
After discovering the learning capability of neural net-
works, data-driven-based methods including supervised learn-
ing and reinforcement learning become the mainstream for
decision-making [20], [21]. The studies using supervised learn-
ing are booming for the development of autonomous vehicles.
Pan et al. [22] introduced a hybrid control policy network
guided by the human expert and model predictive control (MPC)
expert. It only requires images taken with a monocular camera
and rolling speed to output the steering and throttle command
directly. Xu et al. [23] introduced an FCN-LSTM architecture to
generate actions based on prior states of agents and videos taken
with a monocular camera. Moreover, this architecture leveraged
imitation learning (IL) to improve performance since semantic
segmentation as a side task enforces the FCN-LSTM architec-
ture to learn interpretable feature representation. Unfortunately,
since data collection in dangerous scenarios (e.g., inevitable
collisions) is challenging, there is a gap between reality and
training. These supervised methods always have unsatisfactory
performance in realistic scenarios due to the lack of data from
dangerous scenarios [24], [25].
To avoid the high-cost problem for data collection in dan-
gerous scenarios with supervised learning methods, researchers
utilized deep reinforcement learning (DRL) methods with af-
fordable trial-and-error to find the driving policy close to reality
for decision-making in autonomous vehicles based on driving
simulators [26], [27]. Different from rule-based methods, DRL-
based methods learn how to drive from trial-and-error, mak-
ing them suitable to various situations [28], [29]. The defined
learning criteria in DRL-based methods are just some simple
constraints or encouraging orientations which are far more less
or simple than the cases in rule-based methods [30]. Mirchevska
et al. [31] developed a reinforced learning approach based on
deep Q-learning network (DQN) for autonomous vehicles to
take safe actions for lane change in highway driving. To cope
with the challenge in multi-agent collision avoidance, Fan et al.
[33] proposed an innovative multi-stage RL-based architecture
for safe and effective navigation in dense traffic with pedestrians.
Chen et al. [34] designed a network with a hierarchical structure
that maintained an overstory policy and an underlying operation
instruction simultaneously. However, heavy time consumption
burden in model training is still a problem for DRL-based
methods [35]. Therefore, the development of lightweight DRL
models is attracting attentions of researchers in the recent years.
D. Lightweight Model Design
Feature extraction from data with a fair amount of redun-
dant information, such as images, is computationally expen-
sive. Since reinforcement learning based models are mainly
for online training, models with a huge number of parameters
cannot satisfy real-time requirements. Therefore, technologies
for lightweight model design have been developed for practical
applications. Howard et al. [36] proposed a novel lightweight
method with a pre-defined architecture to reduce the number
of convolution calculations to only 1/9 of the number when
using the vanilla convolution. Hua et al. [37] introduced a
dynamic pruning method, named channel gating, to optimize
CNN inference by utilizing input-specific features. By identify-
ing the regions with insignificant contributions, channel gating
will dynamically skip weight propagation for these ineffective
regions to ease the calculation burden. However, these dynamic
methods that merely consider reality implementation can hardly
achieve theoretical acceleration due to extra computation waste
(e.g., indexing, zero-masking, weight-copying, etc.). To achieve
hardware-efficiency acceleration, [38] dynamically sliced the
network parameters to realize statical and contiguous storage
in hardware. As for the computation itself, Jacob et al. [39]
proposed a quantization scheme to gain integer-only models,
which avoids the huge calculation cost in floating point infer-
ence. Apart from these approaches, Henning et al. [40] proposed
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2199
a multi-layer attention map (MLAM) to only process the relevant
data, which mitigates the high redundancy in feature extraction
for environment perception. Although approaches have been
made in the recent years, far more efforts are still needed for
lightweight DRL model design, especially in the area of risk-
aware decision-making for autonomous vehicles in intelligent
transportation systems.
E. Contributions
It has been widely accepted that lane change is one of the most
commonly adopted maneuvers in naturalistic driving [41], [42],
[43], [44]. In order to develop human-like autonomous driving
technologies to avoid conflicts or crashes caused by inconsis-
tences between human drivers and artificial drivers, automatic
lane change systems should be well designed for autonomous
driving [45], [46]. However, current end-to-end automatic lane
changing models usually suffer from high computational cost
or risk insensitivity and may not be useful for high-speed lane
change scenarios. To overcome these drawbacks, we propose
an innovative method that allows agents to learn strategies with
the minimal expected risk at a low computational burden for
highway lane change. Firstly, we proposed a lightweight image
semantic extraction network based on depth-wise separable
convolutions and used transformers to merge the image seman-
tic contexts in time series. Next, we proposed a quantization
approach containing positional uncertainty based on Bayesian
theory for risk assessment, which was then introduced into DRL
to find the policy with minimal expected risk. Lastly, some
virtual scenarios were built in a driving simulator CARLA (Car
Learning to Act) [47] to evaluate the performance of our method.
The key contributions of our work are summarized as follows:
1) An innovative end-to-end model on the basis of depth-
wise separable convolution with low computer burden and
transformer network is newly proposed for lane change
decision-making in autonomous driving. To the best of our
knowledge, the comprehensive use of depth-wise separa-
ble convolution together with transformer in DRL-based
architectures for lane change decision inference has never
been reported in the previous literature.
2) The driving policy with minimal expected risk is cre-
atively integrated into DRL-based architectures for safe
lane change, making the autonomous vehicle being with
risk awareness ability.
3) Three lane change scenarios with different difficulties (i.e.,
one with stationary vehicles, one with moving vehicles,
and one accelerating, decelerating, and lane changing
vehicles) are designed to evaluate the performances of the
examined methods.
4) The lightweight characteristic and superior performance
of our proposed approach can facilitate the development
of autonomous driving technologies in various driving
scenarios for intelligent transportation.
F. Paper Structure
The rest of this paper is structured as follows. Problem state-
ment and previous solutions are mentioned in Section II. The
mentioned methodology and the details of the deep reinforce-
ment learning framework for decision-making are described
in Section III and Section IV, respectively. The experiments
in CARLA are detailed in Section V. Section VI presents the
performance of experiments. Lastly, the conclusions of this
paper are drawn in Section VII.
II. PROBLEM STATEMENT AND PREVIOUS SOLUTIONS
Generally, in the DRL framework, the agent is capable of
driving in an uncertain environment by selecting a sequence of
actions over several continuous-time steps. Subsequently, it will
grant rewards as the feedback of the interaction with environ-
ment. Finally, a strategy with maximum cumulative reward will
be chosen. In this study, a lane change process can be briefly
described as a Markov decision process (MDP):
M=< S,A,P,R > (1)
where S={s0,s
1, ..., s
t}indicates the set of states, A=
{a0,a
1, ..., a
t}indicates the set of actions, P:S×A×
S→[0,1] indicates the transitional probability between states,
and R:S×A×S→Rindicates the reward.
In particular, a sequence of actions adapted to the particu-
lar scenario will be chosen according to a stochastic strategy
π:S→P(A), where P(A)indicates the possibility of an
action Awill be chosen following the strategy π. A trajectory
will take place through this progress, which can be indicated as
τ=s0,a
0,r
0,s
1,a
1,r
1, ..., s
t,a
t,r
t, and the optimal
strategy π∗with the maximum expected cumulative reward can
be found:
π∗=argmax
π
Eπ+∞
t=0
γtrt+1|s0=s(2)
where γ∈[0,1] implies a parameter to control the weight of the
next time step reward rt+1, and π∗indicates the optimal strategy.
However, (2) is hard to solve. To address this issue, a Q-value
function is used for strategy optimization.
Qπ(s, a)=Eπ+∞
t=0
γtrt+1|s0=s, a0=a(3)
where Qπ(s, a)indicates the expected cumulative reward from
the initial condition (state sand action a) by following strategy
π. Subsequently, the strategy πcan be improved by choosing
action ato maximize Q-value, i.e., π(s)=argmax
a
Qπ(s, a).
Thus, as an equivalent solution to (2), an optimal strategy π∗
with maximum Qπ(s, a)will be generated, i.e., Qπ∗(s, a)=
max
πQπ(s, a).
A. Deep Q-Network (DQN)
In DQN architecture [26], Q values can be estimated by two
networks (i.e., Qtarget and Qonline). A trajectory st,a
t,r
t,s
t+1
will be generated through Qonline and then it will be stored in
memory M. Randomly sampling in memory Mwill reduce the
relevancy of data to update the network. Qtarget outputs the
target Q-value to calculate the temporal difference (TD) error.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2200 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 1. The framework of DQN.
Fig. 2. The framework of PRDQN.
The weights of Qonline will then be updated according to the
obtained TD error. The separation of two processes improves
stability of the network. The computational framework is shown
in Fig. 1, and the loss function is described as:
L=E(st,at,rt,st)∼M(y−Q(st,a
t;θt))2
y=rt+γmax
at
Q(st,a
t;θt)(4)
where (st,a
t,r
t,s
t)indicates a trajectory sampled in memory
M,θtand θtrespectively indicates the weights of Qonline and
Qtarget.
B. Prioritized Replay Deep Q-Learning Network (PRDQN)
Memory replay of DQN with consistent sample policy is not
appropriate for samples with higher temporal difference errors
because samples with minor TD errors are easily captured. To
address this problem, Schaul et al. [27] developed a prioritized
replay based on DQN (i.e., PRDQN), prioritizing samples with
higher TD errors to learn. The sampled probability of sample i
is described as:
P(i)= pa
i
kpa
k
(5)
where pindicates the TD error, ais a pre-defined parameter to
control the priority.
In addition, gradient descent in experience replay with priority
is based on the importance of sample weight, which is described
as:
wi=1
N·1
P(i)β
(6)
where Nindicates the number of replay experience, and β
indicates a pre-defined parameter.
The whole computational framework of PRDQN is shown in
Fig. 2.
Fig. 3. The whole architecture of our proposed approach.
Fig. 4. The computational flow of bottleneck.
III. PROPOSED APPROACH
We proposed an end-to-end lightweight architecture with risk
awareness to make decision for autonomous vehicles. First, a
lightweight semantic extraction network is introduced based on
depth-wise separable convolution and transformer for image
sequence processing. Then, we introduce our risk assessment
module and apply it into the proposed DRL-based method to
obtain a driving policy with the minimal expected risk. The
whole architecture is illustrated in Fig. 3.
A. End-to-End Lightweight Network
To alleviate the computational burden of decision-making
network, we introduced a depth-wise separable convolution
[36] based module, called bottleneck [48], to build the image
semantic extraction network. This module consists of Conv 1×1
and Dwise 3×3. The former is designed for dimension adjust-
ment and the latter is designed to decrease the cost of standard
3×3 convolution. To alleviate the computational burden and
conserve information as much as possible at the same time, the
computational flow of bottleneck is designed as shown in Fig. 4.
The whole image semantic network configuration (DSCNN:
depth-wise separable convolution neural network) is shown in
Table I. The parameter details show that our semantic extraction
network (i.e., DSCNN) has only 22.11 MFLOPs (floating-point
operations) inferential expenditure and 0.92M parameters which
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2201
TAB L E I
PARAMETERS OF THE IMAGE SEMANTIC NETWORK DSCNN AND VCNN (T:
EXPANSION RATI O ,C:OUTPUT CHANNELS,N:REPEAT TIMES,S:THE NUMBER
OF STRIDES,FLOPS:FLOATING POINT OPERATIONS PER SECOND,M:10
6,
GAP: GLOBAL AVERAGE POOLING)
Fig. 5. Attention module.
is a very tiny amount for computation, whereas the vanilla CNN
(VCNN) has 40.01 MFLOPs inferential expenditure and 1.43M
parameters. It is obvious that DSCNN has a lower computa-
tional burden and fewer parameters than VCNN, indicating that
DSCNN is more appropriate for real-time applications.
To date, transformer has achieved better performance in par-
allelization than the traditional LSTM based methods, and it has
a strong capability to build word embedding in longer sequences
based on relationships of all features [49]. Therefore, to make the
agent aware of the change of traffic environment, we introduced
transformer [49] to mixup the image semantic context of time
sequences in this study. We put a video frame as the input
into the model and transform it into the action space to infer
an action. The transformer is used for feature extraction based
on global attention, and it is comprised of many multi-head
attention units deriving from the scaled dot-product attention
(a self-attention criterion). The self-attention criterion divides
the input embedding into three vectors Q,Kand V. Firstly, the
scaled dot-product attention is calculated according to (7). The
corresponding diagram is shown in Fig. 5.
Attention (Q, K, V )=softmax QKT
√dkV(7)
where Qis a query vector, Kis a key vector, Vis a value vector,
and dkis a normalization factor.
Then, hparallel scaled dot-product attention modules are
merged to generate the multi-head attention module, which
means that self-attention is calculated htimes with Q,K, and
Vby the scaled dot-product attention modules with different
weights. The computation flow is shown in Fig. 5 and the
corresponding equation is shown as:
MultiHead (Q, K, V )=Concat (head1,...,headh)WO
where headj= Attention WQ
jQ, W K
jK, W V
jV
(8)
where Windicates the weight matrix, WQ
j∈Rdmodel×dq,WK
j∈
Rdmodel×dk,WV
j∈Rdmodel×dv,WV∈Rhdv×dmodel ,dvand dmodel
are the dimensions of the value vector Vand the model, respec-
tively.
Finally, the complete end-to-end network is shown in Fig. 6.
We set the above-mentioned dvand dmodel as 64 and 512,
respectively. The Nin Fig. 6 and the hin (8) are set as 6
and 8, respectively. All the parameters mentioned above are
recommended by the authors of transformer in [49].
B. Risk Assessment
Different from the deterministic approaches that only predict
risk occurrence, our risk assessment method can hierarchically
evaluate the risk possibility of the host vehicle (HV) as:
τ∈Ω={dangerous,attentive,safe}def
={D,A,S}(9)
where τand Ωrespectively denote the risk level and the set of
risk levels.
Introducing uncertainty σand relative distance dto other
vehicles (OVs) into consideration, we take two stages for risk
assessment with the probabilistic approach: 1) hierarchically
computing the conditional possibility based on the distribution
of safety metrics, and 2) risk level determination for a specific
state based on Bayes inference.
Thus, the distribution of safety metrics is defined as follows:
P(d|τ=D) = 1, otherwise
e−
Δd2
D
2σ2,if d ≥dD
P(d|τ=A)= e−
Δd2
A
2σ2
P(d|τ=S)= e−Δd2
s
2σ2,ifd≤dS
1, otherwise
Δdi=|d−di|,i∈{D,A,S}
(10)
where dis the relative distance (from HV to OVs), dD,dAand dS
are hyper-parameters to determine the risk level. These param-
eters defined in advance (i.e., dD,dA,d
Sand σ) are leveraged
for curves smooth at different risk levels. Fig. 7 [30] is the visual
representation of (10). In order to adjust the risk distributions to
be smooth, we design these hyper-parameters to be reasonable
according to the visualized prior distribution of risk. More details
of the determination of these hyper-parameters can be found in
[17] and [30].
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2202 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 6. DSCNN transformer: The proposed end-to-end lightweight decision-making network.
Fig. 7. The concrete risk curves of (10).
According to Bayes inference, posterior possibility of risk
level τcan determine as:
P(τ|d)= P(d|τ)·P(τ)
τ∈ΩP(τ)·P(d|τ)(11)
where P(τ|d)indicates posterior possibility for risk level τin
each settled relative distance d,P(d|τ)indicates conditional
possibility determined by (10), P(τ)indicates priori possibil-
ity of risk level τ. For convenience, a uniform prior possibil-
ity is set in distinct risk levels with the restrictive condition
τ∈ΩP(τ)=1.
C. Decision-Making With Minimal Expected Risk
In order to seek the policy with minimal expected risk, we
incorporate the output of risk evaluation into the DRL-based
methods for more satisfactory performance of safe driving.
Nevertheless, the output of risk evaluation (i.e., P(τ|d))isdis-
continuous, leading to inapplicability for continuous inference
using DRL methods. To solve this problem, a parameter εabout
continuous risk is calculated in (12) by considering the risk
level τ. Because the abbreviated letters (i.e., D,A, and S)in
(9) cannot be directly used in the calculation of expectation,
we respectively assign D,A, and Swith a score 2, 1, 0 (i.e.,
τ∈Ωdef
={2,1,0}) for mathematical calculation.
ε=E(τ)=
τ∈Ω
τ·P(τ|d)
=
τ∈{2,1}
τ·P(τ|d)(12)
where τis discontinuous risk levels, and εindicates the expec-
tation used for continuous transformation of risk.
Subsequently, (13) generates a policy with minimal expecta-
tion based on the continuously quantified driving risk:
π∗=argmin
π
Eπ+∞
t=0
γtεt+1|s0=s(13)
An equivalent expression is written as:
π∗=argmax
π
Eπ+∞
t=0
γt(max ε−εt+1)|s0=s(14)
where max εindicates the maximal value of the defined risk,
which means that max ε=2.
A similar expression rt+1=maxε−εt+1 can be observed in
(2) and (14). Thus, the corresponding Q-value function can be
determined as (15) to solve the problem.
Qπ(s,a)=Eπ+∞
t=0
γt(maxε−εt+1)|s0=s, a0=a
(15)
The DQN results are the probabilities of choosing each action.
By maximizing (15), the action with the maximal probability
will be selected. See (16). Thus, the actions chosen by DQN in
each step are independent.
a∗=argmax
a
Qπ(s,a)(16)
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2203
TAB L E I I
CAMERA DETAILS.(HEIGHT AND WIDTH:RAW SIZEOFTHECOLLECTED
IMAGES,FOV:FIELD OF VIEW,FREQ:IMAGE COLLECTION FREQUENCY,POSE:
THE WORLD COORDINATE OF THE CAMERA.THE UNLISTED X,Y,YAW,ROLL
AND PITCH ARE ALL 0.)
where a∗denotes the optimal action with maximal Q-value
chosen by DQN, Qπ(s,a)is the Q-value function defined
in (15).
IV. DRL-BASED DRIVING DECISION-MAKING
A. State and Action
The state space consists of images from a vehicle camera in
our approach. The camera captures the environment images with
a pre-defined frequency at 50 Hz. The most recent 5 images (0.1
second per image) in the last 0.5s are used to represent the state,
ensuring that the agent can be aware of the environment changes
from the images with dynamically changing information. The
details of camera for data collection are introduced in Table II.
Our proposed method considers longitudinal and lateral con-
trol by steering and throttle action in the designed autonomous
driving strategies. The brake action is retained for human drivers
instead of the DRL agent to prevent over-conservative behaviors
for better travel efficiency. Despite the omissive consideration
of brake action, our method retains efficient performance due to
the well-designed methodology, which can be supported by our
obtained results. Based on the above statement, the final action
space at a given time t(i.e., at) is defined as:
at∈{LT Lt,LT St,St,RT St,RTLt}(17)
where LTLtand RTLtindicate intense steering for left-turn
and right-turn, i.e., ±0.5 (+denotes left-turn and – denotes
right-turn), LTStand RTStindicate slight steering for left-turn
and right-turn, i.e., ±0.1, and Stindicates the host agent keeps
straight driving with steering.
DQN-based agents with discrete actions are usually inappro-
priate for driving comfort [26]. The generated trajectories by
DQN-based methods are always rough [30] because DQN-based
methods are only appropriate for situations with discrete action
space. To alleviate this problem, an exponential moving aver-
aging strategy [30] is developed to smooth the motion path for
improvement. Both the previously executed action and the action
chosen by DQN methods at the current step are considered to
smooth the gaps between the two continuous discrete actions.
a∗
t=at−1+γ(at−at−1)(18)
where a∗
tis the smooth action, γis an invariable parameter
defined in advance for smooth adjustment, at−1and atare
the actions taken by DQN-based models at time t−1and t,
respectively.
B. Reward Function
In order to generate a policy to ensure driving safety, driving
risk should be considered in priority. Therefore, the reward of
risk is written as:
rrisk =maxε−εt(19)
where rrisk is the reward of driving risk, and εtis the estimated
risk at time t, and max εindicates the maximal value of risk.
In reality, traffic rules should be considered in the design of
autonomous driving strategies. Vehicle collisions always suffer
from illegal lane changes. Unlike the binary penalty for illegal
lane change in [32], we propose a soft penalty to strengthen
the awareness that the HV should avoid lane invasion for safe
driving. A greater relative distance between the HV and road
boundary corresponds to a smaller penalty and thus the soft
penalty is defined as:
rinvasion =−e−(lald−lahv)2
2σ2(20)
where lald indicates the road boundary, lahv indicates the lateral
position of the host agent, the uncertainty is described by σ.
Besides, rexist is designed to encourage the HV driving
following the above lane and boundary rules as long as possible
with no crash:
rexist =0.1,if survive
−1,otherwise (21)
where ‘survive’ denotes that the HV drives within lane bound-
aries with no crash. The reward values of rexist are determined
according to [30] and [50].
According to (19), (20), and (21), we can obtain that rrisk ∈
[0, 2], rinvasion ∈[−1, 0], and rexist ∈{−1,0.1}. Reducing driv-
ing risk has been well accepted as the top priority in the devel-
opment of autonomous driving technologies [51]. Therefore, it
is reasonable that the upper bound of rrisk is twice of the corre-
sponding absolute values of the other sub-rewards. According
to the well-accepted simplification in RL [26], [52], the weights
of different sub-rewards are determined as a constant value
(i.e., 1). Comprehensively considering all these rewardelements,
the holistic reward function is designed as:
r=rrisk +rinvasion +rexist (22)
C. Training Details
To decrease the variance when updating the network, we train
the model by involving the technologies including warm-up
learning rate, gradient clip, and soft update.
Warm-up learning rate: With a significant variance existing
in previous training, neural network should be optimized by the
warmup learning rate strategy for updating steadily. In other
words, a small learning rate lives in the previous optimization
process, and ultimately the learning rate reverts to the mean
number. In practice, the learning rate of DRL is initially assigned
to 0.01 and then changed to 0.1 after 50 episodes.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2204 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 8 Description of the OV locations in scenario-I.
Gradient clip: Gradient clip is a prevalent mitigatory for
gradient explosion, which can be calculated as:
grad∗
i=gradi∗clipnorm
max (norm (gradi), clipnorm)(23)
where gradiand grad∗
iindicate original and clipped gradient
in layer i;norm indicates the standard deviation computation;
clipnorm denotes a hyper-parameter about the standard devi-
ation after clipping, which is usually defined as 0.1 to mitigate
volatility in the training process.
Soft update: Unlike hard network updating, soft updating
keeps the identical weights of the online and target networks,
which is defined as:
θtarget =(1−η)·θtarget +η·θonline (24)
where θonline and θtarget are the weights of the online and target
networks, ηdenotes a parameter to control the target network
updating speed, which is usually determined as 0.01.
V. S IMULATION SCENARIOS
Most approaches based on DRL methods are developed based
on simulators [53], [54], [55], [56] instead of real world tests to
prevent the unaffordable trial-and-error cost. In this study, we
design three lane change scenarios (stationary vehicles, moving
vehicles, and moving vehicles with acceleration, deceleration,
and lane change) in a prevalent simulator called Car Learning to
Act (CARLA) [47] to examine the effectiveness of our proposed
method and the compared methods.
Scenario-I (stationary vehicles): Several motionless vehicles
(10∼26) are randomly settled in a 420m road. To prevent the
block of the road by two cars in parallel, each road segment
(i.e., 60 m) is divided into four sub-segments. Each vehicle
(including the HV and OVs) is independently placed in one sub-
segment. The position and lane choice of the placed vehicle are
randomly initialized with the Gaussian-based sampling method.
The HV is expected to drive forward safely without any crash.
See Fig. 8 for more details.
The longest straight road in CARLA is 420m, which is not
sufficient to examine the effectiveness of our proposed method.
To solve this problem, the experiment with randomly distributed
vehicles is repeated 100 times (i.e., 100 episodes) on the same
420m road in the evaluation stage. In total, the HV run 42 km
with 1600 lane changes in the test phase, which means that
the HV needs to change lanes about four times in each 100m
driving. This 420m road is commonly used for autonomous
driving technology development in CARLA [47].
TABLE III
EVA L UA T I O N METRICSOFTHEEXAMINED METHODS IN SCENARIO-I
Scenario-II (moving vehicles): All the settings in this scenario
(such as the initial positioning strategy, the task of HV and the
number of episodes) are the same with scenario-I. Differently,
all the OVs run with a speed limit of 30 m/s.
Scenario-III (moving vehicles with acceleration, deceleration,
and lane change): In this scenario, each OV has a possibility of p
(p=0.5) to accelerate, decelerate, or change lane. The steering
value varies from −1.0 to 1.0. The acceleration and deceleration
ranges are (0, 0.2) and (−0.1, 0), respectively. All the other
settings are the same as scenario-II.
The initial speed of HV in all the examined scenarios is 0.
The inferred driving actions from the examined methods work
on the HV to overtake the static or moving obstacles. Therefore,
the speed of HV is controlled by the inferred driving actions to
be dynamically changed according to the estimated driving risk.
Given that the speed limit of OVs is 30 m/s, the speed of HV
needs to be generally higher than 30 m/s in scenario-II.
Apart from these statements, some details about rrisk need
attending. To mitigate the trajectory fluctuation when going
straight, evaluation about the risk of OVs differs in scenario-III
and the other two scenarios. Since OVs will not change lanes in
scenario-I and scenario-II, we will consider the risk of obstacles
in both lanes at the same time only when the agent is close
enough to the front obstacle in the current lane, otherwise
ignoring the effect of obstacles in the other lane. The threshold
of distance between the HV and OV in the current lane should
be adaptive for the speed of HV, which was set as safe distance
for convenience. Differently, we do not distinguish the effects
of obstacles in the current lane from those in both lanes in
scenario-III.
VI. RESULTS AND DISCUSSION
A CNN-based method [57] and a CNN LSTM method [23] are
used for comparison to support the superiority of our approach
(i.e., DSCNN transformer). The CNN based method only uses
space semantic information for decision-making with single
image frame as input. CNN LSTM is another decision-making
network which combines the semantic information from both
space and time aspects with image sequence as input. For fair
comparison, we used our proposed image sematic extraction
network to replace the corresponding networks in the compared
methods, named as DSCNN and DSCNN LSTM.
A. Quantitative Analysis
The reward of our proposed method when training in
scenario-I is shown in Fig. 9, and the comparison results with
different methods are presented in Fig. 10 and Table III. The
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2205
Fig. 9. Reward of our proposed method when training in scenario-I.
Fig. 10. Evaluation performances of the examined methods in scenario-I. The
lines describe the means of driving distances before collision, and the shade
areas describe the corresponding standard deviations.
baseline in Fig. 10 and Table III means the random action
strategy [26] as the reference to demonstrate the effectiveness
of the examined methods. The episode number in Fig. 10 means
the number of running experiments for evaluation. The Score
denotes driving distances before collision in each episode. Score
(μ) and Score (σ) in Table III respectively denote the mean and
standard deviation of driving distances before collision. nCs
is the number of crashes occurred in experiments. Given that
Dsafe is the total driving distance in the testing episodes without
collisions, the finish rate (FR) is defined as the percentage of
Dsafe in the total driving distance (i.e., 420m) of all the testing
episodes.
The experimental results demonstrate that our method
achieves superior performance to the compared methods. Specif-
ically, the score (μ) of the proposed method is 360.40, improved
by 194.9% and 58.2% than DSCNN and DSCNN LSTM. The
score (σ) of our proposed method decreases by 31.4% and
40.5% than DSCNN and DSCNN LSTM, indicating that the
proposed method is with better stability. The results of nCs
and FR show similar trends. The nCs when using DSCNN and
DSCNN LSTM are 76 and 37, respectively. The number when
using our proposed DSCNN transformer decreases to 18. The
Fig. 11. Reward of our proposed method when training in scenario-II.
Fig. 12. Evaluation performances of the examined methods in scenario-II.
TAB L E I V
EVA L UA T I O N METRICSOFTHEEXAMINED METHODS IN SCENARIO-II
FR increases from 29.10 and 54.24 when using DSCNN and
DSCNN LSTM respectively, to 85.81 when using our proposed
DSCNN transformer.
The reward of our proposed method when training in
scenario-II is shown in Fig. 11, and the performances of the
examined methods are presented in Fig. 12 and Table IV. The
general trends are similar with their performances in scenario-I.
Specifically, score (μ) of our proposed method is 339.82 in
scenario II, improved by 213.8% and 104.8% than the compared
DSCNN and DSCNN LSTM. Score (σ) of our proposed method
is 112.06, decreased by 8.01% than DSCNN LSTM. Although
the score (σ) is higher than that of DSCNN, the driving distance
before collision (i.e., score (μ)) of our approach achieves supe-
rior performance to DSCNN. Besides, the nCs of our proposed
method is 24, only 37.5% and 52.2% of the numbers when using
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2206 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 13. Reward of our proposed method when training in scenario-III.
Fig. 14. Evaluation performances of the examined methods in scenario-III.
TAB L E V
EVA L UA T I O N METRICS OF THE EXAMINED METHODS IN SCENARIO-III
DSCNN and DSCNN LSTM, respectively. The FR increases
from 25.78 and 39.51 when using DSCNN and DSCNN LSTM,
to 80.91 when using our proposed method.
The reward of our proposed method when training in
scenario-III is shown in Fig. 13, and the performances of the
examined methods are presented in Fig. 14 and Table V. The
general trends are similar with their performances in scenario-I
and scenario-II. Specifically, score (μ) of our proposed method
is 303.00, improved by 228.3% and 129.3% than the compared
DSCNN and DSCNN LSTM. Although the score (σ) of our
proposed method is higher than the numbers of the other meth-
ods, the score (μ) of our approach is greatly higher than the
other methods, which can be clearly observed in Fig. 14. The
nCs of our proposed method is 33, only 58.9% and 80.5% of the
numbers when using DSCNN and DSCNN LSTM, respectively.
The FR increases from 21.97 and 31.46 when using DSCNN and
DSCNN LSTM, to 72.14 when using our proposed method.
B. Qualitative Analysis
The outputs of our proposed DSCNN transformer running in
scenario-I, scenario-II and scenario-III are illustrated in Figs. 15,
16 and 17, respectively. The results show that the AV is with
safe and steady driving ability. The highlighted red areas in the
figures illustrate that it becomes more dangerous when the HV
gets closer to an obstacle, complying with the perceived risk of
drivers in reality. There will be an inevitable crash if the agent
takes no lane change when the driving risk situation is getting
worse.
Fortunately, our trained DRL-based agent is aware of driving
risk to take proper actions for safe driving, as shown in the green
areas presented in Fig. 15. The HV is able to take proper actions
to avoid driving out of lane boundaries, and learn traveling
following the lane center by using an incentive mechanism.
Therefore, a series of actions will be taken by the HV to recover
to a low risk level when in dangerous situations, contributing to
the better performance of our proposed method.
Comprehensively considering the presented quantitative and
qualitative analysis, our proposed approach shows obvious supe-
riority to the compared methods. When comparing DSCNN with
our proposed method, DSCNN uses single image frame as input,
which makes it only aware of the static environment in a single
image, but the proposed method can be aware of the dynami-
cally changing environment in the input image sequence. The
awareness of the dynamically changing environment is essential
to decision-making in the examined scenarios. Therefore, the
proposed DSCNN transformer can reach a better performance
than DSCNN. When comparing with DSCNN LSTM, the advan-
tage of our approach mainly comes from the sematic extraction
capability of transformer. The nucleus module of transformer
is multi-head attention module. The dot-product operation in
multi-head attention can recalibrate feature embeddings and
filter useless information to make the agent focus on those
essential information (e.g., the sematic information of lanes or
obstacles).
These superior quantitative and qualitative performances may
be attributed to the newly proposed deep learning architecture
with a lightweight feature extraction network. Some effective
tricks (i.e., depth-wise separable convolution, linear bottlenecks,
together with transformer) used in combination is a novel and
effective attempt. Our develop method further improves the
previous DRL-based methods by incorporating the advanced
sequential action inference technology (i.e., transformer) and
considering driving policy with minimal expected risk for de-
cision inference, which has been demonstrated to be effective
and superior to the compared methods. Besides, the designed
strategy with minimal risk expectation comprehensively uses
position and its uncertainty to model driving risk, making the
agent be aware of driving risk to improve driving safety. The
presented results demonstrate the satisfactory performance of
our approach in the static scenario-I and dynamic scenario-II
and scenario-III.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2207
Fig. 15. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-I.
Fig. 16. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-II.
C. Comparison With the Vanilla CNN Methods
To demonstrate the advances of our proposed approach, we
add a comparison experiment with the vanilla CNN methods
(i.e., VCNN and VCNN transformer) in scenario-I. The com-
parison results are shown in Fig. 18 and Table VI. Specifically,
the score (μ) of DSCNN transformer is 360.40 which is 104.6%
higher than the number of the compared VCNN transformer.
This means that the driving performance is greatly improved
when using DSCNN transformer. Similarly, the score (σ)of
DSCNN transformer is 29.9% lower than number of DSCNN,
indicating a more stable performance than VCNN transformer.
In addition, the nCs declines from 27 to 18, and the FR increases
TAB L E V I
EVA L UA T I O N METRICS OF DSCNN-BASED AND VCNN-BASED
METHODS IN SCENARIO-I
from 41.95% to 85.81%. Another interesting finding is that the
score (σ) and the nCs of DSCNN is not better than VCNN.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2208 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
Fig. 17. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-III.
Fig. 18. Evaluation performances of DSCNN-based and VCNN-based meth-
ods in scenario-I.
But when using DSCNN together with transformer, the best
performance can be achieved, demonstrating the effectiveness
and advances of our proposed approach.
D. Realtime Capability for Model Deployment
Three evaluation metrics about computational cost (i.e., pa-
rameters, FLOPs, and FPS: frames per second) are used to justify
the computational cost advantage of our proposed DSCNN-
based methods over the methods based on VCNN. As shown
in in Table I, the parameters and FLOPs results show that
our proposed semantic extraction network (DSCNN) has only
22.11 MFLOPs inferential expenditure and 0.92M parameters,
while VCNN has 40.01 MFLOPs inferential expenditure and
1.43M parameters. The FPS results of the examined methods are
shown in Table VII. The numbers show that the FPS of DSCNN
transformer is slightly inferior to the numbers of DSCNN and
TAB L E V I I
FRAMES PER SECOND (FPS) OF THE EXAMINED METHODS.THE FPS MEANS
THE NUMBER OF INFERENCES PER SECOND
DSCNN LSTM, but superior to the number of VSCNN trans-
former. Specifically, the FPS of DSCNN transformer is 34.07
which means that the proposed method can finish inference
in only 0.029s. This running speed should be workable on
autonomous vehicle devices. Therefore, although the FPS of
DSCNN transformer is not the best among the examined meth-
ods, the overall performance by comprehensively considering
the above-mentioned quantitativeand qualitative results together
with these computational cost metrics show that our DSCNN
transformer is generally the best among the examined methods.
With the innovative development of AI-SOC (artificial in-
telligence system on chip), learning architectures have promis-
ing opportunities to run in embedded devices. For instance,
TensorFlow-lite, a deep learning architecture developed by
Google adapting to the devices based on ARM, delivers a
convenient solution for our method DSCNN transformer de-
ployment. The lower inferential expenditure, fewer number of
parameters, and reasonable FPS of proposed DSCNN trans-
former demonstrate its low computation cost. Compared with
those huge networks with a large number of parameters used in
traditional end-to-end decision-making (e.g., VGG and Resnet),
our designed network possesses the ability for inference on CPU
devices. Thus, the satisfactory real-time performance makes
our proposed DSCNN transformer have promising talents for
practical applications.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2209
E. Limitations and Future Work of Our Proposed Approach
Although our proposed approach has the above-mentioned
novelties and advances, there are still some limitations. Firstly,
only stable highway driving scenarios are considered in this
study. Given that safe distance between vehicles changes while
driving in different road conditions with various speeds and lane
changing of autonomous vehicles relies heavily on precise po-
sitioning function which is challenging especially in situations
with poor lighting or weather conditions [58], [59], [60], an
adaptive risk level determination strategy should be developed
for improvement in various driving situations. Using knowledge
transferring technologies may be a promising solution [6], [24].
Secondly, our proposed approach is only with discrete actions,
which limits the stability of driving trajectories [61], [62], [63].
It has been reported that methods with continuous actions (e.g.,
DDPG (deep deterministic policy gradient)) can mitigate this
problem, but they may generate over-conservative actions [61].
How to comprehensively integrate the advantages of DQN-
based methods and DDPG-based methods is a promising re-
search topic for the development of autonomous driving tech-
nologies. Thirdly, as reported in previous studies [64], human
driving habits affect drivers’ decision-making performances and
including human driving habits in the design of autonomous
driving systems can improve drivers’ acceptance of the emerging
technologies. However, we did not consider this factor in this
study. Our future work will design methods to involve the
influence of human driving habits on the risk awareness module
to further improve our developed method, which is expected to
help design individualized systems that can better match drivers’
characteristics for acceptance improvement. In addition, more
lightweight improvements will be considered in our future work.
VII. CONCLUSION
In this paper, an innovative driving decision-making network
with risk evaluation is designed to seek an optimum driving
policy with minimal risk expectation. Our proposed approach is
compared with the other methods in three lane change scenarios
with different difficulties. The quantitative and qualitative results
reveal that the comprehensive use of depth-wise separable con-
volution together with transformer in DRL-based architectures
for lane change decision inference can generate an optimal
policy with minimal driving risk to avoid crashes in all the three
examined scenarios. The comparison results well support the
superiority of our proposed approach. The lightweight charac-
teristic and superior performance of our proposed approach can
facilitate the development of autonomous driving technologies
in various driving scenarios.
REFERENCES
[1] “Traffic Safety Facts 2016: A compilation of motor vehicle crash data from
the fatality analysis reporting system and the general estimates system,”
U.S. Dept. Transp., Nat. Highway Traffic Saf. Admin., Washington, DC,
USA, Tech. Rep. DOT HS 812 554, 2017.
[2] M. S. Shirazi and B. T. Morris, “Looking at intersections: A survey of
intersection monitoring, behavior and safety analysis of recent studies,”
IEEE Trans. Intell. Transp. Syst., vol. 18, no. 1, pp. 4–24, Jan. 2017.
[3] G.Li, Y. Liao, Q. Guo, C. Shen, and W. Lai, “Trafficcrash characteristics in
Shenzhen, China from 2014 to 2016,” Int. J. Environ. Res. Public Health,
vol. 18, no. 3, Jan. 2021, Art. no. 1176.
[4] W. Xue and L. Zhe ng, “Active collision avoidance system design based
on model predictive control with varying sampling time,” Automot. Innov.,
vol. 3, no. 1, pp. 62–72, Mar. 2020.
[5] G. Li, Y. Yang, X. Qu, D. Cao, and K. Li, “A deep learning based image
enhancement approach for autonomous driving at night,” Knowl.-Based
Syst., vol. 213, 2021, Art. no. 106617.
[6] G. Li, Z. Ji, and X. Qu, “Stepwise domain adaptation (SDA) for ob-
ject detection in autonomous vehicles using an adaptive CenterNet,”
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 17729–17743,
Oct. 2022.
[7] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel, “Path planning for
autonomous vehicles in unknown semi-structured environments,” Int. J.
Robot. Res., vol. 29, no. 5, pp. 485–501, Apr. 2010.
[8] Y. Huang et al., “A motion planning and tracking framework for au-
tonomous vehicles based on artificial potential field elaborated resis-
tance network approach,” IEEE Trans. Ind. Electron., vol. 67, no. 2,
pp. 1376–1386, Feb. 2020.
[9] B. Li, Y. Ouyang, Y. Zhang, T. Acarman, Q. Kong, and Z. Shao, “Optimal
cooperative maneuver planning for multiple nonholonomic robots in a tiny
environment via adaptive-scaling constrained optimization,” IEEE Robot.
Automat. Lett., vol. 6, no. 2, pp. 1511–1518, Apr. 2021.
[10] D. Shen, Y. Chen, L. Li, and S. Chien, “Collision-free path planning for
automated vehicles risk assessment via predictive occupancy map,” in
Proc. IEEE Intell. Veh. Symp., 2020, pp. 985–991.
[11] B. Simon, F. Franke, P. Riegl, and A. Gaull, “Motion planning for collision
mitigation via FEM–based crash severity maps,” in Proc. IEEE Intell. Veh.
Symp., 2019, pp. 2187–2194.
[12] M. Ali, P. Falcone, and J. Sjöberg, “Threat assessment design under driver
parameter uncertainty,” in Proc. IEEE 51st Conf. Decis. Control, 2012,
pp. 6315–6320.
[13] S. Glaser, B. Vanholme, S. Mammar, D. Gruyer, and L. Nouvelière,
“Maneuver-based trajectory planning for highly autonomous vehicles on
real road with traffic and driver interaction,” IEEE Trans . Intell. Transp.
Syst., vol. 11, no. 3, pp. 589–606, Sep. 2010.
[14] Operational Definitions of Driving Performance Measures and Statistics,
Standard SAE J2944, Society of Automotive Engineers, 2015.
[15] J. Kim and D. Kum, “Collision risk assessment algorithm via
lane-based probabilistic motion prediction of surrounding vehicles,”
IEEE Trans. Intell. Transp. Syst., vol. 19, no. 9, pp. 2965–2976,
Sep. 2018.
[16] S. Noh and K. An, “Decision-making framework for automated driving in
highway environments,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 1,
pp. 58–71, Jan. 2018.
[17] G. Li et al., “Risk assessment based collision avoidance decision-making
for autonomous vehicles in multi-scenarios,” Transp. Res. Part C: Emerg.
Technol., vol. 122, Jan. 2021, Art. no. 102820.
[18] S. Noh, “Decision-making framework for autonomous driving at road
intersections: Safeguarding against collision, overly conservative behav-
ior, and violation vehicles,” IEEE Trans. Ind. Electron., vol. 66, no. 4,
pp. 3275–3286, Apr. 2019.
[19] D. Shin, B. Kim, K. Yi, A. Carvalho, and F. Borrelli, “Human-centered
risk assessment of an automated vehicle using vehicular wireless commu-
nication,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. 2, pp. 667–681,
Feb. 2019.
[20] J. Nidamanuri, C. Nibhanupudi, R. Assfalg, and H. Venkataraman, “A
progressive review: Emerging technologies for ADAS driven solutions,”
IEEE Trans. Intell. Veh., vol. 7, no. 2, pp. 326–341, Jun. 2022.
[21] G. Li, L. Yang, S. Li, X. Luo, X. Qu, and P. Green, “Human-like
decision making of artificial drivers in intelligent transportation sys-
tems: An end-to-end driving behavior prediction approach,” IEEE In-
tell. Transp. Syst. Mag., vol. 14, no. 6, pp. 188–205, Nov./Dec. 2022,
doi: 10.1109/MITS.2021.3085986.
[22] Y. Pan et al., “Imitation learning for agile autonomous driving,” Int. J.
Robot. Res., vol. 39, no. 2/3, pp. 286–302, 2020.
[23] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2017, pp. 3530–3538.
[24] G. Li, Z. Ji, X. Qu, R. Zhou, and D. Cao, “Cross-domain object de-
tection for autonomous driving: A stepwise domain adaptative YOLO
approach,” IEEE Trans. Intell. Veh., vol. 7, no. 3, pp. 603–615, Sep. 2022,
doi: 10.1109/TIV.2022.3165353.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
2210 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023
[25] L. Peng, H. Wang, and J. Li, “Uncertainty evaluation of object detection
algorithms for autonomous vehicles,” Automot. Innov., vol. 4, no. 3,
pp. 241–252, Aug. 2021.
[26] V. Mnih et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[27] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
replay,” 2016, arXiv: 1511.05952.
[28] G. Li, S. Lin, S. Li, and X. Qu, “Learning automated driving in complex
intersection scenarios based on camera sensors: A deep reinforcement
learning approach,” IEEE Sensors J., vol. 22, no. 5, pp. 4687–4696,
Mar. 2022.
[29] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J. Kochen-
derfer, “Combining planning and deep reinforcement learning in tactical
decision making for autonomous driving,” IEEE Trans. Intell. Veh.,vol.5,
no. 2, pp. 294–305, Jun. 2020.
[30] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, and S. E. Li, “Decision making
of autonomous vehicles in lane change scenarios: Deep reinforcement
learning approaches with risk awareness,” Transp. Res. C Emerg. Technol.,
vol. 134, Jan. 2022, Art. no. 103452.
[31] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker, “High-
level decision making for safe and reasonable autonomous lane changing
using reinforcement learning,” in Proc. IEEE 21st Int. Conf. Intell. Transp.
Syst., 2018, pp. 2156–2162.
[32] T. Shi, P. Wang, X. Cheng, C. Y. Chan, and D. Huang, “Driving de-
cision and control for autonomous lane change based on deep rein-
forcement learning,” in Proc. IEEE Intell. Transp. Syst. Conf., 2019,
pp. 2895–2900.
[33] T. Fan, P. Long, W. Liu, and J. Pan, “Distributed multi-robot collision
avoidance via deep reinforcement learning for navigation in complex
scenarios,” Int. J. Robot. Res., vol. 39, no. 7, pp. 856–892, 2020.
[34] Y. Chen, C. Dong, P. Palanisamy, P. Mudalige, K. Muelling, and J.
M. Dolan, “Attention-based hierarchical deep reinforcement learning for
lane change behaviors in autonomous driving,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 1326–1334.
[35] C. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J. Kochenderfer,
“Combining planning and deep reinforcement learning in tactical decision
making for autonomous driving,” IEEE Trans. Intell. Veh., vol. 5, no. 2,
pp. 294–305, Jun. 2020.
[36] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks
for mobile vision applications,” 2017, arXiv:1704.04861.
[37] W. Hua et alet al., “Channel gating neural networks,” in Proc. Neural Inf.
Process. Syst., 2019, vol. 32, pp. 1886–1896.
[38] C. Li, G. Wang, B. Wang, X. Liang, Z. Li, and X. Chang, “Dynamic
slimmable network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2021, pp. 8607–8617.
[39] B. Jacob et al., “Quantization and training of neural networks for efficient
integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit., 2018, pp. 2704–2713.
[40] M. Henning, J. C. Muller, F. Gies, M. Buchholz, and K. Dietmayer,
“Situation-aware environment perception using a multi-layer attention
map,” IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 481–491, Jan. 2023.
[41] Y. Chen, G. Li, S. Li, W. Wang, S. E. Li, and B. Cheng, “Exploring
behavioral patterns of lane change maneuvers for human-like autonomous
driving,” IEEE Trans.Intell. Transp. Syst., vol. 23, no. 9, pp. 14322–14335,
Sep. 2022.
[42] T. Rehder, A. Koenig, M. Goehl, L. Louis, and D. Schramm, “Lane change
intention awareness for assisted and automated driving on highways,”
IEEE Trans. Intell. Veh., vol. 4, no. 2, pp. 265–276, Jun. 2019.
[43] J. Zhang, C. Chang, X. Zeng, and L. Li, “Multi-agent DRL-based lane
change with right-of-way collaboration awareness,” IEEE Trans. Intell.
Transp. Syst., vol. 24 no. 1, pp. 854–869, Jan. 2023.
[44] X. He, H. Yang, Z. Hu, and C. Lv, “Robustlane change decision making for
autonomous vehicles: An observation adversarial reinforcement learning
approach,” IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 184–193, Jan. 2023.
[45] Y. Wang, D. Pan, H. Deng, Y. Jiang, and Z. Liu, “Dynamic trajectory
planning of autonomous lane change at medium and low speeds based on
elastic soft constraint of the safety domain,” Automot. Innov., vol. 3, no. 1,
pp. 73–87, Mar. 2020.
[46] G. Li, Y. Chen, D. Cao, X. Qu, B. Cheng, and K. Li, “Extraction of descrip-
tive driving patterns from driving data using unsupervised algorithms,”
Mech. Syst. Signal Process., vol. 156, Jul. 2021, Art. no. 107589.
[47] A. Dosovitskiy, G. Ros, F. Codevilla, A. López, and V. Koltun, “CARLA:
An open urban driving simulator,” in Proc. Conf. Robot. Learn., 2017,
vol. 78, pp. 1–16.
[48] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. -C. Chen, “Mo-
bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
[49] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Process. Syst., 2017, pp. 6000–6010.
[50] P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally
decentralized multi-robot collision avoidance via deep reinforcement
learning,” in Proc. IEEE Int. Conf. Robot. Automat., 2018, pp. 6252–6259.
[51] M. Bouton et al., “Reinforcement learning with probabilistic guarantees
for autonomous driving,” 2018, arXiv: 1904.07189.
[52] X. Qi, Y. Luo, G. Wu, K. Boriboonsomsin, and M. Barth, “Deep reinforce-
ment learning enabled self-learning control for energy efficient driving,”
Transp. Res. Part C: Emerg. Technol., vol. 99, pp. 67–81, Feb. 2019.
[53] Y. Ye, X. Zhang, and J. Sun, “Automated vehicle’s behavior decision
making using deep reinforcement learning and high-fidelity simulation
environment,” Transp. Res. C Emerg. Technol., vol. 107, pp. 155–170,
Oct. 2019.
[54] M. Zhu, Y. Wang, Z. Pu, J. Hu, X. Wang, and R. Ke, “Safe, efficient,
and comfortable velocity control based on reinforcement learning for
autonomous driving,” Transp.Res. C Emerg. Technol.,vol. 117, Aug. 2020,
Art. no. 102662.
[55] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical rein-
forcement learning for self-driving decision-making without reliance on
labelled driving data,”IET Intell. Transp. Syst., vol. 14, no. 5, pp. 297–305,
2020.
[56] B. R. Kiran et al., “Deep reinforcement learning for autonomous driving:
A survey,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4909–4926,
Jun. 2022, doi: 10.1109/TITS.2021.3054625.
[57] F. Codevilla, M. Miiller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-
to-end driving via conditional imitation learning,” in Proc. IEEE Int. Conf.
Robot. Automat., 2018, pp. 4693–4700.
[58] G. Li, Y. Lin, and X. Qu, “An infrared and visible image fusion method
based on multi-scale transformation and norm optimization,” Inf. Fusion,
vol. 71, pp. 109–129, 2021.
[59] G. Guo and J. Liu, “A stochastic model-based fusion algorithm for en-
hanced localization of land vehicles,” IEEE Trans. Instrum. Meas., vol. 71,
2022, Art no. 8500810, doi: 10.1109/TIM.2021.3137566.
[60] J. Liu and G. Guo, “Vehiclelocalization during GPS outages with extended
Kalman filter and deep learning,” IEEE Trans. Instrum. Meas., vol. 70,
2021, Art no. 7503410, doi: 10.1109/TIM.2021.3097401.
[61] G. Li, S. Li, S. Li, and X. Qu, “Continuous decision-making for au-
tonomous driving at intersections using deep deterministic policy gra-
dient,” IET Intell. Transp. Syst., vol. 16, no. 12, pp. 1669–1681, 2021,
doi: 10.1049/itr2.12107.
[62] G. Li et al., “Deep reinforcement learning enabled decision-making for
autonomous driving at intersections,” Automot. Innov., vol. 3, no. 4,
pp. 374–385, Dec. 2020.
[63] B. Peng et al., “End-to-end autonomous driving through dueling double
deep Q-network,” Automot. Innov., vol. 4, no. 3, pp. 328–337, Aug. 2021.
[64] G. Li, S. E. Li, B. Cheng, and P. Green, “Estimation of driving style
in naturalistic highway traffic using maneuver transition probabilities,”
Transp. Res. Part C: Emerg. Technol., vol. 74, pp. 113–125, Jan. 2017.
Guofa Li (Member, IEEE) received the Ph.D. degree
in mechanical engineering from Tsinghua University,
Beijing, China, in 2016. He is currently a Professor
with the College of Mechanical and Vehicle Engineer-
ing, Chongqing University, Chongqing, China. He
has authored or coauthored more than 70 papers in his
research field, which include environment perception,
driver behavior analysis, and human-like decision-
making and control based on artificial intelligence
technologies in autonomous vehicles and intelligent
transportation systems. He was the recipient of the
Young Elite Scientists Sponsorship Program in China, and the best paper
awards from the China Association for Science and Technology (CAST) and the
Automotive Innovation Journal. He is an Associate Editor for IEEE SENSORS
JOURNAL, the Guest Editor of IEEE Intelligent Transportation Systems Magazine
and Automotive Innovation.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2211
Yifa n Qiu received the B.E. degree in 2021 from
Shenzhen University, Shenzhen, China, where he is
currently working toward the master’s degree with the
College of Mechatronics and Control Engineering.
His research focus on using deep reinforcement learn-
ing technologies for the development of autonomous
vehicles.
Yifa n Yang received the M.E. degree from Shen-
zhen University, Shenzhen, China, in 2021. He is
currently with the Autonomous Driving Group, Ten-
cent, Shenzhen, China. His research interests include
computer vision, deep reinforcement learning, and
machine learning in automotive and transportation
engineering. He has completed five projects on pedes-
trian recognition, object detection, image enhance-
ment, risk assessment, and decision making using
deep reinforcement learning for the development of
autonomous vehicles.
Zhenning Li received the B.S. and M.S. degrees
in transportation science and engineering from the
Harbin Institute of Technology, Harbin, China, in
2014 and 2016, respectively, and the Ph.D. degree
in civil engineering from the University of Hawaii at
Ma¯noa, Honolulu, HI, USA, in 2019. He is currently
an Assistant Professor with the State Key Labora-
tory of Internet of Things for Smart City and the
Department of Computer and Information Science,
University of Macau, Macau, China. His research
interests include connected autonomous vehicles and
Big Data application on urban transportation system.
Shen Li (Member, IEEE) received the B.E. degree
from Jilin University, Changchun, China, in 2012,
and the Ph.D. degree from the University of Wis-
consin – Madison, Madison, WI, USA, in 2019. His
research interests include cooperative control method
of connected vehicles, autonomous driving safety,
intelligent transportation systems (ITS), architecture
design of CAVH system, traffic data mining based on
cellular data, and traffic operations and management.
He has participated in many research projects funding
by the National Natural Science Foundation of China,
Ministry of Science and Technology (863 projects) and U.S. Department of
Transportation.
Wenbo Chu received the B.S. degree major in au-
tomotive engineering from Tsinghua University, Bei-
jing, China, in 2008, and the M.S. degree major in au-
tomotive engineering from RWTH-Aachen, Aachen,
Germany, and the Ph.D. degree major in mechanical
engineering from Tsinghua University, in 2014. He is
currently a Research Fellow with the Western China
Science City Innovation Center of Intelligent and
Connected Vehicles (Chongqing) Co, Ltd., and Na-
tional Innovation Center of Intelligent and Connected
Vehicles.
Paul Green received the M.S.E. and Ph.D. degrees
from the University of Michigan, Ann Arbor, MI,
USA, in 1974 and 1979, respectively. He is cur-
rently a Research Professor with the University of
Michigan Transportation Research Institute Driver
Interface Group, Ann Arbor, MI, USA, and an Ad-
junct Professor with the Department of Industrial
and Operations Engineering, University of Michigan.
He teaches automotive human factors and human-
computer interaction classes. He is the Leader of
the University’s Human Factors Engineering Short
Course, flagship continuing education course in the profession, now in its
year 62. His research interests include driving safety, driver interfaces, driver
behavior, driver workload, and the development of standards to get research into
practice. Prof. Green is the Past President of the Human Factors and Ergonomics
Society.
Shengbo Eben Li (Senior Member, IEEE) received
the M.S. and Ph.D. degrees from Tsinghua University,
Beijing, China, in 2006 and 2009, respectively. Before
joining Tsinghua University, he was with Stanford
University, Stanford, CA, USA University of Michi-
gan, Ann Arbor, MI, USA, and UC Berkeley, Berke-
ley, CA, USA. He is currently a Professor leading the
Intelligent Driving Lab (iDLab), Tsinghua Univer-
sity. He is the author of more than 120 peer-reviewed
journal/conference papers, and the co-inventor of
more than 30 patents. His research interests include
intelligent vehicles and driver assistance systems, reinforcement learning and
optimal control, and distributed control and estimation. Dr. Li was the recipient
of Best Paper Award in IEEE ITSC 2020, ICCAS 2020, IEEE ICUS 2020, CCCC
2018/2019, ITSAPF 2015, and IEEE ITSC 2014. His important awards include
the National Award for Technological Invention of China in 2013, Excellent
Young Scholar of NSF China in 2016, Young Professor of ChangJiang Scholar
Program in 2016, National Award for Progress in Sci & Tech of China in 2018,
Distinguished Young Scholar of Beijing NSF in 2018, and Youth Sci & Tech
Innovation Leader from MOST in 2020. He is also the Board of Governor of
IEEE Intelligent Transportation Systems Society. He is an Associate Editor
for IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, IEEE TRANSACTIONS ON
INTELLIGENT TRANSPORTATION SYSTEMS,andIEEE Intelligent Transportation
Systems Magazine.
Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.