ArticlePDF Available

Lane Change Strategies for Autonomous Vehicles: A Deep Reinforcement Learning Approach Based on Transformer

January 2022
IEEE Transactions on Intelligent Vehicles PP(99):1-15

January 2022
PP(99):1-15

DOI:10.1109/TIV.2022.3227921

Authors:

Guofa Li

Chongqing University

Zhenning Li

University of Macau

Show all 8 authorsHide

End-to-end approaches are one of the most promising solutions for autonomous vehicles (AVs) decision-making. However, the deployment of these technologies is usually constrained by the high computational burden. To alleviate this problem, we proposed a lightweight transformer-based end-to-end model with risk awareness ability for AV decision-making. Specifically, a lightweight network with depth-wise separable convolution and transformer modules was firstly proposed for image semantic extraction from time sequences of trajectory data. Then, we assessed driving risk by a probabilistic model with position uncertainty. This model was integrated into deep reinforcement learning (DRL) to find strategies with minimum expected risk. Finally, the proposed method was evaluated in three lane change scenarios to validate its superiority.

The framework of DQN.

…

The framework of PRDQN.

…

The whole architecture of our proposed approach.

…

The computational flow of bottleneck.

…

Attention module.

…

Figures - uploaded by Shengbo Eben Li

Content may be subject to copyright.

Content uploaded by Shengbo Eben Li

Content may be subject to copyright.

IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023 2197

Lane Change Strategies for Autonomous Vehicles:

A Deep Reinforcement Learning Approach

Based on Transformer

Guofa Li , Member, IEEE,YifanQiu , Yifan Yang, Zhenning Li ,ShenLi , Member, IEEE, Wenbo Chu,

Paul Green , and Shengbo Eben Li , Senior Member, IEEE

Abstract—End-to-end approaches are one of the most promising

solutions for autonomous vehicles (AVs) decision-making. However,

the deployment of these technologies is usually constrained by the

high computational burden. To alleviate this problem, we proposed

a lightweight transformer-based end-to-end model with risk aware-

ness ability for AV decision-making. Speciﬁcally, a lightweight

network with depth-wise separable convolution and transformer

modules was ﬁrstly proposed for image semantic extraction from

time sequences of trajectory data. Then, we assessed driving risk

by a probabilistic model with position uncertainty. This model was

integrated into deep reinforcement learning (DRL) to ﬁnd strate-

gies with minimum expected risk. Finally, the proposedmethod was

evaluated in three lane change scenarios to validate its superiority.

Index Terms—Autonomous vehicles, decision-making,

reinforcement learning, lane change, transformer.

I. INTRODUCTION

AS REPORTED by the National Highway Trafﬁc Safety

Administration (NHTSA) [1], 50000 fatal trafﬁc accidents

are attributed to driving mistakes each year in the United States

Manuscript received 27 November 2022; accepted 5 December 2022. Date

of publication 9 December 2022; date of current version 27 April 2023. This

work was supported in part by the National Natural Science Foundation of China

under Grant 52272421, and in part by Shenzhen Fundamental Research Fund

under Grant JCYJ20190808142613246. (Corresponding author: Shen Li.)

Guofa Li is with the College of Mechanical and Vehicle Engineering,

Chongqing University, Chongqing 400044, China (e-mail: hanshan198@

gmail.com).

Yifan Qiu and Yifan Yang are with the College of Mechatronics and Con-

trol Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China

(e-mail: rye1222@qq.com; lvan0619@qq.com).

Zhenning Li is with the State Key Laboratory of Internet of Things for Smart

City and the Department of Computer and Information Science, University of

Macau, Macau 999078, China (e-mail: zhenningli@um.edu.mo).

Shen Li is with the School of Civil Engineering, Tsinghua University, Beijing

100084, China (e-mail: sli299@tsinghua.edu.cn).

WenboChu is with the Western China Science City Innovation Center of Intel-

ligent and Connected Vehicles (Chongqing) Co, Ltd., Chongqing 401329, China,

and also with the College of Mechanical and Vehicle Engineering, Chongqing

University, Chongqing 400044, China (e-mail: chuwenbo@wicv.cn).

Paul Green is with the University of Michigan Transportation Research

Institute (UMTRI) & Department of Industrial and Operations Engineering, Uni-

versity of Michigan, Ann Arbor, MI 48109 USA (e-mail: pagreen@umich.edu).

Shengbo Eben Li is with the State Key Lab of Automotive Safety and Energy,

School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

(e-mail: lishbo@tsinghua.edu.cn).

Color versions of one or more ﬁgures in this article are available at

https://doi.org/10.1109/TIV.2022.3227921.

Digital Object Identiﬁer 10.1109/TIV.2022.3227921

[2]. Statistics in China also show that over 90% of trafﬁc ac-

cidents are related to driving mistakes [3]. Therefore, to help

drivers make reliable decisions and reduce the frequency of

human-caused accidents, numerous safety applications for lev-

els 1 and 2 autonomous vehicles have been developed in recent

years, such as advanced driver assistance systems (ADAS),

fatigue recognition systems, etc. Furthermore, academia re-

searchers have begun to focus on designing active safety systems

for higher-level autonomous vehicles with heavy attention on

collision avoidance systems [4], [5], [6]. In the following para-

graphs, we summarize the inﬂuential approaches in the devel-

opment of decision-making systems for collision avoidance,

which can be categorized into motion planning-based methods,

risk estimation-based methods, and data-driven-based methods.

Speciﬁcally, supervised learning and reinforcement learning are

two principal categories for data-driven-based methods.

A. Motion Planning-Based Methods

A∗and artiﬁcial potential ﬁeld (APF) are two representative

methods of the conventional motion planning-based category for

collision avoidance decision-making. For instance, Dolgov et al.

[7] searched 3D kinematic state space via a variant A∗method.

Then, a numeric non-linear optimization method was further

utilized to enhance the performance of variant A∗approach.

Huang et al. [8] proposed an APF method with different potential

functions for road boundaries after meshing the drivable area.

Subsequently, a local current comparison method is employed

to generate the path with no crash. Nevertheless, these methods

have two intrinsic drawbacks: 1) how they generate graphs

(considering physical constraints) greatly affects their perfor-

mance, 2) paths that are impossible for vehicle kinematics would

sometimes be generated.

For improvement, another solution considering vehicle kine-

matics is developed. Li et al. [9] introduced an optimization

approach based on adaptive-scaling constraints for multi-agent

travelling by considering vehicle dynamics to have the time-

optimal planning of trajectories. Shen et al. [10] utilized a

predictive occupancy map (POM) to assess the potential risk

levels of surrounding vehicles based on vehicle kinematics. The

optimal path was then obtained by a random tree algorithm with

POM. Simon et al. [11] assumed that inevitable collisions were

inherently time-critical and thus introduced a novel method to

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

2198 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023

mitigate collisions based on vehicle kinematics. The proposed

method could capture the trajectory with the minimum execution

time by simulating with ﬁnite element modeling (FEM). How-

ever, given that the constraints for motion planning are usually

nonlinear or nonconvex, the planning task may be a NP-hard

problem which is difﬁcult to be addressed.

B. Risk Estimation-Based Methods

Risk estimation-based methods always estimate risk in the

current driving state and then formulate a subsequent action pol-

icy in accordance with the risk estimation results. The modular

or hierarchical design in these methods can be more accessible

for the brilliant breakthrough in autonomous vehicles [12]. Cur-

rently, deterministic approaches and probabilistic approaches

are the two principal mainstreams for the risk estimation-based

methods.

The deterministic approaches mainly estimate the occurrence

of a collision to infer the strategies for vehicle control. TTC

(time to collision) and THW (time headway) are the classical

evaluation metrics on driving safety [13], [14]. In single-lane

scenarios, these risk estimation methods are comparatively accu-

rate in longitudinal driving without computational burden [15].

However, barely considering the uncertainty of the input data

leads to unpractical derived policies for real-world applications

and even unsatisfactory performance in multi-lane scenarios

[16].

To avoid the uncertainty problem, probabilistic descriptions

are introduced for risk probability assessment in probabilistic

approaches [17]. After fusing traditional metrics (e.g., TTC) into

risk estimation by using a Bayesian model, Noh [18] developed a

rule-based expertise for subject vehicle control at intersections.

Shin et al. [19] observed the uncertainties in the motion of remote

vehicles via vehicle-to-vehicle (V2V) communication, which

was a reference for calculating the number of crashes within

uncertainty boundaries. Suffering from complication in the real-

istic trafﬁc environment and disregarding human drivers’ learn-

ing ability, the intrinsic drawback of probabilistic approaches is

that they only formulate rule-based strategies based on expert

knowledge. However, complex trafﬁc environment details can-

not always be effectively deﬁned by countable rules, and it is

also impossible to determine all the rules in all situations [9].

C. Data-Driven-Based Methods

After discovering the learning capability of neural net-

works, data-driven-based methods including supervised learn-

ing and reinforcement learning become the mainstream for

decision-making [20], [21]. The studies using supervised learn-

ing are booming for the development of autonomous vehicles.

Pan et al. [22] introduced a hybrid control policy network

guided by the human expert and model predictive control (MPC)

expert. It only requires images taken with a monocular camera

and rolling speed to output the steering and throttle command

directly. Xu et al. [23] introduced an FCN-LSTM architecture to

generate actions based on prior states of agents and videos taken

with a monocular camera. Moreover, this architecture leveraged

imitation learning (IL) to improve performance since semantic

segmentation as a side task enforces the FCN-LSTM architec-

ture to learn interpretable feature representation. Unfortunately,

since data collection in dangerous scenarios (e.g., inevitable

collisions) is challenging, there is a gap between reality and

training. These supervised methods always have unsatisfactory

performance in realistic scenarios due to the lack of data from

dangerous scenarios [24], [25].

To avoid the high-cost problem for data collection in dan-

gerous scenarios with supervised learning methods, researchers

utilized deep reinforcement learning (DRL) methods with af-

fordable trial-and-error to ﬁnd the driving policy close to reality

for decision-making in autonomous vehicles based on driving

simulators [26], [27]. Different from rule-based methods, DRL-

based methods learn how to drive from trial-and-error, mak-

ing them suitable to various situations [28], [29]. The deﬁned

learning criteria in DRL-based methods are just some simple

constraints or encouraging orientations which are far more less

or simple than the cases in rule-based methods [30]. Mirchevska

et al. [31] developed a reinforced learning approach based on

deep Q-learning network (DQN) for autonomous vehicles to

take safe actions for lane change in highway driving. To cope

with the challenge in multi-agent collision avoidance, Fan et al.

[33] proposed an innovative multi-stage RL-based architecture

for safe and effective navigation in dense trafﬁc with pedestrians.

Chen et al. [34] designed a network with a hierarchical structure

that maintained an overstory policy and an underlying operation

instruction simultaneously. However, heavy time consumption

burden in model training is still a problem for DRL-based

methods [35]. Therefore, the development of lightweight DRL

models is attracting attentions of researchers in the recent years.

D. Lightweight Model Design

Feature extraction from data with a fair amount of redun-

dant information, such as images, is computationally expen-

sive. Since reinforcement learning based models are mainly

for online training, models with a huge number of parameters

cannot satisfy real-time requirements. Therefore, technologies

for lightweight model design have been developed for practical

applications. Howard et al. [36] proposed a novel lightweight

method with a pre-deﬁned architecture to reduce the number

of convolution calculations to only 1/9 of the number when

using the vanilla convolution. Hua et al. [37] introduced a

dynamic pruning method, named channel gating, to optimize

CNN inference by utilizing input-speciﬁc features. By identify-

ing the regions with insigniﬁcant contributions, channel gating

will dynamically skip weight propagation for these ineffective

regions to ease the calculation burden. However, these dynamic

methods that merely consider reality implementation can hardly

achieve theoretical acceleration due to extra computation waste

(e.g., indexing, zero-masking, weight-copying, etc.). To achieve

hardware-efﬁciency acceleration, [38] dynamically sliced the

network parameters to realize statical and contiguous storage

in hardware. As for the computation itself, Jacob et al. [39]

proposed a quantization scheme to gain integer-only models,

which avoids the huge calculation cost in ﬂoating point infer-

ence. Apart from these approaches, Henning et al. [40] proposed

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2199

a multi-layer attention map (MLAM) to only process the relevant

data, which mitigates the high redundancy in feature extraction

for environment perception. Although approaches have been

made in the recent years, far more efforts are still needed for

lightweight DRL model design, especially in the area of risk-

aware decision-making for autonomous vehicles in intelligent

transportation systems.

E. Contributions

It has been widely accepted that lane change is one of the most

commonly adopted maneuvers in naturalistic driving [41], [42],

[43], [44]. In order to develop human-like autonomous driving

technologies to avoid conﬂicts or crashes caused by inconsis-

tences between human drivers and artiﬁcial drivers, automatic

lane change systems should be well designed for autonomous

driving [45], [46]. However, current end-to-end automatic lane

changing models usually suffer from high computational cost

or risk insensitivity and may not be useful for high-speed lane

change scenarios. To overcome these drawbacks, we propose

an innovative method that allows agents to learn strategies with

the minimal expected risk at a low computational burden for

highway lane change. Firstly, we proposed a lightweight image

semantic extraction network based on depth-wise separable

convolutions and used transformers to merge the image seman-

tic contexts in time series. Next, we proposed a quantization

approach containing positional uncertainty based on Bayesian

theory for risk assessment, which was then introduced into DRL

to ﬁnd the policy with minimal expected risk. Lastly, some

virtual scenarios were built in a driving simulator CARLA (Car

Learning to Act) [47] to evaluate the performance of our method.

The key contributions of our work are summarized as follows:

1) An innovative end-to-end model on the basis of depth-

wise separable convolution with low computer burden and

transformer network is newly proposed for lane change

decision-making in autonomous driving. To the best of our

knowledge, the comprehensive use of depth-wise separa-

ble convolution together with transformer in DRL-based

architectures for lane change decision inference has never

been reported in the previous literature.

2) The driving policy with minimal expected risk is cre-

atively integrated into DRL-based architectures for safe

lane change, making the autonomous vehicle being with

risk awareness ability.

3) Three lane change scenarios with different difﬁculties (i.e.,

one with stationary vehicles, one with moving vehicles,

and one accelerating, decelerating, and lane changing

vehicles) are designed to evaluate the performances of the

examined methods.

4) The lightweight characteristic and superior performance

of our proposed approach can facilitate the development

of autonomous driving technologies in various driving

scenarios for intelligent transportation.

F. Paper Structure

The rest of this paper is structured as follows. Problem state-

ment and previous solutions are mentioned in Section II. The

mentioned methodology and the details of the deep reinforce-

ment learning framework for decision-making are described

in Section III and Section IV, respectively. The experiments

in CARLA are detailed in Section V. Section VI presents the

performance of experiments. Lastly, the conclusions of this

paper are drawn in Section VII.

II. PROBLEM STATEMENT AND PREVIOUS SOLUTIONS

Generally, in the DRL framework, the agent is capable of

driving in an uncertain environment by selecting a sequence of

actions over several continuous-time steps. Subsequently, it will

grant rewards as the feedback of the interaction with environ-

ment. Finally, a strategy with maximum cumulative reward will

be chosen. In this study, a lane change process can be brieﬂy

described as a Markov decision process (MDP):

M=< S,A,P,R > (1)

where S={s0,s

1, ..., s

t}indicates the set of states, A=

{a0,a

1, ..., a

t}indicates the set of actions, P:S×A×

S→[0,1] indicates the transitional probability between states,

and R:S×A×S→Rindicates the reward.

In particular, a sequence of actions adapted to the particu-

lar scenario will be chosen according to a stochastic strategy

π:S→P(A), where P(A)indicates the possibility of an

action Awill be chosen following the strategy π. A trajectory

will take place through this progress, which can be indicated as

τ=s0,a

0,r

0,s

1,a

1,r

1, ..., s

t,a

t,r

t, and the optimal

strategy π∗with the maximum expected cumulative reward can

be found:

π∗=argmax

Eπ+∞



t=0

γtrt+1|s0=s(2)

where γ∈[0,1] implies a parameter to control the weight of the

next time step reward rt+1, and π∗indicates the optimal strategy.

However, (2) is hard to solve. To address this issue, a Q-value

function is used for strategy optimization.

Qπ(s, a)=Eπ+∞



t=0

γtrt+1|s0=s, a0=a(3)

where Qπ(s, a)indicates the expected cumulative reward from

the initial condition (state sand action a) by following strategy

π. Subsequently, the strategy πcan be improved by choosing

action ato maximize Q-value, i.e., π(s)=argmax

Qπ(s, a).

Thus, as an equivalent solution to (2), an optimal strategy π∗

with maximum Qπ(s, a)will be generated, i.e., Qπ∗(s, a)=

max

πQπ(s, a).

A. Deep Q-Network (DQN)

In DQN architecture [26], Q values can be estimated by two

networks (i.e., Qtarget and Qonline). A trajectory st,a

t,r

t,s

t+1

will be generated through Qonline and then it will be stored in

memory M. Randomly sampling in memory Mwill reduce the

relevancy of data to update the network. Qtarget outputs the

target Q-value to calculate the temporal difference (TD) error.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

2200 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023

Fig. 1. The framework of DQN.

Fig. 2. The framework of PRDQN.

The weights of Qonline will then be updated according to the

obtained TD error. The separation of two processes improves

stability of the network. The computational framework is shown

in Fig. 1, and the loss function is described as:

L=E(st,at,rt,st)∼M(y−Q(st,a

t;θt))2

y=rt+γmax

at

Q(st,a

t;θt)(4)

where (st,a

t,r

t,s

t)indicates a trajectory sampled in memory

M,θtand θtrespectively indicates the weights of Qonline and

Qtarget.

B. Prioritized Replay Deep Q-Learning Network (PRDQN)

Memory replay of DQN with consistent sample policy is not

appropriate for samples with higher temporal difference errors

because samples with minor TD errors are easily captured. To

address this problem, Schaul et al. [27] developed a prioritized

replay based on DQN (i.e., PRDQN), prioritizing samples with

higher TD errors to learn. The sampled probability of sample i

is described as:

P(i)= pa

kpa

(5)

where pindicates the TD error, ais a pre-deﬁned parameter to

control the priority.

In addition, gradient descent in experience replay with priority

is based on the importance of sample weight, which is described

as:

wi=1

N·1

P(i)β

(6)

where Nindicates the number of replay experience, and β

indicates a pre-deﬁned parameter.

The whole computational framework of PRDQN is shown in

Fig. 2.

Fig. 3. The whole architecture of our proposed approach.

Fig. 4. The computational ﬂow of bottleneck.

III. PROPOSED APPROACH

We proposed an end-to-end lightweight architecture with risk

awareness to make decision for autonomous vehicles. First, a

lightweight semantic extraction network is introduced based on

depth-wise separable convolution and transformer for image

sequence processing. Then, we introduce our risk assessment

module and apply it into the proposed DRL-based method to

obtain a driving policy with the minimal expected risk. The

whole architecture is illustrated in Fig. 3.

A. End-to-End Lightweight Network

To alleviate the computational burden of decision-making

network, we introduced a depth-wise separable convolution

[36] based module, called bottleneck [48], to build the image

semantic extraction network. This module consists of Conv 1×1

and Dwise 3×3. The former is designed for dimension adjust-

ment and the latter is designed to decrease the cost of standard

3×3 convolution. To alleviate the computational burden and

conserve information as much as possible at the same time, the

computational ﬂow of bottleneck is designed as shown in Fig. 4.

The whole image semantic network conﬁguration (DSCNN:

depth-wise separable convolution neural network) is shown in

Table I. The parameter details show that our semantic extraction

network (i.e., DSCNN) has only 22.11 MFLOPs (ﬂoating-point

operations) inferential expenditure and 0.92M parameters which

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2201

TAB L E I

PARAMETERS OF THE IMAGE SEMANTIC NETWORK DSCNN AND VCNN (T:

EXPANSION RATI O ,C:OUTPUT CHANNELS,N:REPEAT TIMES,S:THE NUMBER

OF STRIDES,FLOPS:FLOATING POINT OPERATIONS PER SECOND,M:10

GAP: GLOBAL AVERAGE POOLING)

Fig. 5. Attention module.

is a very tiny amount for computation, whereas the vanilla CNN

(VCNN) has 40.01 MFLOPs inferential expenditure and 1.43M

parameters. It is obvious that DSCNN has a lower computa-

tional burden and fewer parameters than VCNN, indicating that

DSCNN is more appropriate for real-time applications.

To date, transformer has achieved better performance in par-

allelization than the traditional LSTM based methods, and it has

a strong capability to build word embedding in longer sequences

based on relationships of all features [49]. Therefore, to make the

agent aware of the change of trafﬁc environment, we introduced

transformer [49] to mixup the image semantic context of time

sequences in this study. We put a video frame as the input

into the model and transform it into the action space to infer

an action. The transformer is used for feature extraction based

on global attention, and it is comprised of many multi-head

attention units deriving from the scaled dot-product attention

(a self-attention criterion). The self-attention criterion divides

the input embedding into three vectors Q,Kand V. Firstly, the

scaled dot-product attention is calculated according to (7). The

corresponding diagram is shown in Fig. 5.

Attention (Q, K, V )=softmax QKT

√dkV(7)

where Qis a query vector, Kis a key vector, Vis a value vector,

and dkis a normalization factor.

Then, hparallel scaled dot-product attention modules are

merged to generate the multi-head attention module, which

means that self-attention is calculated htimes with Q,K, and

Vby the scaled dot-product attention modules with different

weights. The computation ﬂow is shown in Fig. 5 and the

corresponding equation is shown as:

MultiHead (Q, K, V )=Concat (head1,...,headh)WO

where headj= Attention WQ

jQ, W K

jK, W V

jV

(8)

where Windicates the weight matrix, WQ

j∈Rdmodel×dq,WK

j∈

Rdmodel×dk,WV

j∈Rdmodel×dv,WV∈Rhdv×dmodel ,dvand dmodel

are the dimensions of the value vector Vand the model, respec-

tively.

Finally, the complete end-to-end network is shown in Fig. 6.

We set the above-mentioned dvand dmodel as 64 and 512,

respectively. The Nin Fig. 6 and the hin (8) are set as 6

and 8, respectively. All the parameters mentioned above are

recommended by the authors of transformer in [49].

B. Risk Assessment

Different from the deterministic approaches that only predict

risk occurrence, our risk assessment method can hierarchically

evaluate the risk possibility of the host vehicle (HV) as:

τ∈Ω={dangerous,attentive,safe}def

={D,A,S}(9)

where τand Ωrespectively denote the risk level and the set of

risk levels.

Introducing uncertainty σand relative distance dto other

vehicles (OVs) into consideration, we take two stages for risk

assessment with the probabilistic approach: 1) hierarchically

computing the conditional possibility based on the distribution

of safety metrics, and 2) risk level determination for a speciﬁc

state based on Bayes inference.

Thus, the distribution of safety metrics is deﬁned as follows:

P(d|τ=D) = 1, otherwise

e−

Δd2

2σ2,if d ≥dD

P(d|τ=A)= e−

Δd2

2σ2

P(d|τ=S)= e−Δd2

2σ2,ifd≤dS

1, otherwise

Δdi=|d−di|,i∈{D,A,S}

(10)

where dis the relative distance (from HV to OVs), dD,dAand dS

are hyper-parameters to determine the risk level. These param-

eters deﬁned in advance (i.e., dD,dA,d

Sand σ) are leveraged

for curves smooth at different risk levels. Fig. 7 [30] is the visual

representation of (10). In order to adjust the risk distributions to

be smooth, we design these hyper-parameters to be reasonable

according to the visualized prior distribution of risk. More details

of the determination of these hyper-parameters can be found in

[17] and [30].

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

2202 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023

Fig. 6. DSCNN transformer: The proposed end-to-end lightweight decision-making network.

Fig. 7. The concrete risk curves of (10).

According to Bayes inference, posterior possibility of risk

level τcan determine as:

P(τ|d)= P(d|τ)·P(τ)

τ∈ΩP(τ)·P(d|τ)(11)

where P(τ|d)indicates posterior possibility for risk level τin

each settled relative distance d,P(d|τ)indicates conditional

possibility determined by (10), P(τ)indicates priori possibil-

ity of risk level τ. For convenience, a uniform prior possibil-

ity is set in distinct risk levels with the restrictive condition

τ∈ΩP(τ)=1.

C. Decision-Making With Minimal Expected Risk

In order to seek the policy with minimal expected risk, we

incorporate the output of risk evaluation into the DRL-based

methods for more satisfactory performance of safe driving.

Nevertheless, the output of risk evaluation (i.e., P(τ|d))isdis-

continuous, leading to inapplicability for continuous inference

using DRL methods. To solve this problem, a parameter εabout

continuous risk is calculated in (12) by considering the risk

level τ. Because the abbreviated letters (i.e., D,A, and S)in

(9) cannot be directly used in the calculation of expectation,

we respectively assign D,A, and Swith a score 2, 1, 0 (i.e.,

τ∈Ωdef

={2,1,0}) for mathematical calculation.

ε=E(τ)=

τ∈Ω

τ·P(τ|d)

=

τ∈{2,1}

τ·P(τ|d)(12)

where τis discontinuous risk levels, and εindicates the expec-

tation used for continuous transformation of risk.

Subsequently, (13) generates a policy with minimal expecta-

tion based on the continuously quantiﬁed driving risk:

π∗=argmin

Eπ+∞



t=0

γtεt+1|s0=s(13)

An equivalent expression is written as:

π∗=argmax

Eπ+∞



t=0

γt(max ε−εt+1)|s0=s(14)

where max εindicates the maximal value of the deﬁned risk,

which means that max ε=2.

A similar expression rt+1=maxε−εt+1 can be observed in

(2) and (14). Thus, the corresponding Q-value function can be

determined as (15) to solve the problem.

Qπ(s,a)=Eπ+∞



t=0

γt(maxε−εt+1)|s0=s, a0=a

(15)

The DQN results are the probabilities of choosing each action.

By maximizing (15), the action with the maximal probability

will be selected. See (16). Thus, the actions chosen by DQN in

each step are independent.

a∗=argmax

Qπ(s,a)(16)

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2203

TAB L E I I

CAMERA DETAILS.(HEIGHT AND WIDTH:RAW SIZEOFTHECOLLECTED

IMAGES,FOV:FIELD OF VIEW,FREQ:IMAGE COLLECTION FREQUENCY,POSE:

THE WORLD COORDINATE OF THE CAMERA.THE UNLISTED X,Y,YAW,ROLL

AND PITCH ARE ALL 0.)

where a∗denotes the optimal action with maximal Q-value

chosen by DQN, Qπ(s,a)is the Q-value function deﬁned

in (15).

IV. DRL-BASED DRIVING DECISION-MAKING

A. State and Action

The state space consists of images from a vehicle camera in

our approach. The camera captures the environment images with

a pre-deﬁned frequency at 50 Hz. The most recent 5 images (0.1

second per image) in the last 0.5s are used to represent the state,

ensuring that the agent can be aware of the environment changes

from the images with dynamically changing information. The

details of camera for data collection are introduced in Table II.

Our proposed method considers longitudinal and lateral con-

trol by steering and throttle action in the designed autonomous

driving strategies. The brake action is retained for human drivers

instead of the DRL agent to prevent over-conservative behaviors

for better travel efﬁciency. Despite the omissive consideration

of brake action, our method retains efﬁcient performance due to

the well-designed methodology, which can be supported by our

obtained results. Based on the above statement, the ﬁnal action

space at a given time t(i.e., at) is deﬁned as:

at∈{LT Lt,LT St,St,RT St,RTLt}(17)

where LTLtand RTLtindicate intense steering for left-turn

and right-turn, i.e., ±0.5 (+denotes left-turn and – denotes

right-turn), LTStand RTStindicate slight steering for left-turn

and right-turn, i.e., ±0.1, and Stindicates the host agent keeps

straight driving with steering.

DQN-based agents with discrete actions are usually inappro-

priate for driving comfort [26]. The generated trajectories by

DQN-based methods are always rough [30] because DQN-based

methods are only appropriate for situations with discrete action

space. To alleviate this problem, an exponential moving aver-

aging strategy [30] is developed to smooth the motion path for

improvement. Both the previously executed action and the action

chosen by DQN methods at the current step are considered to

smooth the gaps between the two continuous discrete actions.

a∗

t=at−1+γ(at−at−1)(18)

where a∗

tis the smooth action, γis an invariable parameter

deﬁned in advance for smooth adjustment, at−1and atare

the actions taken by DQN-based models at time t−1and t,

respectively.

B. Reward Function

In order to generate a policy to ensure driving safety, driving

risk should be considered in priority. Therefore, the reward of

risk is written as:

rrisk =maxε−εt(19)

where rrisk is the reward of driving risk, and εtis the estimated

risk at time t, and max εindicates the maximal value of risk.

In reality, trafﬁc rules should be considered in the design of

autonomous driving strategies. Vehicle collisions always suffer

from illegal lane changes. Unlike the binary penalty for illegal

lane change in [32], we propose a soft penalty to strengthen

the awareness that the HV should avoid lane invasion for safe

driving. A greater relative distance between the HV and road

boundary corresponds to a smaller penalty and thus the soft

penalty is deﬁned as:

rinvasion =−e−(lald−lahv)2

2σ2(20)

where lald indicates the road boundary, lahv indicates the lateral

position of the host agent, the uncertainty is described by σ.

Besides, rexist is designed to encourage the HV driving

following the above lane and boundary rules as long as possible

with no crash:

rexist =0.1,if survive

−1,otherwise (21)

where ‘survive’ denotes that the HV drives within lane bound-

aries with no crash. The reward values of rexist are determined

according to [30] and [50].

According to (19), (20), and (21), we can obtain that rrisk ∈

[0, 2], rinvasion ∈[−1, 0], and rexist ∈{−1,0.1}. Reducing driv-

ing risk has been well accepted as the top priority in the devel-

opment of autonomous driving technologies [51]. Therefore, it

is reasonable that the upper bound of rrisk is twice of the corre-

sponding absolute values of the other sub-rewards. According

to the well-accepted simpliﬁcation in RL [26], [52], the weights

of different sub-rewards are determined as a constant value

(i.e., 1). Comprehensively considering all these rewardelements,

the holistic reward function is designed as:

r=rrisk +rinvasion +rexist (22)

C. Training Details

To decrease the variance when updating the network, we train

the model by involving the technologies including warm-up

learning rate, gradient clip, and soft update.

Warm-up learning rate: With a signiﬁcant variance existing

in previous training, neural network should be optimized by the

warmup learning rate strategy for updating steadily. In other

words, a small learning rate lives in the previous optimization

process, and ultimately the learning rate reverts to the mean

number. In practice, the learning rate of DRL is initially assigned

to 0.01 and then changed to 0.1 after 50 episodes.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

2204 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023

Fig. 8 Description of the OV locations in scenario-I.

Gradient clip: Gradient clip is a prevalent mitigatory for

gradient explosion, which can be calculated as:

grad∗

i=gradi∗clipnorm

max (norm (gradi), clipnorm)(23)

where gradiand grad∗

iindicate original and clipped gradient

in layer i;norm indicates the standard deviation computation;

clipnorm denotes a hyper-parameter about the standard devi-

ation after clipping, which is usually deﬁned as 0.1 to mitigate

volatility in the training process.

Soft update: Unlike hard network updating, soft updating

keeps the identical weights of the online and target networks,

which is deﬁned as:

θtarget =(1−η)·θtarget +η·θonline (24)

where θonline and θtarget are the weights of the online and target

networks, ηdenotes a parameter to control the target network

updating speed, which is usually determined as 0.01.

V. S IMULATION SCENARIOS

Most approaches based on DRL methods are developed based

on simulators [53], [54], [55], [56] instead of real world tests to

prevent the unaffordable trial-and-error cost. In this study, we

design three lane change scenarios (stationary vehicles, moving

vehicles, and moving vehicles with acceleration, deceleration,

and lane change) in a prevalent simulator called Car Learning to

Act (CARLA) [47] to examine the effectiveness of our proposed

method and the compared methods.

Scenario-I (stationary vehicles): Several motionless vehicles

(10∼26) are randomly settled in a 420m road. To prevent the

block of the road by two cars in parallel, each road segment

(i.e., 60 m) is divided into four sub-segments. Each vehicle

(including the HV and OVs) is independently placed in one sub-

segment. The position and lane choice of the placed vehicle are

randomly initialized with the Gaussian-based sampling method.

The HV is expected to drive forward safely without any crash.

See Fig. 8 for more details.

The longest straight road in CARLA is 420m, which is not

sufﬁcient to examine the effectiveness of our proposed method.

To solve this problem, the experiment with randomly distributed

vehicles is repeated 100 times (i.e., 100 episodes) on the same

420m road in the evaluation stage. In total, the HV run 42 km

with 1600 lane changes in the test phase, which means that

the HV needs to change lanes about four times in each 100m

driving. This 420m road is commonly used for autonomous

driving technology development in CARLA [47].

TABLE III

EVA L UA T I O N METRICSOFTHEEXAMINED METHODS IN SCENARIO-I

Scenario-II (moving vehicles): All the settings in this scenario

(such as the initial positioning strategy, the task of HV and the

number of episodes) are the same with scenario-I. Differently,

all the OVs run with a speed limit of 30 m/s.

Scenario-III (moving vehicles with acceleration, deceleration,

and lane change): In this scenario, each OV has a possibility of p

(p=0.5) to accelerate, decelerate, or change lane. The steering

value varies from −1.0 to 1.0. The acceleration and deceleration

ranges are (0, 0.2) and (−0.1, 0), respectively. All the other

settings are the same as scenario-II.

The initial speed of HV in all the examined scenarios is 0.

The inferred driving actions from the examined methods work

on the HV to overtake the static or moving obstacles. Therefore,

the speed of HV is controlled by the inferred driving actions to

be dynamically changed according to the estimated driving risk.

Given that the speed limit of OVs is 30 m/s, the speed of HV

needs to be generally higher than 30 m/s in scenario-II.

Apart from these statements, some details about rrisk need

attending. To mitigate the trajectory ﬂuctuation when going

straight, evaluation about the risk of OVs differs in scenario-III

and the other two scenarios. Since OVs will not change lanes in

scenario-I and scenario-II, we will consider the risk of obstacles

in both lanes at the same time only when the agent is close

enough to the front obstacle in the current lane, otherwise

ignoring the effect of obstacles in the other lane. The threshold

of distance between the HV and OV in the current lane should

be adaptive for the speed of HV, which was set as safe distance

for convenience. Differently, we do not distinguish the effects

of obstacles in the current lane from those in both lanes in

scenario-III.

VI. RESULTS AND DISCUSSION

A CNN-based method [57] and a CNN LSTM method [23] are

used for comparison to support the superiority of our approach

(i.e., DSCNN transformer). The CNN based method only uses

space semantic information for decision-making with single

image frame as input. CNN LSTM is another decision-making

network which combines the semantic information from both

space and time aspects with image sequence as input. For fair

comparison, we used our proposed image sematic extraction

network to replace the corresponding networks in the compared

methods, named as DSCNN and DSCNN LSTM.

A. Quantitative Analysis

The reward of our proposed method when training in

scenario-I is shown in Fig. 9, and the comparison results with

different methods are presented in Fig. 10 and Table III. The

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2205

Fig. 9. Reward of our proposed method when training in scenario-I.

Fig. 10. Evaluation performances of the examined methods in scenario-I. The

lines describe the means of driving distances before collision, and the shade

areas describe the corresponding standard deviations.

baseline in Fig. 10 and Table III means the random action

strategy [26] as the reference to demonstrate the effectiveness

of the examined methods. The episode number in Fig. 10 means

the number of running experiments for evaluation. The Score

denotes driving distances before collision in each episode. Score

(μ) and Score (σ) in Table III respectively denote the mean and

standard deviation of driving distances before collision. nCs

is the number of crashes occurred in experiments. Given that

Dsafe is the total driving distance in the testing episodes without

collisions, the ﬁnish rate (FR) is deﬁned as the percentage of

Dsafe in the total driving distance (i.e., 420m) of all the testing

episodes.

The experimental results demonstrate that our method

achieves superior performance to the compared methods. Specif-

ically, the score (μ) of the proposed method is 360.40, improved

by 194.9% and 58.2% than DSCNN and DSCNN LSTM. The

score (σ) of our proposed method decreases by 31.4% and

40.5% than DSCNN and DSCNN LSTM, indicating that the

proposed method is with better stability. The results of nCs

and FR show similar trends. The nCs when using DSCNN and

DSCNN LSTM are 76 and 37, respectively. The number when

using our proposed DSCNN transformer decreases to 18. The

Fig. 11. Reward of our proposed method when training in scenario-II.

Fig. 12. Evaluation performances of the examined methods in scenario-II.

TAB L E I V

EVA L UA T I O N METRICSOFTHEEXAMINED METHODS IN SCENARIO-II

FR increases from 29.10 and 54.24 when using DSCNN and

DSCNN LSTM respectively, to 85.81 when using our proposed

DSCNN transformer.

The reward of our proposed method when training in

scenario-II is shown in Fig. 11, and the performances of the

examined methods are presented in Fig. 12 and Table IV. The

general trends are similar with their performances in scenario-I.

Speciﬁcally, score (μ) of our proposed method is 339.82 in

scenario II, improved by 213.8% and 104.8% than the compared

DSCNN and DSCNN LSTM. Score (σ) of our proposed method

is 112.06, decreased by 8.01% than DSCNN LSTM. Although

the score (σ) is higher than that of DSCNN, the driving distance

before collision (i.e., score (μ)) of our approach achieves supe-

rior performance to DSCNN. Besides, the nCs of our proposed

method is 24, only 37.5% and 52.2% of the numbers when using

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

2206 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023

Fig. 13. Reward of our proposed method when training in scenario-III.

Fig. 14. Evaluation performances of the examined methods in scenario-III.

TAB L E V

EVA L UA T I O N METRICS OF THE EXAMINED METHODS IN SCENARIO-III

DSCNN and DSCNN LSTM, respectively. The FR increases

from 25.78 and 39.51 when using DSCNN and DSCNN LSTM,

to 80.91 when using our proposed method.

The reward of our proposed method when training in

scenario-III is shown in Fig. 13, and the performances of the

examined methods are presented in Fig. 14 and Table V. The

general trends are similar with their performances in scenario-I

and scenario-II. Speciﬁcally, score (μ) of our proposed method

is 303.00, improved by 228.3% and 129.3% than the compared

DSCNN and DSCNN LSTM. Although the score (σ) of our

proposed method is higher than the numbers of the other meth-

ods, the score (μ) of our approach is greatly higher than the

other methods, which can be clearly observed in Fig. 14. The

nCs of our proposed method is 33, only 58.9% and 80.5% of the

numbers when using DSCNN and DSCNN LSTM, respectively.

The FR increases from 21.97 and 31.46 when using DSCNN and

DSCNN LSTM, to 72.14 when using our proposed method.

B. Qualitative Analysis

The outputs of our proposed DSCNN transformer running in

scenario-I, scenario-II and scenario-III are illustrated in Figs. 15,

16 and 17, respectively. The results show that the AV is with

safe and steady driving ability. The highlighted red areas in the

ﬁgures illustrate that it becomes more dangerous when the HV

gets closer to an obstacle, complying with the perceived risk of

drivers in reality. There will be an inevitable crash if the agent

takes no lane change when the driving risk situation is getting

worse.

Fortunately, our trained DRL-based agent is aware of driving

risk to take proper actions for safe driving, as shown in the green

areas presented in Fig. 15. The HV is able to take proper actions

to avoid driving out of lane boundaries, and learn traveling

following the lane center by using an incentive mechanism.

Therefore, a series of actions will be taken by the HV to recover

to a low risk level when in dangerous situations, contributing to

the better performance of our proposed method.

Comprehensively considering the presented quantitative and

qualitative analysis, our proposed approach shows obvious supe-

riority to the compared methods. When comparing DSCNN with

our proposed method, DSCNN uses single image frame as input,

which makes it only aware of the static environment in a single

image, but the proposed method can be aware of the dynami-

cally changing environment in the input image sequence. The

awareness of the dynamically changing environment is essential

to decision-making in the examined scenarios. Therefore, the

proposed DSCNN transformer can reach a better performance

than DSCNN. When comparing with DSCNN LSTM, the advan-

tage of our approach mainly comes from the sematic extraction

capability of transformer. The nucleus module of transformer

is multi-head attention module. The dot-product operation in

multi-head attention can recalibrate feature embeddings and

ﬁlter useless information to make the agent focus on those

essential information (e.g., the sematic information of lanes or

obstacles).

These superior quantitative and qualitative performances may

be attributed to the newly proposed deep learning architecture

with a lightweight feature extraction network. Some effective

tricks (i.e., depth-wise separable convolution, linear bottlenecks,

together with transformer) used in combination is a novel and

effective attempt. Our develop method further improves the

previous DRL-based methods by incorporating the advanced

sequential action inference technology (i.e., transformer) and

considering driving policy with minimal expected risk for de-

cision inference, which has been demonstrated to be effective

and superior to the compared methods. Besides, the designed

strategy with minimal risk expectation comprehensively uses

position and its uncertainty to model driving risk, making the

agent be aware of driving risk to improve driving safety. The

presented results demonstrate the satisfactory performance of

our approach in the static scenario-I and dynamic scenario-II

and scenario-III.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2207

Fig. 15. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-I.

Fig. 16. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-II.

C. Comparison With the Vanilla CNN Methods

To demonstrate the advances of our proposed approach, we

add a comparison experiment with the vanilla CNN methods

(i.e., VCNN and VCNN transformer) in scenario-I. The com-

parison results are shown in Fig. 18 and Table VI. Speciﬁcally,

the score (μ) of DSCNN transformer is 360.40 which is 104.6%

higher than the number of the compared VCNN transformer.

This means that the driving performance is greatly improved

when using DSCNN transformer. Similarly, the score (σ)of

DSCNN transformer is 29.9% lower than number of DSCNN,

indicating a more stable performance than VCNN transformer.

In addition, the nCs declines from 27 to 18, and the FR increases

TAB L E V I

EVA L UA T I O N METRICS OF DSCNN-BASED AND VCNN-BASED

METHODS IN SCENARIO-I

from 41.95% to 85.81%. Another interesting ﬁnding is that the

score (σ) and the nCs of DSCNN is not better than VCNN.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

2208 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023

Fig. 17. Examples of the HV trajectory and the output of risk evaluation by our proposed DSCNN transformer in scenario-III.

Fig. 18. Evaluation performances of DSCNN-based and VCNN-based meth-

ods in scenario-I.

But when using DSCNN together with transformer, the best

performance can be achieved, demonstrating the effectiveness

and advances of our proposed approach.

D. Realtime Capability for Model Deployment

Three evaluation metrics about computational cost (i.e., pa-

rameters, FLOPs, and FPS: frames per second) are used to justify

the computational cost advantage of our proposed DSCNN-

based methods over the methods based on VCNN. As shown

in in Table I, the parameters and FLOPs results show that

our proposed semantic extraction network (DSCNN) has only

22.11 MFLOPs inferential expenditure and 0.92M parameters,

while VCNN has 40.01 MFLOPs inferential expenditure and

1.43M parameters. The FPS results of the examined methods are

shown in Table VII. The numbers show that the FPS of DSCNN

transformer is slightly inferior to the numbers of DSCNN and

TAB L E V I I

FRAMES PER SECOND (FPS) OF THE EXAMINED METHODS.THE FPS MEANS

THE NUMBER OF INFERENCES PER SECOND

DSCNN LSTM, but superior to the number of VSCNN trans-

former. Speciﬁcally, the FPS of DSCNN transformer is 34.07

which means that the proposed method can ﬁnish inference

in only 0.029s. This running speed should be workable on

autonomous vehicle devices. Therefore, although the FPS of

DSCNN transformer is not the best among the examined meth-

ods, the overall performance by comprehensively considering

the above-mentioned quantitativeand qualitative results together

with these computational cost metrics show that our DSCNN

transformer is generally the best among the examined methods.

With the innovative development of AI-SOC (artiﬁcial in-

telligence system on chip), learning architectures have promis-

ing opportunities to run in embedded devices. For instance,

TensorFlow-lite, a deep learning architecture developed by

Google adapting to the devices based on ARM, delivers a

convenient solution for our method DSCNN transformer de-

ployment. The lower inferential expenditure, fewer number of

parameters, and reasonable FPS of proposed DSCNN trans-

former demonstrate its low computation cost. Compared with

those huge networks with a large number of parameters used in

traditional end-to-end decision-making (e.g., VGG and Resnet),

our designed network possesses the ability for inference on CPU

devices. Thus, the satisfactory real-time performance makes

our proposed DSCNN transformer have promising talents for

practical applications.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2209

E. Limitations and Future Work of Our Proposed Approach

Although our proposed approach has the above-mentioned

novelties and advances, there are still some limitations. Firstly,

only stable highway driving scenarios are considered in this

study. Given that safe distance between vehicles changes while

driving in different road conditions with various speeds and lane

changing of autonomous vehicles relies heavily on precise po-

sitioning function which is challenging especially in situations

with poor lighting or weather conditions [58], [59], [60], an

adaptive risk level determination strategy should be developed

for improvement in various driving situations. Using knowledge

transferring technologies may be a promising solution [6], [24].

Secondly, our proposed approach is only with discrete actions,

which limits the stability of driving trajectories [61], [62], [63].

It has been reported that methods with continuous actions (e.g.,

DDPG (deep deterministic policy gradient)) can mitigate this

problem, but they may generate over-conservative actions [61].

How to comprehensively integrate the advantages of DQN-

based methods and DDPG-based methods is a promising re-

search topic for the development of autonomous driving tech-

nologies. Thirdly, as reported in previous studies [64], human

driving habits affect drivers’ decision-making performances and

including human driving habits in the design of autonomous

driving systems can improve drivers’ acceptance of the emerging

technologies. However, we did not consider this factor in this

study. Our future work will design methods to involve the

inﬂuence of human driving habits on the risk awareness module

to further improve our developed method, which is expected to

help design individualized systems that can better match drivers’

characteristics for acceptance improvement. In addition, more

lightweight improvements will be considered in our future work.

VII. CONCLUSION

In this paper, an innovative driving decision-making network

with risk evaluation is designed to seek an optimum driving

policy with minimal risk expectation. Our proposed approach is

compared with the other methods in three lane change scenarios

with different difﬁculties. The quantitative and qualitative results

reveal that the comprehensive use of depth-wise separable con-

volution together with transformer in DRL-based architectures

for lane change decision inference can generate an optimal

policy with minimal driving risk to avoid crashes in all the three

examined scenarios. The comparison results well support the

superiority of our proposed approach. The lightweight charac-

teristic and superior performance of our proposed approach can

facilitate the development of autonomous driving technologies

in various driving scenarios.

REFERENCES

[1] “Trafﬁc Safety Facts 2016: A compilation of motor vehicle crash data from

the fatality analysis reporting system and the general estimates system,”

U.S. Dept. Transp., Nat. Highway Trafﬁc Saf. Admin., Washington, DC,

USA, Tech. Rep. DOT HS 812 554, 2017.

[2] M. S. Shirazi and B. T. Morris, “Looking at intersections: A survey of

intersection monitoring, behavior and safety analysis of recent studies,”

IEEE Trans. Intell. Transp. Syst., vol. 18, no. 1, pp. 4–24, Jan. 2017.

[3] G.Li, Y. Liao, Q. Guo, C. Shen, and W. Lai, “Trafﬁccrash characteristics in

Shenzhen, China from 2014 to 2016,” Int. J. Environ. Res. Public Health,

vol. 18, no. 3, Jan. 2021, Art. no. 1176.

[4] W. Xue and L. Zhe ng, “Active collision avoidance system design based

on model predictive control with varying sampling time,” Automot. Innov.,

vol. 3, no. 1, pp. 62–72, Mar. 2020.

[5] G. Li, Y. Yang, X. Qu, D. Cao, and K. Li, “A deep learning based image

enhancement approach for autonomous driving at night,” Knowl.-Based

Syst., vol. 213, 2021, Art. no. 106617.

[6] G. Li, Z. Ji, and X. Qu, “Stepwise domain adaptation (SDA) for ob-

ject detection in autonomous vehicles using an adaptive CenterNet,”

IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 17729–17743,

Oct. 2022.

[7] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel, “Path planning for

autonomous vehicles in unknown semi-structured environments,” Int. J.

Robot. Res., vol. 29, no. 5, pp. 485–501, Apr. 2010.

[8] Y. Huang et al., “A motion planning and tracking framework for au-

tonomous vehicles based on artiﬁcial potential ﬁeld elaborated resis-

tance network approach,” IEEE Trans. Ind. Electron., vol. 67, no. 2,

pp. 1376–1386, Feb. 2020.

[9] B. Li, Y. Ouyang, Y. Zhang, T. Acarman, Q. Kong, and Z. Shao, “Optimal

cooperative maneuver planning for multiple nonholonomic robots in a tiny

environment via adaptive-scaling constrained optimization,” IEEE Robot.

Automat. Lett., vol. 6, no. 2, pp. 1511–1518, Apr. 2021.

[10] D. Shen, Y. Chen, L. Li, and S. Chien, “Collision-free path planning for

automated vehicles risk assessment via predictive occupancy map,” in

Proc. IEEE Intell. Veh. Symp., 2020, pp. 985–991.

[11] B. Simon, F. Franke, P. Riegl, and A. Gaull, “Motion planning for collision

mitigation via FEM–based crash severity maps,” in Proc. IEEE Intell. Veh.

Symp., 2019, pp. 2187–2194.

[12] M. Ali, P. Falcone, and J. Sjöberg, “Threat assessment design under driver

parameter uncertainty,” in Proc. IEEE 51st Conf. Decis. Control, 2012,

pp. 6315–6320.

[13] S. Glaser, B. Vanholme, S. Mammar, D. Gruyer, and L. Nouvelière,

“Maneuver-based trajectory planning for highly autonomous vehicles on

real road with trafﬁc and driver interaction,” IEEE Trans . Intell. Transp.

Syst., vol. 11, no. 3, pp. 589–606, Sep. 2010.

[14] Operational Deﬁnitions of Driving Performance Measures and Statistics,

Standard SAE J2944, Society of Automotive Engineers, 2015.

[15] J. Kim and D. Kum, “Collision risk assessment algorithm via

lane-based probabilistic motion prediction of surrounding vehicles,”

IEEE Trans. Intell. Transp. Syst., vol. 19, no. 9, pp. 2965–2976,

Sep. 2018.

[16] S. Noh and K. An, “Decision-making framework for automated driving in

highway environments,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 1,

pp. 58–71, Jan. 2018.

[17] G. Li et al., “Risk assessment based collision avoidance decision-making

for autonomous vehicles in multi-scenarios,” Transp. Res. Part C: Emerg.

Technol., vol. 122, Jan. 2021, Art. no. 102820.

[18] S. Noh, “Decision-making framework for autonomous driving at road

intersections: Safeguarding against collision, overly conservative behav-

ior, and violation vehicles,” IEEE Trans. Ind. Electron., vol. 66, no. 4,

pp. 3275–3286, Apr. 2019.

[19] D. Shin, B. Kim, K. Yi, A. Carvalho, and F. Borrelli, “Human-centered

risk assessment of an automated vehicle using vehicular wireless commu-

nication,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. 2, pp. 667–681,

Feb. 2019.

[20] J. Nidamanuri, C. Nibhanupudi, R. Assfalg, and H. Venkataraman, “A

progressive review: Emerging technologies for ADAS driven solutions,”

IEEE Trans. Intell. Veh., vol. 7, no. 2, pp. 326–341, Jun. 2022.

[21] G. Li, L. Yang, S. Li, X. Luo, X. Qu, and P. Green, “Human-like

decision making of artiﬁcial drivers in intelligent transportation sys-

tems: An end-to-end driving behavior prediction approach,” IEEE In-

tell. Transp. Syst. Mag., vol. 14, no. 6, pp. 188–205, Nov./Dec. 2022,

doi: 10.1109/MITS.2021.3085986.

[22] Y. Pan et al., “Imitation learning for agile autonomous driving,” Int. J.

Robot. Res., vol. 39, no. 2/3, pp. 286–302, 2020.

[23] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving

models from large-scale video datasets,” in Proc. IEEE Conf. Comput.

Vis. Pattern Recognit., 2017, pp. 3530–3538.

[24] G. Li, Z. Ji, X. Qu, R. Zhou, and D. Cao, “Cross-domain object de-

tection for autonomous driving: A stepwise domain adaptative YOLO

approach,” IEEE Trans. Intell. Veh., vol. 7, no. 3, pp. 603–615, Sep. 2022,

doi: 10.1109/TIV.2022.3165353.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

2210 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 8, NO. 3, MARCH 2023

[25] L. Peng, H. Wang, and J. Li, “Uncertainty evaluation of object detection

algorithms for autonomous vehicles,” Automot. Innov., vol. 4, no. 3,

pp. 241–252, Aug. 2021.

[26] V. Mnih et al., “Human-level control through deep reinforcement learning,”

Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[27] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience

replay,” 2016, arXiv: 1511.05952.

[28] G. Li, S. Lin, S. Li, and X. Qu, “Learning automated driving in complex

intersection scenarios based on camera sensors: A deep reinforcement

learning approach,” IEEE Sensors J., vol. 22, no. 5, pp. 4687–4696,

Mar. 2022.

[29] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J. Kochen-

derfer, “Combining planning and deep reinforcement learning in tactical

decision making for autonomous driving,” IEEE Trans. Intell. Veh.,vol.5,

no. 2, pp. 294–305, Jun. 2020.

[30] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, and S. E. Li, “Decision making

of autonomous vehicles in lane change scenarios: Deep reinforcement

learning approaches with risk awareness,” Transp. Res. C Emerg. Technol.,

vol. 134, Jan. 2022, Art. no. 103452.

[31] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker, “High-

level decision making for safe and reasonable autonomous lane changing

using reinforcement learning,” in Proc. IEEE 21st Int. Conf. Intell. Transp.

Syst., 2018, pp. 2156–2162.

[32] T. Shi, P. Wang, X. Cheng, C. Y. Chan, and D. Huang, “Driving de-

cision and control for autonomous lane change based on deep rein-

forcement learning,” in Proc. IEEE Intell. Transp. Syst. Conf., 2019,

pp. 2895–2900.

[33] T. Fan, P. Long, W. Liu, and J. Pan, “Distributed multi-robot collision

avoidance via deep reinforcement learning for navigation in complex

scenarios,” Int. J. Robot. Res., vol. 39, no. 7, pp. 856–892, 2020.

[34] Y. Chen, C. Dong, P. Palanisamy, P. Mudalige, K. Muelling, and J.

M. Dolan, “Attention-based hierarchical deep reinforcement learning for

lane change behaviors in autonomous driving,” in Proc. IEEE/CVF Conf.

Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 1326–1334.

[35] C. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J. Kochenderfer,

“Combining planning and deep reinforcement learning in tactical decision

making for autonomous driving,” IEEE Trans. Intell. Veh., vol. 5, no. 2,

pp. 294–305, Jun. 2020.

[36] A. G. Howard et al., “MobileNets: Efﬁcient convolutional neural networks

for mobile vision applications,” 2017, arXiv:1704.04861.

[37] W. Hua et alet al., “Channel gating neural networks,” in Proc. Neural Inf.

Process. Syst., 2019, vol. 32, pp. 1886–1896.

[38] C. Li, G. Wang, B. Wang, X. Liang, Z. Li, and X. Chang, “Dynamic

slimmable network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern

Recognit., 2021, pp. 8607–8617.

[39] B. Jacob et al., “Quantization and training of neural networks for efﬁcient

integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Comput. Vis.

Pattern Recognit., 2018, pp. 2704–2713.

[40] M. Henning, J. C. Muller, F. Gies, M. Buchholz, and K. Dietmayer,

“Situation-aware environment perception using a multi-layer attention

map,” IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 481–491, Jan. 2023.

[41] Y. Chen, G. Li, S. Li, W. Wang, S. E. Li, and B. Cheng, “Exploring

behavioral patterns of lane change maneuvers for human-like autonomous

driving,” IEEE Trans.Intell. Transp. Syst., vol. 23, no. 9, pp. 14322–14335,

Sep. 2022.

[42] T. Rehder, A. Koenig, M. Goehl, L. Louis, and D. Schramm, “Lane change

intention awareness for assisted and automated driving on highways,”

IEEE Trans. Intell. Veh., vol. 4, no. 2, pp. 265–276, Jun. 2019.

[43] J. Zhang, C. Chang, X. Zeng, and L. Li, “Multi-agent DRL-based lane

change with right-of-way collaboration awareness,” IEEE Trans. Intell.

Transp. Syst., vol. 24 no. 1, pp. 854–869, Jan. 2023.

[44] X. He, H. Yang, Z. Hu, and C. Lv, “Robustlane change decision making for

autonomous vehicles: An observation adversarial reinforcement learning

approach,” IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 184–193, Jan. 2023.

[45] Y. Wang, D. Pan, H. Deng, Y. Jiang, and Z. Liu, “Dynamic trajectory

planning of autonomous lane change at medium and low speeds based on

elastic soft constraint of the safety domain,” Automot. Innov., vol. 3, no. 1,

pp. 73–87, Mar. 2020.

[46] G. Li, Y. Chen, D. Cao, X. Qu, B. Cheng, and K. Li, “Extraction of descrip-

tive driving patterns from driving data using unsupervised algorithms,”

Mech. Syst. Signal Process., vol. 156, Jul. 2021, Art. no. 107589.

[47] A. Dosovitskiy, G. Ros, F. Codevilla, A. López, and V. Koltun, “CARLA:

An open urban driving simulator,” in Proc. Conf. Robot. Learn., 2017,

vol. 78, pp. 1–16.

[48] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. -C. Chen, “Mo-

bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF

Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.

[49] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.

Process. Syst., 2017, pp. 6000–6010.

[50] P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally

decentralized multi-robot collision avoidance via deep reinforcement

learning,” in Proc. IEEE Int. Conf. Robot. Automat., 2018, pp. 6252–6259.

[51] M. Bouton et al., “Reinforcement learning with probabilistic guarantees

for autonomous driving,” 2018, arXiv: 1904.07189.

[52] X. Qi, Y. Luo, G. Wu, K. Boriboonsomsin, and M. Barth, “Deep reinforce-

ment learning enabled self-learning control for energy efﬁcient driving,”

Transp. Res. Part C: Emerg. Technol., vol. 99, pp. 67–81, Feb. 2019.

[53] Y. Ye, X. Zhang, and J. Sun, “Automated vehicle’s behavior decision

making using deep reinforcement learning and high-ﬁdelity simulation

environment,” Transp. Res. C Emerg. Technol., vol. 107, pp. 155–170,

Oct. 2019.

[54] M. Zhu, Y. Wang, Z. Pu, J. Hu, X. Wang, and R. Ke, “Safe, efﬁcient,

and comfortable velocity control based on reinforcement learning for

autonomous driving,” Transp.Res. C Emerg. Technol.,vol. 117, Aug. 2020,

Art. no. 102662.

[55] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical rein-

forcement learning for self-driving decision-making without reliance on

labelled driving data,”IET Intell. Transp. Syst., vol. 14, no. 5, pp. 297–305,

2020.

[56] B. R. Kiran et al., “Deep reinforcement learning for autonomous driving:

A survey,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4909–4926,

Jun. 2022, doi: 10.1109/TITS.2021.3054625.

[57] F. Codevilla, M. Miiller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-

to-end driving via conditional imitation learning,” in Proc. IEEE Int. Conf.

Robot. Automat., 2018, pp. 4693–4700.

[58] G. Li, Y. Lin, and X. Qu, “An infrared and visible image fusion method

based on multi-scale transformation and norm optimization,” Inf. Fusion,

vol. 71, pp. 109–129, 2021.

[59] G. Guo and J. Liu, “A stochastic model-based fusion algorithm for en-

hanced localization of land vehicles,” IEEE Trans. Instrum. Meas., vol. 71,

2022, Art no. 8500810, doi: 10.1109/TIM.2021.3137566.

[60] J. Liu and G. Guo, “Vehiclelocalization during GPS outages with extended

Kalman ﬁlter and deep learning,” IEEE Trans. Instrum. Meas., vol. 70,

2021, Art no. 7503410, doi: 10.1109/TIM.2021.3097401.

[61] G. Li, S. Li, S. Li, and X. Qu, “Continuous decision-making for au-

tonomous driving at intersections using deep deterministic policy gra-

dient,” IET Intell. Transp. Syst., vol. 16, no. 12, pp. 1669–1681, 2021,

doi: 10.1049/itr2.12107.

[62] G. Li et al., “Deep reinforcement learning enabled decision-making for

autonomous driving at intersections,” Automot. Innov., vol. 3, no. 4,

pp. 374–385, Dec. 2020.

[63] B. Peng et al., “End-to-end autonomous driving through dueling double

deep Q-network,” Automot. Innov., vol. 4, no. 3, pp. 328–337, Aug. 2021.

[64] G. Li, S. E. Li, B. Cheng, and P. Green, “Estimation of driving style

in naturalistic highway trafﬁc using maneuver transition probabilities,”

Transp. Res. Part C: Emerg. Technol., vol. 74, pp. 113–125, Jan. 2017.

Guofa Li (Member, IEEE) received the Ph.D. degree

in mechanical engineering from Tsinghua University,

Beijing, China, in 2016. He is currently a Professor

with the College of Mechanical and Vehicle Engineer-

ing, Chongqing University, Chongqing, China. He

has authored or coauthored more than 70 papers in his

research ﬁeld, which include environment perception,

driver behavior analysis, and human-like decision-

making and control based on artiﬁcial intelligence

technologies in autonomous vehicles and intelligent

transportation systems. He was the recipient of the

Young Elite Scientists Sponsorship Program in China, and the best paper

awards from the China Association for Science and Technology (CAST) and the

Automotive Innovation Journal. He is an Associate Editor for IEEE SENSORS

JOURNAL, the Guest Editor of IEEE Intelligent Transportation Systems Magazine

and Automotive Innovation.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

LI et al.: LANE CHANGE STRATEGIES FOR AUTONOMOUS VEHICLES: A DEEP REINFORCEMENT LEARNING APPROACH 2211

Yifa n Qiu received the B.E. degree in 2021 from

Shenzhen University, Shenzhen, China, where he is

currently working toward the master’s degree with the

College of Mechatronics and Control Engineering.

His research focus on using deep reinforcement learn-

ing technologies for the development of autonomous

vehicles.

Yifa n Yang received the M.E. degree from Shen-

zhen University, Shenzhen, China, in 2021. He is

currently with the Autonomous Driving Group, Ten-

cent, Shenzhen, China. His research interests include

computer vision, deep reinforcement learning, and

machine learning in automotive and transportation

engineering. He has completed ﬁve projects on pedes-

trian recognition, object detection, image enhance-

ment, risk assessment, and decision making using

deep reinforcement learning for the development of

autonomous vehicles.

Zhenning Li received the B.S. and M.S. degrees

in transportation science and engineering from the

Harbin Institute of Technology, Harbin, China, in

2014 and 2016, respectively, and the Ph.D. degree

in civil engineering from the University of Hawaii at

Ma¯noa, Honolulu, HI, USA, in 2019. He is currently

an Assistant Professor with the State Key Labora-

tory of Internet of Things for Smart City and the

Department of Computer and Information Science,

University of Macau, Macau, China. His research

interests include connected autonomous vehicles and

Big Data application on urban transportation system.

Shen Li (Member, IEEE) received the B.E. degree

from Jilin University, Changchun, China, in 2012,

and the Ph.D. degree from the University of Wis-

consin – Madison, Madison, WI, USA, in 2019. His

research interests include cooperative control method

of connected vehicles, autonomous driving safety,

intelligent transportation systems (ITS), architecture

design of CAVH system, trafﬁc data mining based on

cellular data, and trafﬁc operations and management.

He has participated in many research projects funding

by the National Natural Science Foundation of China,

Ministry of Science and Technology (863 projects) and U.S. Department of

Transportation.

Wenbo Chu received the B.S. degree major in au-

tomotive engineering from Tsinghua University, Bei-

jing, China, in 2008, and the M.S. degree major in au-

tomotive engineering from RWTH-Aachen, Aachen,

Germany, and the Ph.D. degree major in mechanical

engineering from Tsinghua University, in 2014. He is

currently a Research Fellow with the Western China

Science City Innovation Center of Intelligent and

Connected Vehicles (Chongqing) Co, Ltd., and Na-

tional Innovation Center of Intelligent and Connected

Vehicles.

Paul Green received the M.S.E. and Ph.D. degrees

from the University of Michigan, Ann Arbor, MI,

USA, in 1974 and 1979, respectively. He is cur-

rently a Research Professor with the University of

Michigan Transportation Research Institute Driver

Interface Group, Ann Arbor, MI, USA, and an Ad-

junct Professor with the Department of Industrial

and Operations Engineering, University of Michigan.

He teaches automotive human factors and human-

computer interaction classes. He is the Leader of

the University’s Human Factors Engineering Short

Course, ﬂagship continuing education course in the profession, now in its

year 62. His research interests include driving safety, driver interfaces, driver

behavior, driver workload, and the development of standards to get research into

practice. Prof. Green is the Past President of the Human Factors and Ergonomics

Society.

Shengbo Eben Li (Senior Member, IEEE) received

the M.S. and Ph.D. degrees from Tsinghua University,

Beijing, China, in 2006 and 2009, respectively. Before

joining Tsinghua University, he was with Stanford

University, Stanford, CA, USA University of Michi-

gan, Ann Arbor, MI, USA, and UC Berkeley, Berke-

ley, CA, USA. He is currently a Professor leading the

Intelligent Driving Lab (iDLab), Tsinghua Univer-

sity. He is the author of more than 120 peer-reviewed

journal/conference papers, and the co-inventor of

more than 30 patents. His research interests include

intelligent vehicles and driver assistance systems, reinforcement learning and

optimal control, and distributed control and estimation. Dr. Li was the recipient

of Best Paper Award in IEEE ITSC 2020, ICCAS 2020, IEEE ICUS 2020, CCCC

2018/2019, ITSAPF 2015, and IEEE ITSC 2014. His important awards include

the National Award for Technological Invention of China in 2013, Excellent

Young Scholar of NSF China in 2016, Young Professor of ChangJiang Scholar

Program in 2016, National Award for Progress in Sci & Tech of China in 2018,

Distinguished Young Scholar of Beijing NSF in 2018, and Youth Sci & Tech

Innovation Leader from MOST in 2020. He is also the Board of Governor of

IEEE Intelligent Transportation Systems Society. He is an Associate Editor

for IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, IEEE TRANSACTIONS ON

INTELLIGENT TRANSPORTATION SYSTEMS,andIEEE Intelligent Transportation

Systems Magazine.

Authorized licensed use limited to: Tsinghua University. Downloaded on May 20,2024 at 05:44:48 UTC from IEEE Xplore. Restrictions apply.

Trustworthy autonomous driving via defense-aware robust reinforcement learning against worst-case observational perturbations

Article

Full-text available

Jun 2024

Despite the substantial advancements in reinforcement learning (RL) in recent years, ensuring trustworthiness remains a formidable challenge when applying this technology to safety-critical autonomous driving domains. One pivotal bottleneck is that well-trained driving policy models may be particularly vulnerable to observational perturbations or perceptual uncertainties, potentially leading to severe failures. In view of this, we present a novel defense-aware robust RL approach tailored for ensuring the robustness and safety of autonomous vehicles in the face of worst-case attacks on observations. The proposed paradigm primarily comprises two crucial modules: an adversarial attacker and a robust defender. Specifically, the adversarial attacker is devised to approximate the worst-case observational perturbations that attempt to induce safety violations (e.g., collisions) in the RL-driven autonomous vehicle. Additionally, the robust defender is developed to facilitate the safe RL agent to learn robust optimal policies that maximize the return while constraining the policy and cost perturbed by the adversarial attacker within specified bounds. Finally, the proposed technique is assessed across three distinct traffic scenarios: highway, on-ramp, and intersection. The simulation and experimental results indicate that our scheme enables the agent to execute trustworthy driving policies, even in the presence of the worst-case observational perturbations.

Act Better by Timing: A timing-Aware Reinforcement Learning for Autonomous Driving

Preprint

Jun 2024

Coping with intensively interactive scenarios is one of the significant challenges in the development of autonomous driving. Reinforcement learning (RL) offers an ideal solution for such scenarios through its self-evolution mechanism via interaction with the environment. However, the lack of sufficient safety mechanisms in common RL leads to the fact that agent often find it difficult to interact well in highly dynamic environment and may collide in pursuit of short-term rewards. Much of the existing safe RL methods require environment modeling to generate reliable safety boundaries that constrain agent behavior. Nevertheless, acquiring such safety boundaries is not always feasible in dynamic environments. Inspired by the driver's behavior of acting when uncertainty is minimal, this study introduces the concept of action timing to replace explicit safety boundary modeling. We define "actor" as an agent to decide optimal action at each step. By imaging the actor take opportunity to act as a timing-dependent gradual process, the other agent called "timing taker" can evaluate the optimal action execution time, and relate the optimal timing to each action moment as a dynamic safety factor to constrain the actor's action. In the experiment involving a complex, unsignaled intersection interaction, this framework achieved superior safety performance compared to all benchmark models.

Scenario Engineering for Autonomous Transportation: A New Stage in Open-Pit Mines

Preprint

Mar 2024

In recent years, open-pit mining has seen significant advancement, the cooperative operation of various specialized machinery substantially enhancing the efficiency of mineral extraction. However, the harsh environment and complex conditions in open-pit mines present substantial challenges for the implementation of autonomous transportation systems. This research introduces a novel paradigm that integrates Scenario Engineering (SE) with autonomous transportation systems to significantly improve the trustworthiness, robustness, and efficiency in open-pit mines by incorporating the four key components of SE, including Scenario Feature Extractor, Intelligence and Index (I&I), Calibration and Certification (C&C), and Verification and Validation (V&V). This paradigm has been validated in two famous open-pit mines, the experiment results demonstrate marked improvements in robustness, trustworthiness, and efficiency. By enhancing the capacity, scalability, and diversity of autonomous transportation, this paradigm fosters the integration of SE and parallel driving and finally propels the achievement of the '6S' objectives.

Human-Guided Deep Reinforcement Learning for Optimal Decision Making of Autonomous Vehicles

Article

Full-text available

Jan 2024

Although deep reinforcement learning (DRL) methods are promising for making behavioral decisions in autonomous vehicles (AVs), their low training efficiency and difficulty to adapt to untrained cases hinder their applications. Introducing a human role in the DRL paradigm could improve training efficiency by using human prior knowledge and overcome untrained cases in deployment by online human takeover. In this study, a novel value-based DRL algorithm that leverages human guidance to improve its performance is proposed for addressing high-level decision-making problems in autonomous driving. We develop a new learning objective for DRL to increase the value of the human policy over the undertrained DRL policy so that the DRL agent can be encouraged to mimic human behaviors and thereby utilizing human guidance more efficiently. Our method can autonomously evaluate the importance of different human guidance, which makes it more robust for variation of human performance. The proposed DRL algorithm was used to address a challenging multiobjective lane-change decision-making problem. We collected human guidance from a human-in-the-loop driving experiment and evaluated our method in a high-fidelity simulator. Results validated the advantages of the proposed algorithm in terms of training efficiency and optimality in the decision-making problem compared to the baselines of state-of-the-art existing methods. Results also revealed the favorable fine-tuning ability of the proposed algorithm, which is promising for addressing the long-tail issue in DRL-based autonomous driving. Our methodology does not introduce additional domain knowledge so that it can be seamlessly applied to other similar issues. The supplementary video is available at https://youtu.be/Ec7WkqeLsB8.

Scenario Engineering for Autonomous Transportation: A New Stage in Open-Pit Mines

Article

Full-text available

Mar 2024

In recent years, open-pit mining has seen significant advancement, the cooperative operation of various specialized machinery substantially enhancing the efficiency of mineral extraction. However, the harsh environment and complex conditions in open-pit mines present substantial challenges for the implementation of autonomous transportation systems. This research introduces a novel paradigm that integrates Scenario Engineering (SE) with autonomous transportation systems to significantly improve the trustworthiness, robustness, and efficiency in open-pit mines by incorporating the four key components of SE, including Scenario Feature Extractor, Intelligence and Index, Calibration and Certification, and Verification and Validation. This paradigm has been validated in two famous open-pit mines, the experiment results demonstrate marked improvements in robustness, trustworthiness, and efficiency. By enhancing the capacity, scalability, and diversity of autonomous transportation, this paradigm fosters the integration of SE and parallel driving and finally propels the achievement of the ‘6S’ objectives.

Integrated Planning and Control: A Crucial Path Nodes-Based Piecewise Model Predictive Control Strategy

Article

Jan 2024

In this paper, an integrated planning and control strategy based on piecewise model predictive control (MPC) is developed to improve the efficiency of planning-control framework. First, a penalty term characterizing the influence of obstacles is merged into the heuristic function of A* algorithm. This enables the planned path to maintain navigability in narrow areas while deviating from obstacles in open areas. Second, to rapidly identify safe regions around the original path, several crucial path nodes (CPNs) are selected, including heading change nodes and constraint change nodes. With the safe regions around these CPNs, a novel method for constructing safe travel corridors (STCs) is developed. Finally, a piecewise MPC is put forward to drive the unmanned ground vehicle (UGV) towards the target node within STCs, which eliminates the requirement for the reference path. This promotes the efficiency of planning-control framework and achieves stronger traveling stability. The proposed strategy is deployed on a UGV platform, and its superiority and effectiveness are demonstrated via real vehicle experiments.

Continuous Decision-Making in Lane Changing and Overtaking Maneuvers for Unmanned Vehicles: A Risk-Aware Reinforcement Learning Approach With Task Decomposition

Article

Apr 2024

Reinforcement learning methods have shown the ability to solve challenging scenarios in unmanned systems. However, solving long-time decision-making sequences in a highly complex environment, such as continuous lane change and overtaking in dense scenarios, remains challenging. Although existing unmanned vehicle systems have made considerable progress, minimizing driving risk is the first consideration. Risk-aware reinforcement learning is crucial for addressing potential driving risks. However, the variability of the risks posed by several risk sources is not considered by existing reinforcement learning algorithms applied in unmanned vehicles. Based on the above analysis, this study proposes a risk-aware reinforcement learning method with driving task decomposition to minimize the risk of various sources. Specifically, risk potential fields are constructed and combined with reinforcement learning to decompose the driving task. The proposed reinforcement learning framework uses different risk-branching networks to learn the driving task. Furthermore, a low-risk episodic sampling augmentation method for different risk branches is proposed to solve the shortage of high-quality samples and further improve sampling efficiency. Also, an intervention training strategy is employed wherein the artificial potential field (APF) is combined with reinforcement learning to speed up training and further ensure safety. Finally, the complete intervention risk classification twin delayed deep deterministic policy gradient-task decompose (IDRCTD3-TD) algorithm is proposed. Two scenarios with different difficulties are designed to validate the superiority of this framework. Results show that the proposed framework has remarkable improvements in performance.

BAT: Behavior-Aware Human-Like Trajectory Prediction for Autonomous Driving

Article

Mar 2024

The ability to accurately predict the trajectory of surrounding vehicles is a critical hurdle to overcome on the journey to fully autonomous vehicles. To address this challenge, we pioneer a novel behavior-aware trajectory prediction model (BAT) that incorporates insights and findings from traffic psychology, human behavior, and decision-making. Our model consists of behavior-aware, interaction-aware, priority-aware, and position-aware modules that perceive and understand the underlying interactions and account for uncertainty and variability in prediction, enabling higher-level learning and flexibility without rigid categorization of driving behavior. Importantly, this approach eliminates the need for manual labeling in the training process and addresses the challenges of non-continuous behavior labeling and the selection of appropriate time windows. We evaluate BAT's performance across the Next Generation Simulation (NGSIM), Highway Drone (HighD), Roundabout Drone (RounD), and Macao Connected Autonomous Driving (MoCAD) datasets, showcasing its superiority over prevailing state-of-the-art (SOTA) benchmarks in terms of prediction accuracy and efficiency. Remarkably, even when trained on reduced portions of the training data (25%), our model outperforms most of the baselines, demonstrating its robustness and efficiency in predicting vehicle trajectories, and the potential to reduce the amount of data required to train autonomous vehicles, especially in corner cases. In conclusion, the behavior-aware model represents a significant advancement in the development of autonomous vehicles capable of predicting trajectories with the same level of proficiency as human drivers. The project page is available on our GitHub.

Augmenting Reinforcement Learning With Transformer-Based Scene Representation Learning for Decision-Making of Autonomous Driving

Article

Mar 2024

Decision-making for urban autonomous driving is challenging due to the stochastic nature of interactive traffic participants and the complexity of road structures. Although reinforcement learning (RL)-based decision-making schemes are promising to handle urban driving scenarios, they suffer from low sample efficiency and poor adaptability. In this paper, we propose the Scene-Rep Transformer to enhance RL decision-making capabilities through improved scene representation encoding and sequential predictive latent distillation. Specifically, a multi-stage Transformer (MST) encoder is constructed to model not only the interaction awareness between the ego vehicle and its neighbors but also intention awareness between the agents and their candidate routes. A sequential latent Transformer (SLT) with self-supervised learning objectives is employed to distill future predictive information into the latent scene representation, in order to reduce the exploration space and speed up training. The final decision-making module based on soft actor-critic (SAC) takes as input the refined latent scene representation from the Scene-Rep Transformer and generates decisions. The framework is validated in five challenging simulated urban scenarios with dense traffic, and its performance is manifested quantitatively by substantial improvements in data efficiency and performance in terms of success rate, safety, and efficiency. Qualitative results reveal that our framework is able to extract the intentions of neighbor agents, enabling better decision-making and more diversified driving behaviors. Code and results are available at: https://georgeliu233.github.io/Scene-Rep-Transformer/</uri

Event-Triggered Parallel Control Using Deep Reinforcement Learning With Application to Comfortable Autonomous Driving

Article

Mar 2024

A novel event-triggered control (ETC) method, called deep event-triggered parallel control (deep-ETPC), is presented to achieve path tracking for comfortable autonomous driving (CAD) using parallel control and deep deterministic policy gradient (DDPG). Based on parallel control, the developed deep-ETPC method constructs a dynamic control policy by introducing variation rates of controls. By employing variation rates of controls, the developed deep-ETPC method is capable of indicating communication loss and comfortable driving indices in the reward, and then enables reinforcement learning (RL) agents to learn comfortable ETC driving policies directly. Moreover, the communication loss, which reflects ETC, is integrated into the reward, so there is no need to additionally design/train triggering conditions, which can be considered a type of multi-tasking learning. Furthermore, an EPTC-oriented DDPG algorithm is developed to achieve the developed deep-ETPC method, making DDPG applicable to ETC. Empirical results, including tracking a simple straight line trajectory and a complicated sinusoidal trajectory, demonstrate the effectiveness of the developed deepETPC method.

Multi-Agent DRL-Based Lane Change With Right-of-Way Collaboration Awareness

Article

Full-text available

Oct 2022

Lane change is a common-yet-challenging driving behavior for automated vehicles. To improve the safety and efficiency of automated vehicles, researchers have proposed various lane-change decision models. However, most of the existing models consider lane-change behavior as a one-player decision-making problem, ignoring the essential multi-agent properties when vehicles are driving in traffic. Such models lead to deficiencies in interaction and collaboration between vehicles, which results in hazardous driving behaviors and overall traffic inefficiency. In this paper, we revisit the lane-change problem and propose a bi-level lane-change behavior planning strategy, where the upper level is a novel multi-agent deep reinforcement learning (DRL) based lane-change decision model and the lower level is a negotiation based right-of-way assignment model. We promote the collaboration performance of the upper-level lane-change decision model from three crucial aspects. First, we formulate the lane-change decision problem with a novel multi-agent reinforcement learning model, which provides a more appropriate paradigm for collaboration than the single-agent model. Second, we encode the driving intentions of surrounding vehicles into the observation space, which can empower multiple vehicles to implicitly negotiate the right-of-way in decision-making and enable the model to determine the right-of-way in a collaborative manner. Third, an ingenious reward function is designed to allow the vehicles to consider not only ego benefits but also the impact of changing lanes on traffic, which will guide the multi-agent system to learn excellent coordination performance. With the upper-level lane-change decisions, the lower-level right-of-way assignment model is used to guarantee the safety of lane-change behaviors. The experiments show that the proposed approaches can lead to safe, efficient, and harmonious lane-change behaviors, which boosts the collaboration between vehicles and in turn improves the safety and efficiency of the overall traffic. Moreover, the proposed approaches promote the microscopic synchronization of vehicles, which can lead to the macroscopic synchronization of traffic flow.

Robust Lane Change Decision Making for Autonomous Vehicles: An Observation Adversarial Reinforcement Learning Approach

Article

Full-text available

Apr 2022

Reinforcement learning holds the promise of allowing autonomous vehicles to learn complex decision making behaviors through interacting with other traffic participants. However, many real-world driving tasks involve unpredictable perception errors or measurement noises which may mislead an autonomous vehicle into making unsafe decisions, even cause catastrophic failures. In light of these risks, to ensure safety under perception uncertainty, autonomous vehicles are required to be able to cope with the worst case observation perturbations. Therefore, this paper proposes a novel observation adversarial reinforcement learning approach for robust lane change decision making of autonomous vehicles. A constrained observation-robust Markov decision process is presented to model lane change decision making behaviors of autonomous vehicles under policy constraints and observation uncertainties. Meanwhile, a black-box attack technique based on Bayesian optimization is implemented to approximate the optimal adversarial observation perturbations efficiently. Furthermore, a constrained observation-robust actor-critic algorithm is advanced to optimize autonomous driving lane change policies while keeping the variations of the policies attacked by the optimal adversarial observation perturbations within bounds. Finally, the robust lane change decision making approach is evaluated in three stochastic mixed traffic flows based on different densities. The results demonstrate that the proposed method can not only enhance the performance of an autonomous vehicle but also improve the robustness of lane change policies against adversarial observation perturbations.

Stepwise Domain Adaptation (SDA) for Object Detection in Autonomous Vehicles Using an Adaptive CenterNet

Article

Oct 2022

In recent years, deep learning technologies for object detection have made great progress and have powered the emergence of state-of-the-art models to address object detection problems. Since the domain shift can make detectors unstable or even crash, the detection of cross-domain becomes very important for the design of object detectors. However, traditional deep learning technologies for object detection always rely on a large amount of reliable ground-truth labelling that is laborious, costly, and time-consuming. Although an advanced approach CycleGAN has been proposed for cross-domain object detection tasks, the ability of CycleGAN to reduce the divergence across domains at the feature level is limited. In this paper, a stepwise domain adaptation (SDA) detection method is proposed to further improve the performance of CycleGAN by minimizing the divergence in cross-domain object detection tasks. Specifically, the domain shift is addressed in two steps. In the first step, to bridge the domain gap, an unpaired image-to-image translator is trained to construct a fake target domain by translating the source images to the similar ones in the target domain. In the second step, to further minimize divergence across domains, an adaptive CenterNet is designed to align distributions at the feature level in an adversarial learning manner. Our proposed method is evaluated in domain shift scenarios based on the driving datasets including Cityscapes, Foggy Cityscapes, SIM10k, and BDD100K. The results show that our method is superior to the state-of-the-art methods and is effective for object detection in domain shift scenarios.

Cross-Domain Object Detection for Autonomous Driving: A Stepwise Domain Adaptative YOLO Approach

Article

Sep 2022

Supervised object detection models based on deep learning technologies cannot perform well in domain shift scenarios where annotated data for training is always insufficient. To this end, domain adaptation technologies for knowledge transfer have emerged to handle the domain shift problems. A stepwise domain adaptive YOLO (S-DAYOLO) framework is developed which constructs an auxiliary domain to bridge the domain gap and uses a new domain adaptive YOLO (DAYOLO) in cross-domain object detection tasks. Different from the previous solutions, the auxiliary domain is composed of original source images and synthetic images that are translated from source images to the similar ones in the target domain. DAYOLO based on YOLOv5s is designed with a category-consistent regularization module and adaptation modules for image-level and instance-level features to generate domain invariant representations. Our proposed method is trained and evaluated by using five public driving datasets including Cityscapes, Foggy Cityscapes, BDD100K, KITTI, and KAIST. Experiment results demonstrate that object detection performance is significantly improved when using our proposed method in various domain shift scenarios for autonomous driving applications.

Situation-Aware Environment Perception Using a Multi-Layer Attention Map

Article

Jan 2022

Within the field of automated driving, a clear trend in environment perception tends towards more sensors, higher redundancy, and overall increase in computational power. This is mainly driven by the paradigm to perceive the entire environment as best as possible at all times. However, due to the ongoing rise in functional complexity, compromises have to be considered to ensure real-time capabilities of the perception system. In this work, we introduce a concept for situation-aware environment perception to control the resource allocation towards processing relevant areas within the data as well as towards employing only a subset of functional modules for environment perception, if sufficient for the current driving task. Specifically, we propose to evaluate the context of an automated vehicle to derive a multi-layer attention map (MLAM) that defines relevant areas. Using this MLAM, the optimum of active functional modules is dynamically configured and intra-module processing of only relevant data is enforced. We outline the feasibility of application of our concept using real-world data in a straight-forward implementation for our system at hand. While retaining overall functionality, we achieve a reduction of accumulated processing time of 59%.

Learning Automated Driving in Complex Intersection Scenarios Based on Camera Sensors: A Deep Reinforcement Learning Approach

Article

Mar 2022

Making proper decisions at intersections that are one of the most dangerous and sophisticated driving scenarios is full of challenges, especially for autonomous vehicles (AVs). The existing decision-making approaches for AVs at intersections are limited as they only consider driving safety in simple intersection scenarios while sacrificing travel efficiency and driving comfort. To solve this issue, a decision-making structure motivated by deep reinforcement learning was proposed for autonomous driving at complex intersection scenarios based on long short-term memory (LSTM). The mapping relationship between traffic images collected from camera sensors and AVs' actions was established by constructing convolutional-recurrent neural networks in a decision-making framework. Traffic images collected from camera sensors at two different timesteps were used to understand the relative motion information between AVs and other vehicles. To model the interaction between the AV and other vehicles, Markov decision process was used. The deep Q-network (DQN) algorithm was applied to generate the optimal driving policy that could comprehensively consider driving safety, travel efficiency and driving comfort. Three crash-prone complex intersection scenarios were reconstructed in CARLA (car learning to act) to evaluate the performance of our proposed method. The results indicate that our method can make AV drive through intersections safely and efficiently with desirable driving comfort in all the examined scenarios.

A Stochastic Model-Based Fusion Algorithm for Enhanced Localization of Land Vehicles

Article

Dec 2021

This paper investigates a position estimation problem for land-vehicles using sensors fusion and dead-reckoning to mitigate the influence of model inaccuracy and uncertain noise covariance. The kinematics of the vehicle is roughly modeled, considering the roll angle and slip angle. To achieve accurate position estimation, a novel stochastic model-based fusion algorithm is proposed by embedding absolute value modulated random noises into the model. For uncertainties that are Gaussian, a quantitative description of the deviation due to uncertainties is given. Improved state and measurement equations are derived to enhance the accuracy of positioning. The algorithm recursively provides robust estimations in a stochastic manner. The effectiveness and superiority of the proposed vehicle localization method with inadequate process knowledge is demonstrated by numerical simulations and real-world experiments. Experimental results also demonstrate that our method is more accurate and reliable than the state-of-the-art methods for vehicle localization under various driving conditions.

Exploring Behavioral Patterns of Lane Change Maneuvers for Human-Like Autonomous Driving

Article

Nov 2021

Due to the growing interest in automated driving, a deep understanding on the characteristics of human driving behavior is critical for human-like autonomous vehicles. Among various driving behaviors, lane change is the most important one for vehicle lateral driving safety. This study proposes an unsupervised method to extract and discover the behavioral patterns of lane change maneuvers for the purpose of exploring the composed behavioral patterns during lane change. This method involves two phases: Firstly, the lane change sequences will be segmented into blocks using time-series segmentation algorithms. Three segmentation algorithms were utilized in this study. In the second phase, the segments will be clustered to find the corresponding behavioral pattern of each segment. Two extended latent Dirichlet allocation (LDA) models were adopted to cluster the segments. The combination of different segmentation and clustering algorithms were evaluated and compared by employing entropy and perplexity as the evaluation criteria. Collected lane change data from naturalistic driving were applied to examine its effectiveness. The results show that this method could effectively mine descriptive behavioral patterns from lane change data. This study provides a promising data mining solution to facilitating deep and comprehensive understanding on driver lane change behaviors, which will promote the development of human-like autonomous vehicles.

Decision making of autonomous vehicles in lane change scenarios: Deep reinforcement learning approaches with risk awareness

Article

Nov 2021
TRANSPORT RES C-EMER

Driving safety is the most important element that needs to be considered for autonomous vehicles (AVs). To ensure driving safety, we proposed a lane change decision-making framework based on deep reinforcement learning to find a risk-aware driving decision strategy with the minimum expected risk for autonomous driving. Firstly, a probabilistic-model based risk assessment method was proposed to assess the driving risk using position uncertainty and distance-based safety metrics. Then, a risk aware decision making algorithm was proposed to find a strategy with the minimum expected risk using deep reinforcement learning. Finally, our proposed methods were evaluated in CARLA in two scenarios (one with static obstacles and one with dynamically moving vehicles). The results show that our proposed methods can generate robust safe driving strategies and achieve better driving performances than previous methods.

Dynamic Slimmable Network

Conference Paper

Jun 2021

Lane Change Strategies for Autonomous Vehicles: A Deep Reinforcement Learning Approach Based on Transformer

Abstract and Figures

Recommended publications

Decision making of autonomous vehicles in lane change scenarios: Deep reinforcement learning approac...

Continuous decision‐making for autonomous driving at intersections using deep deterministic policy g...

Human-Like Decision Making of Artificial Drivers in Intelligent Transportation Systems: An End-to-En...

A Temporal-Spatial Deep Learning Approach for Driver Distraction Detection Based on EEG Signals