PreprintPDF Available

Decision-making Strategy on Highway for Autonomous Vehicles using Deep Reinforcement Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Autonomous driving is a promising technology to reduce traffic accidents and improve driving efficiency. In this work, a deep reinforcement learning (DRL)-enabled decision-making policy is constructed for autonomous vehicles to address the overtaking behaviors on the highway. First, a highway driving environment is founded, wherein the ego vehicle aims to pass through the surrounding vehicles with an efficient and safe maneuver. A hierarchical control framework is presented to control these vehicles, which indicates the upper-level manages the driving decisions, and the lower-level cares about the supervision of vehicle speed and acceleration. Then, the particular DRL method named dueling deep Q-network (DDQN) algorithm is applied to derive the highway decision-making strategy. The exhaustive calculative procedures of deep Q-network and DDQN algorithms are discussed and compared. Finally, a series of estimation simulation experiments are conducted to evaluate the effectiveness of the proposed highway decision-making policy. The advantages of the proposed framework in convergence rate and control performance are illuminated. Simulation results reveal that the DDQN-based overtaking policy could accomplish highway driving tasks efficiently and safely.
1
Decision-making Strategy on Highway for
Autonomous Vehicles using Deep
Reinforcement Learning
Jiangdong Liao1, Teng Liu2, Xiaolin Tang2, Xingyu Mu2, Bing Huang2, and Dongpu Cao3
1School of Mathematics and Statistics, Yangtze Normal University, Fuling, 408100, China
2College of Automotive Engineering, Chongqing University, Chongqing, 400044, China
3Department of Mechanical and Mechatronics Engineering, University of Waterloo, N2L 3G1, Canada
Corresponding author: Teng Liu (e-mail: tengliu17@gmail.com), Xiaolin Tang (e-mail: tangxl0923@cqu.edu.cn).
This work was in part supported by the State Key Laboratory of Mechanical System and Vibration (Grant No. MSV202016).
ABSTRACT Autonomous driving is a promising technology to reduce traffic accidents and improve
driving efficiency. In this work, a deep reinforcement learning (DRL)-enabled decision-making policy is
constructed for autonomous vehicles to address the overtaking behaviors on the highway. First, a highway
driving environment is founded, wherein the ego vehicle aims to pass through the surrounding vehicles with
an efficient and safe maneuver. A hierarchical control framework is presented to control these vehicles,
which indicates the upper-level manages the driving decisions, and the lower-level cares about the
supervision of vehicle speed and acceleration. Then, the particular DRL method named dueling deep Q-
network (DDQN) algorithm is applied to derive the highway decision-making strategy. The exhaustive
calculative procedures of deep Q-network and DDQN algorithms are discussed and compared. Finally, a
series of estimation simulation experiments are conducted to evaluate the effectiveness of the proposed
highway decision-making policy. The advantages of the proposed framework in convergence rate and
control performance are illuminated. Simulation results reveal that the DDQN-based overtaking policy
could accomplish highway driving tasks efficiently and safely.
INDEX TERMS Autonomous driving, decision-making, deep reinforcement learning, dueling deep Q-
network, deep Q-learning, overtaking policy.
I. INTRODUCTION
Autonomous driving (AD) enables the vehicle to engage
different driving missions without a human driver [1, 2].
Motivated by the enormous potentials of artificial
intelligence (AI), autonomous vehicles or automated vehicles
have become one of the research hotspots all over the world
[3]. Many automobile manufacturers, such as Toyota, Tesla,
Ford, Audi, Waymo, Mercedes-Benz, General Motors, and
so on are developing their own autonomous cars and
achieving tremendous progress. Meanwhile, automotive
researchers are paying attention to overcome the essential
technologies to build automated cars with full automation
[4].
Four significant modules are contained in autonomous
vehicles, which are perception, decision-making, planning,
and control [5]. Perception indicates the autonomous vehicles
know the information about the driving environments based
on the functions of a variety of sensors, such as radar, lidar,
global positioning system (GPS), et al. Decision-making
controller manages the driving behaviors of the vehicles, and
these behaviors include acceleration, braking, lane-changing,
lane-keep and so on [6]. Planning function helps the
automated cars find the reasonable running trajectories from
one point to another. Finally, the control module would
command the onboard powertrain components to operate
accurately to finish the driving maneuvers and follow the
planning path. According to the intelligent degrees of these
mentioned modules, the AD is classified into six levels, from
L0 to L5 [7].
Decision-making strategy is regarded as the human brain
and is extremely important in autonomous vehicles [8]. This
policy is often generated by the manual rules based on
1
Highway Decision-making Deep Reinforcement Learning
Lane-changing
Lane-keeping
Learned Control Policies
Testing Experiments
Collision-free
Similar Situation
Optimality and Adaptability
Comparison
Evaluation
FIGURE 1. The constructed deep reinforcement learning-abled highway overtaking driving policy for autonomous vehicles.
human driving experiences or imitated manipulation learned
from supervised learning approaches. For example, Song et
al. applied a continuous hidden Markov chain to predict the
motion intention of the surrounding vehicles. Then, a
partially observable Markov decision process (POMDP) is
used to construct the general decision-making framework [9].
The authors in [10] developed an advanced ability to make
appropriate decisions in the city road traffic situations. The
presented decision-making policy is multiple criteria, which
helps the city cars make feasible choices in different
conditions. In Ref. [11], Nie et al. discussed the lane-
changing decision-making strategy for connected automated
cars. The related model is combining the cooperative car-
following models and candidate decision generation module.
Furthermore, the authors in [12] mentioned the thought of a
human-like driving system. It could adjust the driving
decisions by considering the driving demand of human
drivers.
Deep reinforcement learning (DRL) techniques are taken
as a powerful tool to deal with the long sequential decision-
making problems [13]. In recent years, many attempts have
been implemented to study DRL-based autonomous driving
topics. For example, Duan et al. built a hierarchical structure
to learn the decision-making policy via the reinforcement
learning (RL) method [14]. The pro of this work is
independent of the historical labeled driving data. Ref. [15,
16] utilized DRL approaches to handle the collision
avoidance and path following problems for automated
vehicles. The relevant control performance is better than the
conventional RL methods in these two findings. Furthermore,
the authors in [17] considered not only path planning but also
the fuel consumption for autonomous vehicles. The related
algorithm is deep Q-learning (DQL), and it was proven to
accomplish these two-driving missions suitably. Han et al.
employed the DQL algorithm to decide the lane change or
lane keep for connected autonomous cars, in which the
information of the nearby vehicles is treated as feedback
knowledge from the network [18]. The resulted policy is able
to promote traffic flow and driving comfort. However, the
common DRL methods are unable to address the highway
overtaking problems because of the continuous action space
and large state space [19].
In this work, a DRL enabled highway overtaking driving
policy is constructed for autonomous vehicles. The proposed
decision-making strategy is evaluated and estimated to be
adaptive to other complicated scenarios, as depicted in Fig. 1.
First, the studied driving environment is founded on the
highway, wherein an ego vehicle aims to run through a
particular driving scenario efficiently and safely. Then, a
hierarchical control structure is shown to manipulate the
lateral and longitudinal motions of the ego and surrounding
vehicles. Furthermore, the special DRL algorithm called
dueling deep Q-network (DDQN) is derived and utilized to
obtain the highway decision-making strategy. The DQL and
DDQN algorithms are compared and analyzed theoretically.
Finally, the performance of the proposed control framework
is discussed via executing a series of simulation experiments.
Simulation results reveal that the DDQN-based overtaking
policy could accomplish highway driving tasks efficiently
and safely.
The main contributions and innovations of this work can
be cast into three perspectives: 1) an adaptive and optimal
DRL-based highway overtaking strategy is proposed for
automated vehicles; 2) the dueling deep Q-network (DDQN)
algorithm is leveraged to address the large state space of the
decision-making problem; 3) the convergence rate and
2
control optimization of the derived decision-making policy
are demonstrated by multiple designed experiments.
This following organization of this article is given as
follows: the highway driving environment and the control
modules of the ego and surrounding vehicles are described in
Section II. The DQL and DDQN algorithms are defined in
Section III, in which the parameters of the RL framework are
discussed in detail. Section IV shows the relevant results of a
series of simulation experiments. Finally, the conclusion is
conducted in Section V.
II. DRIVING ENVIRONMENT AND CONTROL MODULE
In this section, the studied driving scenario on the highway is
introduced. Without loss of generality, a three-lane freeway
environment is constructed. Furthermore, a hierarchical
motion controller is described to manage the lateral and
longitudinal movements of the ego and surrounding vehicles.
The upper-level contains two models, which are intelligent
driver model (IDM) and minimize overall braking induced
by lane changes (MOBIL) [20]. The lower-level focuses on
regulating vehicle velocity and acceleration.
A.
HIGHWAY DRIVING SCENARIO
Decision-making in autonomous driving means selecting a
sequence of reasonable driving behaviors to achieve special
driving missions. On the highway, these behaviors involve
lane- changing, lane-keeping, acceleration, braking. The
main objectives are avoiding collisions, running efficiently,
and driving on the preferred lane. Accelerating and
surpassing other vehicles is a typical driving behavior called
overtaking.
FIGURE 2. Highway driving environment for decision-making problem
with three lanes.
This work discusses the decision-making problem on the
highway for autonomous vehicles, and the research driving
scenario is depicted in Fig. 2. The orange vehicle is the ego
vehicle, and other green cars are named as surrounding
vehicles. There are three lanes in the driving environment,
and the derived decision-making policy in this paper is easily
generalized to different situations. The ego vehicle would be
initialized in the middle lane at a random speed.
The objective of the ego vehicle is to run from the starting
point to end point as soon as possible with cashing the
surrounding vehicles. Hence, this goal is interpreted as
efficiency and safety. The initial velocity and position of the
surrounding vehicles are designed randomly. It implies the
driving scenario consists of uncertainties as to the actual
driving. Furthermore, to imitate the real conditions, the ego
vehicle prefers to stay on lane 1 (L=1), and it can overtake
other vehicles from the right or left sides.
At the beginning of this driving task, all the surrounding
vehicles located in front of the ego vehicle. In each lane, the
number of surrounding vehicles is M, which indicates there
are 3M nearby cars in this situation. Two conditions would
interrupt the ego vehicle, which is crashing other vehicles or
reaching the destination. The procedure of running from the
starting point to the ending point is called as one episode in
this work.
Without loss of generality, the parameters of the driving
scenario are settled as follows: the original speed of the ego
vehicle is chosen from [23, 25] m/s, its maximum speed is 40
m/s, the length and width of all vehicles are 5m and 2m. The
duration of one episode is 100s, and the simulation frequency
is 20 Hz. The initial velocity of the surrounding vehicles is
randomly chosen from [20, 23] m/s, and their behaviors are
manipulated by IDM and MOBIL. The next section will
discuss these two models in detail.
B.
VEHICLE BEHAVIOR CONTROLLER
The movements of all the vehicles in the highway
environments are mastered by a hierarchical control
framework, as shown in Fig. 3. The upper-level applied IDM
and MOBIL to manage the vehicle behaviors, and the lower-
level aims to enable the ego vehicle to track a given target
speed and follow a target lane. In this work, the DRL method
is used to control the ego vehicle. The reference model
implies that the ego vehicle is controlled by the bi-level
structure in Fig. 3, which is taken as a benchmark to evaluate
the DRL-based decision-making strategy.
IDM in the upper-level is a prevalent microscopic model
[21] to realize car-following and collision-free. In the
adaptive cruise controller of automated cars, the longitudinal
behavior is usually decided by IDM. In general, the
longitudinal acceleration is IDM is determined as [22]:
2
max [1 ( ) ( ) ]
tar
tar
d
v
aa vd
=
(1)
where v and a is the current vehicle speed and acceleration.
amax is the maximum acceleration, d is the distance to the
front car and δ is named as constant acceleration parameter.
vtar and dtar are the target velocity and distance, and the
desired speed is achieved by the amax and dtar. In IDM, the
expected distance dtar is affected by the front vehicle and is
calculated as follows:
2
Highway Driving Scenario for Decision-making Problem
Ego
Vehicle Surrounding
Vehicles
Reference Model
Upper-level: Equations (1)-(4)
Lower-level: Equations (5)-(7)
DRL Model
Upper-level: DDQN Algorithm
Lower-level: Equations (23)-(24)
Unified Control Model
Upper-level: Equations (1)-(4)
Lower-level: Equations (5)-(7)
Random Initial Speed and Position
FIGURE 3. The hierarchical control framework discussed in this work for the ego vehicle and surrounding vehicles.
0
max
2
tar vv
d d Tv ab
= + +
(2)
where d0 is the predefined minimum relative distance, T is
the expected time interval for safety goal, v is the relative
speed between two vehicles, and b is the deceleration rate
according to the comfortable purpose.
In IDM, the relative speed and distance are defined a priori
to induce the vehicle velocity and acceleration at each time
step. The default configuration is introduced as following:
the maximum acceleration amax is 6 m/s2, acceleration
argument δ is 4, desired time gap T is 1.5 s, comfortable
deceleration rate b is -5 m/s2, and minimum relative distance
d0 is 10m.
Since the IDM is utilized to determine the longitudinal
behavior, the MOBIL is employed to make the lateral lane
change decisions [23]. MOBIL states that lane-changing
behaviors should be observed by two restrictions, which are
safety criterion and incentive condition. These constraints are
related to the ego vehicle and its two followers (before
changing and after changing). Assuming aoldi and aoldj are the
accelerations of these followers before changing, and anewi
and anewj are the accelerations after changing.
The safety criterion requires the follower in the desired
lane (after changing) to limit its acceleration to avoid a
collision. The mathematic expression is shown as:
new
j safe
ab−
(3)
where bsafe is the maximum braking imposed to the follower
in the lane-changing behavior. By following (3), collision
and accidents could be avoided effectively.
The incentive condition is imposed on the ego vehicle and
its followers by an acceleration threshold ath:
[( ) ( )]
new old new old new old
e e i i j j th
a a z a a a a a− + +
(4)
where z is named as politeness coefficient to determine the
effect degree of the followers in the lane-changing behaviors.
This incentive condition means the desired lane should be
safer than the old lane. For application, the parameters in
MOBIL are defined as follows: the politeness factor z is
0.001, safe deceleration limit bsafe is 2 m/s2, and acceleration
threshold ath is 0.2 m/s2. After deciding the longitudinal and
lateral behaviors in the upper-level, the lower-level is applied
to follow the target speed and lane.
C.
VEHICLE MOTION CONTROLLER
In the lower-level, the motions of the vehicles in the
longitudinal and lateral direction are controlled. The former
regulates the acceleration by a proportional controller as:
()
p tar
a K v v=−
(5)
where Kp is the proportional gain.
In the lateral direction, the controller deals with the
position and heading of the vehicle with a simple
proportional-derivative action. The position indicates the
lateral speed vlat of the vehicle is computed as follows:
,lat p lat lat
vK= −
(6)
where Kp,lat is named as position gain, lat is the lateral
position of the vehicle with respect to the center-line of the
9
lane. Then, the heading control is related to the yaw rate
command φ as:
(7)
where φtar is the target heading angle to follow the desired
lane and Kp,lat is the heading gain.
Hence, the movements of the surrounding vehicles are
achieved by the bi-level control framework in Fig. 3. The
position, speed, and acceleration of these vehicles are
assumed to be known to the ego vehicle. This limitation
propels the ego vehicle to learn how to drive in the scenario
via the trial-and-error procedure. In the next section, the DRL
approach is introduced and established to realize this learning
process and derive the highway decision-making policy.
III. DRL METHODOLOGY
This section introduces the RL method and exhibits the
special DRL algorithms. The interaction in RL between the
agent and the environment is first explained. Then, the DQL
algorithm that incorporates the neural network and Q-
learning algorithm is formulated. Finally, a dueling network
is constructed in a DQL algorithm to reconstitute the output
layer of the neural network, and thus raise the DDQN method.
A.
RL CONCEPT
RL approach describes the process that an intelligent agent
interacts with its environment. It is powerful and useful to
solve sequential decision-making problems. The goal of the
agent is to search an optimal sequence of control actions
based on feedback from the environment. Owing to its
characteristics of self-evaluation and self-promotion, RL is
widely used in many research fields [24-28].
In the decision-making problem on the highway, the agent
and environment are the ego vehicle and surrounding
vehicles (including the driving conditions), respectively. The
Markov decision processes (MDPs) is embedded in the
Markov property that the next state variable is only
concerned with the current state and action [29]. This MDP is
often utilized to represent the RL interaction as a tuple (S, A,
P, R, γ), in which S and A are the state and control sets. P
and R are the significant elements of the environments in RL,
and they mean the transition and reward model, respectively.
In RL, the current action would influence the immediate and
future rewards synchronously. Hence, γ is a discount factor to
balance these two parts of rewards.
To represent the list of future rewards, the accumulated
reward Rt is defined as follows:
t
tt
t
Rr
=
(8)
where t is the time instant, and rt is the relevant reward. To
record the worth of the state s and state-action pair (s, a), two
value function as expressed by the accumulated reward as:
( ) [ | , ]
t t t
V s E R s
(9)
( , ) [ | , , ]
t t t t t
Q s a E R s a
(10)
where π is called as control action policy, V is state value
function, and Q is the state-action function (called Q table for
short). To be updated easily, the state-action function is
usually rewritten as the recursive form:
111
( , )= [ max ( , )]
t
t t t t t
a
Q s a E r Q s a

+++
+
(11)
Finally, the optimal control action with respect to the
control policy π is determined by the state-action function:
( ) argmax ( , )
t
t t t
a
s Q s a
=
(12)
Therefore, the essence of different RL algorithms is updating
the state-action function Q(s, a) in various ways. According
to the style of updating rules, the RL algorithms could have
diverse classifications, such as model-based and model-free,
policy-based and value-based, temporal-difference (TD), and
Monte-Carlo (MC) [30].
B. DEEP Q NETWORK
Deep Q network (DQN) is first presented to play the Atari
games in [31]. It synthesizes the strengths of deep learning
(neural network) and Q-learning to obtain the new state-
value function. In the common Q-learning, the updating rule
of this function is narrated as follows:
( , ) ( , ) [ max ( , ) ( , )]
a
Q s a Q s a r Q s a Q s a


+ +
(13)
where α [0, 1] is named as a learning rate to trade-off the
old and new learned experiences from the environment.
and are the state and action at the next time step.
The common Q-learning is unable to handle the problem
with a large space of state variable, because it needs an
enormous time to obtain the mutable Q table. Thus, in DQN,
a neural network is employed to approximate the Q table as
Q(s, a; θ). For the neural network, the inputs are the arrays of
state variables and control actions, and the output is the state-
value function [31].
To measure the discrepancy between the approximated
and actual Q table in DQN, the loss function is introduced
like the following expression:
2
1
( ) [ ( ( , ; )) ]
N
t
t
L E y Q s a

=
=−
(14)
where
max ( , ; )
tt a
y r Q s a

  
=+
(15)
As can be seen, there are two parameters (θ and θ´) of the
neural network, which delegate two networks in DQN. These
networks are prediction and target networks. The former is
applied to estimate the current control action, and the latter
aims to generate the target value. In general, the target
9
State Variables
Vehicle Speed
Relative Position
Control Actions
Acceleration
Lane-changing 256 128
256 256
128 128 128 128
128
128
State-value
Network V(s)
Advantage
network A(s, a)
FIGURE 4. The dueling network combined with state-value network and advantage network for Q table updating.
network would copy the parameters from the prediction
network every certain number of time steps. By doing this,
the target Q table will converge to predict one to some extent
to remit the network instability.
In DQN, the online neural network is updated by gradient
descent as follows:
( ) [( ( , ; )) ( , ; )]
i
L E y Q s a Q s a

 
 = −
(16)
This operation makes the DQN as an off-policy algorithm,
and the states and rewards are acquired by a special criterion.
This rule is known as epsilon greedy, which indicates that the
agent executes the exploration (choose a random action) with
probability ε, and makes exploitation (use the current best
action) with probability 1-ε.
C.
DUELING DQN ALGORITHM
In some RL problems, the selection of current control action
may not cause negative results, apparently. For example, in
the highway environment, many actions would not lead to
the collision. However, these choices may indirectly result in
bad rewards afterward [32]. Motivated by this insight, a
dueling network is proposed in this work to estimate the
worth of the control actions at each step. A new neural
network is constructed to approximate the Q table in the
highway decision-making problem, as shown in Fig. 4.
Two streams of fully connected layers are used to estimate
the state-value function V(s) and the advantage function A(s,
a) of each action. Therefore, the state-action function (Q
table) is constituted as follows:
( , )= ( , )+ ( )Q s a A s a V s
 
(17)
It is obvious that the output of this new dueling network is
also a Q table, and thus the neural network used in DQN can
also be employed to approximate this Q table. The network
with two parameters is computed as:
12
( , ; ) ( ; ) ( , ; )Q s a V s A s a
 
 
=+
(18)
where θ1 and θ2 are the parameters of state-value function
and advantage function, respectively.
To update the Q table in DDQN and achieve the optimal
control action, (18) is reformulated as follows:
1 2 2
( , ; ) ( ; ) ( ( , ; ) max ( , ; ))
a
Q s a V s A s a A s a
 
 
= +
(19)
*2
argmax ( , ; ) argmax ( , ; )
aa
a Q s a A s a



==
(20)
It can be decerned that the input-output interfaces in DDQN
and DQN are the same. Hence, the gradient descent in (16) is
capable of being recycled to train the Q table in this work.
D.
VARIABLES SPECIFICATION
To derive the DDQN-based decision-making strategy, the
preliminaries are initialized as follows, and the calculative
procedure is easily transformed into an analogous driving
environment. The control actions are the longitudinal and
lateral accelerations (a1 and a2) with the units m/s2 and rad:
2
1[ 5, 5] m/sa−
(21)
2[ /4, / 4] rada

−
(22)
It is noticed that when these two accelerations are zeros, the
ego vehicle adopts an idling control.
After obtaining the acceleration actions, the speed and
position of the vehicle can be computed as follows:
1
1 1 1
1
2 2 2
tt
tt
v v a t
v v a t
+
+
= + 
= + 
(23)
2
1 1 1
2
2 2 2
1
2
1
2
tt
tt
d v t a t
d v t a t
=  +
=  +
(24)
9
where v1, v2 are the longitudinal and lateral speed of the
vehicle, respectively, similar to the d1 and d2. The policy
frequency is 1 Hz, which indicates the time interval t is 1
second. It should be noticed that (23) and (24) are feasible for
the ego vehicle and surrounding vehicles simultaneously, and
these expressions are considered as the transition model P in
RL. Then, the state variables are defined as the relative speed
and distance between the ego and nearby cars:
ego sur
t t t
d d d =
(25)
ego sur
t t t
v v v =
(26)
where the superscript ego and sur represent the ego vehicle
and surrounding vehicles, respectively.
Finally, the reward model R is constituted by the optimal
control objectives, which are avoiding collision, running as
fast as possible, and trying the driving on lane 1 (L=1). To
bring this insight to fruition, the instantaneous reward
function is defined as following:
max 2 2
1 0.1*( ) 0.4*( -1)
t
t ego ego
r collision v v L= −
(27)
where collision {0, 1} and the goal of the DDQN-based
highway decision-making strategy is maximizing the
cumulative rewards.
The proposed decision-making control policy is trained
and evaluated in the simulation environment based on the
OpenAI gym Python toolkit [31]. The numbers of lanes and
surrounding vehicles are 3 and 30. The discount factor γ and
learning rate α are 0.8 and 0.2. The layers of the value
network and advantage network are both 128. The value of ε
decreases from 1 to 0.05 with the time step 6000. The
training episode in different DRL approaches are 2000. The
next section discusses the effectiveness of the presented
decision-making strategy for autonomous vehicles.
IV. RESULTS AND EVALUATION
In this section, the proposed highway decision-making policy
is estimated by comparing it with the benchmark methods.
These techniques are the reference model in Fig. 3 and the
common DQN in Section III.B. The optimality is analyzed
by conducting a comparison of these three methods.
Furthermore, the adaptability of the presented approach is
verified by implementing the trained model into a similar
highway driving scenario.
A.
OPTIMALITY EVALUATION
The reference model, DQN, and DDQN are compared in
this subsection. All of them adopted a hierarchical control
framework. The lower-levels are the same and utilize (5)-(7)
to regulate the acceleration, position, and heading. The
upper-levels are different, which are IDM and MOBIL,
DQN algorithm in Section III.B, and DDQN algorithm in
Section III.C. The default parameters are the same in DQN
and DDQN.
Fig. 5 depicts the normalized average rewards of these
three methods. Based on the definition of reward function
in (27), higher reward indicates driving on the preferred
lane with a more efficient maneuver. It is obvious that the
training stability and learning speed of DDQN are better
than the other two approaches. Besides, after about 500
episodes, the reward in DDQN is greater than the other two
approaches, and it keeps this momentum all the time. And
they are both better than the reference model. It is mainly
caused by the advantage network in DDQN. This network
could assess the worth of the chosen action at each step,
which helps the ego vehicle to find a better decision-
making policy fleetly.
FIGURE 5. Average reward variation in three compared methods: the
reference model, DQN and DDQN.
FIGURE 6. Vehicle speed and traveling distance of the ego vehicle in
each episode of these compared techniques.
9
To observe the trajectories of state variables in this work,
Fig. 6 shows the average vehicle speed and traveling
distance in these three compared techniques. They are all
trained by 2000 episodes. A higher average implies that the
ego vehicle could run through the driving scenario faster
and achieve greater cumulative rewards. The traveling
distance means the ego vehicle could drive longer without
collision. These results directly reflect safety and efficiency
demand. The noticeable differences are able to certify the
optimality of the proposed algorithm.
As the ego vehicle is not willing to crash the surrounding
vehicles, the collision conditions of these three control
cases are described in Fig. 7, wherein collision has two
values (collision = 0 or 1). It can be noticed that the DDQN,
DQN, and reference model-enabled agents could avoid a
collision after 1300, 1700, 1950 episodes, respectively. This
appearance can also prove that the DDQN-based agent is
more intelligent other two agents. As the safety claim is the
first concern for the actual application of automated driving,
the learned decision-making model based on DDQN is
more promising to be employed in real-world environments.
FIGURE 7. Collision conditions of the ego vehicle in each compared
method: collision=0, the ego vehicle does not crash other vehicles;
collision=1, the ego vehicle crashes other vehicles.
FIGURE 8. Control actions in one successful episode of three compared
methods: Index=1, changing left lane; Index=2, idling speed; Index=3,
changing right lane; Index=4, running faster; Index=5, running slower.
Furthermore, to defense the concrete control actions are
different in these three methods. The curves of control
action sequences of one successful episode (means the ego
vehicle could drive from the starting to ending point) are
given in Fig. 8. The actions in longitudinal and later
directions of the ego vehicle are uninformed as five
selections. They are changing left lane, changing right lane,
idling speed, running faster, running slower. The
differences between these trajectories indicate the proposed
decision-making policy is different from the two
benchmark methods (in the same successful episode).
Overall, according to all the display results in this
subsection, the optimality of the DDQN-enabled decision-
making strategy is illuminated.
B.
COMPARISION BETWEEN DQN AND DDQN
Since DQN and DDQN are two permanent DRL algorithms,
this experiment aims to appraise the learning and training
procedure of these two approaches. As the target of the
neural network is acquiring the mutable Q table. The
normalized mean discrepancy of the Q table in the training
process of these two methods is displayed in Fig. 9. The
downtrend graphs indicate that both ego vehicles become
more familiar with the driving environment by interacting
with it. Furthermore, it can be discerned that the DDQN
could learn more knowledge about the traffic situations
with the same episodes, and thus results in the faster
learning course. Hence, the ego vehicle could manipulate
more efficiently and safely by the guidance of the DDQN
algorithm.
FIGURE 9. Mean discrepancy of Q table in the training process of two
DRL approaches.
To exhibit the usage of the dueling network in future
decisions in this autonomous driving problem, Fig. 10
discusses the track of the cumulative rewards. The uptrend
variation implies that the control action choices are capable
of improving future rewards. As the DDQN is larger than
DQN, it signifies the related agent could achieve better
control performance. This is also attributed to the advantage
network in DDQN, which enables the ego vehicle to
quantify the potential worth of current control action. To
assess the above-mentioned decision-making policies in a
similar driving condition, the next subsection discusses the
adaptability of these strategies.
9
FIGURE 10. Accumulated rewards in DQN and DDQN: the higher
accumulated reward indicates better control action choices.
C.
ADAPTABILITY ESTIMATION
After learning and training the automated vehicles in
highway driving environments, a short episode is applied to
test their adaptive capacity. The testing number of episodes
is 10 in this work. The default settings and the number of
lanes and surrounding vehicles are the same as the training
process. The learned parameters of the neural networks are
saved and can be utilized directly in the new conditions.
The most concerning elements are the average reward and
collision conditions of the testing operation.
A point
B point
FIGURE 11. Normalized reward in the testing experiment of three
compared methods.
Time Step t= 21s
Time Step t= 35s
Time Step t= 56s
FIGURE 12. One typical testing driving condition: the ego vehicle has
to execute car-following behavior for a long time.
Fig. 11 shows the normalized average reward of the
reference model, DQN, and DDQN methods in the testing
experiment. From (27), the reward is mainly influenced by
the collision conditions and vehicle speed. The average
reward may not achieve the highest score (100 in this work)
because the ego vehicle has to slow down sometimes to
avoid a collision. The ego vehicle also needs to change to
other lanes to realize the overtaking process. Without loss
of generality, two typical situations (two episodes, A and B
points in Fig. 10) are chosen to analyze the decision-
making behaviors of the ego vehicle.
Time Step t= 40scar-following
Time Step t= 50s: find a small space for overtaking
Time Step t= 52s: collision happens
FIGURE 13. Another representative testing driving condition: the ego
vehicle make a dangerous lane changing and a collision happens.
Fig. 12 depicts one driving situation that there are three
surrounding vehicles in front of the ego vehicle (the episode
represented by A point). The ego vehicle has to execute the
car-following maneuver for a long time and wait for the
opportunity the overtake them. As a consequence, the
vehicle speed may not reach the maximum value, and the
ego vehicle may not surpass all the surrounding vehicles
before the destination. Furthermore, an infrequent driving
condition is described in Fig. 13 (the episode represented
by B point). The ego vehicle wants to achieve a risky lane-
changing to obtain higher rewards. However, it cashed
nearby vehicles because the operation space is not enough.
This situation may not happen in the training process, and
thus the ego vehicle could cause a collision.
TABLE I
TRAINING AND TESTING TIME OF DQN AND DDQN METHODS
Techniques#
Training Time (h)
Testing Time (s)*
DQN
8
27.89
DDQN
6
26.56
# A 2.30 GHz microprocessor with 31.8 GB RAM was used.
* The time that uses the trained parameters in a new driving situation.
Based on the detailed analysis in Fig 12 and 13, it hints
us to spend more time to train the mutable decision-making
strategy. These results also remind us that the relevant
control policy has the potential to be applied in real-world
environments. Table I provides the training and testing time
of the DQN, and DDQN approaches. Although the training
9
time can be only realized offline, the learned parameters
and policies are able to be utilized online. This inspires us
to implant our decision-making policy in the visualization
simulation environments and to conduct the related loop
experiments in the future.
V. CONCLUSION
This paper discusses the highway decision-making problem
using the DRL technique. By applying the DDQN
algorithm in the designed driving environments, an efficient
and safe control framework is constructed. Depending on a
series of simulation experiments, the optimality,
convergence rate, and adaptability are demonstrated. In
addition, the testing results are analyzed, and the potentials
of the presented method to be applied in real-world
environments are proven. Future work includes the online
applications of highway decision-making by executing
hardware-in-loop (HIL) experiments. Moreover, the real-
world collected highway database can be used to estimate
the related overtaking strategy.
REFERENCES
[1] A. Raj, J. A. Kumar, and P. Bansal, “A multicriteria decision making
approach to study barriers to the adoption of autonomous vehicles,”
Transportation research part A: policy and practice, vol. 133, pp.
122-137, 2020.
[2] T. Liu, B. Tian, Y. Ai, L. Chen, F. Liu, and D. Cao, Dynamic States
Prediction in Autonomous Vehicles: Comparison of Three Different
Methods, IEEE Intelligent Transportation Systems Conference
(ITSC 2019), 27-30 Oct 2019.
[3] A. Rasouli, and J. K. Tsotsos, “Autonomous vehicles that interact
with pedestrians: A survey of theory and practice,” IEEE Tran. on
Intell.Trans. Sys., vol. 21, no. 3, pp. 900-918, 2019.
[4] C. Gkartzonikas, and K. Gkritza, “What have we learned? A review
of stated preference and choice studies on autonomous vehicles,”
Transportation Research Part C: Emerging Technologies, vol. 98,
pp. 323-337, 2019.
[5] C. J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J.
Kochenderfer, “Combining planning and deep reinforcement
learning in tactical decision making for autonomous driving,” IEEE
Transactions on Intelligent Vehicles, vol. 5, no. 2, pp. 294-305, 2019.
[6] C. J. Hoel, K. Wolff, and L. Laine, “Tactical decision-making in
autonomous driving by reinforcement learning with uncertainty
estimation,” arXiv preprint arXiv:2004.10439, 2020.
[7] A. Driving, “Levels of driving automation are defined in new SAE
international standard J3016: 2014,” SAE International: Warrendale,
PA, USA, 2014.
[8] P. Hart, and A. Knoll, “Using counterfactual reasoning and
reinforcement learning for decision-making in autonomous driving,”
arXiv preprint arXiv:2003.11919, 2020.
[9] W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous
driving decision-making in an uncontrolled intersection,”
Mathematical Problems in Engineering, 2016.
[10] A. Furda, and L. Vlacic, “Enabling safe autonomous driving in
real-world city traffic using multiple criteria decision making,”
IEEE Intelli. Trans. Sys. Magaz., vol. 3, no. 1, pp. 4-17, 2011.
[11] J. Nie, J. Zhang, E. Ding, X. Wan, X. Chen, and B. Ran,
“Decentralized cooperative lane-changing decision-making for
connected autonomous vehicles,” IEEE Access, vol. 4, pp. 9413-
9420, 2016.
[12] L. Li, K. Ota, and M. Dong, “Humanlike driving: Empirical
decision-making system for autonomous vehicles,” IEEE
Transactions on Vehicular Technology, vol. 67, no. 8, pp. 6814-6823,
2018.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G.
Bellemare, and S. Petersen, “Human-level control through deep
reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533,
2015.
[14] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical
reinforcement learning for self-driving decision-making without
reliance on labelled driving data,” IET Intelligent Transport Systems,
vol. 14, no. 5, pp. 297-305, 2020.
[15] M. Kim, S. Lee, J. Lim, J. Choi, and S. Kang, “Unexpected
Collision Avoidance Driving Strategy Using Deep Reinforcement
Learning,” IEEE Access, vol. 8, pp. 17243-17252, 2020.
[16] Q. Zhang, J. Lin, Q. Sha, B. He, and G. Li, G. “Deep Interactive
Reinforcement Learning for Path Following of Autonomous
Underwater Vehicle,” IEEE Access, vol. 8, pp. 24258-24268, 2020.
[17] C. Chen, J. Jiang, N. Lv, and S. Li, “An Intelligent Path Planning
Scheme of Autonomous Vehicles Platoon Using Deep
Reinforcement Learning on Network Edge,” IEEE Access, 2020.
[18] S. Han, and F. Miao, “Behavior Planning For Connected
Autonomous Vehicles Using Feedback Deep Reinforcement
Learning,” arXiv preprint arXiv:2003.04371, 2020.
[19] S. Nageshrao, H. E. Tseng, and D. Filev, “Autonomous highway
driving using deep reinforcement learning,” In 2019 IEEE
International Conference on Systems, Man and Cybernetics (SMC),
pp. 2326-2331, 2019.
[20] T. Liu, B. Huang, Z. Deng, H. Wang, X. Tang, X. Wang and D. Cao,
“Heuristics-oriented overtaking decision making for autonomous
vehicles using reinforcement learning”, IET Electrical Systems in
Transportation, 2020.
[21] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states
in empirical observations and microscopic simulations,” Phys. Rev.
E, vol. 62, pp. 1805-1824, 2000.
[22] M. Zhou, X. Qu, and S. Jin, “On the impact of cooperative
autonomous vehicles in improving freeway merging: a modified
intelligent driver model-based approach,” IEEE Trans. Intell.
Transport. Syst., vol. 18, no. 6, pp. 1422-1428, June 2017.
[23] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing
model MOBIL for car-following models,” Transportation Research
Record, vol. 1999, no. 1, pp. 86-94, 2007.
[24] T. Liu, X. Hu, W. Hu, Y. Zou, “A heuristic planning reinforcement
learning-based energy management for power-split plug-in hybrid
electric vehicles,” IEEE Transactions on Industrial Informatics, vol.
15, no. 12, pp. 6436-6445, 2019.
[25] T. Liu, X. Tang, H. Wang, H. Yu, and X Hu, “Adaptive Hierarchical
Energy Management Design for a Plug-in Hybrid Electric Vehicle,”
IEEE Transactions on Vehicular Technology, vol. 68, no. 12, pp.
11513-11522, 2019.
[26] X. Hu, T. Liu, X. Qi, and M. Barth, “Reinforcement learning for
hybrid and plug-in hybrid electric vehicle energy management:
Recent advances and prospects,” IEEE Industrial Electronics
Magazine, vol. 13, no. 3, pp. 16-25, 2019.
[27] T. Liu, B. Tian, Y. Ai, L. Chen, F. Liu, and D. Cao, Dynamic States
Prediction in Autonomous Vehicles: Comparison of Three Different
Methods, IEEE Intelligent Transportation Systems Conference
(ITSC 2019), 27-30 Oct 2019.
[28] T. Liu, H. Yu, H. Guo, Y. Qin, and Y. Zou, “Online energy
management for multimode plug-in hybrid electric vehicles,” IEEE
Transactions on Industrial Informatics, vol. 15, no. 7, pp. 4352-
4361, July 2019.
[29] M. L., Puterman, “Markov decision processes: discrete stochastic
dynamic programming,John Wiley & Sons, 2014.
[30] R. Sutton, and A. Barto, “Reinforcement learning: An introduction
(second edition),” MIT press, 2018.
[31] L. Edouard, “An environment for autonomous driving decision-
making,” https://github.com/ eleurent/highway-env, GitHub, 2018.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This study presents a three‐lane highway overtaking strategy for an automated vehicle, which is based on a heuristic planning reinforcement learning algorithm. The proposed decision‐making controller focuses on keeping the autonomous vehicle operating safely and efficiently. First, the modelling of the overtaking driving scenario is introduced and the reference approaches named intelligent driver model and minimise overall braking induced by lane changes are formulated. Second, the Dyna‐H algorithm, which combines the modified Q‐learning algorithm with a heuristic planning policy, is utilised for highway overtaking decision‐making. Three different heuristic strategies are formulated to improve learning efficiency and compare performance. This algorithm is applied to determine the lane change and speed selection for an ego vehicle in the environment with uncertainties. Finally, the performance of Dyna‐H is estimated in the autonomous overtaking scenario by comparing it with the reference and traditional learning methods. Furthermore, the Dyna‐H‐enabled decision‐making strategies are validated and analysed in an open‐sourcing driving dataset. Results prove that the proposed decision‐making strategy could produce superior performance in convergence rate and control.
Article
Full-text available
Recent advancements in Intelligent Transportation Systems suggest that the roads will gradually be filled with autonomous vehicles that are able to drive themselves while communicating with each other and the infrastructure. As a representative driving pattern of autonomous vehicles, the platooning technology has great potential for reducing transport costs by lowering fuel consumption and increasing traffic efficiency. In this paper, to improve the driving efficiency of autonomous vehicular platoon in terms of fuel consumption, a path planning scheme is envisioned using deep reinforcement learning on the network edge node. At first, the system model of autonomous vehicles platooning is given on the common highway. Next, a joint optimization problem is developed considering the task deadline and fuel consumption of each vehicle in the platoon. After that, a path determination strategy employing deep reinforcement learning is designed for the platoon. To make the readers readily follow, a case study is also presented with instantiated parameters. Numerical results shows that our proposed model could significantly reduce the fuel consumption of vehicle platoons while ensuring their task deadlines.
Article
Full-text available
Autonomous underwater vehicle (AUV) plays an increasingly important role in ocean exploration. Existing AUVs are usually not fully autonomous and generally limited to pre-planning or pre-programming tasks. Reinforcement learning (RL) and deep reinforcement learning have been introduced into the AUV design and research to improve its autonomy. However, these methods are still difficult to apply directly to the actual AUV system because of the sparse rewards and low learning efficiency. In this paper, we proposed a deep interactive reinforcement learning method for path following of AUV by combining the advantages of deep reinforcement learning and interactive RL. In addition, since the human trainer cannot provide human rewards for AUV when it is running in the ocean and AUV needs to adapt to a changing environment, we further propose a deep reinforcement learning method that learns from both human rewards and environmental rewards at the same time. We test our methods in two path following tasks—straight line and sinusoids curve following of AUV by simulating in the Gazebo platform. Our experimental results show that with our proposed deep interactive RL method, AUV can converge faster than a DQN learner from only environmental reward. Moreover, AUV learning with our deep RL from both human and environmental rewards can also achieve a similar or even better performance than that with deep interactive RL and can adapt to the actual environment by further learning from environmental rewards.
Article
Full-text available
Decision making for self‐driving cars is usually tackled by manually encoding rules from drivers’ behaviours or imitating drivers’ manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This study presents a hierarchical reinforcement learning method for decision making of self‐driving cars, which does not depend on a large amount of labelled driving data. This method comprehensively considers both high‐level manoeuvre selection and low‐level motion control in both lateral and longitudinal directions. The authors firstly decompose the driving tasks into three manoeuvres, including driving in lane, right lane change and left lane change, and learn the sub‐policy for each manoeuvre. Then, a master policy is learned to choose the manoeuvre policy to be executed in the current state. All policies, including master policy and manoeuvre policies, are represented by fully‐connected neural networks and trained by using asynchronous parallel reinforcement learners, which builds a mapping from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each manoeuvre. They apply this method to a highway driving scenario, which demonstrates that it can realise smooth and safe decision making for self‐driving cars.
Article
Full-text available
Automation technology is emerging, but the adoption rate of autonomous vehicles (AV) will depend upon how policymakers and the government address various challenges such as public acceptance and infrastructure development. This study follows a five-stage method to understand these barriers to AV adoption. First, based on a literature review followed by discussions with experts, ten barriers are identified. Second, the opinions of eighteen experts from industry and academia regarding inter-relations between these barriers are recorded. Third, a multicriteria decision making technique, the grey-based Decision-making Trial and Evaluation Laboratory (Grey-DEMATEL), is applied to characterize the structure of relationships between the barriers. Fourth, robustness of the results is tested using sensitivity analysis. Fifth, the key results are depicted in a causal loop diagram (CLD), a systems thinking approach, to comprehend cause-and-effect relationships between the barriers. The results indicate that the lack of customer acceptance (LCA) is the most prominent barrier, the one which should be addressed at the highest priority. The CLD suggests that LCA can be mitigated by addressing two other prominent and more tangible barriers – lack of industry standards and the absence of regulations and certifications. The study’s contribution lies in demonstrating that the barriers to AV adoption do not exist in isolation but are linked with each other in overlapping loops of cause and effect relationships. These insights can help different stakeholders in prioritizing their endeavors to expedite AV adoption. From the methodological perspective, this is the first study in the transportation literature that integrates Grey-DEMATEL with systems thinking.
Article
Full-text available
In this paper, we generated intelligent self-driving policies that minimize the injury severity in unexpected traffic signal violation scenarios at an intersection using the deep reinforcement learning. We provided guidance on reward engineering in terms of the multiplicity of objective function. We used a deep deterministic policy gradient method in the simulated environment to train self-driving agents. We designed two agents, one with a single-objective reward function of collision avoidance and the other with a multi-objective reward function of both collision avoidance and goal-approaching. We evaluated their performances by comparing the percentages of collision avoidance and the average injury severity against those of human drivers and an autonomous emergency braking (AEB) system. The percentage of collision avoidance of our agents were 78.89% higher than human drivers and 84.70% higher than the AEB system. The average injury severity score of our agents were only 8.92% of human drivers and 6.25% of the AEB system.
Article
The autonomous vehicle, as an emerging and rapidly growing field, has received extensive attention for its futuristic driving experiences. Although the fast developing depth sensors and machine learning methods have given a huge boost to self-driving research, existing autonomous driving vehicles do meet with several avoidable accidents during their road testings. The major cause is the misunderstanding between self-driving systems and human drivers. To solve this problem, we propose a humanlike driving system in this paper to give autonomous vehicles the ability to make decisions like a human. In our method, a convolutional neural network model is used to detect, recognize, and abstract the information in the input road scene, which is captured by the on-board sensors. And then a decision-making system calculates the specific commands to control the vehicles based on the abstractions. The biggest advantage of our work is that we implement a decision-making system which can well adapt to real-life road conditions, in which a massive number of human drivers exist. In addition, we build our perception system with only the depth information, rather than the unstable RGB data. The experimental results give a good demonstration of the efficiency and robustness of the proposed method.
Article
Tactical decision making for autonomous driving is challenging due to the diversity of environments, the uncertainty in the sensor information, and the complex interaction with other road users. This paper introduces a general framework for tactical decision making, which combines the concepts of planning and learning, in the form of Monte Carlo tree search and deep reinforcement learning. The method is based on the AlphaGo Zero algorithm, which is extended to a domain with a continuous state space where self-play cannot be used. The framework is applied to two different highway driving cases in a simulated environment and it is shown to perform better than a commonly used baseline method. The strength of combining planning and learning is also illustrated by a comparison to using the Monte Carlo tree search or the neural network policy separately.