PreprintPDF Available

Decision-making Strategy on Highway for Autonomous Vehicles using Deep Reinforcement Learning

July 2020

July 2020

Authors:

Jiangdong Liao

Cardiff University

Teng Liu

University of Waterloo

Xiaolin Tang

Shanghai Jiao Tong University

Show all 6 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

Autonomous driving is a promising technology to reduce traffic accidents and improve driving efficiency. In this work, a deep reinforcement learning (DRL)-enabled decision-making policy is constructed for autonomous vehicles to address the overtaking behaviors on the highway. First, a highway driving environment is founded, wherein the ego vehicle aims to pass through the surrounding vehicles with an efficient and safe maneuver. A hierarchical control framework is presented to control these vehicles, which indicates the upper-level manages the driving decisions, and the lower-level cares about the supervision of vehicle speed and acceleration. Then, the particular DRL method named dueling deep Q-network (DDQN) algorithm is applied to derive the highway decision-making strategy. The exhaustive calculative procedures of deep Q-network and DDQN algorithms are discussed and compared. Finally, a series of estimation simulation experiments are conducted to evaluate the effectiveness of the proposed highway decision-making policy. The advantages of the proposed framework in convergence rate and control performance are illuminated. Simulation results reveal that the DDQN-based overtaking policy could accomplish highway driving tasks efficiently and safely.

Content uploaded by Teng Liu

Content may be subject to copyright.

Decision-making Strategy on Highway for

Autonomous Vehicles using Deep

Reinforcement Learning

Jiangdong Liao1, Teng Liu2, Xiaolin Tang2, Xingyu Mu2, Bing Huang2, and Dongpu Cao3

1School of Mathematics and Statistics, Yangtze Normal University, Fuling, 408100, China

2College of Automotive Engineering, Chongqing University, Chongqing, 400044, China

3Department of Mechanical and Mechatronics Engineering, University of Waterloo, N2L 3G1, Canada

Corresponding author: Teng Liu (e-mail: tengliu17@gmail.com), Xiaolin Tang (e-mail: tangxl0923@cqu.edu.cn).

This work was in part supported by the State Key Laboratory of Mechanical System and Vibration (Grant No. MSV202016).

ABSTRACT Autonomous driving is a promising technology to reduce traffic accidents and improve

driving efficiency. In this work, a deep reinforcement learning (DRL)-enabled decision-making policy is

constructed for autonomous vehicles to address the overtaking behaviors on the highway. First, a highway

driving environment is founded, wherein the ego vehicle aims to pass through the surrounding vehicles with

an efficient and safe maneuver. A hierarchical control framework is presented to control these vehicles,

which indicates the upper-level manages the driving decisions, and the lower-level cares about the

supervision of vehicle speed and acceleration. Then, the particular DRL method named dueling deep Q-

network (DDQN) algorithm is applied to derive the highway decision-making strategy. The exhaustive

calculative procedures of deep Q-network and DDQN algorithms are discussed and compared. Finally, a

series of estimation simulation experiments are conducted to evaluate the effectiveness of the proposed

highway decision-making policy. The advantages of the proposed framework in convergence rate and

control performance are illuminated. Simulation results reveal that the DDQN-based overtaking policy

could accomplish highway driving tasks efficiently and safely.

INDEX TERMS Autonomous driving, decision-making, deep reinforcement learning, dueling deep Q-

network, deep Q-learning, overtaking policy.

I. INTRODUCTION

Autonomous driving (AD) enables the vehicle to engage

different driving missions without a human driver [1, 2].

Motivated by the enormous potentials of artificial

intelligence (AI), autonomous vehicles or automated vehicles

have become one of the research hotspots all over the world

[3]. Many automobile manufacturers, such as Toyota, Tesla,

Ford, Audi, Waymo, Mercedes-Benz, General Motors, and

so on are developing their own autonomous cars and

achieving tremendous progress. Meanwhile, automotive

researchers are paying attention to overcome the essential

technologies to build automated cars with full automation

[4].

Four significant modules are contained in autonomous

vehicles, which are perception, decision-making, planning,

and control [5]. Perception indicates the autonomous vehicles

know the information about the driving environments based

on the functions of a variety of sensors, such as radar, lidar,

global positioning system (GPS), et al. Decision-making

controller manages the driving behaviors of the vehicles, and

these behaviors include acceleration, braking, lane-changing,

lane-keep and so on [6]. Planning function helps the

automated cars find the reasonable running trajectories from

one point to another. Finally, the control module would

command the onboard powertrain components to operate

accurately to finish the driving maneuvers and follow the

planning path. According to the intelligent degrees of these

mentioned modules, the AD is classified into six levels, from

L0 to L5 [7].

Decision-making strategy is regarded as the human brain

and is extremely important in autonomous vehicles [8]. This

policy is often generated by the manual rules based on

Highway Decision-making Deep Reinforcement Learning

Lane-changing

Lane-keeping

Learned Control Policies

Testing Experiments

Collision-free

Similar Situation

Optimality and Adaptability

Comparison

Evaluation

FIGURE 1. The constructed deep reinforcement learning-abled highway overtaking driving policy for autonomous vehicles.

human driving experiences or imitated manipulation learned

from supervised learning approaches. For example, Song et

al. applied a continuous hidden Markov chain to predict the

motion intention of the surrounding vehicles. Then, a

partially observable Markov decision process (POMDP) is

used to construct the general decision-making framework [9].

The authors in [10] developed an advanced ability to make

appropriate decisions in the city road traffic situations. The

presented decision-making policy is multiple criteria, which

helps the city cars make feasible choices in different

conditions. In Ref. [11], Nie et al. discussed the lane-

changing decision-making strategy for connected automated

cars. The related model is combining the cooperative car-

following models and candidate decision generation module.

Furthermore, the authors in [12] mentioned the thought of a

human-like driving system. It could adjust the driving

decisions by considering the driving demand of human

drivers.

Deep reinforcement learning (DRL) techniques are taken

as a powerful tool to deal with the long sequential decision-

making problems [13]. In recent years, many attempts have

been implemented to study DRL-based autonomous driving

topics. For example, Duan et al. built a hierarchical structure

to learn the decision-making policy via the reinforcement

learning (RL) method [14]. The pro of this work is

independent of the historical labeled driving data. Ref. [15,

16] utilized DRL approaches to handle the collision

avoidance and path following problems for automated

vehicles. The relevant control performance is better than the

conventional RL methods in these two findings. Furthermore,

the authors in [17] considered not only path planning but also

the fuel consumption for autonomous vehicles. The related

algorithm is deep Q-learning (DQL), and it was proven to

accomplish these two-driving missions suitably. Han et al.

employed the DQL algorithm to decide the lane change or

lane keep for connected autonomous cars, in which the

information of the nearby vehicles is treated as feedback

knowledge from the network [18]. The resulted policy is able

to promote traffic flow and driving comfort. However, the

common DRL methods are unable to address the highway

overtaking problems because of the continuous action space

and large state space [19].

In this work, a DRL enabled highway overtaking driving

policy is constructed for autonomous vehicles. The proposed

decision-making strategy is evaluated and estimated to be

adaptive to other complicated scenarios, as depicted in Fig. 1.

First, the studied driving environment is founded on the

highway, wherein an ego vehicle aims to run through a

particular driving scenario efficiently and safely. Then, a

hierarchical control structure is shown to manipulate the

lateral and longitudinal motions of the ego and surrounding

vehicles. Furthermore, the special DRL algorithm called

dueling deep Q-network (DDQN) is derived and utilized to

obtain the highway decision-making strategy. The DQL and

DDQN algorithms are compared and analyzed theoretically.

Finally, the performance of the proposed control framework

is discussed via executing a series of simulation experiments.

Simulation results reveal that the DDQN-based overtaking

policy could accomplish highway driving tasks efficiently

and safely.

The main contributions and innovations of this work can

be cast into three perspectives: 1) an adaptive and optimal

DRL-based highway overtaking strategy is proposed for

automated vehicles; 2) the dueling deep Q-network (DDQN)

algorithm is leveraged to address the large state space of the

decision-making problem; 3) the convergence rate and

control optimization of the derived decision-making policy

are demonstrated by multiple designed experiments.

This following organization of this article is given as

follows: the highway driving environment and the control

modules of the ego and surrounding vehicles are described in

Section II. The DQL and DDQN algorithms are defined in

Section III, in which the parameters of the RL framework are

discussed in detail. Section IV shows the relevant results of a

series of simulation experiments. Finally, the conclusion is

conducted in Section V.

II. DRIVING ENVIRONMENT AND CONTROL MODULE

In this section, the studied driving scenario on the highway is

introduced. Without loss of generality, a three-lane freeway

environment is constructed. Furthermore, a hierarchical

motion controller is described to manage the lateral and

longitudinal movements of the ego and surrounding vehicles.

The upper-level contains two models, which are intelligent

driver model (IDM) and minimize overall braking induced

by lane changes (MOBIL) [20]. The lower-level focuses on

regulating vehicle velocity and acceleration.

HIGHWAY DRIVING SCENARIO

Decision-making in autonomous driving means selecting a

sequence of reasonable driving behaviors to achieve special

driving missions. On the highway, these behaviors involve

lane- changing, lane-keeping, acceleration, braking. The

main objectives are avoiding collisions, running efficiently,

and driving on the preferred lane. Accelerating and

surpassing other vehicles is a typical driving behavior called

overtaking.

FIGURE 2. Highway driving environment for decision-making problem

with three lanes.

This work discusses the decision-making problem on the

highway for autonomous vehicles, and the research driving

scenario is depicted in Fig. 2. The orange vehicle is the ego

vehicle, and other green cars are named as surrounding

vehicles. There are three lanes in the driving environment,

and the derived decision-making policy in this paper is easily

generalized to different situations. The ego vehicle would be

initialized in the middle lane at a random speed.

The objective of the ego vehicle is to run from the starting

point to end point as soon as possible with cashing the

surrounding vehicles. Hence, this goal is interpreted as

efficiency and safety. The initial velocity and position of the

surrounding vehicles are designed randomly. It implies the

driving scenario consists of uncertainties as to the actual

driving. Furthermore, to imitate the real conditions, the ego

vehicle prefers to stay on lane 1 (L=1), and it can overtake

other vehicles from the right or left sides.

At the beginning of this driving task, all the surrounding

vehicles located in front of the ego vehicle. In each lane, the

number of surrounding vehicles is M, which indicates there

are 3M nearby cars in this situation. Two conditions would

interrupt the ego vehicle, which is crashing other vehicles or

reaching the destination. The procedure of running from the

starting point to the ending point is called as one episode in

this work.

Without loss of generality, the parameters of the driving

scenario are settled as follows: the original speed of the ego

vehicle is chosen from [23, 25] m/s, its maximum speed is 40

m/s, the length and width of all vehicles are 5m and 2m. The

duration of one episode is 100s, and the simulation frequency

is 20 Hz. The initial velocity of the surrounding vehicles is

randomly chosen from [20, 23] m/s, and their behaviors are

manipulated by IDM and MOBIL. The next section will

discuss these two models in detail.

VEHICLE BEHAVIOR CONTROLLER

The movements of all the vehicles in the highway

environments are mastered by a hierarchical control

framework, as shown in Fig. 3. The upper-level applied IDM

and MOBIL to manage the vehicle behaviors, and the lower-

level aims to enable the ego vehicle to track a given target

speed and follow a target lane. In this work, the DRL method

is used to control the ego vehicle. The reference model

implies that the ego vehicle is controlled by the bi-level

structure in Fig. 3, which is taken as a benchmark to evaluate

the DRL-based decision-making strategy.

IDM in the upper-level is a prevalent microscopic model

[21] to realize car-following and collision-free. In the

adaptive cruise controller of automated cars, the longitudinal

behavior is usually decided by IDM. In general, the

longitudinal acceleration is IDM is determined as [22]:

max [1 ( ) ( ) ]

tar

aa vd



=  − − 

(1)

where v and a is the current vehicle speed and acceleration.

amax is the maximum acceleration, △d is the distance to the

front car and δ is named as constant acceleration parameter.

vtar and dtar are the target velocity and distance, and the

desired speed is achieved by the amax and dtar. In IDM, the

expected distance dtar is affected by the front vehicle and is

calculated as follows:

Highway Driving Scenario for Decision-making Problem

Ego

Vehicle Surrounding

Vehicles

Reference Model

Upper-level: Equations (1)-(4)

Lower-level: Equations (5)-(7)

DRL Model

Upper-level: DDQN Algorithm

Lower-level: Equations (23)-(24)

Unified Control Model

Upper-level: Equations (1)-(4)

Lower-level: Equations (5)-(7)

Random Initial Speed and Position

FIGURE 3. The hierarchical control framework discussed in this work for the ego vehicle and surrounding vehicles.

max

tar vv

d d Tv ab



= + +

(2)

where d0 is the predefined minimum relative distance, T is

the expected time interval for safety goal, △v is the relative

speed between two vehicles, and b is the deceleration rate

according to the comfortable purpose.

In IDM, the relative speed and distance are defined a priori

to induce the vehicle velocity and acceleration at each time

step. The default configuration is introduced as following:

the maximum acceleration amax is 6 m/s2, acceleration

argument δ is 4, desired time gap T is 1.5 s, comfortable

deceleration rate b is -5 m/s2, and minimum relative distance

d0 is 10m.

Since the IDM is utilized to determine the longitudinal

behavior, the MOBIL is employed to make the lateral lane

change decisions [23]. MOBIL states that lane-changing

behaviors should be observed by two restrictions, which are

safety criterion and incentive condition. These constraints are

related to the ego vehicle and its two followers (before

changing and after changing). Assuming aoldi and aoldj are the

accelerations of these followers before changing, and anewi

and anewj are the accelerations after changing.

The safety criterion requires the follower in the desired

lane (after changing) to limit its acceleration to avoid a

collision. The mathematic expression is shown as:

new

j safe

ab−

(3)

where bsafe is the maximum braking imposed to the follower

in the lane-changing behavior. By following (3), collision

and accidents could be avoided effectively.

The incentive condition is imposed on the ego vehicle and

its followers by an acceleration threshold ath:

[( ) ( )]

new old new old new old

e e i i j j th

a a z a a a a a− + − + − 

(4)

where z is named as politeness coefficient to determine the

effect degree of the followers in the lane-changing behaviors.

This incentive condition means the desired lane should be

safer than the old lane. For application, the parameters in

MOBIL are defined as follows: the politeness factor z is

0.001, safe deceleration limit bsafe is 2 m/s2, and acceleration

threshold ath is 0.2 m/s2. After deciding the longitudinal and

lateral behaviors in the upper-level, the lower-level is applied

to follow the target speed and lane.

VEHICLE MOTION CONTROLLER

In the lower-level, the motions of the vehicles in the

longitudinal and lateral direction are controlled. The former

regulates the acceleration by a proportional controller as:

()

p tar

a K v v=−

(5)

where Kp is the proportional gain.

In the lateral direction, the controller deals with the

position and heading of the vehicle with a simple

proportional-derivative action. The position indicates the

lateral speed vlat of the vehicle is computed as follows:

,lat p lat lat

vK= − 

(6)

where Kp,lat is named as position gain, △lat is the lateral

position of the vehicle with respect to the center-line of the

lane. Then, the heading control is related to the yaw rate

command φ as:

= ( )

p tar



  

−

(7)

where φtar is the target heading angle to follow the desired

lane and Kp,lat is the heading gain.

Hence, the movements of the surrounding vehicles are

achieved by the bi-level control framework in Fig. 3. The

position, speed, and acceleration of these vehicles are

assumed to be known to the ego vehicle. This limitation

propels the ego vehicle to learn how to drive in the scenario

via the trial-and-error procedure. In the next section, the DRL

approach is introduced and established to realize this learning

process and derive the highway decision-making policy.

III. DRL METHODOLOGY

This section introduces the RL method and exhibits the

special DRL algorithms. The interaction in RL between the

agent and the environment is first explained. Then, the DQL

algorithm that incorporates the neural network and Q-

learning algorithm is formulated. Finally, a dueling network

is constructed in a DQL algorithm to reconstitute the output

layer of the neural network, and thus raise the DDQN method.

RL CONCEPT

RL approach describes the process that an intelligent agent

interacts with its environment. It is powerful and useful to

solve sequential decision-making problems. The goal of the

agent is to search an optimal sequence of control actions

based on feedback from the environment. Owing to its

characteristics of self-evaluation and self-promotion, RL is

widely used in many research fields [24-28].

In the decision-making problem on the highway, the agent

and environment are the ego vehicle and surrounding

vehicles (including the driving conditions), respectively. The

Markov decision processes (MDPs) is embedded in the

Markov property that the next state variable is only

concerned with the current state and action [29]. This MDP is

often utilized to represent the RL interaction as a tuple (S, A,

P, R, γ), in which S and A are the state and control sets. P

and R are the significant elements of the environments in RL,

and they mean the transition and reward model, respectively.

In RL, the current action would influence the immediate and

future rewards synchronously. Hence, γ is a discount factor to

balance these two parts of rewards.

To represent the list of future rewards, the accumulated

reward Rt is defined as follows:





=



(8)

where t is the time instant, and rt is the relevant reward. To

record the worth of the state s and state-action pair (s, a), two

value function as expressed by the accumulated reward as:

( ) [ | , ]

t t t

V s E R s





(9)

( , ) [ | , , ]

t t t t t

Q s a E R s a





(10)

where π is called as control action policy, V is state value

function, and Q is the state-action function (called Q table for

short). To be updated easily, the state-action function is

usually rewritten as the recursive form:

111

( , )= [ max ( , )]

t t t t t

Q s a E r Q s a





+++

(11)

Finally, the optimal control action with respect to the

control policy π is determined by the state-action function:

( ) argmax ( , )

t t t

s Q s a



(12)

Therefore, the essence of different RL algorithms is updating

the state-action function Q(s, a) in various ways. According

to the style of updating rules, the RL algorithms could have

diverse classifications, such as model-based and model-free,

policy-based and value-based, temporal-difference (TD), and

Monte-Carlo (MC) [30].

B. DEEP Q NETWORK

Deep Q network (DQN) is first presented to play the Atari

games in [31]. It synthesizes the strengths of deep learning

(neural network) and Q-learning to obtain the new state-

value function. In the common Q-learning, the updating rule

of this function is narrated as follows:

( , ) ( , ) [ max ( , ) ( , )]

Q s a Q s a r Q s a Q s a





 + + −

(13)

where α∈ [0, 1] is named as a learning rate to trade-off the

old and new learned experiences from the environment. s´

and a´ are the state and action at the next time step.

The common Q-learning is unable to handle the problem

with a large space of state variable, because it needs an

enormous time to obtain the mutable Q table. Thus, in DQN,

a neural network is employed to approximate the Q table as

Q(s, a; θ). For the neural network, the inputs are the arrays of

state variables and control actions, and the output is the state-

value function [31].

To measure the discrepancy between the approximated

and actual Q table in DQN, the loss function is introduced

like the following expression:

( ) [ ( ( , ; )) ]

L E y Q s a



=−



(14)

where

max ( , ; )

tt a

y r Q s a



  

(15)

As can be seen, there are two parameters (θ and θ´) of the

neural network, which delegate two networks in DQN. These

networks are prediction and target networks. The former is

applied to estimate the current control action, and the latter

aims to generate the target value. In general, the target

State Variables

Vehicle Speed

Relative Position

Control Actions

Acceleration

Lane-changing 256 128

256 256

128 128 128 128

128

State-value

Network V(s)

Advantage

network A(s, a)

FIGURE 4. The dueling network combined with state-value network and advantage network for Q table updating.

network would copy the parameters from the prediction

network every certain number of time steps. By doing this,

the target Q table will converge to predict one to some extent

to remit the network instability.

In DQN, the online neural network is updated by gradient

descent as follows:

( ) [( ( , ; )) ( , ; )]

L E y Q s a Q s a



  

 = − 

(16)

This operation makes the DQN as an off-policy algorithm,

and the states and rewards are acquired by a special criterion.

This rule is known as epsilon greedy, which indicates that the

agent executes the exploration (choose a random action) with

probability ε, and makes exploitation (use the current best

action) with probability 1-ε.

DUELING DQN ALGORITHM

In some RL problems, the selection of current control action

may not cause negative results, apparently. For example, in

the highway environment, many actions would not lead to

the collision. However, these choices may indirectly result in

bad rewards afterward [32]. Motivated by this insight, a

dueling network is proposed in this work to estimate the

worth of the control actions at each step. A new neural

network is constructed to approximate the Q table in the

highway decision-making problem, as shown in Fig. 4.

Two streams of fully connected layers are used to estimate

the state-value function V(s) and the advantage function A(s,

a) of each action. Therefore, the state-action function (Q

table) is constituted as follows:

( , )= ( , )+ ( )Q s a A s a V s

  

(17)

It is obvious that the output of this new dueling network is

also a Q table, and thus the neural network used in DQN can

also be employed to approximate this Q table. The network

with two parameters is computed as:

( , ; ) ( ; ) ( , ; )Q s a V s A s a

  

  

(18)

where θ1 and θ2 are the parameters of state-value function

and advantage function, respectively.

To update the Q table in DDQN and achieve the optimal

control action, (18) is reformulated as follows:

1 2 2

( , ; ) ( ; ) ( ( , ; ) max ( , ; ))

Q s a V s A s a A s a

   

   



= + −

(19)

argmax ( , ; ) argmax ( , ; )

a Q s a A s a





(20)

It can be decerned that the input-output interfaces in DDQN

and DQN are the same. Hence, the gradient descent in (16) is

capable of being recycled to train the Q table in this work.

VARIABLES SPECIFICATION

To derive the DDQN-based decision-making strategy, the

preliminaries are initialized as follows, and the calculative

procedure is easily transformed into an analogous driving

environment. The control actions are the longitudinal and

lateral accelerations (a1 and a2) with the units m/s2 and rad:

1[ 5, 5] m/sa−

(21)

2[ /4, / 4] rada



−

(22)

It is noticed that when these two accelerations are zeros, the

ego vehicle adopts an idling control.

After obtaining the acceleration actions, the speed and

position of the vehicle can be computed as follows:

1 1 1

2 2 2

v v a t

= + 



= + 





(23)

1 1 1

2 2 2

d v t a t

=   +  





=  +  





(24)

where v1, v2 are the longitudinal and lateral speed of the

vehicle, respectively, similar to the d1 and d2. The policy

frequency is 1 Hz, which indicates the time interval △t is 1

second. It should be noticed that (23) and (24) are feasible for

the ego vehicle and surrounding vehicles simultaneously, and

these expressions are considered as the transition model P in

RL. Then, the state variables are defined as the relative speed

and distance between the ego and nearby cars:

ego sur

t t t

d d d = −

(25)

ego sur

t t t

v v v = −

(26)

where the superscript ego and sur represent the ego vehicle

and surrounding vehicles, respectively.

Finally, the reward model R is constituted by the optimal

control objectives, which are avoiding collision, running as

fast as possible, and trying the driving on lane 1 (L=1). To

bring this insight to fruition, the instantaneous reward

function is defined as following:

max 2 2

1 0.1*( ) 0.4*( -1)

t ego ego

r collision v v L= −  − − −

(27)

where collision ∈ {0, 1} and the goal of the DDQN-based

highway decision-making strategy is maximizing the

cumulative rewards.

The proposed decision-making control policy is trained

and evaluated in the simulation environment based on the

OpenAI gym Python toolkit [31]. The numbers of lanes and

surrounding vehicles are 3 and 30. The discount factor γ and

learning rate α are 0.8 and 0.2. The layers of the value

network and advantage network are both 128. The value of ε

decreases from 1 to 0.05 with the time step 6000. The

training episode in different DRL approaches are 2000. The

next section discusses the effectiveness of the presented

decision-making strategy for autonomous vehicles.

IV. RESULTS AND EVALUATION

In this section, the proposed highway decision-making policy

is estimated by comparing it with the benchmark methods.

These techniques are the reference model in Fig. 3 and the

common DQN in Section III.B. The optimality is analyzed

by conducting a comparison of these three methods.

Furthermore, the adaptability of the presented approach is

verified by implementing the trained model into a similar

highway driving scenario.

OPTIMALITY EVALUATION

The reference model, DQN, and DDQN are compared in

this subsection. All of them adopted a hierarchical control

framework. The lower-levels are the same and utilize (5)-(7)

to regulate the acceleration, position, and heading. The

upper-levels are different, which are IDM and MOBIL,

DQN algorithm in Section III.B, and DDQN algorithm in

Section III.C. The default parameters are the same in DQN

and DDQN.

Fig. 5 depicts the normalized average rewards of these

three methods. Based on the definition of reward function

in (27), higher reward indicates driving on the preferred

lane with a more efficient maneuver. It is obvious that the

training stability and learning speed of DDQN are better

than the other two approaches. Besides, after about 500

episodes, the reward in DDQN is greater than the other two

approaches, and it keeps this momentum all the time. And

they are both better than the reference model. It is mainly

caused by the advantage network in DDQN. This network

could assess the worth of the chosen action at each step,

which helps the ego vehicle to find a better decision-

making policy fleetly.

FIGURE 5. Average reward variation in three compared methods: the

reference model, DQN and DDQN.

FIGURE 6. Vehicle speed and traveling distance of the ego vehicle in

each episode of these compared techniques.

To observe the trajectories of state variables in this work,

Fig. 6 shows the average vehicle speed and traveling

distance in these three compared techniques. They are all

trained by 2000 episodes. A higher average implies that the

ego vehicle could run through the driving scenario faster

and achieve greater cumulative rewards. The traveling

distance means the ego vehicle could drive longer without

collision. These results directly reflect safety and efficiency

demand. The noticeable differences are able to certify the

optimality of the proposed algorithm.

As the ego vehicle is not willing to crash the surrounding

vehicles, the collision conditions of these three control

cases are described in Fig. 7, wherein collision has two

values (collision = 0 or 1). It can be noticed that the DDQN,

DQN, and reference model-enabled agents could avoid a

collision after 1300, 1700, 1950 episodes, respectively. This

appearance can also prove that the DDQN-based agent is

more intelligent other two agents. As the safety claim is the

first concern for the actual application of automated driving,

the learned decision-making model based on DDQN is

more promising to be employed in real-world environments.

FIGURE 7. Collision conditions of the ego vehicle in each compared

method: collision=0, the ego vehicle does not crash other vehicles;

collision=1, the ego vehicle crashes other vehicles.

FIGURE 8. Control actions in one successful episode of three compared

methods: Index=1, changing left lane; Index=2, idling speed; Index=3,

changing right lane; Index=4, running faster; Index=5, running slower.

Furthermore, to defense the concrete control actions are

different in these three methods. The curves of control

action sequences of one successful episode (means the ego

vehicle could drive from the starting to ending point) are

given in Fig. 8. The actions in longitudinal and later

directions of the ego vehicle are uninformed as five

selections. They are changing left lane, changing right lane,

idling speed, running faster, running slower. The

differences between these trajectories indicate the proposed

decision-making policy is different from the two

benchmark methods (in the same successful episode).

Overall, according to all the display results in this

subsection, the optimality of the DDQN-enabled decision-

making strategy is illuminated.

COMPARISION BETWEEN DQN AND DDQN

Since DQN and DDQN are two permanent DRL algorithms,

this experiment aims to appraise the learning and training

procedure of these two approaches. As the target of the

neural network is acquiring the mutable Q table. The

normalized mean discrepancy of the Q table in the training

process of these two methods is displayed in Fig. 9. The

downtrend graphs indicate that both ego vehicles become

more familiar with the driving environment by interacting

with it. Furthermore, it can be discerned that the DDQN

could learn more knowledge about the traffic situations

with the same episodes, and thus results in the faster

learning course. Hence, the ego vehicle could manipulate

more efficiently and safely by the guidance of the DDQN

algorithm.

FIGURE 9. Mean discrepancy of Q table in the training process of two

DRL approaches.

To exhibit the usage of the dueling network in future

decisions in this autonomous driving problem, Fig. 10

discusses the track of the cumulative rewards. The uptrend

variation implies that the control action choices are capable

of improving future rewards. As the DDQN is larger than

DQN, it signifies the related agent could achieve better

control performance. This is also attributed to the advantage

network in DDQN, which enables the ego vehicle to

quantify the potential worth of current control action. To

assess the above-mentioned decision-making policies in a

similar driving condition, the next subsection discusses the

adaptability of these strategies.

FIGURE 10. Accumulated rewards in DQN and DDQN: the higher

accumulated reward indicates better control action choices.

ADAPTABILITY ESTIMATION

After learning and training the automated vehicles in

highway driving environments, a short episode is applied to

test their adaptive capacity. The testing number of episodes

is 10 in this work. The default settings and the number of

lanes and surrounding vehicles are the same as the training

process. The learned parameters of the neural networks are

saved and can be utilized directly in the new conditions.

The most concerning elements are the average reward and

collision conditions of the testing operation.

A point

B point

FIGURE 11. Normalized reward in the testing experiment of three

compared methods.

Time Step t= 21s

Time Step t= 35s

Time Step t= 56s

FIGURE 12. One typical testing driving condition: the ego vehicle has

to execute car-following behavior for a long time.

Fig. 11 shows the normalized average reward of the

reference model, DQN, and DDQN methods in the testing

experiment. From (27), the reward is mainly influenced by

the collision conditions and vehicle speed. The average

reward may not achieve the highest score (100 in this work)

because the ego vehicle has to slow down sometimes to

avoid a collision. The ego vehicle also needs to change to

other lanes to realize the overtaking process. Without loss

of generality, two typical situations (two episodes, A and B

points in Fig. 10) are chosen to analyze the decision-

making behaviors of the ego vehicle.

Time Step t= 40s：car-following

Time Step t= 50s: find a small space for overtaking

Time Step t= 52s: collision happens

FIGURE 13. Another representative testing driving condition: the ego

vehicle make a dangerous lane changing and a collision happens.

Fig. 12 depicts one driving situation that there are three

surrounding vehicles in front of the ego vehicle (the episode

represented by A point). The ego vehicle has to execute the

car-following maneuver for a long time and wait for the

opportunity the overtake them. As a consequence, the

vehicle speed may not reach the maximum value, and the

ego vehicle may not surpass all the surrounding vehicles

before the destination. Furthermore, an infrequent driving

condition is described in Fig. 13 (the episode represented

by B point). The ego vehicle wants to achieve a risky lane-

changing to obtain higher rewards. However, it cashed

nearby vehicles because the operation space is not enough.

This situation may not happen in the training process, and

thus the ego vehicle could cause a collision.

TABLE I

TRAINING AND TESTING TIME OF DQN AND DDQN METHODS

Techniques#

Training Time (h)

Testing Time (s)*

DQN

27.89

DDQN

26.56

# A 2.30 GHz microprocessor with 31.8 GB RAM was used.

* The time that uses the trained parameters in a new driving situation.

Based on the detailed analysis in Fig 12 and 13, it hints

us to spend more time to train the mutable decision-making

strategy. These results also remind us that the relevant

control policy has the potential to be applied in real-world

environments. Table I provides the training and testing time

of the DQN, and DDQN approaches. Although the training

time can be only realized offline, the learned parameters

and policies are able to be utilized online. This inspires us

to implant our decision-making policy in the visualization

simulation environments and to conduct the related loop

experiments in the future.

V. CONCLUSION

This paper discusses the highway decision-making problem

using the DRL technique. By applying the DDQN

algorithm in the designed driving environments, an efficient

and safe control framework is constructed. Depending on a

series of simulation experiments, the optimality,

convergence rate, and adaptability are demonstrated. In

addition, the testing results are analyzed, and the potentials

of the presented method to be applied in real-world

environments are proven. Future work includes the online

applications of highway decision-making by executing

hardware-in-loop (HIL) experiments. Moreover, the real-

world collected highway database can be used to estimate

the related overtaking strategy.

REFERENCES

[1] A. Raj, J. A. Kumar, and P. Bansal, “A multicriteria decision making

approach to study barriers to the adoption of autonomous vehicles,”

Transportation research part A: policy and practice, vol. 133, pp.

122-137, 2020.

[2] T. Liu, B. Tian, Y. Ai, L. Chen, F. Liu, and D. Cao, Dynamic States

Prediction in Autonomous Vehicles: Comparison of Three Different

Methods, IEEE Intelligent Transportation Systems Conference

(ITSC 2019), 27-30 Oct 2019.

[3] A. Rasouli, and J. K. Tsotsos, “Autonomous vehicles that interact

with pedestrians: A survey of theory and practice,” IEEE Tran. on

Intell.Trans. Sys., vol. 21, no. 3, pp. 900-918, 2019.

[4] C. Gkartzonikas, and K. Gkritza, “What have we learned? A review

of stated preference and choice studies on autonomous vehicles,”

Transportation Research Part C: Emerging Technologies, vol. 98,

pp. 323-337, 2019.

[5] C. J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J.

Kochenderfer, “Combining planning and deep reinforcement

learning in tactical decision making for autonomous driving,” IEEE

Transactions on Intelligent Vehicles, vol. 5, no. 2, pp. 294-305, 2019.

[6] C. J. Hoel, K. Wolff, and L. Laine, “Tactical decision-making in

autonomous driving by reinforcement learning with uncertainty

estimation,” arXiv preprint arXiv:2004.10439, 2020.

[7] A. Driving, “Levels of driving automation are defined in new SAE

international standard J3016: 2014,” SAE International: Warrendale,

PA, USA, 2014.

[8] P. Hart, and A. Knoll, “Using counterfactual reasoning and

reinforcement learning for decision-making in autonomous driving,”

arXiv preprint arXiv:2003.11919, 2020.

[9] W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous

driving decision-making in an uncontrolled intersection,”

Mathematical Problems in Engineering, 2016.

[10] A. Furda, and L. Vlacic, “Enabling safe autonomous driving in

real-world city traffic using multiple criteria decision making,”

IEEE Intelli. Trans. Sys. Magaz., vol. 3, no. 1, pp. 4-17, 2011.

[11] J. Nie, J. Zhang, E. Ding, X. Wan, X. Chen, and B. Ran,

“Decentralized cooperative lane-changing decision-making for

connected autonomous vehicles,” IEEE Access, vol. 4, pp. 9413-

9420, 2016.

[12] L. Li, K. Ota, and M. Dong, “Humanlike driving: Empirical

decision-making system for autonomous vehicles,” IEEE

Transactions on Vehicular Technology, vol. 67, no. 8, pp. 6814-6823,

2018.

[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G.

Bellemare, and S. Petersen, “Human-level control through deep

reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533,

2015.

[14] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical

reinforcement learning for self-driving decision-making without

reliance on labelled driving data,” IET Intelligent Transport Systems,

vol. 14, no. 5, pp. 297-305, 2020.

[15] M. Kim, S. Lee, J. Lim, J. Choi, and S. Kang, “Unexpected

Collision Avoidance Driving Strategy Using Deep Reinforcement

Learning,” IEEE Access, vol. 8, pp. 17243-17252, 2020.

[16] Q. Zhang, J. Lin, Q. Sha, B. He, and G. Li, G. “Deep Interactive

Reinforcement Learning for Path Following of Autonomous

Underwater Vehicle,” IEEE Access, vol. 8, pp. 24258-24268, 2020.

[17] C. Chen, J. Jiang, N. Lv, and S. Li, “An Intelligent Path Planning

Scheme of Autonomous Vehicles Platoon Using Deep

Reinforcement Learning on Network Edge,” IEEE Access, 2020.

[18] S. Han, and F. Miao, “Behavior Planning For Connected

Autonomous Vehicles Using Feedback Deep Reinforcement

Learning,” arXiv preprint arXiv:2003.04371, 2020.

[19] S. Nageshrao, H. E. Tseng, and D. Filev, “Autonomous highway

driving using deep reinforcement learning,” In 2019 IEEE

International Conference on Systems, Man and Cybernetics (SMC),

pp. 2326-2331, 2019.

[20] T. Liu, B. Huang, Z. Deng, H. Wang, X. Tang, X. Wang and D. Cao,

“Heuristics-oriented overtaking decision making for autonomous

vehicles using reinforcement learning”, IET Electrical Systems in

Transportation, 2020.

[21] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states

in empirical observations and microscopic simulations,” Phys. Rev.

E, vol. 62, pp. 1805-1824, 2000.

[22] M. Zhou, X. Qu, and S. Jin, “On the impact of cooperative

autonomous vehicles in improving freeway merging: a modified

intelligent driver model-based approach,” IEEE Trans. Intell.

Transport. Syst., vol. 18, no. 6, pp. 1422-1428, June 2017.

[23] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing

model MOBIL for car-following models,” Transportation Research

Record, vol. 1999, no. 1, pp. 86-94, 2007.

[24] T. Liu, X. Hu, W. Hu, Y. Zou, “A heuristic planning reinforcement

learning-based energy management for power-split plug-in hybrid

electric vehicles,” IEEE Transactions on Industrial Informatics, vol.

15, no. 12, pp. 6436-6445, 2019.

[25] T. Liu, X. Tang, H. Wang, H. Yu, and X Hu, “Adaptive Hierarchical

Energy Management Design for a Plug-in Hybrid Electric Vehicle,”

IEEE Transactions on Vehicular Technology, vol. 68, no. 12, pp.

11513-11522, 2019.

[26] X. Hu, T. Liu, X. Qi, and M. Barth, “Reinforcement learning for

hybrid and plug-in hybrid electric vehicle energy management:

Recent advances and prospects,” IEEE Industrial Electronics

Magazine, vol. 13, no. 3, pp. 16-25, 2019.

[27] T. Liu, B. Tian, Y. Ai, L. Chen, F. Liu, and D. Cao, Dynamic States

Prediction in Autonomous Vehicles: Comparison of Three Different

Methods, IEEE Intelligent Transportation Systems Conference

(ITSC 2019), 27-30 Oct 2019.

[28] T. Liu, H. Yu, H. Guo, Y. Qin, and Y. Zou, “Online energy

management for multimode plug-in hybrid electric vehicles,” IEEE

Transactions on Industrial Informatics, vol. 15, no. 7, pp. 4352-

4361, July 2019.

[29] M. L., Puterman, “Markov decision processes: discrete stochastic

dynamic programming,” John Wiley & Sons, 2014.

[30] R. Sutton, and A. Barto, “Reinforcement learning: An introduction

(second edition),” MIT press, 2018.

[31] L. Edouard, “An environment for autonomous driving decision-

making,” https://github.com/ eleurent/highway-env, GitHub, 2018.

ResearchGate has not been able to resolve any citations for this publication.

Heuristics-oriented overtaking decision making for autonomous vehicles using reinforcement learning

Article

Full-text available

Nov 2020

This study presents a three‐lane highway overtaking strategy for an automated vehicle, which is based on a heuristic planning reinforcement learning algorithm. The proposed decision‐making controller focuses on keeping the autonomous vehicle operating safely and efficiently. First, the modelling of the overtaking driving scenario is introduced and the reference approaches named intelligent driver model and minimise overall braking induced by lane changes are formulated. Second, the Dyna‐H algorithm, which combines the modified Q‐learning algorithm with a heuristic planning policy, is utilised for highway overtaking decision‐making. Three different heuristic strategies are formulated to improve learning efficiency and compare performance. This algorithm is applied to determine the lane change and speed selection for an ego vehicle in the environment with uncertainties. Finally, the performance of Dyna‐H is estimated in the autonomous overtaking scenario by comparing it with the reference and traditional learning methods. Furthermore, the Dyna‐H‐enabled decision‐making strategies are validated and analysed in an open‐sourcing driving dataset. Results prove that the proposed decision‐making strategy could produce superior performance in convergence rate and control.

An Intelligent Path Planning Scheme of Autonomous Vehicles Platoon Using Deep Reinforcement Learning on Network Edge

Article

Full-text available

May 2020

Recent advancements in Intelligent Transportation Systems suggest that the roads will gradually be filled with autonomous vehicles that are able to drive themselves while communicating with each other and the infrastructure. As a representative driving pattern of autonomous vehicles, the platooning technology has great potential for reducing transport costs by lowering fuel consumption and increasing traffic efficiency. In this paper, to improve the driving efficiency of autonomous vehicular platoon in terms of fuel consumption, a path planning scheme is envisioned using deep reinforcement learning on the network edge node. At first, the system model of autonomous vehicles platooning is given on the common highway. Next, a joint optimization problem is developed considering the task deadline and fuel consumption of each vehicle in the platoon. After that, a path determination strategy employing deep reinforcement learning is designed for the platoon. To make the readers readily follow, a case study is also presented with instantiated parameters. Numerical results shows that our proposed model could significantly reduce the fuel consumption of vehicle platoons while ensuring their task deadlines.

Deep Interactive Reinforcement Learning for Path Following of Autonomous Underwater Vehicle

Article

Full-text available

Jan 2020

Autonomous underwater vehicle (AUV) plays an increasingly important role in ocean exploration. Existing AUVs are usually not fully autonomous and generally limited to pre-planning or pre-programming tasks. Reinforcement learning (RL) and deep reinforcement learning have been introduced into the AUV design and research to improve its autonomy. However, these methods are still difficult to apply directly to the actual AUV system because of the sparse rewards and low learning efficiency. In this paper, we proposed a deep interactive reinforcement learning method for path following of AUV by combining the advantages of deep reinforcement learning and interactive RL. In addition, since the human trainer cannot provide human rewards for AUV when it is running in the ocean and AUV needs to adapt to a changing environment, we further propose a deep reinforcement learning method that learns from both human rewards and environmental rewards at the same time. We test our methods in two path following tasks—straight line and sinusoids curve following of AUV by simulating in the Gazebo platform. Our experimental results show that with our proposed deep interactive RL method, AUV can converge faster than a DQN learner from only environmental reward. Moreover, AUV learning with our deep RL from both human and environmental rewards can also achieve a similar or even better performance than that with deep interactive RL and can adapt to the actual environment by further learning from environmental rewards.

Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data

Article

Full-text available

Feb 2020
IET INTELL TRANSP SY

Decision making for self‐driving cars is usually tackled by manually encoding rules from drivers’ behaviours or imitating drivers’ manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This study presents a hierarchical reinforcement learning method for decision making of self‐driving cars, which does not depend on a large amount of labelled driving data. This method comprehensively considers both high‐level manoeuvre selection and low‐level motion control in both lateral and longitudinal directions. The authors firstly decompose the driving tasks into three manoeuvres, including driving in lane, right lane change and left lane change, and learn the sub‐policy for each manoeuvre. Then, a master policy is learned to choose the manoeuvre policy to be executed in the current state. All policies, including master policy and manoeuvre policies, are represented by fully‐connected neural networks and trained by using asynchronous parallel reinforcement learners, which builds a mapping from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each manoeuvre. They apply this method to a highway driving scenario, which demonstrates that it can realise smooth and safe decision making for self‐driving cars.

A Multicriteria Decision Making Approach to Study the Barriers to the Adoption of Autonomous Vehicles

Article

Full-text available

Jan 2020
TRANSPORT RES A-POL

Automation technology is emerging, but the adoption rate of autonomous vehicles (AV) will depend upon how policymakers and the government address various challenges such as public acceptance and infrastructure development. This study follows a five-stage method to understand these barriers to AV adoption. First, based on a literature review followed by discussions with experts, ten barriers are identified. Second, the opinions of eighteen experts from industry and academia regarding inter-relations between these barriers are recorded. Third, a multicriteria decision making technique, the grey-based Decision-making Trial and Evaluation Laboratory (Grey-DEMATEL), is applied to characterize the structure of relationships between the barriers. Fourth, robustness of the results is tested using sensitivity analysis. Fifth, the key results are depicted in a causal loop diagram (CLD), a systems thinking approach, to comprehend cause-and-effect relationships between the barriers. The results indicate that the lack of customer acceptance (LCA) is the most prominent barrier, the one which should be addressed at the highest priority. The CLD suggests that LCA can be mitigated by addressing two other prominent and more tangible barriers – lack of industry standards and the absence of regulations and certifications. The study’s contribution lies in demonstrating that the barriers to AV adoption do not exist in isolation but are linked with each other in overlapping loops of cause and effect relationships. These insights can help different stakeholders in prioritizing their endeavors to expedite AV adoption. From the methodological perspective, this is the first study in the transportation literature that integrates Grey-DEMATEL with systems thinking.

Unexpected Collision Avoidance Driving Strategy Using Deep Reinforcement Learning

Article

Full-text available

Jan 2020

In this paper, we generated intelligent self-driving policies that minimize the injury severity in unexpected traffic signal violation scenarios at an intersection using the deep reinforcement learning. We provided guidance on reward engineering in terms of the multiplicity of objective function. We used a deep deterministic policy gradient method in the simulated environment to train self-driving agents. We designed two agents, one with a single-objective reward function of collision avoidance and the other with a multi-objective reward function of both collision avoidance and goal-approaching. We evaluated their performances by comparing the percentages of collision avoidance and the average injury severity against those of human drivers and an autonomous emergency braking (AEB) system. The percentage of collision avoidance of our agents were 78.89% higher than human drivers and 84.70% higher than the AEB system. The average injury severity score of our agents were only 8.92% of human drivers and 6.25% of the AEB system.

Humanlike Driving: Empirical Decision-Making System for Autonomous Vehicles

Article

Aug 2018

The autonomous vehicle, as an emerging and rapidly growing field, has received extensive attention for its futuristic driving experiences. Although the fast developing depth sensors and machine learning methods have given a huge boost to self-driving research, existing autonomous driving vehicles do meet with several avoidable accidents during their road testings. The major cause is the misunderstanding between self-driving systems and human drivers. To solve this problem, we propose a humanlike driving system in this paper to give autonomous vehicles the ability to make decisions like a human. In our method, a convolutional neural network model is used to detect, recognize, and abstract the information in the input road scene, which is captured by the on-board sensors. And then a decision-making system calculates the specific commands to control the vehicles based on the abstractions. The biggest advantage of our work is that we implement a decision-making system which can well adapt to real-life road conditions, in which a massive number of human drivers exist. In addition, we build our perception system with only the depth information, rather than the unstable RGB data. The experimental results give a good demonstration of the efficiency and robustness of the proposed method.

Autonomous Highway Driving using Deep Reinforcement Learning

Conference Paper

Oct 2019

Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving

Article

Nov 2019

Tactical decision making for autonomous driving is challenging due to the diversity of environments, the uncertainty in the sensor information, and the complex interaction with other road users. This paper introduces a general framework for tactical decision making, which combines the concepts of planning and learning, in the form of Monte Carlo tree search and deep reinforcement learning. The method is based on the AlphaGo Zero algorithm, which is extended to a domain with a continuous state space where self-play cannot be used. The framework is applied to two different highway driving cases in a simulated environment and it is shown to perform better than a commonly used baseline method. The strength of combining planning and learning is also illustrated by a comparison to using the Monte Carlo tree search or the neural network policy separately.

Reinforcement Learning for Hybrid and Plug-In Hybrid Electric Vehicle Energy Management: Recent Advances and Prospects

Article

Sep 2019

Decision-making Strategy on Highway for Autonomous Vehicles using Deep Reinforcement Learning

Abstract

Recommended publications

VisionNet: A Drivable-space-based Interactive Motion Prediction Network for Autonomous Driving

Decision-Making Strategy on Highway for Autonomous Vehicles Using Deep Reinforcement Learning

Dueling Deep Q Network for Highway Decision Making in Autonomous Vehicles: A Case Study

Decision-making for Autonomous Vehicles on Highway: Deep Reinforcement Learning with Continuous Acti...

A Comparative Analysis of Deep Reinforcement Learning-enabled Freeway Decision-making for Automated...