Content uploaded by Xiaofei Pei
Author content
All content in this area was uploaded by Xiaofei Pei on Jan 14, 2020
Content may be subject to copyright.
Decision-Making for Oncoming Traffic Overtaking
Scenario using Double DQN
Shuojie Mo, Xiaofei Pei*, Zhenfu Chen
School of Automotive Engineering
Wuhan University of Technology
Wuhan, China
peixiaofei7@whut.edu.cn
Abstract—Great progress has been made in the field of
machine learning in recent years. And learning-based methods
have been widely utilized for developing highly autonomous
vehicle. To this end, we introduce a reinforcement learning
based intelligent autonomous vehicle decision making method
for oncoming overtaking scenario. The goal of reinforcement
learning is to learn how to take optimal decision in
corresponding observations through interactions with the
environment using a reward function to estimate whether the
decision is good or not. A Double Deep Q-learning (Double DQN)
agent was used to learn policies (control strategies) for both
longitudinal speed and lane change decision. Prioritized
Experience Replay (PER) was used to accelerate convergence of
the policies. A two-way 3-car scenario with oncoming traffic was
established in SUMO (Simulation of Urban Mobility) to train
and test the policies.
Keywords—autonomous vehicle, decision-making,
overtaking, reinforcement learning.
I. INTRODUCTION
Overtaking is one of the most complex of all maneuvers
since it combines both longitudinal and lateral control.
Average speed of an autonomous vehicle can be efficiently
increased by overtaking which though is dangerous and
always the cause of accidents. In this paper, we especially
focus on two-way 3-car overtaking scenario with oncoming
traffic. Reinforcement learning based decision making
method was introduced to increase the safety and efficiency in
overtaking maneuver.
Overtaking behaviors generally take place many times in
daily driving which are closely related to safety and average
velocity. There are plenty of researches that have been
developed to solve the decision-making problems of
overtaking behavior. In [1] and [2], a fuzzy control based
method was used to mimic human behavior and reactions
during overtaking maneuvers. Throttle, brake as well as
steering angle controls were output by the fuzzy controllers to
overtake the front vehicle Fenghui Wang et al. [3] proposed a
MPC overtaking control based on estimation of the conflict
probability using relative distance between vehicles with
uncertainty. The goal was to find a set of optimal control input
sequence to minimize the objective function. Karaduman et al.
[4] built a 3-car overtaking scenario using Bayesian network
to estimate the probability of collision.
In this work, learning based method was introduced to
solve the overtaking decision-making problem.
Reinforcement learning (RL) aims to solve sequential
decision-making problem in uncertainty by interacting with
world to maximize cumulative reward. Xin Li et al. used Q-
learning method to obtain optimal decision in highway
overtaking scenario [5]. A multiple-goal RL framework based
method was presented in [6]. By considering all of sub-goals,
Ngai and Yung employed seven Q-learning agents
corresponding to those goals as well as a fusion function to
determine the best action while overtaking in curve. Both Q-
learning methods work well. However, oncoming overtaking
scenario is not discussed and continues state need to be
quantized in Q-learning methods. The discrete state will lose
information thus worsening the decision you make.
In this paper, we mainly consider decision-making
problem in two-way 3-car oncoming overtaking scenario
using Double Deep Q-learning (DDQN). The goal of the
presented RL-based method is to train an agent that performs
better than traditional approaches in safety and time efficiency
for changeable scenarios. We offer two main contributions :1)
Neural network in DDQN can handle continuous raw inputs
without missing part of information. 2) Compare to traditional
decision-making methods, vehicle model is not necessary in
our paper. The paper is organized as follows. In section Ⅱ, we
introduce the framework of reinforcement learning and the
algorithm we used. Section Ⅲ outlines the problem
formulation, and Section Ⅳ evaluates the proposed method in
SUMO scenario. Finally, conclusion is given in Section Ⅴ.
II. BACKGROUND
A. Markov Decision Process
Reinforcement learning learns optimal policies by
executing an action to interact with the environment and
improve the action strategy according to the received
reward[7]. We can model it as Markov decision processes
(MDPs) defined by a six element tuple (,,,,). An
agent’s behaviors is defined by a policy π, which maps states
S to a probability distribution over the actions A. A transition
function (′|,)defines the probability of the state
changing from s to s’ under action a.
A reward function (,) is used to estimate the decision
and discount factor γ ∈ [0,1] is employed to calculate the
cumulative discounted expected reward. Specifically, the
agent seeks a behavior policy ∗=(|) that maximizes
cumulative expected reward, also called action-value function
(known as Q function)
*
0
( , ) max | , ,
k
tk t t
k
Q sa E r s sa a
π
γπ
∞
+
=
= = =
∑
(1)
B. Double Deep Q-learning (Double DQN)
A deep Q network (DQN) is a neural network that for a
given state s outputs a vector of action values (s,·; θ), where
θ are the parameters of the neural network. Two important
ingredients of the DQN algorithm as proposed by Mnih et al.
[8] are the use of replay memory and the use of a target
network with parameters −. Replay memory removes
correlations in the observation sequence thus dramatically
improving the performance. The target network is the same as
the original network (main network) expect that its parameters
are copied every t steps from main network
−= and kept
fixed on all other steps. The target TD-error used by DQN is
ˆ
= max (s ,a | )
DQN
ta
yr Q
γθ
−
′
′′
+
(2)
However, overestimations occur due to the max operation
in the target into action selection and action evaluation in both
Q-learning and DQN algorithms. It can be fixed by
decomposing the two processes which is the main idea of
Double DQN. Double DQN has two network architecture the
same as DQN as shown in Fig.1. Main network is used for
evaluating the greedy policy but target network is used to
estimate its value [9]. The target TD-error in Double DQN is
as follows
ˆ
max ( ,arg max ( , | ) | )
DDQN
ta
aA
y r Qs Qs a
γ θθ
−−
′∈
′′
= +
(3)
During learning, parameters are updated on samples from
replay memory. The Q-learning update at iteration i uses the
following loss function:
2
( ) - (, | )
( , , , )~ ( )
DDQN
L E y Qsa
i sars D t i
θθ
µ
=
′
(4)
in which D is the replay memory, and we take samples of
experience
( , , , )~ ( )sars D
µ
′
. Here we use prioritized experience
replay (PER) to pick a minibatch of experience to make the
most effective use of the replay memory for learning. The
central component of prioritized replay is the criterion by
which the importance of each transition is measured by TD-
error [10]. The bigger TD-error is, the more we can learn.
III. PROBLEM FORMULATION
A. Oncoming Overtaking Scenario
In this paper we focus on decision-making during on-
coming overtaking scenario. Fig.2 shows a typical two-way 3-
car oncoming overtaking scenario built in SUMO (Simulation
of Urban Mobility). We assume Auto can accurately obtain its
own state and information of its surroundings using sensor
system. In the initialization of driving scenario, speed of Auto
is set to 10 (m/s). Initial speed of Vehicle1 is between [5(m/s
²),7 (m/s²)] with interval of 0.5(m/s²), and initial distance
from Auto is within [30(m),50(m)] with interval of 5(m).
Initial speed of Vehicle2 varies from 10 (m/s) to 15 (m/s) with
interval of 0.5(m/s). A constant-speed model is applied to
Vehicle1. And Intelligent Drive Model(IDM) [11], which is
the default lane following mode, is employed to model the
longitudinal dynamics of Vehicle2. In addition, the
acceleration for all three vehicles is within [-3(m/s²), 3(m/s²)].
Initial distance to Auto varies from 100 m to 300 (m) with
interval of 5(m).
B. State Space
State represents information about itself and its
surroundings for decision making. In this paper, only
longitudinal distances are considered. An abstract state vector
Fig. 1. Framework of Double DQN
Fig. 2. A typical two-way 3-car oncoming overtaking scenario in SUMO. Green stands for the autonomous vehicle Auto, Red stands for the front slow
overtaken vehicle Vehicle1, and Yellow stands for oncoming vehicle Vehicle2
is defined by (5) to provide adequate environment
information:
( )
, , , , 1, 2 6
k k k k kT
auto auto i i
s vx dv i= ∆ ∆ = ……, ,
(5)
where
auto
v
is the current speed of Auto and
auto
x
is the global
position along x-axis of Auto.
d∆
and
v∆
are used to
represented relationship between Auto and surrounding
vehicles where six elements are used respectively stands for
relative distance d or velocity v with front, left-front, right-
front, behind, left-behind and right-behind vehicles.
C. Action Space
Action space mainly includes longitudinal speed control
and lateral lane-change decision. Here five kinds of
accelerations are chosen to control longitudinal speed: -3 m/s²
for medium brake, -1 m/s² for soft brake, 0 m/s² for speed
maintain, 1 m/s² for soft acceleration and 3 m/s² for rapid
acceleration. And a lane-change signal LC is used for lateral
control. The action space is defined as
( )
, , 1,2, 3, 4, 5
i
A a LC i= =
D. Reward function
Reward function is used for guiding agent to achieve
goals. Therefore, the shape of reward function should base on
learning objectives. In this paper, we expect agent overtake
the front slower vehicle as quick as possible without collision.
The following aspects are mainly considered:
1) Collision avoidance: Collision is the most dangerous
situation that we must avoid. A TTC based piecewise
function is used to formulate reward:
1.5 2.5 3
2 2 2.5
42
40
collision
if TTC
if TTC
Rif TTC
if collision
− ≤≤
− ≤≤
=−≤
−
(6)
2) Speed:Ultimate objective of overtaking is to increase
speed for quicker passage. Here we use the following speed
reward function to encourage agent to speed up
0.2 (v 10)
velocity auto
R=×−
(7)
3) Opposite lane occupancy:Obviously, long-term
occupancy of opposite lane is a dangerous situation. A small
penalty is needed to reduce occupancy time
1
opposite lane
R
−
= −
(8)
4) Overtake:When agent achieves goal state e.g.
successfully overtake the lower front vehicle, a large reward
will be given
200
overtake
R=
(9)
Formally, the final reward function can be defined as sum
of above sub-goals:
+
velocity overtake opposite lane collision
RR R R R
−
= ++
(10)
IV. EVALUATION
The mentioned Double DQN algorithm is applied for
decision-making. It outputs a reasonable policy for overtaking
with oncoming traffic in two-way scenario. Time step for
decision is set to 0.1s. Values of hyper-parameters are not
systematically tuning due to the high computational cost. A
relatively good set of hyper-parameters are selected after
several tests, shown in Table Ⅰ. Fig.3 is the learning curve for
average cumulative reward. Here epochs is utilized to
calculate average epoch reward (red line) for a clearly training
process. As we can see, the agent converges approximately
after about 70 epochs’ training. Better performance of Double
DQN agent will be obtained if time adequate.
According to the oncoming overtaking scenario presented
in Section Ⅲ, we built a two-way scenario with other two
vehicles in SUMO. Speed and position of Vehicle2 will be
randomly initialized every episode. And either collision or
successfully overtaking the slower Vehicel1 will be the end
signal of an episode.
Fig. 4 shows how agent acts with different
overtaking policies in distinct scenarios with random
initializations. In Fig.4 (a), in which either speed of Vehicle2
is not so fast or initial position of Vehicle2 is far from Auto,
there is enough time for Auto to directly overtake the slower
front Vehicle1.
Fig. 3. Learning curve for Double DQN.
Fig. 4. Two typical policies for overtaking. Mark in left side
represtents actions for agent. ‘’ stands for accleration;‘ ’ stands for
speed maintainment;‘ ’ stands for brake; ‘ ’ stands for lane-
change maneuver.
TABLE I. LIST OF HYPER-PAREMETERS
Hyper-parameter Val ue
episode 150000
minibatch size 32
replay memory size 250000
target network soft
update 0.01
discounter factor 0.99
learning rate 5e-5
exploration 1→0.1 ( annealed over 5e5 steps)
replay start size 25000
prioritized replay
alpha 0.6
importance weights 0.4→1(increased over 5e5 steps)
Different policy is used in Fig.4 (b) where the Auto cannot
overtake the front immediately. In this situation, the trained
agent will act like human, wait for overtaking until the
oncoming Vehicle2 pass.
To emphasize the time efficiency in oncoming traffic
scenarios using the learning based method, we take the default
Lane-Change mode ‘LC2013’ in SUMO into comparison, in
which a benefit was computed for estimating the advantage of
changing one lane to another:
( )
max
(, ) (, )
() ()
n
pos n pos c
l
c
v tl v tl
bt vl
−
=
(11)
where
()
n
l
bt
is the benefit of changing to lane
n
l
;
c
l
and
n
l
are
the vehicle’s current and neighbor lanes respectively;
(,)
pos
v tl
is the velocity the vehicle could drive safe with on lane l;
max
()vl
is the maximum velocity the vehicle can take on lane l.
More information about the ‘LC2013’ can be found in [12]
and [13].
TABLE II. COMPARISONS FROM DOUBLE-DQN AND SUMO LANE-
CHANGE METHOD
LC2013
(baseline)
Double
DQN Improvement
Average Speed
average
auto
v
(m/s) 11.92 13.40 12.4%
Time in Opposite
lane
op
t
(s)
7.08 3.80 46.3%
Overtaking
Duration
total
t
(s)
10.43 7.02 32.7%
Opposite lane
occupancy
/
op total
tt
(%)
67.9% 54.1% 20.3%
We test the above mentioned methods for 1000 trails in
SUMO. The learning-based approach we presented obtains a
collision-free overtaking rate at 98.5%. Average overtaking
duration
total
t
, time in opposite lane
op
t
, and speed
average
auto
v
are
the indicators for comparison. The results shown in TABLE Ⅱ
indicate that the RL based decision-making method can
efficiently increase the average speed of Auto and reduce
overtaking time on contrast to the traditional method. In
addition, opposite lane occupancy indicates that specially
designed reward function for sub-goals works well thus
enhancing safety of autonomous vehicles.
V. CONCLUSIONS
With the employment of Double DQN, continues states
can be directly used for learning a more precise policy without
discretization. Evaluation results indicated the trained model-
free agent performed better nearly in all of the random
scenarios in SUMO. In this paper we modeled the decision-
making problem as a MDP, where all the information was
assumed to be completely observable. However, uncertainties
in perception with imperfect sensor systems and other traffic
participants are unavoidable in reality. Future works on
decision-making for autonomous vehicles will be POMDP-
based approaches, where POMDP provides a powerful
mathematical framework for dealing with uncertainty.
REFERENCES
[1] J. E. Naranjo, C. Gonzalez, R. Garcia, and T. de Pedro, "Lane-Change
Fuzzy Control in Autonomous Vehicles for the Overtaking Maneuver,"
IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 3,
pp. 438-450, 2008
[2] J. Perez, V. Milanes, E. Onieva, J. Godoy, and J. Alonso, "Longitudinal
fuzzy control for autonomous overtaking," in 2011 IEEE International
Conference on Mechatronics, 2011, pp. 188-193: IEEE.
[3] F. Wang, M. Yang, and R. J. I. T. o. I. T. S. Yang, "Conflict-
probability-estimation-based overtaking for intelligent vehicles," vol.
10, no. 2, pp. 366-370, 2009.
[4] O. Karaduman, H. Eren, H. Kurum, and M. Celenk, "Interactive risky
behavior model for 3-car overtaking scenario using joint Bayesian
network," in 2013 IEEE Intelligent Vehicles Symposium (IV), 2013,
pp. 1279-1284: IEEE.
[5] X. Li, X. Xu, L. Zuo, and Ieee, Reinforcement Learning Based
Overtaking Decision-Making for Highway Autonomous Driving (2015
Sixth International Conference on Intelligent Control and Information
Processing). New York: Ieee, 2015, pp. 336-342.
[6] D. C. K. Ngai and N. H. C. Yung, "A Multiple-Goal Reinforcement
Learning Method for Complex Vehicle Overtaking Maneuvers," (in
English), Ieee Transactions on Intelligent Transportation Systems,
Article vol. 12, no. 2, pp. 509-522, Jun 2011.
[7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[8] V. Mnih et al., "Human-level control through deep reinforcement
learning," vol. 518, no. 7540, p. 529, 2015.
[9] H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning
with double q-learning," in Thirtieth AAAI Conference on Artificial
Intelligence, 2016.
[10] T. Schaul, J. Quan, I. Antonoglou, and D. J. a. p. a. Silver, "Prioritized
experience replay," 2015.
[11] M. Treiber, A. Hennecke, and D. J. P. r. E. Helbing, "Congested traffic
states in empirical observations and microscopic simulations," vol. 62,
no. 2, p. 1805, 2000.
[12] J. Erdmann, "SUMO’s lane-changing model," in Modeling Mobility
with Open Data: Springer, 2015, pp. 105-123.
[13] D. Krajzewicz, "Traffic simulation with SUMO–simulation of urban
mobility," in Fundamentals of traffic simulation: Springer, 2010, pp.
269-293.