Conference PaperPDF Available

POMDP and Hierarchical Options MDP with Continuous Actions for Autonomous Driving at Intersections

Authors:
POMDP and Hierarchical Options MDP with Continuous Actions for
Autonomous Driving at Intersections
Zhiqian Qiao1, Katharina Muelling2, John Dolan1,2, Praveen Palanisamy3and Priyantha Mudalige3
Abstract When applying autonomous driving technology
to real-world scenarios, environmental uncertainties make the
development of decision-making algorithms difficult. Modeling
the problem as a Partially Observable Markov Decision Process
(POMDP) [1] allows the algorithm to consider these uncertain-
ties in the decision process, which makes it more robust to
real sensor characteristics. However, solving the POMDP with
reinforcement learning (RL) [2] often requires storing a large
number of observations. Furthermore, for continuous action
spaces, the system is computationally inefficient. This paper
addresses these problems by proposing to model the problem as
an MDP and learn a policy with RL using hierarchical options
(HOMDP). The suggested algorithm can store the state-action
pairs and only uses current observations to solve a POMDP
problem. We compare the results of to the time-to-collision
method [3] and the proposed POMDP-with-LSTM method. Our
results show that the HOMDP approach is able to improve the
performance of the agent for a four-way intersection task with
two-way stop signs. The HOMDP method can generate both
higher-level discrete options and lower-level continuous actions
with only the observations of the current step.
I. INTRODUCTION
Traversing a four-way intersection with two-way stop
signs can be difficult for autonomous vehicles. Upon arrival,
the agent has to time its actions properly to make a turn onto
the right-of-way road safely. If the car enters the intersection
too soon, it can result in a collision or cause the approaching
right-of-way vehicles to brake hard. On the other hand, if the
driver keeps waiting for too long to make sure the situation
is safe, valuable time is lost.
On a first glance, rule-based methods [4][5] seem to
be a plausible approach to describe the human decision
process for intersection traversal. The driver estimates the
time an approaching vehicle needs to reach and traverse the
intersection. If the traversal time is less than the approaching
time, then the driver may make the decision to begin the
traversal. Otherwise, the driver may choose to wait. For an
autonomous vehicle, an algorithm called time-to-collision
(TTC) [3] takes a similar approach. However, estimating the
time accurately and dealing with unexpected situations by
adjusting the ego vehicle’s decision according to changes in
the environment is hard for autonomous vehicles. As a result,
such an approach is not reliable enough for the autonomous
vehicle. An alternative to using rule-based methods is to
model the problem as a Markov Decision Process (MDP)
and to learn an optimal policy using reinforcement learning.
1Department of Electrical and Computer Engineering, Carnegie
Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA.
zhiqianq@andrew.cmu.edu
2The Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.
3Research & Development, General Motors, Warren, MI, USA.
Fig. 1: The proposed intersection traversal scenario with
simulated ray traces. The ego vehicle starts from the stop
line with zero velocity and other simulated vehicles run in
other lanes using the Krauss Traffic Model.
Some previous work [6][7] on the intersection traversal
problem uses bird’s-eye view to construct the state infor-
mation for decision making, which is hard to obtain in
a real vehicle. In this work, we only use the information
coming from resources available to an autonomous vehicle
as input such as a simulated LIDAR and the ego car’s
state information. The challenge is that the LIDAR ray trace
information does not include everything happening near the
intersection, but only the information within the visibility
range of the ego vehicle. In the paper, we assume that
the intersection traversal problem is an MDP, however, the
ego vehicle cannot directly observes everything which is
happing out of the visibility range due to the geometry of the
intersection. For any driving scenario, it is hard for ego car
to make proper decisions only based on the current states in
the surrounding environment. Meanwhile, the agent cannot
access the historical information of other vehicles in order to
make a proper decision. Using POMDP to model the whole
system can maintain a probability distribution for the agent’s
observation. As a result, our first contribution is to model the
problem as a POMDP and use deep reinforcement learning
(DRL) to optimize the policy for generating continuous
actions. The second contribution is proposing Hierarchical
Options for MDP (HOMDP) to solve the problem. Instead
of modeling the problem as a POMDP, we still apply the
MDP model and solve the problem by adding hierarchical
options, which gives better results than POMDP alone.
II. REL ATED WORK
This section introduces previous work related to this paper,
which can be categorized as follows: 1) papers that address
deep reinforcement learning (DRL) algorithms; 2) work
focused on using DRL to solve the POMDP problem; 3)
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works
papers which use hierarchical structure and the concept of
options in combination with DRL methods.
A. Deep Reinforcement Learning
Recently, various DRL algorithms have shown advantages
in handling high-dimensional sensory inputs with discrete
or continuous actions. Deep Q-Networks [8] (DQN) and
its variants have been successfully applied to various fields.
However, DQN are restricted to the discrete action domain,
which makes it hard to apply to autonomous driving. DDPG
[9], which was proposed by Lillicrap et al., adapted the ideas
underlying the success of DQN to the continuous action
domain. All these solutions have in common that they model
the problem as an MDP, i.e., they assume full knowledge of
the environment, which is often not true in real world.
B. POMDP with Reinforcement Learning
[10] proposed to extend the DQN method for POMDP
problems by learning the optimal policy of a model-based
RL problem. They used a fully-connected network to ap-
proximate the mapping function from observations and belief
vectors to Q-values for different actions in order to solve
a POMDP problem through DQN. [11] proposed a method
called Deep Recurrent Q-network (DRQN) which can be ap-
plied to a model-free representation by adding LSTM layers
in the policy network. The trained policy network is capable
of capturing all the historical information in the LSTM
layer and the output actions are based on all the historical
observations with the help of LSTM layers. [7] used DRQN
with states from a bird’s-eye view of the intersection to
learn the policy of how to traverse the intersection. Based on
DRQN, [12] proposed an algorithm called Deep Distributed
Recurrent Q-networks (DDRQN) which not only used the
previous observations but also previous actions as the inputs.
The results show the first success in learning communication
protocols by RL. [13] is a recent method called Action-
specific Deep Recurrent Q-Network (ADRQN), which also
used historical observations and actions as input to get an
optimal policy. However, instead of inputing all historical
information into the LSTM layer, they used paired last step
action and current step observations as inputs into the LSTM
layer for training, which showed improved results on some
POMDP tasks.
C. Hierarchical Reinforcement Learning and Options
Based on the context of RL, [14] proposed the idea of
options to generate actions at different levels. The options
are used to define a policy governing when the action policy
is initialized and terminated. [15] introduced the concept of
hierarchical Q learning called MAXQ. They proved the con-
vergence of MAXQ mathematically and showed faster com-
puting power than the original Q learning experimentally.
[16] proposed an improved MAXQ method by combining the
R-MAX [17] algorithm with MAXQ. It has both the efficient
model-based exploration of R-MAX and the opportunities
for abstraction provided by the MAXQ framework.[18]
presented hierarchical DQN (h-DQN), a method which is
based on the structure of DQN. H-DQN adds another high-
level Q-value function to learn a high-level policy while the
DQN is responsible for the low-level policy. The algorithm
is effective on the sparse and delayed feedback of ATARI
games. Instead of using observations in the memory, [19]
proposed a memoryless options method to solve the POMDP
problem by RL. They compared their results with finite state
controllers (FSC) [20], which is a state-of-the-art approach
that is commonly used to solve the POMDP problem. The
results showed that the method can both be as expressive
as FSC and more efficient than a recurrent neural network
when learning optimal policies in challenging POMDPs.
In this paper, we firstly extend the original POMDP with
LSTM [7] structure to an application that can generate
continuous actions instead of discrete actions. Second, we
propose a structure which combines HMDP and DRL and
apply the new structure to the autonomous vehicle traversing
intersection problem.
III. PRELIMINARIES
This work is based mainly on the ideas of Markov Deci-
sion Processes (MDPs), Hierarchical MDP (HMDP) and Par-
tially Observable MDP (POMDP). For a better understanding
of the proposed methodology, we will give a short overview
of all three frameworks before explaining our approach in
Section IV.
A. Markov Decision Process
A MDP is defined as a tuple {S,A,R,T,γ}in which S
denotes a set of states, Adefines the set of available actions,
and T(st+1,at,st)is a transition function that maps a state-
action pair (st,at)to a new state st+1. The reward function R
defines the immediate rewards for each state-action pair and
γis a discount factor for long-term rewards. At every time
step t, the agent receives the state stfrom the environment
and then selects an action ataccording to the policy π, which
results in the agent being transitioned to a new state st+1
according to the transition function T. Meanwhile the agent
will receive an immediate reward rtafter taking action atin
state st.
B. Hierarchical MDP
Hierarchical MDP (HMDP) is a general framework to
solve problems with large state and action spaces. The
framework can restrict the space of policies by separating
the action policy into different levels. One way of realizing
an HMDP is via the options framework [21]. An HMDP
includes a set of options O. An option oOis a tuple of
{πo,Io,βo}.πois the policy for the option o.IoSis the set
of states in which an option can be initiated. βo(st)defines
the probability that the options will be terminated in state st.
C. Partially Observable MDP
When applying decision processes to real-world problems,
we run into the problem that we often can only infer the state
of the environment from potentially incomplete and noisy
sensor data, i.e., we have a partially observable environ-
ment. POMDPs [1] provide a powerful tool to model such
Fig. 2: POMDP with LSTM network for generating con-
tinuous actions. The input observation otis a 126-D vector
which generates a 3-D continuous action vector atthrough
two 512-D LSTM layers and one fully connected (FC) layer.
Meanwhile, the observation vector followed by an FC layer
concatenates with the action vector coming from the previous
step followed by an FC, and then followed by a 128-D LSTM
layer and three FC layers to produce a 1-D Q-values Qt
corresponding to the continuous action vector.
partially observable processes. A POMDP is defined as a
tuple {S,A,R,T,,V,γ}which extends the MDP formulation
through the definition of a set of environment observa-
tions and a set of conditional observation probability
V(vt+1|st+1,at).
IV. METHODOLOGY
In this paper, we model the intersection problem as a
reinforcement learning problem. In particular, we consider
the scenario of a four-way intersection with two-way stop
signs (see Figure 1). We assume a start configuration in
which the ego-vehicle (purple rectangle in Figure 1) has
already stopped near the stop line and would like to traverse
the intersection in order to reach a pre-defined destination
beyond the intersection in order to complete Turn Right,
Go Straight or Turn Left tasks. The yellow rectangles are
simulated vehicles generated by the SUMO simulator [22]
and are all moving according to the Krauss Car Following
Traffic Model [23].
In this section, we propose two methods to solve the RL
problem. The first method models the problem as a POMDP
and uses the past observation and action pairs as inputs for
the reinforcement learning algorithm. It generates continuous
actions as outputs at each time step. The second method
models the problem as an MDP with hierarchical options,
which only takes the latest observations as input and then
generates discrete high-level options as well as low-level
actions through the option policy πosimultaneously.
A. POMDP with LSTM
In our work, we propose a network based on the ideas of
DDPG [9] and ADRQN [13] to generate continuous actions
based on the observations coming from previous steps. In the
network, instead of using all the previous observations, only
a fixed number of previous steps nsteps =20 are used as the
inputs to the LSTM layer. Figure 2 shows the structure of the
Fig. 3: MDP with Hierarchical Options network for gen-
erating continuous actions. The 126-D input state vector
stis followed by three FC layers in order to generate a
2-D Q-values Otcorresponding to two hierarchical option
candidates. Meanwhile, the input state vector produces a 2-D
continuous action vector atthrough four FC layers. Then, the
state vector followed by an FC layer concatenates with the
action vector followed by one FC layer. The output produces
a 1-D Q-values Qtwhich corresponds to the action vector
through four FC layers.
proposed network. In input of the network is the observation
at each time step, and through the first LSTM-layer and a FC-
layer, the network gets the information coming from previous
steps to get the action output for each step. After that, We
pair the observation coming from the current step and the
action taking for the last step (ot,at1) to be the input for
generating the Q value Qt.
B. MDP with Hierarchical Options
For MDP with hierarchical options, we use three networks
to produce hierarchical options, low-level actions, and Q-
values. The structure of the network is shown in Figure
3. Unlike the POMDP network, the system does not take
observations from previous time-steps into account for the
decision process. However, it uses hierarchical options to
make a decision on whether the ego vehicle can trust the
environment or not. Then, based on the observations and
high-level decision, the agent makes the decision on the
low-level policies. The lower-level decision here means the
acceleration or deceleration the agent takes at current step.
The overall flow of the algorithm is given in Algorithm 1.
C. State Space
In the representation of the state space, we assume that
the map information is known by the agent. We define the
state space as containing the ego vehicle’s velocity vand
road geometry information, including the distance between
the ego vehicle and the lower intersection boundary dlb, the
mid-point dmp and the destination dgoal of the intersection.
Figure 4 shows the geometric information related to the
intersection and the generated/simulated rays. The video1
shows the simulated ray trace representations during Go
Straight, Turn Right and Turn Left the intersection in SUMO
[22].
At each time step, a square of size 100m×100mis
constructed whose center is the middle front of the ego
1https://youtu.be/5HMB8SWanW4
Algorithm 1 MDP with Hierarchical Options via RL
1: procedure HOMDP
2: Construct two empty replay buffers Baand Bo.
3: Randomly initialize actor network NNa, critic network NNQand
option network NNOwith weights θµ,θQand θOand the correspond-
ing target actor network NNµ0, critic network NNQ0and option network
NN O0with weights θµ0,θQ0and θO0.
4: for e1 to Eepochs do
5: Get initial state s0. Initial option is o0=SlowForward.ro=0.
6: for t1 to Ttime steps do
7: ot,at= GetAction(st,ot1)
8: st+1,rt,done =StepForward(st,ot,at)
9: if otis Forward then
10: ro+ = rtand add (st,ot,rt,st+1,done)to Bo.
11: if done then
12: Add (st,ot,ro,st+1,done)to Bo.
13: else
14: Add (st,ot,rt,st+1,done)to Bo.
15: Sample random mini-batch of Mtransitions (si,oi,ri,si+1)
from Boand (sj,aj,rj,sj+1)from Ba.
16: oi+1=argmaxoO0(si+1|θO0).yo
i=ri+γO0(si+1|θO0).
17: Minimize Lo=1
Miyo
iO(si|θO)to update NN O.
18: yµ
j=rj+γQ0(sj+1,aj+1|θQ0)
19: Minimize Lµ=1
Mjyµ
jQ(si|θQ)to update NN Q.
20: 1
Mjµ(sj)Q(sj,µ(sj)|θQ)θµµ(sj|θµ)is the policy gra-
dient and is used to update actor network.
21: Update the target networks: θz0τθ z+ (1τ)θz0for zin
{µ,Q,O}.
Algorithm 2 Get Action
1: procedure GETACTION(s,o)
2: if ois SlowForward then
3: oargmaxoO0(s|θO0)according to εgreedy.
4: a=0.
5: if ois Forward then
6: a=µ(s|θµ) + Nwhere Nis a random process.
Algorithm 3 Move one Step to Forward
1: procedure STEP FORWARD(s,o,a)
2: if ois SlowForward then
3: Slowly move forward with dmeters and stop.
4: else if Any li
wiathen
5: Decrease speed of ego car with deceleration cd.
6: else
7: Increase velocity of ego car with acceleration ca.
Fig. 4: Explanation of state space and method of constructing
ray trace system
vehicle. The square is divided into a million small squares
with size 0.1m×0.1m. Each small square is occupied if there
is any obstacle or moving object in this area. There are 61
ray traces produced from the middle front of the ego vehicle
spanning πradians that cover the front view of the vehicle.
Each ray has a resolution of 0.5 meter and has a maximum
reach of 50 meters. Each ray is emitted from the front center
of the ego vehicle and if it reaches any obstacle like the road
boundary or a moving vehicle, the corresponding distance is
sensed. Meanwhile, we assume that the velocity of the end-
point of each ray can be determined as well. To sum up, the
agent obtains the distance and the velocity at the end point
of each ray trace, which results in a 126-D vector:
st=v,dlb,dmp ,dgoal,li,ci(1)
in which i[0,60]and liand ciare the length and the end
point’s velocity for each ray trace i.vis the velocity of the
ego vehicle at the current time step t.
D. Reward Function
The reward function is used to evaluate the goodness of
the action taken at every time step based on the current states
or observations from the environment. For the proposed
problem, the reward function is designed to be the sum of
the following terms:
rt=rego +rray +σ5+σ6+σ7(2)
For the ego vehicle, a positive reward is assigned that
reflects the progress towards the destination position.
A negative constant is added at each time step to
encourage the ego-vehicle to make fast progress. The
initial position and destination of the ego vehicle are
fixed for each training iteration. Equation 3 represents
the reward term for the ego-vehicle in which pdes ,pego
and pini are the destination position, current position and
initial position, respectively.
rego =σ1
kpdes pegok2
kpdes pinit k2σ2(3)
where σ1and σ2are constants.
To penalize the agent getting to cose to other traffic
participants and obstacles, a negative reward is imposed
if the ego vehicle interacts with other simulated vehicles
within the visibility. An interaction means that if the
ego vehicle can see any approaching vehicle which
takes the right of the road through the simulated ray
trace, and the ego vehicle is blocking the future path of
any approaching vehicles (has already passed the lower
boundary of the intersection), this step has an iteration
between ego vehicle and other vehicles.
rray =
i
σ3
li
1li
vi
σ4(4)
in which 1(x) = 1 if x0 and 1(x) = 0 if x<0.
A constant penalty σ5is imposed in case of a crash
between the ego vehicle and any simulated vehicles:
σ5=10000 if kpi
sim pegok2=0,(5)
Fig. 5: Two initial scenarios for experiments
where pi
sim is the position of the ith simulated vehicle
within the visibility of the ego vehicle.
A constant penalty σ6is added if the task cannot be
finished within 1000 steps which is 100 seconds in the
real world.
Moreover, a positive reward σ7is added if the ego
vehicle reaches the goal:
σ7=1000 if kpdes pegok2=0.(6)
V. EX PER IMENTS
We tested our algorithm in the SUMO simulator [22]
with two scenarios which are shown in Figure 5. In the
two scenarios, the initial positions of the ego vehicle, the
position of the stop-line, the position of the mid-point of the
intersection and the position of the destination are all the
same. However, we change the position of the lower-bound
of the intersection, which may result in less information from
the ego vehicle’s ray traces, especially at the beginning of
each training epoch. This kind of situation is closer to real-
world scenarios, in which the average distance between the
lower bound of the intersection and the stop-line is four to
six meters. As a result, most vehicles don’t roll all the way
up to the edge of the intersection. The farther the stop-line
is away from the lower bound of the intersection, the more
difficult is it to make a proper decision on when to traverse,
because the geometry of the intersection blocks visibility.
During the traverse process, the vehicle has to either
Go Straight, Turn Right or Turn Left successfully while
limiting the required interactions between the ego vehicle
and other simulated vehicles. To evaluate the effectiveness
of the proposed algorithms, we tested the the following four
methods: 1) time-to-collision (TTC) [3], 2) DDPG [9], 3)
our proposed POMDP with LSTM algorithm and 4) HOMDP
with vanilla DRL for continuous actions that we proposed as
well in this paper. For the evaluation we applied the resuliting
policies for 1000 iterations on the Go Straight, Turn Right
and Turn Left task separately. We compared the different
algoirthms using five metrics:
Percentage of Success: the percentage of epochs in
which the ego vehicle successfully reaches the desti-
nation.
Percentage of Collision: the percentage of epochs in
which the ego vehicle collides with any other simulated
vehicle before reaching the destination.
Percentage of Unfinished cases: the percentage of
epochs in which the ego vehicle does not achieve the
TABLE I: Results for Scenarios One and Two with TTC and
DDPG methods
Task Metrics Scenario One Scenario Two
TTC DDPG TTC DDPG
Go Straight
% Success 95.6 97.8 95.3 96.2
% Collision 2.6 2.2 4.7 3.1
% Unfinished 1.8 0.0 0.0 0.7
Total Steps 385 136 70 195
% Interaction 44.82 24.49 18.87 30.02
Reward 673 735 552 587
Turn Right
% Success 98.6 100.0 98.2 97.6
% Collision 1.4 0.0 1.8 2.4
% Unfinished 0.0 0.0 0.0 0.0
Total Steps 117 106 66 63
% Interaction 17.05 10.14 13.97 11.4
Reward 743 792 665 721
Turn Left
% Success 86.6 97.6 91.42 92. 4
% Collision 1.0 2.0 8.58 5.4
% Unfinished 12.38 0.4 0.0 2.2
Total Steps 514 196 108 289
% Interaction 35.47 33.26 39.57 37.31
Reward -821 217 -437 -223
goal within 1000 simulation steps, which equals 100
seconds in the real world.
Total steps ste ptotal : the average of number total steps
the ego vehicle takes to finish each test iteration includ-
ing all the successful and fail cases. Fewer steps means
less time is taken to finish the whole process.
Percentage of Interaction ste pinter
ste ptotal : the average percent-
age of steps in each test iteration in which the ego
vehicle has an interaction with one of the other traffic
participants. An interaction is defined as an unscheduled
behavior change of an vehicle in response of the ego
vehicle’s behavior. Other traffic participants can react to
the ego vehicle if the ego vehicle has passed the lower
bound of the intersection and is within the visibility
range of the ego vehicle.
Total Reward T
t=0rt: the reward the agent gets for
each test iteration. More interactions means that the ego
vehicle relies more on other vehicles’ decelerations to
avoid collision.
VI. RES ULTS
First, we compare the results for our two baselines (i.e.,
TTC and the vanilla DDPG) for both scenarios (Figure 5).
The results are shown in Table I . In Scenario One, DDPG
outperforms TTC in terms of the success rate, total steps
to finish the traversing, and the interaction rate, which is in
alignment with other reported results [7] that use bird’s view
data instead of ray trace data. However, for Scenario Two,
where there are more unobservable states initially and during
the training process, DDPG and TTC vary as to which has
better performance. Although DDPG has a higher success
rate for Go Straight and Left Turn scenarios, it also has
a higher interaction rate, which means the lower collision
rate is due at least in part to the other vehicles’ adjustments
instead of the ego vehicle’s skill.
Next, we compared POMDP with LSTM and HOMDP to
get the results for Scenario Two alone, which are presented in
Table II. The performance of the final policies for POMDP
TABLE II: Results for Scenario Two with Four Methods
Together
Task Metrics Scenario Two
TTC DDPG POMDP HOMDP
Go Straight
% Success 95.3 96.2 97.1 98.3
% Collision 4.7 3.1 1.7 1.7
% Unfinished 0.0 0.7 1.2 0.0
Total Steps 70 195 261 102
% Interaction 18.87 30.02 27.86 26.41
Reward 552 587 621 873
Turn Right
% Success 98.2 97.6 99.5 99.8
% Collision 1.8 2.4 0.5 0.2
% Unfinished 0.0 0.0 0.0 0.0
Total Steps 66 63 57 62
% Interaction 13.97 11.4 6.26 9.28
Reward 665 721 892 903
Turn Left
% Success 91.4 92.4 95.6 97.3
% Collision 8.6 5.4 2.4 2.6
% Unfinished 0.0 2.2 2.0 0.1
Total Steps 108 289 276 132
% Interaction 39.57 37.31 33.21 28.90
Reward -437 -223 213 632
and HOMDP can be seen in the video2. In general, TTC
can cause a higher collision rate and may result in more
interactions, which means the method relies on the other
simulated vehicles to yield to the ego vehicle in order to
avoid collision, which is not similar to the situation in the
real world. The result from POMDP is more conservative,
which causes more unfinished fail cases, meaning the epochs
in which the ego vehicle cannot finish the traversing task
within 1000 simulation steps. For the HOMDP method which
is our main contribution, it can get the highest total reward
and success rate for all the Go Straight, Turn Right and Turn
Left tasks. For Go Straight and Turn Left task, the interaction
rate is least with HOMDP which means the results rely least
on the other vehicles’ yielding. HOMDP take more total steps
than other methods (TTC for Go Straight and Turn Left task,
POMDP for Turn Right Task) that means HOMDP is a more
conservative strategy which value the success rate most.
VII. CONCLUSION AND FUTURE WOR K
This paper proposes two methods to deal with the problem
of an autonomous vehicle traversing an urban partially non-
controlled stop-sign intersection. The first approach is based
on the ideas of DDPG and ADRQN, which use previous
steps’ observations to learn an optimal policy with respect
to the current observations by using an LSTM layer in the
DRL method. The second method is based on the idea of
hierarchical MDPs, which learn a discrete option in the
high-level learning process and continuous low-level actions
simultaneously. The results shows that HOMDP can provide
better performance for both success rate and interaction rate.
For the future work, we want to apply our model to the real
traffic data instead of the simulated traffic model in SUMO.
As a starting point, we will apply the model on the Peachtree
Dataset in NGSIM [24].
2https://youtu.be/Rj5bYtKshh8
REFERENCES
[1] G. E. Monahan, “State of the arta survey of partially observable
markov decision processes: theory, models, and algorithms,Manage-
ment Science, vol. 28, no. 1, pp. 1–16, 1982.
[2] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press Cambridge, 1998, vol. 1, no. 1.
[3] D. N. Lee, “A theory of visual control of braking based on information
about time-to-collision,” Perception, vol. 5, no. 4, pp. 437–459, 1976.
[4] C. Dong, J. M. Dolan, and B. Litkouhi, “Intention estimation for
ramp merging control in autonomous driving,” in Intelligent Vehicles
Symposium (IV), 2017 IEEE. IEEE, 2017, pp. 1584–1589.
[5] C. Dong, Y. Zhang, and J. M. Dolan, “Lane-change social behavior
generator for autonomous driving car by non-parametric regression in
reproducing kernel hilbert space,” in Intelligent Robots and Systems
(IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017,
pp. 4489–4494.
[6] W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous driv-
ing decision-making in an uncontrolled intersection,” Mathematical
Problems in Engineering, vol. 2016, 2016.
[7] D. Isele, A. Cosgun, K. Subramanian, and K. Fujimura, “Navigating
intersections with autonomous vehicles using deep reinforcement
learning,” arXiv preprint arXiv:1705.01196, 2017.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
learning,” arXiv preprint arXiv:1312.5602, 2013.
[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
ment learning,” arXiv preprint arXiv:1509.02971, 2015.
[10] M. Egorov, “Deep reinforcement learning with pomdps,” 2015.
[11] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially
observable mdps,CoRR, abs/1507.06527, 2015.
[12] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning
to communicate to solve riddles with deep distributed recurrent q-
networks,” arXiv preprint arXiv:1602.02672, 2016.
[13] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforce-
ment learning for pomdps,” arXiv preprint arXiv:1804.06309, 2018.
[14] S. P. Singh, “Transfer of learning by composing solutions of elemental
sequential tasks,” Machine Learning, vol. 8, no. 3-4, pp. 323–339,
1992.
[15] T. G. Dietterich, “The maxq method for hierarchical reinforcement
learning.” in ICML, vol. 98. Citeseer, 1998, pp. 118–126.
[16] N. K. Jong and P. Stone, “Hierarchical model-based reinforcement
learning: R-max+ maxq,” in Proceedings of the 25th international
conference on Machine learning. ACM, 2008, pp. 432–439.
[17] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial
time algorithm for near-optimal reinforcement learning,” Journal of
Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002.
[18] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hier-
archical deep reinforcement learning: Integrating temporal abstraction
and intrinsic motivation,” in Advances in neural information process-
ing systems, 2016, pp. 3675–3683.
[19] D. Steckelmacher, D. M. Roijers, A. Harutyunyan, P. Vrancx,
and A. Now´
e, “Reinforcement learning in pomdps with memory-
less options and option-observation initiation sets,arXiv preprint
arXiv:1708.06551, 2017.
[20] P. Poupart and C. Boutilier, “Bounded finite state controllers,” in
Advances in neural information processing systems, 2004, pp. 823–
830.
[21] M. Hauskrecht, N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier,
“Hierarchical solution of markov decision processes using macro-
actions,” in Proceedings of the Fourteenth conference on Uncertainty
in artificial intelligence. Morgan Kaufmann Publishers Inc., 1998,
pp. 220–229.
[22] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
development and applications of sumo-simulation of urban mobility,”
International Journal On Advances in Systems and Measurements,
vol. 5, no. 3&4, 2012.
[23] J. Song, Y. Wu, Z. Xu, and X. Lin, “Research on car-following model
based on sumo,” in Advanced Infocomm Technology (ICAIT), 2014
IEEE 7th International Conference on. IEEE, 2014, pp. 47–55.
[24] “NGSIM homepage. FHWA.” 2005-2006. [Online]. Available:
http://ngsim.fhwa.dot.gov.
... Such representation can facilitate complex driving behaviors learning but entail high computation resource. Moreover, these studies rely on deep Q-learning method and are limited to discrete actions [19]. ...
... In urban settings, unsignalized intersections are typically governed by priority-based control or right-hand priority [29]. In this study, we adopt the scenario from [19], focusing on priority-based controlled intersections. Here, we designate the east-west road as the main thoroughfare and the north-south road as the subordinate route. ...
... Equation (19) intuitively reveals that the amount of policy improvement is equal to the sum of expected discounted advantage functions over every state-action pair generated from . The essential of reinforcement learning algorithms is: As long as selects an action , must be better than (otherwise the policy has already converged to the optimal policy). ...
Article
Full-text available
Unsignalized intersections pose a challenge for autonomous vehicles that must decide how to navigate them safely and efficiently. This paper proposes a reinforcement learning (RL) method for autonomous vehicles to navigate unsignal-ized intersections safely and efficiently. The method uses a semantic scene representation to handle variable numbers of vehicles and a universal reward function to facilitate stable learning. A collision risk function is designed to penalize unsafe actions and guide the agent to avoid them. A scalable policy optimization algorithm is introduced to improve data efficiency and safety for vehicle learning at intersections. The algorithm employs experience replay to overcome the on-policy limitation of proximal policy optimization and incorporates the collision risk constraint into the policy optimization problem. The proposed safe RL algorithm can balance the trade-off between vehicle traffic safety and policy learning efficiency. Simulated intersection scenarios with different traffic situations are used to test the algorithm and demonstrate its high success rates and low collision rates under different traffic conditions. The algorithm shows the potential of RL for enhancing the safety and reliability of autonomous driving systems at unsignalized intersections.
... The trained model converged to stable reward after only 200k training steps compared with baseline (1M training steps). In [96], decision-making at an intersection was modeled as hierarchical-option MDP (HOMDP), where only the current observation was considered instead of the observation sequence over a time interval to reduce the computational cost. A success rate of more than 97% was achieved, and 50% fewer number of steps were needed to finish the driving task compared with the baseline. ...
... Success rate: This refers to the percentage of the driving task that the vehicles complete during the numerous training or testing episodes [96]. • Finishing time: This indicates the time it would take for the vehicle to complete its driving task [130]. ...
Article
Full-text available
The proper functioning of connected and autonomous vehicles (CAVs) is crucial for the safety and efficiency of future intelligent transport systems. Meanwhile, transitioning to fully autonomous driving requires a long period of mixed autonomy traffic, including both CAVs and human-driven vehicles. Thus, collaborative decision-making technology for CAVs is essential to generate appropriate driving behaviors to enhance the safety and efficiency of mixed autonomy traffic. In recent years, deep reinforcement learning (DRL) methods have become an efficient way in solving decision-making problems. However, with the development of computing technology, graph reinforcement learning (GRL) methods have gradually demonstrated the large potential to further improve the decision-making performance of CAVs, especially in the area of accurately representing the mutual effects of vehicles and modeling dynamic traffic environments. To facilitate the development of GRL-based methods for autonomous driving, this paper proposes a review of GRL-based methods for the decision-making technologies of CAVs. Firstly, a generic GRL framework is proposed in the beginning to gain an overall understanding of the decision-making technology. Then, the GRL-based decision-making technologies are reviewed from the perspective of the construction methods of mixed autonomy traffic, methods for graph representation of the driving environment, and related works about graph neural networks (GNN) and DRL in the field of decision-making for autonomous driving. Moreover, validation methods are summarized to provide an efficient way to verify the performance of decision-making methods. Finally, challenges and future research directions of GRL-based decision-making methods are summarized.
... Online planners based on partially observable Monte Carlo planning (POMCP) [4,6,24,38] have been shown to handle intersections, but they rely on the existence of an accurate generative model. Offline learning, frequently employing Markov Decision Processes (MDP), addresses intersection complexities from a strategic standpoint [28,36]. Reinforcement learning (RL) is a common offline learning approach for decision-making and planning for autonomous vehicles at intersections. ...
Article
Full-text available
In this paper, we develop a framework to enhance the control of autonomous vehicles within signalized intersections by integrating system dynamics with imperfect sensor data. Although sensor data for autonomous vehicles is often partial, noisy, or missing, its alignment with underlying physics has the potential to significantly improve prediction accuracy. Our methodology involves a physics-based representation of system dynamics, accounting for stochastic elements originating from differences between real-world scenarios and model-based representations, using a Markov decision process (MDP) model that embraces system physics while accommodating uncertainties. Leveraging partial sensor data, we aim to diminish uncertainty associated with segments of the system model reflected in the data and better estimate the other physical variables that are not measurable through data. This framework develops particle-based reinforcement learning action policies, seamlessly integrating data and physics for controlling autonomous vehicles approaching signalized intersections. Numerical experiments showcase the effectiveness of these policies in ensuring safe control and decision-making in different scenarios of missing state variables and varying levels of data sparsity.
... Learning-based methods can also be applied for autonomous driving problems under uncertainty. By formulating the problem as a Partially Observable Markov Decision Process (POMDP) and utilizing a Recurrent Neural Network (RNN) to consider temporal information, the agent in the RL algorithm is allowed to learn from the patterns of variance in the environment, e.g., from sensor noise, other perception limitations, or surrounding vehicle behavior, and handle the uncertainty accordingly [43,44]. To develop adaptive driving behavior, Mani et al. [45] used recurrent deterministic policy gradients to train the agent how to react to the surrounding traffic condition instead of the normal DDPG. ...
Article
Full-text available
A substantial number of vehicles nowadays are equipped with adaptive cruise control (ACC), which adjusts the vehicle speed automatically. However, experiments have found that commercial ACC systems which only detect the direct leader amplify the propagating disturbances in the platoon. This can cause severe traffic congestion when the number of ACC-equipped vehicles increases. Therefore, an ACC system which also considers the second leader further downstream is required. Such a system enables the vehicle to achieve multi-anticipation and hence ensure better platoon stability. Nevertheless, measurements collected from the second leader may be comparatively inaccurate given the limitations of current state-of-the-art sensor technology. This study adopts deep reinforcement learning to develop ACC controllers that besides the input from the first leader exploits the additional information obtained from the second leader, albeit noisy. The simulation experiment demonstrates that even under the influence of noisy measurements, the multi-leader ACC platoon shows smaller disturbance and jerk amplitudes than the one-leader ACC platoon, indicating improved string stability and ride comfort. Practical takeaways are twofold: first, the proposed method can be used to further develop multi-leader ACC systems. Second, even noisy data from the second leader can help stabilize traffic, which makes such systems viable in practice.
... Unlike POMDP, HOMDP utilizes a multi-layered selection network during decision-making, assessing the trustworthiness of the vehicle's current driving environment solely based on the current observation, without considering past observations. The HOMDP algorithm was validated in a crossroads scenario[19]. The team led by Shaojie Shen at the Hong Kong University of Science and Technology introduced a Multi-policy Decision Making (MPDM) system to handle decision-making challenges in complex traffic scenarios. ...
Article
Full-text available
As autonomous driving technology continues to advance and gradually become a reality, ensuring the safety of autonomous driving in complex traffic scenarios has become a key focus and challenge in current research. Model-free deep reinforcement learning (Deep Reinforcement Learning) methods have been widely used for addressing motion planning problems in complex traffic scenarios, as they can implicitly learn interactions between vehicles. However, current planning methods based on deep reinforcement learning exhibit limited robustness and generalization performance. They struggle to adapt to traffic conditions beyond the training scenarios and face difficulties in handling uncertainties arising from unexpected situations. Therefore, this paper addresses the challenges presented by complex traffic scenarios, such as signal-free intersections. It does so by first utilizing the historical trajectories of adjacent vehicles observed in these scenarios. Through a Variational Auto-Encoder (VAE) based on the Gated Recurrent Unit (GRU) recurrent neural network, it extracts driver style features. These driver style features are then integrated with other state parameters and used to train a motion planning strategy within an extended reinforcement learning framework. This approach ultimately yields a more robust and interpretable mid-to-mid motion planning method. Experimental results confirm that the proposed method achieves low collision rates, high efficiency, and successful task completion in complex traffic scenarios.
Conference Paper
This work introduces a comprehensive deep reinforcement learning (DRL) framework for autonomous driving (AD) in high-fidelity traffic environments. The fusion of the robust open-source platforms CARLA and RLlib establishes a unified and customizable framework for DRL applications in AD. It provides a transparent and objective testing of the models, and further enables users to customize components such as the neural networks, reward functions, and input data. Moreover, it allows structured, reproducible assessments through the CARLA leaderboard, providing critical metrics, including lane violations, collision incidents, and route length. Through rigorous, objective evaluations, we ascertain the efficacy of our DRL models in independently navigating an ego vehicle under the demanding conditions simulated by CARLA. The cornerstone of this research is the comparative analysis conducted between standard end-to-end DRL and our proposed decoupled DRL strategy, further enriched with advanced techniques such as imitation learning and data fusion. Our findings show the strengths of the decoupled DRL approach-wherein the perception task is separated from the decision task-in driving the ego vehicle, thereby achieving state-of-the-art performance. We also highlight the pivotal role of advanced strategies, particularly imitation learning and data fusion, in bolstering performance and addressing intricate tasks. Moreover, our framework provides an easy-to-use and standardized approach to explore vision-based and decoupled DRL under Partially Observable Markov Decision Process (POMDP) conditions with spatiotemporal inputs in complex AD scenarios. The framework is fully open source and can be found on GitHub: https://github.com/hallilibas-if/ddhRL.
Article
With the maturation of autonomous driving technology, the use of autonomous vehicles in a socially acceptable manner has become a growing demand of the public. Human-like autonomous driving is expected due to the impact of the differences between autonomous vehicles and human drivers on safety. Although human-like decision-making has become a research hotspot, a unified theory has not yet been formed, and there are significant differences in the implementation and performance of existing methods. This paper provides a comprehensive overview of human-like decision-making for autonomous vehicles. The following issues are discussed: 1) The intelligence level of most autonomous driving decision-making algorithms; 2) The driving datasets and simulation platforms for testing and verifying human-like decision-making; 3) The evaluation metrics of human-likeness; personalized driving; the application of decision- making in real traffic scenarios; and 4) The potential research direction of human-like driving. These research results are significant for creating interpretable human-like driving models and applying them in dynamic traffic scenarios. In the future, the combination of intuitive logical reasoning and hierarchical structure will be an important topic for further research. It is expected to meet the needs of human-like driving.
Conference Paper
Full-text available
Nowadays, self-driving cars are being applied to more complex urban scenarios including intersections, merging ramps or lane changes. It is, therefore, important for self-driving cars to behave socially with human-driven cars. In this paper, we focus on generating the lane change behavior for self-driving cars: perform a safe and effective lane change behavior once a lane-change command is received. Our method bridges the gap between higher-level behavior commands and the trajectory planner. There are two challenges in the task: 1) Analyzing the surrounding vehicles' mutual effects from their trajectories. 2) Estimating the proper lane change start point and end point according to the analysis of surrounding vehicles. We propose a learning-based approach to understand surrounding traffic and make decisions for a safe lane change. Our contributions and advantages of the approach are: 1 Considers the behavior generator as a continuous function in Reproducing Kernel Hilbert Space (RKHS) which contains a family of behavior generators; 2 Constructs the behavior generator function in RKHS by non-parametric regressions on training data; 3 Takes past trajectories of all related surrounding cars as input to capture mutual interactions and output continuous values to represent behaviors. Experimental results show that the proposed approach is able to generate feasible and human-like lane-change behavior (repre-sented by start and end points) in multi-car environments. The experiments also verified that our suggested kernel outperforms the ones which were used in a previous method.
Article
Full-text available
We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.
Article
Full-text available
Most real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately, for instance by combining recurrent neural networks and options, we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the initiation set of options conditional on the previously-executed option, and show that options with such Option-Observation Initiation Sets (OOIs) are at least as expressive as Finite State Controllers (FSCs), a state-of-the-art approach for learning in POMDPs. In contrast to other hierarchical methods in partially observable environments, OOIs are easy to design based on an intuitive description of the task, lead to explainable policies and keep the top-level and option policies memoryless. Our experiments show that OOIs allow agents to learn optimal policies in challenging POMDPs, outperforming an human-provided policy in our robotic experiment, while learning much faster than a recurrent neural network over options.
Article
Full-text available
We propose deep distributed recurrent Q-networks (DDRQN), which enable teams of agents to learn to solve communication-based coordination tasks. In these tasks, the agents are not given any pre-designed communication protocol. Therefore, in order to successfully communicate, they must first automatically develop and agree upon their own communication protocol. We present empirical results on two multi-agent learning problems based on well-known riddles, demonstrating that DDRQN can successfully solve such tasks and discover elegant communication protocols to do so. To our knowledge, this is the first time deep reinforcement learning has succeeded in learning communication protocols. In addition, we present ablation experiments that confirm that each of the main components of the DDRQN architecture are critical to its success.
Article
Full-text available
Autonomous vehicles need to perform social accepted behaviors in complex urban scenarios including human-driven vehicles with uncertain intentions. This leads to many difficult decision-making problems, such as deciding a lane change maneuver and generating policies to pass through intersections. In this paper, we propose an intention-aware decision-making algorithm to solve this challenging problem in an uncontrolled intersection scenario. In order to consider uncertain intentions, we first develop a continuous hidden Markov model to predict both the high-level motion intention (e.g., turn right, turn left, and go straight) and the low level interaction intentions (e.g., yield status for related vehicles). Then a partially observable Markov decision process (POMDP) is built to model the general decision-making framework. Due to the difficulty in solving POMDP, we use proper assumptions and approximations to simplify this problem. A human-like policy generation mechanism is used to generate the possible candidates. Human-driven vehicles’ future motion model is proposed to be applied in state transition process and the intention is updated during each prediction time step. The reward function, which considers the driving safety, traffic laws, time efficiency, and so forth, is designed to calculate the optimal policy. Finally, our method is evaluated in simulation with PreScan software and a driving simulator. The experiments show that our method could lead autonomous vehicle to pass through uncontrolled intersections safely and efficiently.
Article
Deep Reinforcement Learning (RL) recently emerged as one of the most competitive approaches for learning in sequential decision making problems with fully observable environments, e.g., computer Go. However, very little work has been done in deep RL to handle partially observable environments. We propose a new architecture called Action-specific Deep Recurrent Q-Network (ADRQN) to enhance learning performance in partially observable domains. Actions are encoded by a fully connected layer and coupled with a convolutional observation to form an action-observation pair. The time series of action-observation pairs are then integrated by an LSTM layer that learns latent states based on which a fully connected layer computes Q-values as in conventional Deep Q-Networks (DQNs). We demonstrate the effectiveness of our new architecture in several partially observable domains, including flickering Atari games.
Article
We view intersection handling on autonomous vehicles as a reinforcement learning problem, and study its behavior in a transfer learning setting. We show that a network trained on one type of intersection generally is not able to generalize to other intersections. However, a network that is pre-trained on one intersection and fine-tuned on another performs better on the new task compared to training in isolation. This network also retains knowledge of the prior task, even though some forgetting occurs. Finally, we show that the benefits of fine-tuning hold when transferring simulated intersection handling knowledge to a real autonomous vehicle.
Article
Providing an efficient strategy to navigate safely through unsignaled intersections is a difficult task that requires determining the intent of other drivers. We explore the effectiveness of using Deep Reinforcement Learning to handle intersection problems. Combining several recent advances in Deep RL, were we able to learn policies that surpass the performance of a commonly-used heuristic approach in several metrics including task completion time and goal success rate. Our analysis, and the solutions learned by the network point out several short comings of current rule-based methods. The fact that Deep RL policies resulted in collisions, although rarely, combined with the limitations of the policy to generalize well to out-of-sample scenarios suggest a need for further research.
Article
Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. The primary difficulty arises due to insufficient exploration, resulting in an agent being unable to learn robust value functions. Intrinsically motivated agents can explore new behavior for its own sake rather than to directly solve problems. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning. A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments. We demonstrate the strength of our approach on two problems with very sparse, delayed feedback: (1) a complex discrete MDP with stochastic transitions, and (2) the classic ATARI game `Montezuma's Revenge'.