Content uploaded by Zhiqian Qiao
Author content
All content in this area was uploaded by Zhiqian Qiao on Jun 06, 2019
Content may be subject to copyright.
POMDP and Hierarchical Options MDP with Continuous Actions for
Autonomous Driving at Intersections
Zhiqian Qiao1, Katharina Muelling2, John Dolan1,2, Praveen Palanisamy3and Priyantha Mudalige3
Abstract— When applying autonomous driving technology
to real-world scenarios, environmental uncertainties make the
development of decision-making algorithms difficult. Modeling
the problem as a Partially Observable Markov Decision Process
(POMDP) [1] allows the algorithm to consider these uncertain-
ties in the decision process, which makes it more robust to
real sensor characteristics. However, solving the POMDP with
reinforcement learning (RL) [2] often requires storing a large
number of observations. Furthermore, for continuous action
spaces, the system is computationally inefficient. This paper
addresses these problems by proposing to model the problem as
an MDP and learn a policy with RL using hierarchical options
(HOMDP). The suggested algorithm can store the state-action
pairs and only uses current observations to solve a POMDP
problem. We compare the results of to the time-to-collision
method [3] and the proposed POMDP-with-LSTM method. Our
results show that the HOMDP approach is able to improve the
performance of the agent for a four-way intersection task with
two-way stop signs. The HOMDP method can generate both
higher-level discrete options and lower-level continuous actions
with only the observations of the current step.
I. INTRODUCTION
Traversing a four-way intersection with two-way stop
signs can be difficult for autonomous vehicles. Upon arrival,
the agent has to time its actions properly to make a turn onto
the right-of-way road safely. If the car enters the intersection
too soon, it can result in a collision or cause the approaching
right-of-way vehicles to brake hard. On the other hand, if the
driver keeps waiting for too long to make sure the situation
is safe, valuable time is lost.
On a first glance, rule-based methods [4][5] seem to
be a plausible approach to describe the human decision
process for intersection traversal. The driver estimates the
time an approaching vehicle needs to reach and traverse the
intersection. If the traversal time is less than the approaching
time, then the driver may make the decision to begin the
traversal. Otherwise, the driver may choose to wait. For an
autonomous vehicle, an algorithm called time-to-collision
(TTC) [3] takes a similar approach. However, estimating the
time accurately and dealing with unexpected situations by
adjusting the ego vehicle’s decision according to changes in
the environment is hard for autonomous vehicles. As a result,
such an approach is not reliable enough for the autonomous
vehicle. An alternative to using rule-based methods is to
model the problem as a Markov Decision Process (MDP)
and to learn an optimal policy using reinforcement learning.
1Department of Electrical and Computer Engineering, Carnegie
Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA.
zhiqianq@andrew.cmu.edu
2The Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.
3Research & Development, General Motors, Warren, MI, USA.
Fig. 1: The proposed intersection traversal scenario with
simulated ray traces. The ego vehicle starts from the stop
line with zero velocity and other simulated vehicles run in
other lanes using the Krauss Traffic Model.
Some previous work [6][7] on the intersection traversal
problem uses bird’s-eye view to construct the state infor-
mation for decision making, which is hard to obtain in
a real vehicle. In this work, we only use the information
coming from resources available to an autonomous vehicle
as input such as a simulated LIDAR and the ego car’s
state information. The challenge is that the LIDAR ray trace
information does not include everything happening near the
intersection, but only the information within the visibility
range of the ego vehicle. In the paper, we assume that
the intersection traversal problem is an MDP, however, the
ego vehicle cannot directly observes everything which is
happing out of the visibility range due to the geometry of the
intersection. For any driving scenario, it is hard for ego car
to make proper decisions only based on the current states in
the surrounding environment. Meanwhile, the agent cannot
access the historical information of other vehicles in order to
make a proper decision. Using POMDP to model the whole
system can maintain a probability distribution for the agent’s
observation. As a result, our first contribution is to model the
problem as a POMDP and use deep reinforcement learning
(DRL) to optimize the policy for generating continuous
actions. The second contribution is proposing Hierarchical
Options for MDP (HOMDP) to solve the problem. Instead
of modeling the problem as a POMDP, we still apply the
MDP model and solve the problem by adding hierarchical
options, which gives better results than POMDP alone.
II. REL ATED WORK
This section introduces previous work related to this paper,
which can be categorized as follows: 1) papers that address
deep reinforcement learning (DRL) algorithms; 2) work
focused on using DRL to solve the POMDP problem; 3)
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works
papers which use hierarchical structure and the concept of
options in combination with DRL methods.
A. Deep Reinforcement Learning
Recently, various DRL algorithms have shown advantages
in handling high-dimensional sensory inputs with discrete
or continuous actions. Deep Q-Networks [8] (DQN) and
its variants have been successfully applied to various fields.
However, DQN are restricted to the discrete action domain,
which makes it hard to apply to autonomous driving. DDPG
[9], which was proposed by Lillicrap et al., adapted the ideas
underlying the success of DQN to the continuous action
domain. All these solutions have in common that they model
the problem as an MDP, i.e., they assume full knowledge of
the environment, which is often not true in real world.
B. POMDP with Reinforcement Learning
[10] proposed to extend the DQN method for POMDP
problems by learning the optimal policy of a model-based
RL problem. They used a fully-connected network to ap-
proximate the mapping function from observations and belief
vectors to Q-values for different actions in order to solve
a POMDP problem through DQN. [11] proposed a method
called Deep Recurrent Q-network (DRQN) which can be ap-
plied to a model-free representation by adding LSTM layers
in the policy network. The trained policy network is capable
of capturing all the historical information in the LSTM
layer and the output actions are based on all the historical
observations with the help of LSTM layers. [7] used DRQN
with states from a bird’s-eye view of the intersection to
learn the policy of how to traverse the intersection. Based on
DRQN, [12] proposed an algorithm called Deep Distributed
Recurrent Q-networks (DDRQN) which not only used the
previous observations but also previous actions as the inputs.
The results show the first success in learning communication
protocols by RL. [13] is a recent method called Action-
specific Deep Recurrent Q-Network (ADRQN), which also
used historical observations and actions as input to get an
optimal policy. However, instead of inputing all historical
information into the LSTM layer, they used paired last step
action and current step observations as inputs into the LSTM
layer for training, which showed improved results on some
POMDP tasks.
C. Hierarchical Reinforcement Learning and Options
Based on the context of RL, [14] proposed the idea of
options to generate actions at different levels. The options
are used to define a policy governing when the action policy
is initialized and terminated. [15] introduced the concept of
hierarchical Q learning called MAXQ. They proved the con-
vergence of MAXQ mathematically and showed faster com-
puting power than the original Q learning experimentally.
[16] proposed an improved MAXQ method by combining the
R-MAX [17] algorithm with MAXQ. It has both the efficient
model-based exploration of R-MAX and the opportunities
for abstraction provided by the MAXQ framework.[18]
presented hierarchical DQN (h-DQN), a method which is
based on the structure of DQN. H-DQN adds another high-
level Q-value function to learn a high-level policy while the
DQN is responsible for the low-level policy. The algorithm
is effective on the sparse and delayed feedback of ATARI
games. Instead of using observations in the memory, [19]
proposed a memoryless options method to solve the POMDP
problem by RL. They compared their results with finite state
controllers (FSC) [20], which is a state-of-the-art approach
that is commonly used to solve the POMDP problem. The
results showed that the method can both be as expressive
as FSC and more efficient than a recurrent neural network
when learning optimal policies in challenging POMDPs.
In this paper, we firstly extend the original POMDP with
LSTM [7] structure to an application that can generate
continuous actions instead of discrete actions. Second, we
propose a structure which combines HMDP and DRL and
apply the new structure to the autonomous vehicle traversing
intersection problem.
III. PRELIMINARIES
This work is based mainly on the ideas of Markov Deci-
sion Processes (MDPs), Hierarchical MDP (HMDP) and Par-
tially Observable MDP (POMDP). For a better understanding
of the proposed methodology, we will give a short overview
of all three frameworks before explaining our approach in
Section IV.
A. Markov Decision Process
A MDP is defined as a tuple {S,A,R,T,γ}in which S
denotes a set of states, Adefines the set of available actions,
and T(st+1,at,st)is a transition function that maps a state-
action pair (st,at)to a new state st+1. The reward function R
defines the immediate rewards for each state-action pair and
γis a discount factor for long-term rewards. At every time
step t, the agent receives the state stfrom the environment
and then selects an action ataccording to the policy π, which
results in the agent being transitioned to a new state st+1
according to the transition function T. Meanwhile the agent
will receive an immediate reward rtafter taking action atin
state st.
B. Hierarchical MDP
Hierarchical MDP (HMDP) is a general framework to
solve problems with large state and action spaces. The
framework can restrict the space of policies by separating
the action policy into different levels. One way of realizing
an HMDP is via the options framework [21]. An HMDP
includes a set of options O. An option o∈Ois a tuple of
{πo,Io,βo}.πois the policy for the option o.Io∈Sis the set
of states in which an option can be initiated. βo(st)defines
the probability that the options will be terminated in state st.
C. Partially Observable MDP
When applying decision processes to real-world problems,
we run into the problem that we often can only infer the state
of the environment from potentially incomplete and noisy
sensor data, i.e., we have a partially observable environ-
ment. POMDPs [1] provide a powerful tool to model such
Fig. 2: POMDP with LSTM network for generating con-
tinuous actions. The input observation otis a 126-D vector
which generates a 3-D continuous action vector atthrough
two 512-D LSTM layers and one fully connected (FC) layer.
Meanwhile, the observation vector followed by an FC layer
concatenates with the action vector coming from the previous
step followed by an FC, and then followed by a 128-D LSTM
layer and three FC layers to produce a 1-D Q-values Qt
corresponding to the continuous action vector.
partially observable processes. A POMDP is defined as a
tuple {S,A,R,T,Ω,V,γ}which extends the MDP formulation
through the definition of a set of environment observa-
tions Ωand a set of conditional observation probability
V(vt+1|st+1,at).
IV. METHODOLOGY
In this paper, we model the intersection problem as a
reinforcement learning problem. In particular, we consider
the scenario of a four-way intersection with two-way stop
signs (see Figure 1). We assume a start configuration in
which the ego-vehicle (purple rectangle in Figure 1) has
already stopped near the stop line and would like to traverse
the intersection in order to reach a pre-defined destination
beyond the intersection in order to complete Turn Right,
Go Straight or Turn Left tasks. The yellow rectangles are
simulated vehicles generated by the SUMO simulator [22]
and are all moving according to the Krauss Car Following
Traffic Model [23].
In this section, we propose two methods to solve the RL
problem. The first method models the problem as a POMDP
and uses the past observation and action pairs as inputs for
the reinforcement learning algorithm. It generates continuous
actions as outputs at each time step. The second method
models the problem as an MDP with hierarchical options,
which only takes the latest observations as input and then
generates discrete high-level options as well as low-level
actions through the option policy πosimultaneously.
A. POMDP with LSTM
In our work, we propose a network based on the ideas of
DDPG [9] and ADRQN [13] to generate continuous actions
based on the observations coming from previous steps. In the
network, instead of using all the previous observations, only
a fixed number of previous steps nsteps =20 are used as the
inputs to the LSTM layer. Figure 2 shows the structure of the
Fig. 3: MDP with Hierarchical Options network for gen-
erating continuous actions. The 126-D input state vector
stis followed by three FC layers in order to generate a
2-D Q-values Otcorresponding to two hierarchical option
candidates. Meanwhile, the input state vector produces a 2-D
continuous action vector atthrough four FC layers. Then, the
state vector followed by an FC layer concatenates with the
action vector followed by one FC layer. The output produces
a 1-D Q-values Qtwhich corresponds to the action vector
through four FC layers.
proposed network. In input of the network is the observation
at each time step, and through the first LSTM-layer and a FC-
layer, the network gets the information coming from previous
steps to get the action output for each step. After that, We
pair the observation coming from the current step and the
action taking for the last step (ot,at−1) to be the input for
generating the Q value Qt.
B. MDP with Hierarchical Options
For MDP with hierarchical options, we use three networks
to produce hierarchical options, low-level actions, and Q-
values. The structure of the network is shown in Figure
3. Unlike the POMDP network, the system does not take
observations from previous time-steps into account for the
decision process. However, it uses hierarchical options to
make a decision on whether the ego vehicle can trust the
environment or not. Then, based on the observations and
high-level decision, the agent makes the decision on the
low-level policies. The lower-level decision here means the
acceleration or deceleration the agent takes at current step.
The overall flow of the algorithm is given in Algorithm 1.
C. State Space
In the representation of the state space, we assume that
the map information is known by the agent. We define the
state space as containing the ego vehicle’s velocity vand
road geometry information, including the distance between
the ego vehicle and the lower intersection boundary dlb, the
mid-point dmp and the destination dgoal of the intersection.
Figure 4 shows the geometric information related to the
intersection and the generated/simulated rays. The video1
shows the simulated ray trace representations during Go
Straight, Turn Right and Turn Left the intersection in SUMO
[22].
At each time step, a square of size 100m×100mis
constructed whose center is the middle front of the ego
1https://youtu.be/5HMB8SWanW4
Algorithm 1 MDP with Hierarchical Options via RL
1: procedure HOMDP
2: Construct two empty replay buffers Baand Bo.
3: Randomly initialize actor network NNa, critic network NNQand
option network NNOwith weights θµ,θQand θOand the correspond-
ing target actor network NNµ0, critic network NNQ0and option network
NN O0with weights θµ0,θQ0and θO0.
4: for e←1 to Eepochs do
5: Get initial state s0. Initial option is o0=SlowForward.ro=0.
6: for t←1 to Ttime steps do
7: ot,at= GetAction(st,ot−1)
8: st+1,rt,done =StepForward(st,ot,at)
9: if otis Forward then
10: ro+ = rtand add (st,ot,rt,st+1,done)to Bo.
11: if done then
12: Add (st,ot,ro,st+1,done)to Bo.
13: else
14: Add (st,ot,rt,st+1,done)to Bo.
15: Sample random mini-batch of Mtransitions (si,oi,ri,si+1)
from Boand (sj,aj,rj,sj+1)from Ba.
16: oi+1=argmaxoO0(si+1|θO0).yo
i=ri+γO0(si+1|θO0).
17: Minimize Lo=1
M∑iyo
i−O(si|θO)to update NN O.
18: yµ
j=rj+γQ0(sj+1,aj+1|θQ0)
19: Minimize Lµ=1
M∑jyµ
j−Q(si|θQ)to update NN Q.
20: 1
M∑j∇µ(sj)Q(sj,µ(sj)|θQ)∇θµµ(sj|θµ)is the policy gra-
dient and is used to update actor network.
21: Update the target networks: θz0←τθ z+ (1−τ)θz0for zin
{µ,Q,O}.
Algorithm 2 Get Action
1: procedure GETACTION(s,o)
2: if ois SlowForward then
3: o←argmaxoO0(s|θO0)according to εgreedy.
4: a=0.
5: if ois Forward then
6: a=µ(s|θµ) + Nwhere Nis a random process.
Algorithm 3 Move one Step to Forward
1: procedure STEP FORWARD(s,o,a)
2: if ois SlowForward then
3: Slowly move forward with dmeters and stop.
4: else if Any li
wi≤athen
5: Decrease speed of ego car with deceleration cd.
6: else
7: Increase velocity of ego car with acceleration ca.
Fig. 4: Explanation of state space and method of constructing
ray trace system
vehicle. The square is divided into a million small squares
with size 0.1m×0.1m. Each small square is occupied if there
is any obstacle or moving object in this area. There are 61
ray traces produced from the middle front of the ego vehicle
spanning πradians that cover the front view of the vehicle.
Each ray has a resolution of 0.5 meter and has a maximum
reach of 50 meters. Each ray is emitted from the front center
of the ego vehicle and if it reaches any obstacle like the road
boundary or a moving vehicle, the corresponding distance is
sensed. Meanwhile, we assume that the velocity of the end-
point of each ray can be determined as well. To sum up, the
agent obtains the distance and the velocity at the end point
of each ray trace, which results in a 126-D vector:
st=v,dlb,dmp ,dgoal,li,ci(1)
in which i∈[0,60]and liand ciare the length and the end
point’s velocity for each ray trace i.vis the velocity of the
ego vehicle at the current time step t.
D. Reward Function
The reward function is used to evaluate the goodness of
the action taken at every time step based on the current states
or observations from the environment. For the proposed
problem, the reward function is designed to be the sum of
the following terms:
rt=rego +rray +σ5+σ6+σ7(2)
•For the ego vehicle, a positive reward is assigned that
reflects the progress towards the destination position.
A negative constant is added at each time step to
encourage the ego-vehicle to make fast progress. The
initial position and destination of the ego vehicle are
fixed for each training iteration. Equation 3 represents
the reward term for the ego-vehicle in which pdes ,pego
and pini are the destination position, current position and
initial position, respectively.
rego =σ1
kpdes −pegok2
kpdes −pinit k2−σ2(3)
where σ1and σ2are constants.
•To penalize the agent getting to cose to other traffic
participants and obstacles, a negative reward is imposed
if the ego vehicle interacts with other simulated vehicles
within the visibility. An interaction means that if the
ego vehicle can see any approaching vehicle which
takes the right of the road through the simulated ray
trace, and the ego vehicle is blocking the future path of
any approaching vehicles (has already passed the lower
boundary of the intersection), this step has an iteration
between ego vehicle and other vehicles.
rray =−∑
i
σ3
li
1li
vi
≤σ4(4)
in which 1(x) = 1 if x≥0 and 1(x) = 0 if x<0.
•A constant penalty σ5is imposed in case of a crash
between the ego vehicle and any simulated vehicles:
σ5=−10000 if kpi
sim −pegok2=0,(5)
Fig. 5: Two initial scenarios for experiments
where pi
sim is the position of the ith simulated vehicle
within the visibility of the ego vehicle.
•A constant penalty σ6is added if the task cannot be
finished within 1000 steps which is 100 seconds in the
real world.
•Moreover, a positive reward σ7is added if the ego
vehicle reaches the goal:
σ7=1000 if kpdes −pegok2=0.(6)
V. EX PER IMENTS
We tested our algorithm in the SUMO simulator [22]
with two scenarios which are shown in Figure 5. In the
two scenarios, the initial positions of the ego vehicle, the
position of the stop-line, the position of the mid-point of the
intersection and the position of the destination are all the
same. However, we change the position of the lower-bound
of the intersection, which may result in less information from
the ego vehicle’s ray traces, especially at the beginning of
each training epoch. This kind of situation is closer to real-
world scenarios, in which the average distance between the
lower bound of the intersection and the stop-line is four to
six meters. As a result, most vehicles don’t roll all the way
up to the edge of the intersection. The farther the stop-line
is away from the lower bound of the intersection, the more
difficult is it to make a proper decision on when to traverse,
because the geometry of the intersection blocks visibility.
During the traverse process, the vehicle has to either
Go Straight, Turn Right or Turn Left successfully while
limiting the required interactions between the ego vehicle
and other simulated vehicles. To evaluate the effectiveness
of the proposed algorithms, we tested the the following four
methods: 1) time-to-collision (TTC) [3], 2) DDPG [9], 3)
our proposed POMDP with LSTM algorithm and 4) HOMDP
with vanilla DRL for continuous actions that we proposed as
well in this paper. For the evaluation we applied the resuliting
policies for 1000 iterations on the Go Straight, Turn Right
and Turn Left task separately. We compared the different
algoirthms using five metrics:
•Percentage of Success: the percentage of epochs in
which the ego vehicle successfully reaches the desti-
nation.
•Percentage of Collision: the percentage of epochs in
which the ego vehicle collides with any other simulated
vehicle before reaching the destination.
•Percentage of Unfinished cases: the percentage of
epochs in which the ego vehicle does not achieve the
TABLE I: Results for Scenarios One and Two with TTC and
DDPG methods
Task Metrics Scenario One Scenario Two
TTC DDPG TTC DDPG
Go Straight
% Success 95.6 97.8 95.3 96.2
% Collision 2.6 2.2 4.7 3.1
% Unfinished 1.8 0.0 0.0 0.7
Total Steps 385 136 70 195
% Interaction 44.82 24.49 18.87 30.02
Reward 673 735 552 587
Turn Right
% Success 98.6 100.0 98.2 97.6
% Collision 1.4 0.0 1.8 2.4
% Unfinished 0.0 0.0 0.0 0.0
Total Steps 117 106 66 63
% Interaction 17.05 10.14 13.97 11.4
Reward 743 792 665 721
Turn Left
% Success 86.6 97.6 91.42 92. 4
% Collision 1.0 2.0 8.58 5.4
% Unfinished 12.38 0.4 0.0 2.2
Total Steps 514 196 108 289
% Interaction 35.47 33.26 39.57 37.31
Reward -821 217 -437 -223
goal within 1000 simulation steps, which equals 100
seconds in the real world.
•Total steps ste ptotal : the average of number total steps
the ego vehicle takes to finish each test iteration includ-
ing all the successful and fail cases. Fewer steps means
less time is taken to finish the whole process.
•Percentage of Interaction ste pinter
ste ptotal : the average percent-
age of steps in each test iteration in which the ego
vehicle has an interaction with one of the other traffic
participants. An interaction is defined as an unscheduled
behavior change of an vehicle in response of the ego
vehicle’s behavior. Other traffic participants can react to
the ego vehicle if the ego vehicle has passed the lower
bound of the intersection and is within the visibility
range of the ego vehicle.
•Total Reward ∑T
t=0rt: the reward the agent gets for
each test iteration. More interactions means that the ego
vehicle relies more on other vehicles’ decelerations to
avoid collision.
VI. RES ULTS
First, we compare the results for our two baselines (i.e.,
TTC and the vanilla DDPG) for both scenarios (Figure 5).
The results are shown in Table I . In Scenario One, DDPG
outperforms TTC in terms of the success rate, total steps
to finish the traversing, and the interaction rate, which is in
alignment with other reported results [7] that use bird’s view
data instead of ray trace data. However, for Scenario Two,
where there are more unobservable states initially and during
the training process, DDPG and TTC vary as to which has
better performance. Although DDPG has a higher success
rate for Go Straight and Left Turn scenarios, it also has
a higher interaction rate, which means the lower collision
rate is due at least in part to the other vehicles’ adjustments
instead of the ego vehicle’s skill.
Next, we compared POMDP with LSTM and HOMDP to
get the results for Scenario Two alone, which are presented in
Table II. The performance of the final policies for POMDP
TABLE II: Results for Scenario Two with Four Methods
Together
Task Metrics Scenario Two
TTC DDPG POMDP HOMDP
Go Straight
% Success 95.3 96.2 97.1 98.3
% Collision 4.7 3.1 1.7 1.7
% Unfinished 0.0 0.7 1.2 0.0
Total Steps 70 195 261 102
% Interaction 18.87 30.02 27.86 26.41
Reward 552 587 621 873
Turn Right
% Success 98.2 97.6 99.5 99.8
% Collision 1.8 2.4 0.5 0.2
% Unfinished 0.0 0.0 0.0 0.0
Total Steps 66 63 57 62
% Interaction 13.97 11.4 6.26 9.28
Reward 665 721 892 903
Turn Left
% Success 91.4 92.4 95.6 97.3
% Collision 8.6 5.4 2.4 2.6
% Unfinished 0.0 2.2 2.0 0.1
Total Steps 108 289 276 132
% Interaction 39.57 37.31 33.21 28.90
Reward -437 -223 213 632
and HOMDP can be seen in the video2. In general, TTC
can cause a higher collision rate and may result in more
interactions, which means the method relies on the other
simulated vehicles to yield to the ego vehicle in order to
avoid collision, which is not similar to the situation in the
real world. The result from POMDP is more conservative,
which causes more unfinished fail cases, meaning the epochs
in which the ego vehicle cannot finish the traversing task
within 1000 simulation steps. For the HOMDP method which
is our main contribution, it can get the highest total reward
and success rate for all the Go Straight, Turn Right and Turn
Left tasks. For Go Straight and Turn Left task, the interaction
rate is least with HOMDP which means the results rely least
on the other vehicles’ yielding. HOMDP take more total steps
than other methods (TTC for Go Straight and Turn Left task,
POMDP for Turn Right Task) that means HOMDP is a more
conservative strategy which value the success rate most.
VII. CONCLUSION AND FUTURE WOR K
This paper proposes two methods to deal with the problem
of an autonomous vehicle traversing an urban partially non-
controlled stop-sign intersection. The first approach is based
on the ideas of DDPG and ADRQN, which use previous
steps’ observations to learn an optimal policy with respect
to the current observations by using an LSTM layer in the
DRL method. The second method is based on the idea of
hierarchical MDPs, which learn a discrete option in the
high-level learning process and continuous low-level actions
simultaneously. The results shows that HOMDP can provide
better performance for both success rate and interaction rate.
For the future work, we want to apply our model to the real
traffic data instead of the simulated traffic model in SUMO.
As a starting point, we will apply the model on the Peachtree
Dataset in NGSIM [24].
2https://youtu.be/Rj5bYtKshh8
REFERENCES
[1] G. E. Monahan, “State of the arta survey of partially observable
markov decision processes: theory, models, and algorithms,” Manage-
ment Science, vol. 28, no. 1, pp. 1–16, 1982.
[2] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press Cambridge, 1998, vol. 1, no. 1.
[3] D. N. Lee, “A theory of visual control of braking based on information
about time-to-collision,” Perception, vol. 5, no. 4, pp. 437–459, 1976.
[4] C. Dong, J. M. Dolan, and B. Litkouhi, “Intention estimation for
ramp merging control in autonomous driving,” in Intelligent Vehicles
Symposium (IV), 2017 IEEE. IEEE, 2017, pp. 1584–1589.
[5] C. Dong, Y. Zhang, and J. M. Dolan, “Lane-change social behavior
generator for autonomous driving car by non-parametric regression in
reproducing kernel hilbert space,” in Intelligent Robots and Systems
(IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017,
pp. 4489–4494.
[6] W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous driv-
ing decision-making in an uncontrolled intersection,” Mathematical
Problems in Engineering, vol. 2016, 2016.
[7] D. Isele, A. Cosgun, K. Subramanian, and K. Fujimura, “Navigating
intersections with autonomous vehicles using deep reinforcement
learning,” arXiv preprint arXiv:1705.01196, 2017.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
learning,” arXiv preprint arXiv:1312.5602, 2013.
[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
ment learning,” arXiv preprint arXiv:1509.02971, 2015.
[10] M. Egorov, “Deep reinforcement learning with pomdps,” 2015.
[11] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially
observable mdps,” CoRR, abs/1507.06527, 2015.
[12] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning
to communicate to solve riddles with deep distributed recurrent q-
networks,” arXiv preprint arXiv:1602.02672, 2016.
[13] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforce-
ment learning for pomdps,” arXiv preprint arXiv:1804.06309, 2018.
[14] S. P. Singh, “Transfer of learning by composing solutions of elemental
sequential tasks,” Machine Learning, vol. 8, no. 3-4, pp. 323–339,
1992.
[15] T. G. Dietterich, “The maxq method for hierarchical reinforcement
learning.” in ICML, vol. 98. Citeseer, 1998, pp. 118–126.
[16] N. K. Jong and P. Stone, “Hierarchical model-based reinforcement
learning: R-max+ maxq,” in Proceedings of the 25th international
conference on Machine learning. ACM, 2008, pp. 432–439.
[17] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial
time algorithm for near-optimal reinforcement learning,” Journal of
Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002.
[18] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hier-
archical deep reinforcement learning: Integrating temporal abstraction
and intrinsic motivation,” in Advances in neural information process-
ing systems, 2016, pp. 3675–3683.
[19] D. Steckelmacher, D. M. Roijers, A. Harutyunyan, P. Vrancx,
and A. Now´
e, “Reinforcement learning in pomdps with memory-
less options and option-observation initiation sets,” arXiv preprint
arXiv:1708.06551, 2017.
[20] P. Poupart and C. Boutilier, “Bounded finite state controllers,” in
Advances in neural information processing systems, 2004, pp. 823–
830.
[21] M. Hauskrecht, N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier,
“Hierarchical solution of markov decision processes using macro-
actions,” in Proceedings of the Fourteenth conference on Uncertainty
in artificial intelligence. Morgan Kaufmann Publishers Inc., 1998,
pp. 220–229.
[22] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
development and applications of sumo-simulation of urban mobility,”
International Journal On Advances in Systems and Measurements,
vol. 5, no. 3&4, 2012.
[23] J. Song, Y. Wu, Z. Xu, and X. Lin, “Research on car-following model
based on sumo,” in Advanced Infocomm Technology (ICAIT), 2014
IEEE 7th International Conference on. IEEE, 2014, pp. 47–55.
[24] “NGSIM homepage. FHWA.” 2005-2006. [Online]. Available:
http://ngsim.fhwa.dot.gov.