Conference PaperPDF Available

POMDP and Hierarchical Options MDP with Continuous Actions for Autonomous Driving at Intersections

November 2018

November 2018

DOI:10.1109/ITSC.2018.8569400

Conference: 2018 IEEE International Conference on Intelligent Transportation Systems (ITSC)

Authors:

Zhiqian Qiao

Carnegie Mellon University

Katharina Mülling

Carnegie Mellon University

John M. Dolan

Carnegie Mellon University

Praveen Palanisamy

Microsoft

Show all 5 authorsHide

The proposed intersection traversal scenario with simulated ray traces. The ego vehicle starts from the stop line with zero velocity and other simulated vehicles run in other lanes using the Krauss Traffic Model.

…

POMDP with LSTM network for generating continuous actions. The input observation o t is a 126-D vector which generates a 3-D continuous action vector a t through two 512-D LSTM layers and one fully connected (FC) layer. Meanwhile, the observation vector followed by an FC layer concatenates with the action vector coming from the previous step followed by an FC, and then followed by a 128-D LSTM layer and three FC layers to produce a 1-D Q-values Q t corresponding to the continuous action vector.

…

Two initial scenarios for experiments

…

Figures - uploaded by Zhiqian Qiao

Content may be subject to copyright.

Content uploaded by Zhiqian Qiao

Content may be subject to copyright.

POMDP and Hierarchical Options MDP with Continuous Actions for

Autonomous Driving at Intersections

Zhiqian Qiao1, Katharina Muelling2, John Dolan1,2, Praveen Palanisamy3and Priyantha Mudalige3

Abstract— When applying autonomous driving technology

to real-world scenarios, environmental uncertainties make the

development of decision-making algorithms difﬁcult. Modeling

the problem as a Partially Observable Markov Decision Process

(POMDP) [1] allows the algorithm to consider these uncertain-

ties in the decision process, which makes it more robust to

real sensor characteristics. However, solving the POMDP with

reinforcement learning (RL) [2] often requires storing a large

number of observations. Furthermore, for continuous action

spaces, the system is computationally inefﬁcient. This paper

addresses these problems by proposing to model the problem as

an MDP and learn a policy with RL using hierarchical options

(HOMDP). The suggested algorithm can store the state-action

pairs and only uses current observations to solve a POMDP

problem. We compare the results of to the time-to-collision

method [3] and the proposed POMDP-with-LSTM method. Our

results show that the HOMDP approach is able to improve the

performance of the agent for a four-way intersection task with

two-way stop signs. The HOMDP method can generate both

higher-level discrete options and lower-level continuous actions

with only the observations of the current step.

I. INTRODUCTION

Traversing a four-way intersection with two-way stop

signs can be difﬁcult for autonomous vehicles. Upon arrival,

the agent has to time its actions properly to make a turn onto

the right-of-way road safely. If the car enters the intersection

too soon, it can result in a collision or cause the approaching

right-of-way vehicles to brake hard. On the other hand, if the

driver keeps waiting for too long to make sure the situation

is safe, valuable time is lost.

On a ﬁrst glance, rule-based methods [4][5] seem to

be a plausible approach to describe the human decision

process for intersection traversal. The driver estimates the

time an approaching vehicle needs to reach and traverse the

intersection. If the traversal time is less than the approaching

time, then the driver may make the decision to begin the

traversal. Otherwise, the driver may choose to wait. For an

autonomous vehicle, an algorithm called time-to-collision

(TTC) [3] takes a similar approach. However, estimating the

time accurately and dealing with unexpected situations by

adjusting the ego vehicle’s decision according to changes in

the environment is hard for autonomous vehicles. As a result,

such an approach is not reliable enough for the autonomous

vehicle. An alternative to using rule-based methods is to

model the problem as a Markov Decision Process (MDP)

and to learn an optimal policy using reinforcement learning.

1Department of Electrical and Computer Engineering, Carnegie

Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA.

zhiqianq@andrew.cmu.edu

2The Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.

3Research & Development, General Motors, Warren, MI, USA.

Fig. 1: The proposed intersection traversal scenario with

simulated ray traces. The ego vehicle starts from the stop

line with zero velocity and other simulated vehicles run in

other lanes using the Krauss Trafﬁc Model.

Some previous work [6][7] on the intersection traversal

problem uses bird’s-eye view to construct the state infor-

mation for decision making, which is hard to obtain in

a real vehicle. In this work, we only use the information

coming from resources available to an autonomous vehicle

as input such as a simulated LIDAR and the ego car’s

state information. The challenge is that the LIDAR ray trace

information does not include everything happening near the

intersection, but only the information within the visibility

range of the ego vehicle. In the paper, we assume that

the intersection traversal problem is an MDP, however, the

ego vehicle cannot directly observes everything which is

happing out of the visibility range due to the geometry of the

intersection. For any driving scenario, it is hard for ego car

to make proper decisions only based on the current states in

the surrounding environment. Meanwhile, the agent cannot

access the historical information of other vehicles in order to

make a proper decision. Using POMDP to model the whole

system can maintain a probability distribution for the agent’s

observation. As a result, our ﬁrst contribution is to model the

problem as a POMDP and use deep reinforcement learning

(DRL) to optimize the policy for generating continuous

actions. The second contribution is proposing Hierarchical

Options for MDP (HOMDP) to solve the problem. Instead

of modeling the problem as a POMDP, we still apply the

MDP model and solve the problem by adding hierarchical

options, which gives better results than POMDP alone.

II. REL ATED WORK

This section introduces previous work related to this paper,

which can be categorized as follows: 1) papers that address

deep reinforcement learning (DRL) algorithms; 2) work

focused on using DRL to solve the POMDP problem; 3)

including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to

servers or lists, or reuse of any copyrighted component of this work in other works

papers which use hierarchical structure and the concept of

options in combination with DRL methods.

A. Deep Reinforcement Learning

Recently, various DRL algorithms have shown advantages

in handling high-dimensional sensory inputs with discrete

or continuous actions. Deep Q-Networks [8] (DQN) and

its variants have been successfully applied to various ﬁelds.

However, DQN are restricted to the discrete action domain,

which makes it hard to apply to autonomous driving. DDPG

[9], which was proposed by Lillicrap et al., adapted the ideas

underlying the success of DQN to the continuous action

domain. All these solutions have in common that they model

the problem as an MDP, i.e., they assume full knowledge of

the environment, which is often not true in real world.

B. POMDP with Reinforcement Learning

[10] proposed to extend the DQN method for POMDP

problems by learning the optimal policy of a model-based

RL problem. They used a fully-connected network to ap-

proximate the mapping function from observations and belief

vectors to Q-values for different actions in order to solve

a POMDP problem through DQN. [11] proposed a method

called Deep Recurrent Q-network (DRQN) which can be ap-

plied to a model-free representation by adding LSTM layers

in the policy network. The trained policy network is capable

of capturing all the historical information in the LSTM

layer and the output actions are based on all the historical

observations with the help of LSTM layers. [7] used DRQN

with states from a bird’s-eye view of the intersection to

learn the policy of how to traverse the intersection. Based on

DRQN, [12] proposed an algorithm called Deep Distributed

Recurrent Q-networks (DDRQN) which not only used the

previous observations but also previous actions as the inputs.

The results show the ﬁrst success in learning communication

protocols by RL. [13] is a recent method called Action-

speciﬁc Deep Recurrent Q-Network (ADRQN), which also

used historical observations and actions as input to get an

optimal policy. However, instead of inputing all historical

information into the LSTM layer, they used paired last step

action and current step observations as inputs into the LSTM

layer for training, which showed improved results on some

POMDP tasks.

C. Hierarchical Reinforcement Learning and Options

Based on the context of RL, [14] proposed the idea of

options to generate actions at different levels. The options

are used to deﬁne a policy governing when the action policy

is initialized and terminated. [15] introduced the concept of

hierarchical Q learning called MAXQ. They proved the con-

vergence of MAXQ mathematically and showed faster com-

puting power than the original Q learning experimentally.

[16] proposed an improved MAXQ method by combining the

R-MAX [17] algorithm with MAXQ. It has both the efﬁcient

model-based exploration of R-MAX and the opportunities

for abstraction provided by the MAXQ framework.[18]

presented hierarchical DQN (h-DQN), a method which is

based on the structure of DQN. H-DQN adds another high-

level Q-value function to learn a high-level policy while the

DQN is responsible for the low-level policy. The algorithm

is effective on the sparse and delayed feedback of ATARI

games. Instead of using observations in the memory, [19]

proposed a memoryless options method to solve the POMDP

problem by RL. They compared their results with ﬁnite state

controllers (FSC) [20], which is a state-of-the-art approach

that is commonly used to solve the POMDP problem. The

results showed that the method can both be as expressive

as FSC and more efﬁcient than a recurrent neural network

when learning optimal policies in challenging POMDPs.

In this paper, we ﬁrstly extend the original POMDP with

LSTM [7] structure to an application that can generate

continuous actions instead of discrete actions. Second, we

propose a structure which combines HMDP and DRL and

apply the new structure to the autonomous vehicle traversing

intersection problem.

III. PRELIMINARIES

This work is based mainly on the ideas of Markov Deci-

sion Processes (MDPs), Hierarchical MDP (HMDP) and Par-

tially Observable MDP (POMDP). For a better understanding

of the proposed methodology, we will give a short overview

of all three frameworks before explaining our approach in

Section IV.

A. Markov Decision Process

A MDP is deﬁned as a tuple {S,A,R,T,γ}in which S

denotes a set of states, Adeﬁnes the set of available actions,

and T(st+1,at,st)is a transition function that maps a state-

action pair (st,at)to a new state st+1. The reward function R

deﬁnes the immediate rewards for each state-action pair and

γis a discount factor for long-term rewards. At every time

step t, the agent receives the state stfrom the environment

and then selects an action ataccording to the policy π, which

results in the agent being transitioned to a new state st+1

according to the transition function T. Meanwhile the agent

will receive an immediate reward rtafter taking action atin

state st.

B. Hierarchical MDP

Hierarchical MDP (HMDP) is a general framework to

solve problems with large state and action spaces. The

framework can restrict the space of policies by separating

the action policy into different levels. One way of realizing

an HMDP is via the options framework [21]. An HMDP

includes a set of options O. An option o∈Ois a tuple of

{πo,Io,βo}.πois the policy for the option o.Io∈Sis the set

of states in which an option can be initiated. βo(st)deﬁnes

the probability that the options will be terminated in state st.

C. Partially Observable MDP

When applying decision processes to real-world problems,

we run into the problem that we often can only infer the state

of the environment from potentially incomplete and noisy

sensor data, i.e., we have a partially observable environ-

ment. POMDPs [1] provide a powerful tool to model such

Fig. 2: POMDP with LSTM network for generating con-

tinuous actions. The input observation otis a 126-D vector

which generates a 3-D continuous action vector atthrough

two 512-D LSTM layers and one fully connected (FC) layer.

Meanwhile, the observation vector followed by an FC layer

concatenates with the action vector coming from the previous

step followed by an FC, and then followed by a 128-D LSTM

layer and three FC layers to produce a 1-D Q-values Qt

corresponding to the continuous action vector.

partially observable processes. A POMDP is deﬁned as a

tuple {S,A,R,T,Ω,V,γ}which extends the MDP formulation

through the deﬁnition of a set of environment observa-

tions Ωand a set of conditional observation probability

V(vt+1|st+1,at).

IV. METHODOLOGY

In this paper, we model the intersection problem as a

reinforcement learning problem. In particular, we consider

the scenario of a four-way intersection with two-way stop

signs (see Figure 1). We assume a start conﬁguration in

which the ego-vehicle (purple rectangle in Figure 1) has

already stopped near the stop line and would like to traverse

the intersection in order to reach a pre-deﬁned destination

beyond the intersection in order to complete Turn Right,

Go Straight or Turn Left tasks. The yellow rectangles are

simulated vehicles generated by the SUMO simulator [22]

and are all moving according to the Krauss Car Following

Trafﬁc Model [23].

In this section, we propose two methods to solve the RL

problem. The ﬁrst method models the problem as a POMDP

and uses the past observation and action pairs as inputs for

the reinforcement learning algorithm. It generates continuous

actions as outputs at each time step. The second method

models the problem as an MDP with hierarchical options,

which only takes the latest observations as input and then

generates discrete high-level options as well as low-level

actions through the option policy πosimultaneously.

A. POMDP with LSTM

In our work, we propose a network based on the ideas of

DDPG [9] and ADRQN [13] to generate continuous actions

based on the observations coming from previous steps. In the

network, instead of using all the previous observations, only

a ﬁxed number of previous steps nsteps =20 are used as the

inputs to the LSTM layer. Figure 2 shows the structure of the

Fig. 3: MDP with Hierarchical Options network for gen-

erating continuous actions. The 126-D input state vector

stis followed by three FC layers in order to generate a

2-D Q-values Otcorresponding to two hierarchical option

candidates. Meanwhile, the input state vector produces a 2-D

continuous action vector atthrough four FC layers. Then, the

state vector followed by an FC layer concatenates with the

action vector followed by one FC layer. The output produces

a 1-D Q-values Qtwhich corresponds to the action vector

through four FC layers.

proposed network. In input of the network is the observation

at each time step, and through the ﬁrst LSTM-layer and a FC-

layer, the network gets the information coming from previous

steps to get the action output for each step. After that, We

pair the observation coming from the current step and the

action taking for the last step (ot,at−1) to be the input for

generating the Q value Qt.

B. MDP with Hierarchical Options

For MDP with hierarchical options, we use three networks

to produce hierarchical options, low-level actions, and Q-

values. The structure of the network is shown in Figure

3. Unlike the POMDP network, the system does not take

observations from previous time-steps into account for the

decision process. However, it uses hierarchical options to

make a decision on whether the ego vehicle can trust the

environment or not. Then, based on the observations and

high-level decision, the agent makes the decision on the

low-level policies. The lower-level decision here means the

acceleration or deceleration the agent takes at current step.

The overall ﬂow of the algorithm is given in Algorithm 1.

C. State Space

In the representation of the state space, we assume that

the map information is known by the agent. We deﬁne the

state space as containing the ego vehicle’s velocity vand

road geometry information, including the distance between

the ego vehicle and the lower intersection boundary dlb, the

mid-point dmp and the destination dgoal of the intersection.

Figure 4 shows the geometric information related to the

intersection and the generated/simulated rays. The video1

shows the simulated ray trace representations during Go

Straight, Turn Right and Turn Left the intersection in SUMO

[22].

At each time step, a square of size 100m×100mis

constructed whose center is the middle front of the ego

1https://youtu.be/5HMB8SWanW4

Algorithm 1 MDP with Hierarchical Options via RL

1: procedure HOMDP

2: Construct two empty replay buffers Baand Bo.

3: Randomly initialize actor network NNa, critic network NNQand

option network NNOwith weights θµ,θQand θOand the correspond-

ing target actor network NNµ0, critic network NNQ0and option network

NN O0with weights θµ0,θQ0and θO0.

4: for e←1 to Eepochs do

5: Get initial state s0. Initial option is o0=SlowForward.ro=0.

6: for t←1 to Ttime steps do

7: ot,at= GetAction(st,ot−1)

8: st+1,rt,done =StepForward(st,ot,at)

9: if otis Forward then

10: ro+ = rtand add (st,ot,rt,st+1,done)to Bo.

11: if done then

12: Add (st,ot,ro,st+1,done)to Bo.

13: else

14: Add (st,ot,rt,st+1,done)to Bo.

15: Sample random mini-batch of Mtransitions (si,oi,ri,si+1)

from Boand (sj,aj,rj,sj+1)from Ba.

16: oi+1=argmaxoO0(si+1|θO0).yo

i=ri+γO0(si+1|θO0).

17: Minimize Lo=1

M∑iyo

i−O(si|θO)to update NN O.

18: yµ

j=rj+γQ0(sj+1,aj+1|θQ0)

19: Minimize Lµ=1

M∑jyµ

j−Q(si|θQ)to update NN Q.

20: 1

M∑j∇µ(sj)Q(sj,µ(sj)|θQ)∇θµµ(sj|θµ)is the policy gra-

dient and is used to update actor network.

21: Update the target networks: θz0←τθ z+ (1−τ)θz0for zin

{µ,Q,O}.

Algorithm 2 Get Action

1: procedure GETACTION(s,o)

2: if ois SlowForward then

3: o←argmaxoO0(s|θO0)according to εgreedy.

4: a=0.

5: if ois Forward then

6: a=µ(s|θµ) + Nwhere Nis a random process.

Algorithm 3 Move one Step to Forward

1: procedure STEP FORWARD(s,o,a)

2: if ois SlowForward then

3: Slowly move forward with dmeters and stop.

4: else if Any li

wi≤athen

5: Decrease speed of ego car with deceleration cd.

6: else

7: Increase velocity of ego car with acceleration ca.

Fig. 4: Explanation of state space and method of constructing

ray trace system

vehicle. The square is divided into a million small squares

with size 0.1m×0.1m. Each small square is occupied if there

is any obstacle or moving object in this area. There are 61

ray traces produced from the middle front of the ego vehicle

spanning πradians that cover the front view of the vehicle.

Each ray has a resolution of 0.5 meter and has a maximum

reach of 50 meters. Each ray is emitted from the front center

of the ego vehicle and if it reaches any obstacle like the road

boundary or a moving vehicle, the corresponding distance is

sensed. Meanwhile, we assume that the velocity of the end-

point of each ray can be determined as well. To sum up, the

agent obtains the distance and the velocity at the end point

of each ray trace, which results in a 126-D vector:

st=v,dlb,dmp ,dgoal,li,ci(1)

in which i∈[0,60]and liand ciare the length and the end

point’s velocity for each ray trace i.vis the velocity of the

ego vehicle at the current time step t.

D. Reward Function

The reward function is used to evaluate the goodness of

the action taken at every time step based on the current states

or observations from the environment. For the proposed

problem, the reward function is designed to be the sum of

the following terms:

rt=rego +rray +σ5+σ6+σ7(2)

•For the ego vehicle, a positive reward is assigned that

reﬂects the progress towards the destination position.

A negative constant is added at each time step to

encourage the ego-vehicle to make fast progress. The

initial position and destination of the ego vehicle are

ﬁxed for each training iteration. Equation 3 represents

the reward term for the ego-vehicle in which pdes ,pego

and pini are the destination position, current position and

initial position, respectively.

rego =σ1

kpdes −pegok2

kpdes −pinit k2−σ2(3)

where σ1and σ2are constants.

•To penalize the agent getting to cose to other trafﬁc

participants and obstacles, a negative reward is imposed

if the ego vehicle interacts with other simulated vehicles

within the visibility. An interaction means that if the

ego vehicle can see any approaching vehicle which

takes the right of the road through the simulated ray

trace, and the ego vehicle is blocking the future path of

any approaching vehicles (has already passed the lower

boundary of the intersection), this step has an iteration

between ego vehicle and other vehicles.

rray =−∑

σ3

1li

≤σ4(4)

in which 1(x) = 1 if x≥0 and 1(x) = 0 if x<0.

•A constant penalty σ5is imposed in case of a crash

between the ego vehicle and any simulated vehicles:

σ5=−10000 if kpi

sim −pegok2=0,(5)

Fig. 5: Two initial scenarios for experiments

where pi

sim is the position of the ith simulated vehicle

within the visibility of the ego vehicle.

•A constant penalty σ6is added if the task cannot be

ﬁnished within 1000 steps which is 100 seconds in the

real world.

•Moreover, a positive reward σ7is added if the ego

vehicle reaches the goal:

σ7=1000 if kpdes −pegok2=0.(6)

V. EX PER IMENTS

We tested our algorithm in the SUMO simulator [22]

with two scenarios which are shown in Figure 5. In the

two scenarios, the initial positions of the ego vehicle, the

position of the stop-line, the position of the mid-point of the

intersection and the position of the destination are all the

same. However, we change the position of the lower-bound

of the intersection, which may result in less information from

the ego vehicle’s ray traces, especially at the beginning of

each training epoch. This kind of situation is closer to real-

world scenarios, in which the average distance between the

lower bound of the intersection and the stop-line is four to

six meters. As a result, most vehicles don’t roll all the way

up to the edge of the intersection. The farther the stop-line

is away from the lower bound of the intersection, the more

difﬁcult is it to make a proper decision on when to traverse,

because the geometry of the intersection blocks visibility.

During the traverse process, the vehicle has to either

Go Straight, Turn Right or Turn Left successfully while

limiting the required interactions between the ego vehicle

and other simulated vehicles. To evaluate the effectiveness

of the proposed algorithms, we tested the the following four

methods: 1) time-to-collision (TTC) [3], 2) DDPG [9], 3)

our proposed POMDP with LSTM algorithm and 4) HOMDP

with vanilla DRL for continuous actions that we proposed as

well in this paper. For the evaluation we applied the resuliting

policies for 1000 iterations on the Go Straight, Turn Right

and Turn Left task separately. We compared the different

algoirthms using ﬁve metrics:

•Percentage of Success: the percentage of epochs in

which the ego vehicle successfully reaches the desti-

nation.

•Percentage of Collision: the percentage of epochs in

which the ego vehicle collides with any other simulated

vehicle before reaching the destination.

•Percentage of Unﬁnished cases: the percentage of

epochs in which the ego vehicle does not achieve the

TABLE I: Results for Scenarios One and Two with TTC and

DDPG methods

Task Metrics Scenario One Scenario Two

TTC DDPG TTC DDPG

Go Straight

% Success 95.6 97.8 95.3 96.2

% Collision 2.6 2.2 4.7 3.1

% Unﬁnished 1.8 0.0 0.0 0.7

Total Steps 385 136 70 195

% Interaction 44.82 24.49 18.87 30.02

Reward 673 735 552 587

Turn Right

% Success 98.6 100.0 98.2 97.6

% Collision 1.4 0.0 1.8 2.4

% Unﬁnished 0.0 0.0 0.0 0.0

Total Steps 117 106 66 63

% Interaction 17.05 10.14 13.97 11.4

Reward 743 792 665 721

Turn Left

% Success 86.6 97.6 91.42 92. 4

% Collision 1.0 2.0 8.58 5.4

% Unﬁnished 12.38 0.4 0.0 2.2

Total Steps 514 196 108 289

% Interaction 35.47 33.26 39.57 37.31

Reward -821 217 -437 -223

goal within 1000 simulation steps, which equals 100

seconds in the real world.

•Total steps ste ptotal : the average of number total steps

the ego vehicle takes to ﬁnish each test iteration includ-

ing all the successful and fail cases. Fewer steps means

less time is taken to ﬁnish the whole process.

•Percentage of Interaction ste pinter

ste ptotal : the average percent-

age of steps in each test iteration in which the ego

vehicle has an interaction with one of the other trafﬁc

participants. An interaction is deﬁned as an unscheduled

behavior change of an vehicle in response of the ego

vehicle’s behavior. Other trafﬁc participants can react to

the ego vehicle if the ego vehicle has passed the lower

bound of the intersection and is within the visibility

range of the ego vehicle.

•Total Reward ∑T

t=0rt: the reward the agent gets for

each test iteration. More interactions means that the ego

vehicle relies more on other vehicles’ decelerations to

avoid collision.

VI. RES ULTS

First, we compare the results for our two baselines (i.e.,

TTC and the vanilla DDPG) for both scenarios (Figure 5).

The results are shown in Table I . In Scenario One, DDPG

outperforms TTC in terms of the success rate, total steps

to ﬁnish the traversing, and the interaction rate, which is in

alignment with other reported results [7] that use bird’s view

data instead of ray trace data. However, for Scenario Two,

where there are more unobservable states initially and during

the training process, DDPG and TTC vary as to which has

better performance. Although DDPG has a higher success

rate for Go Straight and Left Turn scenarios, it also has

a higher interaction rate, which means the lower collision

rate is due at least in part to the other vehicles’ adjustments

instead of the ego vehicle’s skill.

Next, we compared POMDP with LSTM and HOMDP to

get the results for Scenario Two alone, which are presented in

Table II. The performance of the ﬁnal policies for POMDP

TABLE II: Results for Scenario Two with Four Methods

Together

Task Metrics Scenario Two

TTC DDPG POMDP HOMDP

Go Straight

% Success 95.3 96.2 97.1 98.3

% Collision 4.7 3.1 1.7 1.7

% Unﬁnished 0.0 0.7 1.2 0.0

Total Steps 70 195 261 102

% Interaction 18.87 30.02 27.86 26.41

Reward 552 587 621 873

Turn Right

% Success 98.2 97.6 99.5 99.8

% Collision 1.8 2.4 0.5 0.2

% Unﬁnished 0.0 0.0 0.0 0.0

Total Steps 66 63 57 62

% Interaction 13.97 11.4 6.26 9.28

Reward 665 721 892 903

Turn Left

% Success 91.4 92.4 95.6 97.3

% Collision 8.6 5.4 2.4 2.6

% Unﬁnished 0.0 2.2 2.0 0.1

Total Steps 108 289 276 132

% Interaction 39.57 37.31 33.21 28.90

Reward -437 -223 213 632

and HOMDP can be seen in the video2. In general, TTC

can cause a higher collision rate and may result in more

interactions, which means the method relies on the other

simulated vehicles to yield to the ego vehicle in order to

avoid collision, which is not similar to the situation in the

real world. The result from POMDP is more conservative,

which causes more unﬁnished fail cases, meaning the epochs

in which the ego vehicle cannot ﬁnish the traversing task

within 1000 simulation steps. For the HOMDP method which

is our main contribution, it can get the highest total reward

and success rate for all the Go Straight, Turn Right and Turn

Left tasks. For Go Straight and Turn Left task, the interaction

rate is least with HOMDP which means the results rely least

on the other vehicles’ yielding. HOMDP take more total steps

than other methods (TTC for Go Straight and Turn Left task,

POMDP for Turn Right Task) that means HOMDP is a more

conservative strategy which value the success rate most.

VII. CONCLUSION AND FUTURE WOR K

This paper proposes two methods to deal with the problem

of an autonomous vehicle traversing an urban partially non-

controlled stop-sign intersection. The ﬁrst approach is based

on the ideas of DDPG and ADRQN, which use previous

steps’ observations to learn an optimal policy with respect

to the current observations by using an LSTM layer in the

DRL method. The second method is based on the idea of

hierarchical MDPs, which learn a discrete option in the

high-level learning process and continuous low-level actions

simultaneously. The results shows that HOMDP can provide

better performance for both success rate and interaction rate.

For the future work, we want to apply our model to the real

trafﬁc data instead of the simulated trafﬁc model in SUMO.

As a starting point, we will apply the model on the Peachtree

Dataset in NGSIM [24].

2https://youtu.be/Rj5bYtKshh8

REFERENCES

[1] G. E. Monahan, “State of the arta survey of partially observable

markov decision processes: theory, models, and algorithms,” Manage-

ment Science, vol. 28, no. 1, pp. 1–16, 1982.

[2] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.

MIT press Cambridge, 1998, vol. 1, no. 1.

[3] D. N. Lee, “A theory of visual control of braking based on information

about time-to-collision,” Perception, vol. 5, no. 4, pp. 437–459, 1976.

[4] C. Dong, J. M. Dolan, and B. Litkouhi, “Intention estimation for

ramp merging control in autonomous driving,” in Intelligent Vehicles

Symposium (IV), 2017 IEEE. IEEE, 2017, pp. 1584–1589.

[5] C. Dong, Y. Zhang, and J. M. Dolan, “Lane-change social behavior

generator for autonomous driving car by non-parametric regression in

reproducing kernel hilbert space,” in Intelligent Robots and Systems

(IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017,

pp. 4489–4494.

[6] W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous driv-

ing decision-making in an uncontrolled intersection,” Mathematical

Problems in Engineering, vol. 2016, 2016.

[7] D. Isele, A. Cosgun, K. Subramanian, and K. Fujimura, “Navigating

intersections with autonomous vehicles using deep reinforcement

learning,” arXiv preprint arXiv:1705.01196, 2017.

[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,

D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement

learning,” arXiv preprint arXiv:1312.5602, 2013.

[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,

D. Silver, and D. Wierstra, “Continuous control with deep reinforce-

ment learning,” arXiv preprint arXiv:1509.02971, 2015.

[10] M. Egorov, “Deep reinforcement learning with pomdps,” 2015.

[11] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially

observable mdps,” CoRR, abs/1507.06527, 2015.

[12] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning

to communicate to solve riddles with deep distributed recurrent q-

networks,” arXiv preprint arXiv:1602.02672, 2016.

[13] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforce-

ment learning for pomdps,” arXiv preprint arXiv:1804.06309, 2018.

[14] S. P. Singh, “Transfer of learning by composing solutions of elemental

sequential tasks,” Machine Learning, vol. 8, no. 3-4, pp. 323–339,

1992.

[15] T. G. Dietterich, “The maxq method for hierarchical reinforcement

learning.” in ICML, vol. 98. Citeseer, 1998, pp. 118–126.

[16] N. K. Jong and P. Stone, “Hierarchical model-based reinforcement

learning: R-max+ maxq,” in Proceedings of the 25th international

conference on Machine learning. ACM, 2008, pp. 432–439.

[17] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial

time algorithm for near-optimal reinforcement learning,” Journal of

Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002.

[18] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hier-

archical deep reinforcement learning: Integrating temporal abstraction

and intrinsic motivation,” in Advances in neural information process-

ing systems, 2016, pp. 3675–3683.

[19] D. Steckelmacher, D. M. Roijers, A. Harutyunyan, P. Vrancx,

and A. Now´

e, “Reinforcement learning in pomdps with memory-

less options and option-observation initiation sets,” arXiv preprint

arXiv:1708.06551, 2017.

[20] P. Poupart and C. Boutilier, “Bounded ﬁnite state controllers,” in

Advances in neural information processing systems, 2004, pp. 823–

830.

[21] M. Hauskrecht, N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier,

“Hierarchical solution of markov decision processes using macro-

actions,” in Proceedings of the Fourteenth conference on Uncertainty

in artiﬁcial intelligence. Morgan Kaufmann Publishers Inc., 1998,

pp. 220–229.

[22] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent

development and applications of sumo-simulation of urban mobility,”

International Journal On Advances in Systems and Measurements,

vol. 5, no. 3&4, 2012.

[23] J. Song, Y. Wu, Z. Xu, and X. Lin, “Research on car-following model

based on sumo,” in Advanced Infocomm Technology (ICAIT), 2014

IEEE 7th International Conference on. IEEE, 2014, pp. 47–55.

[24] “NGSIM homepage. FHWA.” 2005-2006. [Online]. Available:

http://ngsim.fhwa.dot.gov.

Safe Efficient Policy Optimization Algorithm for Unsignalized Intersection Navigation

Article

Full-text available

May 2024

Unsignalized intersections pose a challenge for autonomous vehicles that must decide how to navigate them safely and efficiently. This paper proposes a reinforcement learning (RL) method for autonomous vehicles to navigate unsignal-ized intersections safely and efficiently. The method uses a semantic scene representation to handle variable numbers of vehicles and a universal reward function to facilitate stable learning. A collision risk function is designed to penalize unsafe actions and guide the agent to avoid them. A scalable policy optimization algorithm is introduced to improve data efficiency and safety for vehicle learning at intersections. The algorithm employs experience replay to overcome the on-policy limitation of proximal policy optimization and incorporates the collision risk constraint into the policy optimization problem. The proposed safe RL algorithm can balance the trade-off between vehicle traffic safety and policy learning efficiency. Simulated intersection scenarios with different traffic situations are used to test the algorithm and demonstrate its high success rates and low collision rates under different traffic conditions. The algorithm shows the potential of RL for enhancing the safety and reliability of autonomous driving systems at unsignalized intersections.

Graph Reinforcement Learning-Based Decision-Making Technology for Connected and Autonomous Vehicles: Framework, Review, and Future Trends

Article

Full-text available

Oct 2023
SENSORS-BASEL

The proper functioning of connected and autonomous vehicles (CAVs) is crucial for the safety and efficiency of future intelligent transport systems. Meanwhile, transitioning to fully autonomous driving requires a long period of mixed autonomy traffic, including both CAVs and human-driven vehicles. Thus, collaborative decision-making technology for CAVs is essential to generate appropriate driving behaviors to enhance the safety and efficiency of mixed autonomy traffic. In recent years, deep reinforcement learning (DRL) methods have become an efficient way in solving decision-making problems. However, with the development of computing technology, graph reinforcement learning (GRL) methods have gradually demonstrated the large potential to further improve the decision-making performance of CAVs, especially in the area of accurately representing the mutual effects of vehicles and modeling dynamic traffic environments. To facilitate the development of GRL-based methods for autonomous driving, this paper proposes a review of GRL-based methods for the decision-making technologies of CAVs. Firstly, a generic GRL framework is proposed in the beginning to gain an overall understanding of the decision-making technology. Then, the GRL-based decision-making technologies are reviewed from the perspective of the construction methods of mixed autonomy traffic, methods for graph representation of the driving environment, and related works about graph neural networks (GNN) and DRL in the field of decision-making for autonomous driving. Moreover, validation methods are summarized to provide an efficient way to verify the performance of decision-making methods. Finally, challenges and future research directions of GRL-based decision-making methods are summarized.

Physics-Informed Particle-Based Reinforcement Learning for Autonomy in Signalized Intersections

Article

Full-text available

Jun 2024

In this paper, we develop a framework to enhance the control of autonomous vehicles within signalized intersections by integrating system dynamics with imperfect sensor data. Although sensor data for autonomous vehicles is often partial, noisy, or missing, its alignment with underlying physics has the potential to significantly improve prediction accuracy. Our methodology involves a physics-based representation of system dynamics, accounting for stochastic elements originating from differences between real-world scenarios and model-based representations, using a Markov decision process (MDP) model that embraces system physics while accommodating uncertainties. Leveraging partial sensor data, we aim to diminish uncertainty associated with segments of the system model reflected in the data and better estimate the other physical variables that are not measurable through data. This framework develops particle-based reinforcement learning action policies, seamlessly integrating data and physics for controlling autonomous vehicles approaching signalized intersections. Numerical experiments showcase the effectiveness of these policies in ensuring safe control and decision-making in different scenarios of missing state variables and varying levels of data sparsity.

Adaptive Cruise Control Utilizing Noisy Multi-Leader Measurements: A Learning-Based Approach

Article

Full-text available

Jan 2024

A substantial number of vehicles nowadays are equipped with adaptive cruise control (ACC), which adjusts the vehicle speed automatically. However, experiments have found that commercial ACC systems which only detect the direct leader amplify the propagating disturbances in the platoon. This can cause severe traffic congestion when the number of ACC-equipped vehicles increases. Therefore, an ACC system which also considers the second leader further downstream is required. Such a system enables the vehicle to achieve multi-anticipation and hence ensure better platoon stability. Nevertheless, measurements collected from the second leader may be comparatively inaccurate given the limitations of current state-of-the-art sensor technology. This study adopts deep reinforcement learning to develop ACC controllers that besides the input from the first leader exploits the additional information obtained from the second leader, albeit noisy. The simulation experiment demonstrates that even under the influence of noisy measurements, the multi-leader ACC platoon shows smaller disturbance and jerk amplitudes than the one-leader ACC platoon, indicating improved string stability and ride comfort. Practical takeaways are twofold: first, the proposed method can be used to further develop multi-leader ACC systems. Second, even noisy data from the second leader can help stabilize traffic, which makes such systems viable in practice.

An automatic driving trajectory planning approach in complex traffic scenarios based on integrated driver style inference and deep reinforcement learning

Article

Full-text available

Jan 2024
PLOS ONE

As autonomous driving technology continues to advance and gradually become a reality, ensuring the safety of autonomous driving in complex traffic scenarios has become a key focus and challenge in current research. Model-free deep reinforcement learning (Deep Reinforcement Learning) methods have been widely used for addressing motion planning problems in complex traffic scenarios, as they can implicitly learn interactions between vehicles. However, current planning methods based on deep reinforcement learning exhibit limited robustness and generalization performance. They struggle to adapt to traffic conditions beyond the training scenarios and face difficulties in handling uncertainties arising from unexpected situations. Therefore, this paper addresses the challenges presented by complex traffic scenarios, such as signal-free intersections. It does so by first utilizing the historical trajectories of adjacent vehicles observed in these scenarios. Through a Variational Auto-Encoder (VAE) based on the Gated Recurrent Unit (GRU) recurrent neural network, it extracts driver style features. These driver style features are then integrated with other state parameters and used to train a motion planning strategy within an extended reinforcement learning framework. This approach ultimately yields a more robust and interpretable mid-to-mid motion planning method. Experimental results confirm that the proposed method achieves low collision rates, high efficiency, and successful task completion in complex traffic scenarios.

A Unified Approach to Autonomous Driving in a High-Fidelity Simulator Using Vision-Based Reinforcement Learning

Conference Paper

Jan 2024

This work introduces a comprehensive deep reinforcement learning (DRL) framework for autonomous driving (AD) in high-fidelity traffic environments. The fusion of the robust open-source platforms CARLA and RLlib establishes a unified and customizable framework for DRL applications in AD. It provides a transparent and objective testing of the models, and further enables users to customize components such as the neural networks, reward functions, and input data. Moreover, it allows structured, reproducible assessments through the CARLA leaderboard, providing critical metrics, including lane violations, collision incidents, and route length. Through rigorous, objective evaluations, we ascertain the efficacy of our DRL models in independently navigating an ego vehicle under the demanding conditions simulated by CARLA. The cornerstone of this research is the comparative analysis conducted between standard end-to-end DRL and our proposed decoupled DRL strategy, further enriched with advanced techniques such as imitation learning and data fusion. Our findings show the strengths of the decoupled DRL approach-wherein the perception task is separated from the decision task-in driving the ego vehicle, thereby achieving state-of-the-art performance. We also highlight the pivotal role of advanced strategies, particularly imitation learning and data fusion, in bolstering performance and addressing intricate tasks. Moreover, our framework provides an easy-to-use and standardized approach to explore vision-based and decoupled DRL under Partially Observable Markov Decision Process (POMDP) conditions with spatiotemporal inputs in complex AD scenarios. The framework is fully open source and can be found on GitHub: https://github.com/hallilibas-if/ddhRL.

Joint Imitation Learning of Behavior Decision and Control for Autonomous Intersection Navigation

Conference Paper

Oct 2023

An Overview of Decision-Making in Autonomous Vehicles

Article

Jan 2023

A Hierarchical Imitation Learning-based Decision Framework for Autonomous Driving

Conference Paper

Oct 2023

Human-Like Decision-Making of Autonomous Vehicles in Dynamic Traffic Scenarios

Article

Oct 2023

With the maturation of autonomous driving technology, the use of autonomous vehicles in a socially acceptable manner has become a growing demand of the public. Human-like autonomous driving is expected due to the impact of the differences between autonomous vehicles and human drivers on safety. Although human-like decision-making has become a research hotspot, a unified theory has not yet been formed, and there are significant differences in the implementation and performance of existing methods. This paper provides a comprehensive overview of human-like decision-making for autonomous vehicles. The following issues are discussed: 1) The intelligence level of most autonomous driving decision-making algorithms; 2) The driving datasets and simulation platforms for testing and verifying human-like decision-making; 3) The evaluation metrics of human-likeness; personalized driving; the application of decision- making in real traffic scenarios; and 4) The potential research direction of human-like driving. These research results are significant for creating interpretable human-like driving models and applying them in dynamic traffic scenarios. In the future, the combination of intuitive logical reasoning and hierarchical structure will be an important topic for further research. It is expected to meet the needs of human-like driving.

Lane-Change Social Behavior Generator for Autonomous Driving Car by Non-parametric Regression in Reproducing Kernel Hilbert Space

Conference Paper

Full-text available

Sep 2017

Nowadays, self-driving cars are being applied to more complex urban scenarios including intersections, merging ramps or lane changes. It is, therefore, important for self-driving cars to behave socially with human-driven cars. In this paper, we focus on generating the lane change behavior for self-driving cars: perform a safe and effective lane change behavior once a lane-change command is received. Our method bridges the gap between higher-level behavior commands and the trajectory planner. There are two challenges in the task: 1) Analyzing the surrounding vehicles' mutual effects from their trajectories. 2) Estimating the proper lane change start point and end point according to the analysis of surrounding vehicles. We propose a learning-based approach to understand surrounding traffic and make decisions for a safe lane change. Our contributions and advantages of the approach are: 1 Considers the behavior generator as a continuous function in Reproducing Kernel Hilbert Space (RKHS) which contains a family of behavior generators; 2 Constructs the behavior generator function in RKHS by non-parametric regressions on training data; 3 Takes past trajectories of all related surrounding cars as input to capture mutual interactions and output continuous values to represent behaviors. Experimental results show that the proposed approach is able to generate feasible and human-like lane-change behavior (repre-sented by start and end points) in multi-car environments. The experiments also verified that our suggested kernel outperforms the ones which were used in a previous method.

A Deep Reinforcement Learning Chatbot

Article

Full-text available

Sep 2017

We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.

Reinforcement Learning in POMDPs With Memoryless Options and Option-Observation Initiation Sets

Article

Full-text available

Aug 2017

Most real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately, for instance by combining recurrent neural networks and options, we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the initiation set of options conditional on the previously-executed option, and show that options with such Option-Observation Initiation Sets (OOIs) are at least as expressive as Finite State Controllers (FSCs), a state-of-the-art approach for learning in POMDPs. In contrast to other hierarchical methods in partially observable environments, OOIs are easy to design based on an intuitive description of the task, lead to explainable policies and keep the top-level and option policies memoryless. Our experiments show that OOIs allow agents to learn optimal policies in challenging POMDPs, outperforming an human-provided policy in our robotic experiment, while learning much faster than a recurrent neural network over options.

Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks

Article

Full-text available

Feb 2016

We propose deep distributed recurrent Q-networks (DDRQN), which enable teams of agents to learn to solve communication-based coordination tasks. In these tasks, the agents are not given any pre-designed communication protocol. Therefore, in order to successfully communicate, they must first automatically develop and agree upon their own communication protocol. We present empirical results on two multi-agent learning problems based on well-known riddles, demonstrating that DDRQN can successfully solve such tasks and discover elegant communication protocols to do so. To our knowledge, this is the first time deep reinforcement learning has succeeded in learning communication protocols. In addition, we present ablation experiments that confirm that each of the main components of the DDRQN architecture are critical to its success.

Intention-Aware Autonomous Driving Decision-Making in an Uncontrolled Intersection

Article

Full-text available

Jan 2016
MATH PROBL ENG

Autonomous vehicles need to perform social accepted behaviors in complex urban scenarios including human-driven vehicles with uncertain intentions. This leads to many difficult decision-making problems, such as deciding a lane change maneuver and generating policies to pass through intersections. In this paper, we propose an intention-aware decision-making algorithm to solve this challenging problem in an uncontrolled intersection scenario. In order to consider uncertain intentions, we first develop a continuous hidden Markov model to predict both the high-level motion intention (e.g., turn right, turn left, and go straight) and the low level interaction intentions (e.g., yield status for related vehicles). Then a partially observable Markov decision process (POMDP) is built to model the general decision-making framework. Due to the difficulty in solving POMDP, we use proper assumptions and approximations to simplify this problem. A human-like policy generation mechanism is used to generate the possible candidates. Human-driven vehicles’ future motion model is proposed to be applied in state transition process and the intention is updated during each prediction time step. The reward function, which considers the driving safety, traffic laws, time efficiency, and so forth, is designed to calculate the optimal policy. Finally, our method is evaluated in simulation with PreScan software and a driving simulator. The experiments show that our method could lead autonomous vehicle to pass through uncontrolled intersections safely and efficiently.

On Improving Deep Reinforcement Learning for POMDPs

Article

Apr 2018

Deep Reinforcement Learning (RL) recently emerged as one of the most competitive approaches for learning in sequential decision making problems with fully observable environments, e.g., computer Go. However, very little work has been done in deep RL to handle partially observable environments. We propose a new architecture called Action-specific Deep Recurrent Q-Network (ADRQN) to enhance learning performance in partially observable domains. Actions are encoded by a fully connected layer and coupled with a convolutional observation to form an action-observation pair. The time series of action-observation pairs are then integrated by an LSTM layer that learns latent states based on which a fully connected layer computes Q-values as in conventional Deep Q-Networks (DQNs). We demonstrate the effectiveness of our new architecture in several partially observable domains, including flickering Atari games.

Transferring Autonomous Driving Knowledge on Simulated and Real Intersections

Article

Nov 2017

We view intersection handling on autonomous vehicles as a reinforcement learning problem, and study its behavior in a transfer learning setting. We show that a network trained on one type of intersection generally is not able to generalize to other intersections. However, a network that is pre-trained on one intersection and fine-tuned on another performs better on the new task compared to training in isolation. This network also retains knowledge of the prior task, even though some forgetting occurs. Finally, we show that the benefits of fine-tuning hold when transferring simulated intersection handling knowledge to a real autonomous vehicle.

Intention estimation for ramp merging control in autonomous driving

Conference Paper

Jun 2017

Navigating Intersections with Autonomous Vehicles using Deep Reinforcement Learning

Article

May 2017

Providing an efficient strategy to navigate safely through unsignaled intersections is a difficult task that requires determining the intent of other drivers. We explore the effectiveness of using Deep Reinforcement Learning to handle intersection problems. Combining several recent advances in Deep RL, were we able to learn policies that surpass the performance of a commonly-used heuristic approach in several metrics including task completion time and goal success rate. Our analysis, and the solutions learned by the network point out several short comings of current rule-based methods. The fact that Deep RL policies resulted in collisions, although rarely, combined with the limitations of the policy to generalize well to out-of-sample scenarios suggest a need for further research.

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

Article

Apr 2016

Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. The primary difficulty arises due to insufficient exploration, resulting in an agent being unable to learn robust value functions. Intrinsically motivated agents can explore new behavior for its own sake rather than to directly solve problems. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning. A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments. We demonstrate the strength of our approach on two problems with very sparse, delayed feedback: (1) a complex discrete MDP with stochastic transitions, and (2) the classic ATARI game `Montezuma's Revenge'.

POMDP and Hierarchical Options MDP with Continuous Actions for Autonomous Driving at Intersections

Figures

Recommended publications

Merging in Congested Freeway Traffic Using Multipolicy Decision Making and Passive Actor-Critic Lear...

Automatically Generated Curriculum based Reinforcement Learning for Autonomous Vehicles in Urban Env...

Hierarchical Reinforcement Learning Method for Autonomous Vehicle Behavior Planning

Attention-based Hierarchical Deep Reinforcement Learning for Lane Change Behaviors in Autonomous Dri...

Hierarchical Reinforcement Learning Method for Autonomous Vehicle Behavior Planning