Conference PaperPDF Available

Learn collision-free self-driving skills at urban intersections with model-based reinforcement learning

September 2021

September 2021

DOI:10.1109/ITSC48978.2021.9564880

Conference: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC)

Authors:

Yang Guan

Tsinghua University

Ma Haitong

Harvard University

Shengbo Eben Li

Tsinghua University

Show all 7 authorsHide

Intersection and candidate paths generation

…

Visualization of control under different tasks and scenarios

…

Actions variation in different noise level

…

Figures - uploaded by Shengbo Eben Li

Content may be subject to copyright.

Content uploaded by Shengbo Eben Li

Content may be subject to copyright.

Learn collision-free self-driving skills at urban intersections with

model-based reinforcement learning

Yang Guan#1, Yangang Ren#1, Haitong Ma1, Shengbo Eben Li∗1, Qi Sun1, Yifan Dai2, Bo Cheng1

Abstract— Intersection is one of the most complex and

accident-prone urban trafﬁc scenarios for autonomous driving

wherein making safe and computationally efﬁcient decisions

with high-density trafﬁc ﬂow is usually non-trivial. Current

rule-based methods decompose the decision-making task into

several serial sub-modules, resulting in long computation time

at complex scenarios for on-board computing devices. In this

paper, we formulate the decision-making and control problem

under intersections as a process of optimal path selection

and tracking, where the former selects a path with the best

safety measure from a set generated only considering static

information, while the latter then considers dynamic obstacles

and solve a tracking problem with safety constraints using the

chosen path. To avoid the heavy computation introduced by

that, we develop a reinforcement learning algorithm called

generalized exterior point (GEP) to ﬁnd a neural network

(NN) solution ofﬂine. It ﬁrst constructs a multi-task problem

involving all the candidate paths and transforms it into an

unconstrained problem with a penalty on safety violations.

Afterward, the approximate feasible optimal control policy

is obtained by alternatively performing gradient descent and

enlarging the penalty. As an exterior point type method, GEP

permits control policy to violate inequality constraints during

the iterations. To verify the effectiveness of our method, we

carried out experiments both in simulation and in a real road

test. Results demonstrate that the learned policy can realize

collision-free driving under different trafﬁc conditions while

reducing the computation time by a large margin.

Index Terms— Intersection, decision-making, reinforcement

learning, penalty function

I. INTRODUCTION

Autonomous driving has great potential to beneﬁt trafﬁc

safety and efﬁciency as well as change the pattern of fu-

ture transportation. Among multifarious urban scenarios in

autonomous driving, intersection has been regarded as the

most challenging one due to the highly dynamic, stochastic

and sophisticated nature of the trafﬁc environment [1]. At

intersections, the complex conﬂicting relationship between

vehicles results in complicated vehicle operations to avoid

collision and maintain high computing efﬁciency. Statistics

have found that intersection accidents account for 30–60% of

severe trafﬁc injuries in Europe. Therefore, developing safe

This work is supported by the National Key Research and Development

Plan of China under Grant 2017YFB0102603. It is supported in part by

International Science & Technology Cooperation Program of China under

2019YFE0100200, NSF China with 51575293, and U20A20334. It is also

partially supported by Tsinghua University-Toyota Joint Research Center

for AI Technology of Automated Vehicle.

#The ﬁrst two authors contributed equally to this study. 1School

of Vehicle and Mobility, Tsinghua University, Beijing, 100084, China.

2Suzhou Automotive Research Institute, Tsinghua University, Suzhou,

215200, China. All correspondence should be sent to S. E. Li with email:

lisb04@gmail.com.

and computationally efﬁcient decision-making strategies at

intersections becomes vital for popularization of autonomous

driving.

Currently, most of decision-making algorithms at inter-

sections can be categorized into two major paradigms: rule-

based method and learning-based method [2]. The former

usually decomposes decision-making module into a series of

sub-modules such as prediction, behavior selection and tra-

jectory planning, whereas the latter learns a policy mapping

from output of perception to control signal directly using

advanced learning techniques [3], [4]. Under the rule-based

scheme, prediction module aims to predict the trajectory of

trafﬁc participants to construct the feasible region. Recently,

recurrent neural network makes an increasingly important

role in trafﬁc prediction because of its high representation

and approximation capability for long time series [5]. Be-

havior selection module decides on a local driving task that

drives the car towards the destination and abides by rules

of the road. Traditionally, ﬁnite state machine is widely

adopted to transfer between priorly deﬁned driving behaviors

in terms of current vehicle situation [6]. Trajectory planning

module selects a continuous path through the environment

to accomplish a local navigational task [7]–[10]. Commonly,

this process is time-consuming to search one feasible and

safe trajectory, which would be exacerbated by inadequate

computing capability of industrial computers. The state-of-

the-art result of average planning time is 31ms using a

directed acyclic graph search and improved A* algorithm

[11]. However, the time tends to increase largely with the

growing number of surrounding vehicles. To date, rule-

based framework has been widely attempted to handle

some typical urban scenarios like signalized junctions and

pedestrian crossings. However, much human design and high

computation complexity restrict its scalability to the complex

scenarios with large-scale vehicles.

Learning-based method aims to train one driving pol-

icy usually carried by a deep neural network which maps

from perception information to control commands, thus this

is also referred as “end-to-end” scheme [12]. Inspired by

recent success of reinforcement learning (RL) on games

and robotics [13], [14], some researchers have attempted to

utilize RL paradigms to conquer decision-making at com-

plicated intersections. The advantage lies in its self-learning

ability and high online computation efﬁciency due to the

fast propagation of neural networks (NN). As a pioneering

work for decision-making at intersections, Intel laboratory

introduced the asynchronous advantage actor-critic (A3C)

algorithm [15] to develop driving policies based on urban

2021 IEEE Intelligent Transportation Systems Conference (ITSC)

Indianapolis, USA. September 19-21, 2021

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

road simulator in the presence of other vehicles and pedes-

trians [16]. They concluded that RL-enabled methods was

not enough successful at avoiding collisions with cars and

static objects compared with decomposed and supervised

methods. Guan et al. (2020) developed a decision-making

framework for cooperation of eight connected automated

vehicles at unsignalized intersection [17]. They adopted

Proximal Policy Optimization (PPO) algorithm and obtained

the ﬁnal converged policy after days of training, which is

quite slow to obtain a fair performance. Li et al. (2020)

established an end-to-end framework using convolutional

neural networks to map relationship between trafﬁc images

and vehicle operations. Relative velocity between vehicles

was calculated by trafﬁc images collected at two consecutive

time steps and the famous deep Q-network algorithm [18]

was utilized to obtain the optimal driving policy regard-

ing safety and efﬁciency [19]. However, only two vehicles

from different directions are considered. A shortcoming of

aforementioned works is that the intersection trafﬁc scenarios

they studied are rather simple with only a few surrounding

vehicles, which suffers a great discrepancy with the real

intersection condition. Besides, although current algorithms

have been developed and work well on sorts of simulators

like CARLA [16] and SUMO [20], the driving policy is in

fact trained without safety guarantee, making it impractical

to be applied on real world intersections.

In this paper, we propose an integrated decision and

control (IDC) framework for automated vehicles at com-

plex intersections, which combines prior knowledge and

RL technique to make safe and computationally efﬁcient

decisions. Besides, as an important support for IDC, we

develop a model-based RL algorithm called generalized

exterior point (GEP) method to obtain NN solutions of large-

scale constrained optimal control problems (OCP). The main

contributions of the paper can be summarised as follows:

(1) Present the IDC framework for automated vehicles at

intersections. The IDC framework decomposes the driving

task into optimal path selection and tracking, where the

former chooses a path with the best safety measure from a

path set generated in advance, while the latter then solves a

tracking problem with safety constraints to realize collision-

free driving. To mitigate the online computation burden,

the path set is formed only considering static information,

and GEP is developed to solve a control policy of the

tracking problem ofﬂine. The framework enhances the com-

puting efﬁciency by removing intensive optimizations using

RL technique and exhibits complex driving behaviors by

selecting and tracking among different paths. Moreover, it

owns great adaptability and is promising to be deployed

on different kinds of scenarios with a variety of trafﬁc

participants.

(2) Propose a model-based RL algorithm called GEP

to approximately solve a NN control policy for the con-

strained tracking problems to support the real-time online

application. The GEP is in fact an extension of the exterior

point method in optimization domain to the ﬁeld of NN,

for the propose of obtaining NN solutions of large-scale

constrained OCPs. It ﬁrst constructs a multi-task problem

involving all the candidate paths and transforms it into an

unconstrained problem with a penalty on safety violations.

Afterwards, the approximate feasible optimal control policy

is obtained by alternatively performing gradient descent and

enlarging the penalty. To the best of our knowledge, GEP

is the ﬁrst approximate solver for constrained OCPs that are

parameterized by NNs.

II. INTEGRATED DECISION AND CONTROL FRAMEWORK

Here we demonstrate the integrated decision and control

framework at intersections. As shown in Fig.1, the frame-

work consists of four modules: path generating, training,

path selecting and optimal tracking, of which the ﬁrst two

mainly involve ofﬂine training and another two are related to

the online application. Note that the training module design

is the core segment under this framework, which is prone

to substitute all the expensive online optimizations with an

ofﬂine trained policy network based on RL technique.

The path generating module will produce multiple candi-

date paths only considering static information such as road

structure, speed limit and trafﬁc lights. Note that the intention

is to generate one path set rather than ﬁnd the optimal

path, by which the planning time will be reduced largely.

Then the training module will further considers these paths

and the dynamic information such as surrounding vehicles,

pedestrians and bicycles. Formally, an constrained OCP is

constructed concerning each path, where the objective func-

tion is to minimize the tracking error within a ﬁnite horizon

and the constraints characterize safety requirements. Instead

of solving these OCPs in sequence, we develop a model-

based RL algorithm to solve a control policy parameterized

by NN which is capable of tracking different shape of paths

while maintaining high ﬂexibility to avoid collision with

surrounding vehicles. Subsequently, in the online application,

the ego vehicle choose a relatively better path which is prone

to reduce potential collision and passing time using simple

rules. For this given path, the trained policy will quickly

calculate the driving actions by the feed-forward propagation

of NN, leading the ego vehicle to the next location.

Compared to traditional methods, the framework owns

several advantages to be applied in intersections with large-

scale trafﬁc ﬂow. Above all, it has high online computing ef-

ﬁciency beneﬁting from two improvements. One is that path

planning module can embed key parameters of multiple paths

priorly into the electronic map and read directly to generate

candidate paths. The other is that tracking module utilizes

the trained policy network to compute control commands,

which also enjoys high efﬁciency due to the fast propagation

of neural networks. Second, our algorithm is ﬂexible enough

to deal with the dynamic and complexity at intersections.

Generally, the more complex the intersection, the more

candidate paths there are. Fortunately, the design of our

algorithm guarantees the trained policy can handle various

paths at complex intersections. Besides, the formulation of

constrained optimal problem is applicable for different kinds

3463

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

Ref 1

Ref 2

Ref 3

Model

SelectingSelecting

②

③

Offline training

Online application

①

Ref 1

Ref 2

Ref 3

Ref∗

①

②

③

Start

End

②

Training

Constraint

𝜋 >𝜋 𝜋 >𝜋

Policy

Safe Space

Whole space

Safety shield

𝑢∗

𝑢safe

④

𝑢∗

①

②

③

Start

End

Path

generating

Ref∗

Policy

Safe Space

Whole space

Safety shield

𝑢∗

𝑢safe

𝑢∗

Tracking

Fig. 1: Integrated decision and control framework under

intersections

of dynamic trafﬁc participants including red lights, bicycles

and pedestrians.

III. METHODOLOGY

In this section, we mainly focus on the details of the

training module, the central task, including the formulation

of the path tracking problem and the proposed algorithm

GEP. Besides, the details of the online application will also

be involved.

A. Problem formulation

Provided one possible path τ∈Πwhere Πis the candidate

path set, the constrained OCP can be constructed as (1),

aiming to optimize tracking performance while satisfying the

safety constraints.

min

T−1

i=0

(xref

i|t−xi|t)>Q(xref

i|t−xi|t) + u>

i|tRui|t

s.t. xi+1|t=Fego(xi|t, ui|t),

i+1|t=Fpred(xj

i|t),

(xi|t−xj

i|t)>M(xi|t−xj

i|t)≥Dsafe

veh ,

(xi|t−xroad

i|t)>M(xi|t−xroad

i|t)≥Dsafe

road,

xi|t≤Lstop,if light = red

x0|t=xt, xj

0|t=xj

t, u0|t=ut

i= 0 : T−1, j ∈I

(1)

where Tis the prediction horizon and iis the virtual time

step starting from the current time step t,xi|tand xref

i|trespec-

tively represent the ego vehicle state and its corresponding

reference state at predictive time step i, whose deviation is

usually weighted by a positive-deﬁnite matrices Q.ui|tis

control action to be solved and Ris the weighting matrix. xj

i|t

is the state of the j-th vehicle in the interested vehicle set I.

Fego represents the dynamic model of ego vehicle and Fpred

is the kinematics model of surrounding vehicles. In addition,

Dsafe

veh and Dsafe

road deﬁne the safe distance considering other

vehicles and the road edge, and Mis the distance weighted

matrix. Lstop indicates the position of stop line and deﬁnes

the constraints regarding red lights. Note that in (1) the

virtual states can be generated by the dynamics and the

kinematics model except that x0|tand xj

i|tare assigned with

the current real state xtand xj

As for the dynamics of ego vehicle, here we adopt the

classic 2-DOF bicycle model widely adopted in vehicle

control, which owns a 6-dimensional state vector and a 2-

dimensional action

xi|t=pxpyvlon vlat φ ωT

i|t, ui|t=δ aT

i|t,

where px, pyare the position coordinates of the ego vehicle

center of gravity (CG), vlon, vlat are the longitudinal and

lateral velocities, φis the heading angle, ωis the yaw rate,

δand aare the front wheel angle and the acceleration

commands, respectively. And the state space equation is

Fego =







px+ ∆t(vlon cos φ−vlat sin φ)

py+ ∆t(vlon sin φ+vlat cos φ)

vlon + ∆t(a+vlatω)

mvlonvlat +∆t[(Lfkf−Lrkr)ω−kfδvlon−mv2

lonω]

mvlon−∆t(kf+kr)

φ+ ∆tω

−Izωvlon−∆t[(Lfkf−Lrkr)vy−Lfkfδ vlon]

∆t(L2

fkf+L2

rkr)−Izvlon







(2)

Specially, throughout this paper, the key parameters of ego

vehicle are well-designed in accordance with the physical

vehicle we use in real car test, as shown in TABLE I.

Considering the surrounding vehicle only involves safety

TABLE I: Parameters for Fego

Parameter Meaning Value

kfFront wheel cornering stiffness -155495 [N/rad]

krRear wheel cornering stiffness -155495 [N/rad]

LfDistance from CG to front axle 1.19 [m]

LrDistance from CG to rear axle 1.46 [m]

mMass 1520 [kg]

IzPolar moment of inertia at CG 2642 [kg·m2]

∆tDiscrete time 0.1 [s]

requirements, we simply construct a kinematics model to

prediction its motion

Fpred =







x+ ∆t(vj

lon cos φj−vj

lat sin φj)

y+ ∆t(vj

lon sin φj+vj

lat cos φj)

lon

φj+ ∆tωj

pred







(3)

in which ωj

pred depends on the route and position of the j-th

vehicle. Besides, the states of surrounding vehicles, reference

3464

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

path and road are represented with a deduced state vector

respectively:

x†

i|t=hp†

xp†

yv†

lon 0φ†0iT

i|t.

where †∈{j, ref,road}.

B. Generalized exterior point method (GEP)

The OCP (1) can be solved by traditional online optimiza-

tion methods like model predictive control (MPC), but this

would lead to computational intractability for the on-board

device especially in the presence of large-scale constraints.

Therefore, we propose a model-based RL algorithm, which

is utilized to ﬁnd a neural network solution for the con-

strained OCP in ofﬂine with the model guidance, facilitating

the online computing efﬁciency and overall performance.

However, the problem (1) has limited generalization because

it is deﬁned on a particular path and seeks to ﬁnd a single

control under a single state.

So, ﬁrstly and signiﬁcantly, our algorithm convert the

original problem (1) to the following one:

min

Eτ,xt,xj

tT−1

i=0

(xref

i|t−xi|t)>Q(xref

i|t−xi|t)+

π>

θ(si|t)Rπθ(si|t)

s.t. xi+1|t=Fego(xi|t, πθ(si|t)),

i+1|t=Fpred(xj

i|t),

g(si|t) = 









(xi|t−xj

i|t)>M(xi|t−xj

i|t)≥Dsafe

veh ,

(xi|t−xroad

i|t)>M(xi|t−xroad

i|t)≥Dsafe

road,

−xi|t≥ −Lstop,if light = red,

si|t= [xref

i|t, xi|t, xj

i|t]T,

i=0:T−1, j ∈I

(4)

The ﬁrst distinction between (1) and (4) is the variable to

be optimized is no longer a single control quantity but the

parameters of a policy πθdeﬁned as a continuous function

mapping from state space to action space, i.e. ut=πθ(st),

where sis a notation of the input of the policy and is

designed to contain all necessary information to determine

driving actions, including that of the reference path, the ego

vehicle and surrounding vehicles, as shown in (4). With the

policy we can attain the optimal control actions for every

state. Second, the objective function is no longer about a

single state s, that is, a particular pack of ego state, vehicle

states and reference path, but about a joint distribution of

these components. Thus it brings another great advantage that

one policy is enough to handle tracking tasks for different

paths, avoiding training a policy for each path. That is why

we also incorporate the path information as a part of state s.

During training, the reference path will be sampled uniformly

from the path set to generate a series of states such that our

policy can experience different paths information.

For the large-scale constrained problem (4), we sub-

sequently propose GEP, which is adapted from the one

introduced in optimization ﬁeld, wherein the constrained

problem is converted into an unconstrained one. For clarity,

we use l(si|t, πθ(si|t)) to denote one-step cost in (4) and

use g(si|t)≥0as the uniﬁed denotation of constraints in

(4). Then the unconstrained problem can be derived as

min

θJp=Jactor +ρJpenalty

=Es0|tT−1

i=0

l(si|t, πθ(si|t))+ρEs0|tT−1

i=0

ϕi(θ)

(5)

where ϕi(θ) = [max{0,−g(si|t)}]2is the penalty function

and ρis the penalty factor. The illustration of GEP for

constrained optimal control is shown in Fig. 2. The opti-

mization process mainly involves alternative update of policy

parameters and the penalty factor. The former can be updated

by gradient descent methods while the latter is designed

as a given monotonic increasing sequence. Intuitively, this

factor aims to penalize the violation of constraint, which will

become prohibitive with a fair large penalty factor and push

the policy to the feasible region progressively. This algorithm

is shown in the ofﬂine training part of Algorithm 1.

Policy evaluation

critic =



2

Policy improvement

Model rollout

=+1





Cumulative penalty

Cumulative cost

 

1

=0

+1 =(,) +1 =(,)

Value fitting

target



 

1

=0

=() =()

k%1000=0

actor = 

1

=0 

penalty = 

1

=0 

+1

 +(actor +penalty )

=

+1





=+1

Generalized penalty function method



=+1

Policy optimization

Model rollout

=+1





Cumulative penalty

Cumulative cost

 

1

=0

+1 =(,) +1 =(,)

 

1

=0

=() =()

k%m=0

actor = 

1

=0 

penalty = 

1

=0 

+1

 +(actor +penalty )

=

Generalized penalty function method



=+1

Infeasible region

|< 0

Feasible region

|  0

Penalty

 <  <   









Update 

Update 

Fig. 2: Diagram of the generalized exterior point method

C. Online application

Now that we have obtained the optimal control policy after

ofﬂine training, the next step is to deploy it online. In each

time step we ﬁrst choose the optimal path from the candidate

set Πin terms of a rule-based safety indicator shown as

below,

V(τ) = X

j∈I

d(τ, xj)

where d(τ, xj)means distance of the j-th surrounding ve-

hicle to the path τ. Then the optimal path is the one with

3465

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

maximal V, i.e.,

τ∗= arg max

τi

{V(τi)|τi∈Π}.(6)

After that, we establish the state stusing the optimal path

τ∗and the trained policy πθwill tell corresponding driving

actions u∗

t. Moreover, we further consider the application

safety issue caused by approximation technique. Actually, the

trained policy has no guarantee on the safety in a given point

of the state space for the reason that we rely on a sample-

based method to approximately solve this problem and the

whole state space cannot be swept during training. To tackle

this, a safety shield is further added before employing the u∗

to vehicle, which projected the unsafe action to the nearest

safe one usafe

usafe

t=(u∗

t,if u∗

t∈ Usafe(st)

arg min

u∈Usafe(st)||u−u∗

t||2,else (7)

where Usafe(st) = {ut|g(s1|t)≥0}guarantees the next state

s1|tis safe. In practical, s1|tis replaced by model prediction

to judge whether an action is safe. Finally, the safety action

will be acted on the ego. The integrated decision and control

framework is shown in Algorithm 1.

Algorithm 1: IDC framework

Input: Path set Π, policy network πθwith random

parameters θ, buffer B ← ∅, learning rate βθ,

penalty factor ρ= 1, penalty ampliﬁer c

Ofﬂine training

for each iteration kdo

Randomly select a path τ∈Π, initialize ego

state xtand vehicle states xj

t, j ∈I;

for each environment step do

st← {τ, xt, xj

t, j ∈I};

B ∪ {st}, at=πθ(st);

Apply atto observe xt+1 and xj

t+1, j ∈I

Sample a batch of states from Band for each

state, predict Jpin (5);

if kmod mthen

ρ←cρ;

θ←θ+βθ∇θJp

Obtain the optimal policy π∗

Online application

for each environment step do

for each τi∈Πdo

V(τi) = Pj∈Id(τi, xj);

τ∗= arg max

τi

{V(τi)|τi∈Π};

st← {τ∗, xt, xj

t, j ∈I}, u∗

t=π∗

θ(st);

Calculate usafe

tby (7);

Apply usafe

tto observe xt+1 and xj

t+1, j ∈I

IV. SIMULATION VERIFICATION

Here we focus on a typical 4-direction intersection, shown

as Fig. 3, which is signalized and each direction has three

lanes. The controlled ego vehicle aims to travel through this

intersection safely and efﬁciently with a dense trafﬁc ﬂow

over 800 vehicles per hour on each lane of the entrance.

There are three tasks in total including left-turn, right-turn

and straight-go. The simulated videos are available online

(https://youtu.be/MEm2ZCi2iao).

Traffic density

Left: 800/h

Straight: 800/h

Right: 800/h

22.75

3.75

Fig. 3: Intersection and candidate paths generation

A. Algorithm Details and Training Results

The state sis constructed as a 41-d vector composed of

a 6-d ego state described in section III, a 4-d vector for

each of 8 vehicles who have potential conﬂicts with ego

vehicle and a 3-d vector for the reference path including the

tracking error of position, speed and heading. The weighting

matrices are Q=diag(0.04,0.04,0.01,0.01,0.1,0.02),R=

diag(0.1,0.005) and the predictive horizon Tis set to be 25

with a frequency of 10Hz. For the training parameters, the

policy network is a multi-layer perceptron with two hidden

layer, each with 256 units. The update interval mof penalty

factor is 10000 and the ampliﬁer is 1.1. Besides, the batch

size is chosen as 1024 and a decayed learning rate 3e−4→

1e−5is employed. Training curves are shown in Fig. 4, in

which both policy loss and penalty loss decrease consistently

for all the tasks, indicating an improving tracking and safety

performance. Specially, penalty loss decreases nearly to 0,

which shows the proposed algorithm GEP eventually ﬁnd

the policy to satisfy safety constraints approximately.

B. Performance of the GEP

To verify the control precision and computing efﬁciency

of our model-based solver on constrained optimal problem,

we conduct comparison with MPC. Here we adopt the Ipopt

solver [21], an open and popular package to solve nonlinear

optimization problem, to obtain the exact solution of the

constrained optimal control problem, which is then used

as the benchmark for our solver. Fig. 5 demonstrates the

comparison of our algorithm on control effect and compu-

tation time. Results show the output actions, steer wheel

and acceleration, have similar trends, which indicates our

proposed algorithm can approximate the optimal solution

with small error. Besides, there exists obvious difference in

computation time that our method can output the actions

3466

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

(a) (b)

Fig. 4: Training process of model-based RL. (a) Tracking

performance. (b) Safety performance. The solid lines corre-

spond to mean value and the shaded regions correspond to

95% conﬁdence interval over 5 runs.

within 10ms while MPC will take 1000ms to perform that,

demonstrating the superiority of the proposed ofﬂine solver.

(a) (b) (c)

Fig. 5: Comparison with MPC. (a) Front wheel angle (b)

Acceleration (c) Computing time

C. Driving performance comparison

We compare our method with a rule-based method which

adopts A* algorithm to generate a feasible trajectory and

a PID controller to track it [11], as well as a model-free

RL method which uses a punish and reward system to

learn a policy for maximizing the long-term reward (If the

ego vehicle passes the intersection safely, a large positive

reward 100 is given, otherwise -100 is given wherever a

collision happens) [22]. We choose six indicators including

computing time, comfort, travel efﬁciency, collisions, failure

rate and driving compliance to evaluate the three algorithms.

Comfort is reﬂected by the mean root square of lateral and

longitudinal acceleration, i.e., Icomfort = 1.4q(a2

x) + a2

y.

Travel efﬁciency is evaluated by the average time used to

pass the intersection. Failure rate means the accumulated

times of that decision signal is generated for more than 1

seconds and driving compliance shows times of breaking red

light. The results of 100 times simulation are shown in Table

II. Beneﬁting from the framework design, the computational

efﬁciency of our method remains as fast as the model-

free approach, and the driving performance such as safety,

compliance and failure rate, is better than the other two

approaches.

TABLE II: Comparison of driving performance

IDC (Ours) Rule-based Model-free RL

Computing time [ms]

Upper-quantile 5.81 73.99 4.91

Standard deviation 0.60 36.59 0.65

Comfort index 1.84 1.41 3.21

Time to pass [s] 7.86(±3.52) 24.4(±16.48) 6.73(±3.32)

Collisions 0 0 31

Failure Rate 0 13 0

Decision Compliance 0 0 17

V. RE AL ROA D TES T

To validate the effectiveness in real world, our algorithm is

deployed on a real automated vehicle at a two-way two-lane

intersection. The system composition is shown as Fig. 6. The

test vehicle is a Chang-An CS55 SUV, in which an industrial

computer is equipped and served as the vehicle controller,

while the state of ego vehicle can be obtained from RTK

and CAN bus. Specially, we adopt the digital twin system

to generate different driving modes of surrounding vehicle,

which is a simulator developed by 51WORLD Co., Ltd. and

is able to synchronize with the real-world system in real

time. The experiment intersection is located at (36°1802000N,

120°4002500E) in Suzhou, China. The pipeline of the real road

test is as follows: ﬁrst, we construct the training environment

in accordance with the real road condition and train a policy

network by GEP. Then it is applied online. In each time step,

the ego vehicle state will be gathered from RTK and CAN,

and mapped to the digital twin system, where we collect

the state of surrounding vehicles. Then they both are sent

to the on-board computer, in which the trained policies is

embedded, to calculate the optimal and safe steering angle

and acceleration. Finally, these actions will be delivered

through the CAN bus to act on the automated vehicle.

CAN

RTK GPS

Digital tw in system

Vehicle & In tersection On-board computing

platform

Traffic

states

Ego states

Control

Ref∗

Policy

Safe Space

Whole space

Safety shield

𝑢∗

𝑢safe

𝑢∗

Fig. 6: Real road test system

A. Functionality veriﬁcation

Especially, we design 9 typical cases of surrounding

vehicles, 3 cases for each task. Fig. 7 shows the control effect

of trained policies for ego vehicle under different behavior

of surrounding vehicles. The steer and velocity show our

policy has learned reasonable collision avoidance behaviors

such as acceleration (2nd right-turn case), deceleration (2nd

left-turn case), pulling up (1st left-turn case) and turning

3467

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

(a) lst left-turn case (b) 2nd left-turn case (c) 3rd left-turn case

(d) lst straight-go case (e) 2nd straight-go case (f) 3rd straight-go case

(g) lst right-turn case (h) 2nd right-turn case (i) 3rd right-turn case

Fig. 7: Visualization of control under different tasks and scenarios

(2nd straight-go case). All the videos are available online

(https://youtu.be/XQOewggafjc).

B. Robustness to noise

This experiment aims to test the robustness of the trained

policies under different levels of noises added manually.

To that end, we specially design 7 levels of noise, where

each item is Gaussian white noise with different variance,

as shown in Table III. Noises are implemented on different

dimension of states in the ﬁrst left-turn case and we observe

their effect on output actions, shown as in Fig. 8. We can

conclude that our method is rather robust to low level noises

(0-3) because the data distributions in those noises have

no signiﬁcant change. Although the variation of actions are

inevitably enlarged if we add more noise, the bounds of

featured values always remain in a reasonable range, proving

the robustness of our algorithm.

TABLE III: Noise level and the corresponding standard

deviation

Noise level 0 1 2 3 4 5 6

δp[m] 0 0.017 0.033 0.051 0.068 0.085 0.102

δφ[°] 0 0.017 0.033 0.051 0.068 0.085 0.102

x, pj

y[m] 0 0.05 0.10 0.15 0.20 0.25 0.30

lon [m/s] 0 0.05 0.10 0.15 0.20 0.25 0.30

φj[°] 0 1.4 2.8 4.2 5.6 7.0 8.4

(a) Acceleration (b) Steer

Fig. 8: Actions variation in different noise level

C. Robustness to human interference

The experiment is conducted to verify the ability of our

algorithm to cope with human disturbance. We also use the

ﬁrst left-turn case, during which we perform two times of

human intervention on the steer wheel, as shown in Fig. 9(a)

and the ego vehicle heading is plotted in Fig. 9(b). The ﬁrst

one is conducted on 6s when ego vehicle is still out of

intersection and the steer wheel is turned to left manually

to make it turn in advance. The second happens at 17s,

when the ego is turning to the left to pass the crossroad.

We turn the steering wheel right to interrupt the process. We

also draw vehicle heading and actions in Fig. 9, where two

colored regions are the process where human disturbance is

acted on. After the ﬁrst interfere, our policy turn the steering

3468

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

wheel right immediately to correct the excessive ego heading.

About the second intervention, the driving system is able

to turn the steering wheel left immediately to continue to

complete the turn left operation. To sum up, the trained

policy is capable of handling human disturbance on the

steering wheel by quick responses after taking over.

(a) Disturbance position (b) Vehicle heading

Fig. 9: Robustness to human interference

VI. CONCLUSIONS

In this paper, we focus on the complex intersection sce-

nario and propose the integrated decision and control frame-

work to obtain safe and computationally efﬁcient driving per-

formance. Specially, a model-based reinforcement learning

algorithm is developed to approximately ﬁnd an optimal con-

trol policy ofﬂine and its online application can guarantee the

real-time requirements for on-board computer. Furthermore,

we apply our algorithm on a real world intersection to verify

the effectiveness. Results show that the trained policy can

output the control commands within 10ms and is persistence

to the disruption from environmental perception and human

interference. This study provides a promising approach to

utilize reinforcement learning to conquer the complex driving

tasks, wherein the safety can be assured strictly and high

computation efﬁciency is obtained. About the future work,

we intend to design a learning method to choose the optimal

path to replace the current rule-based version. Besides, we

will consider other trafﬁc participants like pedestrians and

bicycles with more complex behavior.

REFERENCES

[1] R. Tay, “A random parameters probit model of urban and rural

intersection crashes,” Accident Analysis & Prevention, vol. 84, pp.

38–40, 2015.

[2] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical rein-

forcement learning for self-driving decision-making without reliance

on labeled driving data,” IET Intelligent Transport Systems, 2019.

[3] B. Paden, M. ˇ

C´

ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “A

survey of motion planning and control techniques for self-driving

urban vehicles,” IEEE Transactions on intelligent vehicles, vol. 1,

no. 1, pp. 33–55, 2016.

[4] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement

learning for urban autonomous driving,” in 2019 IEEE Intelligent

Transportation Systems Conference (ITSC), 2019, pp. 2765–2771.

[5] L. Hou, L. Xin, S. E. Li, B. Cheng, and W. Wang, “Interactive

trajectory prediction of surrounding road users for autonomous driv-

ing using structural-lstm network,” IEEE Transactions on Intelligent

Transportation Systems, vol. 21, no. 11, pp. 4615–4625, 2020.

[6] R. Kala and K. Warwick, “Motion planning of autonomous vehicles

in a non-autonomous vehicle environment without speed lanes,” En-

gineering Applications of Artiﬁcial Intelligence, vol. 26, no. 5-6, pp.

1588–1601, 2013.

[7] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal

motion planning,” The international journal of robotics research,

vol. 30, no. 7, pp. 846–894, 2011.

[8] E. Schmerling, L. Janson, and M. Pavone, “Optimal sampling-based

motion planning under differential constraints: The driftless case,”

in 2015 IEEE International Conference on Robotics and Automation

(ICRA), 2015, pp. 2368–2375.

[9] S. Karaman and E. Frazzoli, “Sampling-based optimal motion planning

for non-holonomic dynamical systems,” in 2013 IEEE International

Conference on Robotics and Automation, 2013, pp. 5041–5047.

[10] Y. Li, Z. Littleﬁeld, and K. E. Bekris, “Sparse methods for efﬁcient

asymptotically optimal kinodynamic planning,” in Algorithmic foun-

dations of robotics XI. Springer, 2015, pp. 263–282.

[11] L. Xin, Y. Kong, S. E. Li, J. Chen, Y. Guan, M. Tomizuka, and

B. Cheng, “Enable faster and smoother spatio-temporal trajectory

planning for autonomous vehicles in constrained dynamic environ-

ment,” Proceedings of the Institution of Mechanical Engineers, Part

D: Journal of Automobile Engineering, vol. 235, no. 4, pp. 1101–1112,

2021.

[12] S. Aradi, “Survey of deep reinforcement learning for motion planning

of autonomous vehicles,” IEEE Transactions on Intelligent Transporta-

tion Systems, 2020.

[13] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,

A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering

the game of go without human knowledge,” nature, vol. 550, no. 7676,

pp. 354–359, 2017.

[14] S. E. Li, “Reinforcement learning and control,” 2020,

tsinghua University: Lecture Notes. http://www.idlab-

tsinghua.com/thulab/labweb/publications.html.

[15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,

D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-

forcement learning,” in International conference on machine learning.

PMLR, 2016, pp. 1928–1937.

[16] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:

An open urban driving simulator,” in Conference on robot learning.

PMLR, 2017, pp. 1–16.

[17] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, “Centralized

cooperation for connected and automated vehicles at intersections

by proximal policy optimization,” IEEE Transactions on Vehicular

Technology, vol. 69, no. 11, pp. 12 597–12 608, 2020.

[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.

Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski

et al., “Human-level control through deep reinforcement learning,”

nature, vol. 518, no. 7540, pp. 529–533, 2015.

[19] G. Li, S. Li, S. E. Li, Y. Qin, D. Cao, X. Qu, and B. Cheng, “Deep re-

inforcement learning enabled decision-making for autonomous driving

at intersections,” Automotive Innovation, vol. 3, no. 4, pp. 374–385,

2020.

[20] P. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Fl¨

otter¨

od,

R. Hilbrich, L. L¨

ucken, and Johannes, “Microscopic trafﬁc simulation

using sumo,” in International Conference on Intelligent Transportation

Systems (ITSC). IEEE, 2018.

[21] J. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl,

“CasADi-A software framework for nonlinear optimization and opti-

mal control,” Mathematical Programming Computation, vol. 11, no. 1,

pp. 1–36, 2019.

[22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:

Off-policy maximum entropy deep reinforcement learning with a

stochastic actor,” in Proceedings of the 35th International Conference

on Machine Learning. PMLR, 2018, pp. 1861–1870.

3469

Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.

Learning to Control Autonomous Fleets from Observation via Offline Reinforcement Learning

Preprint

Feb 2023

Autonomous Mobility-on-Demand (AMoD) systems are a rapidly evolving mode of transportation in which a centrally coordinated fleet of self-driving vehicles dynamically serves travel requests. The control of these systems is typically formulated as a large network optimization problem, and reinforcement learning (RL) has recently emerged as a promising approach to solve the open challenges in this space. However, current RL-based approaches exclusively focus on learning from online data, fundamentally ignoring the per-sample-cost of interactions within real-world transportation systems. To address these limitations, we propose to formalize the control of AMoD systems through the lens of offline reinforcement learning and learn effective control strategies via solely offline data, thus readily available to current mobility operators. We further investigate design decisions and provide experiments on real-world mobility systems showing how offline learning allows to recover AMoD control policies that (i) exhibit performance on par with online methods, (ii) drastically improve data efficiency, and (iii) completely eliminate the need for complex simulated environments. Crucially, this paper demonstrates that offline reinforcement learning is a promising paradigm for the application of RL-based solutions within economically-critical systems, such as mobility systems.

Urban Intersection Simulation and Verification via Deep Reinforcement Learning Algorithms

Article

Full-text available

Feb 2023

Hanlin Liao

Reinforcement Learning (RL) uses rewards to have iteration and update the next state for training in an unknown and complex environment. This paper aims to find a possible solution for the traffic congestion problem and train four Deep Reinforcement Learning (DRL) algorithms to verify the urban intersection simulation environment in the different discussed dimensions, including practicability, efficiency, safety, complexity, and limitation. The experiment result shows that the four DRL algorithms are efficient in the RL intersection simulation. This paper has succeeded in verifying this RL environment in the comparison and expands the experiment with three conclusions: The agent can train by Deep Q-network(DQN), DoubleDQN, DuelingNet DQN, and Categorical DQN algorithms to be practical and efficient. As the experiment results show, DuelingNet takes less time to finish the training, and Categorical DQN has reduced the collision rate after a while. However, the RL simulation environment lacks complexity, causing limitations in solving more complex problems, including the lack of simulation of pedestrian behaviors and the prediction of emergency events. This paper recommends creating a more complex urban intersection simulation that includes exceptional cases for the RL agent environment and more traffic pressure for the intersection to improve the faster and safer response in future automatic driving.

Proactive Motion Planning for Uncontrolled Blind Intersections to Improve the Safety and Traffic Efficiency of Autonomous Vehicles

Article

Full-text available

Nov 2022

For the last two decades, autonomous vehicles have been proposed and developed to extend the operational design domain from the motorway to urban environments. However, there have been few studies on autonomous driving for uncontrolled and blind intersections. This paper presents a proactive motion planning algorithm to enhance safety and traffic efficiency simultaneously for autonomous driving in uncontrolled blind intersections. The target states of approach motion are decided based on the field of view of the laser scanner and the pre-defined intersection map with connectivity information. The model predictive controller is used to follow the target states and determine the longitudinal motion of an autonomous vehicle. A Monte Carlo simulation with a case study was conducted to evaluate the performance of the proposed proactive motion planner. The simulation results show that the risk caused by approaching vehicles from the occluded region is properly managed. In addition, the traffic flow is improved by reducing the required time to cross the intersections.

"Curse of rarity" for autonomous vehicles

Preprint

Full-text available

Jul 2022

Achieving the human-level safety performance for autonomous vehicles (AVs) remains a challenge. One critical bottleneck is the so-called "long-tail challenge", which usually refers to the problem that AVs should be able to handle seemingly endless low-probability safety-critical driving scenarios, even though millions of testing miles have been accumulated on public roads. However, there is neither rigorous definition nor analysis of properties of such problems, which hinders the progress of addressing them. In this paper, we systematically analyze the "long-tail challenge" and propose the concept of "curse of rarity" (CoR) for AVs. We conclude that the compounding effects of the CoR on top of the "curse of dimensionality" (CoD) are the root cause of the "long-tail challenge", because of the rarity of safety-critical events in high dimensionality of driving environments. We discuss the CoR in various aspects of AV development including perception, prediction, and planning, as well as validation and verification. Based on these analyses and discussions, we propose potential solutions to address the CoR in order to accelerate AV development and deployment.

Improve Generalization of Driving Policy at Signalized Intersections with Adversarial Learning

Preprint

Full-text available

Apr 2022

Intersections are quite challenging among various driving scenes wherein the interaction of signal lights and distinct traffic actors poses great difficulty to learn a wise and robust driving policy. Current research rarely considers the diversity of intersections and stochastic behaviors of traffic participants. For practical applications, the randomness usually leads to some devastating events, which should be the focus of autonomous driving. This paper introduces an adversarial learning paradigm to boost the intelligence and robustness of driving policy for signalized intersections with dense traffic flow. Firstly, we design a static path planner which is capable of generating trackable candidate paths for multiple intersections with diversified topology. Next, a constrained optimal control problem (COCP) is built based on these candidate paths wherein the bounded uncertainty of dynamic models is considered to capture the randomness of driving environment. We propose adversarial policy gradient (APG) to solve the COCP wherein the adversarial policy is introduced to provide disturbances by seeking the most severe uncertainty while the driving policy learns to handle this situation by competition. Finally, a comprehensive system is established to conduct training and testing wherein the perception module is introduced and the human experience is incorporated to solve the yellow light dilemma. Experiments indicate that the trained policy can handle the signal lights flexibly meanwhile realizing the smooth and efficient passing with a humanoid paradigm. Besides, APG enables a large-margin improvement of the resistance to the abnormal behaviors and thus ensures a high safety level for the autonomous vehicle.

Model-based Development for ROS 2-based Autonomous-driving Software

Conference Paper

Nov 2023

Neural MPC-Based Decision-Making Framework for Autonomous Driving in Multi-Lane Roundabout

Conference Paper

Sep 2023

Improve generalization of driving policy at signalized intersections with adversarial learning

Article

Jul 2023

Game Theory-Based Decision-Making and Iterative Predictive Lateral Control for Cooperative Obstacle Avoidance of Guided Vehicle Platoon

Article

Jun 2023

This paper presents a systematic decision-making and lateral control framework to realize cooperative obstacle avoidance (OA) of the guided vehicle platoons safely and integrally in multiple scenarios. In the control framework, a centralized decision-making strategy is developed to determine optimal overtaking maneuver (i.e., OA) in the dynamic driving environment, and a distributed lateral controller with a safe driving corridor planner is designed for lateral reference trajectories tracking. The proposed decision-making strategy formulates the influence mechanism between the controlled platoon and potential surrounding vehicles based on an interactive behavior model. Moreover, a game theory-based algorithm is applied to calculate safe timing of overtaking and select efficient overtaking mode while avoiding the negative impact on the surrounding vehicles. For the distributed lateral control strategy, the control problem is solved by a linear time-varying model predictive control (LTV-MPC) algorithm so as to reduce the computational burden. Furthermore, to improve the accuracy of LTV-MPC algorithm in varying scenarios, a tailored iterative control logic is presented. The performance of the proposed framework is evaluated on the dSPACE hardware-in-loop platform. The results show that the proposed framework works well under various test scenarios involving multiple static and moving obstacles.

Safe Reinforcement Learning for Urban Driving using Invariably Safe Braking Sets

Conference Paper

Oct 2022

Deep reinforcement learning (RL) has been widely applied to motion planning problems of autonomous vehicles in urban traffic. However, traditional deep RL algorithms cannot ensure safe trajectories throughout training and deployment. We propose a provably safe RL algorithm for urban autonomous driving to address this. We add a novel safety layer to the RL process to verify the safety of high-level actions before they are performed. Our safety layer is based on invariably safe braking sets to constrain actions for safe lane changing and safe intersection crossing. We introduce a generalized discrete high-level action space, which can represent all high-level intersection driving maneuvers and various desired accelerations. Finally, we conducted extensive experiments on the inD dataset containing urban driving scenarios. Our analysis demonstrates that the safe agent never causes a collision and that the safety layer's lane changing verification can even improve the goal-reaching performance compared to the unsafe baseline agent.

Survey of Deep Reinforcement Learning for Motion Planning of Autonomous Vehicles

Article

Full-text available

Feb 2022

Szilárd Aradi

Academic research in the field of autonomous vehicles has reached high popularity in recent years related to several topics as sensor technologies, V2X communications, safety, security, decision making, control, and even legal and standardization rules. Besides classic control design approaches, Artificial Intelligence and Machine Learning methods are present in almost all of these fields. Another part of research focuses on different layers of Motion Planning, such as strategic decisions, trajectory planning, and control. A wide range of techniques in Machine Learning itself have been developed, and this article describes one of these fields, Deep Reinforcement Learning (DRL). The paper provides insight into the hierarchical motion planning problem and describes the basics of DRL. The main elements of designing such a system are the modeling of the environment, the modeling abstractions, the description of the state and the perception models, the appropriate rewarding, and the realization of the underlying neural network. The paper describes vehicle models, simulation possibilities and computational requirements. Strategic decisions on different layers and the observation models, e.g., continuous and discrete state representations, grid-based, and camera-based solutions are presented. The paper surveys the state-of-art solutions systematized by the different tasks and levels of autonomous driving, such as car-following, lane-keeping, trajectory following, merging, or driving in dense traffic. Finally, open questions and future challenges are discussed.

Centralized Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization

Article

Full-text available

Sep 2020

Connected vehicles will change the modes of future transportation management and organization, especially at an intersection without any traffic light. Centralized coordination methods globally coordinate vehicles approaching the intersection from all sections by considering their states altogether. However, they need substantial computation resources since they own a centralized controller to optimize the trajectories for all approaching vehicles in real-time. In this paper, we propose a centralized coordination scheme of automated vehicles at an intersection without traffic signals using reinforcement learning (RL) to address low computation efficiency suffered by current centralized coordination methods. We first propose an RL training algorithm, model accelerated proximal policy optimization (MA-PPO), which incorporates a prior model into proximal policy optimization (PPO) algorithm to accelerate the learning process in terms of sample efficiency. Then we present the design of state, action and reward to formulate centralized coordination as an RL problem. Finally, we train a coordinate policy in a simulation setting and compare computing time and traffic efficiency with a coordination scheme based on model predictive control (MPC) method. Results show that our method spends only 1/400 of the computing time of MPC and increase the efficiency of the intersection by 4.5 times.

Enable faster and smoother spatio-temporal trajectory planning for autonomous vehicles in constrained dynamic environment

Article

Full-text available

Apr 2020

Trajectory planning is of vital importance to decision-making for autonomous vehicles. Currently, there are three popular classes of cost-based trajectory planning methods: sampling-based, graph-search-based, and optimization-based. However, each of them has its own shortcomings, for example, high computational expense for sampling-based methods, low resolution for graph-search-based methods, and lack of global awareness for optimization-based methods. It leads to one of the challenges for trajectory planning for autonomous vehicles, which is improving planning efficiency while guaranteeing model feasibility. Therefore, this paper proposes a hybrid planning framework composed of two modules, which preserves the strength of both graph-search-based methods and optimization-based methods, thus enabling faster and smoother spatio-temporal trajectory planning in constrained dynamic environment. The proposed method first constructs spatio-temporal driving space based on directed acyclic graph and efficiently searches a spatio-temporal trajectory using the improved A* algorithm. Then taking the search result as reference, locally convex feasible driving area is designed and model predictive control is applied to further optimize the trajectory with a comprehensive consideration of vehicle kinematics and moving obstacles. Results simulated in four different scenarios all demonstrated feasible trajectories without emergency stop or abrupt steering change, which is kinematic-smooth to follow. Moreover, the average planning time was 31 ms, which only took 59.05%, 18.87%, and 0.69%, respectively, of that consumed by other state-of-the-art trajectory planning methods, namely, maximum interaction defensive policy, sampling-based method with iterative optimizations, and Graph-search-based method with Dynamic Programming.

Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data

Article

Full-text available

Feb 2020
IET INTELL TRANSP SY

Decision making for self‐driving cars is usually tackled by manually encoding rules from drivers’ behaviours or imitating drivers’ manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This study presents a hierarchical reinforcement learning method for decision making of self‐driving cars, which does not depend on a large amount of labelled driving data. This method comprehensively considers both high‐level manoeuvre selection and low‐level motion control in both lateral and longitudinal directions. The authors firstly decompose the driving tasks into three manoeuvres, including driving in lane, right lane change and left lane change, and learn the sub‐policy for each manoeuvre. Then, a master policy is learned to choose the manoeuvre policy to be executed in the current state. All policies, including master policy and manoeuvre policies, are represented by fully‐connected neural networks and trained by using asynchronous parallel reinforcement learners, which builds a mapping from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each manoeuvre. They apply this method to a highway driving scenario, which demonstrates that it can realise smooth and safe decision making for self‐driving cars.

Interactive Trajectory Prediction of Surrounding Road Users for Autonomous Driving Using Structural-LSTM Network

Article

Full-text available

Sep 2019

Accurate trajectory prediction of surrounding road users is critical to autonomous driving systems. In mixed traffic flows, road users with different kinds of behaviors and styles bring complexity to the environment, which requires considering interactions among road users when anticipating their future trajectories. This paper presents a long-term interactive trajectory prediction method for surrounding vehicles using a hierarchical multi-sequence learning network. In contrast to non-interactive method which assumes that road users are independent of each other, this method can automatically learn high-level dependencies among multiple interacting vehicles through the proposed structural-LSTM (long short-term memory) network. Specifically, structural-LSTM first assigns one LSTM for each interacting vehicle. Then these LSTMs share their cell states and hidden states with their spatial-neighboring LSTMs by a radial connection, and recurrently analyze the output state of itself as well as the other LSTMs in a deeper layer. Finally based on all output states, the network predicts trajectories for surrounding vehicles. The proposed method is evaluated on the NGSIM dataset, and its results show that satisfyingly accurate prediction performance of long-term trajectories of surrounding vehicles is accessible, e.g., longitudinal and lateral RMS error can be reduced to less than 1.93m and 0.31m over 5s time horizon, respectively.

Deep Reinforcement Learning Enabled Decision-Making for Autonomous Driving at Intersections

Article

Nov 2020

Road intersection is one of the most complex and accident-prone traffic scenarios, so it’s challenging for autonomous vehicles (AVs) to make safe and efficient decisions at the intersections. Most of the related studies focus on the solution to a single scenario or only guarantee safety without considering driving efficiency. To address these problems, this study proposed a deep reinforcement learning enabled decision-making framework for AVs to drive through intersections automatically, safely and efficiently. The mapping relationship between traffic images and vehicle operations was obtained by an end-to-end decision-making framework established by convolutional neural networks. Traffic images collected at two timesteps were used to calculate the relative velocity between vehicles. Markov decision process was employed to model the interaction between AVs and other vehicles, and the deep Q-network algorithm was utilized to obtain the optimal driving policy regarding safety and efficiency. To verify the effectiveness of the proposed decision-making framework, the top three accident-prone crossing path crash scenarios at intersections were simulated, when different initial vehicle states were adopted for better generalization capability. The results showed that the developed method could make AVs drive safely and efficiently through intersections in all of the tested scenarios.

Model-free Deep Reinforcement Learning for Urban Autonomous Driving

Conference Paper

Oct 2019

Microscopic Traffic Simulation using SUMO

Conference Paper

Nov 2018

Microscopic traffic simulation is an invaluable tool for traffic research. In recent years, both the scope of research and the capabilities of the tools have been extended considerably. This article presents the latest developments concerning intermodal traffic solutions, simulator coupling and model development and validation on the example of the open source traffic simulator SUMO.

CasADi: a software framework for nonlinear optimization and optimal control

Article

Jul 2018

We present CasADi, an open-source software framework for numerical optimization. CasADi is a general-purpose tool that can be used to model and solve optimization problems with a large degree of flexibility, larger than what is associated with popular algebraic modeling languages such as AMPL, GAMS, JuMP or Pyomo. Of special interest are problems constrained by differential equations, i.e. optimal control problems. CasADi is written in self-contained C++, but is most conveniently used via full-featured interfaces to Python, MATLAB or Octave. Since its inception in late 2009, it has been used successfully for academic teaching as well as in applications from multiple fields, including process control, robotics and aerospace. This article gives an up-to-date and accessible introduction to the CasADi framework, which has undergone numerous design improvements over the last 7 years.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Article

Jan 2018

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy - that is, succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.

Learn collision-free self-driving skills at urban intersections with model-based reinforcement learning

Figures

Recommended publications

Reinforcement Learning Method with Generalization Ability Developed by Using Deep Learning for Solvi...

Competitive Meta-Learning

A Fair Channel Access Using Reinforcement Learning: Poster

Learning to Rate Player Positioning in Soccer