Conference PaperPDF Available

Learn collision-free self-driving skills at urban intersections with model-based reinforcement learning

Authors:

Figures

Content may be subject to copyright.
Learn collision-free self-driving skills at urban intersections with
model-based reinforcement learning
Yang Guan#1, Yangang Ren#1, Haitong Ma1, Shengbo Eben Li1, Qi Sun1, Yifan Dai2, Bo Cheng1
Abstract Intersection is one of the most complex and
accident-prone urban traffic scenarios for autonomous driving
wherein making safe and computationally efficient decisions
with high-density traffic flow is usually non-trivial. Current
rule-based methods decompose the decision-making task into
several serial sub-modules, resulting in long computation time
at complex scenarios for on-board computing devices. In this
paper, we formulate the decision-making and control problem
under intersections as a process of optimal path selection
and tracking, where the former selects a path with the best
safety measure from a set generated only considering static
information, while the latter then considers dynamic obstacles
and solve a tracking problem with safety constraints using the
chosen path. To avoid the heavy computation introduced by
that, we develop a reinforcement learning algorithm called
generalized exterior point (GEP) to find a neural network
(NN) solution offline. It first constructs a multi-task problem
involving all the candidate paths and transforms it into an
unconstrained problem with a penalty on safety violations.
Afterward, the approximate feasible optimal control policy
is obtained by alternatively performing gradient descent and
enlarging the penalty. As an exterior point type method, GEP
permits control policy to violate inequality constraints during
the iterations. To verify the effectiveness of our method, we
carried out experiments both in simulation and in a real road
test. Results demonstrate that the learned policy can realize
collision-free driving under different traffic conditions while
reducing the computation time by a large margin.
Index Terms Intersection, decision-making, reinforcement
learning, penalty function
I. INTRODUCTION
Autonomous driving has great potential to benefit traffic
safety and efficiency as well as change the pattern of fu-
ture transportation. Among multifarious urban scenarios in
autonomous driving, intersection has been regarded as the
most challenging one due to the highly dynamic, stochastic
and sophisticated nature of the traffic environment [1]. At
intersections, the complex conflicting relationship between
vehicles results in complicated vehicle operations to avoid
collision and maintain high computing efficiency. Statistics
have found that intersection accidents account for 30–60% of
severe traffic injuries in Europe. Therefore, developing safe
This work is supported by the National Key Research and Development
Plan of China under Grant 2017YFB0102603. It is supported in part by
International Science & Technology Cooperation Program of China under
2019YFE0100200, NSF China with 51575293, and U20A20334. It is also
partially supported by Tsinghua University-Toyota Joint Research Center
for AI Technology of Automated Vehicle.
#The first two authors contributed equally to this study. 1School
of Vehicle and Mobility, Tsinghua University, Beijing, 100084, China.
2Suzhou Automotive Research Institute, Tsinghua University, Suzhou,
215200, China. All correspondence should be sent to S. E. Li with email:
lisb04@gmail.com.
and computationally efficient decision-making strategies at
intersections becomes vital for popularization of autonomous
driving.
Currently, most of decision-making algorithms at inter-
sections can be categorized into two major paradigms: rule-
based method and learning-based method [2]. The former
usually decomposes decision-making module into a series of
sub-modules such as prediction, behavior selection and tra-
jectory planning, whereas the latter learns a policy mapping
from output of perception to control signal directly using
advanced learning techniques [3], [4]. Under the rule-based
scheme, prediction module aims to predict the trajectory of
traffic participants to construct the feasible region. Recently,
recurrent neural network makes an increasingly important
role in traffic prediction because of its high representation
and approximation capability for long time series [5]. Be-
havior selection module decides on a local driving task that
drives the car towards the destination and abides by rules
of the road. Traditionally, finite state machine is widely
adopted to transfer between priorly defined driving behaviors
in terms of current vehicle situation [6]. Trajectory planning
module selects a continuous path through the environment
to accomplish a local navigational task [7]–[10]. Commonly,
this process is time-consuming to search one feasible and
safe trajectory, which would be exacerbated by inadequate
computing capability of industrial computers. The state-of-
the-art result of average planning time is 31ms using a
directed acyclic graph search and improved A* algorithm
[11]. However, the time tends to increase largely with the
growing number of surrounding vehicles. To date, rule-
based framework has been widely attempted to handle
some typical urban scenarios like signalized junctions and
pedestrian crossings. However, much human design and high
computation complexity restrict its scalability to the complex
scenarios with large-scale vehicles.
Learning-based method aims to train one driving pol-
icy usually carried by a deep neural network which maps
from perception information to control commands, thus this
is also referred as “end-to-end” scheme [12]. Inspired by
recent success of reinforcement learning (RL) on games
and robotics [13], [14], some researchers have attempted to
utilize RL paradigms to conquer decision-making at com-
plicated intersections. The advantage lies in its self-learning
ability and high online computation efficiency due to the
fast propagation of neural networks (NN). As a pioneering
work for decision-making at intersections, Intel laboratory
introduced the asynchronous advantage actor-critic (A3C)
algorithm [15] to develop driving policies based on urban
2021 IEEE Intelligent Transportation Systems Conference (ITSC)
Indianapolis, USA. September 19-21, 2021
978-1-7281-9142-3/21/$31.00 ©2021 IEEE 3462
2021 IEEE International Intelligent Transportation Systems Conference (ITSC) | 978-1-7281-9142-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/ITSC48978.2021.9564880
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
road simulator in the presence of other vehicles and pedes-
trians [16]. They concluded that RL-enabled methods was
not enough successful at avoiding collisions with cars and
static objects compared with decomposed and supervised
methods. Guan et al. (2020) developed a decision-making
framework for cooperation of eight connected automated
vehicles at unsignalized intersection [17]. They adopted
Proximal Policy Optimization (PPO) algorithm and obtained
the final converged policy after days of training, which is
quite slow to obtain a fair performance. Li et al. (2020)
established an end-to-end framework using convolutional
neural networks to map relationship between traffic images
and vehicle operations. Relative velocity between vehicles
was calculated by traffic images collected at two consecutive
time steps and the famous deep Q-network algorithm [18]
was utilized to obtain the optimal driving policy regard-
ing safety and efficiency [19]. However, only two vehicles
from different directions are considered. A shortcoming of
aforementioned works is that the intersection traffic scenarios
they studied are rather simple with only a few surrounding
vehicles, which suffers a great discrepancy with the real
intersection condition. Besides, although current algorithms
have been developed and work well on sorts of simulators
like CARLA [16] and SUMO [20], the driving policy is in
fact trained without safety guarantee, making it impractical
to be applied on real world intersections.
In this paper, we propose an integrated decision and
control (IDC) framework for automated vehicles at com-
plex intersections, which combines prior knowledge and
RL technique to make safe and computationally efficient
decisions. Besides, as an important support for IDC, we
develop a model-based RL algorithm called generalized
exterior point (GEP) method to obtain NN solutions of large-
scale constrained optimal control problems (OCP). The main
contributions of the paper can be summarised as follows:
(1) Present the IDC framework for automated vehicles at
intersections. The IDC framework decomposes the driving
task into optimal path selection and tracking, where the
former chooses a path with the best safety measure from a
path set generated in advance, while the latter then solves a
tracking problem with safety constraints to realize collision-
free driving. To mitigate the online computation burden,
the path set is formed only considering static information,
and GEP is developed to solve a control policy of the
tracking problem offline. The framework enhances the com-
puting efficiency by removing intensive optimizations using
RL technique and exhibits complex driving behaviors by
selecting and tracking among different paths. Moreover, it
owns great adaptability and is promising to be deployed
on different kinds of scenarios with a variety of traffic
participants.
(2) Propose a model-based RL algorithm called GEP
to approximately solve a NN control policy for the con-
strained tracking problems to support the real-time online
application. The GEP is in fact an extension of the exterior
point method in optimization domain to the field of NN,
for the propose of obtaining NN solutions of large-scale
constrained OCPs. It first constructs a multi-task problem
involving all the candidate paths and transforms it into an
unconstrained problem with a penalty on safety violations.
Afterwards, the approximate feasible optimal control policy
is obtained by alternatively performing gradient descent and
enlarging the penalty. To the best of our knowledge, GEP
is the first approximate solver for constrained OCPs that are
parameterized by NNs.
II. INTEGRATED DECISION AND CONTROL FRAMEWORK
Here we demonstrate the integrated decision and control
framework at intersections. As shown in Fig.1, the frame-
work consists of four modules: path generating, training,
path selecting and optimal tracking, of which the first two
mainly involve offline training and another two are related to
the online application. Note that the training module design
is the core segment under this framework, which is prone
to substitute all the expensive online optimizations with an
offline trained policy network based on RL technique.
The path generating module will produce multiple candi-
date paths only considering static information such as road
structure, speed limit and traffic lights. Note that the intention
is to generate one path set rather than find the optimal
path, by which the planning time will be reduced largely.
Then the training module will further considers these paths
and the dynamic information such as surrounding vehicles,
pedestrians and bicycles. Formally, an constrained OCP is
constructed concerning each path, where the objective func-
tion is to minimize the tracking error within a finite horizon
and the constraints characterize safety requirements. Instead
of solving these OCPs in sequence, we develop a model-
based RL algorithm to solve a control policy parameterized
by NN which is capable of tracking different shape of paths
while maintaining high flexibility to avoid collision with
surrounding vehicles. Subsequently, in the online application,
the ego vehicle choose a relatively better path which is prone
to reduce potential collision and passing time using simple
rules. For this given path, the trained policy will quickly
calculate the driving actions by the feed-forward propagation
of NN, leading the ego vehicle to the next location.
Compared to traditional methods, the framework owns
several advantages to be applied in intersections with large-
scale traffic flow. Above all, it has high online computing ef-
ficiency benefiting from two improvements. One is that path
planning module can embed key parameters of multiple paths
priorly into the electronic map and read directly to generate
candidate paths. The other is that tracking module utilizes
the trained policy network to compute control commands,
which also enjoys high efficiency due to the fast propagation
of neural networks. Second, our algorithm is flexible enough
to deal with the dynamic and complexity at intersections.
Generally, the more complex the intersection, the more
candidate paths there are. Fortunately, the design of our
algorithm guarantees the trained policy can handle various
paths at complex intersections. Besides, the formulation of
constrained optimal problem is applicable for different kinds
3463
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
Ref 1
Ref 2
Ref 3
Model
SelectingSelecting
Offline training
Online application
Ref 1
Ref 2
Ref 3
Ref
Start
End
Training
Constraint
Policy
Safe Space
Whole space
Safety shield
𝑢
𝑢safe
𝑢
Start
End
Path
generating
Ref
Policy
Safe Space
Whole space
Safety shield
𝑢
𝑢safe
𝑢
Tracking
Fig. 1: Integrated decision and control framework under
intersections
of dynamic traffic participants including red lights, bicycles
and pedestrians.
III. METHODOLOGY
In this section, we mainly focus on the details of the
training module, the central task, including the formulation
of the path tracking problem and the proposed algorithm
GEP. Besides, the details of the online application will also
be involved.
A. Problem formulation
Provided one possible path τΠwhere Πis the candidate
path set, the constrained OCP can be constructed as (1),
aiming to optimize tracking performance while satisfying the
safety constraints.
min
ut
J=
T1
X
i=0
(xref
i|txi|t)>Q(xref
i|txi|t) + u>
i|tRui|t
s.t. xi+1|t=Fego(xi|t, ui|t),
xj
i+1|t=Fpred(xj
i|t),
(xi|txj
i|t)>M(xi|txj
i|t)Dsafe
veh ,
(xi|txroad
i|t)>M(xi|txroad
i|t)Dsafe
road,
xi|tLstop,if light = red
x0|t=xt, xj
0|t=xj
t, u0|t=ut
i= 0 : T1, j I
(1)
where Tis the prediction horizon and iis the virtual time
step starting from the current time step t,xi|tand xref
i|trespec-
tively represent the ego vehicle state and its corresponding
reference state at predictive time step i, whose deviation is
usually weighted by a positive-definite matrices Q.ui|tis
control action to be solved and Ris the weighting matrix. xj
i|t
is the state of the j-th vehicle in the interested vehicle set I.
Fego represents the dynamic model of ego vehicle and Fpred
is the kinematics model of surrounding vehicles. In addition,
Dsafe
veh and Dsafe
road define the safe distance considering other
vehicles and the road edge, and Mis the distance weighted
matrix. Lstop indicates the position of stop line and defines
the constraints regarding red lights. Note that in (1) the
virtual states can be generated by the dynamics and the
kinematics model except that x0|tand xj
i|tare assigned with
the current real state xtand xj
t.
As for the dynamics of ego vehicle, here we adopt the
classic 2-DOF bicycle model widely adopted in vehicle
control, which owns a 6-dimensional state vector and a 2-
dimensional action
xi|t=pxpyvlon vlat φ ωT
i|t, ui|t=δ aT
i|t,
where px, pyare the position coordinates of the ego vehicle
center of gravity (CG), vlon, vlat are the longitudinal and
lateral velocities, φis the heading angle, ωis the yaw rate,
δand aare the front wheel angle and the acceleration
commands, respectively. And the state space equation is
Fego =
px+ ∆t(vlon cos φvlat sin φ)
py+ ∆t(vlon sin φ+vlat cos φ)
vlon + ∆t(a+vlatω)
mvlonvlat +∆t[(LfkfLrkr)ωkfδvlonmv2
lonω]
mvlont(kf+kr)
φ+ ∆
Izωvlont[(LfkfLrkr)vyLfkfδ vlon]
t(L2
fkf+L2
rkr)Izvlon
.
(2)
Specially, throughout this paper, the key parameters of ego
vehicle are well-designed in accordance with the physical
vehicle we use in real car test, as shown in TABLE I.
Considering the surrounding vehicle only involves safety
TABLE I: Parameters for Fego
Parameter Meaning Value
kfFront wheel cornering stiffness -155495 [N/rad]
krRear wheel cornering stiffness -155495 [N/rad]
LfDistance from CG to front axle 1.19 [m]
LrDistance from CG to rear axle 1.46 [m]
mMass 1520 [kg]
IzPolar moment of inertia at CG 2642 [kg·m2]
tDiscrete time 0.1 [s]
requirements, we simply construct a kinematics model to
prediction its motion
Fpred =
pj
x+ ∆t(vj
lon cos φjvj
lat sin φj)
pj
y+ ∆t(vj
lon sin φj+vj
lat cos φj)
vj
lon
0
φj+ ∆j
pred
0
(3)
in which ωj
pred depends on the route and position of the j-th
vehicle. Besides, the states of surrounding vehicles, reference
3464
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
path and road are represented with a deduced state vector
respectively:
x
i|t=hp
xp
yv
lon 0φ0iT
i|t.
where †∈{j, ref,road}.
B. Generalized exterior point method (GEP)
The OCP (1) can be solved by traditional online optimiza-
tion methods like model predictive control (MPC), but this
would lead to computational intractability for the on-board
device especially in the presence of large-scale constraints.
Therefore, we propose a model-based RL algorithm, which
is utilized to find a neural network solution for the con-
strained OCP in offline with the model guidance, facilitating
the online computing efficiency and overall performance.
However, the problem (1) has limited generalization because
it is defined on a particular path and seeks to find a single
control under a single state.
So, firstly and significantly, our algorithm convert the
original problem (1) to the following one:
min
θ
Eτ,xt,xj
tT1
X
i=0
(xref
i|txi|t)>Q(xref
i|txi|t)+
π>
θ(si|t)θ(si|t)
s.t. xi+1|t=Fego(xi|t, πθ(si|t)),
xj
i+1|t=Fpred(xj
i|t),
g(si|t) =
(xi|txj
i|t)>M(xi|txj
i|t)Dsafe
veh ,
(xi|txroad
i|t)>M(xi|txroad
i|t)Dsafe
road,
xi|t≥ −Lstop,if light = red,
si|t= [xref
i|t, xi|t, xj
i|t]T,
i=0:T1, j I
(4)
The first distinction between (1) and (4) is the variable to
be optimized is no longer a single control quantity but the
parameters of a policy πθdefined as a continuous function
mapping from state space to action space, i.e. ut=πθ(st),
where sis a notation of the input of the policy and is
designed to contain all necessary information to determine
driving actions, including that of the reference path, the ego
vehicle and surrounding vehicles, as shown in (4). With the
policy we can attain the optimal control actions for every
state. Second, the objective function is no longer about a
single state s, that is, a particular pack of ego state, vehicle
states and reference path, but about a joint distribution of
these components. Thus it brings another great advantage that
one policy is enough to handle tracking tasks for different
paths, avoiding training a policy for each path. That is why
we also incorporate the path information as a part of state s.
During training, the reference path will be sampled uniformly
from the path set to generate a series of states such that our
policy can experience different paths information.
For the large-scale constrained problem (4), we sub-
sequently propose GEP, which is adapted from the one
introduced in optimization field, wherein the constrained
problem is converted into an unconstrained one. For clarity,
we use l(si|t, πθ(si|t)) to denote one-step cost in (4) and
use g(si|t)0as the unified denotation of constraints in
(4). Then the unconstrained problem can be derived as
min
θJp=Jactor +ρJpenalty
=Es0|tT1
X
i=0
l(si|t, πθ(si|t))+ρEs0|tT1
X
i=0
ϕi(θ)
(5)
where ϕi(θ) = [max{0,g(si|t)}]2is the penalty function
and ρis the penalty factor. The illustration of GEP for
constrained optimal control is shown in Fig. 2. The opti-
mization process mainly involves alternative update of policy
parameters and the penalty factor. The former can be updated
by gradient descent methods while the latter is designed
as a given monotonic increasing sequence. Intuitively, this
factor aims to penalize the violation of constraint, which will
become prohibitive with a fair large penalty factor and push
the policy to the feasible region progressively. This algorithm
is shown in the offline training part of Algorithm 1.
Policy evaluation
critic =

2
Policy improvement
Model rollout
=+1
Cumulative penalty
Cumulative cost
 
1
=0
+1 =(,) +1 =(,)
Value fitting
target
 
1
=0
=() =()
k%1000=0
actor = 
1
=0
penalty = 
1
=0
+1
 +(actor +penalty )
=
+1
=+1
Generalized penalty function method
=+1
y
n
Policy optimization
Model rollout
=+1
Cumulative penalty
Cumulative cost
 
1
=0
+1 =(,) +1 =(,)
 
1
=0
=() =()
k%m=0
actor = 
1
=0
penalty = 
1
=0
+1
 +(actor +penalty )
=
Generalized penalty function method
=+1
y
n
Infeasible region
|< 0
Feasible region
|  0
Penalty
< <
Update
Update
Fig. 2: Diagram of the generalized exterior point method
C. Online application
Now that we have obtained the optimal control policy after
offline training, the next step is to deploy it online. In each
time step we first choose the optimal path from the candidate
set Πin terms of a rule-based safety indicator shown as
below,
V(τ) = X
jI
d(τ, xj)
where d(τ, xj)means distance of the j-th surrounding ve-
hicle to the path τ. Then the optimal path is the one with
3465
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
maximal V, i.e.,
τ= arg max
τi
{V(τi)|τiΠ}.(6)
After that, we establish the state stusing the optimal path
τand the trained policy πθwill tell corresponding driving
actions u
t. Moreover, we further consider the application
safety issue caused by approximation technique. Actually, the
trained policy has no guarantee on the safety in a given point
of the state space for the reason that we rely on a sample-
based method to approximately solve this problem and the
whole state space cannot be swept during training. To tackle
this, a safety shield is further added before employing the u
t
to vehicle, which projected the unsafe action to the nearest
safe one usafe
t:
usafe
t=(u
t,if u
t∈ Usafe(st)
arg min
u∈Usafe(st)||uu
t||2,else (7)
where Usafe(st) = {ut|g(s1|t)0}guarantees the next state
s1|tis safe. In practical, s1|tis replaced by model prediction
to judge whether an action is safe. Finally, the safety action
will be acted on the ego. The integrated decision and control
framework is shown in Algorithm 1.
Algorithm 1: IDC framework
Input: Path set Π, policy network πθwith random
parameters θ, buffer B ← ∅, learning rate βθ,
penalty factor ρ= 1, penalty amplifier c
Offline training
for each iteration kdo
Randomly select a path τΠ, initialize ego
state xtand vehicle states xj
t, j I;
for each environment step do
st← {τ, xt, xj
t, j I};
B ∪ {st}, at=πθ(st);
Apply atto observe xt+1 and xj
t+1, j I
Sample a batch of states from Band for each
state, predict Jpin (5);
if kmod mthen
ρ;
θθ+βθθJp
Obtain the optimal policy π
θ
Online application
for each environment step do
for each τiΠdo
V(τi) = PjId(τi, xj);
τ= arg max
τi
{V(τi)|τiΠ};
st← {τ, xt, xj
t, j I}, u
t=π
θ(st);
Calculate usafe
tby (7);
Apply usafe
tto observe xt+1 and xj
t+1, j I
IV. SIMULATION VERIFICATION
Here we focus on a typical 4-direction intersection, shown
as Fig. 3, which is signalized and each direction has three
lanes. The controlled ego vehicle aims to travel through this
intersection safely and efficiently with a dense traffic flow
over 800 vehicles per hour on each lane of the entrance.
There are three tasks in total including left-turn, right-turn
and straight-go. The simulated videos are available online
(https://youtu.be/MEm2ZCi2iao).
Traffic density
Left: 800/h
Straight: 800/h
Right: 800/h
22.75
3.75
Fig. 3: Intersection and candidate paths generation
A. Algorithm Details and Training Results
The state sis constructed as a 41-d vector composed of
a 6-d ego state described in section III, a 4-d vector for
each of 8 vehicles who have potential conflicts with ego
vehicle and a 3-d vector for the reference path including the
tracking error of position, speed and heading. The weighting
matrices are Q=diag(0.04,0.04,0.01,0.01,0.1,0.02),R=
diag(0.1,0.005) and the predictive horizon Tis set to be 25
with a frequency of 10Hz. For the training parameters, the
policy network is a multi-layer perceptron with two hidden
layer, each with 256 units. The update interval mof penalty
factor is 10000 and the amplifier is 1.1. Besides, the batch
size is chosen as 1024 and a decayed learning rate 3e4
1e5is employed. Training curves are shown in Fig. 4, in
which both policy loss and penalty loss decrease consistently
for all the tasks, indicating an improving tracking and safety
performance. Specially, penalty loss decreases nearly to 0,
which shows the proposed algorithm GEP eventually find
the policy to satisfy safety constraints approximately.
B. Performance of the GEP
To verify the control precision and computing efficiency
of our model-based solver on constrained optimal problem,
we conduct comparison with MPC. Here we adopt the Ipopt
solver [21], an open and popular package to solve nonlinear
optimization problem, to obtain the exact solution of the
constrained optimal control problem, which is then used
as the benchmark for our solver. Fig. 5 demonstrates the
comparison of our algorithm on control effect and compu-
tation time. Results show the output actions, steer wheel
and acceleration, have similar trends, which indicates our
proposed algorithm can approximate the optimal solution
with small error. Besides, there exists obvious difference in
computation time that our method can output the actions
3466
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
(a) (b)
Fig. 4: Training process of model-based RL. (a) Tracking
performance. (b) Safety performance. The solid lines corre-
spond to mean value and the shaded regions correspond to
95% confidence interval over 5 runs.
within 10ms while MPC will take 1000ms to perform that,
demonstrating the superiority of the proposed offline solver.
(a) (b) (c)
Fig. 5: Comparison with MPC. (a) Front wheel angle (b)
Acceleration (c) Computing time
C. Driving performance comparison
We compare our method with a rule-based method which
adopts A* algorithm to generate a feasible trajectory and
a PID controller to track it [11], as well as a model-free
RL method which uses a punish and reward system to
learn a policy for maximizing the long-term reward (If the
ego vehicle passes the intersection safely, a large positive
reward 100 is given, otherwise -100 is given wherever a
collision happens) [22]. We choose six indicators including
computing time, comfort, travel efficiency, collisions, failure
rate and driving compliance to evaluate the three algorithms.
Comfort is reflected by the mean root square of lateral and
longitudinal acceleration, i.e., Icomfort = 1.4q(a2
x) + a2
y.
Travel efficiency is evaluated by the average time used to
pass the intersection. Failure rate means the accumulated
times of that decision signal is generated for more than 1
seconds and driving compliance shows times of breaking red
light. The results of 100 times simulation are shown in Table
II. Benefiting from the framework design, the computational
efficiency of our method remains as fast as the model-
free approach, and the driving performance such as safety,
compliance and failure rate, is better than the other two
approaches.
TABLE II: Comparison of driving performance
IDC (Ours) Rule-based Model-free RL
Computing time [ms]
Upper-quantile 5.81 73.99 4.91
Standard deviation 0.60 36.59 0.65
Comfort index 1.84 1.41 3.21
Time to pass [s] 7.86(±3.52) 24.4(±16.48) 6.73(±3.32)
Collisions 0 0 31
Failure Rate 0 13 0
Decision Compliance 0 0 17
V. RE AL ROA D TES T
To validate the effectiveness in real world, our algorithm is
deployed on a real automated vehicle at a two-way two-lane
intersection. The system composition is shown as Fig. 6. The
test vehicle is a Chang-An CS55 SUV, in which an industrial
computer is equipped and served as the vehicle controller,
while the state of ego vehicle can be obtained from RTK
and CAN bus. Specially, we adopt the digital twin system
to generate different driving modes of surrounding vehicle,
which is a simulator developed by 51WORLD Co., Ltd. and
is able to synchronize with the real-world system in real
time. The experiment intersection is located at (36°1802000N,
120°4002500E) in Suzhou, China. The pipeline of the real road
test is as follows: first, we construct the training environment
in accordance with the real road condition and train a policy
network by GEP. Then it is applied online. In each time step,
the ego vehicle state will be gathered from RTK and CAN,
and mapped to the digital twin system, where we collect
the state of surrounding vehicles. Then they both are sent
to the on-board computer, in which the trained policies is
embedded, to calculate the optimal and safe steering angle
and acceleration. Finally, these actions will be delivered
through the CAN bus to act on the automated vehicle.
CAN
RTK GPS
Digital tw in system
Vehicle & In tersection On-board computing
platform
Traffic
states
Ego states
Ego states
Control
Ref
Policy
Safe Space
Whole space
Safety shield
𝑢
𝑢safe
𝑢
Fig. 6: Real road test system
A. Functionality verification
Especially, we design 9 typical cases of surrounding
vehicles, 3 cases for each task. Fig. 7 shows the control effect
of trained policies for ego vehicle under different behavior
of surrounding vehicles. The steer and velocity show our
policy has learned reasonable collision avoidance behaviors
such as acceleration (2nd right-turn case), deceleration (2nd
left-turn case), pulling up (1st left-turn case) and turning
3467
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
(a) lst left-turn case (b) 2nd left-turn case (c) 3rd left-turn case
(d) lst straight-go case (e) 2nd straight-go case (f) 3rd straight-go case
(g) lst right-turn case (h) 2nd right-turn case (i) 3rd right-turn case
Fig. 7: Visualization of control under different tasks and scenarios
(2nd straight-go case). All the videos are available online
(https://youtu.be/XQOewggafjc).
B. Robustness to noise
This experiment aims to test the robustness of the trained
policies under different levels of noises added manually.
To that end, we specially design 7 levels of noise, where
each item is Gaussian white noise with different variance,
as shown in Table III. Noises are implemented on different
dimension of states in the first left-turn case and we observe
their effect on output actions, shown as in Fig. 8. We can
conclude that our method is rather robust to low level noises
(0-3) because the data distributions in those noises have
no significant change. Although the variation of actions are
inevitably enlarged if we add more noise, the bounds of
featured values always remain in a reasonable range, proving
the robustness of our algorithm.
TABLE III: Noise level and the corresponding standard
deviation
Noise level 0 1 2 3 4 5 6
δp[m] 0 0.017 0.033 0.051 0.068 0.085 0.102
δφ[°] 0 0.017 0.033 0.051 0.068 0.085 0.102
pj
x, pj
y[m] 0 0.05 0.10 0.15 0.20 0.25 0.30
vj
lon [m/s] 0 0.05 0.10 0.15 0.20 0.25 0.30
φj[°] 0 1.4 2.8 4.2 5.6 7.0 8.4
(a) Acceleration (b) Steer
Fig. 8: Actions variation in different noise level
C. Robustness to human interference
The experiment is conducted to verify the ability of our
algorithm to cope with human disturbance. We also use the
first left-turn case, during which we perform two times of
human intervention on the steer wheel, as shown in Fig. 9(a)
and the ego vehicle heading is plotted in Fig. 9(b). The first
one is conducted on 6s when ego vehicle is still out of
intersection and the steer wheel is turned to left manually
to make it turn in advance. The second happens at 17s,
when the ego is turning to the left to pass the crossroad.
We turn the steering wheel right to interrupt the process. We
also draw vehicle heading and actions in Fig. 9, where two
colored regions are the process where human disturbance is
acted on. After the first interfere, our policy turn the steering
3468
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
wheel right immediately to correct the excessive ego heading.
About the second intervention, the driving system is able
to turn the steering wheel left immediately to continue to
complete the turn left operation. To sum up, the trained
policy is capable of handling human disturbance on the
steering wheel by quick responses after taking over.
(a) Disturbance position (b) Vehicle heading
(c) Acceleration (d) Steer
Fig. 9: Robustness to human interference
VI. CONCLUSIONS
In this paper, we focus on the complex intersection sce-
nario and propose the integrated decision and control frame-
work to obtain safe and computationally efficient driving per-
formance. Specially, a model-based reinforcement learning
algorithm is developed to approximately find an optimal con-
trol policy offline and its online application can guarantee the
real-time requirements for on-board computer. Furthermore,
we apply our algorithm on a real world intersection to verify
the effectiveness. Results show that the trained policy can
output the control commands within 10ms and is persistence
to the disruption from environmental perception and human
interference. This study provides a promising approach to
utilize reinforcement learning to conquer the complex driving
tasks, wherein the safety can be assured strictly and high
computation efficiency is obtained. About the future work,
we intend to design a learning method to choose the optimal
path to replace the current rule-based version. Besides, we
will consider other traffic participants like pedestrians and
bicycles with more complex behavior.
REFERENCES
[1] R. Tay, “A random parameters probit model of urban and rural
intersection crashes,” Accident Analysis & Prevention, vol. 84, pp.
38–40, 2015.
[2] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical rein-
forcement learning for self-driving decision-making without reliance
on labeled driving data,IET Intelligent Transport Systems, 2019.
[3] B. Paden, M. ˇ
C´
ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “A
survey of motion planning and control techniques for self-driving
urban vehicles,” IEEE Transactions on intelligent vehicles, vol. 1,
no. 1, pp. 33–55, 2016.
[4] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement
learning for urban autonomous driving,” in 2019 IEEE Intelligent
Transportation Systems Conference (ITSC), 2019, pp. 2765–2771.
[5] L. Hou, L. Xin, S. E. Li, B. Cheng, and W. Wang, “Interactive
trajectory prediction of surrounding road users for autonomous driv-
ing using structural-lstm network,” IEEE Transactions on Intelligent
Transportation Systems, vol. 21, no. 11, pp. 4615–4625, 2020.
[6] R. Kala and K. Warwick, “Motion planning of autonomous vehicles
in a non-autonomous vehicle environment without speed lanes,En-
gineering Applications of Artificial Intelligence, vol. 26, no. 5-6, pp.
1588–1601, 2013.
[7] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal
motion planning,” The international journal of robotics research,
vol. 30, no. 7, pp. 846–894, 2011.
[8] E. Schmerling, L. Janson, and M. Pavone, “Optimal sampling-based
motion planning under differential constraints: The driftless case,”
in 2015 IEEE International Conference on Robotics and Automation
(ICRA), 2015, pp. 2368–2375.
[9] S. Karaman and E. Frazzoli, “Sampling-based optimal motion planning
for non-holonomic dynamical systems,” in 2013 IEEE International
Conference on Robotics and Automation, 2013, pp. 5041–5047.
[10] Y. Li, Z. Littlefield, and K. E. Bekris, “Sparse methods for efficient
asymptotically optimal kinodynamic planning,” in Algorithmic foun-
dations of robotics XI. Springer, 2015, pp. 263–282.
[11] L. Xin, Y. Kong, S. E. Li, J. Chen, Y. Guan, M. Tomizuka, and
B. Cheng, “Enable faster and smoother spatio-temporal trajectory
planning for autonomous vehicles in constrained dynamic environ-
ment,” Proceedings of the Institution of Mechanical Engineers, Part
D: Journal of Automobile Engineering, vol. 235, no. 4, pp. 1101–1112,
2021.
[12] S. Aradi, “Survey of deep reinforcement learning for motion planning
of autonomous vehicles,” IEEE Transactions on Intelligent Transporta-
tion Systems, 2020.
[13] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering
the game of go without human knowledge,nature, vol. 550, no. 7676,
pp. 354–359, 2017.
[14] S. E. Li, “Reinforcement learning and control,” 2020,
tsinghua University: Lecture Notes. http://www.idlab-
tsinghua.com/thulab/labweb/publications.html.
[15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
forcement learning,” in International conference on machine learning.
PMLR, 2016, pp. 1928–1937.
[16] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:
An open urban driving simulator,” in Conference on robot learning.
PMLR, 2017, pp. 1–16.
[17] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, “Centralized
cooperation for connected and automated vehicles at intersections
by proximal policy optimization,” IEEE Transactions on Vehicular
Technology, vol. 69, no. 11, pp. 12 597–12 608, 2020.
[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,
nature, vol. 518, no. 7540, pp. 529–533, 2015.
[19] G. Li, S. Li, S. E. Li, Y. Qin, D. Cao, X. Qu, and B. Cheng, “Deep re-
inforcement learning enabled decision-making for autonomous driving
at intersections,” Automotive Innovation, vol. 3, no. 4, pp. 374–385,
2020.
[20] P. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Fl¨
otter¨
od,
R. Hilbrich, L. L¨
ucken, and Johannes, “Microscopic traffic simulation
using sumo,” in International Conference on Intelligent Transportation
Systems (ITSC). IEEE, 2018.
[21] J. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl,
“CasADi-A software framework for nonlinear optimization and opti-
mal control,” Mathematical Programming Computation, vol. 11, no. 1,
pp. 1–36, 2019.
[22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:
Off-policy maximum entropy deep reinforcement learning with a
stochastic actor,” in Proceedings of the 35th International Conference
on Machine Learning. PMLR, 2018, pp. 1861–1870.
3469
Authorized licensed use limited to: Tsinghua University. Downloaded on February 16,2022 at 02:02:37 UTC from IEEE Xplore. Restrictions apply.
... Applications of offline reinforcement learning can be found in diverse research fields, such as robotics [27,30], healthcare [20,47], recommender systems [17,18], and more. Within the transportation domain, the majority of literature focuses on autonomous driving [9,19,46] and road traffic control [7,32]; however, to the best of our knowledge, no research study used offline RL for the system-level coordination of large fleets of vehicles. In this work, we develop an offline RL framework that leverages the specific strengths of direct optimization and reinforcement learning to control fleets of vehicles within urban transportation networks. ...
Preprint
Autonomous Mobility-on-Demand (AMoD) systems are a rapidly evolving mode of transportation in which a centrally coordinated fleet of self-driving vehicles dynamically serves travel requests. The control of these systems is typically formulated as a large network optimization problem, and reinforcement learning (RL) has recently emerged as a promising approach to solve the open challenges in this space. However, current RL-based approaches exclusively focus on learning from online data, fundamentally ignoring the per-sample-cost of interactions within real-world transportation systems. To address these limitations, we propose to formalize the control of AMoD systems through the lens of offline reinforcement learning and learn effective control strategies via solely offline data, thus readily available to current mobility operators. We further investigate design decisions and provide experiments on real-world mobility systems showing how offline learning allows to recover AMoD control policies that (i) exhibit performance on par with online methods, (ii) drastically improve data efficiency, and (iii) completely eliminate the need for complex simulated environments. Crucially, this paper demonstrates that offline reinforcement learning is a promising paradigm for the application of RL-based solutions within economically-critical systems, such as mobility systems.
... Furthermore, Li et al. [6] discussed that a novel decision-making framework provided safe and efficient decisions by using the distance and velocity of vehicles through DQN end-to-end model in intersection situations. Guan et al. [7] discussed that an offline optimal control policy could help develop the model-based RL algorithm effectiveness in a more complex intersection environment. ...
Article
Full-text available
Reinforcement Learning (RL) uses rewards to have iteration and update the next state for training in an unknown and complex environment. This paper aims to find a possible solution for the traffic congestion problem and train four Deep Reinforcement Learning (DRL) algorithms to verify the urban intersection simulation environment in the different discussed dimensions, including practicability, efficiency, safety, complexity, and limitation. The experiment result shows that the four DRL algorithms are efficient in the RL intersection simulation. This paper has succeeded in verifying this RL environment in the comparison and expands the experiment with three conclusions: The agent can train by Deep Q-network(DQN), DoubleDQN, DuelingNet DQN, and Categorical DQN algorithms to be practical and efficient. As the experiment results show, DuelingNet takes less time to finish the training, and Categorical DQN has reduced the collision rate after a while. However, the RL simulation environment lacks complexity, causing limitations in solving more complex problems, including the lack of simulation of pedestrian behaviors and the prediction of emergency events. This paper recommends creating a more complex urban intersection simulation that includes exceptional cases for the RL agent environment and more traffic pressure for the intersection to improve the faster and safer response in future automatic driving.
... Similarly, game-theoretic modeling was used to model negotiation of uncontrolled intersections [36]. Reinforcement learning was applied to provide an efficient strategy for occluded intersections [37][38][39][40][41]. ...
Article
Full-text available
For the last two decades, autonomous vehicles have been proposed and developed to extend the operational design domain from the motorway to urban environments. However, there have been few studies on autonomous driving for uncontrolled and blind intersections. This paper presents a proactive motion planning algorithm to enhance safety and traffic efficiency simultaneously for autonomous driving in uncontrolled blind intersections. The target states of approach motion are decided based on the field of view of the laser scanner and the pre-defined intersection map with connectivity information. The model predictive controller is used to follow the target states and determine the longitudinal motion of an autonomous vehicle. A Monte Carlo simulation with a case study was conducted to evaluate the performance of the proposed proactive motion planner. The simulation results show that the risk caused by approaching vehicles from the occluded region is properly managed. In addition, the traffic flow is improved by reducing the required time to cross the intersections.
... However, due to the CoR and CoD, the DRL approach will face the sparse reward issue for safety-related decision making, which will cause large variance for the learning process. The lack of high-fidelity NDE models will also cause the simulation-to-real gap issues (123)(124)(125)(126)(127). Some recent studies are trying to guarantee the safety performance of DRL approaches based on models (78,79) or formal methods (81,82), i.e., safe reinforcement learning (see (80) and references therein). However, the safety guarantees provided often rely on a set of assumptions (e.g., bounded disturbance sets) that are difficult to verify and design: too strict assumptions will lead to decisions that are too conservative for deployment, while too relaxed assumptions will lead to crashes because of the assumption violations. ...
Preprint
Full-text available
Achieving the human-level safety performance for autonomous vehicles (AVs) remains a challenge. One critical bottleneck is the so-called "long-tail challenge", which usually refers to the problem that AVs should be able to handle seemingly endless low-probability safety-critical driving scenarios, even though millions of testing miles have been accumulated on public roads. However, there is neither rigorous definition nor analysis of properties of such problems, which hinders the progress of addressing them. In this paper, we systematically analyze the "long-tail challenge" and propose the concept of "curse of rarity" (CoR) for AVs. We conclude that the compounding effects of the CoR on top of the "curse of dimensionality" (CoD) are the root cause of the "long-tail challenge", because of the rarity of safety-critical events in high dimensionality of driving environments. We discuss the CoR in various aspects of AV development including perception, prediction, and planning, as well as validation and verification. Based on these analyses and discussions, we propose potential solutions to address the CoR in order to accelerate AV development and deployment.
... This constrained optimal problem needs to be further transformed so that it can be solved by RL algorithms. Here, we adopt the generalized exterior point method [27] to convert it as the unconstrained one, wherein ϕ(s i+1|t ) is the penalty function of the state constraint g(s i+1|t ) ≤ 0, i.e., ...
Preprint
Full-text available
Intersections are quite challenging among various driving scenes wherein the interaction of signal lights and distinct traffic actors poses great difficulty to learn a wise and robust driving policy. Current research rarely considers the diversity of intersections and stochastic behaviors of traffic participants. For practical applications, the randomness usually leads to some devastating events, which should be the focus of autonomous driving. This paper introduces an adversarial learning paradigm to boost the intelligence and robustness of driving policy for signalized intersections with dense traffic flow. Firstly, we design a static path planner which is capable of generating trackable candidate paths for multiple intersections with diversified topology. Next, a constrained optimal control problem (COCP) is built based on these candidate paths wherein the bounded uncertainty of dynamic models is considered to capture the randomness of driving environment. We propose adversarial policy gradient (APG) to solve the COCP wherein the adversarial policy is introduced to provide disturbances by seeking the most severe uncertainty while the driving policy learns to handle this situation by competition. Finally, a comprehensive system is established to conduct training and testing wherein the perception module is introduced and the human experience is incorporated to solve the yellow light dilemma. Experiments indicate that the trained policy can handle the signal lights flexibly meanwhile realizing the smooth and efficient passing with a humanoid paradigm. Besides, APG enables a large-margin improvement of the resistance to the abnormal behaviors and thus ensures a high safety level for the autonomous vehicle.
Article
This paper presents a systematic decision-making and lateral control framework to realize cooperative obstacle avoidance (OA) of the guided vehicle platoons safely and integrally in multiple scenarios. In the control framework, a centralized decision-making strategy is developed to determine optimal overtaking maneuver (i.e., OA) in the dynamic driving environment, and a distributed lateral controller with a safe driving corridor planner is designed for lateral reference trajectories tracking. The proposed decision-making strategy formulates the influence mechanism between the controlled platoon and potential surrounding vehicles based on an interactive behavior model. Moreover, a game theory-based algorithm is applied to calculate safe timing of overtaking and select efficient overtaking mode while avoiding the negative impact on the surrounding vehicles. For the distributed lateral control strategy, the control problem is solved by a linear time-varying model predictive control (LTV-MPC) algorithm so as to reduce the computational burden. Furthermore, to improve the accuracy of LTV-MPC algorithm in varying scenarios, a tailored iterative control logic is presented. The performance of the proposed framework is evaluated on the dSPACE hardware-in-loop platform. The results show that the proposed framework works well under various test scenarios involving multiple static and moving obstacles.
Conference Paper
Deep reinforcement learning (RL) has been widely applied to motion planning problems of autonomous vehicles in urban traffic. However, traditional deep RL algorithms cannot ensure safe trajectories throughout training and deployment. We propose a provably safe RL algorithm for urban autonomous driving to address this. We add a novel safety layer to the RL process to verify the safety of high-level actions before they are performed. Our safety layer is based on invariably safe braking sets to constrain actions for safe lane changing and safe intersection crossing. We introduce a generalized discrete high-level action space, which can represent all high-level intersection driving maneuvers and various desired accelerations. Finally, we conducted extensive experiments on the inD dataset containing urban driving scenarios. Our analysis demonstrates that the safe agent never causes a collision and that the safety layer's lane changing verification can even improve the goal-reaching performance compared to the unsafe baseline agent.
Article
Full-text available
Academic research in the field of autonomous vehicles has reached high popularity in recent years related to several topics as sensor technologies, V2X communications, safety, security, decision making, control, and even legal and standardization rules. Besides classic control design approaches, Artificial Intelligence and Machine Learning methods are present in almost all of these fields. Another part of research focuses on different layers of Motion Planning, such as strategic decisions, trajectory planning, and control. A wide range of techniques in Machine Learning itself have been developed, and this article describes one of these fields, Deep Reinforcement Learning (DRL). The paper provides insight into the hierarchical motion planning problem and describes the basics of DRL. The main elements of designing such a system are the modeling of the environment, the modeling abstractions, the description of the state and the perception models, the appropriate rewarding, and the realization of the underlying neural network. The paper describes vehicle models, simulation possibilities and computational requirements. Strategic decisions on different layers and the observation models, e.g., continuous and discrete state representations, grid-based, and camera-based solutions are presented. The paper surveys the state-of-art solutions systematized by the different tasks and levels of autonomous driving, such as car-following, lane-keeping, trajectory following, merging, or driving in dense traffic. Finally, open questions and future challenges are discussed.
Article
Full-text available
Connected vehicles will change the modes of future transportation management and organization, especially at an intersection without any traffic light. Centralized coordination methods globally coordinate vehicles approaching the intersection from all sections by considering their states altogether. However, they need substantial computation resources since they own a centralized controller to optimize the trajectories for all approaching vehicles in real-time. In this paper, we propose a centralized coordination scheme of automated vehicles at an intersection without traffic signals using reinforcement learning (RL) to address low computation efficiency suffered by current centralized coordination methods. We first propose an RL training algorithm, model accelerated proximal policy optimization (MA-PPO), which incorporates a prior model into proximal policy optimization (PPO) algorithm to accelerate the learning process in terms of sample efficiency. Then we present the design of state, action and reward to formulate centralized coordination as an RL problem. Finally, we train a coordinate policy in a simulation setting and compare computing time and traffic efficiency with a coordination scheme based on model predictive control (MPC) method. Results show that our method spends only 1/400 of the computing time of MPC and increase the efficiency of the intersection by 4.5 times.
Article
Full-text available
Trajectory planning is of vital importance to decision-making for autonomous vehicles. Currently, there are three popular classes of cost-based trajectory planning methods: sampling-based, graph-search-based, and optimization-based. However, each of them has its own shortcomings, for example, high computational expense for sampling-based methods, low resolution for graph-search-based methods, and lack of global awareness for optimization-based methods. It leads to one of the challenges for trajectory planning for autonomous vehicles, which is improving planning efficiency while guaranteeing model feasibility. Therefore, this paper proposes a hybrid planning framework composed of two modules, which preserves the strength of both graph-search-based methods and optimization-based methods, thus enabling faster and smoother spatio-temporal trajectory planning in constrained dynamic environment. The proposed method first constructs spatio-temporal driving space based on directed acyclic graph and efficiently searches a spatio-temporal trajectory using the improved A* algorithm. Then taking the search result as reference, locally convex feasible driving area is designed and model predictive control is applied to further optimize the trajectory with a comprehensive consideration of vehicle kinematics and moving obstacles. Results simulated in four different scenarios all demonstrated feasible trajectories without emergency stop or abrupt steering change, which is kinematic-smooth to follow. Moreover, the average planning time was 31 ms, which only took 59.05%, 18.87%, and 0.69%, respectively, of that consumed by other state-of-the-art trajectory planning methods, namely, maximum interaction defensive policy, sampling-based method with iterative optimizations, and Graph-search-based method with Dynamic Programming.
Article
Full-text available
Decision making for self‐driving cars is usually tackled by manually encoding rules from drivers’ behaviours or imitating drivers’ manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This study presents a hierarchical reinforcement learning method for decision making of self‐driving cars, which does not depend on a large amount of labelled driving data. This method comprehensively considers both high‐level manoeuvre selection and low‐level motion control in both lateral and longitudinal directions. The authors firstly decompose the driving tasks into three manoeuvres, including driving in lane, right lane change and left lane change, and learn the sub‐policy for each manoeuvre. Then, a master policy is learned to choose the manoeuvre policy to be executed in the current state. All policies, including master policy and manoeuvre policies, are represented by fully‐connected neural networks and trained by using asynchronous parallel reinforcement learners, which builds a mapping from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each manoeuvre. They apply this method to a highway driving scenario, which demonstrates that it can realise smooth and safe decision making for self‐driving cars.
Article
Full-text available
Accurate trajectory prediction of surrounding road users is critical to autonomous driving systems. In mixed traffic flows, road users with different kinds of behaviors and styles bring complexity to the environment, which requires considering interactions among road users when anticipating their future trajectories. This paper presents a long-term interactive trajectory prediction method for surrounding vehicles using a hierarchical multi-sequence learning network. In contrast to non-interactive method which assumes that road users are independent of each other, this method can automatically learn high-level dependencies among multiple interacting vehicles through the proposed structural-LSTM (long short-term memory) network. Specifically, structural-LSTM first assigns one LSTM for each interacting vehicle. Then these LSTMs share their cell states and hidden states with their spatial-neighboring LSTMs by a radial connection, and recurrently analyze the output state of itself as well as the other LSTMs in a deeper layer. Finally based on all output states, the network predicts trajectories for surrounding vehicles. The proposed method is evaluated on the NGSIM dataset, and its results show that satisfyingly accurate prediction performance of long-term trajectories of surrounding vehicles is accessible, e.g., longitudinal and lateral RMS error can be reduced to less than 1.93m and 0.31m over 5s time horizon, respectively.
Article
Road intersection is one of the most complex and accident-prone traffic scenarios, so it’s challenging for autonomous vehicles (AVs) to make safe and efficient decisions at the intersections. Most of the related studies focus on the solution to a single scenario or only guarantee safety without considering driving efficiency. To address these problems, this study proposed a deep reinforcement learning enabled decision-making framework for AVs to drive through intersections automatically, safely and efficiently. The mapping relationship between traffic images and vehicle operations was obtained by an end-to-end decision-making framework established by convolutional neural networks. Traffic images collected at two timesteps were used to calculate the relative velocity between vehicles. Markov decision process was employed to model the interaction between AVs and other vehicles, and the deep Q-network algorithm was utilized to obtain the optimal driving policy regarding safety and efficiency. To verify the effectiveness of the proposed decision-making framework, the top three accident-prone crossing path crash scenarios at intersections were simulated, when different initial vehicle states were adopted for better generalization capability. The results showed that the developed method could make AVs drive safely and efficiently through intersections in all of the tested scenarios.
Conference Paper
Microscopic traffic simulation is an invaluable tool for traffic research. In recent years, both the scope of research and the capabilities of the tools have been extended considerably. This article presents the latest developments concerning intermodal traffic solutions, simulator coupling and model development and validation on the example of the open source traffic simulator SUMO.
Article
We present CasADi, an open-source software framework for numerical optimization. CasADi is a general-purpose tool that can be used to model and solve optimization problems with a large degree of flexibility, larger than what is associated with popular algebraic modeling languages such as AMPL, GAMS, JuMP or Pyomo. Of special interest are problems constrained by differential equations, i.e. optimal control problems. CasADi is written in self-contained C++, but is most conveniently used via full-featured interfaces to Python, MATLAB or Octave. Since its inception in late 2009, it has been used successfully for academic teaching as well as in applications from multiple fields, including process control, robotics and aerospace. This article gives an up-to-date and accessible introduction to the CasADi framework, which has undergone numerous design improvements over the last 7 years.
Article
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy - that is, succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.