PreprintPDF Available

Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Connected vehicles will change the modes of future transportation management and organization, especially at intersections. There are mainly two categories coordination methods at unsignalized intersection, i.e. centralized and distributed methods. Centralized coordination methods need huge computation resources since they own a centralized controller to optimize the trajectories for all approaching vehicles, while in distributed methods each approaching vehicles owns an individual controller to optimize the trajectory considering the motion information and the conflict relationship with its neighboring vehicles, which avoids huge computation but needs sophisticated manual design. In this paper, we propose a centralized conflict-free cooperation method for multiple connected vehicles at unsignalized intersection using reinforcement learning (RL) to address computation burden naturally by training offline. We firstly incorporate a prior model into proximal policy optimization (PPO) algorithm to accelerate learning process. Then we present the design of state, action and reward to formulate centralized cooperation as RL problem. Finally, we train a coordinate policy by our model-accelerated PPO (MA-PPO) in a simulation setting and analyze results. Results show that the method we propose improves the traffic efficiency of the intersection on the premise of ensuring no collision.
1
Centralized Conflict-free Cooperation for Connected
and Automated Vehicles at Intersections by
Proximal Policy Optimization
Yang Guan, Yangang Ren, Shengbo Eben Li*, Qi Sun, Laiquan Luo, Koji Taguchi, and Keqiang Li
Abstract—Connected vehicles will change the modes of future
transportation management and organization, especially at inter-
sections. There are mainly two categories coordination methods at
unsignalized intersection, i.e. centralized and distributed meth-
ods. Centralized coordination methods need huge computation
resources since they own a centralized controller to optimize
the trajectories for all approaching vehicles, while in distributed
methods each approaching vehicles owns an individual controller
to optimize the trajectory considering the motion information
and the conflict relationship with its neighboring vehicles, which
avoids huge computation but needs sophisticated manual design.
In this paper, we propose a centralized conflict-free cooperation
method for multiple connected vehicles at unsignalized intersec-
tion using reinforcement learning (RL) to address computation
burden naturally by training offline. We firstly incorporate a
prior model into proximal policy optimization (PPO) algorithm
to accelerate learning process. Then we present the design of
state, action and reward to formulate centralized cooperation as
RL problem. Finally, we train a coordinate policy by our model-
accelerated PPO (MA-PPO) in a simulation setting and analyze
results. Results show that the method we propose improves the
traffic efficiency of the intersection on the premise of ensuring
no collision.
Index Terms—Connected and Automated vehicle, Centralized
coordination method, Reinforcement learning, Traffic intersec-
tion
I. INTRODUCTION
THE increasing demand for mobility poses great chal-
lenges to road transport. The connected and automated
vehicles is attracting extensive attention, due to its potential to
benefit traffic safety, efficiency, and economy [1]. The widely
studied, but also simplified, version of connected vehicle
cooperation is the platoon control system in highways. Platoon
control aims to ensure that a group of connected vehicles in
the same lane move at a harmonized longitudinal speed, while
maintaining desired inter-vehicle spaces [2]–[4]. As a typical
scenario in urban areas, intersection is more complex and chal-
lenging for multi-vehicle coordination than that in highway.
At the intersection, vehicles enter from different intersection
entrances, cross their specific trajectories at the intersection
zone, and leave the intersection at different exits. The complex
Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo and K. Li, are
with State Key Lab of Automotive Safety and Energy, School
of Vehicle and Mobility, Tsinghua University, Beijing, 100084,
China. They are also with Center for Intelligent Connected Vehicles
and Transportation, Tsinghua University, Beijing, China. Email:
(guany17, ryg18, qisun)@mails.tsinghua.edu.cn;
luolaiquan@cau.edu.cn; (lishbo,
likq)@tsinghua.edu.cn
conflict relationship between vehicles results in complicated
vehicle decisions to avoid collisions at the intersection. Hence,
it needs complicated design to guarantee traffic safety while
improving traffic efficiency. To resolve multi-vehicle coordi-
nation at intersections, several studies focus on traffic signals
design scheme. Goodall et al. (2013) developed a decentralized
fully adaptive traffic control algorithm to optimize traffic signal
timing [5]. Feng et al. (2015) presented a real-time adaptive
signal phase allocation algorithm using connected vehicle data,
which optimizes the phase sequence and duration by solving a
two-level optimization problem [6]. These traffic signal based
coordination methods for intersection control can ensure traffic
safety, but they may result in inefficiency for intersection
management. Hence, researchers have started to focus on
signal-free methods for intersection coordination.
Currently, there are mainly two types of methods to han-
dle coordination at unsignalized intersection, i.e. centralized
and distributed coordination methods. Centralized coordina-
tion methods utilize the global information of the whole
intersection to centrally control every vehicle at intersection.
Dresner and Stone (2008) treated drivers and intersections
as autonomous agents in a multi-agent system and built a
new reservation-based approach around a detailed communi-
cation protocol [7]. Lee and Park (2012) eliminated potential
overlaps of vehicular trajectories coming from all conflicting
approaches at the intersection, then sought a safe maneuver
for every vehicle approaching the intersection and manipulates
each of them [8]. Dai et al. (2016) formulated an intersection
control model and transformed it into a convex optimization
problem, with consideration of safety and efficiency [9]. How-
ever, these centralized coordination methods suffer from huge
computation requirement since they coordinate approaching
vehicles by optimizing all their trajectories with a centralized
controller.
In distributed coordination methods, there is no central con-
troller but distributed controller in each approaching vehicle
to optimize its own trajectory considering motion and conflict
relationship with its neighboring vehicles. Ahmane et al.
proposed a model based on Timed Petri Nets with Multipliers
(TPNM) and used that to design the control policy through
the structural analysis [10]. Xu et al. proposed a conflict-
free geometry topology and a communication topology to
transform two-dimension vehicle cluster at the intersection to
one-dimension vehicle virtual platoon and eventually designed
distributed feedback controller [11]. Distributed coordination
methods satisfy huge computation resources requirement by
arXiv:1912.08410v1 [cs.RO] 18 Dec 2019
2
distributed computation, however, they need design of so-
phisticated dynamic model and complicated communication
relationship carefully.
One of the most fundamental goals in artificial intelligence
is how to learn a new skill, especially from high-dimensional
sensor input. Reinforcement Learning (RL) gradually learns a
better policy from trail-and-error interaction with environment,
which is highly similar to human and has the potential to
address a large number of complex problems [12]. Recently,
significant progress has been made on a variety of problems
by combining advances of deep learning and RL. Mnih et
al. (2015) proposed Deep Q-learning Network (DQN) and
attained to human level performance on Atari video games
with raw pixels for input [13]. Silver et al. used RL and
tree search method to conquer go game and produced two
famous programs: Alpha Go and Alpha zero, defeating the
most excellent human champion [14], [15]. Considering DQN
only suitable for problems with discrete action spaces, Deep
Deterministic Policy Gradient (DDPG) algorithm is proposed
to solve continuous control problems [16]. While vanilla policy
gradient methods suffer from poor data efficiency and robust-
ness, Trust Region Policy Optimization (TRPO) is proposed
[17]. However, TRPO is not compatible with architectures
that include noise and rarely implements parameter sharing
between the policy and value function. Proximal Policy Opti-
mization (PPO) is proposed as an updated version of TRPO,
which alternated between sampling data through interaction
and optimizing a surrogate objective function using stochastic
gradient ascent [18].
RL has been poised to conquer the autonomous driving
problem because of the super-human potential. Existing RL
researches on autonomous driving most focus on the intel-
ligence of single vehicle driving in relatively simple traffic
scenarios. DQN was initially used to realize control high-
frequency discrete steering actions of vehicle [19], [20].
After Asynchronous Advantage Actor-Critic (A3C) method
was proposed, some researches adopted this framework to
accelerate the learning speed and maintain the training stability
[21]–[23]. Due to the long time credit assignment advantage
of hierarchical reinforcement learned, it was used to both
high-level maneuver selection and low-level motion control
for decision making of self-driving cars [24]. Besides, other
researchers successfully applied DDPG to autonomous driv-
ing, realizing control on continuous acceleration, steering and
braking actions [25], [26].
In this paper, we employ RL as our method for centralized
control of multiple connected vehicles to realize autonomous
collision-free passing at unsignalized intersections. It is real-
ized by firstly formulating state space, action space and reward
function in framework of RL, and then training policy by dis-
tributed PPO algorithm. Besides, to enhance sample efficiency
and accelerate training process, we incorporate a prior model
into PPO algorithm. Since we trained a centralized controller,
there is no need to design complex components used by
distributed controller elaborately. And RL trains off-line and
infers on-line, thus it naturally unloads on-line computation
burden. Our results show the learned policy is able to increase
driving safety and traffic efficiency at intersections.
The rest of this paper is organized as follows. Section II
introduces preliminaries of Markov decision process and pol-
icy gradient methods. Section III proposes model accelerated
PPO (MA-PPO) which is an improvement based on original
PPO and model-based RL methods. Section IV illustrates our
problem statement and methodology and section V looks into
experimental settings and illustrates results. Last section VI
summaries this work.
II. PRELIMINARIES OF RL
Consider an infinite-horizon discounted Markov decision
process (MDP), defined by the tuple (S,A, p, r, d0, γ), where
Sis a finite set of states, Ais a finite set of actions,
p:S × A × S Ris the transition probability distribution.
r:S × A × S Ris the reward function, d0:S Ris
the distribution of the initial state s0, and γ(0,1) is the
discounted factor.
Let πdenote a stochastic policy π:S × A [0,1], we
seek to learn optimal policy πwhich has maximum value
function vπ(s)for all s S, where the value function vπ(s)
is the expected sum of discounted rewards from a state when
following policy π:
vπ(s) := Eat,st+1,... (
X
l=t
γltrl|st=s)(1)
where atπ(at|st), st+1 p(st+1|st, at)and rt:=
r(st, at, st+1)for short. Similarly, we use the following
standard definition of the state-action value function qπ(s, a):
qπ(s, a) := Est+1,at+1 ,... (
X
l=t
γltrl|st=s, at=a)(2)
A. Vanilla policy gradient
In practice, finding optimal policy for every state is im-
practical for large state space, thus we consider parameterized
policies πθ(a|s)with parameter vector θ. For the same reason,
state-value function is parameterizd as V(s, w)with parameter
vector w.
Policy optimization methods seek to find optimal θwhich
maximize average performance of policy πθ
J(θ) = Es0,a0,... (
X
t=0
γtrt)(3)
where s0d0(s0), atπθ(at|st), st+1 p(st+1|st, at)and
rt:= r(st, at, st+1).
Vanilla methods optimize (3) by stochastic policy gradient
[27]. Its gradient is shown in (4).
θJ(θ) = X
s
dγ(s)X
a
θπθ(a|s)qπθ(s, a)(4)
dγ(s) =
X
t=0
γtdt(s|πθ)(5)
where dt(s|πθ)is state distribution at time t,dγ(s)is called
discounted visiting frequency, which in practice is usually
replaced with state stationary distribution under πθdenoted
3
by dπθ[28]. Combined with likelihood ratio and baseline
technique [12], we can write (4) in expectation format.
θJ(θ)Esdπθ,aπθ{∇θlog πθ(a|s)Aπθ(s, a)}(6)
where Aπθ(s, a) := qπθ(s, a)vπθ(s)is advantage function
which could be estimated by several methods [29].
B. Trust region method
While vanilla policy gradient is simple to implement, it
often leads to destructively large policy updates. TRPO op-
timizes lower bound of (3), i.e. equation (7), to guarantee
performance improvement.
maximize
θ
Es,a πθ(a|s)
πθold (a|s)Aπθold βKL [πθold (·|s), πθ(·|s)]
(7)
However, it is hard to choose a single value of βthat
performs well across different problems, TRPO uses penalty
shown in (8) instead.
maximize
θ
Es,a πθ(a|s)
πθold (a|s)Aπθold
s.t.Es[DKL (πθold (·|s)kπθ(·|s))] δ
(8)
where we denote Es,a [. . .] := Esdπθold ,aπθold [. . .],
Es[. . .] := Esdπθold [. . .]and Aπθold := Aπθold (s, a). TRPO
can be regarded as natural policy gradient [30]. It finds steepest
policy gradient in fisher matrix normed space rather than
euclidean space, which helps to reduce impact of policy pa-
rameterization when calculating gradient and stabilize learning
process.
III. MOD EL AC CE LE RATE D PPO
A. Proximal policy optimization
In this paper, PPO algorithm is employed as our baseline.
It is inspired by TRPO and has two main differences, i.e.
unconstrained surrogate objective function and generalized
advantage estimation.
1) Unconstrained surrogate objective function: Observing
stability policy update requires punishment of policy deviation
based on the unconstrained optimization (9) from theroy of
TRPO,
maximize
θ
Es,a πθ(a|s)
πθold (a|s)Aπθold (9)
PPO alternatively construct a subtle lower bound of (9)
to eliminate motivation of too much deviation of policy
distribution directly. Its objective is (10).
Jppo(θ) = Es,a [min (ρθAπθold ,clip (ρθ,1, 1 + )Aπθold )]
(10)
where ρθ=πθ(a|s)
πθold (a|s)is probability ratio. When Aπθold >0,
with objective function (9), ρθwould tend to be much larger
than 1 to make the objective as large as possible, which
leads to unstable learning, while PPO objective (10) cut this
motivation by clipping ρwithin 1 + and taking minimum of
original objective (7) and clipped function. Same situation is
with Aπθold <0.
2) Generalized advantage estimation: Advantage function
is necessary for policy gradient calculation, and it can be
estimated by (11).
ˆ
Aπ(s, a) = ˆ
Qπ(s, a)V(s, w)(11)
where ˆ
Qπ(s, a)is action-value function estimated by samples
of π,V(s, w)is state-value function approximation. TRPO
and A3C use Monte-Carlo method to construct ˆ
Qπ(s, a)as in
(12).
ˆ
Qπ(st, at) =
X
l=t
γltrl(12)
It is an unbiased estimation of qπ(s, a), but suffers from
high variance. Actor-critic methods use one-step TD to form
ˆ
Qπ(s, a)as in (13).
ˆ
Qπ(st, at) = rt+γV (st+1 , w)(13)
which has low variance but is biased. Generalized advantage
estimation is actually same as TD(λ) only that it uses linear
combination of n-step TD to estimate qπinstead of vπ.
Backward view of TD(λ) is shown in (14)
ˆ
Qπ(st, at) =
X
l=t
(γλ)ltδl(14)
where δlis TD error,
δl=rl+γV (sl+1 , w)V(sl, w)(15)
Compared to TRPO, PPO is much simpler and faster to
implement because it only involves first order optimization,
and it has better convergence due to usage of generalized
advantage estimation. However, PPO is on-policy method and
inevitably has high sample complexity.
B. Model-based RL
Recent model-free reinforcement learning algorithms have
proposed incorporating given or learned dynamics models as
a source of additional data with the intention of reducing
sample complexity. Generally, there are about two general
ways to use model: value gradient methods, and using model
for imagination.
Value gradient methods link together the policy, model,
and reward function to compute an analytic policy gradient
by backpropagation of reward along a trajectory [31]–[33]. A
major limitation of this approach is that the dynamic model
can only be used to retrieve information already presented
in observed data and albeit with lower variance, the actual
improvement in efficiency is relatively small. Alternatively,
the given or learned model can also be used for imagination
rollouts. This usage can be naturally incorporated in model-
free RL framework, however, learned models suffer from
overfitting on experience data and lead to large error in large
horizon [34], [35].
4
C. Model accelerated PPO (MA-PPO)
PPO is a model-free on-policy RL algorithm. Model-free
means it knows nothing about environment and can only learn
from interactions with environment. As a result, it inevitably
requires large amount of experience data although its excellent
final performance. Besides, the training speed is limited by
interaction with real world or simulator. Even worse, property
of on-policy makes experiences produced by previous trained
policies useless, which aggravates sample inefficiency. This is
our motivation to accelerate PPO.
Basically, there are two ways to reduce sample complexity.
The first one is incorporating off-policy data in learning pro-
cess, i.e. using experiences during training. The second one is
giving or learning a dynamic model. We claim that off-policy
data could not be used due to state distribution mismatch.
Assume that off-policy data are generated by another policy
b, we rewrite PPO objective (10) as (16).
Jppo(θ) = Esdb,ab
dπθold πθold
dbbmin (ρ(θ)Aπθold ,clip (ρ(θ),1, 1 + )Aπθold )
(16)
To acquire correct gradient by off-policy data, we need
not only correct action distribution by action probability ratio
πθold
b, but state distribution by stationary probability ratio dπθold
db.
However, stationary probability ratio is hard to estimate and
thus lead to distribution mismatch, which hinders use of off-
policy data in both theory and empirically. As a result, we
only employ model to accelerate PPO.
In field of centralized control at intersection, dynamic model
is available from human prior knowledge, so we construct a
model ourselves rather than using a learned one. To combine
model with PPO algorithm naturally, the second type of model
usage from III-B is employed by us. MA-PPO is shown in
algorithm 1.
IV. PROBLEM STATEMENT AND FOR MU LATION
A. Problem statement
In this paper, we focus on a typical 4-direction intersection
shown in Fig. 1. Each direction is denoted by its location in the
figure, i.e. up (U), down (D), left (L) and right (R) respectively.
We only focus on vehicles within a certain distance from
the intersection center. The intersection is unsignalized and
each entrance or exits is assumed to have only one lane, as a
result, there are 4 entrances in total. Vehicle in each entrance
is allowed to turn right, go straight or turn left. Thus there
are 12 types of vehicles, denoted by their entrance and exit,
i.e. DR, DU, DL, RU, RL, RD, LD, LR, LU, UL, UD, and
UR. Their number and meanings are listed in Table I. All
their possible conflict relations are also illustrated in the figure,
which can be categorized into three classes, including crossing
conflict (denoted by red dot), converging conflict(denoted by
purple dot), and diverging conflict (denoted by pink dot). To
simplify our problem, we choose 8 vehicle modes out of all
the 12 modes to cover main conflict modes to conduct our
experiment. The 8 modes include DR, DL, RU, RL, LD, LU,
UL, UD, as shown in Fig. 2. From the figure, we can summary
Algorithm 1 MA-PPO
Randomly initialize critic network V(s, w)and actor
πθ(s, a)with weights wand θ, set λ, γ, , total timesteps
Ttotal, inner iteration M, batch size T, minibatch size B,
and epoch U
for iteration = 1, Ttotal/T do
Run policy πθin environment for Ttimesteps, collecting
(st, at, rt)
Estimate advantages ˆ
At=PEndepisode
lt(λγ)ltδl
Estimate TD(λ)ˆ
Vt=ˆ
At+V(st, w)
πθold πθ
for epoch = 1, Udo
JPPO(θ) = PT
t=1 min( πθ(at|st)
πθold (at|st)ˆ
At,clip(πθ(at|st)
πθold (at|st),1
, 1 + )ˆ
At)
Update θby a gradient method w.r.t. JPPO(θ)
JCritic(w) = PT
t=1(ˆ
VtV(st, w))2
Update wby a gradient method w.r.t. JCritic(w)
end for
for model iteration = 1, Mdo
Run policy πθby model for Tmodel timesteps, collect-
ing st,ˆat,ˆrt)
Estimate advantages ˆ
At=PEndepisode
lt(λγ)ltˆ
δl
Estimate TD(λ)ˆ
Vt=ˆ
At+V(st, w)
πθold πθ
for epoch = 1, Udo
Update θby a gradient method w.r.t. JPPO(θ)
Update wby a gradient method w.r.t. JCritic(w)
end for
end for
end for
all repeated types of conflict it contains, which is shown in Fig.
3.
We adopt the following assumptions. First, all vehicles are
equipped with positioning and velocity devices so that we can
gather location and movement information when they enter
interesting zone of the intersection. Then, all approaching
vehicles are assumed to be automated vehicles so that vehicles
can strictly follow the desired acceleration, control the speed,
TABLE I
Different types of vehicle
Type Number Meaning
DR 1 From Down turn right to Right
DU 2 From Down go straight to Up
DL 3 From Down turn left to Left
RU 4 From Right turn right to Up
RL 5 From Right go straight to Left
RD 6 From Right turn left to Down
LD 7 From Left turn right to Down
LR 8 From Left go straight to Right
LU 9 From Left go straight to Up
UL 10 From Up turn right to Left
UD 11 From Up go straight to Down
UR 12 From Up turn left to Right
5
and pass the intersection automatically. Additionally, There’s a
maximum of one vehicle of each type at each lane of entrance,
but order of different type is stochastic.
B. RL formulation
We are ready to transform our problem to a RL problem by
defining state space, action space and reward function, which
are basic elements in RL.
1) State and action space: By our assumption, we need
to control at most 8 vehicles at a time, i.e. 2 different type
of vehicles at each entrance. Vector form is used for both
state and action, which are respectively concatenation of each
vehicle’s state and control by their order, as shown in (17).
s= [sDR, sDL , sRU, sRL, sLD , sLU, sUL , sUD]
a= [aDR, aDL , aRU, aRL, aLD , aLU, aUL , aUD](17)
where s*and a*denote state and action of vehicle type *.
Up (U)
Down (D)
Left (L)
Right (R)
DR
DU
DL
1
2
3
RL
RD
5
6
LD
LR
LU
7
8
9
UL
UD
UR
10
11
12
Fig. 1: Intersection scenario settings
DR
DL
RL
RU
LU LD
UL
UD
Up (U)
Down (D)
Left (L)
Right (R)
Fig. 2: Vehicles modes chosen for experiment
State of each type should contains position and velocity
information of each vehicle. Intuitively, we can form state by
a tuple of coordinate and velocity, i.e. (x, y, v ), where (x, y)is
coordination of its position and vis velocity. However, by our
Crossing Converging
Diverging
No
conflict
Fig. 3: Typical modes of conflict in experiment
TABLE II
Reward settings
Reward items Reward
Collision -50
Step reward -1
Some vehicle passes 10
All vehicles pass 50
task formulation, every vehicle has a fixed path corresponding
to its type. There would be redundant information if we use
this formula for every vehicle. Besides, for continuous state,
it is necessary to decrease state space dimensional to speed
up learning and enhance stability. Observing all of paths are
cross intersection, we further compress state of each vehicle
by (d, v), where dis distance between vehicle and center of
its path. Note that dis positive when vehicle is heading for
the center and negative when it is leaving. State formulation
is shown as Fig. 4.
For action space, we choose acceleration of each vehicle. In
total, a 16-dimensional state space and a 8-dimensional action
space are constructed.
2) Reward settings: Reward function is designed under
consideration of safety, efficiency and task completion. First
of all, the task is designed in episodic manner, in which two
types of termination are given, collision or all VEHs passing
intersection. To avoid collision, a large negative reward is
given if it happens. To enhance efficiency, a minor negative
step reward is given every time step. To encourage task
completion, there is a positive reward as long as some vehicle
passes the intersection, and a large positive reward will be
given when all vehicles pass the intersection. All reward
settings are listed in Table II.
C. Algorithm architecture
In this section, we illustrate how to apply MA-PPO algo-
rithm to this centralized control problem.
1) Model construction: MA-PPO learns from data come
from both simulation and model. Simulation incorporates true
6
d1
Center point of
path of type
DL
d2
v2
v1sDL=[d1, v1]
sDL=[-d2, v2]
d1
Center point of
path of type
DU
d2
v2
v1sDU=[d1, v1]
sDU=[-d2, v2]
d1
Center point of
path of type
DR
d2
v1sDR=[d1, v1]
sDR=[-d2, v2]
v2
Fig. 4: State formulation
dynamics of environment, i.e. kinematics module with noise,
but it takes too much time to interact with simulation. MA-
PPO accelerates learning process by incorporates a prior model
to generate data which also used for learning. The model is
constructed by simple kinematics model. Given current posi-
tion, velocity and expected acceleration of each vehicle, their
next position and velocity can be inferred by this kinematics
model.
2) Overall architecture: Learning algorithm for this RL
problem consists of two main parts including MA-PPO learner
and worker. Worker is in charge of getting updated policy
from learner and using it to collect experience data from
simulation or the kinematics model. MA-PPO learner then
uses experience data from worker to update value and policy
network by backpropagation, and finally sends the updated
policy to worker for the next iteration. This overall architecture
is shown in Fig. 5.
V. EXPERIMENTS
A. Experimental settings
In this section, we train and test MA-PPO and original PPO
in set of vehicles mentioned above, in which there are two
vehicles of different types in each entrance of intersection.
Thus, we have 8 vehicles in total in this experiment. These
vehicles are chosen to cover all conflict modes shown in Fig.
3. The initial position of all vehicles are random, and multiple
vehicles enter the intersection from different entrances, follow
their trajectories at the intersection zone, and leave it at
different exits. The central controller is capable to control
the acceleration of all vehicles to adjust their speed and
position to ensure traffic safety and efficiency, i.e. all vehicles
pass through the intersection as quickly as possible without
collision. For results, training processes of MA-PPO and PPO
are shown and compared to illustrate our improvement on
PPO, and we also visualize effects of policy at the start of
training and at the end of training in simulation to show what
the trained policy has learned.
B. Implement details
We employ multiple layers perceptron with two hidden
layers as approximate function of actor and critic. Both actor
and critic have 128 units in each hidden layer, and actor has 16
[]
[]
Action
Simulation
Policy for sampling
Model
=0+
 =+0
2
 =0+1
22
2=0
2+2
State
Data from
Model
󰇝,,,+1󰇞=0:
Data Parameter update
Value error
(MSE)
Policy error
(Clipped
error)
Value network
Data from
Simulation
󰇝,,,+1󰇞=0:
Policy network
Update
Update
Simulation data
and model data Update policy
for sampling
MA-PPO Learner
Worker
Fig. 5: Overall architecture of algorithm
output units for Gaussian distribution of each vehicle (mean
and standard deviation) while critic only has one output unit
for state value. Note that actor and critic network have no
shared parameters. We use Adam as optimizer. For MA-PPO,
we collect T= 2048 transitions and use minibatch B= 64
epoch U= 10 for update. For model simulation loop, we set
M= 1. Besides, we train both PPO and MA-PPO under 5
seeds to eliminate impact of randomness. Complete parameter
setting is listed in Table III.
We use parallel workers to improve exploration and stabilize
policy gradient and thus speed up learning process. Concretely,
16 parallel workers learn simultaneously. In an iteration, each
worker interacts with environment respectively and collects
2048 timesteps batch of data, then takes the first minibatch
to calculate gradient, then global gradient is conducted by
averaging all local gradients of workers. Each workers updates
7
TABLE III
Hyperparameters of experiment
Parameter Value
Discounted factor γ0.99
λ0.95
Clip range 0.2
Total timesteps Ttotal 5×107
Inner iteration M1
Seed number 5
Batch size T2048
Minibatch size B64
Epoch U10
Learning rate LR 0.0003 0
Hidden layer number 2
Hidden units number 128
Adam = 105, β1= 0.9, β2= 0.999
its parameters by using the global gradient in Adam and takes
the next minibatch and going on like this.
C. Results and discussion
In this section, we show the performance of our algorithm
at intersection and analyze the empirical results.
0 250 500 750 1000 1250 1500 1750
Iteration
60
40
20
0
20
40
60
Mean episode reward
PPO
MA-PPO
(a) Mean episode reward
0 250 500 750 1000 1250 1500 1750
Iteration
20
25
30
35
40
45
50
55
60
Mean episode length
PPO
MA-PPO
(b) Mean episode length
Fig. 6: Training process of MA-PPO and PPO
Fig. 6a shows the mean episode reward of MA-PPO and
PPO during training process. Both MA-PPO and PPO get the
highest reward about 50, which means that all 8 vehicle has
passed through the intersection successfully. Compared with
PPO, MA-PPO converges at around 500 iterations, while PPO
algorithm needs nearly 1000 iterations, which shows that MA-
PPO converges twice as faster as PPO.
Fig. 6b shows the change of mean episode length during
training process. Both the episode length of MA-PPO and
PPO first increase rapidly and then reduce to an equal value.
This can be explained that at the beginning the temporary
policy mainly focuses on how to avoid collision because this
case corresponds to a large negative reward. At that time, one
reasonable policy is to let the vehicles with no conflicting
trajectories pass through the intersection, such as the RL and
LD, DR and UD. Meanwhile, other vehicles have to wait
until the next non-collision chance, which leads to the long
episode length. However, such a policy is too conservative
and suffers poor efficiency because every step has a negative
reward -1. Therefore, the following policy would optimize this
process to avoid long waiting time, leading to the decrease of
mean episode length. Besides, MA-PPO obtained more faster
convergence speed in term of mean episode length compared
with origin PPO, which has the same trend as the Fig. 6a.
VEH1(DR)
VEH2(DL)
VEH4(RL)
VEH3(RU)
VEH6(LU)
VEH5(LD)
VEH7(UL)
VEH8(UD)
(a) Episode with collision
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Timestep
0
5
10
15
20
25
Distance (m)
VEH1(DR)
VEH2(DL)
VEH3(RU)
VEH4(RL)
VEH5(LD)
VEH6(LU)
VEH7(UL)
VEH8(UD)
(b) Distance vs Timestep
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Timestep
4
5
6
7
8
Velocity (m/s)
(c) Velocity vs Timestep
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Timestep
3
2
1
0
1
2
3
Acceleration (m^2/s)
(d) Action vs Timestep
Fig. 7: Example of colliding episode in experiment, including
distance to intersection center, speed information and action
they take during the whole episode
Fig. 7 visualizes one episode in 20th iteration of the training
process. At this moment, VEH5 (mode: LD) has pass through
the intersection successfully, however, VEH4 (mode: RL) and
VEH8 (mode: UD) collided at the last step of episode. From
Fig. 7(b) we can see all these 8 vehicles is approaching
the center of intersection, however, almost none of them
realized to decelerate their speed to avoid collision except
for VEH2. During the last few steps before collision, the
velocities of VEH4 and VEH8 still maintained their trend
without significant change. Besides, VEH2 has realized that
front collision and our policy began to control the acceleration
8
to avoid another collision. On this occasion, the learning time
step finally got a reward of 10 because of the success pass of
VEH2. We can conclude that at this point, the learned policy
cannot coordinate all vehicles successfully and some agents
such as VEH4 and VEH8 can not learn effective policy to
address this intersection traffic situation.
Fig. 8 shows a successful example that the central decision
agent learned good policies after 1000 iterations of training.
At this moment, VEH3 (mode: RU) has passed through the
intersection successfully. VEH8 slowed down from beginning
to step 26 to wait VEH7 to turn right. Besides, VEH2 (mode:
DL) has to wait the pass of VEH7 which got a closer distance
to the center of intersection. Also, VEH4 (mode: RL) has
to wait VEH2 to turn right and decreased its acceleration. It
learned a human-like policy, detecting the potential collision
according to the distance to the center of intersection and
assigned the order to pass through. One reasonable explanation
that VEH8 has to wait and pass lastly is that it has a longer
distance than any other vehicles to the center of intersection,
as shown in Fig. 8(b). After step 26, VEH7, VEH2 and VEH4
passed central area of the intersection, VEH8 started to speed
up and then passed the intersection successfully. As shown in
Fig. 8(d), VEH8 remained stable between -2 to -1 from initial
time to 26 time-steps, then it accelerated rapidly after time-
step 26, which also illustrated that VEH8 learned a waiting
policy to avoiding collision.
VEH1(DR)
VEH2(DL)
VEH4(RL)
VEH3(RU)
VEH6(LU)
VEH5(LD)
VEH7(UL)
VEH8(UD)
Decelerating
(a) Episode without collision
0 10 20 30 40 50
Timestep
20
10
0
10
20
30
Distance (m)
VEH1(DR)
VEH2(DL)
VEH3(RU)
VEH4(RL)
VEH5(LD)
VEH6(LU)
VEH7(UL)
VEH8(UD)
(b) Distance vs Timestep
0 10 20 30 40 50
Timestep
3
4
5
6
7
8
Velocity (m/s)
(c) Velocity vs Timestep
0 10 20 30 40 50
Timestep
2
0
2
4
Acceleration (m^2/s)
(d) Action vs Timestep
Fig. 8: Successful example that all VEHs pass the intersection
in experiment, including distance to intersection center, speed
information and action they take during the whole episode
VEH5 and VEH6 have similar velocity curves, both of
which has large change in velocity. At the beginning, they
slowed down and kept low velocity until VEH2, VEH3 and
VEH4 passed the intersection. After time-step 25, both of them
began to speed up and pass the intersection because with no
potential collision around the area of intersection, more larger
acceleration would reduce the negative reward during riding.
On the other hand, the velocity curves of VEH2, VEH3 and
VEH4 demonstrated that they learned to speed up so that they
could pass the intersection quickly. From Fig. 8(b), compared
with other VEHs, VEH5 and VEH6 decreased slowly first,
after time-step 25, the distance curves became sharper, which
also proved that VEH5 and VEH6 learned a waiting policy to
avoid collide.
In conclusion, results have shown that RL based control can
address the intersection situation with multiple vehicles, not
only considering the collide avoidance but also improving pass
efficiency. Unlike human rules can be applied in control of in-
tersection, our algorithm can coordinate vehicle from different
direction corresponding to their velocity and distance to the
center of intersection. Our methods based on reinforcement
learning is prone to show more advantages when there are
more vehicles, in which human rules may not work or is
difficult to find the optimal solution to coordinate all vehicles.
Besides, we use model to accelerate the learning process and
obtain a good acceleration effect, which shows the importance
of prior model in learning algorithm.
VI. CONCLUSION
In this paper, we employ reinforcement learning method
to solve centralized conflict-free cooperation for connected
and automated vehicles at intersection, which have been long
regarded as a challenge problem due to its large scale and high
dimension property. We use PPO algorithm as our baseline,
which has state-of-the-art performance on several benchmarks.
And we propose MA-PPO to enhance sample efficiency and
speed up learning process. A typical 4-direction intersection
which contains 8 different modes of vehicle is studied. We
find that our method is more efficient than PPO and the
learned driving policy shows intelligent behaviors to increase
driving safety and traffic efficiency, which indicates that RL
is promising to deal with centralized cooperative driving at
intersection.
VII. ACKN OWLEDGMENTS
This work is partially supported by International Sci-
ence &Technology Cooperation Program of China under
2016YFE0102200. Special thanks should be given to TOY-
OTA for funding this study. We would like to acknowledge
Mr. Jingliang Duan, Mr. Zhengyu Liu, for their valuable
suggestions throughout this research.
REFERENCES
[1] S. E. Li, S. Xu, X. Huang, B. Cheng, and H. Peng, “Eco-departure of
connected vehicles with v2x communication at signalized intersections,”
IEEE Transactions on Vehicular Technology, vol. 64, no. 12, pp. 5439–
5449, 2015.
[2] Y. Wu, S. E. Li, J. Cort´
es, and K. Poolla, “Distributed sliding mode con-
trol for nonlinear heterogeneous platoon systems with positive definite
topologies,” IEEE Transactions on Control Systems Technology, 2019.
[3] F. Gao, X. Hu, S. E. Li, K. Li, and Q. Sun, “Distributed adaptive sliding
mode control of vehicular platoon with uncertain interaction topology,
IEEE Transactions on Industrial Electronics, vol. 65, no. 8, pp. 6352–
6361, 2018.
9
[4] S. E. Li, X. Qin, K. Li, J. Wang, and B. Xie, “Robustness analysis and
controller synthesis of homogeneous vehicular platoons with bounded
parameter uncertainty, IEEE/ASME Transactions on Mechatronics,
vol. 22, no. 2, pp. 1014–1025, 2017.
[5] N. J. Goodall, B. L. Smith, and B. Park, “Traffic signal control with
connected vehicles,” Transportation Research Record, vol. 2381, no. 1,
pp. 65–72, 2013.
[6] Y. Feng, K. L. Head, S. Khoshmagham, and M. Zamanipour, “A real-
time adaptive signal control in a connected vehicle environment, Trans-
portation Research Part C: Emerging Technologies, vol. 55, pp. 460–
473, 2015.
[7] K. Dresner and P. Stone, “A multiagent approach to autonomous
intersection management,” Journal of artificial intelligence research,
vol. 31, pp. 591–656, 2008.
[8] J. Lee and B. Park, “Development and evaluation of a cooperative
vehicle intersection control algorithm under the connected vehicles
environment, IEEE Transactions on Intelligent Transportation Systems,
vol. 13, no. 1, pp. 81–90, 2012.
[9] P. Dai, K. Liu, Q. Zhuge, E. H.-M. Sha, V. C. S. Lee, and S. H.
Son, “Quality-of-experience-oriented autonomous intersection control in
vehicular networks, IEEE Transactions on Intelligent Transportation
Systems, vol. 17, no. 7, pp. 1956–1967, 2016.
[10] M. Ahmane, A. Abbas-Turki, F. Perronnet, J. Wu, A. El Moudni,
J. Buisson, and R. Zeo, “Modeling and controlling an isolated urban
intersection based on cooperative vehicles, Transportation Research
Part C: Emerging Technologies, vol. 28, pp. 44–62, 2013.
[11] B. Xu, S. E. Li, Y. Bian, S. Li, X. J. Ban, J. Wang, and K. Li, “Distributed
conflict-free cooperation for multiple connected vehicles at unsignalized
intersections,” Transportation Research Part C: Emerging Technologies,
vol. 93, pp. 322–334, 2018.
[12] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
deep reinforcement learning.,” Nature, vol. 518, no. 7540, pp. 529–33,
2015.
[14] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot, et al., “Mastering the game of go with deep neural networks
and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
[15] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering
the game of go without human knowledge, Nature, vol. 550, no. 7676,
p. 354, 2017.
[16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” arXiv preprint arXiv:1509.02971, 2015.
[17] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization, in International Conference on Machine
Learning, pp. 1889–1897, 2015.
[18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017.
[19] P. Wolf, C. Hubschneider, M. Weber, A. Bauer, J. H¨
artl, F. D¨
urr, and
J. M. Z¨
ollner, “Learning how to drive in a real world simulation with
deep q-networks,” in 2017 IEEE Intelligent Vehicles Symposium (IV),
pp. 244–250, IEEE, 2017.
[20] Z. Ruiming, L. Chengju, and C. Qijun, “End-to-end control of kart
agent with deep reinforcement learning,” in 2018 IEEE International
Conference on Robotics and Biomimetics (ROBIO), pp. 1688–1693,
IEEE, 2018.
[21] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, Asynchronous methods for deep rein-
forcement learning,” in International conference on machine learning,
pp. 1928–1937, 2016.
[22] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, and F. Nashashibi,
“End-to-end race driving with deep reinforcement learning, in 2018
IEEE International Conference on Robotics and Automation (ICRA),
pp. 2070–2075, IEEE, 2018.
[23] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:
An open urban driving simulator, arXiv preprint arXiv:1711.03938,
2017.
[24] D. Jingliang, L. Shengbo, Eben, G. Yang, S. Qi, and C. Bo, “Hierarchical
reinforcement learning for self-driving decision-making without reliance
on labeled driving data, IET Intelligent Transport Systems, 2019.
[25] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D.
Lam, A. Bewley, and A. Shah, “Learning to Drive in a Day,” 2018.
[26] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforce-
ment learning and safety based control for autonomous driving,” arXiv
preprint arXiv:1612.00147, 2016.
[27] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-
dient methods for reinforcement learning with function approximation,”
in Advances in neural information processing systems, pp. 1057–1063,
2000.
[28] P. Thomas, “Bias in natural actor-critic algorithms, in International
Conference on Machine Learning, pp. 441–448, 2014.
[29] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
arXiv preprint arXiv:1506.02438, 2015.
[30] S. M. Kakade, “A natural policy gradient, in Advances in neural
information processing systems, pp. 1531–1538, 2002.
[31] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and
data-efficient approach to policy search, in Proceedings of the 28th
International Conference on machine learning (ICML-11), pp. 465–472,
2011.
[32] I. Grondman, “Online model learning algorithms for actor-critic control,”
2015.
[33] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa,
“Learning continuous control policies by stochastic value gradients, in
Advances in Neural Information Processing Systems, pp. 2944–2952,
2015.
[34] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel,
“Model-ensemble trust-region policy optimization, arXiv preprint
arXiv:1802.10592, 2018.
[35] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and
S. Levine, “Model-based value estimation for efficient model-free re-
inforcement learning,” arXiv preprint arXiv:1803.00101, 2018.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper is concerned with the distributed control of vehicle platoons. The dynamics of each vehicle are nonlinear and heterogeneous. The control objective is to regulate vehicles to travel at a common speed while maintaining desired intervehicle gaps. The information flow topology dictates the pattern of communication between vehicles in the platoon. This information is essential to effective platoon control and, therefore, plays a central role in affecting the design and performance of platoon control strategies. Our key contribution is a unified distributed control framework that explicitly incorporates and supports a diversity of information flow topologies. Specifically, we propose a distributed sliding mode control (DSMC) framework for a class of generic topologies. The DSMC constructs the topological sliding surface and reaching law via a so-called “topologically structured function.” The control law obtained by matching the topological sliding surface and topological reaching law is naturally distributed. The Lyapunov stability analysis is carried out for the closed-loop system in the sense of Filippov to cope with the discontinuity originated from switching terms. Moreover, a tradeoff between tracking precision and chattering elimination is discussed with a continuous approximation of the switching control law. The effectiveness of the DSMC for platoons is verified under four different topologies through numerical simulation.
Article
Full-text available
Decision making for self‐driving cars is usually tackled by manually encoding rules from drivers’ behaviours or imitating drivers’ manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This study presents a hierarchical reinforcement learning method for decision making of self‐driving cars, which does not depend on a large amount of labelled driving data. This method comprehensively considers both high‐level manoeuvre selection and low‐level motion control in both lateral and longitudinal directions. The authors firstly decompose the driving tasks into three manoeuvres, including driving in lane, right lane change and left lane change, and learn the sub‐policy for each manoeuvre. Then, a master policy is learned to choose the manoeuvre policy to be executed in the current state. All policies, including master policy and manoeuvre policies, are represented by fully‐connected neural networks and trained by using asynchronous parallel reinforcement learners, which builds a mapping from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each manoeuvre. They apply this method to a highway driving scenario, which demonstrates that it can realise smooth and safe decision making for self‐driving cars.
Article
Full-text available
Connected vehicles will change the modes of future transportation management and organization, especially at intersections. In this paper, we propose a distributed conflict-free cooperation method for multiple connected vehicles at unsignalized intersections. We firstly project the approaching vehicles from different traffic movements into a virtual lane and introduce a conflict-free geometry topology considering the conflict relationship of involved vehicles, thus constructing a virtual platoon. Then we present the modeling of communication topology to describe two modes of information transmission between vehicles. Finally, a distributed controller is designed to stabilize the virtual platoon for conflict-free cooperation at intersections. Numerical simulations validate the effectiveness of this method.
Article
Full-text available
Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.
Article
Full-text available
In a platoon control system, a fixed and symmetrical topology is quite rare, because of adverse communication environments and continuously moving vehicles. This paper presents a DASMC (Distributed Adaptive Sliding Mode Control) scheme for more realistic vehicular platooning. In this scheme, adaptive mechanism is adopted to handle platoon parametric uncertainties, while a structural decomposition method deals with the coupling of interaction topologies. A numerical algorithm based on LMI (Linear Matrix Inequality) is developed to place the poles of the sliding motion dynamics in the required area to balance quickness and smoothness. The proposed scheme allows the nodes to interact with each other via different types of topologies, e.g., either asymmetrical or symmetrical, either fixed or switching. Different from existing techniques, it does not require the exact values of each entity in topological matrix, and only needs to know the bounds of its eigenvalues. The effectiveness of this proposed methodology is validated by bench tests under several conditions. IEEE
Article
Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity, which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and to date have succeeded mainly in restrictive domains where simple models are sufficient for learning. In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. To overcome this issue, we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process. We further show that the use of likelihood ratio derivatives yields much more stable learning than backpropagation through time. Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark tasks.
Article
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.