PreprintPDF Available

Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization

December 2019

December 2019

Authors:

Yang Guan

Tsinghua University

Shengbo Eben Li

Tsinghua University

Show all 7 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

Connected vehicles will change the modes of future transportation management and organization, especially at intersections. There are mainly two categories coordination methods at unsignalized intersection, i.e. centralized and distributed methods. Centralized coordination methods need huge computation resources since they own a centralized controller to optimize the trajectories for all approaching vehicles, while in distributed methods each approaching vehicles owns an individual controller to optimize the trajectory considering the motion information and the conflict relationship with its neighboring vehicles, which avoids huge computation but needs sophisticated manual design. In this paper, we propose a centralized conflict-free cooperation method for multiple connected vehicles at unsignalized intersection using reinforcement learning (RL) to address computation burden naturally by training offline. We firstly incorporate a prior model into proximal policy optimization (PPO) algorithm to accelerate learning process. Then we present the design of state, action and reward to formulate centralized cooperation as RL problem. Finally, we train a coordinate policy by our model-accelerated PPO (MA-PPO) in a simulation setting and analyze results. Results show that the method we propose improves the traffic efficiency of the intersection on the premise of ensuring no collision.

Content uploaded by Shengbo Eben Li

Content may be subject to copyright.

Centralized Conﬂict-free Cooperation for Connected

and Automated Vehicles at Intersections by

Proximal Policy Optimization

Yang Guan, Yangang Ren, Shengbo Eben Li*, Qi Sun, Laiquan Luo, Koji Taguchi, and Keqiang Li

Abstract—Connected vehicles will change the modes of future

transportation management and organization, especially at inter-

sections. There are mainly two categories coordination methods at

unsignalized intersection, i.e. centralized and distributed meth-

ods. Centralized coordination methods need huge computation

resources since they own a centralized controller to optimize

the trajectories for all approaching vehicles, while in distributed

methods each approaching vehicles owns an individual controller

to optimize the trajectory considering the motion information

and the conﬂict relationship with its neighboring vehicles, which

avoids huge computation but needs sophisticated manual design.

In this paper, we propose a centralized conﬂict-free cooperation

method for multiple connected vehicles at unsignalized intersec-

tion using reinforcement learning (RL) to address computation

burden naturally by training ofﬂine. We ﬁrstly incorporate a

prior model into proximal policy optimization (PPO) algorithm

to accelerate learning process. Then we present the design of

state, action and reward to formulate centralized cooperation as

RL problem. Finally, we train a coordinate policy by our model-

accelerated PPO (MA-PPO) in a simulation setting and analyze

results. Results show that the method we propose improves the

trafﬁc efﬁciency of the intersection on the premise of ensuring

no collision.

Index Terms—Connected and Automated vehicle, Centralized

coordination method, Reinforcement learning, Trafﬁc intersec-

tion

I. INTRODUCTION

THE increasing demand for mobility poses great chal-

lenges to road transport. The connected and automated

vehicles is attracting extensive attention, due to its potential to

beneﬁt trafﬁc safety, efﬁciency, and economy [1]. The widely

studied, but also simpliﬁed, version of connected vehicle

cooperation is the platoon control system in highways. Platoon

control aims to ensure that a group of connected vehicles in

the same lane move at a harmonized longitudinal speed, while

maintaining desired inter-vehicle spaces [2]–[4]. As a typical

scenario in urban areas, intersection is more complex and chal-

lenging for multi-vehicle coordination than that in highway.

At the intersection, vehicles enter from different intersection

entrances, cross their speciﬁc trajectories at the intersection

zone, and leave the intersection at different exits. The complex

Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo and K. Li, are

with State Key Lab of Automotive Safety and Energy, School

of Vehicle and Mobility, Tsinghua University, Beijing, 100084,

China. They are also with Center for Intelligent Connected Vehicles

and Transportation, Tsinghua University, Beijing, China. Email:

(guany17, ryg18, qisun)@mails.tsinghua.edu.cn;

luolaiquan@cau.edu.cn; (lishbo,

likq)@tsinghua.edu.cn

conﬂict relationship between vehicles results in complicated

vehicle decisions to avoid collisions at the intersection. Hence,

it needs complicated design to guarantee trafﬁc safety while

improving trafﬁc efﬁciency. To resolve multi-vehicle coordi-

nation at intersections, several studies focus on trafﬁc signals

design scheme. Goodall et al. (2013) developed a decentralized

fully adaptive trafﬁc control algorithm to optimize trafﬁc signal

timing [5]. Feng et al. (2015) presented a real-time adaptive

signal phase allocation algorithm using connected vehicle data,

which optimizes the phase sequence and duration by solving a

two-level optimization problem [6]. These trafﬁc signal based

coordination methods for intersection control can ensure trafﬁc

safety, but they may result in inefﬁciency for intersection

management. Hence, researchers have started to focus on

signal-free methods for intersection coordination.

Currently, there are mainly two types of methods to han-

dle coordination at unsignalized intersection, i.e. centralized

and distributed coordination methods. Centralized coordina-

tion methods utilize the global information of the whole

intersection to centrally control every vehicle at intersection.

Dresner and Stone (2008) treated drivers and intersections

as autonomous agents in a multi-agent system and built a

new reservation-based approach around a detailed communi-

cation protocol [7]. Lee and Park (2012) eliminated potential

overlaps of vehicular trajectories coming from all conﬂicting

approaches at the intersection, then sought a safe maneuver

for every vehicle approaching the intersection and manipulates

each of them [8]. Dai et al. (2016) formulated an intersection

control model and transformed it into a convex optimization

problem, with consideration of safety and efﬁciency [9]. How-

ever, these centralized coordination methods suffer from huge

computation requirement since they coordinate approaching

vehicles by optimizing all their trajectories with a centralized

controller.

In distributed coordination methods, there is no central con-

troller but distributed controller in each approaching vehicle

to optimize its own trajectory considering motion and conﬂict

relationship with its neighboring vehicles. Ahmane et al.

proposed a model based on Timed Petri Nets with Multipliers

(TPNM) and used that to design the control policy through

the structural analysis [10]. Xu et al. proposed a conﬂict-

free geometry topology and a communication topology to

transform two-dimension vehicle cluster at the intersection to

one-dimension vehicle virtual platoon and eventually designed

distributed feedback controller [11]. Distributed coordination

methods satisfy huge computation resources requirement by

arXiv:1912.08410v1 [cs.RO] 18 Dec 2019

distributed computation, however, they need design of so-

phisticated dynamic model and complicated communication

relationship carefully.

One of the most fundamental goals in artiﬁcial intelligence

is how to learn a new skill, especially from high-dimensional

sensor input. Reinforcement Learning (RL) gradually learns a

better policy from trail-and-error interaction with environment,

which is highly similar to human and has the potential to

address a large number of complex problems [12]. Recently,

signiﬁcant progress has been made on a variety of problems

by combining advances of deep learning and RL. Mnih et

al. (2015) proposed Deep Q-learning Network (DQN) and

attained to human level performance on Atari video games

with raw pixels for input [13]. Silver et al. used RL and

tree search method to conquer go game and produced two

famous programs: Alpha Go and Alpha zero, defeating the

most excellent human champion [14], [15]. Considering DQN

only suitable for problems with discrete action spaces, Deep

Deterministic Policy Gradient (DDPG) algorithm is proposed

to solve continuous control problems [16]. While vanilla policy

gradient methods suffer from poor data efﬁciency and robust-

ness, Trust Region Policy Optimization (TRPO) is proposed

[17]. However, TRPO is not compatible with architectures

that include noise and rarely implements parameter sharing

between the policy and value function. Proximal Policy Opti-

mization (PPO) is proposed as an updated version of TRPO,

which alternated between sampling data through interaction

and optimizing a surrogate objective function using stochastic

gradient ascent [18].

RL has been poised to conquer the autonomous driving

problem because of the super-human potential. Existing RL

researches on autonomous driving most focus on the intel-

ligence of single vehicle driving in relatively simple trafﬁc

scenarios. DQN was initially used to realize control high-

frequency discrete steering actions of vehicle [19], [20].

After Asynchronous Advantage Actor-Critic (A3C) method

was proposed, some researches adopted this framework to

accelerate the learning speed and maintain the training stability

[21]–[23]. Due to the long time credit assignment advantage

of hierarchical reinforcement learned, it was used to both

high-level maneuver selection and low-level motion control

for decision making of self-driving cars [24]. Besides, other

researchers successfully applied DDPG to autonomous driv-

ing, realizing control on continuous acceleration, steering and

braking actions [25], [26].

In this paper, we employ RL as our method for centralized

control of multiple connected vehicles to realize autonomous

collision-free passing at unsignalized intersections. It is real-

ized by ﬁrstly formulating state space, action space and reward

function in framework of RL, and then training policy by dis-

tributed PPO algorithm. Besides, to enhance sample efﬁciency

and accelerate training process, we incorporate a prior model

into PPO algorithm. Since we trained a centralized controller,

there is no need to design complex components used by

distributed controller elaborately. And RL trains off-line and

infers on-line, thus it naturally unloads on-line computation

burden. Our results show the learned policy is able to increase

driving safety and trafﬁc efﬁciency at intersections.

The rest of this paper is organized as follows. Section II

introduces preliminaries of Markov decision process and pol-

icy gradient methods. Section III proposes model accelerated

PPO (MA-PPO) which is an improvement based on original

PPO and model-based RL methods. Section IV illustrates our

problem statement and methodology and section V looks into

experimental settings and illustrates results. Last section VI

summaries this work.

II. PRELIMINARIES OF RL

Consider an inﬁnite-horizon discounted Markov decision

process (MDP), deﬁned by the tuple (S,A, p, r, d0, γ), where

Sis a ﬁnite set of states, Ais a ﬁnite set of actions,

p:S × A × S → Ris the transition probability distribution.

r:S × A × S → Ris the reward function, d0:S → Ris

the distribution of the initial state s0, and γ∈(0,1) is the

discounted factor.

Let πdenote a stochastic policy π:S × A → [0,1], we

seek to learn optimal policy π∗which has maximum value

function vπ∗(s)for all s∈ S, where the value function vπ(s)

is the expected sum of discounted rewards from a state when

following policy π:

vπ(s) := Eat,st+1,... (∞

l=t

γl−trl|st=s)(1)

where at∼π(at|st), st+1 ∼p(st+1|st, at)and rt:=

r(st, at, st+1)for short. Similarly, we use the following

standard deﬁnition of the state-action value function qπ(s, a):

qπ(s, a) := Est+1,at+1 ,... (∞

l=t

γl−trl|st=s, at=a)(2)

A. Vanilla policy gradient

In practice, ﬁnding optimal policy for every state is im-

practical for large state space, thus we consider parameterized

policies πθ(a|s)with parameter vector θ. For the same reason,

state-value function is parameterizd as V(s, w)with parameter

vector w.

Policy optimization methods seek to ﬁnd optimal θ∗which

maximize average performance of policy πθ

J(θ) = Es0,a0,... (∞

t=0

γtrt)(3)

where s0∼d0(s0), at∼πθ(at|st), st+1 ∼p(st+1|st, at)and

rt:= r(st, at, st+1).

Vanilla methods optimize (3) by stochastic policy gradient

[27]. Its gradient is shown in (4).

∇θJ(θ) = X

dγ(s)X

∇θπθ(a|s)qπθ(s, a)(4)

dγ(s) =

∞

t=0

γtdt(s|πθ)(5)

where dt(s|πθ)is state distribution at time t,dγ(s)is called

discounted visiting frequency, which in practice is usually

replaced with state stationary distribution under πθdenoted

by dπθ[28]. Combined with likelihood ratio and baseline

technique [12], we can write (4) in expectation format.

∇θJ(θ)≈Es∼dπθ,a∼πθ{∇θlog πθ(a|s)Aπθ(s, a)}(6)

where Aπθ(s, a) := qπθ(s, a)−vπθ(s)is advantage function

which could be estimated by several methods [29].

B. Trust region method

While vanilla policy gradient is simple to implement, it

often leads to destructively large policy updates. TRPO op-

timizes lower bound of (3), i.e. equation (7), to guarantee

performance improvement.

maximize

Es,a πθ(a|s)

πθold (a|s)Aπθold −βKL [πθold (·|s), πθ(·|s)]

(7)

However, it is hard to choose a single value of βthat

performs well across different problems, TRPO uses penalty

shown in (8) instead.

maximize

Es,a πθ(a|s)

πθold (a|s)Aπθold 

s.t.Es[DKL (πθold (·|s)kπθ(·|s))] ≤δ

(8)

where we denote Es,a [. . .] := Es∼dπθold ,a∼πθold [. . .],

Es[. . .] := Es∼dπθold [. . .]and Aπθold := Aπθold (s, a). TRPO

can be regarded as natural policy gradient [30]. It ﬁnds steepest

policy gradient in ﬁsher matrix normed space rather than

euclidean space, which helps to reduce impact of policy pa-

rameterization when calculating gradient and stabilize learning

process.

III. MOD EL AC CE LE RATE D PPO

A. Proximal policy optimization

In this paper, PPO algorithm is employed as our baseline.

It is inspired by TRPO and has two main differences, i.e.

unconstrained surrogate objective function and generalized

advantage estimation.

1) Unconstrained surrogate objective function: Observing

stability policy update requires punishment of policy deviation

based on the unconstrained optimization (9) from theroy of

TRPO,

maximize

Es,a πθ(a|s)

πθold (a|s)Aπθold (9)

PPO alternatively construct a subtle lower bound of (9)

to eliminate motivation of too much deviation of policy

distribution directly. Its objective is (10).

Jppo(θ) = Es,a [min (ρθAπθold ,clip (ρθ,1−, 1 + )Aπθold )]

(10)

where ρθ=πθ(a|s)

πθold (a|s)is probability ratio. When Aπθold >0,

with objective function (9), ρθwould tend to be much larger

than 1 to make the objective as large as possible, which

leads to unstable learning, while PPO objective (10) cut this

motivation by clipping ρwithin 1 + and taking minimum of

original objective (7) and clipped function. Same situation is

with Aπθold <0.

2) Generalized advantage estimation: Advantage function

is necessary for policy gradient calculation, and it can be

estimated by (11).

Aπ(s, a) = ˆ

Qπ(s, a)−V(s, w)(11)

where ˆ

Qπ(s, a)is action-value function estimated by samples

of π,V(s, w)is state-value function approximation. TRPO

and A3C use Monte-Carlo method to construct ˆ

Qπ(s, a)as in

(12).

Qπ(st, at) =

∞

l=t

γl−trl(12)

It is an unbiased estimation of qπ(s, a), but suffers from

high variance. Actor-critic methods use one-step TD to form

Qπ(s, a)as in (13).

Qπ(st, at) = rt+γV (st+1 , w)(13)

which has low variance but is biased. Generalized advantage

estimation is actually same as TD(λ) only that it uses linear

combination of n-step TD to estimate qπinstead of vπ.

Backward view of TD(λ) is shown in (14)

Qπ(st, at) =

∞

l=t

(γλ)l−tδl(14)

where δlis TD error,

δl=rl+γV (sl+1 , w)−V(sl, w)(15)

Compared to TRPO, PPO is much simpler and faster to

implement because it only involves ﬁrst order optimization,

and it has better convergence due to usage of generalized

advantage estimation. However, PPO is on-policy method and

inevitably has high sample complexity.

B. Model-based RL

Recent model-free reinforcement learning algorithms have

proposed incorporating given or learned dynamics models as

a source of additional data with the intention of reducing

sample complexity. Generally, there are about two general

ways to use model: value gradient methods, and using model

for imagination.

Value gradient methods link together the policy, model,

and reward function to compute an analytic policy gradient

by backpropagation of reward along a trajectory [31]–[33]. A

major limitation of this approach is that the dynamic model

can only be used to retrieve information already presented

in observed data and albeit with lower variance, the actual

improvement in efﬁciency is relatively small. Alternatively,

the given or learned model can also be used for imagination

rollouts. This usage can be naturally incorporated in model-

free RL framework, however, learned models suffer from

overﬁtting on experience data and lead to large error in large

horizon [34], [35].

C. Model accelerated PPO (MA-PPO)

PPO is a model-free on-policy RL algorithm. Model-free

means it knows nothing about environment and can only learn

from interactions with environment. As a result, it inevitably

requires large amount of experience data although its excellent

ﬁnal performance. Besides, the training speed is limited by

interaction with real world or simulator. Even worse, property

of on-policy makes experiences produced by previous trained

policies useless, which aggravates sample inefﬁciency. This is

our motivation to accelerate PPO.

Basically, there are two ways to reduce sample complexity.

The ﬁrst one is incorporating off-policy data in learning pro-

cess, i.e. using experiences during training. The second one is

giving or learning a dynamic model. We claim that off-policy

data could not be used due to state distribution mismatch.

Assume that off-policy data are generated by another policy

b, we rewrite PPO objective (10) as (16).

Jppo(θ) = Es∼db,a∼b

dπθold πθold

dbbmin (ρ(θ)Aπθold ,clip (ρ(θ),1−, 1 + )Aπθold )

(16)

To acquire correct gradient by off-policy data, we need

not only correct action distribution by action probability ratio

πθold

b, but state distribution by stationary probability ratio dπθold

db.

However, stationary probability ratio is hard to estimate and

thus lead to distribution mismatch, which hinders use of off-

policy data in both theory and empirically. As a result, we

only employ model to accelerate PPO.

In ﬁeld of centralized control at intersection, dynamic model

is available from human prior knowledge, so we construct a

model ourselves rather than using a learned one. To combine

model with PPO algorithm naturally, the second type of model

usage from III-B is employed by us. MA-PPO is shown in

algorithm 1.

IV. PROBLEM STATEMENT AND FOR MU LATION

A. Problem statement

In this paper, we focus on a typical 4-direction intersection

shown in Fig. 1. Each direction is denoted by its location in the

ﬁgure, i.e. up (U), down (D), left (L) and right (R) respectively.

We only focus on vehicles within a certain distance from

the intersection center. The intersection is unsignalized and

each entrance or exits is assumed to have only one lane, as a

result, there are 4 entrances in total. Vehicle in each entrance

is allowed to turn right, go straight or turn left. Thus there

are 12 types of vehicles, denoted by their entrance and exit,

i.e. DR, DU, DL, RU, RL, RD, LD, LR, LU, UL, UD, and

UR. Their number and meanings are listed in Table I. All

their possible conﬂict relations are also illustrated in the ﬁgure,

which can be categorized into three classes, including crossing

conﬂict (denoted by red dot), converging conﬂict(denoted by

purple dot), and diverging conﬂict (denoted by pink dot). To

simplify our problem, we choose 8 vehicle modes out of all

the 12 modes to cover main conﬂict modes to conduct our

experiment. The 8 modes include DR, DL, RU, RL, LD, LU,

UL, UD, as shown in Fig. 2. From the ﬁgure, we can summary

Algorithm 1 MA-PPO

Randomly initialize critic network V(s, w)and actor

πθ(s, a)with weights wand θ, set λ, γ, , total timesteps

Ttotal, inner iteration M, batch size T, minibatch size B,

and epoch U

for iteration = 1, Ttotal/T do

Run policy πθin environment for Ttimesteps, collecting

(st, at, rt)

Estimate advantages ˆ

At=PEndepisode

l≥t(λγ)l−tδl

Estimate TD(λ)ˆ

Vt=ˆ

At+V(st, w)

πθold ←πθ

for epoch = 1, Udo

JPPO(θ) = PT

t=1 min( πθ(at|st)

πθold (at|st)ˆ

At,clip(πθ(at|st)

πθold (at|st),1−

, 1 + )ˆ

At)

Update θby a gradient method w.r.t. JPPO(θ)

JCritic(w) = −PT

t=1(ˆ

Vt−V(st, w))2

Update wby a gradient method w.r.t. JCritic(w)

end for

for model iteration = 1, Mdo

Run policy πθby model for Tmodel timesteps, collect-

ing (ˆst,ˆat,ˆrt)

Estimate advantages ˆ

At=PEndepisode

l≥t(λγ)l−tˆ

δl

Estimate TD(λ)ˆ

Vt=ˆ

At+V(st, w)

πθold ←πθ

for epoch = 1, Udo

Update θby a gradient method w.r.t. JPPO(θ)

Update wby a gradient method w.r.t. JCritic(w)

end for

all repeated types of conﬂict it contains, which is shown in Fig.

We adopt the following assumptions. First, all vehicles are

equipped with positioning and velocity devices so that we can

gather location and movement information when they enter

interesting zone of the intersection. Then, all approaching

vehicles are assumed to be automated vehicles so that vehicles

can strictly follow the desired acceleration, control the speed,

TABLE I

Different types of vehicle

Type Number Meaning

DR 1 From Down turn right to Right

DU 2 From Down go straight to Up

DL 3 From Down turn left to Left

RU 4 From Right turn right to Up

RL 5 From Right go straight to Left

RD 6 From Right turn left to Down

LD 7 From Left turn right to Down

LR 8 From Left go straight to Right

LU 9 From Left go straight to Up

UL 10 From Up turn right to Left

UD 11 From Up go straight to Down

UR 12 From Up turn left to Right

and pass the intersection automatically. Additionally, There’s a

maximum of one vehicle of each type at each lane of entrance,

but order of different type is stochastic.

B. RL formulation

We are ready to transform our problem to a RL problem by

deﬁning state space, action space and reward function, which

are basic elements in RL.

1) State and action space: By our assumption, we need

to control at most 8 vehicles at a time, i.e. 2 different type

of vehicles at each entrance. Vector form is used for both

state and action, which are respectively concatenation of each

vehicle’s state and control by their order, as shown in (17).

s= [sDR, sDL , sRU, sRL, sLD , sLU, sUL , sUD]

a= [aDR, aDL , aRU, aRL, aLD , aLU, aUL , aUD](17)

where s*and a*denote state and action of vehicle type *.

Up (U)

Down (D)

Left (L)

Right (R)

Fig. 1: Intersection scenario settings

LU LD

Up (U)

Down (D)

Left (L)

Right (R)

Fig. 2: Vehicles modes chosen for experiment

State of each type should contains position and velocity

information of each vehicle. Intuitively, we can form state by

a tuple of coordinate and velocity, i.e. (x, y, v ), where (x, y)is

coordination of its position and vis velocity. However, by our

Crossing Converging

Diverging

conflict

Fig. 3: Typical modes of conﬂict in experiment

TABLE II

Reward settings

Reward items Reward

Collision -50

Step reward -1

Some vehicle passes 10

All vehicles pass 50

task formulation, every vehicle has a ﬁxed path corresponding

to its type. There would be redundant information if we use

this formula for every vehicle. Besides, for continuous state,

it is necessary to decrease state space dimensional to speed

up learning and enhance stability. Observing all of paths are

cross intersection, we further compress state of each vehicle

by (d, v), where dis distance between vehicle and center of

its path. Note that dis positive when vehicle is heading for

the center and negative when it is leaving. State formulation

is shown as Fig. 4.

For action space, we choose acceleration of each vehicle. In

total, a 16-dimensional state space and a 8-dimensional action

space are constructed.

2) Reward settings: Reward function is designed under

consideration of safety, efﬁciency and task completion. First

of all, the task is designed in episodic manner, in which two

types of termination are given, collision or all VEHs passing

intersection. To avoid collision, a large negative reward is

given if it happens. To enhance efﬁciency, a minor negative

step reward is given every time step. To encourage task

completion, there is a positive reward as long as some vehicle

passes the intersection, and a large positive reward will be

given when all vehicles pass the intersection. All reward

settings are listed in Table II.

C. Algorithm architecture

In this section, we illustrate how to apply MA-PPO algo-

rithm to this centralized control problem.

1) Model construction: MA-PPO learns from data come

from both simulation and model. Simulation incorporates true

Center point of

path of type

DL

v1sDL=[d1, v1]

sDL=[-d2, v2]

Center point of

path of type

DU

v1sDU=[d1, v1]

sDU=[-d2, v2]

Center point of

path of type

DR

v1sDR=[d1, v1]

sDR=[-d2, v2]

Fig. 4: State formulation

dynamics of environment, i.e. kinematics module with noise,

but it takes too much time to interact with simulation. MA-

PPO accelerates learning process by incorporates a prior model

to generate data which also used for learning. The model is

constructed by simple kinematics model. Given current posi-

tion, velocity and expected acceleration of each vehicle, their

next position and velocity can be inferred by this kinematics

model.

2) Overall architecture: Learning algorithm for this RL

problem consists of two main parts including MA-PPO learner

and worker. Worker is in charge of getting updated policy

from learner and using it to collect experience data from

simulation or the kinematics model. MA-PPO learner then

uses experience data from worker to update value and policy

network by backpropagation, and ﬁnally sends the updated

policy to worker for the next iteration. This overall architecture

is shown in Fig. 5.

V. EXPERIMENTS

A. Experimental settings

In this section, we train and test MA-PPO and original PPO

in set of vehicles mentioned above, in which there are two

vehicles of different types in each entrance of intersection.

Thus, we have 8 vehicles in total in this experiment. These

vehicles are chosen to cover all conﬂict modes shown in Fig.

3. The initial position of all vehicles are random, and multiple

vehicles enter the intersection from different entrances, follow

their trajectories at the intersection zone, and leave it at

different exits. The central controller is capable to control

the acceleration of all vehicles to adjust their speed and

position to ensure trafﬁc safety and efﬁciency, i.e. all vehicles

pass through the intersection as quickly as possible without

collision. For results, training processes of MA-PPO and PPO

are shown and compared to illustrate our improvement on

PPO, and we also visualize effects of policy at the start of

training and at the end of training in simulation to show what

the trained policy has learned.

B. Implement details

We employ multiple layers perceptron with two hidden

layers as approximate function of actor and critic. Both actor

and critic have 128 units in each hidden layer, and actor has 16

[]

Action



Simulation

Policy for sampling





Model

=0+

 =+0

2

 =0+1

22

2=0

2+2

State



Data from

Model

󰇝,,,+1󰇞=0:

Data Parameter update

Value error

(MSE)

Policy error

(Clipped

error)

Value network

Data from

Simulation

󰇝,,,+1󰇞=0:

Policy network

Update

Simulation data

and model data Update policy

for sampling

MA-PPO Learner

Worker

Fig. 5: Overall architecture of algorithm

output units for Gaussian distribution of each vehicle (mean

and standard deviation) while critic only has one output unit

for state value. Note that actor and critic network have no

shared parameters. We use Adam as optimizer. For MA-PPO,

we collect T= 2048 transitions and use minibatch B= 64

epoch U= 10 for update. For model simulation loop, we set

M= 1. Besides, we train both PPO and MA-PPO under 5

seeds to eliminate impact of randomness. Complete parameter

setting is listed in Table III.

We use parallel workers to improve exploration and stabilize

policy gradient and thus speed up learning process. Concretely,

16 parallel workers learn simultaneously. In an iteration, each

worker interacts with environment respectively and collects

2048 timesteps batch of data, then takes the ﬁrst minibatch

to calculate gradient, then global gradient is conducted by

averaging all local gradients of workers. Each workers updates

TABLE III

Hyperparameters of experiment

Parameter Value

Discounted factor γ0.99

λ0.95

Clip range 0.2

Total timesteps Ttotal 5×107

Inner iteration M1

Seed number 5

Batch size T2048

Minibatch size B64

Epoch U10

Learning rate LR 0.0003 →0

Hidden layer number 2

Hidden units number 128

Adam = 10−5, β1= 0.9, β2= 0.999

its parameters by using the global gradient in Adam and takes

the next minibatch and going on like this.

C. Results and discussion

In this section, we show the performance of our algorithm

at intersection and analyze the empirical results.

0 250 500 750 1000 1250 1500 1750

Iteration

Mean episode reward

PPO

MA-PPO

(a) Mean episode reward

0 250 500 750 1000 1250 1500 1750

Iteration

Mean episode length

PPO

MA-PPO

(b) Mean episode length

Fig. 6: Training process of MA-PPO and PPO

Fig. 6a shows the mean episode reward of MA-PPO and

PPO during training process. Both MA-PPO and PPO get the

highest reward about 50, which means that all 8 vehicle has

passed through the intersection successfully. Compared with

PPO, MA-PPO converges at around 500 iterations, while PPO

algorithm needs nearly 1000 iterations, which shows that MA-

PPO converges twice as faster as PPO.

Fig. 6b shows the change of mean episode length during

training process. Both the episode length of MA-PPO and

PPO ﬁrst increase rapidly and then reduce to an equal value.

This can be explained that at the beginning the temporary

policy mainly focuses on how to avoid collision because this

case corresponds to a large negative reward. At that time, one

reasonable policy is to let the vehicles with no conﬂicting

trajectories pass through the intersection, such as the RL and

LD, DR and UD. Meanwhile, other vehicles have to wait

until the next non-collision chance, which leads to the long

episode length. However, such a policy is too conservative

and suffers poor efﬁciency because every step has a negative

reward -1. Therefore, the following policy would optimize this

process to avoid long waiting time, leading to the decrease of

mean episode length. Besides, MA-PPO obtained more faster

convergence speed in term of mean episode length compared

with origin PPO, which has the same trend as the Fig. 6a.

VEH1(DR)

VEH2(DL)

VEH4(RL)

VEH3(RU)

VEH6(LU)

VEH5(LD)

VEH7(UL)

VEH8(UD)

(a) Episode with collision

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Timestep

Distance (m)

VEH1(DR)

VEH2(DL)

VEH3(RU)

VEH4(RL)

VEH5(LD)

VEH6(LU)

VEH7(UL)

VEH8(UD)

(b) Distance vs Timestep

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Timestep

Velocity (m/s)

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Timestep

Acceleration (m^2/s)

(d) Action vs Timestep

Fig. 7: Example of colliding episode in experiment, including

distance to intersection center, speed information and action

they take during the whole episode

Fig. 7 visualizes one episode in 20th iteration of the training

process. At this moment, VEH5 (mode: LD) has pass through

the intersection successfully, however, VEH4 (mode: RL) and

VEH8 (mode: UD) collided at the last step of episode. From

Fig. 7(b) we can see all these 8 vehicles is approaching

the center of intersection, however, almost none of them

realized to decelerate their speed to avoid collision except

for VEH2. During the last few steps before collision, the

velocities of VEH4 and VEH8 still maintained their trend

without signiﬁcant change. Besides, VEH2 has realized that

front collision and our policy began to control the acceleration

to avoid another collision. On this occasion, the learning time

step ﬁnally got a reward of 10 because of the success pass of

VEH2. We can conclude that at this point, the learned policy

cannot coordinate all vehicles successfully and some agents

such as VEH4 and VEH8 can not learn effective policy to

address this intersection trafﬁc situation.

Fig. 8 shows a successful example that the central decision

agent learned good policies after 1000 iterations of training.

At this moment, VEH3 (mode: RU) has passed through the

intersection successfully. VEH8 slowed down from beginning

to step 26 to wait VEH7 to turn right. Besides, VEH2 (mode:

DL) has to wait the pass of VEH7 which got a closer distance

to the center of intersection. Also, VEH4 (mode: RL) has

to wait VEH2 to turn right and decreased its acceleration. It

learned a human-like policy, detecting the potential collision

according to the distance to the center of intersection and

assigned the order to pass through. One reasonable explanation

that VEH8 has to wait and pass lastly is that it has a longer

distance than any other vehicles to the center of intersection,

as shown in Fig. 8(b). After step 26, VEH7, VEH2 and VEH4

passed central area of the intersection, VEH8 started to speed

up and then passed the intersection successfully. As shown in

Fig. 8(d), VEH8 remained stable between -2 to -1 from initial

time to 26 time-steps, then it accelerated rapidly after time-

step 26, which also illustrated that VEH8 learned a waiting

policy to avoiding collision.

VEH1(DR)

VEH2(DL)

VEH4(RL)

VEH3(RU)

VEH6(LU)

VEH5(LD)

VEH7(UL)

VEH8(UD)

Decelerating

(a) Episode without collision

0 10 20 30 40 50

Timestep

Distance (m)

VEH1(DR)

VEH2(DL)

VEH3(RU)

VEH4(RL)

VEH5(LD)

VEH6(LU)

VEH7(UL)

VEH8(UD)

(b) Distance vs Timestep

0 10 20 30 40 50

Timestep

Velocity (m/s)

0 10 20 30 40 50

Timestep

Acceleration (m^2/s)

(d) Action vs Timestep

Fig. 8: Successful example that all VEHs pass the intersection

in experiment, including distance to intersection center, speed

information and action they take during the whole episode

VEH5 and VEH6 have similar velocity curves, both of

which has large change in velocity. At the beginning, they

slowed down and kept low velocity until VEH2, VEH3 and

VEH4 passed the intersection. After time-step 25, both of them

began to speed up and pass the intersection because with no

potential collision around the area of intersection, more larger

acceleration would reduce the negative reward during riding.

On the other hand, the velocity curves of VEH2, VEH3 and

VEH4 demonstrated that they learned to speed up so that they

could pass the intersection quickly. From Fig. 8(b), compared

with other VEHs, VEH5 and VEH6 decreased slowly ﬁrst,

after time-step 25, the distance curves became sharper, which

also proved that VEH5 and VEH6 learned a waiting policy to

avoid collide.

In conclusion, results have shown that RL based control can

address the intersection situation with multiple vehicles, not

only considering the collide avoidance but also improving pass

efﬁciency. Unlike human rules can be applied in control of in-

tersection, our algorithm can coordinate vehicle from different

direction corresponding to their velocity and distance to the

center of intersection. Our methods based on reinforcement

learning is prone to show more advantages when there are

more vehicles, in which human rules may not work or is

difﬁcult to ﬁnd the optimal solution to coordinate all vehicles.

Besides, we use model to accelerate the learning process and

obtain a good acceleration effect, which shows the importance

of prior model in learning algorithm.

VI. CONCLUSION

In this paper, we employ reinforcement learning method

to solve centralized conﬂict-free cooperation for connected

and automated vehicles at intersection, which have been long

regarded as a challenge problem due to its large scale and high

dimension property. We use PPO algorithm as our baseline,

which has state-of-the-art performance on several benchmarks.

And we propose MA-PPO to enhance sample efﬁciency and

speed up learning process. A typical 4-direction intersection

which contains 8 different modes of vehicle is studied. We

ﬁnd that our method is more efﬁcient than PPO and the

learned driving policy shows intelligent behaviors to increase

driving safety and trafﬁc efﬁciency, which indicates that RL

is promising to deal with centralized cooperative driving at

intersection.

VII. ACKN OWLEDGMENTS

This work is partially supported by International Sci-

ence &Technology Cooperation Program of China under

2016YFE0102200. Special thanks should be given to TOY-

OTA for funding this study. We would like to acknowledge

Mr. Jingliang Duan, Mr. Zhengyu Liu, for their valuable

suggestions throughout this research.

REFERENCES

[1] S. E. Li, S. Xu, X. Huang, B. Cheng, and H. Peng, “Eco-departure of

connected vehicles with v2x communication at signalized intersections,”

IEEE Transactions on Vehicular Technology, vol. 64, no. 12, pp. 5439–

5449, 2015.

[2] Y. Wu, S. E. Li, J. Cort´

es, and K. Poolla, “Distributed sliding mode con-

trol for nonlinear heterogeneous platoon systems with positive deﬁnite

topologies,” IEEE Transactions on Control Systems Technology, 2019.

[3] F. Gao, X. Hu, S. E. Li, K. Li, and Q. Sun, “Distributed adaptive sliding

mode control of vehicular platoon with uncertain interaction topology,”

IEEE Transactions on Industrial Electronics, vol. 65, no. 8, pp. 6352–

6361, 2018.

[4] S. E. Li, X. Qin, K. Li, J. Wang, and B. Xie, “Robustness analysis and

controller synthesis of homogeneous vehicular platoons with bounded

parameter uncertainty,” IEEE/ASME Transactions on Mechatronics,

vol. 22, no. 2, pp. 1014–1025, 2017.

[5] N. J. Goodall, B. L. Smith, and B. Park, “Trafﬁc signal control with

connected vehicles,” Transportation Research Record, vol. 2381, no. 1,

pp. 65–72, 2013.

[6] Y. Feng, K. L. Head, S. Khoshmagham, and M. Zamanipour, “A real-

time adaptive signal control in a connected vehicle environment,” Trans-

portation Research Part C: Emerging Technologies, vol. 55, pp. 460–

473, 2015.

[7] K. Dresner and P. Stone, “A multiagent approach to autonomous

intersection management,” Journal of artiﬁcial intelligence research,

vol. 31, pp. 591–656, 2008.

[8] J. Lee and B. Park, “Development and evaluation of a cooperative

vehicle intersection control algorithm under the connected vehicles

environment,” IEEE Transactions on Intelligent Transportation Systems,

vol. 13, no. 1, pp. 81–90, 2012.

[9] P. Dai, K. Liu, Q. Zhuge, E. H.-M. Sha, V. C. S. Lee, and S. H.

Son, “Quality-of-experience-oriented autonomous intersection control in

vehicular networks,” IEEE Transactions on Intelligent Transportation

Systems, vol. 17, no. 7, pp. 1956–1967, 2016.

[10] M. Ahmane, A. Abbas-Turki, F. Perronnet, J. Wu, A. El Moudni,

J. Buisson, and R. Zeo, “Modeling and controlling an isolated urban

intersection based on cooperative vehicles,” Transportation Research

Part C: Emerging Technologies, vol. 28, pp. 44–62, 2013.

[11] B. Xu, S. E. Li, Y. Bian, S. Li, X. J. Ban, J. Wang, and K. Li, “Distributed

conﬂict-free cooperation for multiple connected vehicles at unsignalized

intersections,” Transportation Research Part C: Emerging Technologies,

vol. 93, pp. 322–334, 2018.

[12] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.

MIT press, 2018.

[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.

Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,

S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,

D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through

deep reinforcement learning.,” Nature, vol. 518, no. 7540, pp. 529–33,

2015.

[14] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van

Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,

M. Lanctot, et al., “Mastering the game of go with deep neural networks

and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.

[15] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,

A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering

the game of go without human knowledge,” Nature, vol. 550, no. 7676,

p. 354, 2017.

[16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,

D. Silver, and D. Wierstra, “Continuous control with deep reinforcement

learning,” arXiv preprint arXiv:1509.02971, 2015.

[17] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust

region policy optimization,” in International Conference on Machine

Learning, pp. 1889–1897, 2015.

[18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-

imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,

2017.

[19] P. Wolf, C. Hubschneider, M. Weber, A. Bauer, J. H¨

artl, F. D¨

urr, and

J. M. Z¨

ollner, “Learning how to drive in a real world simulation with

deep q-networks,” in 2017 IEEE Intelligent Vehicles Symposium (IV),

pp. 244–250, IEEE, 2017.

[20] Z. Ruiming, L. Chengju, and C. Qijun, “End-to-end control of kart

agent with deep reinforcement learning,” in 2018 IEEE International

Conference on Robotics and Biomimetics (ROBIO), pp. 1688–1693,

IEEE, 2018.

[21] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,

D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-

forcement learning,” in International conference on machine learning,

pp. 1928–1937, 2016.

[22] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, and F. Nashashibi,

“End-to-end race driving with deep reinforcement learning,” in 2018

IEEE International Conference on Robotics and Automation (ICRA),

pp. 2070–2075, IEEE, 2018.

[23] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:

An open urban driving simulator,” arXiv preprint arXiv:1711.03938,

2017.

[24] D. Jingliang, L. Shengbo, Eben, G. Yang, S. Qi, and C. Bo, “Hierarchical

reinforcement learning for self-driving decision-making without reliance

on labeled driving data,” IET Intelligent Transport Systems, 2019.

[25] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D.

Lam, A. Bewley, and A. Shah, “Learning to Drive in a Day,” 2018.

[26] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforce-

ment learning and safety based control for autonomous driving,” arXiv

preprint arXiv:1612.00147, 2016.

[27] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-

dient methods for reinforcement learning with function approximation,”

in Advances in neural information processing systems, pp. 1057–1063,

2000.

[28] P. Thomas, “Bias in natural actor-critic algorithms,” in International

Conference on Machine Learning, pp. 441–448, 2014.

[29] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-

dimensional continuous control using generalized advantage estimation,”

arXiv preprint arXiv:1506.02438, 2015.

[30] S. M. Kakade, “A natural policy gradient,” in Advances in neural

information processing systems, pp. 1531–1538, 2002.

[31] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and

data-efﬁcient approach to policy search,” in Proceedings of the 28th

International Conference on machine learning (ICML-11), pp. 465–472,

2011.

[32] I. Grondman, “Online model learning algorithms for actor-critic control,”

2015.

[33] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa,

“Learning continuous control policies by stochastic value gradients,” in

Advances in Neural Information Processing Systems, pp. 2944–2952,

2015.

[34] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel,

“Model-ensemble trust-region policy optimization,” arXiv preprint

arXiv:1802.10592, 2018.

[35] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and

S. Levine, “Model-based value estimation for efﬁcient model-free re-

inforcement learning,” arXiv preprint arXiv:1803.00101, 2018.

ResearchGate has not been able to resolve any citations for this publication.

Distributed Sliding Mode Control for Nonlinear Heterogeneous Platoon Systems With Positive Definite Topologies

Article

Full-text available

Jul 2020

This paper is concerned with the distributed control of vehicle platoons. The dynamics of each vehicle are nonlinear and heterogeneous. The control objective is to regulate vehicles to travel at a common speed while maintaining desired intervehicle gaps. The information flow topology dictates the pattern of communication between vehicles in the platoon. This information is essential to effective platoon control and, therefore, plays a central role in affecting the design and performance of platoon control strategies. Our key contribution is a unified distributed control framework that explicitly incorporates and supports a diversity of information flow topologies. Specifically, we propose a distributed sliding mode control (DSMC) framework for a class of generic topologies. The DSMC constructs the topological sliding surface and reaching law via a so-called “topologically structured function.” The control law obtained by matching the topological sliding surface and topological reaching law is naturally distributed. The Lyapunov stability analysis is carried out for the closed-loop system in the sense of Filippov to cope with the discontinuity originated from switching terms. Moreover, a tradeoff between tracking precision and chattering elimination is discussed with a continuous approximation of the switching control law. The effectiveness of the DSMC for platoons is verified under four different topologies through numerical simulation.

Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data

Article

Full-text available

Feb 2020
IET INTELL TRANSP SY

Decision making for self‐driving cars is usually tackled by manually encoding rules from drivers’ behaviours or imitating drivers’ manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This study presents a hierarchical reinforcement learning method for decision making of self‐driving cars, which does not depend on a large amount of labelled driving data. This method comprehensively considers both high‐level manoeuvre selection and low‐level motion control in both lateral and longitudinal directions. The authors firstly decompose the driving tasks into three manoeuvres, including driving in lane, right lane change and left lane change, and learn the sub‐policy for each manoeuvre. Then, a master policy is learned to choose the manoeuvre policy to be executed in the current state. All policies, including master policy and manoeuvre policies, are represented by fully‐connected neural networks and trained by using asynchronous parallel reinforcement learners, which builds a mapping from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each manoeuvre. They apply this method to a highway driving scenario, which demonstrates that it can realise smooth and safe decision making for self‐driving cars.

Distributed conflict-free cooperation for multiple connected vehicles at unsignalized intersections

Article

Full-text available

Aug 2018
TRANSPORT RES C-EMER

Connected vehicles will change the modes of future transportation management and organization, especially at intersections. In this paper, we propose a distributed conflict-free cooperation method for multiple connected vehicles at unsignalized intersections. We firstly project the approaching vehicles from different traffic movements into a virtual lane and introduce a conflict-free geometry topology considering the conflict relationship of involved vehicles, thus constructing a virtual platoon. Then we present the modeling of communication topology to describe two modes of information transmission between vehicles. Finally, a distributed controller is designed to stabilize the virtual platoon for conflict-free cooperation at intersections. Numerical simulations validate the effectiveness of this method.

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

Article

Full-text available

Feb 2018

Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.

Distributed Adaptive Sliding Mode Control of Vehicular Platoon With Uncertain Interaction Topology

Article

Full-text available

Aug 2018

In a platoon control system, a fixed and symmetrical topology is quite rare, because of adverse communication environments and continuously moving vehicles. This paper presents a DASMC (Distributed Adaptive Sliding Mode Control) scheme for more realistic vehicular platooning. In this scheme, adaptive mechanism is adopted to handle platoon parametric uncertainties, while a structural decomposition method deals with the coupling of interaction topologies. A numerical algorithm based on LMI (Linear Matrix Inequality) is developed to place the poles of the sliding motion dynamics in the required area to balance quickness and smoothness. The proposed scheme allows the nodes to interact with each other via different types of topologies, e.g., either asymmetrical or symmetrical, either fixed or switching. Different from existing techniques, it does not require the exact values of each entity in topological matrix, and only needs to know the bounds of its eigenvalues. The effectiveness of this proposed methodology is validated by bench tests under several conditions. IEEE

Learning to Drive in a Day

Conference Paper

May 2019

End-to-end Control of Kart Agent with Deep Reinforcement Learning

Conference Paper

Dec 2018

End-to-End Race Driving with Deep Reinforcement Learning

Conference Paper

May 2018

Model-Ensemble Trust-Region Policy Optimization

Article

Feb 2018

Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity, which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and to date have succeeded mainly in restrictive domains where simple models are sufficient for learning. In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. To overcome this issue, we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process. We further show that the use of likelihood ratio derivatives yields much more stable learning than backpropagation through time. Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark tasks.

A Natural Policy Gradient

Article

Jan 2001

Sham M Kakade

We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.

Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization

Abstract

Recommended publications

Centralized Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Opt...

Integrated Decision and Control for High-Level Automated Vehicles by Mixed Policy Gradient and Its E...

Distributed conflict-free cooperation for multiple connected vehicles at unsignalized intersections

Direct and indirect reinforcement learning