ArticlePDF Available

A Reinforcement Learning Benchmark for Autonomous Driving in General Urban Scenarios

January 2023
IEEE Transactions on Intelligent Transportation Systems PP(99):1-11

January 2023
PP(99):1-11

DOI:10.1109/TITS.2023.3329823

Authors:

Chang Liu

Peking University

Show all 6 authorsHide

Reinforcement learning (RL) has gained significant interest for its potential to improve decision and control in autonomous driving. However, current approaches have yet to demonstrate sufficient scenario generality and observation generality, hindering their wider utilization. To address these limitations, we propose a unified benchmark simulator for RL algorithms (called IDSim) to facilitate decision and control for high-level autonomous driving, with emphasis on diverse scenarios and a unified observation interface. IDSim is composed of a scenario library and a simulation engine, and is designed with execution efficiency and determinism in mind. The scenario library covers common urban scenarios, with automated random generation of road structure and traffic flow, and the simulation engine operates on the generated scenarios with dynamic interaction support. We conduct four groups of benchmark experiments with five common RL algorithms and focus on challenging signalized intersection scenarios with varying conditions. The results showcase the reliability of the simulator and reveal its potential to improve the generality of RL algorithms. Our analysis suggests that multi-task learning and observation design are potential areas for further algorithm improvement.

Content uploaded by Shengbo Eben Li

Content may be subject to copyright.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

A Reinforcement Learning Benchmark for

Autonomous Driving in General

Urban Scenarios

Yuxuan Jiang , Guojian Zhan , Zhiqian Lan , Chang Liu, Bo Cheng,

and Shengbo Eben Li , Senior Member, IEEE

Abstract— Reinforcement learning (RL) has gained signiﬁcant

interest for its potential to improve decision and control in

autonomous driving. However, current approaches have yet

to demonstrate sufﬁcient scenario generality and observation

generality, hindering their wider utilization. To address these

limitations, we propose a uniﬁed benchmark simulator for RL

algorithms (called IDSim) to facilitate decision and control

for high-level autonomous driving, with emphasis on diverse

scenarios and a uniﬁed observation interface. IDSim is composed

of a scenario library and a simulation engine, and is designed

with execution efﬁciency and determinism in mind. The scenario

library covers common urban scenarios, with automated random

generation of road structure and trafﬁc ﬂow, and the simulation

engine operates on the generated scenarios with dynamic interac-

tion support. We conduct four groups of benchmark experiments

with ﬁve common RL algorithms and focus on challenging

signalized intersection scenarios with varying conditions. The

results showcase the reliability of the simulator and reveal

its potential to improve the generality of RL algorithms. Our

analysis suggests that multi-task learning and observation design

are potential areas for further algorithm improvement.

Index Terms— Autonomous driving, benchmark simulator,

reinforcement learning.

I. INTRODUCTION

AUTONOMOUS driving has garnered signiﬁcant attention

from both academia and industry due to its potential

to improve travel efﬁciency, reduce accidents, and alleviate

the burden on human drivers [1]. At the core of achieving

high-level driving intelligence is the decision-making process.

While there have been attempts to replicate human driving

Manuscript received 20 February 2023; revised 22 August 2023;

accepted 30 October 2023. The work of Shengbo Eben Li was supported in

part by the National Key Research and Development Program of China under

Grant 2022YFB2502901, in part by NSF China under Grant 52221005, in part

by the Tsinghua University Initiative Scientiﬁc Research Program, and in part

by the Tsinghua University-Toyota Joint Research Center for AI Technology

of Automated Vehicle. The Associate Editor for this article was H. Jula.

(Yuxuan Jiang and Guojian Zhan contributed equally to this work.)

(Corresponding author: Shengbo Eben Li.)

Yuxuan Jiang, Guojian Zhan, Zhiqian Lan, Bo Cheng, and Shengbo Eben Li

are with the State Key Laboratory of Automotive Safety and Energy,

School of Vehicle and Mobility, Tsinghua University, Beijing 100084,

China (e-mail: jyx21@mails.tsinghua.edu.cn; zgj21@mails.tsinghua.edu.cn;

lanzq21@mails.tsinghua.edu.cn; chengbo@tsinghua.edu.cn; lishbo@

tsinghua.edu.cn).

Chang Liu is with the Department of Advanced Manufacturing and

Robotics, College of Engineering, Peking University, Beijing 100084, China

(e-mail: changliucoe@pku.edu.cn).

This article has supplementary downloadable material available at

https://doi.org/10.1109/TITS.2023.3329823, provided by the authors.

Digital Object Identiﬁer 10.1109/TITS.2023.3329823

behaviors by imitating expert data, obtaining a large quan-

tity of high-quality data can be expensive and potentially

infeasible. With the capacity to self-evolve through trial-and-

error independent of reliance on external labeled data [2],

reinforcement learning (RL) has emerged as a promising

solution for achieving real-time, high-accuracy decision and

control of autonomous vehicles [3].

Recently, RL has been applied for certain driving tasks and

scenarios. Lillicrap et al. [4] proposed DDPG algorithm and

realized lane keeping in a race track in the TORCS simulator,

reporting comparable performance between low-dimensional

and pixel observation. Duan et al. [5] realized decision and

control at two-lane highway using hierarchical RL, designing

complex reward functions for its high-level maneuver selection

and three low-level maneuvers. Leveraging the structure of

the scenario, it selects one leading and one following vehicle

on each lane, then use their longitudinal distances to the

ego vehicle to represent surrounding information. Chen et al.

[6] focused on the roundabout scenario, designing a speciﬁc

reward function to achieve safe and comfortable driving using

several model-free RL algorithms. This work proposed latent

state encoding with bird-view images to decrease sample com-

plexity as compared to using front-view images. Guan et al. [7]

proposed integrated decision and control framework to handle

the crossroad scenario, and demonstrated its effectiveness

through real-world experiments. For surrounding information,

this work selects two surrounding vehicles for each conﬂicting

connection and observes the position and velocity relative to

the center of intersection.

There are two major concerns regarding the applicability of

the aforementioned works. One is the lack of scenario gener-

ality. Most of these works focus on a single speciﬁc scenario

and require the policy to be retrained even for slight changes in

the environment. It is necessary to aim for a universal policy

that can handle general urban scenarios in order to achieve

high-level autonomous driving. Another concern is the lack

of observation generality. Some of the works use specialized

observation, such as ﬁltering surrounding vehicles to observe

based on rules adapted to two-lane highway, which may have

advantage on the targeted scenario but be inapplicable in other

scenarios. In fact, these two issues relate to the policy gener-

ality from structural level and performance level, respectively.

The observation generality implies that the policy input is

general, that is, the same form of observation can be used in

any working condition and contains sufﬁcient information; the

See https://www.ieee.org/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE I

OPE N-SOURCE AUTONOMOUS DRIVING SIMULATOR

scenario generality means that the performance of the policy

is strong enough to conquer general scenarios. To achieve

high-level driving intelligence, the observation generality is

the fundamental condition, while the scenario generality is the

ultimate goal. Handling these two issues at the algorithm side

is tedious and makes fair performance comparison difﬁcult.

In light of this, we propose a uniﬁed benchmark simulator,

Intelligent Driving Simulator (IDSim), for RL algorithms to

facilitate decision and control for high-level autonomous driv-

ing. The emphasis is placed on diverse scenarios and a uniﬁed

observation interface. IDSim addresses scenario generality by

providing sufﬁcient diversity at the road structure level and

trafﬁc ﬂow level. Urban roads can have complex and varied

layouts, including multi-lane roads, intersections, roundabouts,

etc. Trafﬁc participants (vehicles, cyclists, and pedestrians)

can be signiﬁcantly different from each other, each with

unique geometries, dynamics, and behavioral patterns (e.g.,

conservative or aggressive). Thus, the generality of a policy

can be validated by training and testing on two distinct

set of scenarios. Regarding observation generality, IDSim

provides sufﬁcient information for the ego vehicle to make

optimal decisions in a uniﬁed style. The ego vehicle’s states,

road structure, trafﬁc participants’ states, and other dynamic

trafﬁc information can be easily composed and ﬂattened to

a vector as policy input. Based on the composable and

customizable conﬁguration support, we provide several general

observation designs. Different observation conﬁgurations can

have signiﬁcant impact on policy performance, and relevant

experiments will be presented in the benchmark experiments

section.

Systematically, IDSim is composed of two main parts: a

scenario library and a simulation engine, in accordance with

the integrated decision and control (IDC) framework [7]. The

scenario library achieves automated random generation of road

structure and trafﬁc ﬂow covering common urban scenarios,

such as multi-lane roads, intersections, and roundabout. The

simulation engine operates on a set of generated scenarios

and provides dynamic interaction support. It is built on top of

Eclipse SUMO [12], a microscopic trafﬁc simulation package,

for trafﬁc ﬂow simulation. Lastly, IDSim is designed with

execution efﬁciency and determinism in mind. We incorporate

careful proﬁling and optimization to accelerate RL training,

and regularly run tests to guarantee that simulation result is

exactly reproducible with ﬁxed seed and input.

Our key contributions include:

(1) We propose a uniﬁed, lightweight benchmark simulator

called IDSim, with standard OpenAI Gym [13] interface that

are widely adopted by popular RL training frameworks such

as RLlib [14], Tianshou [15], and GOPS [16]. The simulator

concentrates on general urban scenarios and is compatible

with the IDC framework designed for high-level autonomous

driving.

(2) We address scenario generality by generating diverse

scenarios as training or test environments. Speciﬁcally,

we offer diversity at the road structure level and trafﬁc ﬂow

level. The former encompasses general urban scenarios includ-

ing multi-lane roads, intersections, and roundabouts, while the

latter considers various types of trafﬁc participants such as

vehicles, cyclists, and pedestrians.

(3) We focus on observation generality and provide a uniﬁed

observation interface under all supported scenarios. The ego

vehicle’s states, navigation information, road structure, trafﬁc

participants’ states, and other relevant dynamic trafﬁc infor-

mation can be easily composed and ﬂattened as policy input.

Besides, We provide several observation designs which can be

used in general driving conditions.

(4) We conduct four groups of experiments with ﬁve com-

mon RL algorithms to demonstrate the functionality of our

benchmark simulator, with a particular focus on diverse sig-

nalized intersection scenarios. The results show that different

scenario settings have a strong impact on algorithm perfor-

mance and verify the generality and reliability of the simulator.

We further point out several directions of interest, such as

multi-task learning and observation design, for algorithms to

improve upon.

II. REL ATED WORKS

A proper RL environment is critical to the application of

RL algorithms. In general, the interaction between an RL

agent and the environment either happens in real world or in

simulation. However, trial-and-error in the real world implies

economic costs, security risks, and low sample efﬁciency.

Thus, it has become an inevitable choice to employ a simulator

to collect samples for prototyping and validation of algorithms.

Recently, various open-source autonomous driving simu-

lators have emerged and can serve as RL environments.

A detailed and selective comparison with our proposed IDSim

is illustrated in Table I. TORCS [8] was initially developed as

a 3D racing simulation game. Thanks to its embedded vehicle

dynamics, high efﬁciency, and open-source nature, it has been

repurposed for autonomous driving simulation, yet the scope

is typically limited to basic lane-keeping tasks. CARLA [9] is

one of the most well-known open-source autonomous driving

simulators currently available. Many researchers have made

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 3

customization to it for RL integration, such as Airsim [17],

SUMMIT [18], and MACAD [19]. CARLA’s use of the Unreal

Engine enables photorealistic rendering, but makes simulation

more resource-consuming and signiﬁcantly complicates the

process of automated scenario generation. SMARTS [10]

is a multi-agent autonomous driving simulator, focusing on

realistic and diverse interactions. It develops around a typical

set of urban driving conditions, such as lane merging at a ramp

and unprotected left turn at a four-way intersection. IDSim,

instead, puts more focus on scenario generality and provides

more diversity through randomized generation. MetaDrive [11]

is a driving simulation platform with the explicit focus on

RL’s generalizability issue. It designs a procedural generation

method to unlimitedly produce new scenarios by randomly

sampling and splicing a bunch of built-in blocks. IDSim

employs a different approach by generating a skeleton then

applying perturbation, which allows for more ﬁne-grained

diversity comparing to MetaDrive’s approach where a block

is the smallest unit. Besides, MetaDrive’s trafﬁc ﬂow mod-

ule is developed from scratch, which lacks several desirable

properties, including cooperation, periodic spawning and cus-

tomizability. IDSim, on the other hand, integrates SUMO as

the trafﬁc ﬂow backend, making trafﬁc ﬂow conﬁguration

more friendly to work with. Finally, the environment model is

an extra feature of IDSim, which facilitates model-based RL

algorithms and planning-based methods to train and evaluate

on our platform.

III. PRELIMINARIES

A. Reinforcement Learning

Reinforcement learning (RL) considers a Markov Decision

Process (MDP), where the optimal action only depends on

the current state [2]. More speciﬁcally, at each timestep t,

the agent will take action at∈Aaccording to current state

st∈Sand the policy π:S→A, then the environment

will transit to the next state according to the environment

dynamics, i.e., st+1=f(st,at), and feedback a scalar reward

signal rt. The value function vπ:S→Ris deﬁned as the

expected sum of rewards of policy πgiven initial state s.

Actor-Critic is the most popular structure of RL algorithms.

It simultaneously learns a value function Vwparameterized by

wcalled critic to approximate the true value function, and a

policy πθparameterized by θcalled actor to maximize the

expected sum of rewards with the help of critic [20].

Safe RL is a subﬁeld of RL that focuses on the safety

of an agent, which is especially meaningful for real-word

deployment [21]. Safe RL seeks to ensure that the agent

operates within the boundaries imposed by constraints to

satisfy safety requirements, while maintaining reasonable

performance. To achieve this, various methods that incorpo-

rate penalties [22], Lagrangian multipliers [23], and energy

functions [24] have been proposed to minimize constraint

violations during training and deployment.

B. Integrated Decision and Control Framework

Integrated decision and control (IDC) framework is an

autonomous driving decision and control framework tailored

for RL algorithms. IDC framework consists of two essential

modules: the static path planner and the dynamic optimal

tracker, which subtly correspond to the critic and actor of

Actor-Critic RL algorithms [7]. The static path planner gen-

erates the candidate path set 5, which only considers static

trafﬁc information that is irrelevant to trafﬁc participants’

behaviors. The path set can be constructed by prior knowl-

edge with consideration of road structure. Then, the dynamic

optimal tracker is where RL algorithms come in. In details,

the critic learns to approximate the tracking cost of each

candidate path in the path set, and the actor learns to track

each path while assuring safety by meeting the constraints.

The optimal critic v∗and actor π∗will be approximated

by Vwand πθby ofﬂine training. During online application,

Vwdistinguishes the path with lowest cost from 5, then

πθtracks this path to output the control commands. The

IDC framework can incorporate RL in a more efﬁcient and

interpretable way compared to end-to-end training. And thanks

to the inherited advantages of RL, the IDC framework can

cast off tedious human designs and be promising to improve

driving intelligence.

IV. IDS IM SIM ULATO R

A. Overview

The IDSim simulator mainly consists of two parts, a sce-

nario library and a simulation engine, supporting the static

path planner module and the dynamic optimal tracker module

in IDC framework respectively. The scenario library creates a

set of scenarios with conﬁgurable road structure, trafﬁc light

layouts and trafﬁc ﬂow characteristics. The simulation engine

samples scenarios from the pre-generated set and provides

interactive simulation support. It is also possible to use these

two parts in conjunction to generate scenarios on the ﬂy, i.e.,

provide a unique scenario for each new simulation. The overall

architecture is shown in Fig. 1.

B. Scenario Library

The scenario library helps scenario generality by providing

sufﬁcient randomness in scenario generation. To obtain a new

scenario in IDSim, there are three required procedures: gener-

ating road structure, planning static paths, and characterizing

trafﬁc ﬂow.

IDSim is capable of generating randomized road structure

by establishing basic structures and applying randomness to

the positions and topological relationship of junction nodes

and road edges. A demonstration of random segments can be

found in Fig. 2. IDSim also supports importation of arbitrary

custom maps compatible with SUMO’s network format, which

means real-world data in formats such as OpenStreetMap and

OpenDRIVE, or maps created with NetEdit can be directly

integrated. Fig. 3shows an example map imported from

Tsinghua university and its surroundings, and four typical parts

are cut out for manual edit. In addition, IDSim implements

various checks to reject malformed maps that can cause trouble

in subsequent procedures or simulations.

To plan static paths, IDSim utilizes a general method

outlined in [25]. The center line of each lane is reused as

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 1. The overall architecture of IDSim, consisting of two parts: a scenario library and a simulation engine. The scenario library creates a set of scenarios

with ﬂexible conﬁguration. For each scenario, the library either generates or imports the road structure, plans static path and deploys trafﬁc ﬂow. The simulation

engine samples one scenario from the pre-generated set for each simulation and provides dynamic interaction support. The scenario manager, trafﬁc manager

and agent manager share a common simulation context and support different aspects of simulation.

Fig. 2. Automated generation of road structure based on a standard junction.

Randomness is applied to edge angles, lane numbers and edge numbers.

a static path segment, while cubic Bézier curves connect the

entrance lane and exit lane of different types of connections,

as depicted in Figure 4. These two types of path segments can

be assembled on demand into a static path that corresponds to

a vehicle’s global route.

Finally, IDSim inherits the conﬁgurations from SUMO to

characterize trafﬁc ﬂow, which includes models for vehicles,

cyclists, and pedestrians and allows for customization of trafﬁc

ﬂow characteristics such as density, maximum speed and

emergency braking deceleration. By default, IDSim assigns

randomized trafﬁc ﬂow to the map automatically, yet manual

customization is also supported.

Fig. 3. The map of Tsinghua university, provided as a demonstration that

can be imported into IDSim.

C. Simulation Engine

The simulation engine takes the static scenario information

from the scenario library and offers dynamic interaction sup-

port. Three main components of the engine are the scenario

manager, trafﬁc manager, and agent manager. They share

a common context containing latest information about the

current simulation.

The scenario manager samples and loads a scenario from the

pre-generated scenario set for each simulation, then initializes

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 5

Fig. 4. Static paths generated for different types of road structure in common

urban scenarios.

Fig. 5. 3-DOF non-linear vehicle dynamic model.

the underlying SUMO session with the map and trafﬁc ﬂow

deﬁnition. The scenario will be reused with a conﬁgurable

amount of episodes until the current simulation terminates and

the scenario will be resampled.

Based on the loaded scenario, the trafﬁc manager randomly

takes over an ego vehicle from the trafﬁc ﬂow at the beginning

of each episode. At each step of the episode, the trafﬁc

manager synchronizes trafﬁc lights, trafﬁc participants, and

other trafﬁc elements between SUMO and the simulation

context through Libsumo API [12]. For each ego vehicle agent,

the trafﬁc manager provides static paths based on its global

route and subscribes its surrounding trafﬁc information for

perception simulation.

The agent manager is the main component for an external

policy to interact with. At each step of the episode, it takes

action from the policy, updates the ego vehicle’s dynamic

states, and computes the new observation from updated simu-

lation context. The dynamics of ego vehicle is a classic 3-DOF

bicycle model widely adopted in vehicle control, which owns

6-dimensional state and 2-dimensional action [26]:

x=pxpyvxvyϕ ωT,u=aδT,

where px,pyare the ground coordinates of the ego vehicle’s

center of gravity (CG), vx, vyare the longitudinal and lateral

velocities, ϕis the heading angle, ωis the yaw rate, ais the

acceleration command, and δis the front wheel angle. The

state space equation is

x′=F(x,u)=







px+1t(vxcos ϕ−vysin ϕ)

py+1t(vxsin ϕ+vycos ϕ)

vx+1t(a+vyω)

mvxvy+1t[(lfkf−lrkr)ω−kfδvx−mv2

xω]

mvx−1t(kf+kr)

ϕ+1tω

Izωvx+1t[(lfkf−lrkr)vy−lfkfδvx]

Izvx−1t(l2

fkf+l2

rkr)







(1)

One focus of the agent manager is to provide a uniﬁed

observation interface under all supported scenarios to achieve

observation generality. The ego vehicle’s states, navigation

information, road structure, trafﬁc participants’ states, and

other relevant dynamic trafﬁc information can be easily

composed and ﬂattened as policy input. The trafﬁc partic-

ipant observation have a special property that the number

of perceived participants could vary with time. To retrieve a

ﬁxed-dimensional vector, multiple choices are discussed and

compared in section V-A4. The observation interface is also

customizable to support more use cases, such as adding noise

in robust RL research and sensitivity analysis. The agent

manager also checks if the ego vehicle has to be removed

because either the route is completed or trafﬁc regulations are

violated. The cause will be recorded for reward computation

and the engine will be notiﬁed to terminate the episode.

Each agent manager has independent states, thus multiple

agent managers can operate concurrently to form a multi-agent

simulation.

D. Environment Model & Auxiliary Tools

IDSim offers an environment model to facilitate the

training and evaluation of model-based RL algorithms and

planning-based methods. The environment model is a close

approximation of the simulator, consisting of ego vehicle

dynamic model, surrounding vehicle prediction model, ref-

erence trajectory model and reward model. Starting at states

collected from the simulator at certain time steps, the environ-

ment model could rollout future states given a policy or some

actions. Two main features of the environment model is:

1) Differentiable. One can differentiate through the envi-

ronment model and retrieve gradients of actions with

respect to the cost function.

2) Suitable for planning. All computation of the envi-

ronment model varies with inputs only. This enables

multiple rollouts starting at the same state, suitable for

planning purpose. In contrast, the simulator could only

go forward in time. Replaying certain time step in a

simulator is often costly and imperfect.

Based on this model, a built-in model predictive control (MPC)

driver is also provided for reference.

In addition to the core components mentioned above, IDSim

also comes with several built-in auxiliary tools. This includes

a logger that can save simulation results and conﬁgurations,

a renderer that can provide real-time visualization and output

videos for playback, and an evaluator that can provide ﬁve

dimensional metrics: driving safety, regulatory compliance,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

driving comfort, travel efﬁciency and energy efﬁciency to test

the driving performance of the trained policy systematically.

E. Efﬁciency & Determinism

The sample efﬁciency of RL algorithms have long been

deemed unsatisfactory. On-policy algorithms, like proximal

policy optimization (PPO) [27], employ parallel sampling

technique to speed up training process. Each worker instan-

tiates and samples from an independent environment, where

the environment’s execution and RAM efﬁciency becomes

crucial to scale up. IDSim is designed with efﬁciency in

mind. We incorporate careful proﬁling to locate inefﬁcient

and bottleneck code segments and use various optimization

to accelerate execution and reduce memory footprint. As a

simple experiment, it takes an average of 0.915 ms per step

and a maximum resident set size of 136.8 MiB to rollout an

environment with a set of 100 scenarios for 2 ×104steps

on a single thread. With 10 parallel workers, training a PPO

agent for 5 ×106steps ﬁnishes under 40 minutes. Both two

experiments above runs on AMD Ryzen Threadripper 3960X

CPU, with no GPU involvement.

A proper RL environment shall be fully deterministic with

ﬁxed seed and input. Failure to meet this requirement com-

monly result from inconsistent PRNG (pseudo random number

generator) usage. For instance, Metadrive is not fully determin-

istic out-of-box and has to be subtly tweaked for reproducible

RL training.1IDSim ensures that all sources of randomness

are controlled through a single master seed. We regularly runs

integration tests (against a deterministic policy) and end-to-

end test (against RL frameworks) to guarantee that simulation

result is exactly reproducible.

V. BENCHMARK EXPE RIM ENTS

We benchmark multiple algorithms in four groups of tasks,

to demonstrate the generality and reliability of our simula-

tor and address some common concerns regarding RL for

autonomous driving.

A. Task Description

All tasks share the common objective of driving the ego

vehicle to the destination within limited time while adhering

to various trafﬁc regulations. This includes maintaining the

vehicle within the road lanes, following trafﬁc light signals,

and avoiding collisions with other vehicles on the road.

The observations used to guide the vehicle’s actions include

the state of the ego vehicle, waypoints, LiDAR points out-

lining the edges of the road, trafﬁc light information, and

observations of the surrounding participants. The control input

consists of the steering angle and acceleration of the vehicle.

The following groups of tasks explore the impact of various

factors on the performance of the RL algorithm, using a con-

trol variable approach. The ﬁrst three groups vary the scenario

(trafﬁc light layout, trafﬁc ﬂow density, and road structure)

1See a workaround in https://github.com/jjyyxx/srlnbc/blob/main/srlnbc/

env/my_metadrive.py, which targets MetaDrive release https://github.com/

metadriverse/metadrive/releases/tag/MetaDrive-0.2.5.

Fig. 6. Scenario demonstration for group (A). The junction follows right-hand

trafﬁc rules. Thick lines above four incoming edges’ stop lines indicate the

trafﬁc light phase, and the dark green phase has a lower priority than the light

green phase. The rectangle in magenta is the ego vehicle, and rectangles in

blue are vehicles controlled by SUMO. This plotting convention also applies

to Fig. 7,8,9.

Fig. 7. Scenario demonstration for group (B).

Fig. 8. Scenario demonstration for group (C). In group (C), 100 scenarios

are generated as the training set and 1 distinct scenario is used for testing.

The C-inotation in Fig. (a)-(e) indicates the i-th scenario in the training set.

while maintaining a common environment deﬁnition. The

fourth group varies the surrounding participant observation

presented to the algorithm. The four groups are referred to as

(A), (B), (C) and (D), and variants in each group are referred

to as A1, A2, etc.

1) Trafﬁc Light Layout (A): Trafﬁc lights are used to

regulate the ﬂow of trafﬁc at intersections. The layout of the

trafﬁc lights can signiﬁcantly affect the characteristics of the

trafﬁc ﬂow and present varying levels of difﬁculty for the agent

to learn proper behavior. In our experiments, we test three

different layouts at a four-way intersection: opposites (A1),

incoming (A2), and none (A3). For tasks in other groups, the

trafﬁc light layout defaults to opposites.

The opposites layout (A1) is the most common for intersec-

tions, where the two orthogonal directions are allowed to pass

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 7

Fig. 9. Scenario demonstration for group (D). Green and red rectangles

under D1 and D2 variant represent the observation vector of its corresponding

surrounding vehicle. Orange dots under D3 variant represent scalar distances

normalized to detection range shown as the light orange circle. The rectangles

and dots are arranged in order (D1: from close to distant; D3: counter-

clockwise) and concatenated into the ﬁnal ﬁxed-dimensional surrounding

participant observation.

alternately. For each of the two directions, a straight phase is

followed by a left-turn phase. During the straight phase, left-

turn vehicles are permitted to pass if there are no conﬂicting

straight-going vehicles in the opposite direction. The incoming

layout (A2) further reduces potential conﬂicts at the cost of

longer waiting times for each incoming road. Each road has

a dedicated phase where all vehicles are allowed to pass. The

none layout (A3) literally disables the trafﬁc lights, posing a

greater challenge for the agent to learn proper driving behavior.

2) Trafﬁc Flow Density (B): The difﬁculty of collision

avoidance increases with the density of trafﬁc ﬂow. Quantita-

tively, the density is represented by the frequency at which new

vehicles are introduced into the network. In our experiments,

we test three levels of density: sparse (B1), normal (B2),

and dense (B3), which are characterized by vehicle spawn

periods of 1.8, 1.2, and 0.8 seconds, respectively. The sparse

level is relatively easy to navigate, often requiring little to no

interaction with surrounding participants. The dense level is

more challenging, but is capped to avoid trafﬁc jams. For tasks

in other groups, the trafﬁc ﬂow density defaults to normal.

3) Road Structure (C): The road structure is randomly

generated to evaluate the generalizability of the learned policy.

The algorithms are trained on 1 (C1), 10 (C2), and 100 (C3)

scenarios and then tested on a previously unseen scenario. For

tasks in other groups, the road structure of both training and

test scenarios are the same as the training scenario of C1.

4) Surrounding Participant Observation (D): Interaction

with surrounding participants is a primary challenge in

autonomous driving. The surrounding participant observation

has the special property of being a set with varying dimen-

sion [28]. This group of tasks examines various methods of

converting it into a ﬁxed-dimensional observation.

The distance-sorted observation (D1) sorts the surrounding

participants by their distance from the ego vehicle [29]. It then

selects at most 8 closest vehicles and concatenates their states

into a vector. If there are fewer vehicles than the expected

size, the observation is padded with zeros. An additional

TABLE II

ON-POLICY ALGORITHMS’ H YPERPARAMETERS

mask is also appended to each vehicle’s state, indicating

whether it is a real vehicle or a padding element. The top-

risky observation (D2) selects a single vehicle that is deemed

most risky according to a set of rules and uses its state as the

observation. The criterion to judge the top-risky vehicle is:

outside the intersection, it is the vehicle in front; and inside

the intersection, it is the nearest vehicle in the conﬂicting

trafﬁc ﬂow. The ﬁxed-directional (D3) observation simulates

a LiDAR sensor by dividing the surroundings into a number

of bins and casting rays to surrounding participants to obtain

a ﬁxed-size observation [30]. Each element in the observation

reﬂects the ego vehicle’s closest distance to surrounding par-

ticipants in a speciﬁc bin and direction. We set the number

of bins to 16 and the elements are arranged in counter-

clockwise order starting from the bin in front. For tasks in

other group, the surrounding participant observation defaults to

distance-sorted.

B. RL Algorithms

Five model-free algorithms are selected for benchmarking

on the aforementioned groups of tasks. Proximal policy gradi-

ent (PPO) [27], soft actor-critic (SAC) [31] and distributional

soft actor-critic (DSAC) [3] are general-purpose RL algo-

rithms. PPO-Lagrangian (PPO-Lag) [30] and SAC-Lagrangian

(SAC-Lag) [32] are common baselines for safe RL. For safe

RL algorithms, constraints are handled with dual ascent, where

the Lagrange multiplier is updated with gradient descent. Also,

the environment during training is adjusted so that collision

with surrounding participants will not immediately terminate

the episode or cause a penalty. Instead, a cost of +1 is given

for every step a collision is happening, indicating constraint

violation. During evaluation, the environment settings are kept

the same as general-purpose RL algorithms.

The hyperparameters for two on-policy algorithms (PPO,

PPO-Lag) and three off-policy algorithms (SAC, DSAC, and

SAC-Lag) are listed in Table II and Table III, respectively.

Most of them are are directly taken from the original literature.

C. Simulation Parameter Conﬁguration

There are two main sampling times in IDSim. One is the

sampling time of the environment 1tenv, directly interacting

with the algorithm; the other is the sampling time of the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE III

OFF-POLICY ALGORITHMS’ HYP ERPARA METE RS

TABLE IV

DYNA MICA L PA RAME TERS O F EGO VEHI CLE

TABLE V

REWAR D TERMS FOR ALL TASKS

dynamic model 1tin (5). Here we set 1tenv =1t=0.1 s.

As for other dynamical parameters of vehicle model, the

details are listed in Table IV.

The reward is composed of a dense term for driving forward

along the route, as well as a conditional term for positioning

the vehicle appropriately at intersections. An episode termi-

nates when one of the following events happens to the agent:

reaching the end of its route (refered to as success), collision

with other trafﬁc participants, leaving driving area, violating

trafﬁc lights or exceeding time limit, at which point a terminal

reward or penalty is issued. The speciﬁc descriptions and

values are listed in Table V. The notion drepresents the

distance traveled along the reference path over the last time

step, and vddenotes the desired velocity. vdis set to 8 m/s in

case of green light and 0 when encountering red light.

We train each algorithm on each task for ﬁve runs, using

different random seeds for the environment and neural network

initialization. The policy checkpoints are stored for every 2%

of the training process, and evaluated in the corresponding test

scenario for 100 episodes.

D. Results

The success rates of the ﬁnal policies are reported in

Table VI. The performance of intermediate policies can be

found in Supplementary Material.

In general, PPO and SAC-Lag performs well consistently

across all tasks, and DSAC outperforms SAC except for the

task A2. Typically, an agent can learn to follow its path, keep

itself in driving area, and obey trafﬁc light regulation. Failures

commonly fall into two categories: (1) Fail to learn collision

avoidance and keep colliding; (2) Be overly conservative to

avoid collision, leading to excess of time limit.

Group (A) and (B) meets the design goal of allowing for

different levels of scenario difﬁculty through the adjustment

of trafﬁc light layout and trafﬁc ﬂow density. The results of

Group (A) are consistent with discussions in section V-A1,

that the incoming layout is the easiest for all algorithms

and the none layout is quite challenging, which prevents

algorithms to achieve reasonable performance without careful

design. In group (B), all algorithms’ performances decline

when trafﬁc ﬂow density increases from sparse to dense,

which meets expectation. For sparse density, the algorithms’

performances are on par with each other and the variance

is low, but for harder scenarios, different algorithms make

more differences, which may result from different exploration-

exploitation characteristics.

Group (C) focuses on the scenario generalizability of var-

ious RL algorithms. Fig. 10 reports the ﬁnal training and

test performance of algorithms trained on different number of

scenarios. From the viewpoint of multi-task learning, driving

in each scenario can be viewed as a task. The approach

of mixing all scenarios equally corresponds to the simplest

form of multi-task learning. This approach is known to be

susceptible to unbalanced optimization, where certain sce-

narios dominate others, leading to poor overall performance.

This is supported by Fig. 10, where the success rate drops

as the number of training scenarios increases. Multi-task

learning could leverage the problem structure, and achieve

improved performance on training scenarios. On the other

hand, the evaluation on the unseen test scenario is a matter

of generalization. When trained on a single scenario, this can

be thought of as out-of-distribution (OOD) generalization and

extrapolation, which is a well-known difﬁcult topic. But when

training on 100 scenarios, this is more close to in-distribution

generalization and interpolation. So, for generalizability, it is

desirable to have multiple scenarios, where multi-task learning

could come to help. In Fig. 10, the train-test gap is obvious,

suggesting that training on a single scenario is not enough.

With a larger number of training scenarios, PPO gets obvious

test performance improvements. For other algorithms, the

improvements are minor, possibly because of their deterio-

rated training performances. Therefore, if multi-task learning

achieves improved training performance, its test performance

should also surpass the baselines. Further training curve

comparison can be found in Supplementary Material.

Group (D) shows that the design of surrounding partici-

pant observation has a strong impact on test performance.

As is shown in Fig. 11, the on-policy algorithm PPO learns

faster and achieves better ﬁnal performance with D2 and

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 9

TABLE VI

ALGORITHM PERFORMANCE EVALUATI ON ON 4 GRO UPS OF TA SKS

Fig. 10. Success rate between training and test scenarios for task group (C).

It is desirable to have better test performance and smaller gap between training

and test performance.

D3 observation; for the off-policy algorithm DSAC, D2 and

D3 observation also lead to an initial boost but the ﬁnal

performance difference is not obvious. For other algorithms,

the training curves can be found in Supplementary Material.

Taking a closer look at each kind of surrounding participant

observation, D2 discards all information about other vehicles

except for the top-risky one. D3 uses a LiDAR-like observation

common in robotic control. Both D2 and D3 lose certain

aspects of information, but helps the agent have better perfor-

mance. D1 appears to be the most information-complete, yet

it has the worst performance in the group. This is because

the raw surrounding observation is a set of participant states.

Similar to a set in mathematics, it has no inherent ordering.

In other words, arbitrary permutations to the set will not

change the scenario, and the policy should produce identical

action. Distance-sorted observation meets this requirement by

introducing an order and sorting accordingly before passing

to the policy. However, contrary to an ideal order like sorting

by risk level which is not easily available, sorting by distance

creates discontinuity and harms policy learning. Considering

that two participants have roughly same distances to the ego

Fig. 11. Typical on-policy and off-policy algorithms’ success rate with

different kinds of surrounding participant observations in task group (D). The

solid lines correspond to the mean and the shaded regions correspond to 90%

conﬁdence interval over ﬁve runs.

vehicle but at different directions, when their order is swapped,

the observation would change discontinuously. In contrast, the

other two observations, though leaving out certain information,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

are much easier for the policy to leverage to avoid risk. It is

clear that there is plenty of room to improve the observation

design. For instance, permutation-invariant encoding intro-

duces hard inductive bias to meet the permutation invariant

requirement and has the potential to overcome discontinuity

while utilize all state information.

VI. CONCLUSION

We introduce IDSim, an autonomous driving benchmark

simulator for general urban scenarios, supporting the scenario

generality and observation generality of RL algorithms. Based

on systematic benchmark experiments, we suggest that multi-

task learning and observation design are potential areas for

for algorithms to improve upon. We further hope to show

that IDSim can support researches into how RL algorithms

perform in the decision and control of autonomous driving in

a generalizable way.

REFERENCES

[1] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, “Centralized

cooperation for connected and automated vehicles at intersections by

proximal policy optimization,” IEEE Trans. Veh. Technol., vol. 69,

no. 11, pp. 12597–12608, Nov. 2020.

[2] S. E. Li, Reinforcement Learning for Sequential Decision-Making and

Control. Cham, Switzerland: Springer, 2023.

[3] J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional

soft actor-critic: Off-policy reinforcement learning for addressing value

estimation errors,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33,

no. 11, pp. 6584–6598, Nov. 2022.

[4] T. P. Lillicrap et al., “Continuous control with deep reinforcement

learning,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.

[5] J. Duan, S. Eben Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical

reinforcement learning for self-driving decision-making without reliance

on labelled driving data,” IET Intell. Transp. Syst., vol. 14, no. 5,

pp. 297–305, May 2020.

[6] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement

learning for urban autonomous driving,” in Proc. Int. Conf. Intell.

Transp. Syst. (ITSC), 2019, pp. 2765–2771.

[7] Y. Guan et al., “Integrated decision and control: Toward interpretable

and computationally efﬁcient driving intelligence,” IEEE Trans. Cybern.,

vol. 53, no. 2, pp. 859–873, Feb. 2023.

[8] B. Wymann et al., “Torcs, the open racing car simulator,” vol. 4, no. 6,

p. 2, 2000. [Online]. Available: http://torcs.sourceforge.net

[9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,

“CARLA: An open urban driving simulator,” in Proc. Conf. Robot

Learn. (CoRL), 2017, pp. 1–16.

[10] M. Zhou et al., “Smarts: An open-source scalable multi-agent RL

training school for autonomous driving,” in Proc. Conf. Robot Learn.

(CoRL), 2021, pp. 264–285.

[11] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “MetaDrive:

Composing diverse driving scenarios for generalizable reinforcement

learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3,

pp. 3461–3475, Mar. 2023.

[12] P. A. Lopez et al., “Microscopic trafﬁc simulation using sumo,” in Proc.

Int. Conf. Intell. Transp. Syst. (ITSC), 2018, pp. 2575–2582.

[13] G. Brockman et al., “OpenAI gym,” 2016, arXiv:1606.01540.

[14] E. Liang et al., “RLlib: Abstractions for distributed reinforcement learn-

ing,” in Proc. Int. Conf. Mach. Learn. (ICML), 2018, pp. 3053–3062.

[15] J. Weng et al., “Tianshou: A highly modularized deep reinforcement

learning library,” J. Mach. Learn. Res., vol. 23, no. 267, pp. 1–6, 2022.

[16] W. Wang et al., “GOPS: A general optimal control problem solver

for autonomous driving and industrial control applications,” Commun.

Transp. Res., vol. 3, Dec. 2023, Art. no. 100096.

[17] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-ﬁdelity visual

and physical simulation for autonomous vehicles,” in Field and Service

Robotics. Cham, Switzerland: Springer, 2018, pp. 621–635.

[18] P. Cai, Y. Lee, Y. Luo, and D. Hsu, “SUMMIT: A simulator for urban

driving in massive mixed trafﬁc,” in Proc. IEEE Int. Conf. Robot. Autom.

(ICRA), May 2020, pp. 4023–4029.

[19] P. Palanisamy, “Multi-agent connected autonomous driving using deep

reinforcement learning,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN),

Jul. 2020, pp. 1–7.

[20] Y. Guan et al., “Direct and indirect reinforcement learning,” Int. J. Intell.

Syst., vol. 36, no. 8, pp. 4439–4467, Aug. 2021.

[21] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforce-

ment learning,” J. Mach. Learn. Res., vol. 16, no. 42, pp. 1437–1480,

Aug. 2015.

[22] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained

policy optimization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019,

pp. 1–15.

[23] Y. Chow, A. Tamar, S. Mannor, and M. Pavone, “Risk-sensitive and

robust decision-making: A CVaR optimization approach,” in Proc. Adv.

Neural Inf. Process. Syst. (NeurIPS), 2015, pp. 1–9.

[24] T. Wei and C. Liu, “Safe control algorithms using energy functions: A

uni ed framework, benchmark, and new directions,” in Proc. IEEE 58th

Conf. Decis. Control (CDC), Dec. 2019, pp. 238–243.

[25] Y. Ren, G. Zhan, L. Tang, S. Eben Li, J. Jiang, and J. Duan, “Improve

generalization of driving policy at signalized intersections with adver-

sarial learning,” 2022, arXiv:2204.04403.

[26] Q. Ge, Q. Sun, S. E. Li, S. Zheng, W. Wu, and X. Chen, “Numeri-

cally stable dynamic bicycle model for discrete-time control,” in Proc.

IEEE Intell. Vehicles Symp. Workshops (IV Workshops), Jul. 2021,

pp. 128–134.

[27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,

“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.

[28] J. Duan et al., “Fixed-dimensional and permutation invariant state

representation of autonomous driving,” IEEE Trans. Intell. Transp. Syst.,

vol. 23, no. 7, pp. 9518–9528, Jul. 2022.

[29] Y. Ren et al., “Self-learned intelligence for integrated decision and

control of automated vehicles at signalized intersections,” IEEE Trans.

Intell. Transp. Syst., vol. 23, no. 12, pp. 24145–24156, Dec. 2022.

[30] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau, “Benchmarking

batch deep reinforcement learning algorithms,” 2019, arXiv:1910.01708.

[31] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:

Off-policy maximum entropy deep reinforcement learning with a

stochastic actor,” in Proc. Int. Conf. Mach. Learn. (ICML), 2018,

pp. 1861–1870.

[32] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, “Learning to walk in the real

world with minimal human effort,” in Proc. Conf. Robot Learn. (CoRL),

2021, pp. 1110–1120.

Yuxuan Jiang received the B.Eng. degree from the

School of Vehicle and Mobility, Tsinghua Univer-

sity, Beijing, China, in 2021, where he is currently

pursuing the master’s degree with the School of

Vehicle and Mobility.

His research interests include optimal control

and reinforcement learning and its application to

autonomous driving.

Guojian Zhan received the B.Eng. degree from the

School of Vehicle and Mobility, Tsinghua Univer-

sity, Beijing, China, in 2021, where he is currently

pursuing the Ph.D. degree with the School of Vehicle

and Mobility.

His research interests lie in autonomous driving

and reinforcement learning.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 11

Zhiqian Lan received the B.Eng. degree from the

School of Vehicle and Mobility, Tsinghua Univer-

sity, Beijing, China, in 2021, where he is currently

pursuing the M.S. degree with the School of Vehicle

and Mobility.

His research interests lie in autonomous driving,

deep learning, and optimal ﬁltering.

Chang Liu received the B.S. degree in electrical

engineering and in applied mathematics from Peking

University, China, in 2011, and the dual M.S. degree

in mechanical engineering and in computer science

and the Ph.D. degree in mechanical engineering from

the University of California at Berkeley, Berkeley,

in 2014, 2016, and 2017, respectively.

He was a Software Engineer with NVIDIA Corpo-

ration and a Post-Doctoral Associate with the Sibley

School of Mechanical and Aerospace Engineering,

Cornell University. He is currently an Assistant

Professor with the College of Engineering, Peking University, China. His

research interests include the robotic motion planning, active sensing, and

human–robot collaboration.

Bo Cheng received the B.S. and M.S. degrees in

automotive engineering from Tsinghua University,

Beijing, China, in 1985 and 1988, respectively, and

the Ph.D. degree in mechanical engineering from

The University of Tokyo, Tokyo, Japan, in 1998.

He is currently a Professor with the School of

Vehicle and Mobility, Tsinghua University, where

he is also the Dean of the Suzhou Automotive

Research Institute. His active research interests

include autonomous vehicles, driver-assistance sys-

tems, active safety, and vehicular ergonomics. He is

also the Chairperson of the Academic Board of SAE-Beijing and a member

of the Council of the Chinese Ergonomics Society.

Shengbo Eben Li (Senior Member, IEEE) received

the M.S. and Ph.D. degrees in mechanical engi-

neering from Tsinghua University, Beijing, China,

in 2006 and 2009, respectively. He worked with

Stanford University, University of Michigan, and

University of California at Berkeley, Berkeley.

He is currently a tenured Professor with Tsinghua

University. He is the author of more than 100 jour-

nals/conference papers and the co-inventor of

over 20 Chinese patents. His active research interests

include intelligent vehicles and driver assistance,

reinforcement learning and distributed control, optimal control, and estima-

tion. He serves as an Associate Editor for IEEE Intelligent Transportation

Systems Magazine and I EEE TRANSACTIONS ON INTELLIGENT TRA NS-

PO RTATI ON SYST EMS.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.

DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime

Preprint

Full-text available

May 2024

Reinforcement learning (RL) has garnered increasing recognition for its potential to optimise dynamic treatment regimes (DTRs) in personalised medicine, particularly for drug dosage prescriptions and medication recommendations. However, a significant challenge persists: the absence of a unified framework for simulating diverse healthcare scenarios and a comprehensive analysis to benchmark the effectiveness of RL algorithms within these contexts. To address this gap, we introduce \textit{DTR-Bench}, a benchmarking platform comprising four distinct simulation environments tailored to common DTR applications, including cancer chemotherapy, radiotherapy, glucose management in diabetes, and sepsis treatment. We evaluate various state-of-the-art RL algorithms across these settings, particularly highlighting their performance amidst real-world challenges such as pharmacokinetic/pharmacodynamic (PK/PD) variability, noise, and missing data. Our experiments reveal varying degrees of performance degradation among RL algorithms in the presence of noise and patient variability, with some algorithms failing to converge. Additionally, we observe that using temporal observation representations does not consistently lead to improved performance in DTR settings. Our findings underscore the necessity of developing robust, adaptive RL algorithms capable of effectively managing these complexities to enhance patient-specific healthcare. We have open-sourced our benchmark and code at https://github.com/GilesLuo/DTR-Bench.

GOPS: A general optimal control problem solver for autonomous driving and industrial control applications

Article

Full-text available

Apr 2023

Reinforcement Learning for Sequential Decision and Optimal Control

Book

Full-text available

Jan 2023

Shengbo Eben Li

As one of the most important AI branches, reinforcement learning (RL) has attracted increasing attention in recent years. RL is an interdisciplinary field of trial-and-error learning and optimal control that promises to provide optimal solutions for decision-making or control in large-scale and complex dynamic processes. One of its most conspicuous successes is AlphaZero from Google DeepMind, which has beaten the highest-level professional human player. The underlying key technology is the so-called deep reinforcement learning, which equips AlphaGo with amazing self-evolution ability and high playing intelligence. This book aims to provide a systematic introduction to fundamental RL theories, mainstream RL algorithms and typical RL applications for researchers and engineers. The main topics include Markov decision processes, Monte Carlo learning, temporal difference learning, RL with function approximation, policy gradient method, approximate dynamic programming, deep reinforcement learning, etc.

Self-Learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized Intersections

Article

Full-text available

Dec 2022

Intersection is one of the most accident-prone urban scenarios for autonomous driving wherein making safe and computationally efficient decisions is non-trivial. Current research mainly focuses on the simplified traffic conditions while ignoring the existence of mixed traffic flows, i.e., vehicles, cyclists and pedestrians. For urban roads, different participants lead to a quite dynamic and complex interaction, posing great difficulty to learn an intelligent policy. This paper develops the dynamic permutation state representation in the framework of integrated decision and control (IDC) to handle signalized intersections with mixed traffic flows. Specially, this representation introduces an encoding function and summation operator to construct driving states from environmental observation, capable of dealing with different types and variant number of traffic participants. A constrained optimal control problem is built wherein the objective involves tracking performance and the constraints for different participants, roads and signal lights are designed respectively to assure safety. We solve this problem by gradient-based optimization, wherein the reasonable state will be given by the encoding function and then served as the input of policy and value function. An off-policy training is designed to reuse observations from driving environment and backpropagation through time is utilized to update the policy function and encoding function jointly. Verification result shows that the dynamic permutation state representation can enhance the driving performance of IDC, including comfort, decision compliance and safety with a large margin. The trained driving policy can realize efficient and smooth passing in the complex intersection, guaranteeing driving intelligence and safety simultaneously.

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Article

Full-text available

Jun 2021

In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q-value overestimations because it is capable of adaptively adjusting the update step size of the Q-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.

Direct and indirect reinforcement learning

Article

Full-text available

May 2021

Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision-making and control tasks. In this paper, we classify RL into direct and indirect RL according to how they seek the optimal policy of the Markov decision process problem. The former solves the optimal policy by directly maximizing an objective function using gradient descent methods, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We study policy gradient (PG) forms of direct and indirect RL and show that both of them can derive the actor–critic architecture and can be unified into a PG with the approximate value function and the stationary state distribution, revealing the equivalence of direct and indirect RL. We employ a Gridworld task to verify the influence of different forms of PG, suggesting their differences and relationships experimentally. Finally, we classify current mainstream RL algorithms using the direct and indirect taxonomy, together with other ones, including value-based and policy-based, model-based and model-free.

Improve generalization of driving policy at signalized intersections with adversarial learning

Article

Jul 2023

MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning

Article

Jul 2022

Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the safety awareness of the surrounding traffic, and the decision-making in complex multi-agent settings. Despite the great success of Reinforcement Learning (RL), most of the RL research works investigate each capability separately due to the lack of integrated environments. In this work, we develop a new driving simulation platform called MetaDrive to support the research of generalizable reinforcement learning algorithms for machine autonomy. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real data importing. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. The generalization experiments conducted on both procedurally generated scenarios and real-world scenarios show that increasing the diversity and the size of the training set leads to the improvement of the RL agent's generalizability. We further evaluate various safe reinforcement learning and multi-agent reinforcement learning algorithms in MetaDrive environments and provide the benchmarks. Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive .

Integrated Decision and Control: Toward Interpretable and Computationally Efficient Driving Intelligence

Article

Apr 2022

Decision and control are core functionalities of high-level automated vehicles. Current mainstream methods, such as functional decomposition and end-to-end reinforcement learning (RL), suffer high time complexity or poor interpretability and adaptability on real-world autonomous driving tasks. In this article, we present an interpretable and computationally efficient framework called integrated decision and control (IDC) for automated vehicles, which decomposes the driving task into static path planning and dynamic optimal tracking that are structured hierarchically. First, the static path planning generates several candidate paths only considering static traffic elements. Then, the dynamic optimal tracking is designed to track the optimal path while considering the dynamic obstacles. To that end, we formulate a constrained optimal control problem (OCP) for each candidate path, optimize them separately, and follow the one with the best tracking performance. To unload the heavy online computation, we propose a model-based RL algorithm that can be served as an approximate-constrained OCP solver. Specifically, the OCPs for all paths are considered together to construct a single complete RL problem and then solved offline in the form of value and policy networks for real-time online path selecting and tracking, respectively. We verify our framework in both simulations and the real world. Results show that compared with baseline methods, IDC has an order of magnitude higher online computing efficiency, as well as better driving performance, including traffic efficiency and safety. In addition, it yields great interpretability and adaptability among different driving scenarios and tasks.

Numerically Stable Dynamic Bicycle Model for Discrete-time Control

Conference Paper

Jul 2021

Fixed-Dimensional and Permutation Invariant State Representation of Autonomous Driving

Article

Dec 2021

In this paper, we propose a new state representation method, called encoding sum and concatenation (ESC), to describe the environment observation for decision-making in autonomous driving. Unlike existing state representation methods, ESC is applicable to the situation where the number of surrounding vehicles is variable and eliminates the need for manually pre-designed sorting rules, leading to higher representation ability and generality. The proposed ESC method introduces a feature neural network (NN) to encode the real-valued feature of each surrounding vehicle into an encoding vector, and then adds these vectors up to obtain the representation vector of the set of surrounding vehicles. Then, a fixed-dimensional and permutation-invariance state representation can be obtained by concatenating the set representation with other variables, such as indicators of the ego vehicle and road. By introducing the sum-of-power mapping, this paper has further proved that the injectivity of the ESC state representation can be guaranteed if the output dimension of the feature NN is greater than the number of variables of all surrounding vehicles. This means that the ESC representation can be used to describe the environment and taken as the inputs of learning-based policy functions. Experiments demonstrate that compared with the fixed-permutation representation method, the policy learning accuracy based on ESC representation is improved by 62.2%.

A Reinforcement Learning Benchmark for Autonomous Driving in General Urban Scenarios

Abstract

Recommended publications

Multi-task Safe Reinforcement Learning for Navigating Intersections in Dense Traffic

Enhance Generality by Model-based Reinforcement Learning and Domain Randomization

Self-Learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized In...

Integrated Decision and Control for High-Level Automated Vehicles by Mixed Policy Gradient and Its E...

Encoding Distributional Soft Actor-Critic for Autonomous Driving in Multi-Lane Scenarios [Research F...