ArticlePDF Available

A Reinforcement Learning Benchmark for Autonomous Driving in General Urban Scenarios

Authors:

Abstract

Reinforcement learning (RL) has gained significant interest for its potential to improve decision and control in autonomous driving. However, current approaches have yet to demonstrate sufficient scenario generality and observation generality, hindering their wider utilization. To address these limitations, we propose a unified benchmark simulator for RL algorithms (called IDSim) to facilitate decision and control for high-level autonomous driving, with emphasis on diverse scenarios and a unified observation interface. IDSim is composed of a scenario library and a simulation engine, and is designed with execution efficiency and determinism in mind. The scenario library covers common urban scenarios, with automated random generation of road structure and traffic flow, and the simulation engine operates on the generated scenarios with dynamic interaction support. We conduct four groups of benchmark experiments with five common RL algorithms and focus on challenging signalized intersection scenarios with varying conditions. The results showcase the reliability of the simulator and reveal its potential to improve the generality of RL algorithms. Our analysis suggests that multi-task learning and observation design are potential areas for further algorithm improvement.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1
A Reinforcement Learning Benchmark for
Autonomous Driving in General
Urban Scenarios
Yuxuan Jiang , Guojian Zhan , Zhiqian Lan , Chang Liu, Bo Cheng,
and Shengbo Eben Li , Senior Member, IEEE
Abstract Reinforcement learning (RL) has gained significant
interest for its potential to improve decision and control in
autonomous driving. However, current approaches have yet
to demonstrate sufficient scenario generality and observation
generality, hindering their wider utilization. To address these
limitations, we propose a unified benchmark simulator for RL
algorithms (called IDSim) to facilitate decision and control
for high-level autonomous driving, with emphasis on diverse
scenarios and a unified observation interface. IDSim is composed
of a scenario library and a simulation engine, and is designed
with execution efficiency and determinism in mind. The scenario
library covers common urban scenarios, with automated random
generation of road structure and traffic flow, and the simulation
engine operates on the generated scenarios with dynamic interac-
tion support. We conduct four groups of benchmark experiments
with five common RL algorithms and focus on challenging
signalized intersection scenarios with varying conditions. The
results showcase the reliability of the simulator and reveal
its potential to improve the generality of RL algorithms. Our
analysis suggests that multi-task learning and observation design
are potential areas for further algorithm improvement.
Index Terms Autonomous driving, benchmark simulator,
reinforcement learning.
I. INTRODUCTION
AUTONOMOUS driving has garnered significant attention
from both academia and industry due to its potential
to improve travel efficiency, reduce accidents, and alleviate
the burden on human drivers [1]. At the core of achieving
high-level driving intelligence is the decision-making process.
While there have been attempts to replicate human driving
Manuscript received 20 February 2023; revised 22 August 2023;
accepted 30 October 2023. The work of Shengbo Eben Li was supported in
part by the National Key Research and Development Program of China under
Grant 2022YFB2502901, in part by NSF China under Grant 52221005, in part
by the Tsinghua University Initiative Scientific Research Program, and in part
by the Tsinghua University-Toyota Joint Research Center for AI Technology
of Automated Vehicle. The Associate Editor for this article was H. Jula.
(Yuxuan Jiang and Guojian Zhan contributed equally to this work.)
(Corresponding author: Shengbo Eben Li.)
Yuxuan Jiang, Guojian Zhan, Zhiqian Lan, Bo Cheng, and Shengbo Eben Li
are with the State Key Laboratory of Automotive Safety and Energy,
School of Vehicle and Mobility, Tsinghua University, Beijing 100084,
China (e-mail: jyx21@mails.tsinghua.edu.cn; zgj21@mails.tsinghua.edu.cn;
lanzq21@mails.tsinghua.edu.cn; chengbo@tsinghua.edu.cn; lishbo@
tsinghua.edu.cn).
Chang Liu is with the Department of Advanced Manufacturing and
Robotics, College of Engineering, Peking University, Beijing 100084, China
(e-mail: changliucoe@pku.edu.cn).
This article has supplementary downloadable material available at
https://doi.org/10.1109/TITS.2023.3329823, provided by the authors.
Digital Object Identifier 10.1109/TITS.2023.3329823
behaviors by imitating expert data, obtaining a large quan-
tity of high-quality data can be expensive and potentially
infeasible. With the capacity to self-evolve through trial-and-
error independent of reliance on external labeled data [2],
reinforcement learning (RL) has emerged as a promising
solution for achieving real-time, high-accuracy decision and
control of autonomous vehicles [3].
Recently, RL has been applied for certain driving tasks and
scenarios. Lillicrap et al. [4] proposed DDPG algorithm and
realized lane keeping in a race track in the TORCS simulator,
reporting comparable performance between low-dimensional
and pixel observation. Duan et al. [5] realized decision and
control at two-lane highway using hierarchical RL, designing
complex reward functions for its high-level maneuver selection
and three low-level maneuvers. Leveraging the structure of
the scenario, it selects one leading and one following vehicle
on each lane, then use their longitudinal distances to the
ego vehicle to represent surrounding information. Chen et al.
[6] focused on the roundabout scenario, designing a specific
reward function to achieve safe and comfortable driving using
several model-free RL algorithms. This work proposed latent
state encoding with bird-view images to decrease sample com-
plexity as compared to using front-view images. Guan et al. [7]
proposed integrated decision and control framework to handle
the crossroad scenario, and demonstrated its effectiveness
through real-world experiments. For surrounding information,
this work selects two surrounding vehicles for each conflicting
connection and observes the position and velocity relative to
the center of intersection.
There are two major concerns regarding the applicability of
the aforementioned works. One is the lack of scenario gener-
ality. Most of these works focus on a single specific scenario
and require the policy to be retrained even for slight changes in
the environment. It is necessary to aim for a universal policy
that can handle general urban scenarios in order to achieve
high-level autonomous driving. Another concern is the lack
of observation generality. Some of the works use specialized
observation, such as filtering surrounding vehicles to observe
based on rules adapted to two-lane highway, which may have
advantage on the targeted scenario but be inapplicable in other
scenarios. In fact, these two issues relate to the policy gener-
ality from structural level and performance level, respectively.
The observation generality implies that the policy input is
general, that is, the same form of observation can be used in
any working condition and contains sufficient information; the
1558-0016 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TABLE I
OPE N-SOURCE AUTONOMOUS DRIVING SIMULATOR
scenario generality means that the performance of the policy
is strong enough to conquer general scenarios. To achieve
high-level driving intelligence, the observation generality is
the fundamental condition, while the scenario generality is the
ultimate goal. Handling these two issues at the algorithm side
is tedious and makes fair performance comparison difficult.
In light of this, we propose a unified benchmark simulator,
Intelligent Driving Simulator (IDSim), for RL algorithms to
facilitate decision and control for high-level autonomous driv-
ing. The emphasis is placed on diverse scenarios and a unified
observation interface. IDSim addresses scenario generality by
providing sufficient diversity at the road structure level and
traffic flow level. Urban roads can have complex and varied
layouts, including multi-lane roads, intersections, roundabouts,
etc. Traffic participants (vehicles, cyclists, and pedestrians)
can be significantly different from each other, each with
unique geometries, dynamics, and behavioral patterns (e.g.,
conservative or aggressive). Thus, the generality of a policy
can be validated by training and testing on two distinct
set of scenarios. Regarding observation generality, IDSim
provides sufficient information for the ego vehicle to make
optimal decisions in a unified style. The ego vehicle’s states,
road structure, traffic participants’ states, and other dynamic
traffic information can be easily composed and flattened to
a vector as policy input. Based on the composable and
customizable configuration support, we provide several general
observation designs. Different observation configurations can
have significant impact on policy performance, and relevant
experiments will be presented in the benchmark experiments
section.
Systematically, IDSim is composed of two main parts: a
scenario library and a simulation engine, in accordance with
the integrated decision and control (IDC) framework [7]. The
scenario library achieves automated random generation of road
structure and traffic flow covering common urban scenarios,
such as multi-lane roads, intersections, and roundabout. The
simulation engine operates on a set of generated scenarios
and provides dynamic interaction support. It is built on top of
Eclipse SUMO [12], a microscopic traffic simulation package,
for traffic flow simulation. Lastly, IDSim is designed with
execution efficiency and determinism in mind. We incorporate
careful profiling and optimization to accelerate RL training,
and regularly run tests to guarantee that simulation result is
exactly reproducible with fixed seed and input.
Our key contributions include:
(1) We propose a unified, lightweight benchmark simulator
called IDSim, with standard OpenAI Gym [13] interface that
are widely adopted by popular RL training frameworks such
as RLlib [14], Tianshou [15], and GOPS [16]. The simulator
concentrates on general urban scenarios and is compatible
with the IDC framework designed for high-level autonomous
driving.
(2) We address scenario generality by generating diverse
scenarios as training or test environments. Specifically,
we offer diversity at the road structure level and traffic flow
level. The former encompasses general urban scenarios includ-
ing multi-lane roads, intersections, and roundabouts, while the
latter considers various types of traffic participants such as
vehicles, cyclists, and pedestrians.
(3) We focus on observation generality and provide a unified
observation interface under all supported scenarios. The ego
vehicle’s states, navigation information, road structure, traffic
participants’ states, and other relevant dynamic traffic infor-
mation can be easily composed and flattened as policy input.
Besides, We provide several observation designs which can be
used in general driving conditions.
(4) We conduct four groups of experiments with five com-
mon RL algorithms to demonstrate the functionality of our
benchmark simulator, with a particular focus on diverse sig-
nalized intersection scenarios. The results show that different
scenario settings have a strong impact on algorithm perfor-
mance and verify the generality and reliability of the simulator.
We further point out several directions of interest, such as
multi-task learning and observation design, for algorithms to
improve upon.
II. REL ATED WORKS
A proper RL environment is critical to the application of
RL algorithms. In general, the interaction between an RL
agent and the environment either happens in real world or in
simulation. However, trial-and-error in the real world implies
economic costs, security risks, and low sample efficiency.
Thus, it has become an inevitable choice to employ a simulator
to collect samples for prototyping and validation of algorithms.
Recently, various open-source autonomous driving simu-
lators have emerged and can serve as RL environments.
A detailed and selective comparison with our proposed IDSim
is illustrated in Table I. TORCS [8] was initially developed as
a 3D racing simulation game. Thanks to its embedded vehicle
dynamics, high efficiency, and open-source nature, it has been
repurposed for autonomous driving simulation, yet the scope
is typically limited to basic lane-keeping tasks. CARLA [9] is
one of the most well-known open-source autonomous driving
simulators currently available. Many researchers have made
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 3
customization to it for RL integration, such as Airsim [17],
SUMMIT [18], and MACAD [19]. CARLA’s use of the Unreal
Engine enables photorealistic rendering, but makes simulation
more resource-consuming and significantly complicates the
process of automated scenario generation. SMARTS [10]
is a multi-agent autonomous driving simulator, focusing on
realistic and diverse interactions. It develops around a typical
set of urban driving conditions, such as lane merging at a ramp
and unprotected left turn at a four-way intersection. IDSim,
instead, puts more focus on scenario generality and provides
more diversity through randomized generation. MetaDrive [11]
is a driving simulation platform with the explicit focus on
RL’s generalizability issue. It designs a procedural generation
method to unlimitedly produce new scenarios by randomly
sampling and splicing a bunch of built-in blocks. IDSim
employs a different approach by generating a skeleton then
applying perturbation, which allows for more fine-grained
diversity comparing to MetaDrive’s approach where a block
is the smallest unit. Besides, MetaDrive’s traffic flow mod-
ule is developed from scratch, which lacks several desirable
properties, including cooperation, periodic spawning and cus-
tomizability. IDSim, on the other hand, integrates SUMO as
the traffic flow backend, making traffic flow configuration
more friendly to work with. Finally, the environment model is
an extra feature of IDSim, which facilitates model-based RL
algorithms and planning-based methods to train and evaluate
on our platform.
III. PRELIMINARIES
A. Reinforcement Learning
Reinforcement learning (RL) considers a Markov Decision
Process (MDP), where the optimal action only depends on
the current state [2]. More specifically, at each timestep t,
the agent will take action atAaccording to current state
stSand the policy π:SA, then the environment
will transit to the next state according to the environment
dynamics, i.e., st+1=f(st,at), and feedback a scalar reward
signal rt. The value function vπ:SRis defined as the
expected sum of rewards of policy πgiven initial state s.
Actor-Critic is the most popular structure of RL algorithms.
It simultaneously learns a value function Vwparameterized by
wcalled critic to approximate the true value function, and a
policy πθparameterized by θcalled actor to maximize the
expected sum of rewards with the help of critic [20].
Safe RL is a subfield of RL that focuses on the safety
of an agent, which is especially meaningful for real-word
deployment [21]. Safe RL seeks to ensure that the agent
operates within the boundaries imposed by constraints to
satisfy safety requirements, while maintaining reasonable
performance. To achieve this, various methods that incorpo-
rate penalties [22], Lagrangian multipliers [23], and energy
functions [24] have been proposed to minimize constraint
violations during training and deployment.
B. Integrated Decision and Control Framework
Integrated decision and control (IDC) framework is an
autonomous driving decision and control framework tailored
for RL algorithms. IDC framework consists of two essential
modules: the static path planner and the dynamic optimal
tracker, which subtly correspond to the critic and actor of
Actor-Critic RL algorithms [7]. The static path planner gen-
erates the candidate path set 5, which only considers static
traffic information that is irrelevant to traffic participants’
behaviors. The path set can be constructed by prior knowl-
edge with consideration of road structure. Then, the dynamic
optimal tracker is where RL algorithms come in. In details,
the critic learns to approximate the tracking cost of each
candidate path in the path set, and the actor learns to track
each path while assuring safety by meeting the constraints.
The optimal critic vand actor πwill be approximated
by Vwand πθby offline training. During online application,
Vwdistinguishes the path with lowest cost from 5, then
πθtracks this path to output the control commands. The
IDC framework can incorporate RL in a more efficient and
interpretable way compared to end-to-end training. And thanks
to the inherited advantages of RL, the IDC framework can
cast off tedious human designs and be promising to improve
driving intelligence.
IV. IDS IM SIM ULATO R
A. Overview
The IDSim simulator mainly consists of two parts, a sce-
nario library and a simulation engine, supporting the static
path planner module and the dynamic optimal tracker module
in IDC framework respectively. The scenario library creates a
set of scenarios with configurable road structure, traffic light
layouts and traffic flow characteristics. The simulation engine
samples scenarios from the pre-generated set and provides
interactive simulation support. It is also possible to use these
two parts in conjunction to generate scenarios on the fly, i.e.,
provide a unique scenario for each new simulation. The overall
architecture is shown in Fig. 1.
B. Scenario Library
The scenario library helps scenario generality by providing
sufficient randomness in scenario generation. To obtain a new
scenario in IDSim, there are three required procedures: gener-
ating road structure, planning static paths, and characterizing
traffic flow.
IDSim is capable of generating randomized road structure
by establishing basic structures and applying randomness to
the positions and topological relationship of junction nodes
and road edges. A demonstration of random segments can be
found in Fig. 2. IDSim also supports importation of arbitrary
custom maps compatible with SUMO’s network format, which
means real-world data in formats such as OpenStreetMap and
OpenDRIVE, or maps created with NetEdit can be directly
integrated. Fig. 3shows an example map imported from
Tsinghua university and its surroundings, and four typical parts
are cut out for manual edit. In addition, IDSim implements
various checks to reject malformed maps that can cause trouble
in subsequent procedures or simulations.
To plan static paths, IDSim utilizes a general method
outlined in [25]. The center line of each lane is reused as
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 1. The overall architecture of IDSim, consisting of two parts: a scenario library and a simulation engine. The scenario library creates a set of scenarios
with flexible configuration. For each scenario, the library either generates or imports the road structure, plans static path and deploys traffic flow. The simulation
engine samples one scenario from the pre-generated set for each simulation and provides dynamic interaction support. The scenario manager, traffic manager
and agent manager share a common simulation context and support different aspects of simulation.
Fig. 2. Automated generation of road structure based on a standard junction.
Randomness is applied to edge angles, lane numbers and edge numbers.
a static path segment, while cubic Bézier curves connect the
entrance lane and exit lane of different types of connections,
as depicted in Figure 4. These two types of path segments can
be assembled on demand into a static path that corresponds to
a vehicle’s global route.
Finally, IDSim inherits the configurations from SUMO to
characterize traffic flow, which includes models for vehicles,
cyclists, and pedestrians and allows for customization of traffic
flow characteristics such as density, maximum speed and
emergency braking deceleration. By default, IDSim assigns
randomized traffic flow to the map automatically, yet manual
customization is also supported.
Fig. 3. The map of Tsinghua university, provided as a demonstration that
can be imported into IDSim.
C. Simulation Engine
The simulation engine takes the static scenario information
from the scenario library and offers dynamic interaction sup-
port. Three main components of the engine are the scenario
manager, traffic manager, and agent manager. They share
a common context containing latest information about the
current simulation.
The scenario manager samples and loads a scenario from the
pre-generated scenario set for each simulation, then initializes
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 5
Fig. 4. Static paths generated for different types of road structure in common
urban scenarios.
Fig. 5. 3-DOF non-linear vehicle dynamic model.
the underlying SUMO session with the map and traffic flow
definition. The scenario will be reused with a configurable
amount of episodes until the current simulation terminates and
the scenario will be resampled.
Based on the loaded scenario, the traffic manager randomly
takes over an ego vehicle from the traffic flow at the beginning
of each episode. At each step of the episode, the traffic
manager synchronizes traffic lights, traffic participants, and
other traffic elements between SUMO and the simulation
context through Libsumo API [12]. For each ego vehicle agent,
the traffic manager provides static paths based on its global
route and subscribes its surrounding traffic information for
perception simulation.
The agent manager is the main component for an external
policy to interact with. At each step of the episode, it takes
action from the policy, updates the ego vehicle’s dynamic
states, and computes the new observation from updated simu-
lation context. The dynamics of ego vehicle is a classic 3-DOF
bicycle model widely adopted in vehicle control, which owns
6-dimensional state and 2-dimensional action [26]:
x=pxpyvxvyϕ ωT,u=aδT,
where px,pyare the ground coordinates of the ego vehicle’s
center of gravity (CG), vx, vyare the longitudinal and lateral
velocities, ϕis the heading angle, ωis the yaw rate, ais the
acceleration command, and δis the front wheel angle. The
state space equation is
x=F(x,u)=
px+1t(vxcos ϕvysin ϕ)
py+1t(vxsin ϕ+vycos ϕ)
vx+1t(a+vyω)
mvxvy+1t[(lfkflrkrkfδvxmv2
xω]
mvx1t(kf+kr)
ϕ+1tω
Izωvx+1t[(lfkflrkr)vylfkfδvx]
Izvx1t(l2
fkf+l2
rkr)
(1)
One focus of the agent manager is to provide a unified
observation interface under all supported scenarios to achieve
observation generality. The ego vehicle’s states, navigation
information, road structure, traffic participants’ states, and
other relevant dynamic traffic information can be easily
composed and flattened as policy input. The traffic partic-
ipant observation have a special property that the number
of perceived participants could vary with time. To retrieve a
fixed-dimensional vector, multiple choices are discussed and
compared in section V-A4. The observation interface is also
customizable to support more use cases, such as adding noise
in robust RL research and sensitivity analysis. The agent
manager also checks if the ego vehicle has to be removed
because either the route is completed or traffic regulations are
violated. The cause will be recorded for reward computation
and the engine will be notified to terminate the episode.
Each agent manager has independent states, thus multiple
agent managers can operate concurrently to form a multi-agent
simulation.
D. Environment Model & Auxiliary Tools
IDSim offers an environment model to facilitate the
training and evaluation of model-based RL algorithms and
planning-based methods. The environment model is a close
approximation of the simulator, consisting of ego vehicle
dynamic model, surrounding vehicle prediction model, ref-
erence trajectory model and reward model. Starting at states
collected from the simulator at certain time steps, the environ-
ment model could rollout future states given a policy or some
actions. Two main features of the environment model is:
1) Differentiable. One can differentiate through the envi-
ronment model and retrieve gradients of actions with
respect to the cost function.
2) Suitable for planning. All computation of the envi-
ronment model varies with inputs only. This enables
multiple rollouts starting at the same state, suitable for
planning purpose. In contrast, the simulator could only
go forward in time. Replaying certain time step in a
simulator is often costly and imperfect.
Based on this model, a built-in model predictive control (MPC)
driver is also provided for reference.
In addition to the core components mentioned above, IDSim
also comes with several built-in auxiliary tools. This includes
a logger that can save simulation results and configurations,
a renderer that can provide real-time visualization and output
videos for playback, and an evaluator that can provide ve
dimensional metrics: driving safety, regulatory compliance,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
driving comfort, travel efficiency and energy efficiency to test
the driving performance of the trained policy systematically.
E. Efficiency & Determinism
The sample efficiency of RL algorithms have long been
deemed unsatisfactory. On-policy algorithms, like proximal
policy optimization (PPO) [27], employ parallel sampling
technique to speed up training process. Each worker instan-
tiates and samples from an independent environment, where
the environment’s execution and RAM efficiency becomes
crucial to scale up. IDSim is designed with efficiency in
mind. We incorporate careful profiling to locate inefficient
and bottleneck code segments and use various optimization
to accelerate execution and reduce memory footprint. As a
simple experiment, it takes an average of 0.915 ms per step
and a maximum resident set size of 136.8 MiB to rollout an
environment with a set of 100 scenarios for 2 ×104steps
on a single thread. With 10 parallel workers, training a PPO
agent for 5 ×106steps finishes under 40 minutes. Both two
experiments above runs on AMD Ryzen Threadripper 3960X
CPU, with no GPU involvement.
A proper RL environment shall be fully deterministic with
fixed seed and input. Failure to meet this requirement com-
monly result from inconsistent PRNG (pseudo random number
generator) usage. For instance, Metadrive is not fully determin-
istic out-of-box and has to be subtly tweaked for reproducible
RL training.1IDSim ensures that all sources of randomness
are controlled through a single master seed. We regularly runs
integration tests (against a deterministic policy) and end-to-
end test (against RL frameworks) to guarantee that simulation
result is exactly reproducible.
V. BENCHMARK EXPE RIM ENTS
We benchmark multiple algorithms in four groups of tasks,
to demonstrate the generality and reliability of our simula-
tor and address some common concerns regarding RL for
autonomous driving.
A. Task Description
All tasks share the common objective of driving the ego
vehicle to the destination within limited time while adhering
to various traffic regulations. This includes maintaining the
vehicle within the road lanes, following traffic light signals,
and avoiding collisions with other vehicles on the road.
The observations used to guide the vehicle’s actions include
the state of the ego vehicle, waypoints, LiDAR points out-
lining the edges of the road, traffic light information, and
observations of the surrounding participants. The control input
consists of the steering angle and acceleration of the vehicle.
The following groups of tasks explore the impact of various
factors on the performance of the RL algorithm, using a con-
trol variable approach. The first three groups vary the scenario
(traffic light layout, traffic flow density, and road structure)
1See a workaround in https://github.com/jjyyxx/srlnbc/blob/main/srlnbc/
env/my_metadrive.py, which targets MetaDrive release https://github.com/
metadriverse/metadrive/releases/tag/MetaDrive-0.2.5.
Fig. 6. Scenario demonstration for group (A). The junction follows right-hand
traffic rules. Thick lines above four incoming edges’ stop lines indicate the
traffic light phase, and the dark green phase has a lower priority than the light
green phase. The rectangle in magenta is the ego vehicle, and rectangles in
blue are vehicles controlled by SUMO. This plotting convention also applies
to Fig. 7,8,9.
Fig. 7. Scenario demonstration for group (B).
Fig. 8. Scenario demonstration for group (C). In group (C), 100 scenarios
are generated as the training set and 1 distinct scenario is used for testing.
The C-inotation in Fig. (a)-(e) indicates the i-th scenario in the training set.
while maintaining a common environment definition. The
fourth group varies the surrounding participant observation
presented to the algorithm. The four groups are referred to as
(A), (B), (C) and (D), and variants in each group are referred
to as A1, A2, etc.
1) Traffic Light Layout (A): Traffic lights are used to
regulate the flow of traffic at intersections. The layout of the
traffic lights can significantly affect the characteristics of the
traffic flow and present varying levels of difficulty for the agent
to learn proper behavior. In our experiments, we test three
different layouts at a four-way intersection: opposites (A1),
incoming (A2), and none (A3). For tasks in other groups, the
traffic light layout defaults to opposites.
The opposites layout (A1) is the most common for intersec-
tions, where the two orthogonal directions are allowed to pass
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 7
Fig. 9. Scenario demonstration for group (D). Green and red rectangles
under D1 and D2 variant represent the observation vector of its corresponding
surrounding vehicle. Orange dots under D3 variant represent scalar distances
normalized to detection range shown as the light orange circle. The rectangles
and dots are arranged in order (D1: from close to distant; D3: counter-
clockwise) and concatenated into the final fixed-dimensional surrounding
participant observation.
alternately. For each of the two directions, a straight phase is
followed by a left-turn phase. During the straight phase, left-
turn vehicles are permitted to pass if there are no conflicting
straight-going vehicles in the opposite direction. The incoming
layout (A2) further reduces potential conflicts at the cost of
longer waiting times for each incoming road. Each road has
a dedicated phase where all vehicles are allowed to pass. The
none layout (A3) literally disables the traffic lights, posing a
greater challenge for the agent to learn proper driving behavior.
2) Traffic Flow Density (B): The difficulty of collision
avoidance increases with the density of traffic flow. Quantita-
tively, the density is represented by the frequency at which new
vehicles are introduced into the network. In our experiments,
we test three levels of density: sparse (B1), normal (B2),
and dense (B3), which are characterized by vehicle spawn
periods of 1.8, 1.2, and 0.8 seconds, respectively. The sparse
level is relatively easy to navigate, often requiring little to no
interaction with surrounding participants. The dense level is
more challenging, but is capped to avoid traffic jams. For tasks
in other groups, the traffic flow density defaults to normal.
3) Road Structure (C): The road structure is randomly
generated to evaluate the generalizability of the learned policy.
The algorithms are trained on 1 (C1), 10 (C2), and 100 (C3)
scenarios and then tested on a previously unseen scenario. For
tasks in other groups, the road structure of both training and
test scenarios are the same as the training scenario of C1.
4) Surrounding Participant Observation (D): Interaction
with surrounding participants is a primary challenge in
autonomous driving. The surrounding participant observation
has the special property of being a set with varying dimen-
sion [28]. This group of tasks examines various methods of
converting it into a fixed-dimensional observation.
The distance-sorted observation (D1) sorts the surrounding
participants by their distance from the ego vehicle [29]. It then
selects at most 8 closest vehicles and concatenates their states
into a vector. If there are fewer vehicles than the expected
size, the observation is padded with zeros. An additional
TABLE II
ON-POLICY ALGORITHMS H YPERPARAMETERS
mask is also appended to each vehicle’s state, indicating
whether it is a real vehicle or a padding element. The top-
risky observation (D2) selects a single vehicle that is deemed
most risky according to a set of rules and uses its state as the
observation. The criterion to judge the top-risky vehicle is:
outside the intersection, it is the vehicle in front; and inside
the intersection, it is the nearest vehicle in the conflicting
traffic flow. The fixed-directional (D3) observation simulates
a LiDAR sensor by dividing the surroundings into a number
of bins and casting rays to surrounding participants to obtain
a fixed-size observation [30]. Each element in the observation
reflects the ego vehicle’s closest distance to surrounding par-
ticipants in a specific bin and direction. We set the number
of bins to 16 and the elements are arranged in counter-
clockwise order starting from the bin in front. For tasks in
other group, the surrounding participant observation defaults to
distance-sorted.
B. RL Algorithms
Five model-free algorithms are selected for benchmarking
on the aforementioned groups of tasks. Proximal policy gradi-
ent (PPO) [27], soft actor-critic (SAC) [31] and distributional
soft actor-critic (DSAC) [3] are general-purpose RL algo-
rithms. PPO-Lagrangian (PPO-Lag) [30] and SAC-Lagrangian
(SAC-Lag) [32] are common baselines for safe RL. For safe
RL algorithms, constraints are handled with dual ascent, where
the Lagrange multiplier is updated with gradient descent. Also,
the environment during training is adjusted so that collision
with surrounding participants will not immediately terminate
the episode or cause a penalty. Instead, a cost of +1 is given
for every step a collision is happening, indicating constraint
violation. During evaluation, the environment settings are kept
the same as general-purpose RL algorithms.
The hyperparameters for two on-policy algorithms (PPO,
PPO-Lag) and three off-policy algorithms (SAC, DSAC, and
SAC-Lag) are listed in Table II and Table III, respectively.
Most of them are are directly taken from the original literature.
C. Simulation Parameter Configuration
There are two main sampling times in IDSim. One is the
sampling time of the environment 1tenv, directly interacting
with the algorithm; the other is the sampling time of the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TABLE III
OFF-POLICY ALGORITHMS HYP ERPARA METE RS
TABLE IV
DYNA MICA L PA RAME TERS O F EGO VEHI CLE
TABLE V
REWAR D TERMS FOR ALL TASKS
dynamic model 1tin (5). Here we set 1tenv =1t=0.1 s.
As for other dynamical parameters of vehicle model, the
details are listed in Table IV.
The reward is composed of a dense term for driving forward
along the route, as well as a conditional term for positioning
the vehicle appropriately at intersections. An episode termi-
nates when one of the following events happens to the agent:
reaching the end of its route (refered to as success), collision
with other traffic participants, leaving driving area, violating
traffic lights or exceeding time limit, at which point a terminal
reward or penalty is issued. The specific descriptions and
values are listed in Table V. The notion drepresents the
distance traveled along the reference path over the last time
step, and vddenotes the desired velocity. vdis set to 8 m/s in
case of green light and 0 when encountering red light.
We train each algorithm on each task for five runs, using
different random seeds for the environment and neural network
initialization. The policy checkpoints are stored for every 2%
of the training process, and evaluated in the corresponding test
scenario for 100 episodes.
D. Results
The success rates of the final policies are reported in
Table VI. The performance of intermediate policies can be
found in Supplementary Material.
In general, PPO and SAC-Lag performs well consistently
across all tasks, and DSAC outperforms SAC except for the
task A2. Typically, an agent can learn to follow its path, keep
itself in driving area, and obey traffic light regulation. Failures
commonly fall into two categories: (1) Fail to learn collision
avoidance and keep colliding; (2) Be overly conservative to
avoid collision, leading to excess of time limit.
Group (A) and (B) meets the design goal of allowing for
different levels of scenario difficulty through the adjustment
of traffic light layout and traffic flow density. The results of
Group (A) are consistent with discussions in section V-A1,
that the incoming layout is the easiest for all algorithms
and the none layout is quite challenging, which prevents
algorithms to achieve reasonable performance without careful
design. In group (B), all algorithms’ performances decline
when traffic flow density increases from sparse to dense,
which meets expectation. For sparse density, the algorithms’
performances are on par with each other and the variance
is low, but for harder scenarios, different algorithms make
more differences, which may result from different exploration-
exploitation characteristics.
Group (C) focuses on the scenario generalizability of var-
ious RL algorithms. Fig. 10 reports the final training and
test performance of algorithms trained on different number of
scenarios. From the viewpoint of multi-task learning, driving
in each scenario can be viewed as a task. The approach
of mixing all scenarios equally corresponds to the simplest
form of multi-task learning. This approach is known to be
susceptible to unbalanced optimization, where certain sce-
narios dominate others, leading to poor overall performance.
This is supported by Fig. 10, where the success rate drops
as the number of training scenarios increases. Multi-task
learning could leverage the problem structure, and achieve
improved performance on training scenarios. On the other
hand, the evaluation on the unseen test scenario is a matter
of generalization. When trained on a single scenario, this can
be thought of as out-of-distribution (OOD) generalization and
extrapolation, which is a well-known difficult topic. But when
training on 100 scenarios, this is more close to in-distribution
generalization and interpolation. So, for generalizability, it is
desirable to have multiple scenarios, where multi-task learning
could come to help. In Fig. 10, the train-test gap is obvious,
suggesting that training on a single scenario is not enough.
With a larger number of training scenarios, PPO gets obvious
test performance improvements. For other algorithms, the
improvements are minor, possibly because of their deterio-
rated training performances. Therefore, if multi-task learning
achieves improved training performance, its test performance
should also surpass the baselines. Further training curve
comparison can be found in Supplementary Material.
Group (D) shows that the design of surrounding partici-
pant observation has a strong impact on test performance.
As is shown in Fig. 11, the on-policy algorithm PPO learns
faster and achieves better final performance with D2 and
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 9
TABLE VI
ALGORITHM PERFORMANCE EVALUATI ON ON 4 GRO UPS OF TA SKS
Fig. 10. Success rate between training and test scenarios for task group (C).
It is desirable to have better test performance and smaller gap between training
and test performance.
D3 observation; for the off-policy algorithm DSAC, D2 and
D3 observation also lead to an initial boost but the final
performance difference is not obvious. For other algorithms,
the training curves can be found in Supplementary Material.
Taking a closer look at each kind of surrounding participant
observation, D2 discards all information about other vehicles
except for the top-risky one. D3 uses a LiDAR-like observation
common in robotic control. Both D2 and D3 lose certain
aspects of information, but helps the agent have better perfor-
mance. D1 appears to be the most information-complete, yet
it has the worst performance in the group. This is because
the raw surrounding observation is a set of participant states.
Similar to a set in mathematics, it has no inherent ordering.
In other words, arbitrary permutations to the set will not
change the scenario, and the policy should produce identical
action. Distance-sorted observation meets this requirement by
introducing an order and sorting accordingly before passing
to the policy. However, contrary to an ideal order like sorting
by risk level which is not easily available, sorting by distance
creates discontinuity and harms policy learning. Considering
that two participants have roughly same distances to the ego
Fig. 11. Typical on-policy and off-policy algorithms’ success rate with
different kinds of surrounding participant observations in task group (D). The
solid lines correspond to the mean and the shaded regions correspond to 90%
confidence interval over five runs.
vehicle but at different directions, when their order is swapped,
the observation would change discontinuously. In contrast, the
other two observations, though leaving out certain information,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
are much easier for the policy to leverage to avoid risk. It is
clear that there is plenty of room to improve the observation
design. For instance, permutation-invariant encoding intro-
duces hard inductive bias to meet the permutation invariant
requirement and has the potential to overcome discontinuity
while utilize all state information.
VI. CONCLUSION
We introduce IDSim, an autonomous driving benchmark
simulator for general urban scenarios, supporting the scenario
generality and observation generality of RL algorithms. Based
on systematic benchmark experiments, we suggest that multi-
task learning and observation design are potential areas for
for algorithms to improve upon. We further hope to show
that IDSim can support researches into how RL algorithms
perform in the decision and control of autonomous driving in
a generalizable way.
REFERENCES
[1] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, “Centralized
cooperation for connected and automated vehicles at intersections by
proximal policy optimization,” IEEE Trans. Veh. Technol., vol. 69,
no. 11, pp. 12597–12608, Nov. 2020.
[2] S. E. Li, Reinforcement Learning for Sequential Decision-Making and
Control. Cham, Switzerland: Springer, 2023.
[3] J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional
soft actor-critic: Off-policy reinforcement learning for addressing value
estimation errors,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
no. 11, pp. 6584–6598, Nov. 2022.
[4] T. P. Lillicrap et al., “Continuous control with deep reinforcement
learning,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.
[5] J. Duan, S. Eben Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical
reinforcement learning for self-driving decision-making without reliance
on labelled driving data, IET Intell. Transp. Syst., vol. 14, no. 5,
pp. 297–305, May 2020.
[6] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement
learning for urban autonomous driving, in Proc. Int. Conf. Intell.
Transp. Syst. (ITSC), 2019, pp. 2765–2771.
[7] Y. Guan et al., “Integrated decision and control: Toward interpretable
and computationally efficient driving intelligence, IEEE Trans. Cybern.,
vol. 53, no. 2, pp. 859–873, Feb. 2023.
[8] B. Wymann et al., “Torcs, the open racing car simulator,” vol. 4, no. 6,
p. 2, 2000. [Online]. Available: http://torcs.sourceforge.net
[9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
“CARLA: An open urban driving simulator, in Proc. Conf. Robot
Learn. (CoRL), 2017, pp. 1–16.
[10] M. Zhou et al., “Smarts: An open-source scalable multi-agent RL
training school for autonomous driving,” in Proc. Conf. Robot Learn.
(CoRL), 2021, pp. 264–285.
[11] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “MetaDrive:
Composing diverse driving scenarios for generalizable reinforcement
learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3,
pp. 3461–3475, Mar. 2023.
[12] P. A. Lopez et al., “Microscopic traffic simulation using sumo, in Proc.
Int. Conf. Intell. Transp. Syst. (ITSC), 2018, pp. 2575–2582.
[13] G. Brockman et al., “OpenAI gym,” 2016, arXiv:1606.01540.
[14] E. Liang et al., “RLlib: Abstractions for distributed reinforcement learn-
ing,” in Proc. Int. Conf. Mach. Learn. (ICML), 2018, pp. 3053–3062.
[15] J. Weng et al., “Tianshou: A highly modularized deep reinforcement
learning library, J. Mach. Learn. Res., vol. 23, no. 267, pp. 1–6, 2022.
[16] W. Wang et al., “GOPS: A general optimal control problem solver
for autonomous driving and industrial control applications, Commun.
Transp. Res., vol. 3, Dec. 2023, Art. no. 100096.
[17] S. Shah, D. Dey, C. Lovett, and A. Kapoor, AirSim: High-fidelity visual
and physical simulation for autonomous vehicles,” in Field and Service
Robotics. Cham, Switzerland: Springer, 2018, pp. 621–635.
[18] P. Cai, Y. Lee, Y. Luo, and D. Hsu, “SUMMIT: A simulator for urban
driving in massive mixed traffic,” in Proc. IEEE Int. Conf. Robot. Autom.
(ICRA), May 2020, pp. 4023–4029.
[19] P. Palanisamy, “Multi-agent connected autonomous driving using deep
reinforcement learning,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN),
Jul. 2020, pp. 1–7.
[20] Y. Guan et al., “Direct and indirect reinforcement learning, Int. J. Intell.
Syst., vol. 36, no. 8, pp. 4439–4467, Aug. 2021.
[21] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforce-
ment learning,” J. Mach. Learn. Res., vol. 16, no. 42, pp. 1437–1480,
Aug. 2015.
[22] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained
policy optimization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019,
pp. 1–15.
[23] Y. Chow, A. Tamar, S. Mannor, and M. Pavone, “Risk-sensitive and
robust decision-making: A CVaR optimization approach,” in Proc. Adv.
Neural Inf. Process. Syst. (NeurIPS), 2015, pp. 1–9.
[24] T. Wei and C. Liu, “Safe control algorithms using energy functions: A
uni ed framework, benchmark, and new directions, in Proc. IEEE 58th
Conf. Decis. Control (CDC), Dec. 2019, pp. 238–243.
[25] Y. Ren, G. Zhan, L. Tang, S. Eben Li, J. Jiang, and J. Duan, “Improve
generalization of driving policy at signalized intersections with adver-
sarial learning,” 2022, arXiv:2204.04403.
[26] Q. Ge, Q. Sun, S. E. Li, S. Zheng, W. Wu, and X. Chen, “Numeri-
cally stable dynamic bicycle model for discrete-time control,” in Proc.
IEEE Intell. Vehicles Symp. Workshops (IV Workshops), Jul. 2021,
pp. 128–134.
[27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
[28] J. Duan et al., “Fixed-dimensional and permutation invariant state
representation of autonomous driving, IEEE Trans. Intell. Transp. Syst.,
vol. 23, no. 7, pp. 9518–9528, Jul. 2022.
[29] Y. Ren et al., “Self-learned intelligence for integrated decision and
control of automated vehicles at signalized intersections,” IEEE Trans.
Intell. Transp. Syst., vol. 23, no. 12, pp. 24145–24156, Dec. 2022.
[30] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau, “Benchmarking
batch deep reinforcement learning algorithms,” 2019, arXiv:1910.01708.
[31] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:
Off-policy maximum entropy deep reinforcement learning with a
stochastic actor, in Proc. Int. Conf. Mach. Learn. (ICML), 2018,
pp. 1861–1870.
[32] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, “Learning to walk in the real
world with minimal human effort, in Proc. Conf. Robot Learn. (CoRL),
2021, pp. 1110–1120.
Yuxuan Jiang received the B.Eng. degree from the
School of Vehicle and Mobility, Tsinghua Univer-
sity, Beijing, China, in 2021, where he is currently
pursuing the master’s degree with the School of
Vehicle and Mobility.
His research interests include optimal control
and reinforcement learning and its application to
autonomous driving.
Guojian Zhan received the B.Eng. degree from the
School of Vehicle and Mobility, Tsinghua Univer-
sity, Beijing, China, in 2021, where he is currently
pursuing the Ph.D. degree with the School of Vehicle
and Mobility.
His research interests lie in autonomous driving
and reinforcement learning.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
JIANG et al.: REINFORCEMENT LEARNING BENCHMARK FOR AUTONOMOUS DRIVING IN GENERAL URBAN SCENARIOS 11
Zhiqian Lan received the B.Eng. degree from the
School of Vehicle and Mobility, Tsinghua Univer-
sity, Beijing, China, in 2021, where he is currently
pursuing the M.S. degree with the School of Vehicle
and Mobility.
His research interests lie in autonomous driving,
deep learning, and optimal filtering.
Chang Liu received the B.S. degree in electrical
engineering and in applied mathematics from Peking
University, China, in 2011, and the dual M.S. degree
in mechanical engineering and in computer science
and the Ph.D. degree in mechanical engineering from
the University of California at Berkeley, Berkeley,
in 2014, 2016, and 2017, respectively.
He was a Software Engineer with NVIDIA Corpo-
ration and a Post-Doctoral Associate with the Sibley
School of Mechanical and Aerospace Engineering,
Cornell University. He is currently an Assistant
Professor with the College of Engineering, Peking University, China. His
research interests include the robotic motion planning, active sensing, and
human–robot collaboration.
Bo Cheng received the B.S. and M.S. degrees in
automotive engineering from Tsinghua University,
Beijing, China, in 1985 and 1988, respectively, and
the Ph.D. degree in mechanical engineering from
The University of Tokyo, Tokyo, Japan, in 1998.
He is currently a Professor with the School of
Vehicle and Mobility, Tsinghua University, where
he is also the Dean of the Suzhou Automotive
Research Institute. His active research interests
include autonomous vehicles, driver-assistance sys-
tems, active safety, and vehicular ergonomics. He is
also the Chairperson of the Academic Board of SAE-Beijing and a member
of the Council of the Chinese Ergonomics Society.
Shengbo Eben Li (Senior Member, IEEE) received
the M.S. and Ph.D. degrees in mechanical engi-
neering from Tsinghua University, Beijing, China,
in 2006 and 2009, respectively. He worked with
Stanford University, University of Michigan, and
University of California at Berkeley, Berkeley.
He is currently a tenured Professor with Tsinghua
University. He is the author of more than 100 jour-
nals/conference papers and the co-inventor of
over 20 Chinese patents. His active research interests
include intelligent vehicles and driver assistance,
reinforcement learning and distributed control, optimal control, and estima-
tion. He serves as an Associate Editor for IEEE Intelligent Transportation
Systems Magazine and I EEE TRANSACTIONS ON INTELLIGENT TRA NS-
PO RTATI ON SYST EMS.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Tsinghua University. Downloaded on March 25,2024 at 07:08:14 UTC from IEEE Xplore. Restrictions apply.
... Despite the extensive benchmarks established for RL in domains such as games [25], [26] and autonomous driving [27], a similar benchmark in DTRs has been conspicuously lacking. This gap highlights a significant need in the field to provide a robust platform that can systematically evaluate and compare the performance of RL algorithms in complex, dynamic healthcare settings. ...
Preprint
Full-text available
Reinforcement learning (RL) has garnered increasing recognition for its potential to optimise dynamic treatment regimes (DTRs) in personalised medicine, particularly for drug dosage prescriptions and medication recommendations. However, a significant challenge persists: the absence of a unified framework for simulating diverse healthcare scenarios and a comprehensive analysis to benchmark the effectiveness of RL algorithms within these contexts. To address this gap, we introduce \textit{DTR-Bench}, a benchmarking platform comprising four distinct simulation environments tailored to common DTR applications, including cancer chemotherapy, radiotherapy, glucose management in diabetes, and sepsis treatment. We evaluate various state-of-the-art RL algorithms across these settings, particularly highlighting their performance amidst real-world challenges such as pharmacokinetic/pharmacodynamic (PK/PD) variability, noise, and missing data. Our experiments reveal varying degrees of performance degradation among RL algorithms in the presence of noise and patient variability, with some algorithms failing to converge. Additionally, we observe that using temporal observation representations does not consistently lead to improved performance in DTR settings. Our findings underscore the necessity of developing robust, adaptive RL algorithms capable of effectively managing these complexities to enhance patient-specific healthcare. We have open-sourced our benchmark and code at https://github.com/GilesLuo/DTR-Bench.
Book
Full-text available
As one of the most important AI branches, reinforcement learning (RL) has attracted increasing attention in recent years. RL is an interdisciplinary field of trial-and-error learning and optimal control that promises to provide optimal solutions for decision-making or control in large-scale and complex dynamic processes. One of its most conspicuous successes is AlphaZero from Google DeepMind, which has beaten the highest-level professional human player. The underlying key technology is the so-called deep reinforcement learning, which equips AlphaGo with amazing self-evolution ability and high playing intelligence. This book aims to provide a systematic introduction to fundamental RL theories, mainstream RL algorithms and typical RL applications for researchers and engineers. The main topics include Markov decision processes, Monte Carlo learning, temporal difference learning, RL with function approximation, policy gradient method, approximate dynamic programming, deep reinforcement learning, etc.
Article
Full-text available
Intersection is one of the most accident-prone urban scenarios for autonomous driving wherein making safe and computationally efficient decisions is non-trivial. Current research mainly focuses on the simplified traffic conditions while ignoring the existence of mixed traffic flows, i.e., vehicles, cyclists and pedestrians. For urban roads, different participants lead to a quite dynamic and complex interaction, posing great difficulty to learn an intelligent policy. This paper develops the dynamic permutation state representation in the framework of integrated decision and control (IDC) to handle signalized intersections with mixed traffic flows. Specially, this representation introduces an encoding function and summation operator to construct driving states from environmental observation, capable of dealing with different types and variant number of traffic participants. A constrained optimal control problem is built wherein the objective involves tracking performance and the constraints for different participants, roads and signal lights are designed respectively to assure safety. We solve this problem by gradient-based optimization, wherein the reasonable state will be given by the encoding function and then served as the input of policy and value function. An off-policy training is designed to reuse observations from driving environment and backpropagation through time is utilized to update the policy function and encoding function jointly. Verification result shows that the dynamic permutation state representation can enhance the driving performance of IDC, including comfort, decision compliance and safety with a large margin. The trained driving policy can realize efficient and smooth passing in the complex intersection, guaranteeing driving intelligence and safety simultaneously.
Article
Full-text available
In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q-value overestimations because it is capable of adaptively adjusting the update step size of the Q-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
Article
Full-text available
Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision-making and control tasks. In this paper, we classify RL into direct and indirect RL according to how they seek the optimal policy of the Markov decision process problem. The former solves the optimal policy by directly maximizing an objective function using gradient descent methods, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We study policy gradient (PG) forms of direct and indirect RL and show that both of them can derive the actor–critic architecture and can be unified into a PG with the approximate value function and the stationary state distribution, revealing the equivalence of direct and indirect RL. We employ a Gridworld task to verify the influence of different forms of PG, suggesting their differences and relationships experimentally. Finally, we classify current mainstream RL algorithms using the direct and indirect taxonomy, together with other ones, including value-based and policy-based, model-based and model-free.
Article
Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the safety awareness of the surrounding traffic, and the decision-making in complex multi-agent settings. Despite the great success of Reinforcement Learning (RL), most of the RL research works investigate each capability separately due to the lack of integrated environments. In this work, we develop a new driving simulation platform called MetaDrive to support the research of generalizable reinforcement learning algorithms for machine autonomy. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real data importing. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. The generalization experiments conducted on both procedurally generated scenarios and real-world scenarios show that increasing the diversity and the size of the training set leads to the improvement of the RL agent's generalizability. We further evaluate various safe reinforcement learning and multi-agent reinforcement learning algorithms in MetaDrive environments and provide the benchmarks. Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive .
Article
Decision and control are core functionalities of high-level automated vehicles. Current mainstream methods, such as functional decomposition and end-to-end reinforcement learning (RL), suffer high time complexity or poor interpretability and adaptability on real-world autonomous driving tasks. In this article, we present an interpretable and computationally efficient framework called integrated decision and control (IDC) for automated vehicles, which decomposes the driving task into static path planning and dynamic optimal tracking that are structured hierarchically. First, the static path planning generates several candidate paths only considering static traffic elements. Then, the dynamic optimal tracking is designed to track the optimal path while considering the dynamic obstacles. To that end, we formulate a constrained optimal control problem (OCP) for each candidate path, optimize them separately, and follow the one with the best tracking performance. To unload the heavy online computation, we propose a model-based RL algorithm that can be served as an approximate-constrained OCP solver. Specifically, the OCPs for all paths are considered together to construct a single complete RL problem and then solved offline in the form of value and policy networks for real-time online path selecting and tracking, respectively. We verify our framework in both simulations and the real world. Results show that compared with baseline methods, IDC has an order of magnitude higher online computing efficiency, as well as better driving performance, including traffic efficiency and safety. In addition, it yields great interpretability and adaptability among different driving scenarios and tasks.
Article
In this paper, we propose a new state representation method, called encoding sum and concatenation (ESC), to describe the environment observation for decision-making in autonomous driving. Unlike existing state representation methods, ESC is applicable to the situation where the number of surrounding vehicles is variable and eliminates the need for manually pre-designed sorting rules, leading to higher representation ability and generality. The proposed ESC method introduces a feature neural network (NN) to encode the real-valued feature of each surrounding vehicle into an encoding vector, and then adds these vectors up to obtain the representation vector of the set of surrounding vehicles. Then, a fixed-dimensional and permutation-invariance state representation can be obtained by concatenating the set representation with other variables, such as indicators of the ego vehicle and road. By introducing the sum-of-power mapping, this paper has further proved that the injectivity of the ESC state representation can be guaranteed if the output dimension of the feature NN is greater than the number of variables of all surrounding vehicles. This means that the ESC representation can be used to describe the environment and taken as the inputs of learning-based policy functions. Experiments demonstrate that compared with the fixed-permutation representation method, the policy learning accuracy based on ESC representation is improved by 62.2%.