Conference PaperPDF Available

Adaptive Behavior Generation for Autonomous Driving using Deep Reinforcement Learning with Compact Semantic States

Authors:

Abstract and Figures

Making the right decision in traffic is a challenging task that is highly dependent on individual preferences as well as the surrounding environment. Therefore it is hard to model solely based on expert knowledge. In this work we use Deep Reinforcement Learning to learn maneuver decisions based on a compact semantic state representation. This ensures a consistent model of the environment across scenarios as well as a behavior adaptation function, enabling on-line changes of desired behaviors without retraining. The input for the neural network is a simulated object list similar to that of Radar or Lidar sensors, superimposed by a relational semantic scene description. The state as well as the reward are extended by a behavior adaptation function and a parameterization respectively. With little expert knowledge and a set of mid-level actions, it can be seen that the agent is capable to adhere to traffic rules and learns to drive safely in a variety of situations.
Content may be subject to copyright.
©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
2018 IEEE Intelligent Vehicles Symposium (IV), pp. 993-1000, 26-30 June 2018.
1
Adaptive Behavior Generation for Autonomous Driving using Deep
Reinforcement Learning with Compact Semantic States
Peter Wolf1, Karl Kurzer2, Tobias Wingert2, Florian Kuhnt1and J. Marius Z¨
ollner1,2
Abstract— Making the right decision in traffic is a challenging
task that is highly dependent on individual preferences as well
as the surrounding environment. Therefore it is hard to model
solely based on expert knowledge. In this work we use Deep
Reinforcement Learning to learn maneuver decisions based
on a compact semantic state representation. This ensures a
consistent model of the environment across scenarios as well
as a behavior adaptation function, enabling on-line changes
of desired behaviors without re-training. The input for the
neural network is a simulated object list similar to that of
Radar or Lidar sensors, superimposed by a relational semantic
scene description. The state as well as the reward are extended
by a behavior adaptation function and a parameterization
respectively. With little expert knowledge and a set of mid-level
actions, it can be seen that the agent is capable to adhere to
traffic rules and learns to drive safely in a variety of situations.
I. INTRODUCTION
While sensors are improving at a staggering pace and
actuators as well as control theory are well up to par to
the challenging task of autonomous driving, it is yet to be
seen how a robot can devise decisions that navigate it safely
in a heterogeneous environment that is partially made up
by humans who not always take rational decisions or have
known cost functions.
Early approaches for maneuver decisions focused on
predefined rules embedded in large state machines, each
requiring thoughtful engineering and expert knowledge [1],
[2], [3].
Recent work focuses on more complex models with addi-
tional domain knowledge to predict and generate maneuver
decisions [4]. Some approaches explicitly model the inter-
dependency between the actions of traffic participants [5] as
well as address their replanning capabilities [6].
With the large variety of challenges that vehicles with
a higher degree of autonomy need to face, the limitations
of rule- and model-based approaches devised by human
expert knowledge that proved successful in the past become
apparent.
At least since the introduction of AlphaZero, which dis-
covered the same game-playing strategies as humans did
in Chess and Go in the past, but also learned entirely
unknown strategies, it is clear, that human expert knowledge
is overvalued [7], [8]. Hence, it is only reasonable to apply
the same techniques to the task of behavior planning in
1FZI Research Center for Information Technology, Haid-und-
Neu-Str. 10-14, 76131 Karlsruhe, Germany {wolf, kuhnt,
zoellner}@fzi.de 2Karlsruhe Institute of Technology,
Kaiserstr. 12, 76131 Karlsruhe, Germany kurzer@kit.edu,
wingerttobias@gmail.com
Agent
ActionState
Fig. 1: The initial traffic scene is transformed into a compact semantic
state representation sand used as input for the reinforcement learning (RL)
agent. The agent estimates the action awith the highest return (Q-value)
and executes it, e.g., changing lanes. Afterwards a reward ris collected and
a new state s0is reached. The transition (s, a, r, s0)is stored in the agent’s
replay memory.
autonomous driving, relying on data-driven instead of model-
based approaches.
The contributions of this work are twofold. First, we
employ a compact semantic state representation, which is
based on the most significant relations between other entities
and the ego vehicle. This representation is neither depen-
dent on the road geometry nor the number of surrounding
vehicles, suitable for a variety of traffic scenarios. Second,
using parameterization and a behavior adaptation function
we demonstrate the ability to train agents with a change-
able desired behavior, adaptable on-line, not requiring new
training.
The remainder of this work is structured as follows: In
Section II we give a brief overview of the research on
behavior generation in the automated driving domain and
deep reinforcement learning. A detailed description of our
approach, methods and framework follows in Section III
and Section IV respectively. In Section V we present the
evaluation of our trained agents. Finally, we discuss our
results in Section VI.
II. RE LATE D WORK
Initially most behavior planners were handcrafted state
machines, made up by a variety of modules to handle differ-
ent tasks of driving. During the DARPA Urban Challenge
Boss (CMU) for example used five different modules to
conduct on road driving. The responsibilities of the modules
ranged from lane selection, merge planning to distance
keeping [1]. Other participants such as Odin (Virginia Tech)
2
or Talos (MIT) developed very similar behavior generators
[2], [3].
Due to the limitations of state machines, current research
has expanded on the initial efforts by creating more complex
and formal models: A mixture of POMDP, stochastic non-
linear MPC and domain knowledge can be used to generate
lane change decisions in a variety of traffic scenarios [4].
Capturing the mutual dependency of maneuver decisions
between different agents, planning can be conducted with
foresight [5], [6]. While [5] plans only the next maneuver
focusing on the reduction of collision probabilities between
all traffic participants, [6] explicitly addresses longer plan-
ning horizons and the replanning capabilities of others.
Fairly different is the high-level maneuver planning pre-
sented by [9] that is based on [10], where a semantic method
to model the state and action space of the ego vehicle is
introduced. Here the ego vehicle’s state is described by its
relations to relevant objects of a structured traffic scene, such
as other vehicles, the road layout and road infrastructure.
Transitions between semantic states can be restricted by
traffic rules, physical barriers or other constraints. Combined
with an A* search and a path-time-velocity planner a se-
quence of semantic states and their implicit transitions can
be generated.
In recent years, deep reinforcement learning (DRL) has
been successfully used to learn policies for various chal-
lenges. Silver et al. used DRL in conjunction with supervised
learning on human game data to train the policy networks of
their program AlphaGo [11]; [12], [13] present an overview
of RL and DRL respectively. In [7] and [8] their agents
achieve superhuman performance in their respective domains
solely using a self-play reinforcement learning algorithm
which utilizes Monte Carlo Tree Search (MCTS) to acceler-
ate the training. Mnih et al. proposed their deep Q-network
(DQN) [14], [15] which was able to learn policies for a
plethora of different Atari 2600 games and reach or surpass
human level of performance. The DQN approach offers
a high generalizability and versatility in tasks with high
dimensional state spaces and has been extended in various
work [16], [17], [18]. Especially actor-critic approaches have
shown huge success in learning complex policies and are also
able to learn behavior policies in domains with a continuous
action space [19], [20].
In the domain of autonomous driving, DRL has been used
to directly control the movements of simulated vehicles to
solve tasks like lane-keeping [21], [22]. In these approaches,
the agents receive sensor input from the respective simulation
environments and are trained to determine a suitable steering
angle to keep the vehicle within its driving lane. Thereby, the
focus lies mostly on low-level control.
In contrast, DRL may also be applied to the problem of
forming long term driving strategies for different driving
scenarios. For example, DRL can be used in intersection
scenarios to determine whether to cross an intersection by
controlling the longitudinal movement of the vehicle along a
predefined path [23]. The resulting policies achieve a lower
wait time than using a Time-To-Collision policy. In the same
(a) The ego vehicle (blue) is driving on a two lane road. Five other
vehicles are in its sensor range. Based on a vehicle scope with Λlateral =
Λahead = Λbehind = 1 only four vehicles (green) are considered in the
semantic state representation.
behind ahead
laterallateral
(b) The relational grid resulting from the scene of (a) with Λlateral =
Λahead = Λbehind = 1. Moreover, there is no adjacent lane to the right,
which is represented in the grid as well.
Fig. 2: An example scene (a) is transformed to a relational grid (b) using
a vehicle scope Λwith Λlateral = Λahead = Λbehind = 1. The red
vehicle, which is in sensor range, is not represented in the grid. Since there
is no vehicle driving on the same lane in front of the ego vehicle, the
respective relational position in the grid is empty.
way, DRL techniques can be used to learn lateral movement,
e.g. lane change maneuvers in a highway simulation [24].
Since it can be problematic to model multi-agent sce-
narios as a Markov Decision Process (MDP) due to the
unpredictable behavior of other agents, one possibility is
to decompose the problem into learning a cost function
for driving trajectories [25]. To make learning faster and
more data efficient, expert knowledge can be incorporated
by restricting certain actions during the training process [26].
Additionally, there is the option to handle task and motion
planning by learning low-level controls for lateral as well
as longitudinal maneuvers from a predefined set and a high-
level maneuver policy [27].
Our approach addresses the problem of generating adap-
tive behavior that is applicable to a multitude of traffic
scenarios using a semantic state representation. Not wanting
to limit our concept to a single scenario, we chose a more
general set of “mid-level” actions, e.g. decelerate instead of
Stop or Follow. This does not restrict the possible behavior
of the agent and enables the agent to identify suitable high-
level actions implicitly. Further, our agent is trained without
restrictive expert knowledge, but rather we let the agent learn
from its own experience, allowing it to generate superior
solutions without the limiting biases introduced by expert
knowledge [7], [8].
III. APP ROAC H
We employ a deep reinforcement learning approach to
generate adaptive behavior for autonomous driving.
2
Fig. 3: Partially visualized entity-relationship model of the scene in Fig. 2a.
The vehicle topology is modeled by vehicle-vehicle relations whereas the
lane topology is modeled by lane-lane relations.
A reinforcement learning process is commonly modeled
as an MDP [28] (S, A, R, δ, T )where Sis the set of states,
Athe set of actions, R:S×A×SRthe reward function,
δ:S×ASthe state transition model and Tthe set of
terminal states. At timestep ian agent in state sScan
choose an action aAaccording to a policy πand will
progress into the successor state s0receiving reward r. This
is defined as a transition t= (s, a, s0, r).
The aim of reinforcement learning is to maximize the
future discounted return Gi=Pn=iγniri. A DQN uses
Q-Learning [29] to learn Q-values for each action given input
state sbased on past transitions. The predicted Q-values of
the DQN are used to adapt the policy πand therefore change
the agent’s behavior. A schematic of this process is depicted
in Fig. 1.
For the input state representation we adapt the ontology-
based concept from [10] focusing on relations with other
traffic participants as well as the road topology. We design
the state representation to use high level preprocessed sen-
sory inputs such as object lists provided by common Radar
and Lidar sensors and lane boundaries from visual sensors
or map data. To generate the desired behavior the reward is
comprised of different factors with varying priorities. In the
following the different aspects are described in more detail.
A. Semantic Entity-Relationship Model
A traffic scene τis described by a semantic entity-
relationship model, consisting of all scene objects and rela-
tions between them. We define it as the tuple (E,R), where
E={e0, e1, ..., en}: set of scene objects (entities).
R={r0, r1, ..., rm}: set of relations.
The scene objects contain all static and dynamic objects,
such as vehicles, pedestrians, lane segments, signs and traffic
lights.
In this work we focus on vehicles V ⊂ E ,lane segments
L⊂E and the three relation types vehicle-vehicle relations,
vehicle-lane relations and lane-lane relations. Using these
entities and relations an entity-relationship representation of
a traffic scene can be created as depicted in Fig. 3. Every
and relation holds several properties or attributes of the scene
objects, such as e.g. absolute positions or relative velocities.
This scene description combines low level attributes with
high level relational knowledge in a generic way. It is thus
applicable to any traffic scene and vehicle sensor setup,
making it a beneficial state representation.
But the representation is of varying size and includes more
aspects than are relevant for a given driving task. In order to
use this representation as the input to a neural network we
transform it to a fixed-size relational grid that includes only
the most relevant relations.
B. Relational Grid
We define a relational grid, centered at the ego vehicle
vego ∈ V, see Fig. 2. The rows correspond to the relational
lane topology, whereas the columns correspond to the vehicle
topology on these lanes.
To define the size of the relational grid, a vehicle scope
Λis introduced that captures the lateral and longitudinal
dimensions, defined by the following parameters:
Λlateral N: maximum number of laterally adjacent
lanes to the ego vehicle’s lane that is considered
Λahead N: maximum number of vehicles per lane
driving in front of the ego vehicle that is considered
Λbehind N: maximum number of vehicles per lane
driving behind the ego vehicle that is considered
The set of vehicles inside the vehicle scope Λis denoted
by VΛ⊆ V. The different features of vehicles and lanes
are represented by separate layers of the grid, resulting in a
semantic state representation, see Fig. 4. The column vector
of each grid cell holds attributes of the vehicle and the lane
it belongs to. While the ego vehicle’s features are absolute
values, other vehicles’ features are relative to the ego vehicle
or the lane they are in (induced by the vehicle-vehicle and
vehicle-lane relations of the entity-relationship model):
1) Non-ego vehicle features: The features of non-ego
vehicles vi∈ VΛ\vego are fV
i= (∆si,∆ ˙si,di,φi):
si: longitudinal position relative to vego
∆ ˙si: longitudinal velocity relative to vego
di: lateral alignment relative to lane center
φi: heading relative to lane orientation
2) Ego vehicle features: To generate adaptive behavior,
the ego vehicle’s features include a function Ω(τ, θ )that
describes the divergence from the desired behavior of the
ego vehicle parameterized by θgiven the traffic scene τ.
Thus the features for vego are fv
ego = (Ω(τ, θ),˙sego , k):
Ω(τ, θ): behavior adaptation function
˙sego: longitudinal velocity of ego vehicle
k: lane index of ego vehicle
(k= 0 being the right most lane)
3) Lane features: The lane topology features of lk∈ L
are fL
k= (Γk,sk):
Γk: lane type (normal, acceleration)
sk: lane ending relative to vego
Non-perceivable features or missing entities are indicated
in the semantic state representation by a dedicated value.
2
v0
v4
v1
v6
v3
v5
v4
v2
vego
φi
∆ ˙si
si
lane topology features
di
Fig. 4: The relational grid contains one layer per feature. The vehicle
features fV
iand fV
ego share layers and are in the cells of the viand vego
respectively. The lane features fL
kare on additional layers in the k-th row
of the grid.
The relational grid ensures a consistent representation of
the environment, independent of the road geometry or the
number of surrounding vehicles.
The resulting input state sSis depicted in Fig. 4 and
fed into a DQN.
C. Action Space
The vehicle’s actions space is defined by a set of semantic
actions that is deemed sufficient for most on-road driving
tasks, excluding special cases such as U-turns. The longitu-
dinal movement of the vehicle is controlled by the actions
accelerate and decelerate. While executing these actions, the
ego vehicle keeps its lane. Lateral movement is generated
by the actions lane change left as well as lane change right
respectively. Only a single action is executed at a time and
actions are executed in their entirety, the vehicle is not able
to prematurely abort an action. The default action results in
no change of either velocity, lateral alignment or heading.
D. Adaptive Desired Behavior through Reward Function
With the aim to generate adaptive behavior we extend
the reward function R(s, a)by a parameterization θ. This
parameterization is used in the behavior adaptation function
Ω(τ, θ), so that the agent is able to learn different desired
behaviors without the need to train a new model for varying
parameter values.
Furthermore, the desired driving behavior consists of
several individual goals, modeled by separated rewards. We
rank these reward functions by three different priorities. The
highest priority has collision avoidance, important but to a
lesser extent are rewards associated with traffic rules, least
prioritized are rewards connected to the driving style.
The overall reward function R(s, a, θ)can be expressed
as follows:
R(s, a, θ) =
Rcollision(s, θ)sScollision
Rrules(s, θ)sSrules
Rdriving style(s, a, θ)else,
(1)
The subset Scollision Sconsists of all states sdescribing
a collision state of the ego vehicle vego and another vehicle
vi. In these states the agent only receives the immediate
reward without any possibility to earn any other future
rewards. Additionally, attempting a lane change to a nonex-
istent adjacent lane is also treated as a collision.
The state dependent evaluation of the reward factors
facilitates the learning process. As the reward for a state is
independent of rewards with lower priority, the eligibility
trace is more concise for the agent being trained. For
example, driving at the desired velocity does not mitigate
the reward for collisions.
IV. EXP ER IM EN TS
A. Framework
While our concept is able to handle data from many
preprocessing methods used in autonomous vehicles, we
tested the approach with the traffic simulation SUMO [30].
A schematic overview of the framework is depicted in Fig. 5.
We use SUMO in our setup as it allows the initialization and
execution of various traffic scenarios with adjustable road
layout, traffic density and the vehicles’s driving behavior.
To achieve this, we extend TensorForce [31] with a highly
configurable interface to the SUMO environment. Tensor-
Force is a reinforcement library based on TensorFlow [32],
which enables the deployment of various customizable DRL
methods, including DQN.
Fig. 5: The current state from SUMO is retrieved and transformed into
the semantic state representation. This is the input to the RL agent, which
is trained using the TensorForce library. Chosen actions are mapped to
a respective state change of the simulation. The agent’s reward is then
computed based on the initial state, the chosen action and the successor
state.
The interface fulfills three functions: the state extraction
from SUMO, the calculation of the reward and the mapping
of the chosen maneuver onto valid commands for the SUMO
controlled ego vehicle. The necessary information about the
ego vehicle, the surrounding vehicles as well as additional
information about the lanes are extracted from the running
simulation. These observed features are transformed into the
state representation of the current scene. For this purpose the
surrounding positions relative to the ego vehicle are checked
and if another vehicle fulfills a relation, the selected feature
set for this vehicle is written into the state representation for
its relation. The different reward components can be added
or removed, given a defined value or range of values and a
priority.
2
(a) Highway Driving (b) Highway Merging
Fig. 6: To examine the agent’s compliance to traffic rules, it is trained and evaluated on two different traffic scenarios. In (a) the agent has the obligation
to drive on the right most lane and must not pass others from the right, amongst other constraints. In (b) the agent is allowed to accelerate while on the
on-ramp and also might overtake vehicles on its left. But it has to leave the on-ramp before it ends.
This setup allows us to deploy agents using various DRL
methods, state representations and rewards in multiple traffic
scenarios. In our experiments we used a vehicle scope with
Λbehind = 1 and Λahead = Λlateral = 2. This allows the
agent to always perceive all lanes of a 3 lane highway and
increases its potential anticipation.
B. Network
In this work we use the DQN approach introduced by
Mnih et al. [15] as it has shown its capabilities to successfully
learn behavior policies for a range of different tasks. While
we use the general learning algorithm described in [15],
including the usage of experience replay and a secondary
target network, our actual network architecture differs from
theirs. The network from Mnih et al. was designed for a
visual state representation of the environment. In that case,
a series of convolutional layers is commonly used to learn a
suitable low-dimensional feature set from this kind of high-
dimensional sensor input. This set of features is usually
further processed in fully-connected network layers.
Since the state representation in this work already consists
of selected features, the learning of a low-dimensional feature
set using convolutional layers is not necessary. Therefore we
use a network with solely fully-connected layers, see Tab. I.
The size of the input layer depends on the number of
features in the state representation. On the output layer there
is a neuron for each action. The given value for each action
is its estimated Q-value.
Layer Neurons
Input
Hidden Layer 1 512
Hidden Layer 2 512
Hidden Layer 3 256
Hidden Layer 4 64
Output 5
TABLE I: The network layout of our DQN agent. The size of the input
neurons is determined by the number of features in the state space. The five
output neurons predict the five Q-values for respective actions.
C. Training
During training the agents are driving on one or more traf-
fic scenarios in SUMO. An agent is trained for a maximum of
2 million timesteps, each generating a transition consisting of
the observed state, the selected action, the subsequent state
and the received reward. The transitions are stored in the
replay memory, which holds up to 500,000 transitions. After
reaching a threshold of at least 50,000 transitions in memory,
a batch of 32 transitions is randomly selected to update the
network’s weights every fourth timestep. We discount future
rewards by a factor of 0.9during the weight update. The
target network is updated every 50,000th step.
To allow for exploration early on in the training, an
-greedy policy is followed. With a probability of the
action to be executed is selected randomly, otherwise the
action with the highest estimated Q-value is chosen. The
variable is initialized as 1, but decreased linearly over the
course of 500,000 timesteps until it reaches a minimum of
0.1, reducing the exploration in favor of exploitation. As
optimization method for our DQN we use the RMSProp
algorithm [33] with a learning rate of 105and decay of
0.95.
The training process is segmented into episodes. If an
agent is trained on multiple scenarios, the scenarios alternate
after the end of an episode. To ensure the agent experiences a
wide range of different scenes, it is started with a randomized
departure time, lane, velocity and θin the selected scenario
at the beginning and after every reset of the simulation.
In a similar vein, it is important that the agent is able to
observe a broad spectrum of situations in the scenario early
in the training. Therefore, should the agent reach a collision
state sScollision by either colliding with another vehicle
or attempting to drive off the road, the current episode is
finished with a terminal state. Afterwards, a new episode
is started immediately without reseting the simulation or
changing the agent’s position or velocity. Since we want to
avoid learning an implicit goal state at the end of a scenario’s
course, the simulation is reset if a maximum amount of 200
timesteps per episode has passed or the course’s end has been
reached and the episode ends with a non-terminal state.
D. Scenarios
Experiments are conducted using two different scenarios,
see Fig. 6. One is a straight multi-lane highway scenario. The
other is a merging scenario on a highway with an on-ramp.
To generate the desired adaptive behavior, parameterized
reward functions are defined (see Eq. 1). We base Rrules on
German traffic rules such as the obligation to drive on the
right most lane (rkeepRight ), prohibiting overtaking on the
right (rpassRight) as well as keeping a minimum distance
to vehicles in front (rsafeD istance). A special case is the
acceleration lane, where the agent is allowed to pass on the
right and is not required to stay on the right most lane.
Instead the agent is not allowed to enter the acceleration
lane (rnotEnter ).
2
0 500000 1000000 1500000 2000000
Iterations
0
20
40
60
80
100
Collision Rate [%]
gH
gM
Fig. 7: Collision Rate of the agents gHand gMduring training.
0 500000 1000000 1500000 2000000
Iterations
0
2
4
6
8
10
12
14
16
Rule Violations [%]
gH
gM
Fig. 8: Relative duration of rule violations of the agents gHand gMduring
training.
Similarly, Rdriving style entails driving a desired velocity
(rvelocity ) ranging from 80 km/h to 115 km/h on the highway
and from 40 km/h to 80 km/h on the merging scenario. The
desired velocity in each training episode is defined by θv
which is sampled uniformly over the scenario’s velocity
range. Additionally, Rdriving style aims to avoid unnecssary
lane and velocity changes (raction).
With these constraints in mind, the parameterized reward
functions are implemented as follows, to produce the desired
behavior.
Rcollision(s, θ) = θtrcollision (2)
Rrules(s, θ) = rpassRight(s, θp) + rnotEnter (s, θn)
+rsafeDistance(s, θs) + rkeepRig ht(s, θk)(3)
Rdriving style(s, a, θ) = raction (a, θa) + rvelocity (s, θv)
(4)
To enable different velocity preferences, the behavior
adaptation function returns the difference between the
desired and the actual velocity of vego.
V. EVAL UATIO N
During evaluation we trained an agent gHonly on the
highway scenario and an agent gMonly on the merging
scenario. In order to show the versatility of our approach, we
additionally trained an agent gCboth on the highway as well
as the merging scenario (see Tab.II). Due to the nature of our
compact semantic state representation we are able to achieve
this without further modifications. The agents are evaluated
during and after training by running the respective scenarios
100 times. To assess the capabilities of the trained agents,
using the concept mentioned in Section III, we introduce the
following metrics.
Collision Rate [%]: The collision rate denotes the average
amount of collisions over all test runs. In contrast to the
training, a run is terminated if the agent collides. As this is
such a critical measure it acts as the most expressive gauge
assessing the agents performance.
Avg. Distance between Collisions [km]: The average
distance travelled between collisions is used to remove the
bias of the episode length and the vehicle’s speed.
Rule Violations [%]: Relative duration during which the
agent is not keeping a safe distance or is overtaking on right.
Lane Distribution [%]: The lane distribution is an ad-
ditional weak indicator for the agent’s compliance with the
traffic rules.
Avg. Speed [m/s]: The average speed of the agent does
not only indicate how fast the agent drives, but also displays
how accurate the agent matches its desired velocity.
The results of the agents trained on the different scenarios
are shown in Tab. II. The agents generally achieve the desired
behavior. An example of an overtaking maneuver is presented
in Fig. 9. During training the collision rate of gHdecreases
to a decent level (see Fig. 7). Agent gMtakes more training
iterations to reduce its collision rate to a reasonable level,
as it not only has to avoid other vehicles, but also needs
to leave the on-ramp. Additionally, gMsuccessfully learns
to accelerate to its desired velocity on the on-ramp. But for
higher desired velocities this causes difficulties leaving the
ramp or braking in time in case the middle lane is occupied.
This effect increases the collision rate in the latter half of
the training process.
Fig. 9: The agent (blue) is driving behind a slower vehicle (green). In this
situation the action lane change left has the highest estimated Q-value. After
the lange change, the agent accelerates and overtakes the slower vehicle.
Subsequently, the agent changes back to the right most lane.
2
The relative duration of rule violations by gHreduces over
the course of the training, but stagnates at approximately
2% (see Fig. 8). A potential cause is our strict definition
of when an overtaking on the right occurs. The agent almost
never performs a full overtaking maneuver from the right, but
might drive faster than another vehicle on the left hand side,
which will already be counted towards our metric. For gM
the duration of rule violations is generally shorter, starting
low, peaking and then also stagnating similarly to gH. This is
explained due to overtaking on the right not being considered
on the acceleration lane. The peak emerges as a result of the
agent leaving the lane more often at this point.
The lane distribution of gH(see Tab. II) demonstrates that
the agent most often drives on the right lane of the highway,
to a lesser extent on the middle lane and only seldom on
the left lane. This reflects the desired behavior of adhering
to the obligation of driving on the right most lane and only
using the other lanes for overtaking slower vehicles. In the
merging scenario this distribution is less informative since
the task does not provide the same opportunities for lane
changes.
TABLE II: Results of the trained agents. The agents gHand gMwere only
evaluated on their respective scenario, while gCwas evaluated on both. The
results for gCare listed separately for each scenario.
gHgMgC(highway) gC(merging)
Collision Rate 2% 1% 2% 4%
avg. Distance 98.37 km 28.98 km 75.92 km 7.19 km
Rule Violations 1.52% 0.87% 1.62% 0.5%
Lane 0 64.12% 87.99% 57.11% 87.8%
Lane 1 22.07% 11.81% 34.46% 11.85%
Lane 2 13.81% 0.2% 8.43% 0.35%
To measure the speed deviation of the agents, additional
test runs with fixed values for the desired velocity were
performed. The results are shown in Tab. III. As can be
seen, the agents adapt their behavior, as an increase in the
desired velocity raises the average speed of the agents. In
tests with other traffic participants, the average speed is
expectedly lower than the desired velocity, as the agents often
have to slow down and wait for an opportunity to overtake.
Especially in the merging scenario the agent is unable to
reach higher velocities due to these circumstances. During
runs on an empty highway scenario, the difference between
the average and desired velocity diminishes.
Although gHand gMoutperform it on their individual
tasks, gCachieves satisfactory behavior on both. Especially,
it is able to learn task specific knowledge such as overtaking
in the acceleration lane of the on-ramp while not overtaking
from the right on the highway.
A video of our agents behavior is provided online.1
VI. CONCLUSIONS
In this work two main contributions have been presented.
First, we introduced a compact semantic state representation
that is applicable to a variety of traffic scenarios. Using
1http://url.fzi.de/behavior-iv2018
a relational grid our representation is independent of road
topology, traffic constellation and sensor setup.
Second, we proposed a behavior adaptation function which
enables changing desired driving behavior online without the
need to retrain the agent. This eliminates the requirement to
generate new models for different driving style preferences
or other varying parameter values.
TABLE III: The average speed of each agent given varying desired velocities
[m/s]. The agents have been evaluated on the training scenarios with normal
traffic density as well as on an empty highway. While not every agent is
able to achieve the desired velocity precisely, their behavior adapts to the
different parameter values.
gHgMgC
vdes highway empty merging empty highway merging empty
12 12.8 13.2 10.8 13.3 12.1 10.3 12.6
17 17.1 17.1 13.5 18.6 16.9 13.2 17.1
22 21.1 21.9 14.7 23.7 20.9 14.5 21.8
25 23.4 24.8 15.1 26.5 23.0 15.0 23.9
30 25.7 29.2 15.3 30.4 25.2 14.8 28.0
Agents trained with this approach performed well on
different traffic scenarios, i.e. highway driving and highway
merging. Due to the design of our state representation
and behavior adaptation, we were able to develop a single
model applicable to both scenarios. The agent trained on
the combined model was able to successfully learn scenario
specific behavior.
One of the major goals for future work we kept in mind
while designing the presented concept is the transfer from
the simulation environment to real world driving tasks. A
possible option is to use the trained networks as a heuristic in
MCTS methods, similar to [27]. Alternatively, our approach
can be used in combination with model-driven systems to
plan or evaluate driving behavior.
To achieve this transfer to real-world application, we will
apply our state representation to further traffic scenarios, e.g.,
intersections. Additionally, we will extend the capabilities
of the agents by adopting more advanced reinforcement
learning techniques.
ACK NOW LE DG EM EN TS
We wish to thank the German Research Foundation (DFG)
for funding the project Cooperatively Interacting Automo-
biles (CoInCar) within which the research leading to this con-
tribution was conducted. The information as well as views
presented in this publication are solely the ones expressed
by the authors.
REFERENCES
[1] C. Urmson, J. Anhalt, D. Bagnell, et al., “Autonomous driving in
urban environments: Boss and the urban challenge,Journal of Field
Robotics, vol. 25, no. 8, pp. 425–466, 2008.
[2] A. Bacha, C. Bauman, R. Faruque, et al., “Odin: Team victortango’s
entry in the darpa urban challenge,” Journal of field Robotics, vol. 25,
no. 8, pp. 467–492, 2008.
[3] J. Leonard, J. How, S. Teller, et al., “A perception-driven autonomous
urban vehicle,” Journal of Field Robotics, vol. 25, no. 10, pp. 727–774,
2008.
[4] S. Ulbrich and M. Maurer, “Towards tactical lane change behavior
planning for automated vehicles,” in 18th IEEE International Confer-
ence on Intelligent Transportation Systems, ITSC, 2015, pp. 989–995.
2
[5] A. Lawitzky, D. Althoff, C. F. Passenberg, et al., “Interactive scene
prediction for automotive applications,” in Intelligent Vehicles Sympo-
sium (IV), IEEE, 2013, pp. 1028–1033.
[6] M. Bahram, A. Lawitzky, J. Friedrichs, et al., “A game-theoretic ap-
proach to replanning-aware interactive scene prediction and planning,
IEEE Transactions on Vehicular Technology, vol. 65, no. 6, pp. 3981–
3992, 2016.
[7] D. Silver, J. Schrittwieser, K. Simonyan, et al., “Mastering the game
of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354,
2017.
[8] D. Silver, T. Hubert, J. Schrittwieser, et al., “Mastering chess and
shogi by self-play with a general reinforcement learning algorithm,”
2017.
[9] R. Kohlhaas, D. Hammann, T. Schamm, and J. M. Z¨
ollner, “Planning
of High-Level Maneuver Sequences on Semantic State Spaces,” in 18th
IEEE International Conference on Intelligent Transportation Systems,
ITSC, 2015, pp. 2090–2096.
[10] R. Kohlhaas, T. Bittner, T. Schamm, and J. M. Z¨
ollner, “Semantic state
space for high-level maneuver planning in structured traffic scenes,
in 17th IEEE International Conference on Intelligent Transportation
Systems, ITSC, 2014, pp. 1060–1065.
[11] D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of
go with deep neural networks and tree search,” nature, vol. 529, no.
7587, pp. 484–489, 2016.
[12] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
robotics: A survey,” The International Journal of Robotics Research,
vol. 32, no. 11, pp. 1238–1274, 2013.
[13] Y. Li, “Deep reinforcement learning: An overview,” 2017.
[14] V. Mnih, K. Kavukcuoglu, D. Silver, et al., “Playing atari with deep
reinforcement learning,” 2013.
[15] V. Mnih et al., “Human-level control through deep reinforcement
learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
[16] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experi-
ence Replay,” 2015.
[17] H. van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning
with Double Q-learning,” 2015.
[18] Z. Wang, N. de Freitas, and M. Lanctot, “Dueling Network Architec-
tures for Deep Reinforcement Learning,” 2016.
[19] V. Mnih, A. P. Badia, M. Mirza, et al., “Asynchronous methods for
deep reinforcement learning,” in International Conference on Machine
Learning, 2016, pp. 1928–1937.
[20] Y. Wu, E. Mansimov, R. B. Grosse, et al., “Scalable trust-region
method for deep reinforcement learning using kronecker-factored ap-
proximation,” in Advances in Neural Information Processing Systems,
2017, pp. 5285–5294.
[21] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep rein-
forcement learning framework for autonomous driving,Electronic
Imaging, no. 19, pp. 70–76, 2017.
[22] P. Wolf, C. Hubschneider, M. Weber, et al., “Learning how to drive in
a real world simulation with deep q-networks,” in Intelligent Vehicles
Symposium (IV), IEEE, 2017, pp. 244–250.
[23] D. Isele, A. Cosgun, K. Subramanian, and K. Fujimura, “Navigating
Intersections with Autonomous Vehicles using Deep Reinforcement
Learning,” 2017.
[24] B. Mirchevska, M. Blum, L. Louis, et al., “Reinforcement learning
for autonomous maneuvering in highway scenarios,” 2017.
[25] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, Multi-Agent,
Reinforcement Learning for Autonomous Driving,” 2016.
[26] M. Mukadam, A. Cosgun, A. Nakhaei, and K. Fujimura, “Tactical
decision making for lane changing with deep reinforcement learning,”
Conference on Neural Information Processing Systems (NIPS), 2017.
[27] C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combining
Neural Networks and Tree Search for Task and Motion Planning in
Challenging Environments,IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pp. 6059–6066, 2017.
[28] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press Cambridge, 1998, vol. 1, no. 1.
[29] C. J. Watkins and P. Dayan, “Q-learning,Machine learning, vol. 8,
no. 3-4, pp. 279–292, 1992.
[30] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
Development and Applications of {SUMO - Simulation of Urban
MObility},” International Journal On Advances in Systems and Mea-
surements, vol. 5, no. 3, pp. 128–138, 2012.
[31] M. Schaarschmidt, A. Kuhnle, and K. Fricke, “Tensorforce: A
tensorflow library for applied reinforcement learning,” Web page,
2017. [Online]. Available: https://github.com/reinforceio/tensorforce
[32] M. Abadi, A. Agarwal, P. Barham, et al., “TensorFlow: Large-scale
machine learning on heterogeneous systems,” 2015, software available
from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[33] G. Hinton, N. Srivastava, and K. Swersky, “Rmsprop: Divide the
gradient by a running average of its recent magnitude,Neural
networks for machine learning, Coursera lecture 6e, 2012.
... To address this limitation and to achieve closed-loop optimal control, several reinforcement learning (RL) AV control studies have been recently presented in the literature, some with the objective of optimizing traffic flow as in [13] and [14], while others with the goal of automating the driving subtasks such as car following [15][16][17][18][19][20][21], lane-keeping [22][23][24], merging [25][26][27][28], etc. In terms of traffic flow optimization, a dynamic ACC system that adapts the inter-vehicle gap based on deep reinforcement learning (DRL) according to changing traffic conditions, is proposed in [13], with the objective of decreasing the average vehicle delay required to pass a given highway segment. ...
Article
Full-text available
Adaptive cruise control (ACC) is the core building block of future full autonomous driving. Numerous recent research demonstrated that Autonomous Vehicles (AVs) adopting shorter headways generally increase road capacity and may relieve congestion at bottlenecks for moderate demand scenarios. However, with high demand scenarios, bottlenecks can still be activated causing capacity breakdown. Therefore, extra control measures as dynamic traffic control near bottlenecks is necessary. The challenge is harder on urban freeways with consecutive bottlenecks which affect each other. This paper aims to improve the performance of ACC systems in a high demand scenario. A multi‐bottleneck dynamic headway control strategy based on deep reinforcement learning (DRL) that adapts headways to optimize traffic flow and minimize delay is proposed. The controller dynamically assigns an optimal headway for each controlled section, based on state measurement representing the current traffic conditions. The case study is a freeway stretch with three consecutive bottlenecks which is then extended to include eight bottlenecks. Three different RL agent configurations are presented and compared. It is quantitatively demonstrated that the proposed control strategy improves traffic and enhances the system delay by up to 22.30%, and 18.87% compared to shortest headway setting for the three‐bottleneck and the eight‐bottleneck networks, respectively.
... For example, in [48], domain knowledge is used to group states that are similar to each other by focusing on higher-level properties like road configurations, e.g., the number of lanes, and object relationships, as defined, say, by the distances and directions of cars, or by the identity of the lane that a given car happens to be in. ...
Article
Full-text available
Deep learning (DL), which includes deep reinforcement learning (DRL), holds great promise for carrying out real-world tasks that human minds seem to cope with quite readily. That promise is already delivering extremely impressive results in a variety of areas. However, while DL-enabled systems achieve excellent performance, they are far from perfect. It has been demonstrated, in several domains, that DL systems can err when they encounter cases they had not hitherto encountered. Furthermore, the opacity of the produced agents makes it difficult to explain their behavior and ensure that they adhere to various requirements posed by human engineers. At the other end of the software development spectrum of methods, behavioral programming (BP) facilitates orderly system development using self-standing executable modules aligned with how humans intuitively describe desired system behavior. In this paper, we elaborate on different approaches for combining DRL with BP and, more generally, machine learning (ML) with executable specifications (ES). We begin by defining a framework for studying the various approaches, which can also be used to study new emerging approaches not covered here. We then briefly review state-of-the-art approaches to integrating ML with ES, continue with a focus on DRL, and then present the merits of integrating ML with BP. We conclude with guidelines on how this categorization can be used in decision making in system development, and outline future research challenges.
... For instance, [16] introduced an automated lane change strategy for highway environments using the Proximal Policy Optimization algorithm, demonstrating efficient navigation through complex traffic scenarios. Meanwhile, [17] developed a Reinforcement Learning method that superimposes semantic scene descriptions onto LiDAR sensor data, enhancing model consistency across diverse scenes. However, despite these advancements, learningbased methods face challenges that hinder their widespread real-world application. ...
Preprint
Full-text available
Overtaking on two-lane roads is a great challenge for autonomous vehicles, as oncoming traffic appearing on the opposite lane may require the vehicle to change its decision and abort the overtaking. Deep reinforcement learning (DRL) has shown promise for difficult decision problems such as this, but it requires massive number of data, especially if the action space is continuous. This paper proposes to incorporate guidance from an expert system into DRL to increase its sample efficiency in the autonomous overtaking setting. The guidance system developed in this study is composed of constrained iterative LQR and PID controllers. The novelty lies in the incorporation of a fading guidance function, which gradually decreases the effect of the expert system, allowing the agent to initially learn an appropriate action swiftly and then improve beyond the performance of the expert system. This approach thus combines the strengths of traditional control engineering with the flexibility of learning systems, expanding the capabilities of the autonomous system. The proposed methodology for autonomous vehicle overtaking does not depend on a particular DRL algorithm and three state-of-the-art algorithms are used as baselines for evaluation. Simulation results show that incorporating expert system guidance improves state-of-the-art DRL algorithms greatly in both sample efficiency and driving safety.
... At present, the advanced driver assistance system (ADAS) for vehicles has become relatively mature, and adaptive cruise control (ACC) has also become widely used in vehicle longitudinal speed control [35]. The self-learning ability of DRL enables it to complete more efficient, safe and stable vehicle driving control in complex traffic scenarios [36]. Wang et al. [37] used deep Q-learning to control a vehicle such that it was able to automatically drive from a ramp to a main road under the premise of ensuring safe driving. ...
Article
Full-text available
In the vehicle-to-everything scenario, the fuel cell bus can accurately obtain the surrounding traffic information, and quickly optimize the energy management problem while controlling its own safe and efficient driving. This paper proposes an energy management strategy (EMS) that considers speed control based on deep reinforcement learning (DRL) in complex traffic scenarios. Using SUMO simulation software (Version 1.15.0), a two-lane urban expressway is designed as a traffic scenario, and a hydrogen fuel cell bus speed control and energy management system is designed through the soft actor–critic (SAC) algorithm to effectively reduce the equivalent hydrogen consumption and fuel cell output power fluctuation while ensuring the safe, efficient and smooth driving of the vehicle. Compared with the SUMO–IDM car-following model, the average speed of vehicles is kept the same, and the average acceleration and acceleration change value decrease by 10.22% and 11.57% respectively. Compared with deep deterministic policy gradient (DDPG), the average speed is increased by 1.18%, and the average acceleration and acceleration change value are decreased by 4.82% and 5.31% respectively. In terms of energy management, the hydrogen consumption of SAC–OPT-based energy management strategy reaches 95.52% of that of the DP algorithm, and the fluctuation range is reduced by 32.65%. Compared with SAC strategy, the fluctuation amplitude is reduced by 15.29%, which effectively improves the durability of fuel cells.
Article
This paper proposes an improved decision-making method based on deep reinforcement learning to address on-ramp merging challenges in highway autonomous driving. A novel safety indicator, time difference to merging (TDTM), is introduced, which is used in conjunction with the classic time to collision (TTC) indicator to evaluate driving safety and assist the merging vehicle in finding a suitable gap in traffic, thereby enhancing driving safety. The training of an autonomous driving agent is performed using the Deep Deterministic Policy Gradient (DDPG) algorithm. An action-masking mechanism is deployed to prevent unsafe actions during the policy exploration phase. The proposed DDPG + TDTM + TTC solution is tested in on-ramp merging scenarios with different driving speeds in SUMO and achieves a success rate of 99.96% without significantly impacting traffic efficiency on the main road. The results demonstrate that DDPG + TDTM + TTC achieved a higher on-ramp merging success rate of 99.96% compared to DDPG + TTC and DDPG.
Article
Full-text available
Uncertain environment on multi-lane highway, e.g., the stochastic lane-change maneuver of surrounding vehicles, is a big challenge for achieving safe automated highway driving. To improve the driving safety, a heuristic reinforcement learning decision-making framework with integrated risk assessment is proposed. First, the framework includes a long short-term memory model to predict the trajectory of surrounding vehicles and a future integrated risk assessment model to estimate the possible driving risk. Second, a heuristic decaying state entropy deep reinforcement learning algorithm is introduced to address the exploration and exploitation dilemma of reinforcement learning. Finally, the framework also includes a rule-based vehicle decision model for interaction decision problems with surrounding vehicles. The proposed framework is validated in both low-density and high-density traffic scenarios. The results show that the traffic efficiency and vehicle safety are both improved compared to the common dueling double deep Q-Network method and rule-based method.
Article
Full-text available
The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.
Article
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Conference Paper
Full-text available
We present a reinforcement learning approach using Deep Q-Networks to steer a vehicle in a 3D physics simulation. Relying solely on camera image input the approach directly learns steering the vehicle in an end-to-end manner. The system is able to learn human driving behavior without the need of any labeled training data. An action-based reward function is proposed, which is motivated by a potential use in real world reinforcement learning scenarios. Compared to a na¨ıvena¨ıve distance-based reward function, it improves the overall driving behavior of the vehicle agent. The agent is even able to reach comparable to human driving performance on a previously unseen track in our simulation environment.
Article
Full-text available
Reinforcement learning is considered to be a strong AI paradigm which can be used to teach machines through interaction with the environment and learning from their mistakes. Despite its perceived utility, it has not yet been successfully applied in automotive applications. Motivated by the successful demonstrations of learning of Atari games and Go by Google DeepMind, we propose a framework for autonomous driving using deep reinforcement learning. This is of particular relevance as it is difficult to pose autonomous driving as a supervised learning problem due to strong interactions with the environment including other vehicles, pedestrians and roadworks. As it is a relatively new area of research for autonomous driving, we provide a short overview of deep reinforcement learning and then describe our proposed framework. It incorporates Recurrent Neural Networks for information integration, enabling the car to handle partially observable scenarios. It also integrates the recent work on attention models to focus on relevant information, thereby reducing the computational complexity for deployment on embedded hardware. The framework was tested in an open source 3D car racing simulator called TORCS. Our simulation results demonstrate learning of autonomous maneuvering in a scenario of complex road curvatures and simple interaction of other vehicles.
Conference Paper
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 41 out of 49 games.
Technical Report
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Article
In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also a method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the MuJoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2- to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods. Code is available at https://github.com/openai/baselines.
Article
Providing an efficient strategy to navigate safely through unsignaled intersections is a difficult task that requires determining the intent of other drivers. We explore the effectiveness of using Deep Reinforcement Learning to handle intersection problems. Combining several recent advances in Deep RL, were we able to learn policies that surpass the performance of a commonly-used heuristic approach in several metrics including task completion time and goal success rate. Our analysis, and the solutions learned by the network point out several short comings of current rule-based methods. The fact that Deep RL policies resulted in collisions, although rarely, combined with the limitations of the policy to generalize well to out-of-sample scenarios suggest a need for further research.