Conference PaperPDF Available

A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics Using Constrained Reinforcement Learning

Authors:

Abstract and Figures

The use of reinforcement learning (RL) in real-world domains often requires extensive effort to ensure safe behavior. While this compromises the autonomy of the system, it might still be too risky to allow a learning agent to freely explore its environment. These strict impositions come at the cost of flexibility and applying them often relies on complex parameters and hard-coded knowledge modeled by the reward function. Autonomous driving is one such domain that could greatly benefit from more efficient and verifiable methods for safe automation. We propose to approach the automated driving problem using constrained RL, a method that automates the trade-off between risk and utility, thereby significantly reducing the burden on the designer. We first show that an engineered reward function for ensuring safety and utility in one specific environment might not result in optimal behavior when traffic dynamics changes in the exact environment. Next, we show how algorithms based on constrained RL which are more robust to environmental disturbances can address this challenge. These algorithms use a simple and easy-to-interpret reward and cost function, and are able to maintain both, efficiency and safety without requiring reward parameter tuning. We demonstrate our approach in the automated merging scenario with different traffic configurations such as the low or high chance of cooperative drivers and different cooperative driving strategies.
Content may be subject to copyright.
A Modern Perspective on Safe Automated Driving for Different Traffic
Dynamics using Constrained Reinforcement Learning
Danial Kamran1, Thiago D. Sim˜
ao2, Qisong Yang3,
Canmanie T. Ponnambalam3, Johannes Fischer1, Matthijs T. J. Spaan3and Martin Lauer1
Abstract The use of reinforcement learning (RL) in real-
world domains often requires extensive effort to ensure safe
behavior. While this compromises the autonomy of the system,
it might still be too risky to allow a learning agent to freely
explore its environment. These strict impositions come at the
cost of flexibility and applying them often relies on complex
parameters and hard-coded knowledge modelled by the reward
function. Autonomous driving is one such domain that could
greatly benefit from more efficient and verifiable methods for
safe automation. We propose to approach the automated driving
problem using constrained RL, a method that automates the
trade off between risk and utility, thereby significantly reducing
the burden on the designer. We first show that an engineered
reward function for ensuring safety and utility in one specific
environment might not result in the optimal behavior when
traffic dynamics changes in the exact environment. Next we
show how algorithms based on constrained RL which are
more robust to the environmental disturbances can address
this challenge. These algorithms use a simple and easy to
interpret reward and cost function, and are able to maintain
both, efficiency and safety without requiring reward parameter
tuning. We demonstrate our approach in the automated merg-
ing scenario with different traffic configurations such as low
or high chance of cooperative drivers and different cooperative
driving strategies.
I. INTRODUCTION
Reinforcement learning (RL) promises to produce agents
that learn to optimize decision-making problems with limited
to no knowledge of the environment. This makes it an attrac-
tive approach to automating complex and high-dimensional
tasks. The drawback is that RL agents must interact with
the environment, exhaustively taking both good and bad
actions, in order to learn the best long-term decisions. In
real-world and safety-critical domains such as driving, where
the consequences of taking bad actions are severe, which
diminishes the appeal of classical RL. An ideal response
to this problem is safe reinforcement learning, a class of
RL methods that guarantees safety during learning or upon
execution. Safe RL methods have recently been applied
to various automated driving problems with some success
[1]–[6]. Existing safe RL approaches to autonomous driving
require extensive designer knowledge coded into the solution.
In many cases, these methods impose a heavy burden on the
*Authors have equal contribution.
1Karlsruhe Institute of Technology, Germany {danial.kamran,
johannes.fischer, martin.lauer}@kit.edu
2Radboud University, Nijmegen, The Netherlands
thiago.simao@ru.nl
3Delft University of Technology, The Netherlands {q.yang,
c.t.ponnambalam, m.t.j.spaan}@tudelft.nl
Policy
Training Safe Policy
Safety
Constraint
Policy
Training
Constraint
Satised?
Safety
Constraint
Yes
No
Safe Policy
Normal RL
Constrained RL
Reward
Tuning
Fig. 1. Two structures for learning safe policies. In normal RL, the user
searches for the best reward function that produces a policy that satisfies the
required constraint. In constrained RL, the algorithm design is simplified as
the safety constraint will automatically be satisfied during training.
designer to identify unsafe states or actions, tune hyperpa-
rameters, or define complex reward functions [3], [5], [7], as
Figure 1 illustrates. In addition to the issue of extensive prior
knowledge needed, these methods can be overly conservative
as they often impose hard restrictions on the search space.
Instead, we propose constrained reinforcement learning as an
elegant approach to safety in the automated vehicle domain.
Constrained RL models the problem as a constrained Markov
decision process (CMDP), introducing a cost function to
encode safety-relevant information (such as whether a crash
has occurred). This clearly separates the specification of
reward (to be maximized) and cost (to be minimized).
The constraint is then defined as a threshold regarding the
acceptable expected cost, resulting in a simple and highly
interpretable parameter. The agent learns to optimize reward
while respecting this safety constraint, automatically tuning
the trade-off between the two conflicting goals.
Merging into a highway with dense traffic is a challenging
task for automated vehicles. The dense traffic means that the
window in which a successful merge can occur is small.
The position and interaction between other vehicles in the
environment is a crucial aspect of the state description that
determines when a merge can be successfully executed. This
complex state space makes it particularly challenging to
define a set of safe states or actions by hand, and manipu-
lating the reward function to include safety considerations is
difficult and requires considerable tuning. Further, applying
a safe RL method that is overly conservative in this scenario
can result in the freezing robot problem, whereby the vehicle
is unable to merge at all. This makes the highway merge
problem a prime candidate for a constrained RL approach.
In this paper, we first describe existing safe RL approaches
to the automated driving problem, and highlight recent work
with similar goals to our approach as well as their limitations.
We then formulate the dense highway merge scenario as a
constrained MDP and apply two constrained RL methods to
a traditional safe approach to this problem. The experiments
demonstrate how constrained RL successfully mitigates the
trade-off between merging as quickly as possible and avoid-
ing crashes without additional hyper-parameters or extensive
tuning.
II. REL ATED WORK
The field of safe reinforcement learning (RL) encompasses
several different types of approaches with varying levels
of safety guarantees, of which formulating the problem as
a constrained Markov decision process (as we propose) is
only a sub-set. For a comprehensive overview of safe RL in
general, we refer the reader to [8].
In this work, we focus on methods situated in the auto-
mated driving domain that aim to adhere to safe behavior
either during learning or on execution of the trained agent.
The most relevant methods can be divided into two cate-
gories: those that encode safety in the reward function, and
those that shield unsafe actions from the agent.
A popular approach to safe RL is to include penalties
in the reward function that discourages unsafe behavior, an
indirect way to incorporate safety [1], [5], [6], [9]. These
methods lay the burden of specifying unsafe behavior on the
designer, resulting in reward functions that can be hard to
specify and even more difficult to verify. A more explicit way
to produce safe behavior is to restrict the action space to safe
actions, often referred to as shielding [10]–[12]. Determining
unsafe actions can be done using, for example, a model
checker [3] or predictive model [4], [7]. Efforts have been
made to combine the two types of approaches, with one using
a parameterized reward penalty to restrict actions determined
to be unsafe [13]. In general, restricting the search space can
produce conservative behavior, as they enforce hard limits
on the space of acceptable policies. Further, such methods
are very sensitive to incorrect specifications or predictions of
unsafe actions.
The use of constrained RL in the autonomous vehicle
domain has been so far limited. In one paper, they used
LTL specifications to define unsafe states, referring to the
result as a constrained optimization problem [14]. However,
this decoupled approach does not attempt to balance reward
and safety, instead enforcing hard constraints on the search
space. A budgeted MDP, which is similar to a constrained
MDP, but offers additional control over the budget, has
also been used to model the problem of automated driving
[15]. Most recently, constrained RL has been evaluated
on lane keeping and intersection navigation tasks, where
a parallel learning approach was proposed that employs
multiple agents to speed up convergence [16]. Our paper
highlights the limitations of reinforcement learning that are
addressed by taking a constrained optimization approach. We
focus on the improvements provided in terms of the ease of
specification, robustness to scalarization issues, and elegant
trade-off of reward and risk, evaluated on a dense highway
merge scenario.
III. BACKGROU ND
In this section, we formalize the definition of constrained
RL, and present the algorithms that are used to solve it.
A. Constrained Markov Decision Process
A CMDP [17] is a model that separates reward and safety
signals. Similar to a Markov decision process (MDP) [18], a
CMDP is a tuple, (S,A,P, ι, r, c, d, γ), where Sis the state
space, Ais the set of actions, P:S × A × S [0,1] is
a transition kernel indicating the probability to state safter
taking action ain state s,ιis the initial state distribution,
r:S×A [rmin, rmax ]is the reward function, c:S×A
[cmin, cmax ]is the cost function, dis the safety threshold,
and γ[0,1] is the discount factor.
As in an MDP the goal in a CMDP is to compute a policy
that maximizes the accumulated discounted reward
max
πJR(π).
=E
(st,at)∼Tπ"
X
t=0
γtr(st, at)#.(1)
where Tπ= (s0, a0, s1, . . .)is the trajectory distribution
induced by s0ι,atπ(·|st), and st+1 P(· |st, at).
Additionally, the optimal policy has to keep the expected
accumulated discounted cost bounded
JC(π).
=E
(st,at)∼Tπ"
X
t=0
γtc(st, at)#d(2)
according to the predefined safety threshold d. Depending
on the task, it might resemble a bound on the probability
of failure, for instance if c(s, a) = 1failure(s), although this
requires γ= 1. An MDP can be seen as an unbounded
CMDP, setting d=, which essentially allows to ignore the
cost function, obtaining the following MDP (S,A,P, ι, r, γ).
B. Constrained Reinforcement Learning
Constrained RL addresses the problem of solving an
unknown CMDP [19]. Although off-policy methods for
constrained RL have been proposed [20]–[22], this paper
focuses on on-policy variants. Specifically, we apply PPO-
Lagrangian [23] and Constrained Policy Optimization (CPO)
[24] to the automated driving domain. These methods repre-
sent two main directions in on-policy constrained RL. The
first direction is to adapt RL algorithms to their Lagrangian
variants, as seen in TRPO-Lagrangian and PPO-Lagrangian
[23]. The second direction uses constrained policy optimiza-
tion methods [25], [26] built on the work of [24].
a) PPO-Lagrangian: Proximal policy optimization
(PPO) [27], designed for regular RL problems, not only re-
tains the benefits of trust region policy optimization (TRPO)
[28], but also has better sample complexity and convenience
to implement. Constrained optimization problems can be
solved by a Lagrangian variant of PPO [23], [29]. Instead
of fixing the value of the Lagrangian multiplier, we adapt
it based on the constraint-satisfying performance. When the
policy is unsafe, we increase the Lagrangian multiplier to
enhance safety, but decrease it when attaining safe perfor-
mance. This allows us to leverage an adaptive safety weight λ
in the constrained optimization problem:
max
πmin
λ0G(π, λ).
=f(π)λg(π),(3)
where f(π) = JR(π)and g(π) = JC(π)din the case of
Equations (1) and (2). So, we update the safety weight using
λk+1 = max(0, λk+αλ(JC(π)d)),(4)
where αλis the penalty learning rate. In our experiments,
we use the undiscounted cumulative cost to measure the real
constraint satisfaction.
b) Constrained Policy Optimization (CPO): CPO is a
trust-region method for constrained RL with guarantees for
near-constraint satisfaction at each iteration [24]. At each
gradient step, CPO constrains the policy changes to the cost
constraint and divergence neighborhood while guaranteeing
reward improvement. Similar to TRPO, Equations (1) and (2)
are further constrained to an additional Kullback-Leibler
(KL) divergence constraint:
πk+1 = arg max
π
E
s∼Tπk
aπ
[Aπk
R(s, a)]
s.t. JC(πk) + 1
1γE
s∼Tπk
aπ
[Aπk
C(s, a)] d
E
s∼Tπk
[DKL (π||πk)[s]] δ
(5)
where δis the maximum step size, DKL is the KL
divergence to indicate the trust region. The advantage
functions ARand AC, respectively, express the per-
formance change Es∼Tπk,aπ[Aπk
R(s, a)] (in reward) and
Es∼Tπk,aπ[Aπk
C(s, a)] (in cost) of policy πover the current
policy πk. After the transition from Equations (1) and (2) to
Equation (5), CPO further approximates the reward function
and constraints using linear approximation (first and second
order expansions) for small step sizes δ, to ensure the
problem is solvable. We refer the reader to [24] for more
details on the CPO algorithm.
IV. AUTONOMOUS DRIV ING A S AN MDP
We formulate the automated driving problem as a Markov
decision process (MDP), where at every decision step t, the
decision making policy πchooses the best action. The overall
goal is to learn the actions that maximize the expected future
reward (return) at every time step. In this paper, we focus
on merging in a highway environment where the ego vehicle
is a reinforcement learning agent that observes the positions
Fig. 2. Example of a merging scenario and the features that make up the
observation of the reinforcement learning agent. Here the ego vehicle (blue)
has to prevent collisions with vehicles on the main lane and also drive as
fast as possible to reach the goal.
and velocities of the surrounding vehicles and controls its
acceleration. The aim is to avoid collisions during merging
without acting too conservative. To this end, we model the
merging scenario depicted in Figure 2 and define the input
state as
st=dedgoal d1... dn
veaev1... vn
,(6)
where deis the ego vehicle’s distance to the conflict merging
area, dgoal is the distance from the conflict area to the
goal, and veand aeare the velocity and acceleration of the
ego vehicle, respectively. We also include relative distances
and velocities between vehicles on the main lane and the
projection of the ego vehicle to the main lane as diand vi
for a maximum of N= 15 surrounding vehicles in the state,
as shown in Figure 2.
The policy maps a state to an action atfrom the discrete
action space A={Decelerate,Idle,Accelerate}that controls
the ego vehicle behavior during merging by sending high-
level commands to a low-level speed controller.
Some of the key performance metrics we consider in this
domain include:
risk(st, at) = (ccollision,if collision,
0,otherwise.(7)
utility(st, at) = (csuccess,if success,
ctime,otherwise.(8)
In some works, the time penalty ctime is not used. Instead,
discounting future rewards with γ < 1also encourages faster
driving.
In order to learn the desired behavior, both safety and
utility must be considered, as the two most important aspects
for automated driving. It is preferred to learn policies that
are safe, thus preventing collisions with other vehicles, while
also acting efficiently, thereby exhibiting behavior that is not
too conservative.
A. Penalty-based safety
Traditionally, such desired behavior is encoded in the
reward function using a combination of these components
and adjusting their parameters to increase speed or enforcing
time penalties that encourage faster driving, while at the same
time employing high collision penalties to encourage safe
driving [1]–[6]. The resulting reward function is given as
r(st, at) = utility(st, at)λrisk(st, at),(9)
where λis the safety weight, which is responsible to balance
between utility and safety.
In this case, assuming the values ccollision,ctime and csuccess
are already defined, the user must choose an appropriate
safety weight λ. Notice however, that the appropriate value
for λdepends on the structure of the reward function, in
other words, for different values of ccollision,ctime and csuccess,
the appropriate safety weight λcould change significantly.
V. AUTONOMOUS DRIVING AS A CMDP
In order to overcome the hyper-parameter sensitivity of
such a complex reward function, which is especially im-
portant in safety-related applications like automated driving,
we propose to instead use constrained RL and formulate
safety explicitly in a cost function. In this way, the RL agent
automatically satisfies safety constraints identified as cost
limits of the policy without requiring any parameter tuning
in the reward function. We define the following reward and
cost function for our highway merging scenario
r(st, at) = utility(st, at),(10)
c(st, at) = risk(st, at).(11)
Now, when the user of this system trains a policy to drive
safely, she only needs to define the cost limit d, removing
the burden of choosing an appropriate balance between utility
and risk, represented by the value λ.
a) The trade-off between safety and utility: In a con-
ventional reward scheme, safety and utility are considered
simultaneously in the return, implying that at some points
the RL agent may sacrifice safety to reach higher reward
or alternatively become too conservative due to large safety
punishments. This trade-off is often tuned based on the ccollision
ctime
or ccollision
csuccess ratio in the reward function. However, this leads to
two main issues: hyper-parameter sensitivity and environ-
ment over-fitting. After small changes in the environment
configuration (like more dense traffic or higher average speed
of vehicles) the reward function may not lead to the desired
behavior anymore and a new reward parameter tuning needs
be applied and the agent needs to be retrained with a new
rewarding scheme.
We propose to consider safety as the cost of policy and
decouple it from other factors in the desired behavior of the
RL agent by leveraging constrained RL. We can then enforce
safety by setting a suitable cost limit dwhich is a meaningful
parameter, in our case specifying the average number of
safety violations of the policy, without the requirement of
again tuning the parameters of the reward function.
We may notice that other objectives, such as comfort,
compliance with traffic rules or fuel consumption, could also
be defined as separate constraints. This makes the goals
easier to interpret and avoids a highly complex scalarized
reward function.
Reward engineering can also become easier with this ap-
proach. Consider for instance the task of choosing the values
for ctime and csuccess. On the one hand, if ctime > csuccess , a
regular RL agent might choose to crash in order to avoid
getting time penalties, ignoring the reward for completing a
task. On the other hand, a constrained RL would still have
a reasonable behavior due to the safety constraints.
VI. EXP ERI MENTAL ANA LYSIS
In this section we empirically evaluate the two ways we
may tackle the automated driving task with normal reinforce-
ment learning and constrained reinforcement learning. The
goal is to validate the hypothesis that training a safe and high
performing policy using constrained RL is easier from the
user perspective than using regular RL.
A. Experimental Set-Up
For our evaluations we use the highway-env framework,
which provides environments for tactical decision-making in
different automated driving tasks [30]. In this framework, the
RL agent controls the ego vehicle, while the other vehicles
follow an Intelligent Driver Model (IDM) [31] and only
react to the ego vehicle once it enters their lane. At the
beginning of each episode, some vehicles with probability
of pcoop will be cooperative which consider the projected
position of the ego-vehicle on their lane as the front vehicle
position and open a merging gap for the ego vehicle with dif-
ferent comfortable deceleration limit in their IDM controller
(acomf-max). In order to simulate different traffic dynamics,
we implement three different environments for the automated
merging scenario:
Low Cooperative: Low chance of having cooperative
drivers with pcoop=0.3 and early cooperative brake with
acomf-max=1.0 m/s2
High Cooperative: High chance of having cooperative
drivers with pcoop=0.6 and early cooperative brake with
acomf-max=1.0 m/s2
Low Cooperative with Late Brake: Low chance of
having cooperative drivers with pcoop=0.3 and late co-
operative brake with acomf-max=5.0 m/s2
1) Baselines: We consider three different RL agents in
order to solve each merging scenario: Normal PPO with
multiple collision penalties, PPO-Lagrangian with multiple
cost limits (we set αλ= 0.05 and update the penalty 40
times per epoch; the remaining hyperparameters are the same
as for PPO), and CPO also using multiple cost limits.
2) Metrics: We evaluate the average episode cost (Av-
erageEpCost), which is the expected accumulated cost of a
trajectory and the episode length (EpLen), which indicates
how fast the policy can finish one episode. Hence, in all the
plots lower values are better.
B. Results
1) Ease of use: As we discussed in Section IV and
Section V, finding the appropriate value to balance between
utility and safety can be a challenge. Figure 3 shows the per-
formance of the different algorithms on the Low Cooperative
Fig. 3. Training results for PPO with different safety weights (left), PPO-Lagrangian with different cost limits (middle) and CPO algorithm with different
cost limits (right). The grey dashed lines indicate the cost limits.
environment using different values for the safety weight λ(in
case of PPO) and safety bound d(in case of PPO-Lagrangian
and CPO). On the one hand, we notice that equipping PPO
with a λtoo low, such as 5, can lead to extremely unsafe
policies, while setting it to 20 or 100 provide safer policies.
On the other hand, the constrained methods can find safe
policies. It is easy to see that PPO-Lagrangian is approaching
the desired safety bound. This experiment makes clear that
from the user’s perspective choosing a value for dis much
more meaningful than choosing a value for λ, since there
is an obvious connection between the safety bound dand
the safety level of the policy returned. Consider for instance
that the user is willing to allow an expected cost of 0.1, after
observing the results from PPO with 3 different values for
λon Figure 3, it is not clear what should be the value of λ
in that case.
Considering the results for PPO with λ= 5, we may
conclude that setting ctime = 0.1and csuccess = 1 encourages
the agent to terminate the episode as soon as possible leading
to more crashes, making it mandatory to set λ= 5. On the
other hand, we notice that the constrained agents manage to
reduce the number of collisions, almost independently of the
cost limit.
2) Safety satisfaction: Although both, PPO-Lagrangian
and CPO, try to learn safe policies, according to Figure 3,
PPO-Lagrangian is more successful to satisfy the specified
cost limit din its configuration. This suggests that based
on the desired safety requirement, one can directly specify
the required cost limit for a PPO-Lagrangian agent before
training without the necessity to tune the reward function
for safety satisfaction.
3) Evaluations on Different Traffic Dynamics: In tra-
ditional RL, the reward function needs to be specialized
for every new environment with different configuration. In
order to study if constrained RL can address this challenge,
we trained PPO agents with different collision penalties
(λ) in their reward function in environments with different
traffic dynamics. After training, we evaluated each trained
policy for 100 episodes in the configured environment and
compared the collision rate and average episode time of each
agent in Table I and Table II. The first conclusion from these
results is that the PPO agent requires specialized collision
penalty in order to learn safe behavior for each environment.
For the High Cooperative environment, all PPO agents have
collision rates below 5%while for the Low Cooperative
agent the PPO with λ=0.1 has 14%collision rate. Moreover,
in the Low Cooperative with Late Brake environment (as
the most challenging configuration) only PPO agents with
λ5have collision rates below 5%. Next we trained
the PPO-Lagrangian as a constrained RL agent in the three
environments and compared its evaluations with the PPO
agent. It is visible that the PPO-Lagrangian agent could learn
policies with collision rates below 5%for all of the three
environments with fixed parameters in the reward and cost
functions (d=0.01 and αλ=0.1). The important conclusion is
that the PPO-Lagrangian algorithm is not sensitive to the
TABLE I
COMPARING BASELINES IN LOW COO PERAT IV E AN D HIGH COOPERATIVE ENVIRONMENTS.
Env. Config Low Cooperative High Cooperative
Agent PPO PPO-Lag PPO PPO-Lag
λ0.1 1 2.5 5 10 100 0.1 1 2.5 5 10 100
Collision Rate (%) 14 3 4.5 0.6 2.6 03.3 4.6 4.3 1 4.3 2 00.33
Avg. Time (s) 48.2 51.1 52.7 59.9 56.8 79.3 56.7 47.5 47.3 49.4 48.1 49.1 60.3 49.1
TABLE II
COMPARING BASELINES IN LATE COO PE RATI VE BR AK E E NV IRO NM EN T.
Env. Config Low Cooperative with Late Brake
Agent PPO PPO-Lag
λ0.1 1 2.5 5 10 100
Collision Rate (%) 16 14 19.3 3.3 0.1 01.3
Avg. Time (s) 47.4 45 47.5 60.6 68.8 78.2 81.9
environment disturbances and therefore the designer can put
less effort for training safe RL policies in automated driving
environments which may have different traffic configurations.
4) Effect of penalty learning rate αλ:We also per-
formed a hyper-parameter analysis for the PPO-Lagrangian’s
penalty learning rate αλ. We considered the values αλ
[0.005,0.01,0.05,0.1,0.5]. Figure 4 shows that overall the
PPO-Lagrangian algorithm has a low hyper-parameter sen-
sitivity with respect to αλin terms of safety. That is, for all
learning rates the algorithm is converging to a constraint sat-
isfying policy. We also notice that, αλhas a more significant
impact in the performance during learning, demonstrated by
the average episode length. PPO-Lagrangian finds policies
with lower episode length using lower learning rates (0.01
and 0.005), indicating that the results presented on Figure 3
and Table I and Table II could still be improved, while larger
learning rates can increase the episode length.
5) Video Demonstration: The supplementary video com-
pares the use of different safety penalties for a penalty-based
RL model and constrained RL algorithms. Large penalties
generally lead RL agents to drive safely but overly conser-
vative (slower), while smaller penalties lead to faster driving
but also reckless behavior. We observe that the Lagrangian-
PPO agent learns to balance between safe and fast driving,
a behavior less conservative than an RL agent with large
penalties and also less reckless than an RL agent with small
penalties.
C. Limitations
Notice that constrained RL defines safety in expectation,
which in our application still allows a number of collisions
even after the learning is finished, to mitigate this issues
we could combine such an approach with methods that
enforce hard constraints [10]–[12]. This method also does not
guarantee safety while the agent is still learning, finding ways
to ensure constrained RL satisfies the safety constraints is an
active line of research [32]–[34]. Furthermore, investigating
more sophisticated penalty learning schedules can be applied
in the constrained RL algorithm in order to achieve faster
convergence and even more adaptive policies.
Fig. 4. Training results for the PPO-Lagrangian method using different
penalty learning rates on the Low Cooperative environment with cost limit
d= 0.01, as indicated by the grey dashed line.
VII. CONCLUSION
In this paper, we addressed the challenge of learning
safe and efficient policies in automated driving with RL. In
contrast to traditional RL methods that learn safe policies
by discouraging unsafe outcomes using penalties in the
reward function, we investigate a new perspective on safety
of the learned policies using constrained RL. We showed
the main drawback of the traditional RL algorithms is the
requirement of reward engineering for every specific traffic
configurations (e.g. fewer cooperative drivers or different
cooperative strategies) in order to learn safe policies. The
proposed methodology provides a clear interface for the
designer who only needs to set the desired cost limit for the
policy being learned instead of manually balancing safety
and utility until finding the best policy. In light of our
experiments, this helps to learn safe and efficient policies
in environments with different traffic dynamics using a fixed
setup for the constrained RL agent.
ACKNOWLEDGMENT
This research is partly accomplished within the project
“UNICARagil” (FKZ 6EMO0287). We acknowledge the
financial support for the project by the Federal Ministry of
Education and Research of Germany (BMBF).
REFERENCES
[1] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fu-
jimura, “Navigating occluded intersections with autonomous
vehicles using deep reinforcement learning, in ICRA, IEEE,
2018, pp. 2034–2039.
[2] T. Tram, A. Jansson, R. Gr¨
onberg, M. Ali, and J. Sj¨
oberg,
“Learning negotiating behavior between cars in intersections
using deep q-learning,” in 2018 21st International Confer-
ence on Intelligent Transportation Systems (ITSC), IEEE,
2018, pp. 3169–3174.
[3] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochender-
fer, “Safe reinforcement learning with scene decomposition
for navigating complex urban environments, in 2019 IEEE
Intelligent Vehicles Symposium (IV), IEEE, 2019, pp. 1469–
1476.
[4] M. Bouton, A. Nakhaei, D. Isele, K. Fujimura, and M. J.
Kochenderfer, “Reinforcement learning with iterative rea-
soning for merging in dense traffic, in 2020 IEEE 23rd
International Conference on Intelligent Transportation Sys-
tems (ITSC), 2020, pp. 1–6.
[5] D. Kamran, C. F. Lopez, M. Lauer, and C. Stiller, “Risk-
aware high-level decisions for automated driving at occluded
intersections with reinforcement learning,” in 2020 IEEE
Intelligent Vehicles Symposium (IV), IEEE, 2020, pp. 1205–
1212.
[6] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochender-
fer, “Cooperation-aware reinforcement learning for merging
in dense traffic, in 2019 IEEE Intelligent Transportation
Systems Conference (ITSC), IEEE, 2019, pp. 3441–3447.
[7] D. Isele, A. Nakhaei, and K. Fujimura, “Safe reinforcement
learning on autonomous vehicles, in IROS, IEEE, 2018,
pp. 1–6.
[8] J. Garc´
ıa and F. Fern´
andez, “A comprehensive survey on
safe reinforcement learning,” JMLR, vol. 16, pp. 1437–1480,
2015.
[9] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforce-
ment learning based approach for automated lane change
maneuvers, in 2018 IEEE Intelligent Vehicles Symposium
(IV), 2018, pp. 1379–1384.
[10] M. Alshiekh, R. Bloem, R. Ehlers, B. K¨
onighofer, S.
Niekum, and U. Topcu, “Safe reinforcement learning via
shielding,” in AAAI, AAAI Press, 2018, pp. 2669–2678.
[11] G. Kalweit, M. Huegle, M. Werling, and J. Boedecker, Deep
Constrained Q-learning, arXiv:2003.09398, 2020.
[12] N. Jansen, B. K¨
onighofer, S. Junges, A. Serban, and R.
Bloem, “Safe reinforcement learning using probabilistic
shields,” in CONCUR, ser. LIPIcs, vol. 171, 2020, 3:1–3:16.
[13] S. Mo, X. Pei, and C. Wu, “Safe reinforcement learning for
autonomous vehicle using monte carlo tree search,” IEEE
Transactions on Intelligent Transportation Systems, pp. 1–8,
2021.
[14] M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J.
Kochenderfer, and J. Tumova, “Reinforcement learning with
probabilistic guarantees for autonomous driving, 2019,
arXiv:1904.07189.
[15] N. Carrara, E. Leurent, R. Laroche, T. Urvoy, O.-A. Mail-
lard, and O. Pietquin, “Budgeted reinforcement learning in
continuous state space,” in NeurIPS, Curran Associates, Inc.,
2019, pp. 9295–9305.
[16] L. Wen, J. Duan, S. E. Li, S. Xu, and H. Peng, “Safe rein-
forcement learning for autonomous vehicles through parallel
constrained policy optimization, in 23rd IEEE International
Conference on Intelligent Transportation Systems, IEEE,
2020, pp. 1–7.
[17] E. Altman, Constrained Markov Decision Processes. CRC
Press, 1999, vol. 7.
[18] M. L. Puterman, Markov Decision Processes: Discrete
Stochastic Dynamic Programming, 1st. John Wiley & Sons,
Inc., 1994.
[19] L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J.
Panerati, et al., “Safe Learning in Robotics: From Learning-
Based Control to Safe Reinforcement Learning,” Annual
Review of Control, Robotics, and Autonomous Systems,
vol. 5, no. 1, pp. 411–444, 2022.
[20] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, Learning
to Walk in the Real World with Minimal Human Effort,
arXiv:2002.08550, 2020.
[21] Q. Yang, T. D. Sim˜
ao, S. H. Tindemans, and M. T. J.
Spaan, “WCSAC: Worst-Case Soft Actor Critic for Safety-
Constrained Reinforcement Learning,” in AAAI, 2021.
[22] ——, “Safety-constrained reinforcement learning with a
distributional safety critic, Machine Learning, pp. 1–29,
2022.
[23] A. Ray, J. Achiam, and D. Amodei, Benchmark-
ing Safe Exploration in Deep Reinforcement Learning,
https://cdn.openai.com/safexp-short.pdf, 2019.
[24] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained
policy optimization, in ICML, PMLR, 2017, pp. 22–31.
[25] Y. Liu, J. Ding, and X. Liu, “Ipo: Interior-point policy opti-
mization under constraints,” in AAAI, 2020, pp. 4940–4947.
[26] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge,
“Projection-based constrained policy optimization, in ICLR,
2020.
[27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
O. Klimov, Proximal Policy Optimization Algorithms,
arXiv:1707.06347, 2017.
[28] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz,
“Trust Region Policy Optimization, in ICML, JMLR.org,
2015, pp. 1889–1897.
[29] D. P. Bertsekas, Constrained optimization and Lagrange
multiplier methods. Academic press, 1982, vol. 1.
[30] E. Leurent, An environment for autonomous driving
decision-making,https://github. com/eleurent/
highway-env, 2018.
[31] M. Treiber, A. Hennecke, and D. Helbing, “Congested
Traffic States in Empirical Observations and Microscopic
Simulations,” Phys. Rev. E, vol. 62, pp. 1805–1824, 2 2000.
[32] T. Liu, R. Zhou, D. Kalathil, P. R. Kumar, and C. Tian,
“Learning Policies with Zero or Bounded Constraint Viola-
tion for Constrained MDPs,” in NeurIPS, 2021, pp. 17183–
17 193.
[33] T. D. Sim˜
ao, N. Jansen, and M. T. J. Spaan, AlwaysSafe:
Reinforcement Learning Without Safety Constraint Vio-
lations During Training, in AAMAS, IFAAMAS, 2021,
pp. 1226–1235.
[34] Q. Bai, A. S. Bedi, M. Agarwal, A. Koppel, and V. Aggar-
wal, Achieving Zero Constraint Violation for Constrained
Reinforcement Learning via Primal-Dual Approach,” in
AAAI, AAAI Press, 2022, pp. 3682–3689.
... To this end, the DRL community has recently made significant efforts to generate more reliable agents. These attempts include online training methods such as constrained optimization (Yang et al., 2022a;Zhang et al., 2020) and safe exploration (Simão et al., 2021;Kamran et al., 2022). However, despite promising results, these approaches, and others, are heuristic in nature and are unable to guarantee the absolute correctness of the DRL policy in question. ...
Preprint
In recent years, Deep Reinforcement Learning (DRL) has emerged as an effective approach to solving real-world tasks. However, despite their successes, DRL-based policies suffer from poor reliability, which limits their deployment in safety-critical domains. As a result, various methods have been put forth to address this issue by providing formal safety guarantees. Two main approaches include shielding and verification. While shielding ensures the safe behavior of the policy by employing an external online component (i.e., a ``shield'') that overruns potentially dangerous actions, this approach has a significant computational cost as the shield must be invoked at runtime to validate every decision. On the other hand, verification is an offline process that can identify policies that are unsafe, prior to their deployment, yet, without providing alternative actions when such a policy is deemed unsafe. In this work, we present verification-guided shielding -- a novel approach that bridges the DRL reliability gap by integrating these two methods. Our approach combines both formal and probabilistic verification tools to partition the input domain into safe and unsafe regions. In addition, we employ clustering and symbolic representation procedures that compress the unsafe regions into a compact representation. This, in turn, allows to temporarily activate the shield solely in (potentially) unsafe regions, in an efficient manner. Our novel approach allows to significantly reduce runtime overhead while still preserving formal safety guarantees. We extensively evaluate our approach on two benchmarks from the robotic navigation domain, as well as provide an in-depth analysis of its scalability and completeness.
... In response to this potential risk, many studies have attempted to develop a variety of safe RL techniques to achieve trustworthy autonomous driving [27,28,29]. A prevalent scheme involves integrating standard RL methods with safety checkers [30] or constraints [31] to learn optimal policies for autonomous vehicles, while ensuring or promoting safety. For example, safe driving policies for autonomous vehicles can be learned through RL by relying on a safety checker that assesses risks or corrects unsafe decisions [32,33]. ...
Article
Full-text available
Despite the substantial advancements in reinforcement learning (RL) in recent years, ensuring trustworthiness remains a formidable challenge when applying this technology to safety-critical autonomous driving domains. One pivotal bottleneck is that well-trained driving policy models may be particularly vulnerable to observational perturbations or perceptual uncertainties, potentially leading to severe failures. In view of this, we present a novel defense-aware robust RL approach tailored for ensuring the robustness and safety of autonomous vehicles in the face of worst-case attacks on observations. The proposed paradigm primarily comprises two crucial modules: an adversarial attacker and a robust defender. Specifically, the adversarial attacker is devised to approximate the worst-case observational perturbations that attempt to induce safety violations (e.g., collisions) in the RL-driven autonomous vehicle. Additionally, the robust defender is developed to facilitate the safe RL agent to learn robust optimal policies that maximize the return while constraining the policy and cost perturbed by the adversarial attacker within specified bounds. Finally, the proposed technique is assessed across three distinct traffic scenarios: highway, on-ramp, and intersection. The simulation and experimental results indicate that our scheme enables the agent to execute trustworthy driving policies, even in the presence of the worst-case observational perturbations.
... In safety-constrained RL problems, the discounted longterm costs are usually constrained within a pre-defined cost limit (Achiam et al. 2017;Liu, Ding, and Liu 2020;Yang et al. 2020;Kamran et al. 2022). However, for industrial and robotic settings (Jardine, Lin, and Banjevic 2006;Boutilier and Lu 2016;De Nijs, Spaan, and de Weerdt 2015), the safety constraints are always built on the real costs within a finite horizon instead of the discounted cost-return. ...
Article
In the absence of assigned tasks, a learning agent typically seeks to explore its environment efficiently. However, the pursuit of exploration will bring more safety risks. An under-explored aspect of reinforcement learning is how to achieve safe efficient exploration when the task is unknown. In this paper, we propose a practical Constrained Entropy Maximization (CEM) algorithm to solve task-agnostic safe exploration problems, which naturally require a finite horizon and undiscounted constraints on safety costs. The CEM algorithm aims to learn a policy that maximizes state entropy under the premise of safety. To avoid approximating the state density in complex domains, CEM leverages a k-nearest neighbor entropy estimator to evaluate the efficiency of exploration. In terms of safety, CEM minimizes the safety costs, and adaptively trades off safety and exploration based on the current constraint satisfaction. The empirical analysis shows that CEM enables the acquisition of a safe exploration policy in complex environments, resulting in improved performance in both safety and sample efficiency for target tasks.
... In this setting, the agent observes, besides the reward, an extra signal, called the cost, that must be kept under a predefined threshold.4 This cost signal is often used to explicitly model safety requirements, which allows an engineer to easily specify the behavior expected from the agent [84,127]. In the typical constrained RL setting, the goal of the agent is to maximize the expected return while keeping the expectation of the cost-return (the accumulated cost in an episode) under the given threshold [1]. ...
Article
Full-text available
This position paper reflects on the state-of-the-art in decision-making under uncertainty. A classical assumption is that probabilities can sufficiently capture all uncertainty in a system. In this paper, the focus is on the uncertainty that goes beyond this classical interpretation, particularly by employing a clear distinction between aleatoric and epistemic uncertainty. The paper features an overview of Markov decision processes (MDPs) and extensions to account for partial observability and adversarial behavior. These models sufficiently capture aleatoric uncertainty, but fail to account for epistemic uncertainty robustly. Consequently, we present a thorough overview of so-called uncertainty models that exhibit uncertainty in a more robust interpretation. We show several solution techniques for both discrete and continuous models, ranging from formal verification, over control-based abstractions, to reinforcement learning. As an integral part of this paper, we list and discuss several key challenges that arise when dealing with rich types of uncertainty in a model-based fashion.
... In safety-constrained RL problems, the discounted longterm costs are usually constrained within a pre-defined cost limit (Achiam et al. 2017;Liu, Ding, and Liu 2020;Yang et al. 2020;Kamran et al. 2022). However, for industrial and robotic settings (Jardine, Lin, and Banjevic 2006;Boutilier and Lu 2016;De Nijs, Spaan, and de Weerdt 2015), the safety constraints are always built on the real costs within a finite horizon instead of the discounted cost-return. ...
Conference Paper
Full-text available
In the absence of assigned tasks, a learning agent typically seeks to explore its environment efficiently. However, the pursuit of exploration will bring more safety risks. An under-explored aspect of reinforcement learning is how to achieve safe efficient exploration when the task is unknown. In this paper, we propose a practical Constrained Entropy Maxi-mization (CEM) algorithm to solve task-agnostic safe exploration problems, which naturally require a finite horizon and undiscounted constraints on safety costs. The CEM algorithm aims to learn a policy that maximizes state entropy under the premise of safety. To avoid approximating the state density in complex domains, CEM leverages a k-nearest neighbor en-tropy estimator to evaluate the efficiency of exploration. In terms of safety, CEM minimizes the safety costs, and adap-tively trades off safety and exploration based on the current constraint satisfaction. The empirical analysis shows that CEM enables the acquisition of a safe exploration policy in complex environments, resulting in improved performance in both safety and sample efficiency for target tasks.
... In this setting, the agent observes, besides the reward, an extra signal, called the cost, that must be kept under a predefined threshold 4 . This cost signal is often used to explicitly model safety requirements, which allows an engineer to easily specify the behavior expected from the agent [84,127]. In the typical constrained RL setting, the goal of the agent is to maximize the expected return while keeping the expectation of the cost-return (the accumulated cost in an episode) under the given threshold [1]. ...
Preprint
This position paper reflects on the state-of-the-art in decision-making under uncertainty. A classical assumption is that probabilities can sufficiently capture all uncertainty in a system. In this paper, the focus is on the uncertainty that goes beyond this classical interpretation, particularly by employing a clear distinction between aleatoric and epistemic uncertainty. The paper features an overview of Markov decision processes (MDPs) and extensions to account for partial observability and adversarial behavior. These models sufficiently capture aleatoric uncertainty but fail to account for epistemic uncertainty robustly. Consequently, we present a thorough overview of so-called uncertainty models that exhibit uncertainty in a more robust interpretation. We show several solution techniques for both discrete and continuous models, ranging from formal verification, over control-based abstractions, to reinforcement learning. As an integral part of this paper, we list and discuss several key challenges that arise when dealing with rich types of uncertainty in a model-based fashion.
Article
Full-text available
Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.
Article
Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.
Article
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve epsilon-optimal cumulative reward with epsilon feasible policies. An epsilon-feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve epsilon-optimal cumulative reward with zero constraint violations or not. To achieve that, we advocate the use of a randomized primal-dual approach to solve the CMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA) which is shown to exhibit O(1/epsilon^2) sample complexity to achieve epsilon-optimal cumulative reward with zero constraint violations. In the prior works, the best available sample complexity for the epsilon-optimal policy with zero constraint violation is O(1/epsilon^5). Hence, the proposed algorithm provides a significant improvement compared to the state of the art.
Article
The last half decade has seen a steep rise in the number of contributions on safe learning methods for real-world robotic deployments from both the control and reinforcement learning communities. This article provides a concise but holistic review of the recent advances made in using machine learning to achieve safe decision-making under uncertainties, with a focus on unifying the language and frameworks used in control theory and reinforcement learning research. It includes learning-based control approaches that safely improve performance by learning the uncertain dynamics, reinforcement learning approaches that encourage safety or robustness, and methods that can formally certify the safety of a learned control policy. As data- and learning-based robot control methods continue to gain traction, researchers must understand when and how to best leverage them in real-world scenarios where safety is imperative, such as when operating in close proximity to humans. We highlight some of the open challenges that will drive the field of robot learning in the coming years, and emphasize the need for realistic physics-based benchmarks to facilitate fair comparisons between control and reinforcement learning approaches. Expected final online publication date for the Annual Review of Control, Robotics, and Autonomous Systems, Volume 5 is May 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Article
Reinforcement learning has gradually demonstrated its decision-making ability in autonomous driving. Reinforcement learning is learning how to map states to actions by interacting with environment so as to maximize the long-term reward. Within limited interactions, the learner will get a suitable driving policy according to the designed reward function. However there will be a lot of unsafe behaviors during training in traditional reinforcement learning. This paper proposes a RL-based method combined with RL agent and Monte Carlo tree search algorithm to reduce unsafe behaviors. The proposed safe reinforcement learning framework mainly consists of two modules: risk state estimation module and safe policy search module. Once the future state will be risky calculated by the risk state estimation module using current state information and the action outputted by the RL agent, the MCTS based safe policy search module will activate to guarantee a safer exploration by adding an additional reward for risk actions. We test the approach in several random overtake scenarios, resulting in faster convergence and safer behaviors compared to traditional reinforcement learning.
Article
In this paper, we study reinforcement learning (RL) algorithms to solve real-world decision problems with the objective of maximizing the long-term reward as well as satisfying cumulative constraints. We propose a novel first-order policy optimization method, Interior-point Policy Optimization (IPO), which augments the objective with logarithmic barrier functions, inspired by the interior-point method. Our proposed method is easy to implement with performance guarantees and can handle general types of cumulative multi-constraint settings. We conduct extensive evaluations to compare our approach with state-of-the-art baselines. Our algorithm outperforms the baseline algorithms, in terms of reward maximization and constraint satisfaction.