Conference PaperPDF Available

A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics Using Constrained Reinforcement Learning

November 2022

November 2022

DOI:10.1109/ITSC55140.2022.9921907

Conference: ITSC 2022
At: Macau, China

Authors:

Danial Kamran

Karlsruhe Institute of Technology

Qisong Yang

Delft University of Technology

Show all 7 authorsHide

The use of reinforcement learning (RL) in real-world domains often requires extensive effort to ensure safe behavior. While this compromises the autonomy of the system, it might still be too risky to allow a learning agent to freely explore its environment. These strict impositions come at the cost of flexibility and applying them often relies on complex parameters and hard-coded knowledge modeled by the reward function. Autonomous driving is one such domain that could greatly benefit from more efficient and verifiable methods for safe automation. We propose to approach the automated driving problem using constrained RL, a method that automates the trade-off between risk and utility, thereby significantly reducing the burden on the designer. We first show that an engineered reward function for ensuring safety and utility in one specific environment might not result in optimal behavior when traffic dynamics changes in the exact environment. Next, we show how algorithms based on constrained RL which are more robust to environmental disturbances can address this challenge. These algorithms use a simple and easy-to-interpret reward and cost function, and are able to maintain both, efficiency and safety without requiring reward parameter tuning. We demonstrate our approach in the automated merging scenario with different traffic configurations such as the low or high chance of cooperative drivers and different cooperative driving strategies.

Training results for PPO with different safety weights (left), PPO-Lagrangian with different cost limits (middle) and CPO algorithm with different cost limits (right). The grey dashed lines indicate the cost limits.

…

Figures - uploaded by Qisong Yang

Content may be subject to copyright.

Content uploaded by Qisong Yang

Content may be subject to copyright.

A Modern Perspective on Safe Automated Driving for Different Trafﬁc

Dynamics using Constrained Reinforcement Learning

Danial Kamran∗1, Thiago D. Sim˜

ao∗2, Qisong Yang3,

Canmanie T. Ponnambalam3, Johannes Fischer1, Matthijs T. J. Spaan3and Martin Lauer1

Abstract— The use of reinforcement learning (RL) in real-

world domains often requires extensive effort to ensure safe

behavior. While this compromises the autonomy of the system,

it might still be too risky to allow a learning agent to freely

explore its environment. These strict impositions come at the

cost of ﬂexibility and applying them often relies on complex

parameters and hard-coded knowledge modelled by the reward

function. Autonomous driving is one such domain that could

greatly beneﬁt from more efﬁcient and veriﬁable methods for

safe automation. We propose to approach the automated driving

problem using constrained RL, a method that automates the

trade off between risk and utility, thereby signiﬁcantly reducing

the burden on the designer. We ﬁrst show that an engineered

reward function for ensuring safety and utility in one speciﬁc

environment might not result in the optimal behavior when

trafﬁc dynamics changes in the exact environment. Next we

show how algorithms based on constrained RL which are

more robust to the environmental disturbances can address

this challenge. These algorithms use a simple and easy to

interpret reward and cost function, and are able to maintain

both, efﬁciency and safety without requiring reward parameter

tuning. We demonstrate our approach in the automated merg-

ing scenario with different trafﬁc conﬁgurations such as low

or high chance of cooperative drivers and different cooperative

driving strategies.

I. INTRODUCTION

Reinforcement learning (RL) promises to produce agents

that learn to optimize decision-making problems with limited

to no knowledge of the environment. This makes it an attrac-

tive approach to automating complex and high-dimensional

tasks. The drawback is that RL agents must interact with

the environment, exhaustively taking both good and bad

actions, in order to learn the best long-term decisions. In

real-world and safety-critical domains such as driving, where

the consequences of taking bad actions are severe, which

diminishes the appeal of classical RL. An ideal response

to this problem is safe reinforcement learning, a class of

RL methods that guarantees safety during learning or upon

execution. Safe RL methods have recently been applied

to various automated driving problems with some success

[1]–[6]. Existing safe RL approaches to autonomous driving

require extensive designer knowledge coded into the solution.

In many cases, these methods impose a heavy burden on the

*Authors have equal contribution.

1Karlsruhe Institute of Technology, Germany {danial.kamran,

johannes.fischer, martin.lauer}@kit.edu

2Radboud University, Nijmegen, The Netherlands

thiago.simao@ru.nl

3Delft University of Technology, The Netherlands {q.yang,

c.t.ponnambalam, m.t.j.spaan}@tudelft.nl

Policy

Training Safe Policy

Safety

Constraint

Policy

Training

Constraint

Satisﬁed?

Safety

Constraint

Yes

Safe Policy

Normal RL

Constrained RL

Reward

Tuning

Fig. 1. Two structures for learning safe policies. In normal RL, the user

searches for the best reward function that produces a policy that satisﬁes the

required constraint. In constrained RL, the algorithm design is simpliﬁed as

the safety constraint will automatically be satisﬁed during training.

designer to identify unsafe states or actions, tune hyperpa-

rameters, or deﬁne complex reward functions [3], [5], [7], as

Figure 1 illustrates. In addition to the issue of extensive prior

knowledge needed, these methods can be overly conservative

as they often impose hard restrictions on the search space.

Instead, we propose constrained reinforcement learning as an

elegant approach to safety in the automated vehicle domain.

Constrained RL models the problem as a constrained Markov

decision process (CMDP), introducing a cost function to

encode safety-relevant information (such as whether a crash

has occurred). This clearly separates the speciﬁcation of

reward (to be maximized) and cost (to be minimized).

The constraint is then deﬁned as a threshold regarding the

acceptable expected cost, resulting in a simple and highly

interpretable parameter. The agent learns to optimize reward

while respecting this safety constraint, automatically tuning

the trade-off between the two conﬂicting goals.

Merging into a highway with dense trafﬁc is a challenging

task for automated vehicles. The dense trafﬁc means that the

window in which a successful merge can occur is small.

The position and interaction between other vehicles in the

environment is a crucial aspect of the state description that

determines when a merge can be successfully executed. This

complex state space makes it particularly challenging to

deﬁne a set of safe states or actions by hand, and manipu-

lating the reward function to include safety considerations is

difﬁcult and requires considerable tuning. Further, applying

a safe RL method that is overly conservative in this scenario

can result in the freezing robot problem, whereby the vehicle

is unable to merge at all. This makes the highway merge

problem a prime candidate for a constrained RL approach.

In this paper, we ﬁrst describe existing safe RL approaches

to the automated driving problem, and highlight recent work

with similar goals to our approach as well as their limitations.

We then formulate the dense highway merge scenario as a

constrained MDP and apply two constrained RL methods to

a traditional safe approach to this problem. The experiments

demonstrate how constrained RL successfully mitigates the

trade-off between merging as quickly as possible and avoid-

ing crashes without additional hyper-parameters or extensive

tuning.

II. REL ATED WORK

The ﬁeld of safe reinforcement learning (RL) encompasses

several different types of approaches with varying levels

of safety guarantees, of which formulating the problem as

a constrained Markov decision process (as we propose) is

only a sub-set. For a comprehensive overview of safe RL in

general, we refer the reader to [8].

In this work, we focus on methods situated in the auto-

mated driving domain that aim to adhere to safe behavior

either during learning or on execution of the trained agent.

The most relevant methods can be divided into two cate-

gories: those that encode safety in the reward function, and

those that shield unsafe actions from the agent.

A popular approach to safe RL is to include penalties

in the reward function that discourages unsafe behavior, an

indirect way to incorporate safety [1], [5], [6], [9]. These

methods lay the burden of specifying unsafe behavior on the

designer, resulting in reward functions that can be hard to

specify and even more difﬁcult to verify. A more explicit way

to produce safe behavior is to restrict the action space to safe

actions, often referred to as shielding [10]–[12]. Determining

unsafe actions can be done using, for example, a model

checker [3] or predictive model [4], [7]. Efforts have been

made to combine the two types of approaches, with one using

a parameterized reward penalty to restrict actions determined

to be unsafe [13]. In general, restricting the search space can

produce conservative behavior, as they enforce hard limits

on the space of acceptable policies. Further, such methods

are very sensitive to incorrect speciﬁcations or predictions of

unsafe actions.

The use of constrained RL in the autonomous vehicle

domain has been so far limited. In one paper, they used

LTL speciﬁcations to deﬁne unsafe states, referring to the

result as a constrained optimization problem [14]. However,

this decoupled approach does not attempt to balance reward

and safety, instead enforcing hard constraints on the search

space. A budgeted MDP, which is similar to a constrained

MDP, but offers additional control over the budget, has

also been used to model the problem of automated driving

[15]. Most recently, constrained RL has been evaluated

on lane keeping and intersection navigation tasks, where

a parallel learning approach was proposed that employs

multiple agents to speed up convergence [16]. Our paper

highlights the limitations of reinforcement learning that are

addressed by taking a constrained optimization approach. We

focus on the improvements provided in terms of the ease of

speciﬁcation, robustness to scalarization issues, and elegant

trade-off of reward and risk, evaluated on a dense highway

merge scenario.

III. BACKGROU ND

In this section, we formalize the deﬁnition of constrained

RL, and present the algorithms that are used to solve it.

A. Constrained Markov Decision Process

A CMDP [17] is a model that separates reward and safety

signals. Similar to a Markov decision process (MDP) [18], a

CMDP is a tuple, (S,A,P, ι, r, c, d, γ), where Sis the state

space, Ais the set of actions, P:S × A × S → [0,1] is

a transition kernel indicating the probability to state s′after

taking action ain state s,ιis the initial state distribution,

r:S×A → [rmin, rmax ]is the reward function, c:S×A →

[cmin, cmax ]is the cost function, dis the safety threshold,

and γ∈[0,1] is the discount factor.

As in an MDP the goal in a CMDP is to compute a policy

that maximizes the accumulated discounted reward

max

πJR(π).

(st,at)∼Tπ"∞

t=0

γtr(st, at)#.(1)

where Tπ= (s0, a0, s1, . . .)is the trajectory distribution

induced by s0∼ι,at∼π(·|st), and st+1 ∼ P(· |st, at).

Additionally, the optimal policy has to keep the expected

accumulated discounted cost bounded

JC(π).

(st,at)∼Tπ"∞

t=0

γtc(st, at)#≤d(2)

according to the predeﬁned safety threshold d. Depending

on the task, it might resemble a bound on the probability

of failure, for instance if c(s, a) = 1failure(s), although this

requires γ= 1. An MDP can be seen as an unbounded

CMDP, setting d=∞, which essentially allows to ignore the

cost function, obtaining the following MDP (S,A,P, ι, r, γ).

B. Constrained Reinforcement Learning

Constrained RL addresses the problem of solving an

unknown CMDP [19]. Although off-policy methods for

constrained RL have been proposed [20]–[22], this paper

focuses on on-policy variants. Speciﬁcally, we apply PPO-

Lagrangian [23] and Constrained Policy Optimization (CPO)

[24] to the automated driving domain. These methods repre-

sent two main directions in on-policy constrained RL. The

ﬁrst direction is to adapt RL algorithms to their Lagrangian

variants, as seen in TRPO-Lagrangian and PPO-Lagrangian

[23]. The second direction uses constrained policy optimiza-

tion methods [25], [26] built on the work of [24].

a) PPO-Lagrangian: Proximal policy optimization

(PPO) [27], designed for regular RL problems, not only re-

tains the beneﬁts of trust region policy optimization (TRPO)

[28], but also has better sample complexity and convenience

to implement. Constrained optimization problems can be

solved by a Lagrangian variant of PPO [23], [29]. Instead

of ﬁxing the value of the Lagrangian multiplier, we adapt

it based on the constraint-satisfying performance. When the

policy is unsafe, we increase the Lagrangian multiplier to

enhance safety, but decrease it when attaining safe perfor-

mance. This allows us to leverage an adaptive safety weight λ

in the constrained optimization problem:

max

πmin

λ≥0G(π, λ).

=f(π)−λg(π),(3)

where f(π) = JR(π)and g(π) = JC(π)−din the case of

Equations (1) and (2). So, we update the safety weight using

λk+1 = max(0, λk+αλ(JC(π)−d)),(4)

where αλis the penalty learning rate. In our experiments,

we use the undiscounted cumulative cost to measure the real

constraint satisfaction.

b) Constrained Policy Optimization (CPO): CPO is a

trust-region method for constrained RL with guarantees for

near-constraint satisfaction at each iteration [24]. At each

gradient step, CPO constrains the policy changes to the cost

constraint and divergence neighborhood while guaranteeing

reward improvement. Similar to TRPO, Equations (1) and (2)

are further constrained to an additional Kullback-Leibler

(KL) divergence constraint:

πk+1 = arg max

s∼Tπk

a∼π

[Aπk

R(s, a)]

s.t. JC(πk) + 1

1−γE

s∼Tπk

a∼π

[Aπk

C(s, a)] ≤d

s∼Tπk

[DKL (π||πk)[s]] ≤δ

(5)

where δis the maximum step size, DKL is the KL

divergence to indicate the trust region. The advantage

functions ARand AC, respectively, express the per-

formance change Es∼Tπk,a∼π[Aπk

R(s, a)] (in reward) and

Es∼Tπk,a∼π[Aπk

C(s, a)] (in cost) of policy πover the current

policy πk. After the transition from Equations (1) and (2) to

Equation (5), CPO further approximates the reward function

and constraints using linear approximation (ﬁrst and second

order expansions) for small step sizes δ, to ensure the

problem is solvable. We refer the reader to [24] for more

details on the CPO algorithm.

IV. AUTONOMOUS DRIV ING A S AN MDP

We formulate the automated driving problem as a Markov

decision process (MDP), where at every decision step t, the

decision making policy πchooses the best action. The overall

goal is to learn the actions that maximize the expected future

reward (return) at every time step. In this paper, we focus

on merging in a highway environment where the ego vehicle

is a reinforcement learning agent that observes the positions

Fig. 2. Example of a merging scenario and the features that make up the

observation of the reinforcement learning agent. Here the ego vehicle (blue)

has to prevent collisions with vehicles on the main lane and also drive as

fast as possible to reach the goal.

and velocities of the surrounding vehicles and controls its

acceleration. The aim is to avoid collisions during merging

without acting too conservative. To this end, we model the

merging scenario depicted in Figure 2 and deﬁne the input

state as

st=dedgoal d1... dn

veaev1... vn





,(6)

where deis the ego vehicle’s distance to the conﬂict merging

area, dgoal is the distance from the conﬂict area to the

goal, and veand aeare the velocity and acceleration of the

ego vehicle, respectively. We also include relative distances

and velocities between vehicles on the main lane and the

projection of the ego vehicle to the main lane as diand vi

for a maximum of N= 15 surrounding vehicles in the state,

as shown in Figure 2.

The policy maps a state to an action atfrom the discrete

action space A={Decelerate,Idle,Accelerate}that controls

the ego vehicle behavior during merging by sending high-

level commands to a low-level speed controller.

Some of the key performance metrics we consider in this

domain include:

risk(st, at) = (ccollision,if collision,

0,otherwise.(7)

utility(st, at) = (csuccess,if success,

−ctime,otherwise.(8)

In some works, the time penalty ctime is not used. Instead,

discounting future rewards with γ < 1also encourages faster

driving.

In order to learn the desired behavior, both safety and

utility must be considered, as the two most important aspects

for automated driving. It is preferred to learn policies that

are safe, thus preventing collisions with other vehicles, while

also acting efﬁciently, thereby exhibiting behavior that is not

too conservative.

A. Penalty-based safety

Traditionally, such desired behavior is encoded in the

reward function using a combination of these components

and adjusting their parameters to increase speed or enforcing

time penalties that encourage faster driving, while at the same

time employing high collision penalties to encourage safe

driving [1]–[6]. The resulting reward function is given as

r(st, at) = utility(st, at)−λrisk(st, at),(9)

where λis the safety weight, which is responsible to balance

between utility and safety.

In this case, assuming the values ccollision,ctime and csuccess

are already deﬁned, the user must choose an appropriate

safety weight λ. Notice however, that the appropriate value

for λdepends on the structure of the reward function, in

other words, for different values of ccollision,ctime and csuccess,

the appropriate safety weight λcould change signiﬁcantly.

V. AUTONOMOUS DRIVING AS A CMDP

In order to overcome the hyper-parameter sensitivity of

such a complex reward function, which is especially im-

portant in safety-related applications like automated driving,

we propose to instead use constrained RL and formulate

safety explicitly in a cost function. In this way, the RL agent

automatically satisﬁes safety constraints identiﬁed as cost

limits of the policy without requiring any parameter tuning

in the reward function. We deﬁne the following reward and

cost function for our highway merging scenario

r(st, at) = utility(st, at),(10)

c(st, at) = risk(st, at).(11)

Now, when the user of this system trains a policy to drive

safely, she only needs to deﬁne the cost limit d, removing

the burden of choosing an appropriate balance between utility

and risk, represented by the value λ.

a) The trade-off between safety and utility: In a con-

ventional reward scheme, safety and utility are considered

simultaneously in the return, implying that at some points

the RL agent may sacriﬁce safety to reach higher reward

or alternatively become too conservative due to large safety

punishments. This trade-off is often tuned based on the ccollision

ctime

or ccollision

csuccess ratio in the reward function. However, this leads to

two main issues: hyper-parameter sensitivity and environ-

ment over-ﬁtting. After small changes in the environment

conﬁguration (like more dense trafﬁc or higher average speed

of vehicles) the reward function may not lead to the desired

behavior anymore and a new reward parameter tuning needs

be applied and the agent needs to be retrained with a new

rewarding scheme.

We propose to consider safety as the cost of policy and

decouple it from other factors in the desired behavior of the

RL agent by leveraging constrained RL. We can then enforce

safety by setting a suitable cost limit dwhich is a meaningful

parameter, in our case specifying the average number of

safety violations of the policy, without the requirement of

again tuning the parameters of the reward function.

We may notice that other objectives, such as comfort,

compliance with trafﬁc rules or fuel consumption, could also

be deﬁned as separate constraints. This makes the goals

easier to interpret and avoids a highly complex scalarized

reward function.

Reward engineering can also become easier with this ap-

proach. Consider for instance the task of choosing the values

for ctime and csuccess. On the one hand, if ctime > csuccess , a

regular RL agent might choose to crash in order to avoid

getting time penalties, ignoring the reward for completing a

task. On the other hand, a constrained RL would still have

a reasonable behavior due to the safety constraints.

VI. EXP ERI MENTAL ANA LYSIS

In this section we empirically evaluate the two ways we

may tackle the automated driving task with normal reinforce-

ment learning and constrained reinforcement learning. The

goal is to validate the hypothesis that training a safe and high

performing policy using constrained RL is easier from the

user perspective than using regular RL.

A. Experimental Set-Up

For our evaluations we use the highway-env framework,

which provides environments for tactical decision-making in

different automated driving tasks [30]. In this framework, the

RL agent controls the ego vehicle, while the other vehicles

follow an Intelligent Driver Model (IDM) [31] and only

react to the ego vehicle once it enters their lane. At the

beginning of each episode, some vehicles with probability

of pcoop will be cooperative which consider the projected

position of the ego-vehicle on their lane as the front vehicle

position and open a merging gap for the ego vehicle with dif-

ferent comfortable deceleration limit in their IDM controller

(acomf-max). In order to simulate different trafﬁc dynamics,

we implement three different environments for the automated

merging scenario:

•Low Cooperative: Low chance of having cooperative

drivers with pcoop=0.3 and early cooperative brake with

acomf-max=1.0 m/s2

•High Cooperative: High chance of having cooperative

drivers with pcoop=0.6 and early cooperative brake with

acomf-max=1.0 m/s2

•Low Cooperative with Late Brake: Low chance of

having cooperative drivers with pcoop=0.3 and late co-

operative brake with acomf-max=5.0 m/s2

1) Baselines: We consider three different RL agents in

order to solve each merging scenario: Normal PPO with

multiple collision penalties, PPO-Lagrangian with multiple

cost limits (we set αλ= 0.05 and update the penalty 40

times per epoch; the remaining hyperparameters are the same

as for PPO), and CPO also using multiple cost limits.

2) Metrics: We evaluate the average episode cost (Av-

erageEpCost), which is the expected accumulated cost of a

trajectory and the episode length (EpLen), which indicates

how fast the policy can ﬁnish one episode. Hence, in all the

plots lower values are better.

B. Results

1) Ease of use: As we discussed in Section IV and

Section V, ﬁnding the appropriate value to balance between

utility and safety can be a challenge. Figure 3 shows the per-

formance of the different algorithms on the Low Cooperative

Fig. 3. Training results for PPO with different safety weights (left), PPO-Lagrangian with different cost limits (middle) and CPO algorithm with different

cost limits (right). The grey dashed lines indicate the cost limits.

environment using different values for the safety weight λ(in

case of PPO) and safety bound d(in case of PPO-Lagrangian

and CPO). On the one hand, we notice that equipping PPO

with a λtoo low, such as 5, can lead to extremely unsafe

policies, while setting it to 20 or 100 provide safer policies.

On the other hand, the constrained methods can ﬁnd safe

policies. It is easy to see that PPO-Lagrangian is approaching

the desired safety bound. This experiment makes clear that

from the user’s perspective choosing a value for dis much

more meaningful than choosing a value for λ, since there

is an obvious connection between the safety bound dand

the safety level of the policy returned. Consider for instance

that the user is willing to allow an expected cost of 0.1, after

observing the results from PPO with 3 different values for

λon Figure 3, it is not clear what should be the value of λ

in that case.

Considering the results for PPO with λ= 5, we may

conclude that setting ctime = 0.1and csuccess = 1 encourages

the agent to terminate the episode as soon as possible leading

to more crashes, making it mandatory to set λ= 5. On the

other hand, we notice that the constrained agents manage to

reduce the number of collisions, almost independently of the

cost limit.

2) Safety satisfaction: Although both, PPO-Lagrangian

and CPO, try to learn safe policies, according to Figure 3,

PPO-Lagrangian is more successful to satisfy the speciﬁed

cost limit din its conﬁguration. This suggests that based

on the desired safety requirement, one can directly specify

the required cost limit for a PPO-Lagrangian agent before

training without the necessity to tune the reward function

for safety satisfaction.

3) Evaluations on Different Trafﬁc Dynamics: In tra-

ditional RL, the reward function needs to be specialized

for every new environment with different conﬁguration. In

order to study if constrained RL can address this challenge,

we trained PPO agents with different collision penalties

(λ) in their reward function in environments with different

trafﬁc dynamics. After training, we evaluated each trained

policy for 100 episodes in the conﬁgured environment and

compared the collision rate and average episode time of each

agent in Table I and Table II. The ﬁrst conclusion from these

results is that the PPO agent requires specialized collision

penalty in order to learn safe behavior for each environment.

For the High Cooperative environment, all PPO agents have

collision rates below 5%while for the Low Cooperative

agent the PPO with λ=0.1 has 14%collision rate. Moreover,

in the Low Cooperative with Late Brake environment (as

the most challenging conﬁguration) only PPO agents with

λ≥5have collision rates below 5%. Next we trained

the PPO-Lagrangian as a constrained RL agent in the three

environments and compared its evaluations with the PPO

agent. It is visible that the PPO-Lagrangian agent could learn

policies with collision rates below 5%for all of the three

environments with ﬁxed parameters in the reward and cost

functions (d=0.01 and αλ=0.1). The important conclusion is

that the PPO-Lagrangian algorithm is not sensitive to the

TABLE I

COMPARING BASELINES IN LOW COO PERAT IV E AN D HIGH COOPERATIVE ENVIRONMENTS.

Env. Conﬁg Low Cooperative High Cooperative

Agent PPO PPO-Lag PPO PPO-Lag

λ0.1 1 2.5 5 10 100 0.1 1 2.5 5 10 100

Collision Rate (%) 14 3 4.5 0.6 2.6 03.3 4.6 4.3 1 4.3 2 00.33

Avg. Time (s) 48.2 51.1 52.7 59.9 56.8 79.3 56.7 47.5 47.3 49.4 48.1 49.1 60.3 49.1

TABLE II

COMPARING BASELINES IN LATE COO PE RATI VE BR AK E E NV IRO NM EN T.

Env. Conﬁg Low Cooperative with Late Brake

Agent PPO PPO-Lag

λ0.1 1 2.5 5 10 100

Collision Rate (%) 16 14 19.3 3.3 0.1 01.3

Avg. Time (s) 47.4 45 47.5 60.6 68.8 78.2 81.9

environment disturbances and therefore the designer can put

less effort for training safe RL policies in automated driving

environments which may have different trafﬁc conﬁgurations.

4) Effect of penalty learning rate αλ:We also per-

formed a hyper-parameter analysis for the PPO-Lagrangian’s

penalty learning rate αλ. We considered the values αλ∈

[0.005,0.01,0.05,0.1,0.5]. Figure 4 shows that overall the

PPO-Lagrangian algorithm has a low hyper-parameter sen-

sitivity with respect to αλin terms of safety. That is, for all

learning rates the algorithm is converging to a constraint sat-

isfying policy. We also notice that, αλhas a more signiﬁcant

impact in the performance during learning, demonstrated by

the average episode length. PPO-Lagrangian ﬁnds policies

with lower episode length using lower learning rates (0.01

and 0.005), indicating that the results presented on Figure 3

and Table I and Table II could still be improved, while larger

learning rates can increase the episode length.

5) Video Demonstration: The supplementary video com-

pares the use of different safety penalties for a penalty-based

RL model and constrained RL algorithms. Large penalties

generally lead RL agents to drive safely but overly conser-

vative (slower), while smaller penalties lead to faster driving

but also reckless behavior. We observe that the Lagrangian-

PPO agent learns to balance between safe and fast driving,

a behavior less conservative than an RL agent with large

penalties and also less reckless than an RL agent with small

penalties.

C. Limitations

Notice that constrained RL deﬁnes safety in expectation,

which in our application still allows a number of collisions

even after the learning is ﬁnished, to mitigate this issues

we could combine such an approach with methods that

enforce hard constraints [10]–[12]. This method also does not

guarantee safety while the agent is still learning, ﬁnding ways

to ensure constrained RL satisﬁes the safety constraints is an

active line of research [32]–[34]. Furthermore, investigating

more sophisticated penalty learning schedules can be applied

in the constrained RL algorithm in order to achieve faster

convergence and even more adaptive policies.

Fig. 4. Training results for the PPO-Lagrangian method using different

penalty learning rates on the Low Cooperative environment with cost limit

d= 0.01, as indicated by the grey dashed line.

VII. CONCLUSION

In this paper, we addressed the challenge of learning

safe and efﬁcient policies in automated driving with RL. In

contrast to traditional RL methods that learn safe policies

by discouraging unsafe outcomes using penalties in the

reward function, we investigate a new perspective on safety

of the learned policies using constrained RL. We showed

the main drawback of the traditional RL algorithms is the

requirement of reward engineering for every speciﬁc trafﬁc

conﬁgurations (e.g. fewer cooperative drivers or different

cooperative strategies) in order to learn safe policies. The

proposed methodology provides a clear interface for the

designer who only needs to set the desired cost limit for the

policy being learned instead of manually balancing safety

and utility until ﬁnding the best policy. In light of our

experiments, this helps to learn safe and efﬁcient policies

in environments with different trafﬁc dynamics using a ﬁxed

setup for the constrained RL agent.

ACKNOWLEDGMENT

This research is partly accomplished within the project

“UNICARagil” (FKZ 6EMO0287). We acknowledge the

ﬁnancial support for the project by the Federal Ministry of

Education and Research of Germany (BMBF).

REFERENCES

[1] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fu-

jimura, “Navigating occluded intersections with autonomous

vehicles using deep reinforcement learning,” in ICRA, IEEE,

2018, pp. 2034–2039.

[2] T. Tram, A. Jansson, R. Gr¨

onberg, M. Ali, and J. Sj¨

oberg,

“Learning negotiating behavior between cars in intersections

using deep q-learning,” in 2018 21st International Confer-

ence on Intelligent Transportation Systems (ITSC), IEEE,

2018, pp. 3169–3174.

[3] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochender-

fer, “Safe reinforcement learning with scene decomposition

for navigating complex urban environments,” in 2019 IEEE

Intelligent Vehicles Symposium (IV), IEEE, 2019, pp. 1469–

1476.

[4] M. Bouton, A. Nakhaei, D. Isele, K. Fujimura, and M. J.

Kochenderfer, “Reinforcement learning with iterative rea-

soning for merging in dense trafﬁc,” in 2020 IEEE 23rd

International Conference on Intelligent Transportation Sys-

tems (ITSC), 2020, pp. 1–6.

[5] D. Kamran, C. F. Lopez, M. Lauer, and C. Stiller, “Risk-

aware high-level decisions for automated driving at occluded

intersections with reinforcement learning,” in 2020 IEEE

Intelligent Vehicles Symposium (IV), IEEE, 2020, pp. 1205–

1212.

[6] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochender-

fer, “Cooperation-aware reinforcement learning for merging

in dense trafﬁc,” in 2019 IEEE Intelligent Transportation

Systems Conference (ITSC), IEEE, 2019, pp. 3441–3447.

[7] D. Isele, A. Nakhaei, and K. Fujimura, “Safe reinforcement

learning on autonomous vehicles,” in IROS, IEEE, 2018,

pp. 1–6.

[8] J. Garc´

ıa and F. Fern´

andez, “A comprehensive survey on

safe reinforcement learning,” JMLR, vol. 16, pp. 1437–1480,

2015.

[9] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforce-

ment learning based approach for automated lane change

maneuvers,” in 2018 IEEE Intelligent Vehicles Symposium

(IV), 2018, pp. 1379–1384.

[10] M. Alshiekh, R. Bloem, R. Ehlers, B. K¨

onighofer, S.

Niekum, and U. Topcu, “Safe reinforcement learning via

shielding,” in AAAI, AAAI Press, 2018, pp. 2669–2678.

[11] G. Kalweit, M. Huegle, M. Werling, and J. Boedecker, Deep

Constrained Q-learning, arXiv:2003.09398, 2020.

[12] N. Jansen, B. K¨

onighofer, S. Junges, A. Serban, and R.

Bloem, “Safe reinforcement learning using probabilistic

shields,” in CONCUR, ser. LIPIcs, vol. 171, 2020, 3:1–3:16.

[13] S. Mo, X. Pei, and C. Wu, “Safe reinforcement learning for

autonomous vehicle using monte carlo tree search,” IEEE

Transactions on Intelligent Transportation Systems, pp. 1–8,

2021.

[14] M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J.

Kochenderfer, and J. Tumova, “Reinforcement learning with

probabilistic guarantees for autonomous driving,” 2019,

arXiv:1904.07189.

[15] N. Carrara, E. Leurent, R. Laroche, T. Urvoy, O.-A. Mail-

lard, and O. Pietquin, “Budgeted reinforcement learning in

continuous state space,” in NeurIPS, Curran Associates, Inc.,

2019, pp. 9295–9305.

[16] L. Wen, J. Duan, S. E. Li, S. Xu, and H. Peng, “Safe rein-

forcement learning for autonomous vehicles through parallel

constrained policy optimization,” in 23rd IEEE International

Conference on Intelligent Transportation Systems, IEEE,

2020, pp. 1–7.

[17] E. Altman, Constrained Markov Decision Processes. CRC

Press, 1999, vol. 7.

[18] M. L. Puterman, Markov Decision Processes: Discrete

Stochastic Dynamic Programming, 1st. John Wiley & Sons,

Inc., 1994.

[19] L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J.

Panerati, et al., “Safe Learning in Robotics: From Learning-

Based Control to Safe Reinforcement Learning,” Annual

Review of Control, Robotics, and Autonomous Systems,

vol. 5, no. 1, pp. 411–444, 2022.

[20] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, Learning

to Walk in the Real World with Minimal Human Effort,

arXiv:2002.08550, 2020.

[21] Q. Yang, T. D. Sim˜

ao, S. H. Tindemans, and M. T. J.

Spaan, “WCSAC: Worst-Case Soft Actor Critic for Safety-

Constrained Reinforcement Learning,” in AAAI, 2021.

[22] ——, “Safety-constrained reinforcement learning with a

distributional safety critic,” Machine Learning, pp. 1–29,

2022.

[23] A. Ray, J. Achiam, and D. Amodei, Benchmark-

ing Safe Exploration in Deep Reinforcement Learning,

https://cdn.openai.com/safexp-short.pdf, 2019.

[24] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained

policy optimization,” in ICML, PMLR, 2017, pp. 22–31.

[25] Y. Liu, J. Ding, and X. Liu, “Ipo: Interior-point policy opti-

mization under constraints,” in AAAI, 2020, pp. 4940–4947.

[26] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge,

“Projection-based constrained policy optimization,” in ICLR,

2020.

[27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and

O. Klimov, Proximal Policy Optimization Algorithms,

arXiv:1707.06347, 2017.

[28] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz,

“Trust Region Policy Optimization,” in ICML, JMLR.org,

2015, pp. 1889–1897.

[29] D. P. Bertsekas, Constrained optimization and Lagrange

multiplier methods. Academic press, 1982, vol. 1.

[30] E. Leurent, An environment for autonomous driving

decision-making,https://github. com/eleurent/

highway-env, 2018.

[31] M. Treiber, A. Hennecke, and D. Helbing, “Congested

Trafﬁc States in Empirical Observations and Microscopic

Simulations,” Phys. Rev. E, vol. 62, pp. 1805–1824, 2 2000.

[32] T. Liu, R. Zhou, D. Kalathil, P. R. Kumar, and C. Tian,

“Learning Policies with Zero or Bounded Constraint Viola-

tion for Constrained MDPs,” in NeurIPS, 2021, pp. 17183–

17 193.

[33] T. D. Sim˜

ao, N. Jansen, and M. T. J. Spaan, “AlwaysSafe:

Reinforcement Learning Without Safety Constraint Vio-

lations During Training,” in AAMAS, IFAAMAS, 2021,

pp. 1226–1235.

[34] Q. Bai, A. S. Bedi, M. Agarwal, A. Koppel, and V. Aggar-

wal, “Achieving Zero Constraint Violation for Constrained

Reinforcement Learning via Primal-Dual Approach,” in

AAAI, AAAI Press, 2022, pp. 3682–3689.

Verification-Guided Shielding for Deep Reinforcement Learning

Preprint

Jun 2024

In recent years, Deep Reinforcement Learning (DRL) has emerged as an effective approach to solving real-world tasks. However, despite their successes, DRL-based policies suffer from poor reliability, which limits their deployment in safety-critical domains. As a result, various methods have been put forth to address this issue by providing formal safety guarantees. Two main approaches include shielding and verification. While shielding ensures the safe behavior of the policy by employing an external online component (i.e., a ``shield'') that overruns potentially dangerous actions, this approach has a significant computational cost as the shield must be invoked at runtime to validate every decision. On the other hand, verification is an offline process that can identify policies that are unsafe, prior to their deployment, yet, without providing alternative actions when such a policy is deemed unsafe. In this work, we present verification-guided shielding -- a novel approach that bridges the DRL reliability gap by integrating these two methods. Our approach combines both formal and probabilistic verification tools to partition the input domain into safe and unsafe regions. In addition, we employ clustering and symbolic representation procedures that compress the unsafe regions into a compact representation. This, in turn, allows to temporarily activate the shield solely in (potentially) unsafe regions, in an efficient manner. Our novel approach allows to significantly reduce runtime overhead while still preserving formal safety guarantees. We extensively evaluate our approach on two benchmarks from the robotic navigation domain, as well as provide an in-depth analysis of its scalability and completeness.

Trustworthy autonomous driving via defense-aware robust reinforcement learning against worst-case observational perturbations

Article

Full-text available

Jun 2024

Despite the substantial advancements in reinforcement learning (RL) in recent years, ensuring trustworthiness remains a formidable challenge when applying this technology to safety-critical autonomous driving domains. One pivotal bottleneck is that well-trained driving policy models may be particularly vulnerable to observational perturbations or perceptual uncertainties, potentially leading to severe failures. In view of this, we present a novel defense-aware robust RL approach tailored for ensuring the robustness and safety of autonomous vehicles in the face of worst-case attacks on observations. The proposed paradigm primarily comprises two crucial modules: an adversarial attacker and a robust defender. Specifically, the adversarial attacker is devised to approximate the worst-case observational perturbations that attempt to induce safety violations (e.g., collisions) in the RL-driven autonomous vehicle. Additionally, the robust defender is developed to facilitate the safe RL agent to learn robust optimal policies that maximize the return while constraining the policy and cost perturbed by the adversarial attacker within specified bounds. Finally, the proposed technique is assessed across three distinct traffic scenarios: highway, on-ramp, and intersection. The simulation and experimental results indicate that our scheme enables the agent to execute trustworthy driving policies, even in the presence of the worst-case observational perturbations.

CEM: Constrained Entropy Maximization for Task-Agnostic Safe Exploration

Article

Jun 2023

In the absence of assigned tasks, a learning agent typically seeks to explore its environment efficiently. However, the pursuit of exploration will bring more safety risks. An under-explored aspect of reinforcement learning is how to achieve safe efficient exploration when the task is unknown. In this paper, we propose a practical Constrained Entropy Maximization (CEM) algorithm to solve task-agnostic safe exploration problems, which naturally require a finite horizon and undiscounted constraints on safety costs. The CEM algorithm aims to learn a policy that maximizes state entropy under the premise of safety. To avoid approximating the state density in complex domains, CEM leverages a k-nearest neighbor entropy estimator to evaluate the efficiency of exploration. In terms of safety, CEM minimizes the safety costs, and adaptively trades off safety and exploration based on the current constraint satisfaction. The empirical analysis shows that CEM enables the acquisition of a safe exploration policy in complex environments, resulting in improved performance in both safety and sample efficiency for target tasks.

Decision-making under uncertainty: beyond probabilities

Article

Full-text available

May 2023
Int J Software Tool Tech Tran

This position paper reflects on the state-of-the-art in decision-making under uncertainty. A classical assumption is that probabilities can sufficiently capture all uncertainty in a system. In this paper, the focus is on the uncertainty that goes beyond this classical interpretation, particularly by employing a clear distinction between aleatoric and epistemic uncertainty. The paper features an overview of Markov decision processes (MDPs) and extensions to account for partial observability and adversarial behavior. These models sufficiently capture aleatoric uncertainty, but fail to account for epistemic uncertainty robustly. Consequently, we present a thorough overview of so-called uncertainty models that exhibit uncertainty in a more robust interpretation. We show several solution techniques for both discrete and continuous models, ranging from formal verification, over control-based abstractions, to reinforcement learning. As an integral part of this paper, we list and discuss several key challenges that arise when dealing with rich types of uncertainty in a model-based fashion.

CEM: Constrained Entropy Maximization for Task-Agnostic Safe Exploration

Conference Paper

Full-text available

Mar 2023

In the absence of assigned tasks, a learning agent typically seeks to explore its environment efficiently. However, the pursuit of exploration will bring more safety risks. An under-explored aspect of reinforcement learning is how to achieve safe efficient exploration when the task is unknown. In this paper, we propose a practical Constrained Entropy Maxi-mization (CEM) algorithm to solve task-agnostic safe exploration problems, which naturally require a finite horizon and undiscounted constraints on safety costs. The CEM algorithm aims to learn a policy that maximizes state entropy under the premise of safety. To avoid approximating the state density in complex domains, CEM leverages a k-nearest neighbor en-tropy estimator to evaluate the efficiency of exploration. In terms of safety, CEM minimizes the safety costs, and adap-tively trades off safety and exploration based on the current constraint satisfaction. The empirical analysis shows that CEM enables the acquisition of a safe exploration policy in complex environments, resulting in improved performance in both safety and sample efficiency for target tasks.

Decision-Making Under Uncertainty: Beyond Probabilities

Preprint

Mar 2023

This position paper reflects on the state-of-the-art in decision-making under uncertainty. A classical assumption is that probabilities can sufficiently capture all uncertainty in a system. In this paper, the focus is on the uncertainty that goes beyond this classical interpretation, particularly by employing a clear distinction between aleatoric and epistemic uncertainty. The paper features an overview of Markov decision processes (MDPs) and extensions to account for partial observability and adversarial behavior. These models sufficiently capture aleatoric uncertainty but fail to account for epistemic uncertainty robustly. Consequently, we present a thorough overview of so-called uncertainty models that exhibit uncertainty in a more robust interpretation. We show several solution techniques for both discrete and continuous models, ranging from formal verification, over control-based abstractions, to reinforcement learning. As an integral part of this paper, we list and discuss several key challenges that arise when dealing with rich types of uncertainty in a model-based fashion.

Safety Reinforced Model Predictive Control (SRMPC): Improving MPC with Reinforcement Learning for Motion Planning in Autonomous Driving

Conference Paper

Sep 2023

Enhancing Deep Reinforcement Learning with Executable Specifications

Conference Paper

May 2023

Raz Yerushalmi

Safety-constrained reinforcement learning with a distributional safety critic

Article

Full-text available

Jun 2022
MACH LEARN

Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization *

Conference Paper

Full-text available

Sep 2020

WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

Article

May 2021

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

Article

Jun 2022

Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve epsilon-optimal cumulative reward with epsilon feasible policies. An epsilon-feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve epsilon-optimal cumulative reward with zero constraint violations or not. To achieve that, we advocate the use of a randomized primal-dual approach to solve the CMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA) which is shown to exhibit O(1/epsilon^2) sample complexity to achieve epsilon-optimal cumulative reward with zero constraint violations. In the prior works, the best available sample complexity for the epsilon-optimal policy with zero constraint violation is O(1/epsilon^5). Hence, the proposed algorithm provides a significant improvement compared to the state of the art.

Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning

Article

May 2022

The last half decade has seen a steep rise in the number of contributions on safe learning methods for real-world robotic deployments from both the control and reinforcement learning communities. This article provides a concise but holistic review of the recent advances made in using machine learning to achieve safe decision-making under uncertainties, with a focus on unifying the language and frameworks used in control theory and reinforcement learning research. It includes learning-based control approaches that safely improve performance by learning the uncertain dynamics, reinforcement learning approaches that encourage safety or robustness, and methods that can formally certify the safety of a learned control policy. As data- and learning-based robot control methods continue to gain traction, researchers must understand when and how to best leverage them in real-world scenarios where safety is imperative, such as when operating in close proximity to humans. We highlight some of the open challenges that will drive the field of robot learning in the coming years, and emphasize the need for realistic physics-based benchmarks to facilitate fair comparisons between control and reinforcement learning approaches. Expected final online publication date for the Annual Review of Control, Robotics, and Autonomous Systems, Volume 5 is May 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Safe Reinforcement Learning for Autonomous Vehicle Using Monte Carlo Tree Search

Article

Mar 2021

Reinforcement learning has gradually demonstrated its decision-making ability in autonomous driving. Reinforcement learning is learning how to map states to actions by interacting with environment so as to maximize the long-term reward. Within limited interactions, the learner will get a suitable driving policy according to the designed reward function. However there will be a lot of unsafe behaviors during training in traditional reinforcement learning. This paper proposes a RL-based method combined with RL agent and Monte Carlo tree search algorithm to reduce unsafe behaviors. The proposed safe reinforcement learning framework mainly consists of two modules: risk state estimation module and safe policy search module. Once the future state will be risky calculated by the risk state estimation module using current state information and the action outputted by the RL agent, the MCTS based safe policy search module will activate to guarantee a safer exploration by adding an additional reward for risk actions. We test the approach in several random overtake scenarios, resulting in faster convergence and safer behaviors compared to traditional reinforcement learning.

Risk-Aware High-level Decisions for Automated Driving at Occluded Intersections with Reinforcement Learning

Conference Paper

Oct 2020

Reinforcement Learning with Iterative Reasoning for Merging in Dense Traffic

Conference Paper

Sep 2020

IPO: Interior-Point Policy Optimization under Constraints

Article

Apr 2020

In this paper, we study reinforcement learning (RL) algorithms to solve real-world decision problems with the objective of maximizing the long-term reward as well as satisfying cumulative constraints. We propose a novel first-order policy optimization method, Interior-point Policy Optimization (IPO), which augments the objective with logarithmic barrier functions, inspired by the interior-point method. Our proposed method is easy to implement with performance guarantees and can handle general types of cumulative multi-constraint settings. We conduct extensive evaluations to compare our approach with state-of-the-art baselines. Our algorithm outperforms the baseline algorithms, in terms of reward maximization and constraint satisfaction.

Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic

Conference Paper

Oct 2019

A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics Using Constrained Reinforcement Learning

Abstract and Figures

Recommended publications

Learning Personalized Discretionary Lane-Change Initiation for Fully Autonomous Driving Based on Rei...

Safety-constrained reinforcement learning with a distributional safety critic

Minimizing Safety Interference for Safe and Comfortable Automated Driving with Distributional Reinfo...

Risk-Aware High-level Decisions for Automated Driving at Occluded Intersections with Reinforcement L...

Minimizing Safety Interference for Safe and Comfortable Automated Driving with Distributional Reinfo...