Conference PaperPDF Available

Multi-Objective Distributional Value Iteration

Authors:

Abstract

In sequential multi-objective decision making (MODeM) settings, when the utility of a user is derived from a single execution of a policy, policies for the expected scalarised returns (ESR) criterion should be computed. In multi-objective settings, a user's preferences over objectives, or utility function, may be unknown at the time of planning. When the utility function of a user is unknown, multi-policy methods are deployed to compute a set of optimal policies. However, the state-of-the-art sequential MODeM multi-policy algorithms compute a set of optimal policies for the scalarised expected returns (SER) criterion. Algorithms that compute a set of optimal policies for the SER criterion utilise expected value vectors which cannot be used when optimising for the ESR criterion. We propose a novel multi-policy multi-objective distributional value iteration (MODVI) algorithm that replaces value vectors with distributions over the returns and computes a set of optimal policies for the ESR criterion. MODVI is evaluated using several sequential multi-objective problem domains, where, for each problem, a set of optimal policies for the ESR criterion is computed.
Multi-Objective Distributional Value Iteration
Conor F. Hayes
National University of Ireland Galway (IE)
c.hayes13@nuigalway.ie
Diederik M. Roijers
Vrije Universiteit Brussel (BE)
& HU Univ. of Appl. Sci. Utrecht (NL)
Enda Howley
National University of Ireland Galway (IE)
Patrick Mannion
National University of Ireland Galway (IE)
ABSTRACT
In sequential multi-objective decision making (MODeM) settings,
when the utility of a user is derived from a single execution of a
policy, policies for the expected scalarised returns (ESR) criterion
should be computed. In multi-objective settings, a user’s prefer-
ences over objectives, or utility function, may be unknown at the
time of planning. When the utility function of a user is unknown,
multi-policy methods are deployed to compute a set of optimal poli-
cies. However, the state-of-the-art sequential MODeM multi-policy
algorithms compute a set of optimal policies for the scalarised ex-
pected returns (SER) criterion. Algorithms that compute a set of
optimal policies for the SER criterion utilise expected value vectors
which cannot be used when optimising for the ESR criterion. We
propose a novel multi-policy multi-objective distributional value
iteration (MODVI) algorithm that replaces value vectors with dis-
tributions over the returns and computes a set of optimal policies
for the ESR criterion. MODVI is evaluated using several sequential
multi-objective problem domains, where, for each problem, a set of
optimal policies for the ESR criterion is computed.
KEYWORDS
Multi-objective; distributional; value iteration; expected scalarised
returns
1 INTRODUCTION
When making decisions in the real world, trade-os between mul-
tiple, often conicting, objectives must be made [
44
]. In many real-
world decision making settings, a policy is only executed once. For
example, consider a government body planning to implement a tax
incentive on imported electric vehicles. The tax incentive would
increase sales of electric vehicles, reducing
𝐶𝑂2
emissions, how-
ever, it may cause the sales of domestically produced petrol/diesel
vehicles to plummet, resulting in local unemployment. The tax
incentive will only be implemented once and, therefore, the gov-
ernment body must carefully consider the eects and likelihood of
all potential outcomes. The current state-of-the-art multi-objective
decision making (MODeM) literature focuses almost exclusively on
computing polices that are optimal over multiple executions. There-
fore, to fully utilise MODeM in the real world, we must develop
algorithms to compute a policy, or set of policies, that are optimal
given the single-execution nature of the problem.
In MODeM, a policy, or set of policies, is computed to maximise
the user’s preferences over objectives, or utility function. However,
This paper extends our AAMAS 2022 extended abstract [21].
Proc. of the Adaptive and Learning Agents Workshop (ALA 2022), Cruz, Hayes, da Silva,
Santos (eds.), May 9-10, 2022, Online, https://ala2022.github.io/ . 2022.
the user’s utility function is often unknown at the time of planning
[
37
]. Therefore, we are deemed to be in the unknown utility function
scenario [
22
], where a set of optimal policies must be computed
and returned to the user. Once the user’s utility function becomes
known, the user can select a policy from the computed set of optimal
policies that best reects their preferences [37].
MODeM distinguishes between two optimality criteria. In scenar-
ios where the utility of a user is derived from multiple executions
of a policy, the scalarised expected returns (SER) criterion should
be optimised [
22
]. In scenarios where the utility of a user is derived
from a single execution of a policy, the expected scalarised returns
(ESR) criterion should be optimised [
19
,
20
]. The SER criterion is
the most commonly used optimality criterion in the sequential
multi-objective planning literature [
38
]. In contrast to the SER cri-
terion, the ESR criterion has been understudied by the single agent
MODeM community, with some exceptions [19, 20, 33, 36, 43].
The majority of multi-policy MODeM algorithms are designed to
compute a set of optimal policies for the SER criterion [
11
,
17
,
49
].
However, if the utility function of a user is non-linear, the policies
computed under the SER criterion and ESR criterion can be dierent,
given the SER criterion and ESR criterion utilise the utility function
dierently [
39
]. Moreover, sub-optimal policies can be computed if
the choice of optimality criterion is not taken into consideration
when planning [
24
]. Therefore, new methods that can compute
policies for the ESR criterion must be developed.
The current state-of-the-art SER methods [
30
,
48
] are fundamen-
tally incompatible with the ESR criterion. When the utility function
of a user is unknown, SER methods use expected value vectors to
compute a set of optimal policies [
48
,
49
]. However, expected value
vectors cannot be used to compute policies under the ESR criterion
[
33
]. Instead, a distribution over the returns, or return distribution,
must be maintained to compute policies for the ESR criterion [
23
].
Given, in the real world, policies are often only executed once, a
user must have sucient information about the potential positive
or negative outcomes a policy may have. Maintaining a distribution
over the returns for each computed policy ensures a user has su-
cient information to take the potential outcomes into consideration
at decision time [
19
,
20
]. Utilising a distribution over the returns
ensures the ESR criterion can be considered in real-world decision
making scenarios.
In Section 3, we highlight why multi-policy methods for the SER
criterion cannot be used for the ESR criterion and show why main-
taining a distribution over the returns is necessary to compute a set
of optimal policies under the ESR criterion. In Section 4, we present
a novel multi-objective distributional value iteration (MODVI) algo-
rithm that computes a set of optimal policies for the ESR criterion
in scenarios when the utility function of a user is unknown at the
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
MOMDP algorithm solution
set
planning or learning phase
user
selection
selection phase
single
solution
execution phase
Figure 1: The unknown utility function scenario [22].
time of planning. In Section 5, we show MODVI can compute a
set of optimal policies for the ESR criterion using two sequential
multi-objective benchmark problems, and show how these could be
visualised for a user. Finally, we show that MODVI can compute a
set of optimal policies for the ESR criterion in a practical real-world
problem domain.
2 BACKGROUND
In Section 2, we formally dene multi-objective Markov decision
processes, the unknown utility function scenario, and commonly
studied optimality criteria in multi-objective decision making.
2.1 Multi-Objective Markov Decision Processes
A multi-objective Markov decision process (MOMDP) is a tuple,
M=(S,A,T, 𝛾, R)
, where
S
is the state space,
A
is the set of
actions,
T:S × A × S [0,1]
is the probabilistic transition
function,
𝛾
is the discount factor, and
R:S × A × S R𝑛
is the
probabilistic vectorial reward function for each of the
𝑛
objectives.
An agent acts according to a policy
𝜋
:
A [0,1]
. Given a state,
actions are selected according to a certain probability distribution.
2.2 The Unknown Utility Function Scenario
In MODeM, a user’s preferences over objectives can be modelled
as a utility function [
37
]. However, a user’s utility function is often
unknown at the time of planning. In the taxonomy of MODeM, this
is known as the unknown utility function scenario, where a set of
optimal policies must be computed and returned to the user [
37
].
Figure 1 outlines the three phases in the unknown utility function
scenario: the planning phase, the selection phase, and the execution
phase [
22
]. During the planning phase a multi-policy algorithm
[
41
] is deployed to compute a set of policies that are optimal for all
possible utility functions [
50
]. The set of optimal policies is then
returned to the user. During the selection phase, the user selects a
policy from the computed set of optimal policies according to their
preferences. Finally, during the execution phase, the selected policy
is executed.
2.3 Optimality Criteria in Multi-Objective
Decision Making
When applying a user’s utility function, the MODeM literature
distinguishes between two optimality criteria. Calculating the ex-
pected value of the return of a policy before applying the utility
function leads to the scalarised expected returns (SER) optimisation
criterion:
𝑉𝜋
𝑢=𝑢 E"
𝑡=0
𝛾𝑡r𝑡|𝜋, 𝜇0#!.(1)
In scenarios where the utility of a user is derived from the expected
outcome over multiple executions of a policy, the SER criterion
should be optimised [
22
]. SER is the most commonly used criterion
in the multi-objective (single agent) planning literature [
48
,
49
].
For SER, a set of non-dominated policies that are optimal for all
possible utility functions is known as a coverage set. Applying the
utility function to the returns and then calculating the expected
value leads to the ESR optimisation criterion:
𝑉𝜋
𝑢=E"𝑢
𝑡=0
𝛾𝑡r𝑡!|𝜋, 𝜇0#.(2)
In scenarios where the utility function of a user is derived from
single executions of a policy, the ESR criterion should be optimised
[
22
]. The ESR criterion is the most commonly used criterion in the
game theory literature on multi-objective games [32].
The current state-of-the-art multi-policy MODeM methods fo-
cus almost exclusively on the SER criterion [
48
,
49
], leaving the
ESR criterion largely understudied [
19
,
20
,
26
]. Given that the SER
criterion and the ESR criterion utilise the utility function dierently,
SER methods cannot be used to compute a set of optimal policies
for the ESR criterion. Additionally, a set of optimal policies under
the SER criterion can exclude policies that are optimal under the
ESR criterion [
24
]. In all decision-making problems where a policy
is only executed once, the ESR criterion must be utilised. As such
problems are salient [
22
], new methods to compute a set of optimal
policies for the ESR criterion must be developed to ensure optimal
decision making in the real world.
3 EXPECTED SCALARISED RETURNS WITH
UNKNOWN UTILITY FUNCTIONS
The choice of optimality criterion in MODeM has implications
for the policies computed. Recently, it has been shown if a user’s
utility function is non-linear, the policies computed under the SER
criterion and the ESR criterion can be dierent [
39
]
1
. Moreover, sets
of policies that are optimal under the SER criterion can potentially
exclude policies that are optimal under the ESR criterion [
24
]. If the
optimality criterion is not carefully chosen, one could potentially
exclude policies that could lead to a higher utility.
SER methods cannot be used to compute policies for the ESR
criterion. This is because SER methods determine optimality on
the basis of expected value vectors [
53
]; these are insucient to
determine optimality in ESR settings as we demonstrate with the
example below. To highlight why dierent methods must be used,
consider the lotteries,
𝐿1
and
𝐿2
in Table 1. In this example the
utility function,
𝑢
, is unknown. To determine which lottery to play
in Table 1 when optimising for the SER criterion, the expected value
vector for 𝐿1and 𝐿2must be computed rst (see Equation 1):
E(𝐿1)=0.6((8,2)) + 0.4((6,1)) =(4.8,1.2)+(2.4,0.4)=(7.2,1.6)
𝑢(E(𝐿1)) =𝑢( (7.2,1.6))
E(𝐿2)=0.9((5,1)) + 0.1((8,0)) =(4.5,0.9)+(0.8,0)=(5.3,0.9)
𝑢(E(𝐿2)) =𝑢( (5.2,0.9))
Given that the utility function is unknown, Pareto dominance [
31
]
can be used to dene a partial ordering over expected value vectors
1
It is important to note, if the utility function is linear, the distinction between SER
and ESR does not exist [
23
,
39
]. Additionally, multi-policy approaches that compute a
set of optimal policies using linear scalarisation weights [
5
,
47
], fail to locate policies
in non-convex regions of the Pareto front [45].
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
𝐿1
P(𝐿1=R)R
0.6 (8, 2)
0.4 (6, 1)
𝐿2
P(𝐿2=R)R
0.9 (5, 1)
0.1 (8, 0)
Table 1: Lottery
𝐿1
has two possible returns, (8, 2) with prob-
ability 0.6 and (6, 1) with probability 0.4. Lottery
𝐿2
has two
possible returns (5, 1) with probability 0.9 and (8, 0) with
probability 0.1.
for all monotonically increasing utility functions. For example,
methods like [
48
50
] compute a set of policies known as the Pareto
front, which are optimal under the SER criterion.
To determine which lottery to play while optimising for the
ESR criterion, the utility function must rst be applied, then the
expected utility can be computed (see Equation 2):
𝑢(𝐿1)=𝑢((8,2)) + 𝑢((6,1))
E(𝑢(𝐿1)) =0.6(𝑢((8,2))) + 0.4(𝑢((6,1)))
𝑢(𝐿2)=𝑢((5,1)) + 𝑢((8,0))
E(𝑢(𝐿2)) =0.9(𝑢((5,1))) + 0.1(𝑢((8,0)))
Given the utility function is unknown, it impossible to compute the
expected utility. Moreover, a distribution over the returns received
from a policy execution must be maintained in order to optimise
for the ESR criterion. Maintaining a distribution over the returns
ensures the expected utility can be computed once the user’s utility
function becomes known during the selection phase. Therefore,
while computing a set of optimal policies under the ESR criterion,
a distribution over the returns must be maintained to determine
optimality.
Prior to this work, no algorithm existed to compute sets of opti-
mal policies in sequential settings for the ESR criterion when the
utility function is unknown. Therefore, new methods must be for-
mulated that compute a set of optimal policies for the ESR criterion
in sequential MODeM settings in the unknown utility function
scenario.
Recently, a new solution concept for ESR with unknown utility
functions, called the ESR set, was proposed by Hayes et al. [
23
,
24
].
However, their work did not propose any algorithms to compute
ESR sets for sequential decision making problems. Hayes et al.
[
23
,
24
] dene a multi-objective return distribution,
z𝜋
, which rep-
resents the distribution over returns for a policy, 𝜋, such that,
Ez𝜋=E"
𝑡=0
𝛾𝑡r𝑡
𝜋, 𝜇0#.(3)
A return distribution
2
is a distribution over the returns of a random
vector when a policy, 𝜋, is executed [23].
Hayes et al. [
23
,
24
] dene ESR dominance, which gives a partial
ordering over return distributions, where each return distribution
is associated with a policy that could be executed. ESR dominance
builds on the principles of rst-order stochastic dominance [
6
,
18
]
in multivariate settings [
4
,
40
]. Stochastic dominance gives a partial
2
The term value distribution is used in [
8
,
23
,
33
]. However, a value distribution is a
distribution over the returns, not over values. Therefore, we prefer the term return
distribution.
ordering over random variables and random vectors. Stochastic
dominance has been used in economics [
12
], nance [
3
,
7
] and
game theory [15] to make decisions under uncertainty.
To calculate ESR dominance, the cumulative distribution func-
tion (CDF) of the given return distributions must be calculated.
For a return distribution
z𝜋
, the CDF of
z𝜋
is denoted by
𝐹z𝜋
. A
return distribution
z𝜋
ESR dominates a return distribution
z𝜋
if
the following is true:
z𝜋>𝐸𝑆𝑅 z𝜋
v:𝐹z𝜋(v) 𝐹z𝜋(v) v:𝐹z𝜋(v)<𝐹z𝜋(v).(4)
Hayes et al. [
23
] prove if a return distribution
z𝜋
ESR dominates
a return distribution
z𝜋
,
z𝜋
has a higher expected utility than
z𝜋
for all strictly monotonically increasing utility functions, 𝑢.
z𝜋>𝐸𝑆𝑅 z𝜋
=E(𝑢(z𝜋)) >E(𝑢(z𝜋)) (5)
Finally, Hayes et al. [
23
,
24
] dene a set of non-dominated return
distributions known as the ESR set, which is dened as follows:
𝐸𝑆𝑅(Π)={𝜋Π| 𝜋Π:z𝜋
>𝐸𝑆𝑅 z𝜋}.(6)
4 MULTI-OBJECTIVE DISTRIBUTIONAL
VALUE ITERATION
To compute a set of optimal policies for the ESR criterion when the
utility function of a user is unknown, we propose a novel multi-
objective distributional value iteration (MODVI) algorithm. MODVI
maintains sets of return distributions for each state and uses ESR
dominance [
23
] to compute a set of non-dominated return distribu-
tions, known as the ESR set.
The state-of-the-art multi-objective decision making (MODeM)
algorithms use expected value vectors to compute sets of optimal
policies [
48
50
]. However, expected value vectors can only be used
when optimising for the SER criterion. As previously highlighted,
to compute a set of optimal polices for the ESR criterion, expected
value vectors must be replaced with return distributions. Generally,
expected value MODeM algorithms utilise the Bellman operator
[
9
] to compute the expected value vectors for each state. Given
our approach is distributional, we adopt the distributional Bellman
operator [
8
],
T𝜋
𝐷
, to update the return distribution for each state-
action pair:
T𝜋
𝐷z(𝑠, 𝑎)𝐷
=r𝑠,𝑎 +𝛾z(𝑠, 𝑎).(7)
To represent a return distribution in multi-objective settings, we
use a multivariate categorical distribution similar to the distribu-
tions used by Reymond et al. [
33
] and Bellemare et al. [
8
]. The
categorical distribution is paramaterised by a number of atoms,
𝑁N
, where the distribution has a dimension per objective,
𝑛
.
The atoms outline the width of each category and are bounded
by the minimum returns,
R𝑚𝑖𝑛
, and maximum returns,
R𝑚𝑎𝑥
. The
multivariate categorical distribution has a set of atoms dened as
follows [33]:
{z𝑖...𝑘 =(Rmin0+𝑖Δz0, . . . , Rmin𝑛+𝑘Δz𝑛):
0𝑖<𝑁 , . . . , 0𝑘<𝑁},(8)
where each objective,
𝑛
, has a separate
Rmin𝑏,Rmax𝑏
for 0
<𝑏𝑛
and
Δz=R𝑚𝑎𝑥 R𝑚𝑖𝑛
𝑁1
. The distribution is a set of discrete cate-
gories,
𝑁
, where each category,
𝑝𝑖
, represents the probability of
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
receiving a return [
33
]. To ensure the distribution is an accurate
representation of the returns of the execution of a policy, it is cru-
cial a number of atoms are selected to suciently cover the range
of values from
Rmin
and
Rmax
. For example, if
𝛾=
1and reward
values are expected to be integers in the range
Rmin =[
0
,
0
]
to
Rmax =[
1
,
10
]
,
𝑁=
11 is the required value to ensure that the dis-
tribution is represented without aliasing between dierent reward
levels.
To update the multivariate categorical distribution, we utilise
the state space, action space and reward function of the model.
During an update of the multivariate categorical distribution, we
iterate over each atom,
𝑗
, for each objective. To update the return
distribution,
z𝑠
, for state
𝑠
, we compute the distributional Bellman
update
ˆ
Tz𝑠,𝑗 =r𝑠,𝑎,𝑠+𝛾z𝑠,𝑗
for each atom
𝑗
, for a given reward
r𝑠,𝑎,𝑠
and return distribution,
z𝑠
, for state
𝑠
. We then distribute
the probability,
𝑝
, for the atom,
𝑗
, of the return distribution,
𝑝𝑗(z𝑠)
,
in state
𝑠
, to the corresponding atom of the updated return dis-
tribution,
𝑧𝑠
, for state s. Therefore, the return distribution,
z𝑠
, for
state
𝑠
is equivalent to the return distribution,
z𝑠
, in state
𝑠
, shifted
relative to the reward, r𝑠,𝑎,𝑠.
At each iteration,
𝑘
, of MODVI, for each state,
𝑠
, and action,
𝑎
, a
set of optimal return distributions is backed up once. In Equation
9, the Bellman operator has been replaced with the distributional
Bellman operator [8],
Q𝑘+1(𝑠, 𝑎) Ê
𝑠
𝑇(𝑠|𝑠, 𝑎) [r𝑠,𝑎,𝑠+𝛾Z𝑘(𝑠)] (9)
where
Q𝑘+1(𝑠, 𝑎)
and
Z𝑘(𝑠)
represent sets of return distributions,
denotes the cross-sum between sets of return distributions, and
𝑇(𝑠|𝑠, 𝑎)
represents the probability of transitioning to state
𝑠
from
state 𝑠after taking action 𝑎.
During a distributional Bellman backup, each return distribution,
z𝑠
, in the set
Z𝑘(𝑠)
, is updated with the reward,
r𝑠,𝑎,𝑠
, for action,
𝑎
, in state,
𝑠
, as follows:
{r𝑠,𝑎,𝑠+𝛾z𝑠
:
z𝑠Z𝑘(𝑠)}
. Each
updated return distribution in the set for state
𝑠
is then multiplied
by the transition probability,
𝑇(𝑠|𝑠, 𝑎)
. The cross sum for each
resulting set of updated return distributions is computed for each
next possible next state,
𝑠
. The cross sum between two sets of return
distributions,
XÉY
, is dened as follows:
{x+y
:
xXyY}
,
where
x
and
y
are
return distributions
. For a detailed overview on
how a set of return distributions for an action in a MOMDP can be
computed, please consider the example outlined in Figure 2.
To compute a set of ESR non-dominated policies for each state,
we dene an algorithm known as
ESRPrune
(Algorithm 1) which
computes a set of ESR non-dominated policies by removing ESR
dominated return distributions from a given set.
Z𝑘+1(𝑠) ESRPrune Ø
𝑎
Q𝑘+1(𝑠, 𝑎)!(10)
Equation 10 calculates the set of return distributions for a given
state,
𝑠
, by taking the union of each set of return distributions
over each action,
𝑎
. The resulting set of return distributions is then
passed to the ESRPrune algorithm as input.
ESRPrune
utilises ESR dominance dened by Hayes et al. [
23
,
24
]
(see Equation 4). Like Pareto dominance, ESR dominance is transi-
tive [
52
], therefore we can apply
ESRPrune
in sequence. To compute
ESR dominance, the cumulative distribution function (CDF) of each
𝑠0
𝑠1
X={x1,x2}
𝑠2
Y={y1,y2}
(1,0)0.9(0,0)0.1
a
(a) An action,
𝑎
, in a MOMDP with
stochastic state transitions. States
𝑠1
and
𝑠2
have sets of non-dominated
return distributions
X
and
Y
. For ac-
tion
𝑎
, transitioning from
𝑠0
to
𝑠1
oc-
curs with a probability of 0
.
9and a
reward of
[
1
,
0
]
is received. For action
𝑎
, transitioning from
𝑠0
to
𝑠2
occurs
with a probability of 0
.
1and a reward
of [0,0]is received.
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
x10 1 0.7
2 0 0.3
x22 1 0.5
2 2 0.5
y1
1 0 0.75
0 2 0.25
y2
0 1 0.9
3 0 0.1
(b) The return distributions
x1,x2,y1
and
y2
in the sets of
policies for
𝑠1
and
𝑠2
. To compute
a set of policies for state
𝑠0
, the
distributional Bellman operator
is utilised (Equation 9).
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
¤x11 1 0.7
3 0 0.3
¤x23 1 0.5
3 2 0.5
¤y1
1 0 0.75
0 2 0.25
¤y2
0 1 0.9
3 0 0.1
(c) The reward,
r𝑠,𝑎,𝑠
, is used to
update each return distribution
for states
𝑠1
and
𝑠2
. For example,
¤x1=r𝑠,𝑎,𝑠+𝛾x1
. For this example
𝛾=1.
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
ˆx11 1 0.63
3 0 0.27
ˆx23 1 0.45
3 2 0.45
ˆy1
1 0 0.075
0 2 0.025
ˆy2
0 1 0.09
3 0 0.01
(d) Each return distribution for
𝑠1
and
𝑠2
is then multiplied by the
transition probabilities,
𝑇(𝑠|𝑠, 𝑎)
.
For example, ˆx1=¤x1×𝑇(𝑠|𝑠 , 𝑎).
𝑠0
Z={z1=ˆx1+ˆy1,z2=ˆx1+ˆy2,
z3=ˆx2+ˆy1,z4=ˆx2+ˆy2}
(e) In Figure 2(e), a set of return distri-
butions,
Z
, is computed for state
𝑠0
. The
cross sum,
É
, is utilised to sum all com-
binations of return distributions for the
previously updated sets. The set of re-
turn distributions at state
𝑠0
,
Z
, is dened
as follows:
Z=XÉY={ˆx+ˆy
:
ˆx
XˆyY}
, where
ˆx
and
ˆy
are return dis-
tributions. Figure 2(e) describes the re-
sulting set of return distributions which
contains z1,z2,z3and z4.
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
z1
1 1 0.63
3 0 0.27
1 0 0.075
0 2 0.025
z2
1 1 0.63
3 0 0.28
0 1 0.09
z3
3 1 0.45
3 2 0.45
1 0 0.075
0 2 0.025
z4
3 1 0.45
3 2 0.45
0 1 0.09
3 0 0.01
(f) Figure 2(f ) outlines the set of
return distributions, Z at state 𝑠0.
Z
will be passed to the
ESRPrune
algorithm.
Figure 2: A worked example outlining the necessary steps
to compute a set of return distributions for a MOMDP with
stochastic state transitions.
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
return distribution in the given set must be calculated.
ESRPrune
iterates over the given set of return distributions and compares the
CDFs of the return distributions to determine which are ESR non-
dominated. The return distributions that are ESR dominated are
removed from the set. A set of non-dominated return distributions
is known as the ESR set [23].
Algorithm 1: ESRPrune
1Input:ZA set of return distributions
2Z
3while Z do
4zthe rst element of Z
5for zZ do
6if z>𝐸𝑆 𝑅 z then
7zz
8end
9end
10 Remove zand all return distributions
11 ESR-dominated by zfrom Z
12 Add zto Z
13 end
14 Return Z
To highlight how
ESRPrune
determines which return distribu-
tions are ESR non-dominated, consider the example outlined in
Figure 3(a), Figure 3(b) and Figure 4. To determine ESR dominance,
ESRPrune
compares a return distribution
X
with a return distribu-
tion
Y
. The CDF for Xis denoted by
𝐹X
(Figure 3(a)) and the CDF
for
Y
is denoted by
𝐹Y
(Figure 3(b)). In order for
X>𝐸𝑆𝑅 Y
the
following condition must be true [23]:
v:𝐹X(v) 𝐹Y(v) v:𝐹X(v)<𝐹Y(v).
Additionally, if
X>𝐸𝑆𝑅 Y
the following condition also must be
true:
v:𝐹X(v) 𝐹Y(v) 0 v:𝐹X(v) 𝐹Y(v)<0.
101230
2
4
0
0.5
1
o1
o2
probability
(a) The CDF,
𝐹X
, of a return distri-
bution
X
.
X
is a multivariate normal
probability distribution, with a mean,
𝜇
, and co-variance matrix,
Σ
. For
X
,
𝜇=[1,2]and Σ=0.5 0.25
0.25 0.5.
101230
2
4
0
0.5
1
o1
o2
probability
(b) The CDF,
𝐹Y
, of a return distribu-
tion
Y
.
Y
is a multivariate normal prob-
ability distribution, with a mean,
𝜇
,
and co-variance matrix,
Σ
. For
Y
,
𝜇=
[1,1]and Σ=0.15 0.05
0.05 0.15 .
Figure 3: The CDFs,
𝐹X
and
𝐹Y
, of two return distributions,
X
and Y.
0
2
01234
0.6
0.4
0.2
0
0.2
(a)
o1
o2
FXFY
Figure 4: The dierence in probability mass for
𝐹X𝐹Y
, which
is used to visualise the requirements for ESR dominance. A
dotted line (a) is drawn to highlight that
𝐹X𝐹Y>
0for least
at one point. Therefore, X does not ESR dominate Y.
Figure 4 highlights the dierence in probability for
𝐹X𝐹Y
. The
dotted line in Figure 4, labelled
(𝑎)
, highlights that, for at least one
point,
𝐹X𝐹Y>
0. Therefore, the return distribution
X
cannot ESR
dominate the return distribution Y.
Algorithm 2: MODVI
1Initialise all return distributions and sets
2while not converged do
3for 𝑠𝑆do
4for 𝑎𝐴do
5Q𝑘+1(𝑠, 𝑎)
É𝑠𝑇(𝑠|𝑠, 𝑎) [R(𝑠 , 𝑎, 𝑠 ) + 𝛾Z𝑘(𝑠)]
6end
7Z𝑘+1(𝑠) E𝑆𝑅𝑃𝑟 𝑢𝑛𝑒 Ð𝑎Q𝑘+1(𝑠, 𝑎)
8end
9end
Algorithm 2 describes the MODVI algorithm
3
. On initialisation
of MODVI, a set of return distributions is generated for each state-
action pair. For innite horizon settings, each set contains a single
return distribution that is randomly initialised, where an atom is
selected at random and a probability mass of 1
.
0is assigned to
that atom. In nite horizon settings each return distribution is
initialised by assigning a probability mass of 1
.
0to the atom which
corresponds to the return
[
0
,
0
]
. During each iteration of MODVI,
a set of return distributions is computed (Algorithm 2, Line 5) for
each state,
𝑠
and action,
𝑎
. The union of the resulting sets of return
distributions is then passed to the
ESRPrune
algorithm to remove
the dominated return distributions. Once
ESRPrune
(Algorithm 2,
Line 7) has been executed for the given iteration of MODVI, a set
of non-dominated return distributions is backed up for the state
𝑠
.
Once MODVI has converged, a set of ESR non-dominated policies,
or the ESR set, is available at the start state, 𝑠0.
3
Algorithm 2 describes MODVI for innite horizon settings. However, it is trivial to
alter MODVI for nite horizon settings.
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
5 EXPERIMENTS
In this section we show that MODVI can compute a set of optimal
policies for the ESR criterion for two multi-objective benchmark
problems and a practical multi-objective real-world problem.
5.1 Space Traders
First, we evaluate MODVI on a multi-objective benchmark problem
known as Space Traders [
43
]. Space Traders is a problem with
nine policies and a small number of returns per policy. Therefore,
it is possible to visualise each policy in the ESR set, illustrating
how policies can be returned to a user during the selection phase
in practice. Of course, for larger problems, the user could select
subsets of the policies to visualise and compare.
Space Traders has two timesteps, two non-terminal states and
three available actions per state. In Space Traders an agent must
deliver cargo from its home planet (planet A) to some destination
planet (planet B) and then return home to planet A. While deliv-
ering the cargo, the agent must avoid being intercepted by space
pirates. An agent acting in the Space Traders environment aims
to complete the mission and minimise time. An agent receives a
reward of 1 for returning home to planet A and completing the
mission, and at all other states the agent receives a reward of 0 for
mission success. After each action, the agent receives a negative
reward corresponding to the time taken to reach the next planet.
Finally, after taking each action there is a probability the agent will
be intercepted by space pirates. If the agent is intercepted by space
pirates, the agent will receive a reward of 0 for mission success, a
negative time penalty and the episode will terminate. All remain-
ing implementation details for the Space Traders environment are
available in the works of Vamplew et al. [42, 43].
MODVI has the following parameters:
𝛾=
1,
𝑁=
23,
R𝑚𝑖𝑛 =
[
0
,
22
]
and
R𝑚𝑎𝑥 =[
1
,
0
]
. Figure 7(a) outlines the six return
distributions in the computed ESR set. Figure 5 plots the expected
value vectors of each return distribution in the ESR set and also plots
the expected value vectors for the Pareto front [
43
]. It is important
to note, the ESR set for Space Traders contains a policy that is not
present on the Pareto front. The Pareto front is a set of optimal
policies for the SER criterion. Therefore, certain policies that are
optimal under the ESR criterion are not optimal under the SER
criterion. In real-world decision making, incorrectly selecting an
optimality criterion can lead to sub-optimal performance, given
some optimal policies may not be returned to the user.
During the selection phase, visualisations, like Figure 5, are re-
turned to the user to aid in their decision making. However, in
Figure 5, the details of the return distributions for each policy in
the ESR set are lost. Computing expected value vectors for each
return distribution reduces the information available about a policy,
given the information about each individual return of a policy is no
longer available. As already highlighted, under the ESR criterion
the utility of a user is derived from a single execution of a policy.
Therefore, it is crucial a user has sucient information available at
decision time, given a policy may only be executed once. Figure 6
visualises each potential return and the corresponding probability
of the return distributions in the ESR set. In Figure 6, each return
distribution has a shape, where the position of each shape corre-
sponds to a return and the colour of each shape corresponds to the
20 15 10 5 0
objective 2
0.8
0.9
1.0
objective 1
ESR Set
Pareto Front
Figure 5: The expected value vectors of the return distribu-
tions in the ESR set (red) are plotted against the expected
value vectors of the Pareto front (blue).
20 15 10 5 0
objective 2
0.0
0.2
0.4
0.6
0.8
1.0
objective 1
0.2
0.4
0.6
0.8
1.0
probability
Figure 6: The return distributions in the ESR set computed
by MODVI. Each shape corresponds to a computed policy in
the ESR set, where the location of the shape corresponds to a
return in the policy. Colours correspond to the probability
of receiving the specic return when executing the policy.
probability of receiving the return. In practice, a user would be able
to choose which return distributions in the ESR set to display at a
given moment, allowing the user to compare and contrast dierent
policies individually. Figure 6 provides an intuitive aid which can be
returned to a user when making decisions under the ESR criterion.
5.2 Resource Gathering
Next, we evaluate MODVI on the Resource Gathering benchmark
[
5
]. Resource Gathering is a multi-objective benchmark problem
with intuitive trade-os between objectives, motivating the need to
consider the ESR criterion in real-world decision making. MODVI is
evaluated on a four-objective version of Resource Gathering, where
time is added as an objective. The Resource Gathering environment
is shown in Figure 7(b). The agent starts in a home state and nav-
igates the grid environment to collect the available resources (
𝑅1
and
𝑅2
) while avoiding the enemy states (
1
and
2
) before return-
ing home again. At each timestep, the agent receives a reward of
[−
1
,
0
,
0
,
0
]
. If the agent returns to the home state having gathered
the available resources, the agent receives one of the following
rewards:
[−
1
,
0
,
10
,
0
]
for collecting
𝑅1
,
[−
1
,
0
,
0
,
10
]
for collecting
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
𝜋11 -22 1.0
𝜋20 -1 0.1
1 -16 0.9
𝜋3
0 -7 0.085
0 0 0.15
1 -8 0.765
𝜋40 0 0.15
0 -10 0.85
𝜋50 0 0.2775
1 0 0.7225
𝜋6
0 -6 0.135
0 -1 0.1
1 -6 0.765
(a) The return distributions in the
ESR set for the Space Traders envi-
ronment, with 𝛾=1.
𝑅1
1𝑅2
2
(b) The grid for the Resource Gather-
ing environment.
1
and
2
are en-
emy states.
𝑅1
and
𝑅2
are the re-
sources that need to be gathered, be-
fore returning to the home state.
𝜋 𝑟1𝑟2𝑟3𝑟4P(𝑟1, 𝑟2, 𝑟 3, 𝑟4)
𝜋1-18 0 10 10 1.0
𝜋2-12 0 10 0 1.0
𝜋3-16 -10 0 0 0.1
-14 0 10 10 0.9
𝜋4-12 -10 0 0 0.1
-16 0 10 10 0.9
𝜋5-12 -10 0 0 0.1
-10 0 10 0 0.9
𝜋6
-14 -10 0 0 0.09
-12 -10 0 0 0.1
-12 0 10 10 0.81
𝜋7
-14 -10 0 0 0.09
-12 -10 0 0 0.1
-8 0 10 0 0.81
𝜋8-10 0 0 10 1.0
(c) The return distributions in the ESR set for
the Resource Gathering environment, with
𝛾=1.
𝜋 𝑟1𝑟2P(𝑟1, 𝑟2)
𝜋1
-1 -0.06 0.0995
-1 0.0 0.3210
0 -0.06 0.3778
0 0.0 0.2017
𝜋2
-1 -0.06 0.0597
-1 0.0 0.3609
0 -0.06 0.2264
0 0.0 0.3530
𝜋3-1 0.0 0.4206
0 0.0 0.5794
(d) The return distributions in the
ESR set for the Control Problem envi-
ronment, with 𝛾=1.
Figure 7: Figure 7(a), Figure 7(c) and Figure 7(d) show the
return distributions in the ESR set computed by the MODVI
algorithm for the Space Traders, Resource Gathering and
Control Problem. Figure 7(b) shows the grid layout for the
Resource Gathering environment.
𝑅2
, and
[−
1
,
0
,
10
,
10
]
for collecting
𝑅1
and
𝑅2
. The agent must avoid
the enemy states. If the agent enters an enemy state, there is a 0.1
chance the agent will be attacked. If the agent is attacked in an
enemy state, the agent receives a reward of
[−
10
,
10
,
0
,
0
]
. In this
case, the agent also receives a time penalty for being attacked and
the episode terminates.
For Resource Gathering, the following parameters were set for
MODVI:
𝛾=
1,
𝑁=
25,
R𝑚𝑖𝑛 =[−
24
,
24
,
14
,
14
]
and
R𝑚𝑎𝑥 =
[
0
,
0
,
10
,
10
]
. Figure 7(c) outlines the return distribution in the ESR
set for Resource Gathering. The ESR set contains eight policies,
where each policy gathers one or both resources before returning
home. An important aspect of the distributional approach applied
by MODVI is that a user will have sucient information about the
trade-os between each objective for each policy in the ESR set. For
example, there is a clear trade-o between objectives in
𝜋3
and
𝜋6
in Figure 7(c). When considering
𝜋3
, fourteen timesteps are taken
to gather both resources and the agent enters one enemy state
with a 0
.
1chance of being attacked. When considering
𝜋6
, twelve
timesteps are taken to gather both resources, but the agent must
enter both enemy states, which poses 0
.
09 chance and 0
.
1chance
of being attacked. Using a distributional approach ensures a user
has sucient information to understand the trade-os between
objectives across dierent policies. In Resource Gathering a user
looking to minimise time, while also being indierent about being
attacked, may select
𝜋6
having fully understood the probabilities
of being attacked. Therefore, having sucient critical information
available at decision time enables the user to make more informed
decisions that could potentially better reect their preferences over
objectives, when compared to expected value vector based methods.
5.3 Feedtank Control Problem
Finally, we evaluate MODVI on the risk-based Feedtank Control
Problem (FCP) proposed by Geibel and Wysotski [
16
], which is a
practical real-world problem domain that highlights how MODVI
and the ESR criterion can be applied. In FCP, the agent must control
the outow of a tank that lies upstream of a distillation column,
while minimising the risk of the tank overowing. The purpose
of the distillation column is to separate two substances. There are
a nite number of timesteps 0
, ..., 𝑇
, where
𝑡
denotes the current
timestep. The feed-stream of the distillation column, or outow of
the tank, is denoted by
𝐹(𝑡)
and is controlled by the agent. The
tank level
𝑦(𝑡)
depends on the two stochastic inow streams char-
acterized by the ow rates
𝐹1(𝑡)
and
𝐹2(𝑡)
. The dynamics of the
tank level are outlined in the following equation:
𝑦(𝑡+1)=𝑦(𝑡) + 𝐴1𝛿(𝑡) (
𝑗=1,2
𝐹𝑗(𝑡) 𝐹(𝑡)).(11)
The tank level must not violate the following constraint:
𝑦𝑚𝑖𝑛 𝑦(𝑡) 𝑦𝑚𝑎𝑥 .(12)
The inows
𝐹𝑗(𝑡)
are random and controlled by probability dis-
tributions (Table 2). Therefore, the inows may also cause the tank
level to violate the constraint in Equation 12. At each timestep
there is also a chance,
𝑝
, that the inows may randomly violate the
constraint in Equation 12. To take a random constraint violation
into consideration, the probabilities for each inow in Table 2 must
be multiplied by 1
𝑝
. If the tank level violates the constraint in
Equation 12, the system shuts down, the agent enters a terminal
state, and receives a reward of [-1, 0]. The agent takes an action,
𝑎
, to control the outow of the tank. If the action does not cause
a violation of Equation 12, the agent receives a reward dened as
follows:
r𝑠,𝑎,𝑠=[0,−|𝐹(𝑡) 𝐹𝑠 𝑝𝑒𝑐 |],(13)
where
𝐹(𝑡)
is the discretised action value for the selected action
that adheres to
𝐹𝑚𝑖𝑛 𝐹(𝑡) 𝐹𝑚𝑎𝑥
where
𝐹𝑚𝑖𝑛
and
𝐹𝑚𝑎𝑥
are
intervals for actions, and
𝐹𝑠𝑝𝑒 𝑐
is the optimal action value. The
state parameters for the FCP are dened as follows:
𝑠(𝑡)=[𝑡, 𝑦 (𝑡)].(14)
Finally, the initial state,
𝑠0
, is dened as follows: [0,
𝑦0
]. For the
version of FCP used in this paper there are 11 actions available to
the agent, with 8timesteps.
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
𝑡 𝐹1𝑃(𝐹1)𝐹2𝑃(𝐹2)
1 1.70843345 0.78341724 1.85062176 0.21658276
2 1.40843345 0.40060469 1.55062176 0.59939531
3 0.56537807 0.83222158 0.70876186 0.16777842
4 0.37336325 0.81546855 0.50537012 0.18453145
5 0.11927879 0.41123876 0.31832656 0.58876124
6 0.02762233 0.7665067 0.20677226 0.2334933
7 0.45139631 0.62905513 0.59104772 0.37094487
8 1.10806585 0.04634063 1.20835887 0.95365937
Table 2: The inows (
𝐹1, 𝐹2
) for the feedtank with the corre-
sponding probabilities (𝑃(𝐹1), 𝑃 (𝐹2)) for each timestep, 𝑡.
The following parameters were set for FCP: [
𝐹𝑚𝑖𝑛
,
𝐹𝑚𝑎𝑥
] = [0.55,
1.05],
𝐹𝑠𝑝𝑒 𝑐
= 0.8,
𝑦0=
0
.
4, [
𝑦𝑚𝑖𝑛
,
𝑦𝑚𝑎𝑥
] = [0.25, 0.75],
𝐴1𝛿(𝑡)=
0
.
1
and
𝑝
= 0.1. MODVI has the following parameters:
𝛾=
1,
𝑁=
101,
R𝑚𝑖𝑛 =[−
1
,
3
]
and
R𝑚𝑎𝑥 =[
0
,
0
]
. Figure 7(d) outlines the three
return distributions computed by MODVI in the ESR set for FCP.
To provide an intuitive aid for decision making during the selection
phase, the policies in the ESR set can be visualised, like in Figure
6, and returned to the user. It is important to note that
𝜋1
and
𝜋2
in the ESR set contain the same returns, although with dierent
probabilities. If the expected value vectors for
𝜋1
and
𝜋2
are returned
to a user, the user will lose all knowledge of how similar the returns
for
𝜋1
and
𝜋2
are. Therefore, taking a distributional approach can
aid in decision making, given a user has more information about the
individual returns of a policy. It is important to note, each return
distribution in Figure 7(d) could easily be interpreted by a domain
expert.
FCP is motivated by minimising risk as an important objective,
given violating certain constraints can shut down the distillation
process. Therefore, FCP should be optimised under the ESR crite-
rion, given a single execution of a policy is used to derive utility.
If the SER criterion is used as an optimality criterion, the average
risk over multiple policy executions would be computed. However,
making decisions based on average risk is not sucient for FCP
given a single violation of the constraints could lead to a system
shutdown, resulting in loss of productivity and prots. Using a
distributional approach for FCP under the ESR criterion ensures
that a user has sucient information about the probability of a
constraint violation to make decisions that mitigate such risks.
6 RELATED WORK
In recent years, using distributions in decision making has become
an active area of research for both single and multi-objective prob-
lem domains. For example, Martin et al. [
28
] use a single-objective
distributional C51 algorithm with stochastic dominance to make
risk-aware decisions. Abbas et al. [
1
] take a distributional approach
to multi-objective decision making to compute a set of optimal
policies for the SER criterion. It is important to note, taking a distri-
butional approach to decision making is not new and methods like
conditional value-at-risk (CVAR) [
35
] and value-at-risk (VAR) [
14
]
have been used extensively in nance [
27
,
34
] to make decisions
under uncertainty. Beyond a distributional approach, many algo-
rithms can compute a set of optimal policies for the SER criterion.
For example, multi-objective Monte Carlo tree search [
48
], Pareto
value iteration [
49
], convex hull value iteration [
5
] and CON-MODP
[
50
,
51
]. In contrast to the SER criterion, the ESR criterion has been
largely understudied with some exceptions. Several single-policy
algorithms have been developed which can compute a single opti-
mal policy for the ESR criterion. However, the single-policy ESR
algorithms cannot compute sets of optimal policies for the ESR
criterion, which heavily restricts their use in real-world decision
making scenarios. Reymond et al. [
33
] dene a multi-objective dis-
tributional actor critic algorithm that can compute optimal policies
for the ESR criterion. Roijers et al. [
36
] dene a multi-objective
policy gradient that can compute a single optimal policy for the
ESR criterion. Hayes et al. [
19
,
20
] outline a distributional Monte
Carlo tree search (DMCTS) algorithm to compute policies for the
ESR criterion. However, all of the highlighted methods require the
utility function of a user to be known a priori. For scenarios where
the utility function is unknown, Hayes et al. [
23
] outline a distribu-
tional algorithm that computes a set of policies for the ESR criterion
in a multi-objective multi-armed bandit [
13
] setting. However, the
work of Hayes et al. [
23
] is limited to bandit settings and cannot be
used for sequential decision making.
7 CONCLUSION & FUTURE WORK
In this paper we propose a multi-objective distributional value iter-
ation (MODVI) algorithm that can compute a set of optimal policies
for the ESR criterion. MODVI utilises return distributions which
replace expected value vectors in multi-objective decision making.
MODVI is the rst algorithm that can compute a set of optimal
policies under the ESR criterion in sequential multi-objective deci-
sion making settings. We show that MODVI can compute a set of
optimal policies for several multi-objective benchmark problems
and a practical real-world decision making problem. Because it is
the rst of its kind, MODVI opens up decision-theoretic planning
for a key range of real-world problems.
We plan to use return distributions in multi-objective reinforce-
ment learning (RL) settings. Model-based RL algorithms, like R-max
[
10
], and model-free RL algorithms, like multi-objective Q-learning
[
46
], could form the basis for new multi-objective distributional al-
gorithms that can compute sets of policies for the ESR criterion. For
MODVI, when the range of potential returns increases, maintaining
a sucient number of atoms for the return distribution requires
a large amount of memory. It is expected that in larger scenarios,
like [
2
], the range of possible potential returns would be dicult
to maintain using a categorical distribution. A potential solution
would be to use Dirichlet distributions [
29
] to represent return dis-
tributions. Finally, ESR dominance is a strict dominance criterion.
In many settings, ESR dominance may produce very large sets of
policies that would be optimal for all decision makers. It would be
possible to relax the ESR dominance requirements by using almost
stochastic dominance to generate smaller solution sets, where each
policy in the set is optimal for most decision makers [25].
ACKNOWLEDGEMENTS
Conor F. Hayes is funded by the National University of Ireland
Hardiman Scholarship. This research was supported by funding
from the Flemish Government under the “Onderzoeksprogramma
Articiële Intelligentie (AI) Vlaanderen” program.
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
REFERENCES
[1]
Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Fran-
cis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, and
Martin Riedmiller. 2020. A distributional view on multi-objective policy opti-
mization. In International Conference on Machine Learning. PMLR, 11–22.
[2]
Steven Abrams, James Wambua, Eva Santermans, Lander Willem, Elise Kuylen,
Pietro Coletti, Pieter Libin, Christel Faes, Oana Petrof, Sereina A. Herzog, Philippe
Beutels, and Niel Hens. 2021. Modelling the early phase of the Belgian CO VID-19
epidemic using a stochastic compartmental model and studying its implied future
trajectories. Epidemics 35 (2021), 100449. https://doi.org/10.1016/j.epidem.2021.
100449
[3]
Mukhtar M. Ali. 1975. Stochastic dominance and portfolio analysis. Journal of
Financial Economics 2, 2 (1975), 205–229. https://doi.org/10.1016/0304- 405X(75)
90005-7
[4]
Anthony B Atkinson and Francois Bourguignon. 1982. The Compari-
son of Multi-Dimensioned Distributions of Economic Status. The Re-
view of Economic Studies 49, 2 (04 1982), 183–201. https://doi.org/10.2307/
2297269 arXiv:https://academic.oup.com/restud/article-pdf/49/2/183/4720580/49-
2-183.pdf
[5]
Leon Barrett and Srini Narayanan. 2008. Learning all optimal policies with
multiple criteria. In Proceedings of the 25th international conference on Machine
learning. 41–47.
[6]
Vijay S. Bawa. 1975. Optimal rules for ordering uncertain prospects. Journal of
Financial Economics 2, 1 (1975), 95 121. https://doi.org/10.1016/0304- 405X(75)
90025-2
[7]
Vijay S. Bawa. 1978. Safety-First, Stochastic Dominance, and Optimal Portfolio
Choice. The Journal of Financial and Quantitative Analysis 13, 2 (1978), 255–271.
http://www.jstor.org/stable/2330386
[8]
Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspec-
tive on reinforcement learning. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70. JMLR. org, 449–458.
[9] Richard Bellman. 1957. Dynamic programming. Courier Corporation.
[10]
Ronen I Brafman and Moshe Tennenholtz. 2002. R-max-a general polynomial
time algorithm for near-optimal reinforcement learning. Journal of Machine
Learning Research 3, Oct (2002), 213–231.
[11]
Daniel Bryce, William Cushing, and Subbarao Kambhampati. 2007. Probabilistic
planning is multi-objective. Arizona State University, Tech. Rep. ASU-CSE-07-006
(2007).
[12]
E. Choi and Stanley Johnson. 1988. Stochastic Dominance and Uncertain Price
Prospects. Center for Agricultural and Rural Development (CARD) at Iowa State
University, Center for Agricultural and Rural Development (CARD) Publications 55
(01 1988). https://doi.org/10.2307/1059583
[13]
Madalina M. Drugan and Ann Nowe. 2013. Designing multi-objective multi-
armed bandits algorithms: A study. In The 2013 International Joint Conference on
Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN.2013.6707036
[14]
Darrell Due and Jun Pan. 1997. An overview of value at risk. Journal of
derivatives 4, 3 (1997), 7–49.
[15]
Peter C Fishburn. 1978. Non-cooperative stochastic dominance games. Interna-
tional Journal of Game Theory 7, 1 (1978), 51–61.
[16]
Peter Geibel and Fritz Wysotzki. 2005. Risk-sensitive reinforcement learning
applied to control under constraints. Journal of Articial Intelligence Research 24
(2005), 81–108.
[17]
Peichen Gong. 1992. Multiobjective dynamic programming for forest resource
management. Forest Ecology and Management 48, 1 (1992), 43–54. https://doi.
org/10.1016/0378-1127(92)90120- X
[18]
Josef Hadar and William R. Russell. 1969. Rules for Ordering Uncertain Prospects.
The American Economic Review 59, 1 (1969), 25–34. http://www.jstor.org/stable/
1811090
[19]
Conor F. Hayes, Mathieu Reymond, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021. Risk-Aware and Multi-Objective Decision Making with
Distributional Monte Carlo Tree Search. In: Proceedings of the Adaptive and
Learning Agents workshop at AAMAS 2021) (2021).
[20]
Conor F. Hayes, Mathieu Reymond, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021 In Press. Distributional Monte Carlo Tree Search for
Risk-Aware and Multi-Objective Reinforcement Learning. In Proceedings of the
20th International Conference on Autonomous Agents and MultiAgent Systems,
Vol. 2021. IFAAMAS.
[21]
Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Mannion Patrick. 2022.
Decision-Theoretic Planning for the Expected Scalarised Returns. In Proceedings
of the 21st International Conference on AAMAS (2022).
[22]
Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström,
Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf,
Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick
Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and
Diederik M. Roijers. 2022. A Practical Guide to Multi-Objective Reinforcement
Learning and Planning. Autonomous Agents and Multi-Agent Systems 36, 1 (2022),
26. https://doi.org/10.1007/s10458-022- 09552-y
[23]
Conor F. Hayes, Timothy Verstraeten, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021. Dominance Criteria and Solution Sets for the Expected
Scalarised Returns. In Proceedings of the Adaptive and Learning Agents workshop
at AAMAS 2021.
[24]
Conor F. Hayes, Timothy Verstraeten, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021. Expected Scalarised Returns Dominance: A New Solution
Concept for Multi-Objective Decision Making. arXiv preprint arXiv:2106.01048
(2021).
[25] Moshe Leshno and Haim Levy. 2002. Preferred by “all” and preferred by “most”
decision makers: Almost stochastic dominance. Management Science 48, 8 (2002),
1074–1085.
[26]
Federico Malerba and Patrick Mannion. 2021. Evaluating Tunable Agents
with Non-Linear Utility Functions under Expected Scalarised Returns. In Multi-
Objective Decision Making Workshop (MODeM 2021).
[27]
Simone Manganelli and Robert F Engle. 2001. Value at risk models in nance.
(2001).
[28]
John Martin, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. 2020. Stochas-
tically Dominant Distributional Reinforcement Learning. In International Confer-
ence on Machine Learning. PMLR, 6745–6754.
[29]
Ingram Olkin and Herman Rubin. 1964. Multivariate beta distributions and
independence properties of the Wishart distribution. The Annals of Mathematical
Statistics (1964), 261–269.
[30]
Michael Painter, Bruno Lacerda, and Nick Hawes. 2020. Convex Hull Monte-
Carlo Tree-Search. In Proceedings of the Thirtieth International Conference on
Automated Planning and Scheduling, Nancy, France, October 26-30, 2020. AAAI
Press, 217–225.
[31] Vilfredo Pareto. 1896. Manuel d’Economie Politique. Vol. 1. Giard, Paris.
[32]
Roxana Rădulescu, Patrick Mannion, Diederik M. Roijers, and Ann Nowé. 2020.
Multi-objective multi-agent decision making: a utility-based analysis and survey.
Autonomous Agents and Multi-Agent Systems 34, 10 (2020).
[33]
Mathieu Reymond, Conor F. Hayes, Diederik M. Roijers, Denis Steckelmacher,
and Ann Nowé. 2021. Actor-Critic Multi-Objective Reinforcement Learning
for Non-Linear Utility Functions. Multi-Objective Decision Making Workshop
(MODeM 2021) (2021).
[34]
R Tyrrell Rockafellar and Stanislav Uryasev. 2002. Conditional value-at-risk for
general loss distributions. Journal of banking & nance 26, 7 (2002), 1443–1471.
[35]
R Tyrrell Rockafellar, Stanislav Uryasev, et al
.
2000. Optimization of conditional
value-at-risk. Journal of risk 2, 3 (2000), 21–41.
[36]
Diederik M. Roijers, Denis Steckelmacher, and Ann Nowé. 2018. Multi-objective
Reinforcement Learning for the Expected Utility of the Return. In Proceedings of
the Adaptive and Learning Agents workshop at FAIM 2018.
[37]
Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley.
2013. A survey of multi-objective sequential decision-making. Journal of Articial
Intelligence Research 48 (2013), 67–113.
[38]
Diederik M. Roijers, Shimon Whiteson, and Frans A. Oliehoek. 2015. Comput-
ing Convex Coverage Sets for Faster Multi-Objective Coordination. Journal of
Articial Intelligence Research 52 (2015), 399–443.
[39]
Roxana Rădulescu, Patrick Mannion, Yijie Zhang, Diederik Marijn Roijers, and
Ann Nowé. 2020. A utility-based analysis of equilibria in multi-objective normal
form games. The Knowledge Engineering Review 35, e32 (2020).
[40]
Songsak Sriboonchitta, Wing-Keung Wong, s Dhompongsa, and Hung Nguyen.
2009. Stochastic Dominance and Applications to Finance, Risk and Economics.
https://doi.org/10.1201/9781420082678
[41]
Peter Vamplew, Richard Dazeley, Adam Berry, Rustam Issabekov, and Evan
Dekker. 2011. Empirical evaluation methods for multiobjective reinforcement
learning algorithms. Machine Learning (2011).
[42]
Peter Vamplew, Cameron Foale, and Richard Dazeley. 2020. A Demonstration
of Issues with Value-Based Multi Objective Reinforcement Learning Under Sto-
chastic State Transitions. Adaptive and Learning Agents Workshop (AAMAS
2020).
[43]
Peter Vamplew, Cameron Foale, and Richard Dazeley. 2021. The impact of envi-
ronmental stochasticity on value-based multiobjective reinforcement learning. In
Neural Computing and Applications. https://doi.org/10.1007/s00521-021-05859- 1
[44]
Peter Vamplew, Benjamin J Smith, Johan Kallstrom, Gabriel Ramos, Roxana
Radulescu, Diederik M Roijers, Conor F Hayes, Fredrik Heintz, Patrick Mannion,
Pieter JK Libin, et al
.
2021. Scalar reward is not enough: A response to Silver,
Singh, Precup and Sutton (2021). arXiv preprint arXiv:2112.15422 (2021).
[45]
Peter Vamplew, John Yearwood, Richard Dazeley, and Adam Berry. 2008. On
the Limitations of Scalarisation for Multi-objective Reinforcement Learning of
Pareto Fronts. In AI 2008: Advances in Articial Intelligence, Wayne Wobcke and
Mengjie Zhang (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 372–378.
[46]
Kristof Van Moaert and Ann Nowé. 2014. Multi-objective reinforcement learning
using sets of Pareto dominating policies. The Journal of Machine Learning Research
15, 1 (2014), 3483–3512.
[47]
K Wakuta and K Togawa. 1998. Solution procedures for multi-objective Markov
decision processes. Optimization 43, 1 (1998), 29–46.
[48]
Weijia Wang and Michèle Sebag. 2012. Multi-objective Monte-Carlo Tree Search
(Proceedings of Machine Learning Research, Vol. 25), Steven C. H. Hoi and Wray
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
Buntine (Eds.). PMLR, Singapore Management University, Singapore, 507–522.
[49]
DJ White. 1982. Multi-objective innite-horizon discounted Markov decision
processes. Journal of mathematical analysis and applications 89, 2 (1982), 639–647.
[50]
Marco A. Wiering and Edwin D. de Jong. 2007. Computing Optimal Stationary
Policies for Multi-Objective Markov Decision Processes. In 2007 IEEE International
Symposium on Approximate Dynamic Programming and Reinforcement Learning.
158–165. https://doi.org/10.1109/ADPRL.2007.368183
[51]
Marco A Wiering, Maikel Withagen, and Mădălina M Drugan. 2014. Model-based
multi-objective reinforcement learning. In 2014 IEEE Symposium on Adaptive
Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 1–6.
[52]
Elmar Wolfstetter. 1999. Topics in Microeconomics: Industrial Organization, Auc-
tions, and Incentives. Cambridge University Press. https://doi.org/10.1017/
CBO9780511625787
[53]
Kyle Hollins Wray, Shlomo Zilberstein, and Abdel-Illah Mouaddib. 2015. Multi-
objective MDPs with conditional lexicographic reward preferences. In Twenty-
ninth AAAI conference on articial intelligence.
... Therefore, to increase MORL's usability in the real world, dedicated multi-objective algorithms for the ESR criterion and the SER criterion that can learn policies for nonlinear utility functions must be formulated. We note that the MORL literature focuses almost exclusively on the SER criterion, leaving the ESR criterion largely understudied with a few exceptions [61,32,33,42,74,29,30] ...
Preprint
Full-text available
In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns -- known in reinforcement learning as the value -- cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.
... Distributional reinforcement learning on the other hand works directly with the full distribution of the reward instead. This can be beneficial for MORL, as shown by Hayes, Roijers, et al. (2022) who applied distributional multi-objective Distributional Value Iteration to find optimal policies for the ESR criteria. Therefore it could also potentially solve both the noisy estimates and stochastic SER issues. ...
Preprint
Full-text available
Multi-objective reinforcement learning (MORL) is a relatively new field which builds on conventional Reinforcement Learning (RL) to solve multi-objective problems. One of common algorithm is to extend scalar value Q-learning by using vector Q values in combination with a utility function, which captures the user's preference for action selection. This study follows on prior works, and focuses on what factors influence the frequency with which value-based MORL Q-learning algorithms learn the optimal policy for an environment with stochastic state transitions in scenarios where the goal is to maximise the Scalarised Expected Return (SER) - that is, to maximise the average outcome over multiple runs rather than the outcome within each individual episode. The analysis of the interaction between stochastic environment and MORL Q-learning algorithms run on a simple Multi-objective Markov decision process (MOMDP) Space Traders problem with different variant versions. The empirical evaluations show that well designed reward signal can improve the performance of the original baseline algorithm, however it is still not enough to address more general environment. A variant of MORL Q-Learning incorporating global statistics is shown to outperform the baseline method in original Space Traders problem, but remains below 100 percent effectiveness in finding the find desired SER-optimal policy at the end of training. On the other hand, Option learning is guarantied to converge to desired SER-optimal policy but it is not able to scale up to solve more complex problem in real-life. The main contribution of this thesis is to identify the extent to which the issue of noisy Q-value estimates impacts on the ability to learn optimal policies under the combination of stochastic environments, non-linear utility and a constant learning rate.
Article
Full-text available
Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems.
Article
Full-text available
Following the onset of the ongoing COVID-19 pandemic throughout the world, a large fraction of the global population is or has been under strict measures of physical distancing and quarantine, with many countries being in partial or full lockdown. These measures are imposed in order to reduce the spread of the disease and to lift the pressure on healthcare systems. Estimating the impact of such interventions as well as monitoring the gradual relaxing of these stringent measures is quintessential to understand how resurgence of the COVID-19 epidemic can be controlled for in the future. In this paper we use a stochastic age-structured discrete time compartmental model to describe the transmission of COVID-19 in Belgium. Our model explicitly accounts for age-structure by integrating data on social contacts to (i) assess the impact of the lockdown as implemented on March 13, 2020 on the number of new hospitalizations in Belgium; (ii) conduct a scenario analysis estimating the impact of possible exit strategies on potential future COVID-19 waves. More specifically, the aforementioned model is fitted to hospital admission data, data on the daily number of COVID-19 deaths and serial serological survey data informing the (sero)prevalence of the disease in the population while relying on a Bayesian MCMC approach. Our age-structured stochastic model describes the observed outbreak data well, both in terms of hospitalizations as well as COVID-19 related deaths in the Belgian population. Despite an extensive exploration of various projections for the future course of the epidemic, based on the impact of adherence to measures of physical distancing and a potential increase in contacts as a result of the relaxation of the stringent lockdown measures, a lot of uncertainty remains about the evolution of the epidemic in the next months.
Article
Full-text available
A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It shows that the nature of the optimal solution depends on these environmental characteristics, and also on whether we desire to maximise the Expected Scalarised Return (ESR) or the Scalarised Expected Return (SER). We also identify a novel aim which may arise in some applications of maximising SER subject to satisfying constraints on the variation in return and show that this may require different solutions than ESR or conventional SER. The analysis of the interaction between environmental stochasticity and multi-objective Q-learning is supported by empirical evaluations on several simple multiobjective Markov Decision Processes with varying characteristics. This includes a demonstration of a novel approach to learning deterministic SER-optimal policies for environments with stochastic rewards. In addition, we report a previously unidentified issue with model-free, value-based approaches to multiobjective reinforcement learning in the context of environments with stochastic state transitions. Having highlighted the limitations of value-based model-free MORL methods, we discuss several alternative methods that may be more suitable for maximising SER in MOMDPs with stochastic transitions.
Article
Full-text available
The majority of multi-agent system implementations aim to optimise agents’ policies with respect to a single objective, despite the fact that many real-world problem domains are inherently multi-objective in nature. Multi-objective multi-agent systems (MOMAS) explicitly consider the possible trade-offs between conflicting objective functions. We argue that, in MOMAS, such compromises should be analysed on the basis of the utility that these compromises have for the users of a system. As is standard in multi-objective optimisation, we model the user utility using utility functions that map value or return vectors to scalar values. This approach naturally leads to two different optimisation criteria: expected scalarised returns (ESR) and scalarised expected returns (SER). We develop a new taxonomy which classifies multi-objective multi-agent decision making settings, on the basis of the reward structures, and which and how utility functions are applied. This allows us to offer a structured view of the field, to clearly delineate the current state-of-the-art in multi-objective multi-agent decision making approaches and to identify promising directions for future research. Starting from the execution phase, in which the selected policies are applied and the utility for the users is attained, we analyse which solution concepts apply to the different settings in our taxonomy. Furthermore, we define and discuss these solution concepts under both ESR and SER optimisation criteria. We conclude with a summary of our main findings and a discussion of many promising future research directions in multi-objective multi-agent systems.
Conference Paper
Full-text available
Real-world decision problems often have multiple, possibly conflicting , objectives. In multi-objective reinforcement learning, the effects of actions in terms of these objectives must be learned by interacting with an environment. Typically, multi-objective reinforcement learning algorithms optimise the utility of the expected value of the returns. This implies the underlying assumption that it is indeed the expected value of the returns (i.e., an average returns over many runs) that is important to the user. However, this is not always the case. For example in a medical treatment setting only the return of a single run matters to the patient. This return is expressed in terms of multiple objectives such as maximising the probability of a full recovery and minimising the severity of side-effects. The utility of such a vector-valued return is often a non-linear combination of the return in each objective. In such cases, we should thus optimise the expected value of the utility of the returns, rather than the utility of the expected value of the returns. In this paper, we propose a novel method to do so, based on policy gradient, and show empirically that our method is key to learning good policies with respect to the expected value of the utility of the returns.
Article
Full-text available
In this article, we propose new algorithms for multi-objective coordination graphs (MO-CoGs). Key to the efficiency of these algorithms is that they compute the convex coverage set (CCS) instead of the Pareto coverage set (PCS), i.e., the Pareto front. Not only is the CCS a sufficient solution set for a large class of problems, it also has important characteristics that facilitate more efficient solutions. We propose two main algorithms for computing the CCS in MO-CoGs. Convex multi-objective variable elimination (CMOVE) computes the CCS by performing a series of agent eliminations, which can be seen as solving a series of local multi-objective subproblems. Variable elimination linear support (VELS) iteratively identifies the single weight vector, w, that can lead to the maximal possible improvement on a partial CCS and calls variable elimination to solve a scalarized instance of the problem for w. VELS is faster than CMOVE for small and medium numbers of objectives and can compute an ε-approximate CCS in a fraction of the runtime. In addition, we propose variants of these methods that employ AND/OR tree search instead of variable elimination to achieve memory efficiency. We analyze the runtime and space complexities of these methods, prove their correctness, and compare them empirically against a naive baseline and an existing PCS method, both in terms of memory-usage and runtime. Our results show that, by focusing on the CCS, these methods achieve much better scalability in the number of agents than the current state of the art.
Article
In multi-objective multi-agent systems (MOMASs), agents explicitly consider the possible trade-offs between conflicting objective functions. We argue that compromises between competing objectives in MOMAS should be analyzed on the basis of the utility that these compromises have for the users of a system, where an agent’s utility function maps their payoff vectors to scalar utility values. This utility-based approach naturally leads to two different optimization criteria for agents in a MOMAS: expected scalarized returns (ESRs) and scalarized expected returns (SERs). In this article, we explore the differences between these two criteria using the framework of multi-objective normal-form games (MONFGs). We demonstrate that the choice of optimization criterion (ESR or SER) can radically alter the set of equilibria in a MONFG when nonlinear utility functions are used.
Article
In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.
Article
This article gives a broad and accessible overview of models of value at risk (WR), a popular measure of the market risk of a financial firm's "book," the list ofpositions in various instruments that expose the firm to financial risk. Roughly speaking, the value at risk o f a portfolio is the loss in market value over a given time period, such as one day or two weeks, that is exceeded with a small probability, such as 1%. We focus narrowly on the market risk associated with changes in the prices or rates of underlying traded instruments. Traditionally, this would include such aspects of credit risk as the risk of changes in the spreads ofpublicly traded corporate and sovereign bonds. In order to maintain a narrow focus, however, I/aR does not traditionally include, and we do not review here, the risk of defdult on long-term derivative contracts. We begin with models of the distribution of underlying market returns and of volatility, emphasizing the roles of price jumps and of stochastic volatility in determining the "fatness" of the tails of the distributions of returns in various markets. We then turn to methods for approximating the value at risk of derivatives, using numerical approximations based on delta and gamma. Estimation of the value at risk o f a portfolio of positions is then discussed, beginning with the choice of which risk factors to include. An extensive numerical example illustrates the accuracy of various alternative methods, allowing for correlated jumps in the underlying market prices. Finally, we review "scenario analysis," which treats the potential losses associated with particular scenarios, such as a given parallel shift of a yield curve or a given change in volatility.