Content uploaded by Conor F. Hayes
Author content
All content in this area was uploaded by Conor F. Hayes on May 05, 2022
Content may be subject to copyright.
Multi-Objective Distributional Value Iteration∗
Conor F. Hayes
National University of Ireland Galway (IE)
c.hayes13@nuigalway.ie
Diederik M. Roijers
Vrije Universiteit Brussel (BE)
& HU Univ. of Appl. Sci. Utrecht (NL)
Enda Howley
National University of Ireland Galway (IE)
Patrick Mannion
National University of Ireland Galway (IE)
ABSTRACT
In sequential multi-objective decision making (MODeM) settings,
when the utility of a user is derived from a single execution of a
policy, policies for the expected scalarised returns (ESR) criterion
should be computed. In multi-objective settings, a user’s prefer-
ences over objectives, or utility function, may be unknown at the
time of planning. When the utility function of a user is unknown,
multi-policy methods are deployed to compute a set of optimal poli-
cies. However, the state-of-the-art sequential MODeM multi-policy
algorithms compute a set of optimal policies for the scalarised ex-
pected returns (SER) criterion. Algorithms that compute a set of
optimal policies for the SER criterion utilise expected value vectors
which cannot be used when optimising for the ESR criterion. We
propose a novel multi-policy multi-objective distributional value
iteration (MODVI) algorithm that replaces value vectors with dis-
tributions over the returns and computes a set of optimal policies
for the ESR criterion. MODVI is evaluated using several sequential
multi-objective problem domains, where, for each problem, a set of
optimal policies for the ESR criterion is computed.
KEYWORDS
Multi-objective; distributional; value iteration; expected scalarised
returns
1 INTRODUCTION
When making decisions in the real world, trade-os between mul-
tiple, often conicting, objectives must be made [
44
]. In many real-
world decision making settings, a policy is only executed once. For
example, consider a government body planning to implement a tax
incentive on imported electric vehicles. The tax incentive would
increase sales of electric vehicles, reducing
𝐶𝑂2
emissions, how-
ever, it may cause the sales of domestically produced petrol/diesel
vehicles to plummet, resulting in local unemployment. The tax
incentive will only be implemented once and, therefore, the gov-
ernment body must carefully consider the eects and likelihood of
all potential outcomes. The current state-of-the-art multi-objective
decision making (MODeM) literature focuses almost exclusively on
computing polices that are optimal over multiple executions. There-
fore, to fully utilise MODeM in the real world, we must develop
algorithms to compute a policy, or set of policies, that are optimal
given the single-execution nature of the problem.
In MODeM, a policy, or set of policies, is computed to maximise
the user’s preferences over objectives, or utility function. However,
∗This paper extends our AAMAS 2022 extended abstract [21].
Proc. of the Adaptive and Learning Agents Workshop (ALA 2022), Cruz, Hayes, da Silva,
Santos (eds.), May 9-10, 2022, Online, https://ala2022.github.io/ . 2022.
the user’s utility function is often unknown at the time of planning
[
37
]. Therefore, we are deemed to be in the unknown utility function
scenario [
22
], where a set of optimal policies must be computed
and returned to the user. Once the user’s utility function becomes
known, the user can select a policy from the computed set of optimal
policies that best reects their preferences [37].
MODeM distinguishes between two optimality criteria. In scenar-
ios where the utility of a user is derived from multiple executions
of a policy, the scalarised expected returns (SER) criterion should
be optimised [
22
]. In scenarios where the utility of a user is derived
from a single execution of a policy, the expected scalarised returns
(ESR) criterion should be optimised [
19
,
20
]. The SER criterion is
the most commonly used optimality criterion in the sequential
multi-objective planning literature [
38
]. In contrast to the SER cri-
terion, the ESR criterion has been understudied by the single agent
MODeM community, with some exceptions [19, 20, 33, 36, 43].
The majority of multi-policy MODeM algorithms are designed to
compute a set of optimal policies for the SER criterion [
11
,
17
,
49
].
However, if the utility function of a user is non-linear, the policies
computed under the SER criterion and ESR criterion can be dierent,
given the SER criterion and ESR criterion utilise the utility function
dierently [
39
]. Moreover, sub-optimal policies can be computed if
the choice of optimality criterion is not taken into consideration
when planning [
24
]. Therefore, new methods that can compute
policies for the ESR criterion must be developed.
The current state-of-the-art SER methods [
30
,
48
] are fundamen-
tally incompatible with the ESR criterion. When the utility function
of a user is unknown, SER methods use expected value vectors to
compute a set of optimal policies [
48
,
49
]. However, expected value
vectors cannot be used to compute policies under the ESR criterion
[
33
]. Instead, a distribution over the returns, or return distribution,
must be maintained to compute policies for the ESR criterion [
23
].
Given, in the real world, policies are often only executed once, a
user must have sucient information about the potential positive
or negative outcomes a policy may have. Maintaining a distribution
over the returns for each computed policy ensures a user has su-
cient information to take the potential outcomes into consideration
at decision time [
19
,
20
]. Utilising a distribution over the returns
ensures the ESR criterion can be considered in real-world decision
making scenarios.
In Section 3, we highlight why multi-policy methods for the SER
criterion cannot be used for the ESR criterion and show why main-
taining a distribution over the returns is necessary to compute a set
of optimal policies under the ESR criterion. In Section 4, we present
a novel multi-objective distributional value iteration (MODVI) algo-
rithm that computes a set of optimal policies for the ESR criterion
in scenarios when the utility function of a user is unknown at the
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
MOMDP algorithm solution
set
planning or learning phase
user
selection
selection phase
single
solution
execution phase
Figure 1: The unknown utility function scenario [22].
time of planning. In Section 5, we show MODVI can compute a
set of optimal policies for the ESR criterion using two sequential
multi-objective benchmark problems, and show how these could be
visualised for a user. Finally, we show that MODVI can compute a
set of optimal policies for the ESR criterion in a practical real-world
problem domain.
2 BACKGROUND
In Section 2, we formally dene multi-objective Markov decision
processes, the unknown utility function scenario, and commonly
studied optimality criteria in multi-objective decision making.
2.1 Multi-Objective Markov Decision Processes
A multi-objective Markov decision process (MOMDP) is a tuple,
M=(S,A,T, 𝛾, R)
, where
S
is the state space,
A
is the set of
actions,
T:S × A × S → [0,1]
is the probabilistic transition
function,
𝛾
is the discount factor, and
R:S × A × S → R𝑛
is the
probabilistic vectorial reward function for each of the
𝑛
objectives.
An agent acts according to a policy
𝜋
:
S×A → [0,1]
. Given a state,
actions are selected according to a certain probability distribution.
2.2 The Unknown Utility Function Scenario
In MODeM, a user’s preferences over objectives can be modelled
as a utility function [
37
]. However, a user’s utility function is often
unknown at the time of planning. In the taxonomy of MODeM, this
is known as the unknown utility function scenario, where a set of
optimal policies must be computed and returned to the user [
37
].
Figure 1 outlines the three phases in the unknown utility function
scenario: the planning phase, the selection phase, and the execution
phase [
22
]. During the planning phase a multi-policy algorithm
[
41
] is deployed to compute a set of policies that are optimal for all
possible utility functions [
50
]. The set of optimal policies is then
returned to the user. During the selection phase, the user selects a
policy from the computed set of optimal policies according to their
preferences. Finally, during the execution phase, the selected policy
is executed.
2.3 Optimality Criteria in Multi-Objective
Decision Making
When applying a user’s utility function, the MODeM literature
distinguishes between two optimality criteria. Calculating the ex-
pected value of the return of a policy before applying the utility
function leads to the scalarised expected returns (SER) optimisation
criterion:
𝑉𝜋
𝑢=𝑢 E"∞
𝑡=0
𝛾𝑡r𝑡|𝜋, 𝜇0#!.(1)
In scenarios where the utility of a user is derived from the expected
outcome over multiple executions of a policy, the SER criterion
should be optimised [
22
]. SER is the most commonly used criterion
in the multi-objective (single agent) planning literature [
48
,
49
].
For SER, a set of non-dominated policies that are optimal for all
possible utility functions is known as a coverage set. Applying the
utility function to the returns and then calculating the expected
value leads to the ESR optimisation criterion:
𝑉𝜋
𝑢=E"𝑢 ∞
𝑡=0
𝛾𝑡r𝑡!|𝜋, 𝜇0#.(2)
In scenarios where the utility function of a user is derived from
single executions of a policy, the ESR criterion should be optimised
[
22
]. The ESR criterion is the most commonly used criterion in the
game theory literature on multi-objective games [32].
The current state-of-the-art multi-policy MODeM methods fo-
cus almost exclusively on the SER criterion [
48
,
49
], leaving the
ESR criterion largely understudied [
19
,
20
,
26
]. Given that the SER
criterion and the ESR criterion utilise the utility function dierently,
SER methods cannot be used to compute a set of optimal policies
for the ESR criterion. Additionally, a set of optimal policies under
the SER criterion can exclude policies that are optimal under the
ESR criterion [
24
]. In all decision-making problems where a policy
is only executed once, the ESR criterion must be utilised. As such
problems are salient [
22
], new methods to compute a set of optimal
policies for the ESR criterion must be developed to ensure optimal
decision making in the real world.
3 EXPECTED SCALARISED RETURNS WITH
UNKNOWN UTILITY FUNCTIONS
The choice of optimality criterion in MODeM has implications
for the policies computed. Recently, it has been shown if a user’s
utility function is non-linear, the policies computed under the SER
criterion and the ESR criterion can be dierent [
39
]
1
. Moreover, sets
of policies that are optimal under the SER criterion can potentially
exclude policies that are optimal under the ESR criterion [
24
]. If the
optimality criterion is not carefully chosen, one could potentially
exclude policies that could lead to a higher utility.
SER methods cannot be used to compute policies for the ESR
criterion. This is because SER methods determine optimality on
the basis of expected value vectors [
53
]; these are insucient to
determine optimality in ESR settings as we demonstrate with the
example below. To highlight why dierent methods must be used,
consider the lotteries,
𝐿1
and
𝐿2
in Table 1. In this example the
utility function,
𝑢
, is unknown. To determine which lottery to play
in Table 1 when optimising for the SER criterion, the expected value
vector for 𝐿1and 𝐿2must be computed rst (see Equation 1):
E(𝐿1)=0.6((8,2)) + 0.4((6,1)) =(4.8,1.2)+(2.4,0.4)=(7.2,1.6)
𝑢(E(𝐿1)) =𝑢( (7.2,1.6))
E(𝐿2)=0.9((5,1)) + 0.1((8,0)) =(4.5,0.9)+(0.8,0)=(5.3,0.9)
𝑢(E(𝐿2)) =𝑢( (5.2,0.9))
Given that the utility function is unknown, Pareto dominance [
31
]
can be used to dene a partial ordering over expected value vectors
1
It is important to note, if the utility function is linear, the distinction between SER
and ESR does not exist [
23
,
39
]. Additionally, multi-policy approaches that compute a
set of optimal policies using linear scalarisation weights [
5
,
47
], fail to locate policies
in non-convex regions of the Pareto front [45].
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
𝐿1
P(𝐿1=R)R
0.6 (8, 2)
0.4 (6, 1)
𝐿2
P(𝐿2=R)R
0.9 (5, 1)
0.1 (8, 0)
Table 1: Lottery
𝐿1
has two possible returns, (8, 2) with prob-
ability 0.6 and (6, 1) with probability 0.4. Lottery
𝐿2
has two
possible returns (5, 1) with probability 0.9 and (8, 0) with
probability 0.1.
for all monotonically increasing utility functions. For example,
methods like [
48
–
50
] compute a set of policies known as the Pareto
front, which are optimal under the SER criterion.
To determine which lottery to play while optimising for the
ESR criterion, the utility function must rst be applied, then the
expected utility can be computed (see Equation 2):
𝑢(𝐿1)=𝑢((8,2)) + 𝑢((6,1))
E(𝑢(𝐿1)) =0.6(𝑢((8,2))) + 0.4(𝑢((6,1)))
𝑢(𝐿2)=𝑢((5,1)) + 𝑢((8,0))
E(𝑢(𝐿2)) =0.9(𝑢((5,1))) + 0.1(𝑢((8,0)))
Given the utility function is unknown, it impossible to compute the
expected utility. Moreover, a distribution over the returns received
from a policy execution must be maintained in order to optimise
for the ESR criterion. Maintaining a distribution over the returns
ensures the expected utility can be computed once the user’s utility
function becomes known during the selection phase. Therefore,
while computing a set of optimal policies under the ESR criterion,
a distribution over the returns must be maintained to determine
optimality.
Prior to this work, no algorithm existed to compute sets of opti-
mal policies in sequential settings for the ESR criterion when the
utility function is unknown. Therefore, new methods must be for-
mulated that compute a set of optimal policies for the ESR criterion
in sequential MODeM settings in the unknown utility function
scenario.
Recently, a new solution concept for ESR with unknown utility
functions, called the ESR set, was proposed by Hayes et al. [
23
,
24
].
However, their work did not propose any algorithms to compute
ESR sets for sequential decision making problems. Hayes et al.
[
23
,
24
] dene a multi-objective return distribution,
z𝜋
, which rep-
resents the distribution over returns for a policy, 𝜋, such that,
Ez𝜋=E"∞
𝑡=0
𝛾𝑡r𝑡
𝜋, 𝜇0#.(3)
A return distribution
2
is a distribution over the returns of a random
vector when a policy, 𝜋, is executed [23].
Hayes et al. [
23
,
24
] dene ESR dominance, which gives a partial
ordering over return distributions, where each return distribution
is associated with a policy that could be executed. ESR dominance
builds on the principles of rst-order stochastic dominance [
6
,
18
]
in multivariate settings [
4
,
40
]. Stochastic dominance gives a partial
2
The term value distribution is used in [
8
,
23
,
33
]. However, a value distribution is a
distribution over the returns, not over values. Therefore, we prefer the term return
distribution.
ordering over random variables and random vectors. Stochastic
dominance has been used in economics [
12
], nance [
3
,
7
] and
game theory [15] to make decisions under uncertainty.
To calculate ESR dominance, the cumulative distribution func-
tion (CDF) of the given return distributions must be calculated.
For a return distribution
z𝜋
, the CDF of
z𝜋
is denoted by
𝐹z𝜋
. A
return distribution
z𝜋
ESR dominates a return distribution
z𝜋′
if
the following is true:
z𝜋>𝐸𝑆𝑅 z𝜋′⇔
∀v:𝐹z𝜋(v) ≤ 𝐹z𝜋′(v) ∧ ∃v:𝐹z𝜋(v)<𝐹z𝜋′(v).(4)
Hayes et al. [
23
] prove if a return distribution
z𝜋
ESR dominates
a return distribution
z𝜋′
,
z𝜋
has a higher expected utility than
z𝜋′
for all strictly monotonically increasing utility functions, 𝑢.
z𝜋>𝐸𝑆𝑅 z𝜋′
=⇒E(𝑢(z𝜋)) >E(𝑢(z𝜋′)) (5)
Finally, Hayes et al. [
23
,
24
] dene a set of non-dominated return
distributions known as the ESR set, which is dened as follows:
𝐸𝑆𝑅(Π)={𝜋∈Π| 𝜋′∈Π:z𝜋′
>𝐸𝑆𝑅 z𝜋}.(6)
4 MULTI-OBJECTIVE DISTRIBUTIONAL
VALUE ITERATION
To compute a set of optimal policies for the ESR criterion when the
utility function of a user is unknown, we propose a novel multi-
objective distributional value iteration (MODVI) algorithm. MODVI
maintains sets of return distributions for each state and uses ESR
dominance [
23
] to compute a set of non-dominated return distribu-
tions, known as the ESR set.
The state-of-the-art multi-objective decision making (MODeM)
algorithms use expected value vectors to compute sets of optimal
policies [
48
–
50
]. However, expected value vectors can only be used
when optimising for the SER criterion. As previously highlighted,
to compute a set of optimal polices for the ESR criterion, expected
value vectors must be replaced with return distributions. Generally,
expected value MODeM algorithms utilise the Bellman operator
[
9
] to compute the expected value vectors for each state. Given
our approach is distributional, we adopt the distributional Bellman
operator [
8
],
T𝜋
𝐷
, to update the return distribution for each state-
action pair:
T𝜋
𝐷z(𝑠, 𝑎)𝐷
=r𝑠,𝑎 +𝛾z(𝑠′, 𝑎′).(7)
To represent a return distribution in multi-objective settings, we
use a multivariate categorical distribution similar to the distribu-
tions used by Reymond et al. [
33
] and Bellemare et al. [
8
]. The
categorical distribution is paramaterised by a number of atoms,
𝑁∈N
, where the distribution has a dimension per objective,
𝑛
.
The atoms outline the width of each category and are bounded
by the minimum returns,
R𝑚𝑖𝑛
, and maximum returns,
R𝑚𝑎𝑥
. The
multivariate categorical distribution has a set of atoms dened as
follows [33]:
{z𝑖...𝑘 =(Rmin0+𝑖Δz0, . . . , Rmin𝑛+𝑘Δz𝑛):
0≤𝑖<𝑁 , . . . , 0≤𝑘<𝑁},(8)
where each objective,
𝑛
, has a separate
Rmin𝑏,Rmax𝑏
for 0
<𝑏≤𝑛
and
Δz=R𝑚𝑎𝑥 −R𝑚𝑖𝑛
𝑁−1
. The distribution is a set of discrete cate-
gories,
𝑁
, where each category,
𝑝𝑖
, represents the probability of
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
receiving a return [
33
]. To ensure the distribution is an accurate
representation of the returns of the execution of a policy, it is cru-
cial a number of atoms are selected to suciently cover the range
of values from
Rmin
and
Rmax
. For example, if
𝛾=
1and reward
values are expected to be integers in the range
Rmin =[
0
,
0
]
to
Rmax =[
1
,
10
]
,
𝑁=
11 is the required value to ensure that the dis-
tribution is represented without aliasing between dierent reward
levels.
To update the multivariate categorical distribution, we utilise
the state space, action space and reward function of the model.
During an update of the multivariate categorical distribution, we
iterate over each atom,
𝑗
, for each objective. To update the return
distribution,
z𝑠
, for state
𝑠
, we compute the distributional Bellman
update
ˆ
Tz𝑠,𝑗 =r𝑠,𝑎,𝑠′+𝛾z𝑠′,𝑗
for each atom
𝑗
, for a given reward
r𝑠,𝑎,𝑠′
and return distribution,
z𝑠′
, for state
𝑠′
. We then distribute
the probability,
𝑝
, for the atom,
𝑗
, of the return distribution,
𝑝𝑗(z𝑠′)
,
in state
𝑠′
, to the corresponding atom of the updated return dis-
tribution,
𝑧𝑠
, for state s. Therefore, the return distribution,
z𝑠
, for
state
𝑠
is equivalent to the return distribution,
z𝑠′
, in state
𝑠′
, shifted
relative to the reward, r𝑠,𝑎,𝑠′.
At each iteration,
𝑘
, of MODVI, for each state,
𝑠
, and action,
𝑎
, a
set of optimal return distributions is backed up once. In Equation
9, the Bellman operator has been replaced with the distributional
Bellman operator [8],
Q𝑘+1(𝑠, 𝑎) ← Ê
𝑠′
𝑇(𝑠′|𝑠, 𝑎) [r𝑠,𝑎,𝑠′+𝛾Z𝑘(𝑠′)] (9)
where
Q𝑘+1(𝑠, 𝑎)
and
Z𝑘(𝑠′)
represent sets of return distributions,
⊕denotes the cross-sum between sets of return distributions, and
𝑇(𝑠′|𝑠, 𝑎)
represents the probability of transitioning to state
𝑠′
from
state 𝑠after taking action 𝑎.
During a distributional Bellman backup, each return distribution,
z𝑠′
, in the set
Z𝑘(𝑠′)
, is updated with the reward,
r𝑠,𝑎,𝑠′
, for action,
𝑎
, in state,
𝑠
, as follows:
{r𝑠,𝑎,𝑠′+𝛾z𝑠′
:
∀z𝑠′∈Z𝑘(𝑠′)}
. Each
updated return distribution in the set for state
𝑠′
is then multiplied
by the transition probability,
𝑇(𝑠′|𝑠, 𝑎)
. The cross sum for each
resulting set of updated return distributions is computed for each
next possible next state,
𝑠′
. The cross sum between two sets of return
distributions,
XÉY
, is dened as follows:
{x+y
:
x∈X∧y∈Y}
,
where
x
and
y
are
return distributions
. For a detailed overview on
how a set of return distributions for an action in a MOMDP can be
computed, please consider the example outlined in Figure 2.
To compute a set of ESR non-dominated policies for each state,
we dene an algorithm known as
ESRPrune
(Algorithm 1) which
computes a set of ESR non-dominated policies by removing ESR
dominated return distributions from a given set.
Z𝑘+1(𝑠) ← ESRPrune Ø
𝑎
Q𝑘+1(𝑠, 𝑎)!(10)
Equation 10 calculates the set of return distributions for a given
state,
𝑠
, by taking the union of each set of return distributions
over each action,
𝑎
. The resulting set of return distributions is then
passed to the ESRPrune algorithm as input.
ESRPrune
utilises ESR dominance dened by Hayes et al. [
23
,
24
]
(see Equation 4). Like Pareto dominance, ESR dominance is transi-
tive [
52
], therefore we can apply
ESRPrune
in sequence. To compute
ESR dominance, the cumulative distribution function (CDF) of each
𝑠0
𝑠1
X={x1,x2}
𝑠2
Y={y1,y2}
(1,0)0.9(0,0)0.1
a
(a) An action,
𝑎
, in a MOMDP with
stochastic state transitions. States
𝑠1
and
𝑠2
have sets of non-dominated
return distributions
X
and
Y
. For ac-
tion
𝑎
, transitioning from
𝑠0
to
𝑠1
oc-
curs with a probability of 0
.
9and a
reward of
[
1
,
0
]
is received. For action
𝑎
, transitioning from
𝑠0
to
𝑠2
occurs
with a probability of 0
.
1and a reward
of [0,0]is received.
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
x10 1 0.7
2 0 0.3
x22 1 0.5
2 2 0.5
y1
1 0 0.75
0 2 0.25
y2
0 1 0.9
3 0 0.1
(b) The return distributions
x1,x2,y1
and
y2
in the sets of
policies for
𝑠1
and
𝑠2
. To compute
a set of policies for state
𝑠0
, the
distributional Bellman operator
is utilised (Equation 9).
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
¤x11 1 0.7
3 0 0.3
¤x23 1 0.5
3 2 0.5
¤y1
1 0 0.75
0 2 0.25
¤y2
0 1 0.9
3 0 0.1
(c) The reward,
r𝑠,𝑎,𝑠′
, is used to
update each return distribution
for states
𝑠1
and
𝑠2
. For example,
¤x1=r𝑠,𝑎,𝑠′+𝛾x1
. For this example
𝛾=1.
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
ˆx11 1 0.63
3 0 0.27
ˆx23 1 0.45
3 2 0.45
ˆy1
1 0 0.075
0 2 0.025
ˆy2
0 1 0.09
3 0 0.01
(d) Each return distribution for
𝑠1
and
𝑠2
is then multiplied by the
transition probabilities,
𝑇(𝑠′|𝑠, 𝑎)
.
For example, ˆx1=¤x1×𝑇(𝑠′|𝑠 , 𝑎).
𝑠0
Z={z1=ˆx1+ˆy1,z2=ˆx1+ˆy2,
z3=ˆx2+ˆy1,z4=ˆx2+ˆy2}
(e) In Figure 2(e), a set of return distri-
butions,
Z
, is computed for state
𝑠0
. The
cross sum,
É
, is utilised to sum all com-
binations of return distributions for the
previously updated sets. The set of re-
turn distributions at state
𝑠0
,
Z
, is dened
as follows:
Z=XÉY={ˆx+ˆy
:
ˆx∈
X∧ˆy∈Y}
, where
ˆx
and
ˆy
are return dis-
tributions. Figure 2(e) describes the re-
sulting set of return distributions which
contains z1,z2,z3and z4.
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
z1
1 1 0.63
3 0 0.27
1 0 0.075
0 2 0.025
z2
1 1 0.63
3 0 0.28
0 1 0.09
z3
3 1 0.45
3 2 0.45
1 0 0.075
0 2 0.025
z4
3 1 0.45
3 2 0.45
0 1 0.09
3 0 0.01
(f) Figure 2(f ) outlines the set of
return distributions, Z at state 𝑠0.
Z
will be passed to the
ESRPrune
algorithm.
Figure 2: A worked example outlining the necessary steps
to compute a set of return distributions for a MOMDP with
stochastic state transitions.
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
return distribution in the given set must be calculated.
ESRPrune
iterates over the given set of return distributions and compares the
CDFs of the return distributions to determine which are ESR non-
dominated. The return distributions that are ESR dominated are
removed from the set. A set of non-dominated return distributions
is known as the ESR set [23].
Algorithm 1: ESRPrune
1Input:Z←A set of return distributions
2Z∗← ∅
3while Z ≠∅do
4z←the rst element of Z
5for z′∈Z do
6if z′>𝐸𝑆 𝑅 z then
7z←z′
8end
9end
10 Remove zand all return distributions
11 ESR-dominated by zfrom Z
12 Add zto Z∗
13 end
14 Return Z∗
To highlight how
ESRPrune
determines which return distribu-
tions are ESR non-dominated, consider the example outlined in
Figure 3(a), Figure 3(b) and Figure 4. To determine ESR dominance,
ESRPrune
compares a return distribution
X
with a return distribu-
tion
Y
. The CDF for Xis denoted by
𝐹X
(Figure 3(a)) and the CDF
for
Y
is denoted by
𝐹Y
(Figure 3(b)). In order for
X>𝐸𝑆𝑅 Y
the
following condition must be true [23]:
∀v:𝐹X(v) ≤ 𝐹Y(v) ∧ ∃v:𝐹X(v)<𝐹Y(v).
Additionally, if
X>𝐸𝑆𝑅 Y
the following condition also must be
true:
∀v:𝐹X(v) − 𝐹Y(v) ≤ 0∧ ∃v:𝐹X(v) − 𝐹Y(v)<0.
−101230
2
4
0
0.5
1
o1
o2
probability
(a) The CDF,
𝐹X
, of a return distri-
bution
X
.
X
is a multivariate normal
probability distribution, with a mean,
𝜇
, and co-variance matrix,
Σ
. For
X
,
𝜇=[1,2]and Σ=0.5 0.25
0.25 0.5.
−101230
2
4
0
0.5
1
o1
o2
probability
(b) The CDF,
𝐹Y
, of a return distribu-
tion
Y
.
Y
is a multivariate normal prob-
ability distribution, with a mean,
𝜇
,
and co-variance matrix,
Σ
. For
Y
,
𝜇=
[1,1]and Σ=0.15 0.05
0.05 0.15 .
Figure 3: The CDFs,
𝐹X
and
𝐹Y
, of two return distributions,
X
and Y.
0
2
01234
−0.6
−0.4
−0.2
0
0.2
(a)
o1
o2
FX−FY
Figure 4: The dierence in probability mass for
𝐹X−𝐹Y
, which
is used to visualise the requirements for ESR dominance. A
dotted line (a) is drawn to highlight that
𝐹X−𝐹Y>
0for least
at one point. Therefore, X does not ESR dominate Y.
Figure 4 highlights the dierence in probability for
𝐹X−𝐹Y
. The
dotted line in Figure 4, labelled
(𝑎)
, highlights that, for at least one
point,
𝐹X−𝐹Y>
0. Therefore, the return distribution
X
cannot ESR
dominate the return distribution Y.
Algorithm 2: MODVI
1Initialise all return distributions and sets
2while not converged do
3for 𝑠∈𝑆do
4for 𝑎∈𝐴do
5Q𝑘+1(𝑠, 𝑎) ←
É𝑠′𝑇(𝑠′|𝑠, 𝑎) [R(𝑠 , 𝑎, 𝑠 ′) + 𝛾Z𝑘(𝑠′)]
6end
7Z𝑘+1(𝑠) ← E𝑆𝑅𝑃𝑟 𝑢𝑛𝑒 Ð𝑎Q𝑘+1(𝑠, 𝑎)
8end
9end
Algorithm 2 describes the MODVI algorithm
3
. On initialisation
of MODVI, a set of return distributions is generated for each state-
action pair. For innite horizon settings, each set contains a single
return distribution that is randomly initialised, where an atom is
selected at random and a probability mass of 1
.
0is assigned to
that atom. In nite horizon settings each return distribution is
initialised by assigning a probability mass of 1
.
0to the atom which
corresponds to the return
[
0
,
0
]
. During each iteration of MODVI,
a set of return distributions is computed (Algorithm 2, Line 5) for
each state,
𝑠
and action,
𝑎
. The union of the resulting sets of return
distributions is then passed to the
ESRPrune
algorithm to remove
the dominated return distributions. Once
ESRPrune
(Algorithm 2,
Line 7) has been executed for the given iteration of MODVI, a set
of non-dominated return distributions is backed up for the state
𝑠
.
Once MODVI has converged, a set of ESR non-dominated policies,
or the ESR set, is available at the start state, 𝑠0.
3
Algorithm 2 describes MODVI for innite horizon settings. However, it is trivial to
alter MODVI for nite horizon settings.
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
5 EXPERIMENTS
In this section we show that MODVI can compute a set of optimal
policies for the ESR criterion for two multi-objective benchmark
problems and a practical multi-objective real-world problem.
5.1 Space Traders
First, we evaluate MODVI on a multi-objective benchmark problem
known as Space Traders [
43
]. Space Traders is a problem with
nine policies and a small number of returns per policy. Therefore,
it is possible to visualise each policy in the ESR set, illustrating
how policies can be returned to a user during the selection phase
in practice. Of course, for larger problems, the user could select
subsets of the policies to visualise and compare.
Space Traders has two timesteps, two non-terminal states and
three available actions per state. In Space Traders an agent must
deliver cargo from its home planet (planet A) to some destination
planet (planet B) and then return home to planet A. While deliv-
ering the cargo, the agent must avoid being intercepted by space
pirates. An agent acting in the Space Traders environment aims
to complete the mission and minimise time. An agent receives a
reward of 1 for returning home to planet A and completing the
mission, and at all other states the agent receives a reward of 0 for
mission success. After each action, the agent receives a negative
reward corresponding to the time taken to reach the next planet.
Finally, after taking each action there is a probability the agent will
be intercepted by space pirates. If the agent is intercepted by space
pirates, the agent will receive a reward of 0 for mission success, a
negative time penalty and the episode will terminate. All remain-
ing implementation details for the Space Traders environment are
available in the works of Vamplew et al. [42, 43].
MODVI has the following parameters:
𝛾=
1,
𝑁=
23,
R𝑚𝑖𝑛 =
[
0
,−
22
]
and
R𝑚𝑎𝑥 =[
1
,
0
]
. Figure 7(a) outlines the six return
distributions in the computed ESR set. Figure 5 plots the expected
value vectors of each return distribution in the ESR set and also plots
the expected value vectors for the Pareto front [
43
]. It is important
to note, the ESR set for Space Traders contains a policy that is not
present on the Pareto front. The Pareto front is a set of optimal
policies for the SER criterion. Therefore, certain policies that are
optimal under the ESR criterion are not optimal under the SER
criterion. In real-world decision making, incorrectly selecting an
optimality criterion can lead to sub-optimal performance, given
some optimal policies may not be returned to the user.
During the selection phase, visualisations, like Figure 5, are re-
turned to the user to aid in their decision making. However, in
Figure 5, the details of the return distributions for each policy in
the ESR set are lost. Computing expected value vectors for each
return distribution reduces the information available about a policy,
given the information about each individual return of a policy is no
longer available. As already highlighted, under the ESR criterion
the utility of a user is derived from a single execution of a policy.
Therefore, it is crucial a user has sucient information available at
decision time, given a policy may only be executed once. Figure 6
visualises each potential return and the corresponding probability
of the return distributions in the ESR set. In Figure 6, each return
distribution has a shape, where the position of each shape corre-
sponds to a return and the colour of each shape corresponds to the
−20 −15 −10 −5 0
objective 2
0.8
0.9
1.0
objective 1
ESR Set
Pareto Front
Figure 5: The expected value vectors of the return distribu-
tions in the ESR set (red) are plotted against the expected
value vectors of the Pareto front (blue).
−20 −15 −10 −5 0
objective 2
0.0
0.2
0.4
0.6
0.8
1.0
objective 1
0.2
0.4
0.6
0.8
1.0
probability
Figure 6: The return distributions in the ESR set computed
by MODVI. Each shape corresponds to a computed policy in
the ESR set, where the location of the shape corresponds to a
return in the policy. Colours correspond to the probability
of receiving the specic return when executing the policy.
probability of receiving the return. In practice, a user would be able
to choose which return distributions in the ESR set to display at a
given moment, allowing the user to compare and contrast dierent
policies individually. Figure 6 provides an intuitive aid which can be
returned to a user when making decisions under the ESR criterion.
5.2 Resource Gathering
Next, we evaluate MODVI on the Resource Gathering benchmark
[
5
]. Resource Gathering is a multi-objective benchmark problem
with intuitive trade-os between objectives, motivating the need to
consider the ESR criterion in real-world decision making. MODVI is
evaluated on a four-objective version of Resource Gathering, where
time is added as an objective. The Resource Gathering environment
is shown in Figure 7(b). The agent starts in a home state and nav-
igates the grid environment to collect the available resources (
𝑅1
and
𝑅2
) while avoiding the enemy states (
†1
and
†2
) before return-
ing home again. At each timestep, the agent receives a reward of
[−
1
,
0
,
0
,
0
]
. If the agent returns to the home state having gathered
the available resources, the agent receives one of the following
rewards:
[−
1
,
0
,
10
,
0
]
for collecting
𝑅1
,
[−
1
,
0
,
0
,
10
]
for collecting
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)
𝜋11 -22 1.0
𝜋20 -1 0.1
1 -16 0.9
𝜋3
0 -7 0.085
0 0 0.15
1 -8 0.765
𝜋40 0 0.15
0 -10 0.85
𝜋50 0 0.2775
1 0 0.7225
𝜋6
0 -6 0.135
0 -1 0.1
1 -6 0.765
(a) The return distributions in the
ESR set for the Space Traders envi-
ronment, with 𝛾=1.
𝑅1
†1𝑅2
†2
(b) The grid for the Resource Gather-
ing environment.
†1
and
†2
are en-
emy states.
𝑅1
and
𝑅2
are the re-
sources that need to be gathered, be-
fore returning to the home state.
𝜋 𝑟1𝑟2𝑟3𝑟4P(𝑟1, 𝑟2, 𝑟 3, 𝑟4)
𝜋1-18 0 10 10 1.0
𝜋2-12 0 10 0 1.0
𝜋3-16 -10 0 0 0.1
-14 0 10 10 0.9
𝜋4-12 -10 0 0 0.1
-16 0 10 10 0.9
𝜋5-12 -10 0 0 0.1
-10 0 10 0 0.9
𝜋6
-14 -10 0 0 0.09
-12 -10 0 0 0.1
-12 0 10 10 0.81
𝜋7
-14 -10 0 0 0.09
-12 -10 0 0 0.1
-8 0 10 0 0.81
𝜋8-10 0 0 10 1.0
(c) The return distributions in the ESR set for
the Resource Gathering environment, with
𝛾=1.
𝜋 𝑟1𝑟2P(𝑟1, 𝑟2)
𝜋1
-1 -0.06 0.0995
-1 0.0 0.3210
0 -0.06 0.3778
0 0.0 0.2017
𝜋2
-1 -0.06 0.0597
-1 0.0 0.3609
0 -0.06 0.2264
0 0.0 0.3530
𝜋3-1 0.0 0.4206
0 0.0 0.5794
(d) The return distributions in the
ESR set for the Control Problem envi-
ronment, with 𝛾=1.
Figure 7: Figure 7(a), Figure 7(c) and Figure 7(d) show the
return distributions in the ESR set computed by the MODVI
algorithm for the Space Traders, Resource Gathering and
Control Problem. Figure 7(b) shows the grid layout for the
Resource Gathering environment.
𝑅2
, and
[−
1
,
0
,
10
,
10
]
for collecting
𝑅1
and
𝑅2
. The agent must avoid
the enemy states. If the agent enters an enemy state, there is a 0.1
chance the agent will be attacked. If the agent is attacked in an
enemy state, the agent receives a reward of
[−
10
,−
10
,
0
,
0
]
. In this
case, the agent also receives a time penalty for being attacked and
the episode terminates.
For Resource Gathering, the following parameters were set for
MODVI:
𝛾=
1,
𝑁=
25,
R𝑚𝑖𝑛 =[−
24
,−
24
,−
14
,−
14
]
and
R𝑚𝑎𝑥 =
[
0
,
0
,
10
,
10
]
. Figure 7(c) outlines the return distribution in the ESR
set for Resource Gathering. The ESR set contains eight policies,
where each policy gathers one or both resources before returning
home. An important aspect of the distributional approach applied
by MODVI is that a user will have sucient information about the
trade-os between each objective for each policy in the ESR set. For
example, there is a clear trade-o between objectives in
𝜋3
and
𝜋6
in Figure 7(c). When considering
𝜋3
, fourteen timesteps are taken
to gather both resources and the agent enters one enemy state
with a 0
.
1chance of being attacked. When considering
𝜋6
, twelve
timesteps are taken to gather both resources, but the agent must
enter both enemy states, which poses 0
.
09 chance and 0
.
1chance
of being attacked. Using a distributional approach ensures a user
has sucient information to understand the trade-os between
objectives across dierent policies. In Resource Gathering a user
looking to minimise time, while also being indierent about being
attacked, may select
𝜋6
having fully understood the probabilities
of being attacked. Therefore, having sucient critical information
available at decision time enables the user to make more informed
decisions that could potentially better reect their preferences over
objectives, when compared to expected value vector based methods.
5.3 Feedtank Control Problem
Finally, we evaluate MODVI on the risk-based Feedtank Control
Problem (FCP) proposed by Geibel and Wysotski [
16
], which is a
practical real-world problem domain that highlights how MODVI
and the ESR criterion can be applied. In FCP, the agent must control
the outow of a tank that lies upstream of a distillation column,
while minimising the risk of the tank overowing. The purpose
of the distillation column is to separate two substances. There are
a nite number of timesteps 0
, ..., 𝑇
, where
𝑡
denotes the current
timestep. The feed-stream of the distillation column, or outow of
the tank, is denoted by
𝐹(𝑡)
and is controlled by the agent. The
tank level
𝑦(𝑡)
depends on the two stochastic inow streams char-
acterized by the ow rates
𝐹1(𝑡)
and
𝐹2(𝑡)
. The dynamics of the
tank level are outlined in the following equation:
𝑦(𝑡+1)=𝑦(𝑡) + 𝐴−1𝛿(𝑡) (
𝑗=1,2
𝐹𝑗(𝑡) − 𝐹(𝑡)).(11)
The tank level must not violate the following constraint:
𝑦𝑚𝑖𝑛 ≤𝑦(𝑡) ≤ 𝑦𝑚𝑎𝑥 .(12)
The inows
𝐹𝑗(𝑡)
are random and controlled by probability dis-
tributions (Table 2). Therefore, the inows may also cause the tank
level to violate the constraint in Equation 12. At each timestep
there is also a chance,
𝑝
, that the inows may randomly violate the
constraint in Equation 12. To take a random constraint violation
into consideration, the probabilities for each inow in Table 2 must
be multiplied by 1
−𝑝
. If the tank level violates the constraint in
Equation 12, the system shuts down, the agent enters a terminal
state, and receives a reward of [-1, 0]. The agent takes an action,
𝑎
, to control the outow of the tank. If the action does not cause
a violation of Equation 12, the agent receives a reward dened as
follows:
r𝑠,𝑎,𝑠′=[0,−|𝐹(𝑡) − 𝐹𝑠 𝑝𝑒𝑐 |],(13)
where
𝐹(𝑡)
is the discretised action value for the selected action
that adheres to
𝐹𝑚𝑖𝑛 ≤𝐹(𝑡) ≤ 𝐹𝑚𝑎𝑥
where
𝐹𝑚𝑖𝑛
and
𝐹𝑚𝑎𝑥
are
intervals for actions, and
𝐹𝑠𝑝𝑒 𝑐
is the optimal action value. The
state parameters for the FCP are dened as follows:
𝑠(𝑡)=[𝑡, 𝑦 (𝑡)].(14)
Finally, the initial state,
𝑠0
, is dened as follows: [0,
𝑦0
]. For the
version of FCP used in this paper there are 11 actions available to
the agent, with 8timesteps.
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
𝑡 𝐹1𝑃(𝐹1)𝐹2𝑃(𝐹2)
1 1.70843345 0.78341724 1.85062176 0.21658276
2 1.40843345 0.40060469 1.55062176 0.59939531
3 0.56537807 0.83222158 0.70876186 0.16777842
4 0.37336325 0.81546855 0.50537012 0.18453145
5 0.11927879 0.41123876 0.31832656 0.58876124
6 0.02762233 0.7665067 0.20677226 0.2334933
7 0.45139631 0.62905513 0.59104772 0.37094487
8 1.10806585 0.04634063 1.20835887 0.95365937
Table 2: The inows (
𝐹1, 𝐹2
) for the feedtank with the corre-
sponding probabilities (𝑃(𝐹1), 𝑃 (𝐹2)) for each timestep, 𝑡.
The following parameters were set for FCP: [
𝐹𝑚𝑖𝑛
,
𝐹𝑚𝑎𝑥
] = [0.55,
1.05],
𝐹𝑠𝑝𝑒 𝑐
= 0.8,
𝑦0=
0
.
4, [
𝑦𝑚𝑖𝑛
,
𝑦𝑚𝑎𝑥
] = [0.25, 0.75],
𝐴−1𝛿(𝑡)=
0
.
1
and
𝑝
= 0.1. MODVI has the following parameters:
𝛾=
1,
𝑁=
101,
R𝑚𝑖𝑛 =[−
1
,−
3
]
and
R𝑚𝑎𝑥 =[
0
,
0
]
. Figure 7(d) outlines the three
return distributions computed by MODVI in the ESR set for FCP.
To provide an intuitive aid for decision making during the selection
phase, the policies in the ESR set can be visualised, like in Figure
6, and returned to the user. It is important to note that
𝜋1
and
𝜋2
in the ESR set contain the same returns, although with dierent
probabilities. If the expected value vectors for
𝜋1
and
𝜋2
are returned
to a user, the user will lose all knowledge of how similar the returns
for
𝜋1
and
𝜋2
are. Therefore, taking a distributional approach can
aid in decision making, given a user has more information about the
individual returns of a policy. It is important to note, each return
distribution in Figure 7(d) could easily be interpreted by a domain
expert.
FCP is motivated by minimising risk as an important objective,
given violating certain constraints can shut down the distillation
process. Therefore, FCP should be optimised under the ESR crite-
rion, given a single execution of a policy is used to derive utility.
If the SER criterion is used as an optimality criterion, the average
risk over multiple policy executions would be computed. However,
making decisions based on average risk is not sucient for FCP
given a single violation of the constraints could lead to a system
shutdown, resulting in loss of productivity and prots. Using a
distributional approach for FCP under the ESR criterion ensures
that a user has sucient information about the probability of a
constraint violation to make decisions that mitigate such risks.
6 RELATED WORK
In recent years, using distributions in decision making has become
an active area of research for both single and multi-objective prob-
lem domains. For example, Martin et al. [
28
] use a single-objective
distributional C51 algorithm with stochastic dominance to make
risk-aware decisions. Abbas et al. [
1
] take a distributional approach
to multi-objective decision making to compute a set of optimal
policies for the SER criterion. It is important to note, taking a distri-
butional approach to decision making is not new and methods like
conditional value-at-risk (CVAR) [
35
] and value-at-risk (VAR) [
14
]
have been used extensively in nance [
27
,
34
] to make decisions
under uncertainty. Beyond a distributional approach, many algo-
rithms can compute a set of optimal policies for the SER criterion.
For example, multi-objective Monte Carlo tree search [
48
], Pareto
value iteration [
49
], convex hull value iteration [
5
] and CON-MODP
[
50
,
51
]. In contrast to the SER criterion, the ESR criterion has been
largely understudied with some exceptions. Several single-policy
algorithms have been developed which can compute a single opti-
mal policy for the ESR criterion. However, the single-policy ESR
algorithms cannot compute sets of optimal policies for the ESR
criterion, which heavily restricts their use in real-world decision
making scenarios. Reymond et al. [
33
] dene a multi-objective dis-
tributional actor critic algorithm that can compute optimal policies
for the ESR criterion. Roijers et al. [
36
] dene a multi-objective
policy gradient that can compute a single optimal policy for the
ESR criterion. Hayes et al. [
19
,
20
] outline a distributional Monte
Carlo tree search (DMCTS) algorithm to compute policies for the
ESR criterion. However, all of the highlighted methods require the
utility function of a user to be known a priori. For scenarios where
the utility function is unknown, Hayes et al. [
23
] outline a distribu-
tional algorithm that computes a set of policies for the ESR criterion
in a multi-objective multi-armed bandit [
13
] setting. However, the
work of Hayes et al. [
23
] is limited to bandit settings and cannot be
used for sequential decision making.
7 CONCLUSION & FUTURE WORK
In this paper we propose a multi-objective distributional value iter-
ation (MODVI) algorithm that can compute a set of optimal policies
for the ESR criterion. MODVI utilises return distributions which
replace expected value vectors in multi-objective decision making.
MODVI is the rst algorithm that can compute a set of optimal
policies under the ESR criterion in sequential multi-objective deci-
sion making settings. We show that MODVI can compute a set of
optimal policies for several multi-objective benchmark problems
and a practical real-world decision making problem. Because it is
the rst of its kind, MODVI opens up decision-theoretic planning
for a key range of real-world problems.
We plan to use return distributions in multi-objective reinforce-
ment learning (RL) settings. Model-based RL algorithms, like R-max
[
10
], and model-free RL algorithms, like multi-objective Q-learning
[
46
], could form the basis for new multi-objective distributional al-
gorithms that can compute sets of policies for the ESR criterion. For
MODVI, when the range of potential returns increases, maintaining
a sucient number of atoms for the return distribution requires
a large amount of memory. It is expected that in larger scenarios,
like [
2
], the range of possible potential returns would be dicult
to maintain using a categorical distribution. A potential solution
would be to use Dirichlet distributions [
29
] to represent return dis-
tributions. Finally, ESR dominance is a strict dominance criterion.
In many settings, ESR dominance may produce very large sets of
policies that would be optimal for all decision makers. It would be
possible to relax the ESR dominance requirements by using almost
stochastic dominance to generate smaller solution sets, where each
policy in the set is optimal for most decision makers [25].
ACKNOWLEDGEMENTS
Conor F. Hayes is funded by the National University of Ireland
Hardiman Scholarship. This research was supported by funding
from the Flemish Government under the “Onderzoeksprogramma
Articiële Intelligentie (AI) Vlaanderen” program.
Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/
REFERENCES
[1]
Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Fran-
cis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, and
Martin Riedmiller. 2020. A distributional view on multi-objective policy opti-
mization. In International Conference on Machine Learning. PMLR, 11–22.
[2]
Steven Abrams, James Wambua, Eva Santermans, Lander Willem, Elise Kuylen,
Pietro Coletti, Pieter Libin, Christel Faes, Oana Petrof, Sereina A. Herzog, Philippe
Beutels, and Niel Hens. 2021. Modelling the early phase of the Belgian CO VID-19
epidemic using a stochastic compartmental model and studying its implied future
trajectories. Epidemics 35 (2021), 100449. https://doi.org/10.1016/j.epidem.2021.
100449
[3]
Mukhtar M. Ali. 1975. Stochastic dominance and portfolio analysis. Journal of
Financial Economics 2, 2 (1975), 205–229. https://doi.org/10.1016/0304- 405X(75)
90005-7
[4]
Anthony B Atkinson and Francois Bourguignon. 1982. The Compari-
son of Multi-Dimensioned Distributions of Economic Status. The Re-
view of Economic Studies 49, 2 (04 1982), 183–201. https://doi.org/10.2307/
2297269 arXiv:https://academic.oup.com/restud/article-pdf/49/2/183/4720580/49-
2-183.pdf
[5]
Leon Barrett and Srini Narayanan. 2008. Learning all optimal policies with
multiple criteria. In Proceedings of the 25th international conference on Machine
learning. 41–47.
[6]
Vijay S. Bawa. 1975. Optimal rules for ordering uncertain prospects. Journal of
Financial Economics 2, 1 (1975), 95 – 121. https://doi.org/10.1016/0304- 405X(75)
90025-2
[7]
Vijay S. Bawa. 1978. Safety-First, Stochastic Dominance, and Optimal Portfolio
Choice. The Journal of Financial and Quantitative Analysis 13, 2 (1978), 255–271.
http://www.jstor.org/stable/2330386
[8]
Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspec-
tive on reinforcement learning. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70. JMLR. org, 449–458.
[9] Richard Bellman. 1957. Dynamic programming. Courier Corporation.
[10]
Ronen I Brafman and Moshe Tennenholtz. 2002. R-max-a general polynomial
time algorithm for near-optimal reinforcement learning. Journal of Machine
Learning Research 3, Oct (2002), 213–231.
[11]
Daniel Bryce, William Cushing, and Subbarao Kambhampati. 2007. Probabilistic
planning is multi-objective. Arizona State University, Tech. Rep. ASU-CSE-07-006
(2007).
[12]
E. Choi and Stanley Johnson. 1988. Stochastic Dominance and Uncertain Price
Prospects. Center for Agricultural and Rural Development (CARD) at Iowa State
University, Center for Agricultural and Rural Development (CARD) Publications 55
(01 1988). https://doi.org/10.2307/1059583
[13]
Madalina M. Drugan and Ann Nowe. 2013. Designing multi-objective multi-
armed bandits algorithms: A study. In The 2013 International Joint Conference on
Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN.2013.6707036
[14]
Darrell Due and Jun Pan. 1997. An overview of value at risk. Journal of
derivatives 4, 3 (1997), 7–49.
[15]
Peter C Fishburn. 1978. Non-cooperative stochastic dominance games. Interna-
tional Journal of Game Theory 7, 1 (1978), 51–61.
[16]
Peter Geibel and Fritz Wysotzki. 2005. Risk-sensitive reinforcement learning
applied to control under constraints. Journal of Articial Intelligence Research 24
(2005), 81–108.
[17]
Peichen Gong. 1992. Multiobjective dynamic programming for forest resource
management. Forest Ecology and Management 48, 1 (1992), 43–54. https://doi.
org/10.1016/0378-1127(92)90120- X
[18]
Josef Hadar and William R. Russell. 1969. Rules for Ordering Uncertain Prospects.
The American Economic Review 59, 1 (1969), 25–34. http://www.jstor.org/stable/
1811090
[19]
Conor F. Hayes, Mathieu Reymond, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021. Risk-Aware and Multi-Objective Decision Making with
Distributional Monte Carlo Tree Search. In: Proceedings of the Adaptive and
Learning Agents workshop at AAMAS 2021) (2021).
[20]
Conor F. Hayes, Mathieu Reymond, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021 In Press. Distributional Monte Carlo Tree Search for
Risk-Aware and Multi-Objective Reinforcement Learning. In Proceedings of the
20th International Conference on Autonomous Agents and MultiAgent Systems,
Vol. 2021. IFAAMAS.
[21]
Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Mannion Patrick. 2022.
Decision-Theoretic Planning for the Expected Scalarised Returns. In Proceedings
of the 21st International Conference on AAMAS (2022).
[22]
Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström,
Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf,
Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick
Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and
Diederik M. Roijers. 2022. A Practical Guide to Multi-Objective Reinforcement
Learning and Planning. Autonomous Agents and Multi-Agent Systems 36, 1 (2022),
26. https://doi.org/10.1007/s10458-022- 09552-y
[23]
Conor F. Hayes, Timothy Verstraeten, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021. Dominance Criteria and Solution Sets for the Expected
Scalarised Returns. In Proceedings of the Adaptive and Learning Agents workshop
at AAMAS 2021.
[24]
Conor F. Hayes, Timothy Verstraeten, Diederik M. Roijers, Enda Howley, and
Patrick Mannion. 2021. Expected Scalarised Returns Dominance: A New Solution
Concept for Multi-Objective Decision Making. arXiv preprint arXiv:2106.01048
(2021).
[25] Moshe Leshno and Haim Levy. 2002. Preferred by “all” and preferred by “most”
decision makers: Almost stochastic dominance. Management Science 48, 8 (2002),
1074–1085.
[26]
Federico Malerba and Patrick Mannion. 2021. Evaluating Tunable Agents
with Non-Linear Utility Functions under Expected Scalarised Returns. In Multi-
Objective Decision Making Workshop (MODeM 2021).
[27]
Simone Manganelli and Robert F Engle. 2001. Value at risk models in nance.
(2001).
[28]
John Martin, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. 2020. Stochas-
tically Dominant Distributional Reinforcement Learning. In International Confer-
ence on Machine Learning. PMLR, 6745–6754.
[29]
Ingram Olkin and Herman Rubin. 1964. Multivariate beta distributions and
independence properties of the Wishart distribution. The Annals of Mathematical
Statistics (1964), 261–269.
[30]
Michael Painter, Bruno Lacerda, and Nick Hawes. 2020. Convex Hull Monte-
Carlo Tree-Search. In Proceedings of the Thirtieth International Conference on
Automated Planning and Scheduling, Nancy, France, October 26-30, 2020. AAAI
Press, 217–225.
[31] Vilfredo Pareto. 1896. Manuel d’Economie Politique. Vol. 1. Giard, Paris.
[32]
Roxana Rădulescu, Patrick Mannion, Diederik M. Roijers, and Ann Nowé. 2020.
Multi-objective multi-agent decision making: a utility-based analysis and survey.
Autonomous Agents and Multi-Agent Systems 34, 10 (2020).
[33]
Mathieu Reymond, Conor F. Hayes, Diederik M. Roijers, Denis Steckelmacher,
and Ann Nowé. 2021. Actor-Critic Multi-Objective Reinforcement Learning
for Non-Linear Utility Functions. Multi-Objective Decision Making Workshop
(MODeM 2021) (2021).
[34]
R Tyrrell Rockafellar and Stanislav Uryasev. 2002. Conditional value-at-risk for
general loss distributions. Journal of banking & nance 26, 7 (2002), 1443–1471.
[35]
R Tyrrell Rockafellar, Stanislav Uryasev, et al
.
2000. Optimization of conditional
value-at-risk. Journal of risk 2, 3 (2000), 21–41.
[36]
Diederik M. Roijers, Denis Steckelmacher, and Ann Nowé. 2018. Multi-objective
Reinforcement Learning for the Expected Utility of the Return. In Proceedings of
the Adaptive and Learning Agents workshop at FAIM 2018.
[37]
Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley.
2013. A survey of multi-objective sequential decision-making. Journal of Articial
Intelligence Research 48 (2013), 67–113.
[38]
Diederik M. Roijers, Shimon Whiteson, and Frans A. Oliehoek. 2015. Comput-
ing Convex Coverage Sets for Faster Multi-Objective Coordination. Journal of
Articial Intelligence Research 52 (2015), 399–443.
[39]
Roxana Rădulescu, Patrick Mannion, Yijie Zhang, Diederik Marijn Roijers, and
Ann Nowé. 2020. A utility-based analysis of equilibria in multi-objective normal
form games. The Knowledge Engineering Review 35, e32 (2020).
[40]
Songsak Sriboonchitta, Wing-Keung Wong, s Dhompongsa, and Hung Nguyen.
2009. Stochastic Dominance and Applications to Finance, Risk and Economics.
https://doi.org/10.1201/9781420082678
[41]
Peter Vamplew, Richard Dazeley, Adam Berry, Rustam Issabekov, and Evan
Dekker. 2011. Empirical evaluation methods for multiobjective reinforcement
learning algorithms. Machine Learning (2011).
[42]
Peter Vamplew, Cameron Foale, and Richard Dazeley. 2020. A Demonstration
of Issues with Value-Based Multi Objective Reinforcement Learning Under Sto-
chastic State Transitions. Adaptive and Learning Agents Workshop (AAMAS
2020).
[43]
Peter Vamplew, Cameron Foale, and Richard Dazeley. 2021. The impact of envi-
ronmental stochasticity on value-based multiobjective reinforcement learning. In
Neural Computing and Applications. https://doi.org/10.1007/s00521-021-05859- 1
[44]
Peter Vamplew, Benjamin J Smith, Johan Kallstrom, Gabriel Ramos, Roxana
Radulescu, Diederik M Roijers, Conor F Hayes, Fredrik Heintz, Patrick Mannion,
Pieter JK Libin, et al
.
2021. Scalar reward is not enough: A response to Silver,
Singh, Precup and Sutton (2021). arXiv preprint arXiv:2112.15422 (2021).
[45]
Peter Vamplew, John Yearwood, Richard Dazeley, and Adam Berry. 2008. On
the Limitations of Scalarisation for Multi-objective Reinforcement Learning of
Pareto Fronts. In AI 2008: Advances in Articial Intelligence, Wayne Wobcke and
Mengjie Zhang (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 372–378.
[46]
Kristof Van Moaert and Ann Nowé. 2014. Multi-objective reinforcement learning
using sets of Pareto dominating policies. The Journal of Machine Learning Research
15, 1 (2014), 3483–3512.
[47]
K Wakuta and K Togawa. 1998. Solution procedures for multi-objective Markov
decision processes. Optimization 43, 1 (1998), 29–46.
[48]
Weijia Wang and Michèle Sebag. 2012. Multi-objective Monte-Carlo Tree Search
(Proceedings of Machine Learning Research, Vol. 25), Steven C. H. Hoi and Wray
ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion
Buntine (Eds.). PMLR, Singapore Management University, Singapore, 507–522.
[49]
DJ White. 1982. Multi-objective innite-horizon discounted Markov decision
processes. Journal of mathematical analysis and applications 89, 2 (1982), 639–647.
[50]
Marco A. Wiering and Edwin D. de Jong. 2007. Computing Optimal Stationary
Policies for Multi-Objective Markov Decision Processes. In 2007 IEEE International
Symposium on Approximate Dynamic Programming and Reinforcement Learning.
158–165. https://doi.org/10.1109/ADPRL.2007.368183
[51]
Marco A Wiering, Maikel Withagen, and Mădălina M Drugan. 2014. Model-based
multi-objective reinforcement learning. In 2014 IEEE Symposium on Adaptive
Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 1–6.
[52]
Elmar Wolfstetter. 1999. Topics in Microeconomics: Industrial Organization, Auc-
tions, and Incentives. Cambridge University Press. https://doi.org/10.1017/
CBO9780511625787
[53]
Kyle Hollins Wray, Shlomo Zilberstein, and Abdel-Illah Mouaddib. 2015. Multi-
objective MDPs with conditional lexicographic reward preferences. In Twenty-
ninth AAAI conference on articial intelligence.