Conference PaperPDF Available

Multi-Objective Distributional Value Iteration

May 2022

May 2022

Conference: Adaptive and Learning Agents Workshop (ALA) at AAMAS 2022
At: Auckland, New Zealand (Online)

Authors:

Conor F. Hayes

Lawrence Livermore National Laboratory

Diederik M. Roijers

Vrije Universiteit Brussel

Enda Howley

University of Galway

Patrick Mannion

University of Galway

In sequential multi-objective decision making (MODeM) settings, when the utility of a user is derived from a single execution of a policy, policies for the expected scalarised returns (ESR) criterion should be computed. In multi-objective settings, a user's preferences over objectives, or utility function, may be unknown at the time of planning. When the utility function of a user is unknown, multi-policy methods are deployed to compute a set of optimal policies. However, the state-of-the-art sequential MODeM multi-policy algorithms compute a set of optimal policies for the scalarised expected returns (SER) criterion. Algorithms that compute a set of optimal policies for the SER criterion utilise expected value vectors which cannot be used when optimising for the ESR criterion. We propose a novel multi-policy multi-objective distributional value iteration (MODVI) algorithm that replaces value vectors with distributions over the returns and computes a set of optimal policies for the ESR criterion. MODVI is evaluated using several sequential multi-objective problem domains, where, for each problem, a set of optimal policies for the ESR criterion is computed.

Content uploaded by Conor F. Hayes

Content may be subject to copyright.

Multi-Objective Distributional Value Iteration∗

Conor F. Hayes

National University of Ireland Galway (IE)

c.hayes13@nuigalway.ie

Diederik M. Roijers

Vrije Universiteit Brussel (BE)

& HU Univ. of Appl. Sci. Utrecht (NL)

Enda Howley

National University of Ireland Galway (IE)

Patrick Mannion

National University of Ireland Galway (IE)

ABSTRACT

In sequential multi-objective decision making (MODeM) settings,

when the utility of a user is derived from a single execution of a

policy, policies for the expected scalarised returns (ESR) criterion

should be computed. In multi-objective settings, a user’s prefer-

ences over objectives, or utility function, may be unknown at the

time of planning. When the utility function of a user is unknown,

multi-policy methods are deployed to compute a set of optimal poli-

cies. However, the state-of-the-art sequential MODeM multi-policy

algorithms compute a set of optimal policies for the scalarised ex-

pected returns (SER) criterion. Algorithms that compute a set of

optimal policies for the SER criterion utilise expected value vectors

which cannot be used when optimising for the ESR criterion. We

propose a novel multi-policy multi-objective distributional value

iteration (MODVI) algorithm that replaces value vectors with dis-

tributions over the returns and computes a set of optimal policies

for the ESR criterion. MODVI is evaluated using several sequential

multi-objective problem domains, where, for each problem, a set of

optimal policies for the ESR criterion is computed.

KEYWORDS

Multi-objective; distributional; value iteration; expected scalarised

returns

1 INTRODUCTION

When making decisions in the real world, trade-os between mul-

tiple, often conicting, objectives must be made [

]. In many real-

world decision making settings, a policy is only executed once. For

example, consider a government body planning to implement a tax

incentive on imported electric vehicles. The tax incentive would

increase sales of electric vehicles, reducing

𝐶𝑂2

emissions, how-

ever, it may cause the sales of domestically produced petrol/diesel

vehicles to plummet, resulting in local unemployment. The tax

incentive will only be implemented once and, therefore, the gov-

ernment body must carefully consider the eects and likelihood of

all potential outcomes. The current state-of-the-art multi-objective

decision making (MODeM) literature focuses almost exclusively on

computing polices that are optimal over multiple executions. There-

fore, to fully utilise MODeM in the real world, we must develop

algorithms to compute a policy, or set of policies, that are optimal

given the single-execution nature of the problem.

In MODeM, a policy, or set of policies, is computed to maximise

the user’s preferences over objectives, or utility function. However,

∗This paper extends our AAMAS 2022 extended abstract [21].

Proc. of the Adaptive and Learning Agents Workshop (ALA 2022), Cruz, Hayes, da Silva,

Santos (eds.), May 9-10, 2022, Online, https://ala2022.github.io/ . 2022.

the user’s utility function is often unknown at the time of planning

[

]. Therefore, we are deemed to be in the unknown utility function

scenario [

], where a set of optimal policies must be computed

and returned to the user. Once the user’s utility function becomes

known, the user can select a policy from the computed set of optimal

policies that best reects their preferences [37].

MODeM distinguishes between two optimality criteria. In scenar-

ios where the utility of a user is derived from multiple executions

of a policy, the scalarised expected returns (SER) criterion should

be optimised [

]. In scenarios where the utility of a user is derived

from a single execution of a policy, the expected scalarised returns

(ESR) criterion should be optimised [

]. The SER criterion is

the most commonly used optimality criterion in the sequential

multi-objective planning literature [

]. In contrast to the SER cri-

terion, the ESR criterion has been understudied by the single agent

MODeM community, with some exceptions [19, 20, 33, 36, 43].

The majority of multi-policy MODeM algorithms are designed to

compute a set of optimal policies for the SER criterion [

However, if the utility function of a user is non-linear, the policies

computed under the SER criterion and ESR criterion can be dierent,

given the SER criterion and ESR criterion utilise the utility function

dierently [

]. Moreover, sub-optimal policies can be computed if

the choice of optimality criterion is not taken into consideration

when planning [

]. Therefore, new methods that can compute

policies for the ESR criterion must be developed.

The current state-of-the-art SER methods [

] are fundamen-

tally incompatible with the ESR criterion. When the utility function

of a user is unknown, SER methods use expected value vectors to

compute a set of optimal policies [

]. However, expected value

vectors cannot be used to compute policies under the ESR criterion

[

]. Instead, a distribution over the returns, or return distribution,

must be maintained to compute policies for the ESR criterion [

Given, in the real world, policies are often only executed once, a

user must have sucient information about the potential positive

or negative outcomes a policy may have. Maintaining a distribution

over the returns for each computed policy ensures a user has su-

cient information to take the potential outcomes into consideration

at decision time [

]. Utilising a distribution over the returns

ensures the ESR criterion can be considered in real-world decision

making scenarios.

In Section 3, we highlight why multi-policy methods for the SER

criterion cannot be used for the ESR criterion and show why main-

taining a distribution over the returns is necessary to compute a set

of optimal policies under the ESR criterion. In Section 4, we present

a novel multi-objective distributional value iteration (MODVI) algo-

rithm that computes a set of optimal policies for the ESR criterion

in scenarios when the utility function of a user is unknown at the

ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion

MOMDP algorithm solution

set

planning or learning phase

user

selection

selection phase

single

solution

execution phase

Figure 1: The unknown utility function scenario [22].

time of planning. In Section 5, we show MODVI can compute a

set of optimal policies for the ESR criterion using two sequential

multi-objective benchmark problems, and show how these could be

visualised for a user. Finally, we show that MODVI can compute a

set of optimal policies for the ESR criterion in a practical real-world

problem domain.

2 BACKGROUND

In Section 2, we formally dene multi-objective Markov decision

processes, the unknown utility function scenario, and commonly

studied optimality criteria in multi-objective decision making.

2.1 Multi-Objective Markov Decision Processes

A multi-objective Markov decision process (MOMDP) is a tuple,

M=(S,A,T, 𝛾, R)

, where

is the state space,

is the set of

actions,

T:S × A × S → [0,1]

is the probabilistic transition

function,

𝛾

is the discount factor, and

R:S × A × S → R𝑛

is the

probabilistic vectorial reward function for each of the

𝑛

objectives.

An agent acts according to a policy

𝜋

S×A → [0,1]

. Given a state,

actions are selected according to a certain probability distribution.

2.2 The Unknown Utility Function Scenario

In MODeM, a user’s preferences over objectives can be modelled

as a utility function [

]. However, a user’s utility function is often

unknown at the time of planning. In the taxonomy of MODeM, this

is known as the unknown utility function scenario, where a set of

optimal policies must be computed and returned to the user [

Figure 1 outlines the three phases in the unknown utility function

scenario: the planning phase, the selection phase, and the execution

phase [

]. During the planning phase a multi-policy algorithm

[

] is deployed to compute a set of policies that are optimal for all

possible utility functions [

]. The set of optimal policies is then

returned to the user. During the selection phase, the user selects a

policy from the computed set of optimal policies according to their

preferences. Finally, during the execution phase, the selected policy

is executed.

2.3 Optimality Criteria in Multi-Objective

Decision Making

When applying a user’s utility function, the MODeM literature

distinguishes between two optimality criteria. Calculating the ex-

pected value of the return of a policy before applying the utility

function leads to the scalarised expected returns (SER) optimisation

criterion:

𝑉𝜋

𝑢=𝑢 E"∞



𝑡=0

𝛾𝑡r𝑡|𝜋, 𝜇0#!.(1)

In scenarios where the utility of a user is derived from the expected

outcome over multiple executions of a policy, the SER criterion

should be optimised [

]. SER is the most commonly used criterion

in the multi-objective (single agent) planning literature [

For SER, a set of non-dominated policies that are optimal for all

possible utility functions is known as a coverage set. Applying the

utility function to the returns and then calculating the expected

value leads to the ESR optimisation criterion:

𝑉𝜋

𝑢=E"𝑢 ∞



𝑡=0

𝛾𝑡r𝑡!|𝜋, 𝜇0#.(2)

In scenarios where the utility function of a user is derived from

single executions of a policy, the ESR criterion should be optimised

[

]. The ESR criterion is the most commonly used criterion in the

game theory literature on multi-objective games [32].

The current state-of-the-art multi-policy MODeM methods fo-

cus almost exclusively on the SER criterion [

], leaving the

ESR criterion largely understudied [

]. Given that the SER

criterion and the ESR criterion utilise the utility function dierently,

SER methods cannot be used to compute a set of optimal policies

for the ESR criterion. Additionally, a set of optimal policies under

the SER criterion can exclude policies that are optimal under the

ESR criterion [

]. In all decision-making problems where a policy

is only executed once, the ESR criterion must be utilised. As such

problems are salient [

], new methods to compute a set of optimal

policies for the ESR criterion must be developed to ensure optimal

decision making in the real world.

3 EXPECTED SCALARISED RETURNS WITH

UNKNOWN UTILITY FUNCTIONS

The choice of optimality criterion in MODeM has implications

for the policies computed. Recently, it has been shown if a user’s

utility function is non-linear, the policies computed under the SER

criterion and the ESR criterion can be dierent [

]

. Moreover, sets

of policies that are optimal under the SER criterion can potentially

exclude policies that are optimal under the ESR criterion [

]. If the

optimality criterion is not carefully chosen, one could potentially

exclude policies that could lead to a higher utility.

SER methods cannot be used to compute policies for the ESR

criterion. This is because SER methods determine optimality on

the basis of expected value vectors [

]; these are insucient to

determine optimality in ESR settings as we demonstrate with the

example below. To highlight why dierent methods must be used,

consider the lotteries,

𝐿1

and

𝐿2

in Table 1. In this example the

utility function,

𝑢

, is unknown. To determine which lottery to play

in Table 1 when optimising for the SER criterion, the expected value

vector for 𝐿1and 𝐿2must be computed rst (see Equation 1):

E(𝐿1)=0.6((8,2)) + 0.4((6,1)) =(4.8,1.2)+(2.4,0.4)=(7.2,1.6)

𝑢(E(𝐿1)) =𝑢( (7.2,1.6))

E(𝐿2)=0.9((5,1)) + 0.1((8,0)) =(4.5,0.9)+(0.8,0)=(5.3,0.9)

𝑢(E(𝐿2)) =𝑢( (5.2,0.9))

Given that the utility function is unknown, Pareto dominance [

]

can be used to dene a partial ordering over expected value vectors

It is important to note, if the utility function is linear, the distinction between SER

and ESR does not exist [

]. Additionally, multi-policy approaches that compute a

set of optimal policies using linear scalarisation weights [

], fail to locate policies

in non-convex regions of the Pareto front [45].

Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/

𝐿1

P(𝐿1=R)R

0.6 (8, 2)

0.4 (6, 1)

𝐿2

P(𝐿2=R)R

0.9 (5, 1)

0.1 (8, 0)

Table 1: Lottery

𝐿1

has two possible returns, (8, 2) with prob-

ability 0.6 and (6, 1) with probability 0.4. Lottery

𝐿2

has two

possible returns (5, 1) with probability 0.9 and (8, 0) with

probability 0.1.

for all monotonically increasing utility functions. For example,

methods like [

–

] compute a set of policies known as the Pareto

front, which are optimal under the SER criterion.

To determine which lottery to play while optimising for the

ESR criterion, the utility function must rst be applied, then the

expected utility can be computed (see Equation 2):

𝑢(𝐿1)=𝑢((8,2)) + 𝑢((6,1))

E(𝑢(𝐿1)) =0.6(𝑢((8,2))) + 0.4(𝑢((6,1)))

𝑢(𝐿2)=𝑢((5,1)) + 𝑢((8,0))

E(𝑢(𝐿2)) =0.9(𝑢((5,1))) + 0.1(𝑢((8,0)))

Given the utility function is unknown, it impossible to compute the

expected utility. Moreover, a distribution over the returns received

from a policy execution must be maintained in order to optimise

for the ESR criterion. Maintaining a distribution over the returns

ensures the expected utility can be computed once the user’s utility

function becomes known during the selection phase. Therefore,

while computing a set of optimal policies under the ESR criterion,

a distribution over the returns must be maintained to determine

optimality.

Prior to this work, no algorithm existed to compute sets of opti-

mal policies in sequential settings for the ESR criterion when the

utility function is unknown. Therefore, new methods must be for-

mulated that compute a set of optimal policies for the ESR criterion

in sequential MODeM settings in the unknown utility function

scenario.

Recently, a new solution concept for ESR with unknown utility

functions, called the ESR set, was proposed by Hayes et al. [

However, their work did not propose any algorithms to compute

ESR sets for sequential decision making problems. Hayes et al.

[

] dene a multi-objective return distribution,

z𝜋

, which rep-

resents the distribution over returns for a policy, 𝜋, such that,

Ez𝜋=E"∞



𝑡=0

𝛾𝑡r𝑡

𝜋, 𝜇0#.(3)

A return distribution

is a distribution over the returns of a random

vector when a policy, 𝜋, is executed [23].

Hayes et al. [

] dene ESR dominance, which gives a partial

ordering over return distributions, where each return distribution

is associated with a policy that could be executed. ESR dominance

builds on the principles of rst-order stochastic dominance [

]

in multivariate settings [

]. Stochastic dominance gives a partial

The term value distribution is used in [

]. However, a value distribution is a

distribution over the returns, not over values. Therefore, we prefer the term return

distribution.

ordering over random variables and random vectors. Stochastic

dominance has been used in economics [

], nance [

] and

game theory [15] to make decisions under uncertainty.

To calculate ESR dominance, the cumulative distribution func-

tion (CDF) of the given return distributions must be calculated.

For a return distribution

z𝜋

, the CDF of

z𝜋

is denoted by

𝐹z𝜋

. A

return distribution

z𝜋

ESR dominates a return distribution

z𝜋′

the following is true:

z𝜋>𝐸𝑆𝑅 z𝜋′⇔

∀v:𝐹z𝜋(v) ≤ 𝐹z𝜋′(v) ∧ ∃v:𝐹z𝜋(v)<𝐹z𝜋′(v).(4)

Hayes et al. [

] prove if a return distribution

z𝜋

ESR dominates

a return distribution

z𝜋′

z𝜋

has a higher expected utility than

z𝜋′

for all strictly monotonically increasing utility functions, 𝑢.

z𝜋>𝐸𝑆𝑅 z𝜋′

=⇒E(𝑢(z𝜋)) >E(𝑢(z𝜋′)) (5)

Finally, Hayes et al. [

] dene a set of non-dominated return

distributions known as the ESR set, which is dened as follows:

𝐸𝑆𝑅(Π)={𝜋∈Π| 𝜋′∈Π:z𝜋′

>𝐸𝑆𝑅 z𝜋}.(6)

4 MULTI-OBJECTIVE DISTRIBUTIONAL

VALUE ITERATION

To compute a set of optimal policies for the ESR criterion when the

utility function of a user is unknown, we propose a novel multi-

objective distributional value iteration (MODVI) algorithm. MODVI

maintains sets of return distributions for each state and uses ESR

dominance [

] to compute a set of non-dominated return distribu-

tions, known as the ESR set.

The state-of-the-art multi-objective decision making (MODeM)

algorithms use expected value vectors to compute sets of optimal

policies [

–

]. However, expected value vectors can only be used

when optimising for the SER criterion. As previously highlighted,

to compute a set of optimal polices for the ESR criterion, expected

value vectors must be replaced with return distributions. Generally,

expected value MODeM algorithms utilise the Bellman operator

[

] to compute the expected value vectors for each state. Given

our approach is distributional, we adopt the distributional Bellman

operator [

T𝜋

𝐷

, to update the return distribution for each state-

action pair:

T𝜋

𝐷z(𝑠, 𝑎)𝐷

=r𝑠,𝑎 +𝛾z(𝑠′, 𝑎′).(7)

To represent a return distribution in multi-objective settings, we

use a multivariate categorical distribution similar to the distribu-

tions used by Reymond et al. [

] and Bellemare et al. [

]. The

categorical distribution is paramaterised by a number of atoms,

𝑁∈N

, where the distribution has a dimension per objective,

𝑛

The atoms outline the width of each category and are bounded

by the minimum returns,

R𝑚𝑖𝑛

, and maximum returns,

R𝑚𝑎𝑥

. The

multivariate categorical distribution has a set of atoms dened as

follows [33]:

{z𝑖...𝑘 =(Rmin0+𝑖Δz0, . . . , Rmin𝑛+𝑘Δz𝑛):

0≤𝑖<𝑁 , . . . , 0≤𝑘<𝑁},(8)

where each objective,

𝑛

, has a separate

Rmin𝑏,Rmax𝑏

for 0

<𝑏≤𝑛

and

Δz=R𝑚𝑎𝑥 −R𝑚𝑖𝑛

𝑁−1

. The distribution is a set of discrete cate-

gories,

𝑁

, where each category,

𝑝𝑖

, represents the probability of

ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion

receiving a return [

]. To ensure the distribution is an accurate

representation of the returns of the execution of a policy, it is cru-

cial a number of atoms are selected to suciently cover the range

of values from

Rmin

and

Rmax

. For example, if

𝛾=

1and reward

values are expected to be integers in the range

Rmin =[

]

Rmax =[

]

𝑁=

11 is the required value to ensure that the dis-

tribution is represented without aliasing between dierent reward

levels.

To update the multivariate categorical distribution, we utilise

the state space, action space and reward function of the model.

During an update of the multivariate categorical distribution, we

iterate over each atom,

𝑗

, for each objective. To update the return

distribution,

z𝑠

, for state

𝑠

, we compute the distributional Bellman

update

Tz𝑠,𝑗 =r𝑠,𝑎,𝑠′+𝛾z𝑠′,𝑗

for each atom

𝑗

, for a given reward

r𝑠,𝑎,𝑠′

and return distribution,

z𝑠′

, for state

𝑠′

. We then distribute

the probability,

𝑝

, for the atom,

𝑗

, of the return distribution,

𝑝𝑗(z𝑠′)

in state

𝑠′

, to the corresponding atom of the updated return dis-

tribution,

𝑧𝑠

, for state s. Therefore, the return distribution,

z𝑠

, for

state

𝑠

is equivalent to the return distribution,

z𝑠′

, in state

𝑠′

, shifted

relative to the reward, r𝑠,𝑎,𝑠′.

At each iteration,

𝑘

, of MODVI, for each state,

𝑠

, and action,

𝑎

, a

set of optimal return distributions is backed up once. In Equation

9, the Bellman operator has been replaced with the distributional

Bellman operator [8],

Q𝑘+1(𝑠, 𝑎) ← Ê

𝑠′

𝑇(𝑠′|𝑠, 𝑎) [r𝑠,𝑎,𝑠′+𝛾Z𝑘(𝑠′)] (9)

where

Q𝑘+1(𝑠, 𝑎)

and

Z𝑘(𝑠′)

represent sets of return distributions,

⊕denotes the cross-sum between sets of return distributions, and

𝑇(𝑠′|𝑠, 𝑎)

represents the probability of transitioning to state

𝑠′

from

state 𝑠after taking action 𝑎.

During a distributional Bellman backup, each return distribution,

z𝑠′

, in the set

Z𝑘(𝑠′)

, is updated with the reward,

r𝑠,𝑎,𝑠′

, for action,

𝑎

, in state,

𝑠

, as follows:

{r𝑠,𝑎,𝑠′+𝛾z𝑠′

∀z𝑠′∈Z𝑘(𝑠′)}

. Each

updated return distribution in the set for state

𝑠′

is then multiplied

by the transition probability,

𝑇(𝑠′|𝑠, 𝑎)

. The cross sum for each

resulting set of updated return distributions is computed for each

next possible next state,

𝑠′

. The cross sum between two sets of return

distributions,

XÉY

, is dened as follows:

{x+y

x∈X∧y∈Y}

where

and

are

return distributions

. For a detailed overview on

how a set of return distributions for an action in a MOMDP can be

computed, please consider the example outlined in Figure 2.

To compute a set of ESR non-dominated policies for each state,

we dene an algorithm known as

ESRPrune

(Algorithm 1) which

computes a set of ESR non-dominated policies by removing ESR

dominated return distributions from a given set.

Z𝑘+1(𝑠) ← ESRPrune Ø

𝑎

Q𝑘+1(𝑠, 𝑎)!(10)

Equation 10 calculates the set of return distributions for a given

state,

𝑠

, by taking the union of each set of return distributions

over each action,

𝑎

. The resulting set of return distributions is then

passed to the ESRPrune algorithm as input.

ESRPrune

utilises ESR dominance dened by Hayes et al. [

]

(see Equation 4). Like Pareto dominance, ESR dominance is transi-

tive [

], therefore we can apply

ESRPrune

in sequence. To compute

ESR dominance, the cumulative distribution function (CDF) of each

𝑠0

𝑠1

X={x1,x2}

𝑠2

Y={y1,y2}

(1,0)0.9(0,0)0.1

(a) An action,

𝑎

, in a MOMDP with

stochastic state transitions. States

𝑠1

and

𝑠2

have sets of non-dominated

return distributions

and

. For ac-

tion

𝑎

, transitioning from

𝑠0

𝑠1

oc-

curs with a probability of 0

9and a

reward of

[

]

is received. For action

𝑎

, transitioning from

𝑠0

𝑠2

occurs

with a probability of 0

1and a reward

of [0,0]is received.

𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)

x10 1 0.7

2 0 0.3

x22 1 0.5

2 2 0.5

1 0 0.75

0 2 0.25

0 1 0.9

3 0 0.1

(b) The return distributions

x1,x2,y1

and

in the sets of

policies for

𝑠1

and

𝑠2

. To compute

a set of policies for state

𝑠0

, the

distributional Bellman operator

is utilised (Equation 9).

𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)

¤x11 1 0.7

3 0 0.3

¤x23 1 0.5

3 2 0.5

¤y1

1 0 0.75

0 2 0.25

¤y2

0 1 0.9

3 0 0.1

r𝑠,𝑎,𝑠′

, is used to

update each return distribution

for states

𝑠1

and

𝑠2

. For example,

¤x1=r𝑠,𝑎,𝑠′+𝛾x1

. For this example

𝛾=1.

𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)

ˆx11 1 0.63

3 0 0.27

ˆx23 1 0.45

3 2 0.45

ˆy1

1 0 0.075

0 2 0.025

ˆy2

0 1 0.09

3 0 0.01

(d) Each return distribution for

𝑠1

and

𝑠2

is then multiplied by the

transition probabilities,

𝑇(𝑠′|𝑠, 𝑎)

For example, ˆx1=¤x1×𝑇(𝑠′|𝑠 , 𝑎).

𝑠0

Z={z1=ˆx1+ˆy1,z2=ˆx1+ˆy2,

z3=ˆx2+ˆy1,z4=ˆx2+ˆy2}

(e) In Figure 2(e), a set of return distri-

butions,

, is computed for state

𝑠0

. The

cross sum,

, is utilised to sum all com-

binations of return distributions for the

previously updated sets. The set of re-

turn distributions at state

𝑠0

, is dened

as follows:

Z=XÉY={ˆx+ˆy

ˆx∈

X∧ˆy∈Y}

, where

ˆx

and

ˆy

are return dis-

tributions. Figure 2(e) describes the re-

sulting set of return distributions which

contains z1,z2,z3and z4.

𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)

1 1 0.63

3 0 0.27

1 0 0.075

0 2 0.025

1 1 0.63

3 0 0.28

0 1 0.09

3 1 0.45

3 2 0.45

1 0 0.075

0 2 0.025

3 1 0.45

3 2 0.45

0 1 0.09

3 0 0.01

(f) Figure 2(f ) outlines the set of

return distributions, Z at state 𝑠0.

will be passed to the

ESRPrune

algorithm.

Figure 2: A worked example outlining the necessary steps

to compute a set of return distributions for a MOMDP with

stochastic state transitions.

Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/

return distribution in the given set must be calculated.

ESRPrune

iterates over the given set of return distributions and compares the

CDFs of the return distributions to determine which are ESR non-

dominated. The return distributions that are ESR dominated are

removed from the set. A set of non-dominated return distributions

is known as the ESR set [23].

Algorithm 1: ESRPrune

1Input:Z←A set of return distributions

2Z∗← ∅

3while Z ≠∅do

4z←the rst element of Z

5for z′∈Z do

6if z′>𝐸𝑆 𝑅 z then

7z←z′

8end

9end

10 Remove zand all return distributions

11 ESR-dominated by zfrom Z

12 Add zto Z∗

13 end

14 Return Z∗

To highlight how

ESRPrune

determines which return distribu-

tions are ESR non-dominated, consider the example outlined in

Figure 3(a), Figure 3(b) and Figure 4. To determine ESR dominance,

ESRPrune

compares a return distribution

with a return distribu-

tion

. The CDF for Xis denoted by

𝐹X

(Figure 3(a)) and the CDF

for

is denoted by

𝐹Y

(Figure 3(b)). In order for

X>𝐸𝑆𝑅 Y

the

following condition must be true [23]:

∀v:𝐹X(v) ≤ 𝐹Y(v) ∧ ∃v:𝐹X(v)<𝐹Y(v).

Additionally, if

X>𝐸𝑆𝑅 Y

the following condition also must be

true:

∀v:𝐹X(v) − 𝐹Y(v) ≤ 0∧ ∃v:𝐹X(v) − 𝐹Y(v)<0.

−101230

0.5

probability

(a) The CDF,

𝐹X

, of a return distri-

bution

is a multivariate normal

probability distribution, with a mean,

𝜇

, and co-variance matrix,

. For

𝜇=[1,2]and Σ=0.5 0.25

0.25 0.5.

−101230

0.5

probability

(b) The CDF,

𝐹Y

, of a return distribu-

tion

is a multivariate normal prob-

ability distribution, with a mean,

𝜇

and co-variance matrix,

. For

𝜇=

[1,1]and Σ=0.15 0.05

0.05 0.15 .

Figure 3: The CDFs,

𝐹X

and

𝐹Y

, of two return distributions,

and Y.

01234

−0.6

−0.4

−0.2

0.2

(a)

FX−FY

Figure 4: The dierence in probability mass for

𝐹X−𝐹Y

, which

is used to visualise the requirements for ESR dominance. A

dotted line (a) is drawn to highlight that

𝐹X−𝐹Y>

0for least

at one point. Therefore, X does not ESR dominate Y.

Figure 4 highlights the dierence in probability for

𝐹X−𝐹Y

. The

dotted line in Figure 4, labelled

(𝑎)

, highlights that, for at least one

point,

𝐹X−𝐹Y>

0. Therefore, the return distribution

cannot ESR

dominate the return distribution Y.

Algorithm 2: MODVI

1Initialise all return distributions and sets

2while not converged do

3for 𝑠∈𝑆do

4for 𝑎∈𝐴do

5Q𝑘+1(𝑠, 𝑎) ←

É𝑠′𝑇(𝑠′|𝑠, 𝑎) [R(𝑠 , 𝑎, 𝑠 ′) + 𝛾Z𝑘(𝑠′)]

6end

7Z𝑘+1(𝑠) ← E𝑆𝑅𝑃𝑟 𝑢𝑛𝑒 Ð𝑎Q𝑘+1(𝑠, 𝑎)

8end

9end

Algorithm 2 describes the MODVI algorithm

. On initialisation

of MODVI, a set of return distributions is generated for each state-

action pair. For innite horizon settings, each set contains a single

return distribution that is randomly initialised, where an atom is

selected at random and a probability mass of 1

0is assigned to

that atom. In nite horizon settings each return distribution is

initialised by assigning a probability mass of 1

0to the atom which

corresponds to the return

[

]

. During each iteration of MODVI,

a set of return distributions is computed (Algorithm 2, Line 5) for

each state,

𝑠

and action,

𝑎

. The union of the resulting sets of return

distributions is then passed to the

ESRPrune

algorithm to remove

the dominated return distributions. Once

ESRPrune

(Algorithm 2,

Line 7) has been executed for the given iteration of MODVI, a set

of non-dominated return distributions is backed up for the state

𝑠

Once MODVI has converged, a set of ESR non-dominated policies,

or the ESR set, is available at the start state, 𝑠0.

Algorithm 2 describes MODVI for innite horizon settings. However, it is trivial to

alter MODVI for nite horizon settings.

ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion

5 EXPERIMENTS

In this section we show that MODVI can compute a set of optimal

policies for the ESR criterion for two multi-objective benchmark

problems and a practical multi-objective real-world problem.

5.1 Space Traders

First, we evaluate MODVI on a multi-objective benchmark problem

known as Space Traders [

]. Space Traders is a problem with

nine policies and a small number of returns per policy. Therefore,

it is possible to visualise each policy in the ESR set, illustrating

how policies can be returned to a user during the selection phase

in practice. Of course, for larger problems, the user could select

subsets of the policies to visualise and compare.

Space Traders has two timesteps, two non-terminal states and

three available actions per state. In Space Traders an agent must

deliver cargo from its home planet (planet A) to some destination

planet (planet B) and then return home to planet A. While deliv-

ering the cargo, the agent must avoid being intercepted by space

pirates. An agent acting in the Space Traders environment aims

to complete the mission and minimise time. An agent receives a

reward of 1 for returning home to planet A and completing the

mission, and at all other states the agent receives a reward of 0 for

mission success. After each action, the agent receives a negative

reward corresponding to the time taken to reach the next planet.

Finally, after taking each action there is a probability the agent will

be intercepted by space pirates. If the agent is intercepted by space

pirates, the agent will receive a reward of 0 for mission success, a

negative time penalty and the episode will terminate. All remain-

ing implementation details for the Space Traders environment are

available in the works of Vamplew et al. [42, 43].

MODVI has the following parameters:

𝛾=

𝑁=

23,

R𝑚𝑖𝑛 =

[

,−

]

and

R𝑚𝑎𝑥 =[

]

. Figure 7(a) outlines the six return

distributions in the computed ESR set. Figure 5 plots the expected

value vectors of each return distribution in the ESR set and also plots

the expected value vectors for the Pareto front [

]. It is important

to note, the ESR set for Space Traders contains a policy that is not

present on the Pareto front. The Pareto front is a set of optimal

policies for the SER criterion. Therefore, certain policies that are

optimal under the ESR criterion are not optimal under the SER

criterion. In real-world decision making, incorrectly selecting an

optimality criterion can lead to sub-optimal performance, given

some optimal policies may not be returned to the user.

During the selection phase, visualisations, like Figure 5, are re-

turned to the user to aid in their decision making. However, in

Figure 5, the details of the return distributions for each policy in

the ESR set are lost. Computing expected value vectors for each

return distribution reduces the information available about a policy,

given the information about each individual return of a policy is no

longer available. As already highlighted, under the ESR criterion

the utility of a user is derived from a single execution of a policy.

Therefore, it is crucial a user has sucient information available at

decision time, given a policy may only be executed once. Figure 6

visualises each potential return and the corresponding probability

of the return distributions in the ESR set. In Figure 6, each return

distribution has a shape, where the position of each shape corre-

sponds to a return and the colour of each shape corresponds to the

−20 −15 −10 −5 0

objective 2

0.8

0.9

1.0

objective 1

ESR Set

Pareto Front

Figure 5: The expected value vectors of the return distribu-

tions in the ESR set (red) are plotted against the expected

value vectors of the Pareto front (blue).

−20 −15 −10 −5 0

objective 2

0.0

0.2

0.4

0.6

0.8

1.0

objective 1

0.2

0.4

0.6

0.8

1.0

probability

Figure 6: The return distributions in the ESR set computed

by MODVI. Each shape corresponds to a computed policy in

the ESR set, where the location of the shape corresponds to a

return in the policy. Colours correspond to the probability

of receiving the specic return when executing the policy.

probability of receiving the return. In practice, a user would be able

to choose which return distributions in the ESR set to display at a

given moment, allowing the user to compare and contrast dierent

policies individually. Figure 6 provides an intuitive aid which can be

returned to a user when making decisions under the ESR criterion.

5.2 Resource Gathering

Next, we evaluate MODVI on the Resource Gathering benchmark

[

]. Resource Gathering is a multi-objective benchmark problem

with intuitive trade-os between objectives, motivating the need to

consider the ESR criterion in real-world decision making. MODVI is

evaluated on a four-objective version of Resource Gathering, where

time is added as an objective. The Resource Gathering environment

is shown in Figure 7(b). The agent starts in a home state and nav-

igates the grid environment to collect the available resources (

𝑅1

and

𝑅2

) while avoiding the enemy states (

†1

and

†2

) before return-

ing home again. At each timestep, the agent receives a reward of

[−

]

. If the agent returns to the home state having gathered

the available resources, the agent receives one of the following

rewards:

[−

]

for collecting

𝑅1

[−

]

for collecting

Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/

𝜋 𝑟1𝑟2𝑃(𝑟1, 𝑟 2)

𝜋11 -22 1.0

𝜋20 -1 0.1

1 -16 0.9

𝜋3

0 -7 0.085

0 0 0.15

1 -8 0.765

𝜋40 0 0.15

0 -10 0.85

𝜋50 0 0.2775

1 0 0.7225

𝜋6

0 -6 0.135

0 -1 0.1

1 -6 0.765

(a) The return distributions in the

ESR set for the Space Traders envi-

ronment, with 𝛾=1.

𝑅1

†1𝑅2

†2

(b) The grid for the Resource Gather-

ing environment.

†1

and

†2

are en-

emy states.

𝑅1

and

𝑅2

are the re-

sources that need to be gathered, be-

fore returning to the home state.

𝜋 𝑟1𝑟2𝑟3𝑟4P(𝑟1, 𝑟2, 𝑟 3, 𝑟4)

𝜋1-18 0 10 10 1.0

𝜋2-12 0 10 0 1.0

𝜋3-16 -10 0 0 0.1

-14 0 10 10 0.9

𝜋4-12 -10 0 0 0.1

-16 0 10 10 0.9

𝜋5-12 -10 0 0 0.1

-10 0 10 0 0.9

𝜋6

-14 -10 0 0 0.09

-12 -10 0 0 0.1

-12 0 10 10 0.81

𝜋7

-14 -10 0 0 0.09

-12 -10 0 0 0.1

-8 0 10 0 0.81

𝜋8-10 0 0 10 1.0

the Resource Gathering environment, with

𝛾=1.

𝜋 𝑟1𝑟2P(𝑟1, 𝑟2)

𝜋1

-1 -0.06 0.0995

-1 0.0 0.3210

0 -0.06 0.3778

0 0.0 0.2017

𝜋2

-1 -0.06 0.0597

-1 0.0 0.3609

0 -0.06 0.2264

0 0.0 0.3530

𝜋3-1 0.0 0.4206

0 0.0 0.5794

(d) The return distributions in the

ESR set for the Control Problem envi-

ronment, with 𝛾=1.

Figure 7: Figure 7(a), Figure 7(c) and Figure 7(d) show the

return distributions in the ESR set computed by the MODVI

algorithm for the Space Traders, Resource Gathering and

Control Problem. Figure 7(b) shows the grid layout for the

Resource Gathering environment.

𝑅2

, and

[−

]

for collecting

𝑅1

and

𝑅2

. The agent must avoid

the enemy states. If the agent enters an enemy state, there is a 0.1

chance the agent will be attacked. If the agent is attacked in an

enemy state, the agent receives a reward of

[−

,−

]

. In this

case, the agent also receives a time penalty for being attacked and

the episode terminates.

For Resource Gathering, the following parameters were set for

MODVI:

𝛾=

𝑁=

25,

R𝑚𝑖𝑛 =[−

,−

]

and

R𝑚𝑎𝑥 =

[

]

. Figure 7(c) outlines the return distribution in the ESR

set for Resource Gathering. The ESR set contains eight policies,

where each policy gathers one or both resources before returning

home. An important aspect of the distributional approach applied

by MODVI is that a user will have sucient information about the

trade-os between each objective for each policy in the ESR set. For

example, there is a clear trade-o between objectives in

𝜋3

and

𝜋6

in Figure 7(c). When considering

𝜋3

, fourteen timesteps are taken

to gather both resources and the agent enters one enemy state

with a 0

1chance of being attacked. When considering

𝜋6

, twelve

timesteps are taken to gather both resources, but the agent must

enter both enemy states, which poses 0

09 chance and 0

1chance

of being attacked. Using a distributional approach ensures a user

has sucient information to understand the trade-os between

objectives across dierent policies. In Resource Gathering a user

looking to minimise time, while also being indierent about being

attacked, may select

𝜋6

having fully understood the probabilities

of being attacked. Therefore, having sucient critical information

available at decision time enables the user to make more informed

decisions that could potentially better reect their preferences over

objectives, when compared to expected value vector based methods.

5.3 Feedtank Control Problem

Finally, we evaluate MODVI on the risk-based Feedtank Control

Problem (FCP) proposed by Geibel and Wysotski [

], which is a

practical real-world problem domain that highlights how MODVI

and the ESR criterion can be applied. In FCP, the agent must control

the outow of a tank that lies upstream of a distillation column,

while minimising the risk of the tank overowing. The purpose

of the distillation column is to separate two substances. There are

a nite number of timesteps 0

, ..., 𝑇

, where

𝑡

denotes the current

timestep. The feed-stream of the distillation column, or outow of

the tank, is denoted by

𝐹(𝑡)

and is controlled by the agent. The

tank level

𝑦(𝑡)

depends on the two stochastic inow streams char-

acterized by the ow rates

𝐹1(𝑡)

and

𝐹2(𝑡)

. The dynamics of the

tank level are outlined in the following equation:

𝑦(𝑡+1)=𝑦(𝑡) + 𝐴−1𝛿(𝑡) ( 

𝑗=1,2

𝐹𝑗(𝑡) − 𝐹(𝑡)).(11)

The tank level must not violate the following constraint:

𝑦𝑚𝑖𝑛 ≤𝑦(𝑡) ≤ 𝑦𝑚𝑎𝑥 .(12)

The inows

𝐹𝑗(𝑡)

are random and controlled by probability dis-

tributions (Table 2). Therefore, the inows may also cause the tank

level to violate the constraint in Equation 12. At each timestep

there is also a chance,

𝑝

, that the inows may randomly violate the

constraint in Equation 12. To take a random constraint violation

into consideration, the probabilities for each inow in Table 2 must

be multiplied by 1

−𝑝

. If the tank level violates the constraint in

Equation 12, the system shuts down, the agent enters a terminal

state, and receives a reward of [-1, 0]. The agent takes an action,

𝑎

, to control the outow of the tank. If the action does not cause

a violation of Equation 12, the agent receives a reward dened as

follows:

r𝑠,𝑎,𝑠′=[0,−|𝐹(𝑡) − 𝐹𝑠 𝑝𝑒𝑐 |],(13)

where

𝐹(𝑡)

is the discretised action value for the selected action

that adheres to

𝐹𝑚𝑖𝑛 ≤𝐹(𝑡) ≤ 𝐹𝑚𝑎𝑥

where

𝐹𝑚𝑖𝑛

and

𝐹𝑚𝑎𝑥

are

intervals for actions, and

𝐹𝑠𝑝𝑒 𝑐

is the optimal action value. The

state parameters for the FCP are dened as follows:

𝑠(𝑡)=[𝑡, 𝑦 (𝑡)].(14)

Finally, the initial state,

𝑠0

, is dened as follows: [0,

𝑦0

]. For the

version of FCP used in this paper there are 11 actions available to

the agent, with 8timesteps.

ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion

𝑡 𝐹1𝑃(𝐹1)𝐹2𝑃(𝐹2)

1 1.70843345 0.78341724 1.85062176 0.21658276

2 1.40843345 0.40060469 1.55062176 0.59939531

3 0.56537807 0.83222158 0.70876186 0.16777842

4 0.37336325 0.81546855 0.50537012 0.18453145

5 0.11927879 0.41123876 0.31832656 0.58876124

6 0.02762233 0.7665067 0.20677226 0.2334933

7 0.45139631 0.62905513 0.59104772 0.37094487

8 1.10806585 0.04634063 1.20835887 0.95365937

Table 2: The inows (

𝐹1, 𝐹2

) for the feedtank with the corre-

sponding probabilities (𝑃(𝐹1), 𝑃 (𝐹2)) for each timestep, 𝑡.

The following parameters were set for FCP: [

𝐹𝑚𝑖𝑛

𝐹𝑚𝑎𝑥

] = [0.55,

1.05],

𝐹𝑠𝑝𝑒 𝑐

= 0.8,

𝑦0=

4, [

𝑦𝑚𝑖𝑛

𝑦𝑚𝑎𝑥

] = [0.25, 0.75],

𝐴−1𝛿(𝑡)=

and

𝑝

= 0.1. MODVI has the following parameters:

𝛾=

𝑁=

101,

R𝑚𝑖𝑛 =[−

,−

]

and

R𝑚𝑎𝑥 =[

]

. Figure 7(d) outlines the three

return distributions computed by MODVI in the ESR set for FCP.

To provide an intuitive aid for decision making during the selection

phase, the policies in the ESR set can be visualised, like in Figure

6, and returned to the user. It is important to note that

𝜋1

and

𝜋2

in the ESR set contain the same returns, although with dierent

probabilities. If the expected value vectors for

𝜋1

and

𝜋2

are returned

to a user, the user will lose all knowledge of how similar the returns

for

𝜋1

and

𝜋2

are. Therefore, taking a distributional approach can

aid in decision making, given a user has more information about the

individual returns of a policy. It is important to note, each return

distribution in Figure 7(d) could easily be interpreted by a domain

expert.

FCP is motivated by minimising risk as an important objective,

given violating certain constraints can shut down the distillation

process. Therefore, FCP should be optimised under the ESR crite-

rion, given a single execution of a policy is used to derive utility.

If the SER criterion is used as an optimality criterion, the average

risk over multiple policy executions would be computed. However,

making decisions based on average risk is not sucient for FCP

given a single violation of the constraints could lead to a system

shutdown, resulting in loss of productivity and prots. Using a

distributional approach for FCP under the ESR criterion ensures

that a user has sucient information about the probability of a

constraint violation to make decisions that mitigate such risks.

6 RELATED WORK

In recent years, using distributions in decision making has become

an active area of research for both single and multi-objective prob-

lem domains. For example, Martin et al. [

] use a single-objective

distributional C51 algorithm with stochastic dominance to make

risk-aware decisions. Abbas et al. [

] take a distributional approach

to multi-objective decision making to compute a set of optimal

policies for the SER criterion. It is important to note, taking a distri-

butional approach to decision making is not new and methods like

conditional value-at-risk (CVAR) [

] and value-at-risk (VAR) [

]

have been used extensively in nance [

] to make decisions

under uncertainty. Beyond a distributional approach, many algo-

rithms can compute a set of optimal policies for the SER criterion.

For example, multi-objective Monte Carlo tree search [

], Pareto

value iteration [

], convex hull value iteration [

] and CON-MODP

[

]. In contrast to the SER criterion, the ESR criterion has been

largely understudied with some exceptions. Several single-policy

algorithms have been developed which can compute a single opti-

mal policy for the ESR criterion. However, the single-policy ESR

algorithms cannot compute sets of optimal policies for the ESR

criterion, which heavily restricts their use in real-world decision

making scenarios. Reymond et al. [

] dene a multi-objective dis-

tributional actor critic algorithm that can compute optimal policies

for the ESR criterion. Roijers et al. [

] dene a multi-objective

policy gradient that can compute a single optimal policy for the

ESR criterion. Hayes et al. [

] outline a distributional Monte

Carlo tree search (DMCTS) algorithm to compute policies for the

ESR criterion. However, all of the highlighted methods require the

utility function of a user to be known a priori. For scenarios where

the utility function is unknown, Hayes et al. [

] outline a distribu-

tional algorithm that computes a set of policies for the ESR criterion

in a multi-objective multi-armed bandit [

] setting. However, the

work of Hayes et al. [

] is limited to bandit settings and cannot be

used for sequential decision making.

7 CONCLUSION & FUTURE WORK

In this paper we propose a multi-objective distributional value iter-

ation (MODVI) algorithm that can compute a set of optimal policies

for the ESR criterion. MODVI utilises return distributions which

replace expected value vectors in multi-objective decision making.

MODVI is the rst algorithm that can compute a set of optimal

policies under the ESR criterion in sequential multi-objective deci-

sion making settings. We show that MODVI can compute a set of

optimal policies for several multi-objective benchmark problems

and a practical real-world decision making problem. Because it is

the rst of its kind, MODVI opens up decision-theoretic planning

for a key range of real-world problems.

We plan to use return distributions in multi-objective reinforce-

ment learning (RL) settings. Model-based RL algorithms, like R-max

[

], and model-free RL algorithms, like multi-objective Q-learning

[

], could form the basis for new multi-objective distributional al-

gorithms that can compute sets of policies for the ESR criterion. For

MODVI, when the range of potential returns increases, maintaining

a sucient number of atoms for the return distribution requires

a large amount of memory. It is expected that in larger scenarios,

like [

], the range of possible potential returns would be dicult

to maintain using a categorical distribution. A potential solution

would be to use Dirichlet distributions [

] to represent return dis-

tributions. Finally, ESR dominance is a strict dominance criterion.

In many settings, ESR dominance may produce very large sets of

policies that would be optimal for all decision makers. It would be

possible to relax the ESR dominance requirements by using almost

stochastic dominance to generate smaller solution sets, where each

policy in the set is optimal for most decision makers [25].

ACKNOWLEDGEMENTS

Conor F. Hayes is funded by the National University of Ireland

Hardiman Scholarship. This research was supported by funding

from the Flemish Government under the “Onderzoeksprogramma

Articiële Intelligentie (AI) Vlaanderen” program.

Multi-Objective Distributional Value Iteration ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/

REFERENCES

[1]

Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Fran-

cis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, and

Martin Riedmiller. 2020. A distributional view on multi-objective policy opti-

mization. In International Conference on Machine Learning. PMLR, 11–22.

[2]

Steven Abrams, James Wambua, Eva Santermans, Lander Willem, Elise Kuylen,

Pietro Coletti, Pieter Libin, Christel Faes, Oana Petrof, Sereina A. Herzog, Philippe

Beutels, and Niel Hens. 2021. Modelling the early phase of the Belgian CO VID-19

epidemic using a stochastic compartmental model and studying its implied future

trajectories. Epidemics 35 (2021), 100449. https://doi.org/10.1016/j.epidem.2021.

100449

[3]

Mukhtar M. Ali. 1975. Stochastic dominance and portfolio analysis. Journal of

Financial Economics 2, 2 (1975), 205–229. https://doi.org/10.1016/0304- 405X(75)

90005-7

[4]

Anthony B Atkinson and Francois Bourguignon. 1982. The Compari-

son of Multi-Dimensioned Distributions of Economic Status. The Re-

view of Economic Studies 49, 2 (04 1982), 183–201. https://doi.org/10.2307/

2297269 arXiv:https://academic.oup.com/restud/article-pdf/49/2/183/4720580/49-

2-183.pdf

[5]

Leon Barrett and Srini Narayanan. 2008. Learning all optimal policies with

multiple criteria. In Proceedings of the 25th international conference on Machine

learning. 41–47.

[6]

Vijay S. Bawa. 1975. Optimal rules for ordering uncertain prospects. Journal of

Financial Economics 2, 1 (1975), 95 – 121. https://doi.org/10.1016/0304- 405X(75)

90025-2

[7]

Vijay S. Bawa. 1978. Safety-First, Stochastic Dominance, and Optimal Portfolio

Choice. The Journal of Financial and Quantitative Analysis 13, 2 (1978), 255–271.

http://www.jstor.org/stable/2330386

[8]

Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspec-

tive on reinforcement learning. In Proceedings of the 34th International Conference

on Machine Learning-Volume 70. JMLR. org, 449–458.

[9] Richard Bellman. 1957. Dynamic programming. Courier Corporation.

[10]

Ronen I Brafman and Moshe Tennenholtz. 2002. R-max-a general polynomial

time algorithm for near-optimal reinforcement learning. Journal of Machine

Learning Research 3, Oct (2002), 213–231.

[11]

Daniel Bryce, William Cushing, and Subbarao Kambhampati. 2007. Probabilistic

planning is multi-objective. Arizona State University, Tech. Rep. ASU-CSE-07-006

(2007).

[12]

E. Choi and Stanley Johnson. 1988. Stochastic Dominance and Uncertain Price

Prospects. Center for Agricultural and Rural Development (CARD) at Iowa State

University, Center for Agricultural and Rural Development (CARD) Publications 55

(01 1988). https://doi.org/10.2307/1059583

[13]

Madalina M. Drugan and Ann Nowe. 2013. Designing multi-objective multi-

armed bandits algorithms: A study. In The 2013 International Joint Conference on

Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN.2013.6707036

[14]

Darrell Due and Jun Pan. 1997. An overview of value at risk. Journal of

derivatives 4, 3 (1997), 7–49.

[15]

Peter C Fishburn. 1978. Non-cooperative stochastic dominance games. Interna-

tional Journal of Game Theory 7, 1 (1978), 51–61.

[16]

Peter Geibel and Fritz Wysotzki. 2005. Risk-sensitive reinforcement learning

applied to control under constraints. Journal of Articial Intelligence Research 24

(2005), 81–108.

[17]

Peichen Gong. 1992. Multiobjective dynamic programming for forest resource

management. Forest Ecology and Management 48, 1 (1992), 43–54. https://doi.

org/10.1016/0378-1127(92)90120- X

[18]

Josef Hadar and William R. Russell. 1969. Rules for Ordering Uncertain Prospects.

The American Economic Review 59, 1 (1969), 25–34. http://www.jstor.org/stable/

1811090

[19]

Conor F. Hayes, Mathieu Reymond, Diederik M. Roijers, Enda Howley, and

Patrick Mannion. 2021. Risk-Aware and Multi-Objective Decision Making with

Distributional Monte Carlo Tree Search. In: Proceedings of the Adaptive and

Learning Agents workshop at AAMAS 2021) (2021).

[20]

Conor F. Hayes, Mathieu Reymond, Diederik M. Roijers, Enda Howley, and

Patrick Mannion. 2021 In Press. Distributional Monte Carlo Tree Search for

Risk-Aware and Multi-Objective Reinforcement Learning. In Proceedings of the

20th International Conference on Autonomous Agents and MultiAgent Systems,

Vol. 2021. IFAAMAS.

[21]

Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Mannion Patrick. 2022.

Decision-Theoretic Planning for the Expected Scalarised Returns. In Proceedings

of the 21st International Conference on AAMAS (2022).

[22]

Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström,

Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf,

Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick

Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and

Diederik M. Roijers. 2022. A Practical Guide to Multi-Objective Reinforcement

Learning and Planning. Autonomous Agents and Multi-Agent Systems 36, 1 (2022),

26. https://doi.org/10.1007/s10458-022- 09552-y

[23]

Conor F. Hayes, Timothy Verstraeten, Diederik M. Roijers, Enda Howley, and

Patrick Mannion. 2021. Dominance Criteria and Solution Sets for the Expected

Scalarised Returns. In Proceedings of the Adaptive and Learning Agents workshop

at AAMAS 2021.

[24]

Conor F. Hayes, Timothy Verstraeten, Diederik M. Roijers, Enda Howley, and

Patrick Mannion. 2021. Expected Scalarised Returns Dominance: A New Solution

Concept for Multi-Objective Decision Making. arXiv preprint arXiv:2106.01048

(2021).

[25] Moshe Leshno and Haim Levy. 2002. Preferred by “all” and preferred by “most”

decision makers: Almost stochastic dominance. Management Science 48, 8 (2002),

1074–1085.

[26]

Federico Malerba and Patrick Mannion. 2021. Evaluating Tunable Agents

with Non-Linear Utility Functions under Expected Scalarised Returns. In Multi-

Objective Decision Making Workshop (MODeM 2021).

[27]

Simone Manganelli and Robert F Engle. 2001. Value at risk models in nance.

(2001).

[28]

John Martin, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. 2020. Stochas-

tically Dominant Distributional Reinforcement Learning. In International Confer-

ence on Machine Learning. PMLR, 6745–6754.

[29]

Ingram Olkin and Herman Rubin. 1964. Multivariate beta distributions and

independence properties of the Wishart distribution. The Annals of Mathematical

Statistics (1964), 261–269.

[30]

Michael Painter, Bruno Lacerda, and Nick Hawes. 2020. Convex Hull Monte-

Carlo Tree-Search. In Proceedings of the Thirtieth International Conference on

Automated Planning and Scheduling, Nancy, France, October 26-30, 2020. AAAI

Press, 217–225.

[31] Vilfredo Pareto. 1896. Manuel d’Economie Politique. Vol. 1. Giard, Paris.

[32]

Roxana Rădulescu, Patrick Mannion, Diederik M. Roijers, and Ann Nowé. 2020.

Multi-objective multi-agent decision making: a utility-based analysis and survey.

Autonomous Agents and Multi-Agent Systems 34, 10 (2020).

[33]

Mathieu Reymond, Conor F. Hayes, Diederik M. Roijers, Denis Steckelmacher,

and Ann Nowé. 2021. Actor-Critic Multi-Objective Reinforcement Learning

for Non-Linear Utility Functions. Multi-Objective Decision Making Workshop

(MODeM 2021) (2021).

[34]

R Tyrrell Rockafellar and Stanislav Uryasev. 2002. Conditional value-at-risk for

general loss distributions. Journal of banking & nance 26, 7 (2002), 1443–1471.

[35]

R Tyrrell Rockafellar, Stanislav Uryasev, et al

2000. Optimization of conditional

value-at-risk. Journal of risk 2, 3 (2000), 21–41.

[36]

Diederik M. Roijers, Denis Steckelmacher, and Ann Nowé. 2018. Multi-objective

Reinforcement Learning for the Expected Utility of the Return. In Proceedings of

the Adaptive and Learning Agents workshop at FAIM 2018.

[37]

Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley.

2013. A survey of multi-objective sequential decision-making. Journal of Articial

Intelligence Research 48 (2013), 67–113.

[38]

Diederik M. Roijers, Shimon Whiteson, and Frans A. Oliehoek. 2015. Comput-

ing Convex Coverage Sets for Faster Multi-Objective Coordination. Journal of

Articial Intelligence Research 52 (2015), 399–443.

[39]

Roxana Rădulescu, Patrick Mannion, Yijie Zhang, Diederik Marijn Roijers, and

Ann Nowé. 2020. A utility-based analysis of equilibria in multi-objective normal

form games. The Knowledge Engineering Review 35, e32 (2020).

[40]

Songsak Sriboonchitta, Wing-Keung Wong, s Dhompongsa, and Hung Nguyen.

2009. Stochastic Dominance and Applications to Finance, Risk and Economics.

https://doi.org/10.1201/9781420082678

[41]

Peter Vamplew, Richard Dazeley, Adam Berry, Rustam Issabekov, and Evan

Dekker. 2011. Empirical evaluation methods for multiobjective reinforcement

learning algorithms. Machine Learning (2011).

[42]

Peter Vamplew, Cameron Foale, and Richard Dazeley. 2020. A Demonstration

of Issues with Value-Based Multi Objective Reinforcement Learning Under Sto-

chastic State Transitions. Adaptive and Learning Agents Workshop (AAMAS

2020).

[43]

Peter Vamplew, Cameron Foale, and Richard Dazeley. 2021. The impact of envi-

ronmental stochasticity on value-based multiobjective reinforcement learning. In

Neural Computing and Applications. https://doi.org/10.1007/s00521-021-05859- 1

[44]

Peter Vamplew, Benjamin J Smith, Johan Kallstrom, Gabriel Ramos, Roxana

Radulescu, Diederik M Roijers, Conor F Hayes, Fredrik Heintz, Patrick Mannion,

Pieter JK Libin, et al

2021. Scalar reward is not enough: A response to Silver,

Singh, Precup and Sutton (2021). arXiv preprint arXiv:2112.15422 (2021).

[45]

Peter Vamplew, John Yearwood, Richard Dazeley, and Adam Berry. 2008. On

the Limitations of Scalarisation for Multi-objective Reinforcement Learning of

Pareto Fronts. In AI 2008: Advances in Articial Intelligence, Wayne Wobcke and

Mengjie Zhang (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 372–378.

[46]

Kristof Van Moaert and Ann Nowé. 2014. Multi-objective reinforcement learning

using sets of Pareto dominating policies. The Journal of Machine Learning Research

15, 1 (2014), 3483–3512.

[47]

K Wakuta and K Togawa. 1998. Solution procedures for multi-objective Markov

decision processes. Optimization 43, 1 (1998), 29–46.

[48]

Weijia Wang and Michèle Sebag. 2012. Multi-objective Monte-Carlo Tree Search

(Proceedings of Machine Learning Research, Vol. 25), Steven C. H. Hoi and Wray

ALA ’22, May 9-10, 2022, Online, https://ala2022.github.io/ Conor F. Hayes, Diederik M. Roijers, Enda Howley, and Patrick Mannion

Buntine (Eds.). PMLR, Singapore Management University, Singapore, 507–522.

[49]

DJ White. 1982. Multi-objective innite-horizon discounted Markov decision

processes. Journal of mathematical analysis and applications 89, 2 (1982), 639–647.

[50]

Marco A. Wiering and Edwin D. de Jong. 2007. Computing Optimal Stationary

Policies for Multi-Objective Markov Decision Processes. In 2007 IEEE International

Symposium on Approximate Dynamic Programming and Reinforcement Learning.

158–165. https://doi.org/10.1109/ADPRL.2007.368183

[51]

Marco A Wiering, Maikel Withagen, and Mădălina M Drugan. 2014. Model-based

multi-objective reinforcement learning. In 2014 IEEE Symposium on Adaptive

Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 1–6.

[52]

Elmar Wolfstetter. 1999. Topics in Microeconomics: Industrial Organization, Auc-

tions, and Incentives. Cambridge University Press. https://doi.org/10.1017/

CBO9780511625787

[53]

Kyle Hollins Wray, Shlomo Zilberstein, and Abdel-Illah Mouaddib. 2015. Multi-

objective MDPs with conditional lexicographic reward preferences. In Twenty-

ninth AAAI conference on articial intelligence.

Monte Carlo Tree Search Algorithms for Risk-Aware and Multi-Objective Reinforcement Learning

Preprint

Full-text available

Nov 2022

In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns -- known in reinforcement learning as the value -- cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.

Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning

Preprint

Full-text available

Nov 2022

Kewen Ding

Multi-objective reinforcement learning (MORL) is a relatively new field which builds on conventional Reinforcement Learning (RL) to solve multi-objective problems. One of common algorithm is to extend scalar value Q-learning by using vector Q values in combination with a utility function, which captures the user's preference for action selection. This study follows on prior works, and focuses on what factors influence the frequency with which value-based MORL Q-learning algorithms learn the optimal policy for an environment with stochastic state transitions in scenarios where the goal is to maximise the Scalarised Expected Return (SER) - that is, to maximise the average outcome over multiple runs rather than the outcome within each individual episode. The analysis of the interaction between stochastic environment and MORL Q-learning algorithms run on a simple Multi-objective Markov decision process (MOMDP) Space Traders problem with different variant versions. The empirical evaluations show that well designed reward signal can improve the performance of the original baseline algorithm, however it is still not enough to address more general environment. A variant of MORL Q-Learning incorporating global statistics is shown to outperform the baseline method in original Space Traders problem, but remains below 100 percent effectiveness in finding the find desired SER-optimal policy at the end of training. On the other hand, Option learning is guarantied to converge to desired SER-optimal policy but it is not able to scale up to solve more complex problem in real-life. The main contribution of this thesis is to identify the extent to which the issue of noisy Q-value estimates impacts on the ability to learn optimal policies under the combination of stochastic environments, non-linear utility and a constant learning rate.

A practical guide to multi-objective reinforcement learning and planning

Article

Full-text available

Apr 2022
AUTON AGENT MULTI-AG

Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems.

Modeling the early phase of the Belgian COVID-19 epidemic using a stochastic compartmental model and studying its implied future trajectories

Article

Full-text available

Mar 2021

Following the onset of the ongoing COVID-19 pandemic throughout the world, a large fraction of the global population is or has been under strict measures of physical distancing and quarantine, with many countries being in partial or full lockdown. These measures are imposed in order to reduce the spread of the disease and to lift the pressure on healthcare systems. Estimating the impact of such interventions as well as monitoring the gradual relaxing of these stringent measures is quintessential to understand how resurgence of the COVID-19 epidemic can be controlled for in the future. In this paper we use a stochastic age-structured discrete time compartmental model to describe the transmission of COVID-19 in Belgium. Our model explicitly accounts for age-structure by integrating data on social contacts to (i) assess the impact of the lockdown as implemented on March 13, 2020 on the number of new hospitalizations in Belgium; (ii) conduct a scenario analysis estimating the impact of possible exit strategies on potential future COVID-19 waves. More specifically, the aforementioned model is fitted to hospital admission data, data on the daily number of COVID-19 deaths and serial serological survey data informing the (sero)prevalence of the disease in the population while relying on a Bayesian MCMC approach. Our age-structured stochastic model describes the observed outbreak data well, both in terms of hospitalizations as well as COVID-19 related deaths in the Belgian population. Despite an extensive exploration of various projections for the future course of the epidemic, based on the impact of adherence to measures of physical distancing and a potential increase in contacts as a result of the relaxation of the stringent lockdown measures, a lot of uncertainty remains about the evolution of the epidemic in the next months.

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

Article

Full-text available

Feb 2022
NEURAL COMPUT APPL

A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It shows that the nature of the optimal solution depends on these environmental characteristics, and also on whether we desire to maximise the Expected Scalarised Return (ESR) or the Scalarised Expected Return (SER). We also identify a novel aim which may arise in some applications of maximising SER subject to satisfying constraints on the variation in return and show that this may require different solutions than ESR or conventional SER. The analysis of the interaction between environmental stochasticity and multi-objective Q-learning is supported by empirical evaluations on several simple multiobjective Markov Decision Processes with varying characteristics. This includes a demonstration of a novel approach to learning deterministic SER-optimal policies for environments with stochastic rewards. In addition, we report a previously unidentified issue with model-free, value-based approaches to multiobjective reinforcement learning in the context of environments with stochastic state transitions. Having highlighted the limitations of value-based model-free MORL methods, we discuss several alternative methods that may be more suitable for maximising SER in MOMDPs with stochastic transitions.

Multi-objective multi-agent decision making: a utility-based analysis and survey

Article

Full-text available

Dec 2019
AUTON AGENT MULTI-AG

The majority of multi-agent system implementations aim to optimise agents’ policies with respect to a single objective, despite the fact that many real-world problem domains are inherently multi-objective in nature. Multi-objective multi-agent systems (MOMAS) explicitly consider the possible trade-offs between conflicting objective functions. We argue that, in MOMAS, such compromises should be analysed on the basis of the utility that these compromises have for the users of a system. As is standard in multi-objective optimisation, we model the user utility using utility functions that map value or return vectors to scalar values. This approach naturally leads to two different optimisation criteria: expected scalarised returns (ESR) and scalarised expected returns (SER). We develop a new taxonomy which classifies multi-objective multi-agent decision making settings, on the basis of the reward structures, and which and how utility functions are applied. This allows us to offer a structured view of the field, to clearly delineate the current state-of-the-art in multi-objective multi-agent decision making approaches and to identify promising directions for future research. Starting from the execution phase, in which the selected policies are applied and the utility for the users is attained, we analyse which solution concepts apply to the different settings in our taxonomy. Furthermore, we define and discuss these solution concepts under both ESR and SER optimisation criteria. We conclude with a summary of our main findings and a discussion of many promising future research directions in multi-objective multi-agent systems.

Multi-objective Reinforcement Learning for the Expected Utility of the Return

Conference Paper

Full-text available

Jul 2018

Real-world decision problems often have multiple, possibly conflicting , objectives. In multi-objective reinforcement learning, the effects of actions in terms of these objectives must be learned by interacting with an environment. Typically, multi-objective reinforcement learning algorithms optimise the utility of the expected value of the returns. This implies the underlying assumption that it is indeed the expected value of the returns (i.e., an average returns over many runs) that is important to the user. However, this is not always the case. For example in a medical treatment setting only the return of a single run matters to the patient. This return is expressed in terms of multiple objectives such as maximising the probability of a full recovery and minimising the severity of side-effects. The utility of such a vector-valued return is often a non-linear combination of the return in each objective. In such cases, we should thus optimise the expected value of the utility of the returns, rather than the utility of the expected value of the returns. In this paper, we propose a novel method to do so, based on policy gradient, and show empirically that our method is key to learning good policies with respect to the expected value of the utility of the returns.

Computing Convex Coverage Sets for Faster Multi-objective Coordination

Article

Full-text available

Mar 2015

In this article, we propose new algorithms for multi-objective coordination graphs (MO-CoGs). Key to the efficiency of these algorithms is that they compute the convex coverage set (CCS) instead of the Pareto coverage set (PCS), i.e., the Pareto front. Not only is the CCS a sufficient solution set for a large class of problems, it also has important characteristics that facilitate more efficient solutions. We propose two main algorithms for computing the CCS in MO-CoGs. Convex multi-objective variable elimination (CMOVE) computes the CCS by performing a series of agent eliminations, which can be seen as solving a series of local multi-objective subproblems. Variable elimination linear support (VELS) iteratively identifies the single weight vector, w, that can lead to the maximal possible improvement on a partial CCS and calls variable elimination to solve a scalarized instance of the problem for w. VELS is faster than CMOVE for small and medium numbers of objectives and can compute an ε-approximate CCS in a fraction of the runtime. In addition, we propose variants of these methods that employ AND/OR tree search instead of variable elimination to achieve memory efficiency. We analyze the runtime and space complexities of these methods, prove their correctness, and compare them empirically against a naive baseline and an existing PCS method, both in terms of memory-usage and runtime. Our results show that, by focusing on the CCS, these methods achieve much better scalability in the number of agents than the current state of the art.

A utility-based analysis of equilibria in multi-objective normal-form games

Article

Jun 2020

In multi-objective multi-agent systems (MOMASs), agents explicitly consider the possible trade-offs between conflicting objective functions. We argue that compromises between competing objectives in MOMAS should be analyzed on the basis of the utility that these compromises have for the users of a system, where an agent’s utility function maps their payoff vectors to scalar utility values. This utility-based approach naturally leads to two different optimization criteria for agents in a MOMAS: expected scalarized returns (ESRs) and scalarized expected returns (SERs). In this article, we explore the differences between these two criteria using the framework of multi-objective normal-form games (MONFGs). We demonstrate that the choice of optimization criterion (ESR or SER) can radically alter the set of equilibria in a MONFG when nonlinear utility functions are used.

A Distributional Perspective on Reinforcement Learning

Article

Jul 2017

In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.

Rand Corporation Dynamic Programming

Article

Jan 1957

R.E. Bellman

An Overview of Value at Risk

Article

Apr 1997

This article gives a broad and accessible overview of models of value at risk (WR), a popular measure of the market risk of a financial firm's "book," the list ofpositions in various instruments that expose the firm to financial risk. Roughly speaking, the value at risk o f a portfolio is the loss in market value over a given time period, such as one day or two weeks, that is exceeded with a small probability, such as 1%. We focus narrowly on the market risk associated with changes in the prices or rates of underlying traded instruments. Traditionally, this would include such aspects of credit risk as the risk of changes in the spreads ofpublicly traded corporate and sovereign bonds. In order to maintain a narrow focus, however, I/aR does not traditionally include, and we do not review here, the risk of defdult on long-term derivative contracts. We begin with models of the distribution of underlying market returns and of volatility, emphasizing the roles of price jumps and of stochastic volatility in determining the "fatness" of the tails of the distributions of returns in various markets. We then turn to methods for approximating the value at risk of derivatives, using numerical approximations based on delta and gamma. Estimation of the value at risk o f a portfolio of positions is then discussed, beginning with the choice of which risk factors to include. An extensive numerical example illustrates the accuracy of various alternative methods, allowing for correlated jumps in the underlying market prices. Finally, we review "scenario analysis," which treats the potential losses associated with particular scenarios, such as a given parallel shift of a yield curve or a given change in volatility.

Multi-Objective Distributional Value Iteration

Abstract

Recommended publications

Decision-Theoretic Planning for the Expected Scalarised Returns

Expected scalarised returns dominance: a new solution concept for multi-objective decision making

Dominance Criteria and Solution Sets for the Expected Scalarised Returns

Expected Scalarised Returns Dominance: A New Solution Concept for Multi-Objective Decision Making