PreprintPDF Available

Greeks Need Not Apply - Using Market Generators and Deep Hedging for Model-Free Data-Driven Hedging

June 2021

June 2021

DOI:10.13140/RG.2.2.23559.73129

Authors:

Magnus Frandsen

University of Copenhagen

Preprints and early-stage research may not have been peer reviewed yet.

In recent years, research in deep learning has intensified, and deep learning methods have been developed in all areas of finance. This thesis aims to combine deep learning-based market generators with deep hedging methods to create a model-and greek-free framework that finds risk optimal hedging strategies from observed price paths. First, we explain and test the deep hedging method using virtually unrestricted amounts of synthetic data from Black-Scholes and Heston models. We show that the deep hedging method can find reasonable hedging strategies for simple claims, path-dependent options with and without transaction costs. However, architecture and inputs (including processing) have significant impacts on performance. We also observe that the deep hedging method can yield unstable hedging strategies in multivariate models. Next, we explain and test data-driven market generators based on variational autoencoders. We observe that the market generators can produce paths with similar marginal distributions and correlation to a Black-Scholes model. The market generators struggle in the Heston model and when conditioning on initial instantaneous variance. We propose several moment regularization terms that partly alleviate these issues. Finally, we combine and test the market generators with the deep hedging methods when assuming that a Black-Scholes or Heston model drives the market. In a Black-Scholes model, we observe that the combined framework can learn reasonable hedging strategies for call and down-and-out call options from a modest number of observations. In a Heston model, we also observe that the framework can learn hedging strategies (conditioned on initial instantaneous variance) from a single path. Still, the framework struggles with high initial instantaneous variance. We also show that it is possible to improve performance by utilizing overlapping paths, which increase the number of training paths. The results are promising. However, the techniques still require refining and further development before it is feasible to use commercially.

Hedge accuracy (a) and PnLs (b) across terminal values of S over 50,000 samples for the ANN model.

…

Portfolio values vs option payoff and PnLs for the analytical strategy and ANN models optimized over MSE and CV aR 0.95 . Model: Black Scholes model without transactions costs. Option: ATM call option.

…

Average PnL, average absolute PnL and CV aR 0.95 across varying volatilities for ANN models trained on σ = 0.3 and Black-Scholes delta hedging with σ = 0.3 (analytical). (a), (c) and (e) show average PnL, average absolute PnL and CV aR 0.95 , respectively. (b), (d), (f) show the respective differences to that of the analytical approach. No transaction costs.

…

Out of sample PnL distribution. ANN model using CVaR vs. analytical strategy. Model: 4 dimensional Black Scholes model without transactions costs. Option portfolio: see figure 5.3.

…

Two samples of holdings for different trading strategies. The goal is to hedge a down-and-out call option in a one-dimensional Black-Scholes model without transaction costs.

…

Figures - uploaded by Magnus Frandsen

Content may be subject to copyright.

Content uploaded by Magnus Frandsen

Content may be subject to copyright.

U N I V E R S I T Y O F C O P E N H A G E N

D E P A R T M E N T O F M A T H E M A T I C A L S C I E N C E S

Master Thesis

Magnus Grønnegaard Frandsen

Greeks Need Not Apply

Us in g Market Generators and Deep Hed ging for Model-Free Data-Driven Hedging

Ad vi sor: Rolf P oulsen

Su bm itted on: J un e 15, 2021

Abstract

In recent years, research in deep learning has intensiﬁed, and deep learning methods have been

developed in all areas of ﬁnance. This thesis aims to combine deep learning-based market generators

with deep hedging methods to create a model- and greek-free framework that ﬁnds risk optimal hedging

strategies from observed price paths.

First, we explain and test the deep hedging method using virtually unrestricted amounts of synthetic

data from Black-Scholes and Heston models. We show that the deep hedging method can ﬁnd

reasonable hedging strategies for simple claims, path-dependent options with and without transaction

costs. However, architecture and inputs (including processing) have signiﬁcant impacts on performance.

We also observe that the deep hedging method can yield unstable hedging strategies in multivariate

models.

Next, we explain and test data-driven market generators based on variational autoencoders. We

observe that the market generators can produce paths with similar marginal distributions and correlation

to a Black-Scholes model. The market generators struggle in the Heston model and when conditioning

on initial instantaneous variance. We propose several moment regularization terms that partly alleviate

these issues.

Finally, we combine and test the market generators with the deep hedging methods when assuming

that a Black-Scholes or Heston model drives the market. In a Black-Scholes model, we observe that the

combined framework can learn reasonable hedging strategies for call and down-and-out call options

from a modest number of observations. In a Heston model, we also observe that the framework can

learn hedging strategies (conditioned on initial instantaneous variance) from a single path. Still, the

framework struggles with high initial instantaneous variance. We also show that it is possible to

improve performance by utilizing overlapping paths, which increase the number of training paths. The

results are promising. However, the techniques still require reﬁning and further development before it

is feasible to use commercially.

Keywords—

Hedging, Deep Learning, Model Free, Market Generator, Variational Autoencoder, Conditional

Variational Autoencoder, Stochastic Volatility, Transaction Costs, Barrier Option.

Contents

1 Introduction 1

2 Simple Hedging Experiment 2

3 General Hedging Problem 6

3.1 MarketSetup .............................................. 6

3.2 Risk measures and optimal trading strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 CVaRanditspracticalities ....................................... 11

3.4 Practicalities of minimizing over δ................................... 11

4 Artiﬁcial Neural Networks 12

4.1 Architecture............................................... 13

4.2 Universal Representation and Representation Beneﬁts of deep ANNs . . . . . . . . . . . . . . . . . 14

4.3 Backpropagation ............................................ 15

4.4 Implementation and Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Deep Hedging Experiments 17

5.1 Black-Scholes with 1 asset and a simple claim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1.1 Notransactioncosts ...................................... 17

5.1.2 Training on wrong volatility and The Fundamental Theorem of Derivatives Trading . . . . . 22

5.1.3 0.5% transaction costs (Black-Scholes 1 asset) . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Black Scholes with multiple assets and no transaction costs . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Training with the wrong correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3 Black Scholes model with path-dependent options . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Heston model with one tradable asset (incomplete and non-Markovian model)............ 35

5.5 Subconclusion.............................................. 38

6 Market Generators 38

6.1 VariationalAutoencoders........................................ 39

6.2 ConditionalVAEs............................................ 43

6.3 Connecting ANNs to VAEs and CVAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.4 Experiments with VAEs and performance evaluation (in a simple Black-Scholes model) . . . . . . . 45

6.5 Cheating in the Heston model - capturing path dependency . . . . . . . . . . . . . . . . . . . . . . 50

6.6 Conditioning on instantaneous variance in the Heston model . . . . . . . . . . . . . . . . . . . . . 54

6.7 Overlapping training paths (is it possible?) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7 Data Driven Hedge Experiments 60

7.1 VAE powered hedge experiments - Black-Scholes . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.2 CVAE powered hedge experiments - Heston . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8 Conclusion 67

A Appendix 70

A.1 Outperforming delta hedging using MSE (in theory) . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.2 Price and delta of a call option in the Heston model . . . . . . . . . . . . . . . . . . . . . . . . . . 71

1 Introduction

Pricing and hedging derivatives are among the cornerstones of mathematical ﬁnance and play a vital role for

derivatives dealers and traders. In classical theory, prices are determined as expected discounted payoﬀs under

a pricing measure

, and derivatives can be hedged (locally) by constructing a portfolio with matching Greeks

(derivatives of the pricing function). This classical approach works perfectly in an idealized complete market without

transaction costs, where assets adhere to simple Ito-processes. However, in real-world scenarios, expert traders must

include other techniques and market knowledge to set prices and eﬃciently hedge portfolios against unwanted risks.

In recent years, all scientiﬁc ﬁelds have been overwhelmed with interest and research in machine learning. This

also includes ﬁnance, where machine learning techniques have been proposed in virtually all areas. The strength of

machine learning (especially deep learning) is that it oﬀers solutions to complex and general problems with minimal

assumptions. The disadvantage of machine learning is that the result can be sensitive to problem formulations,

network architecture (for deep learning) and data quality. Results from machine learning methods also suﬀer from

diﬃculties of interpretation and poor understanding of errors.

In this thesis, we aim to combine and test the ideas of [

] and [

]. In [

], Buehler et al. propose a deep

hedging framework that can ﬁnd optimal hedging strategies for option portfolios in a market with multiple assets

and market frictions. The approach is entirely model-free and does

not

depend on the model dynamics. Hence,

the approach does not depend on any

-measures or Greeks. The framework optimizes a hedging portfolio under

acoherent risk measure using synthetic price data and artiﬁcial neural networks (ANNs). The framework will

depend heavily on the synthetic data. If we choose a classic model (like Black-Scholes or Heston) to drive the

market, then the deep hedging framework loses its inherent model-free status (although it would still be

- and

Greek-free). In [

], Buehler et al. propose a data-driven, ﬂexible and non-parametric generative model based on

variational autoencoders (VAEs), which can generate new asset paths from a small/modest number of observations.

This generative model (also called a market generator) can (hopefully) be utilized to create synthetic data for a deep

hedging model. Therefore, combining the generative model and the deep hedging model would yield a

data-driven

and

model-free

hedging framework. This thesis aims to explain, test, and analyze both the deep hedging approach

and the VAE-based generative models.

The thesis is organized as follows: We motivate our formulation of a hedging problem in section 2 with a

simple example in a Binomial Model. We do this by approaching the hedging problem as a regression problem.

Section 3 introduces a general hedging problem with coherent risk measures. We aim to formulate the problem to

enable optimal hedging strategies to be found by minimizing over possible hedging strategies. In this thesis, we

utilize ANNs to represent trading decisions. For completeness, we introduce ANNs in section 4, including universal

approximation, backpropagation and training practicalities. In section 5, we perform various hedge experiments to

evaluate the deep hedging approach. For these experiments, we assume that the market is driven by a standard model

like Black-Scholes or Heston. This enables better testing of the deep hedging models since we can compare the deep

hedging models’ performance to classic delta hedging. To test the deep hedging model (looking at performance

and stability), we perform experiments with call options, down-and-out call options, multiple assets and stochastic

volatility (using a Heston model).

At this point, we should understand (some of) the possibilities and limitations of the deep hedging approach.

We, therefore, wish to step back from hedging to introduce a generative model that can create synthetic data for

the deep hedging models. In section 6, we introduce the theory behind variational and conditional variational

autoencoders (VAEs and CVAEs). We then test the VAEs and CVAEs abilities to create new paths of an asset,

based on a modest number of observed paths. The market will again be assumed to be driven by a Black-Scholes

or Heston model. To evaluate the performance of the generative models, we analyze the marginal distributions

and time correlations of the simulated paths/returns. To improve the performance of the generative models,

we propose various moment regularization terms, which guide the generative models towards distributions with

similar moments and correlations to the utilized training data. Section 7 combines the generative models and

the deep hedging approach to create and test a framework that learns optimal hedging strategies from a modest

number of observations. For simplicity, we again assume that the market is driven by either a Black-Scholes or

Heston model. However, in these experiments, we train the deep hedging model on synthetic samples created

by generative models that are trained on a set of observed paths from the actual model, i.e. Black-Scholes or

Heston. We aim to analyze the performance of the developed framework when varying the number of train-

ing samples and when utilizing overlapping sample paths. Finally, we investigate if it is possible to learn the

optimal hedging strategy from training samples with varying levels of underlying volatility (e.g. from a Heston model).

All implementations for this thesis are implemented in Python 3.8 with Tensorﬂow 2.4.1. Implementations

are available on Github in the following repo:

https://github.com/jnr494/MasterThesis

2 Simple Hedging Experiment

In this section, we wish to present a simple hedging problem in a Binomial model, where we can represent the

problem as a regression problem.

The binomial model and hedge setup

We assume that we are in a Binomial model (S(t), B(t)) where

S(t+ ∆t) = S(t)·





u= exp(α∆t+σ√∆t)w.p. p

d= exp(α∆t−σ√∆t)w.p. 1−p

B(t+ ∆t) = B(t)er∆t, B(0) = 1

where we assume

∆t=T

and

is the time of maturity for some option (that we wish to hedge) and

is the

number of steps from t= 0 to t=T. For notational simplicity, we write Si=S(i∆t),Bi=B(i∆t)and so on.

We assume that we wish to hedge an option on

with payoﬀ

CN=g(SN)

. Note that at time

the underlying

asset Scan take on N+ 1 diﬀerent values

S0uN

S0uN−1d

S0udN−1

S0dN.

It is quite easy to hedge options in a Binomial model. Standing at time

, we can hedge the option value at time

t+ ∆tby solving the system

aS(t)u+bB(t)er∆t=Cu(t+ ∆t)

aS(t)d+bB(t)er∆t=Cd(t+ ∆t)

where

(a, b)

is a portfolio of

(S(t), B(t))

and

Cu(t+ ∆t)

Cd(t+ ∆t)

are the values of the option in the two

possible future states at time t+ ∆t. The solution

a∗=Cu(t+ ∆t)−Cd(t+ ∆t)

St(u−d)

b∗=uCd(t+ ∆t)−dC u(t+ ∆t)

B(t+ ∆t)(u−d)

is the optimal hedging strategy and the price at time

C(t) = a∗St+b∗Bt

, i.e. the value of the hedging portfolio.

Solving the above problem recursively backwards in time from

t= 0

gives us the optimal hedging portfolio and

the fair/arbitrage-free price of the option.

We now turn our attention to solving the hedging problem by learning

a∗

and

b∗

for all the states. We do

this by framing the problem as a regression problem.

Imagine that

is the initial value of our hedging portfolio (

should be the option’s price, but we do not know

this). We represent our investment strategy by functions

(f0(s), . . . , fN−1(s))

which resembles the investments in

. The function

could be polynomial, an ANN or any other parametric function that can represent the optimal

hedging strategy.

The value of our portfolio P Fican be recursively determined by

P F0=p0

P Fi=fi−1(Si−1)Si+

amount invested in Bi−1

z }| {

(P Fi−1−fi−1(Si−1)Si)Bi

Bi−1

for i= 1, . . . , N.

Rewriting P FN, we get

P FN=p0BN+

k=1

Sk(fk(Sk)−fk−1(Sk−1))Bn

where

f−1, fN= 0

. We can, therefore, think of

P FN

as a parametric function/algorithm that takes a path of

(S, B)

and returns the portfolio value at time

when starting at

and following the trading strategy given by

(f0, . . . , fN−1)

. Since we are looking for a hedge portfolio, we want to ﬁnd

p0, f0, . . . , fN−1

s.t.

P FN=g(SN)

almost surely.

We can now think of the problem as a problem of ﬁnding

p0, f0, . . . , fN−1

given

sample-paths to minimize

the mean square error (MSE)

i=1

(C(i)

N−P F (i)

N)2.

If we, given all possible sample paths (of which there are

), can ﬁnd

p0, f0, . . . , fN−1

st. the mean square error is

zero, then we have found the desired optimal strategy since the optimal strategy is unique due to the completeness of

the model.

The practical issues of the loss function and choosing the functional form of

will be covered in the later

chapters. For this experiment, we choose to represent the

s with ANNs and apply a standard gradient descent

algorithm (ADAM) to update

and the parameters of the

s. Note, however, that the choice of ANNs to represent

the

s is not essential from a theoretical point of view. Any parametric function capable of representing the optimal

hedging strategy would work. The choice of ANNs is made solely from a practical standpoint.

Value (Standard Error) % of option price

Option Price 8.04567

Model p08.04524 99.995%

Avg. PnL 0.00111 (0.00011) 0.014%

Avg. abs. PnL 0.01507 (0.00008) 0.187%

Avg. squared PnL 0.00052 (0.00001)

Table 2.1: Average PnL, absolute PnL and squared PnL for trained model in test of 50,000 samples.

Option payoﬀ and PFN

(a)

PnL

(b)

Figure 2.1: Hedge accuracy (a) and PnLs (b) across terminal values of

over 50,000 samples for the ANN

model.

A simple experiment in a Binomial model

For this simple experiment, we wish to hedge an ATM call option in a Binomial model over one year. The Binomial

model will have step size

1/10

implying that we have

trading decisions over the lifetime of the option. We assume

the following model parameters and option characteristics:

•Model parameters:S0= 100,α= 0.03,σ= 0.2,r= 0 and ∆t= 1/10.

•Option: Type: European call, Maturity: T= 1, Strike: K= 100.

For this experiment, we do not focus on the details of the ANN-based model and training (see section 4). However, to

quickly summarize: We generate

218

samples of

and estimate the option payoﬀ on each sample. We then train

the model (with an ANN representing each trading decision, each with three layers of four units) on these samples.

During training, we seek to minimize the MSE of the total Proﬁt-and-Loss (PnL) of selling the call option and trading

in (S, B), i.e. P nL =−CN+P FN.

To test our model, we perform a hedge experiment where the model has to hedge the call option on 50,000

new paths of

(S, B)

(it will already have seen the vast majority). The result of this test can be seen in table 2.1 and

ﬁgures 2.1, 2.2 and 2.3.

In ﬁgure 2.1, we see that the model quite successfully hedges the call option (at least visually) with PnLs

below 0.2 corresponding to

2.5%

of the option price. From table 2.1, we see that the average PnL and even the

average absolute PnL are pretty low, with values of 0.00111 and 0.01507 corresponding to 0.014% and 0.187% of

the option price, respectively. The learned option price

8.04524

, corresponding to

99.995%

of the actual option

price, which is also quite impressive.

In ﬁgure 2.2, we see two examples of holdings across time. As the black dashed line indicates, the ANN model

can mimic the analytical trading strategy. This is also seen in ﬁgure 2.3 where we see the learned strategy for holdings

(a) Sample 1 (b) Sample 2

Figure 2.2: Holdings in Sacross time for two diﬀerent sample paths.

Units of S

Figure 2.3: Holdings in Sat time t= 0.6

at time

t= 0.6

across values of

. The strategy, learned by the ANN model, seems to be close to the analytical

strategy for the possible values of

(of which there are only seven). However, the ANN model is not super-smooth

in between the possible values of S, which makes sense since nothing is gained from this.

Whether or not this is impressive and/or good enough is debatable. Still, one should remember that the model

learned the option price and hedging strategy by

only

observing the MSE between the option payoﬀs and the values

of the hedging portfolios. The ANN model has no information on model dynamics or how it relates to the option.

There is, in principle, nothing stopping us from applying the same model/method to an entire option portfolio in a

stochastic volatility model with transaction costs. The power of this method comes from its ﬂexibility. In the next

section, we formalize the problem in a more general setting, allowing us to investigate these tools further.

3 General Hedging Problem

In this section, we wish to introduce a more general hedging problem, which involves multiple tradable assets,

stochastic interest rate and transaction costs. This section is based on the work by Buehler et al. in [2].

3.1 Market Setup

We imagine a market consisting

risky tradable assets with price processes represented by

S(t)∈Rd

which are

adapted to

where

(F)t∈R+

is a ﬁltration based on all relevant market information (prices, interest rates, news

etc.). The market also contains a tradable locally risk-free asset with price process

B(t)∈R

, which is a predictable

adapted process. We observe and interact with these assets under the real-world probability measure

, and it is

under this measure that we wish to ﬁnd the optimal hedging strategy.

We assume that we are selling (and wish to hedge) a portfolio of claims

Z∈FT

in the described market.

We assume for simplicity that all claims have maturity

in this portfolio. Being the seller of

implies that

is a liability at time

, but it may contain both long and short positions in the underlying assets (which are not

necessarily

). As compensation for selling the portfolio

, we imagine being compensated

p0∈R

(which

may be negative if we eﬀectively hold long positions).

Our goal is to trade

(S, B)

to optimally minimize the risk of our combined portfolio (both containing

−Z

and our hedging portfolio).

We assume that we are allowed to trade in

(S, B)

time-points,

0 = t0< t1< . . . < tN−1< T

, at

prices

Sk=S(tk)

and

Bk=B(tk)

with

tN=T

being the time-point at which our combined portfolio is evaluated.

However, these trades might be subject to transaction costs. We denote our trading strategy

(δk)k=0,...,N−1

with

δk∈Rdrepresenting holdings of Sat time tk. We assume that the strategy is kept self-ﬁnancing using B.

For simplicity, we assume that trades in

are subject to proportional transaction costs, implying that trading

δi

k−δi

k−1units of Siat time tkcosts

ciSi

k|δi

k−δi

k−1|

where

k>0

(eg.

k= 0.001

representing transaction costs of 0.1%). We also assume that liquidation of the

portfolio is free at time T.

Representing the terminal portfolio value and PnL

The value of the tradable portfolio (excluding

)

P F

, obtained by following

(δk)k=0,...,N−1

and having initial capital

p0, is

P F0=p0−

i=1

0Si

0|δi

P Fk= d

i=1

kδi

k−1!+ (P Fk−1−δi

k−1Si

k−1)Bk

Bk−1−

i=1

kSi

k|δi

k−δi

k−1|for k= 1, . . . , N −1

P FN= d

i=1

Nδi

N−1!+ (P FN−1−δi

N−1Si

N−1)BN

BN−1

where

(P Fk−1−δi

k−1Si

k−1)

is the amount invested in

in order to keep the portfolio self-ﬁnancing. Also, notice

that we have applied our assumption of zero transaction cost at time

. The value of the tradable portfolio at time

P FNcan be written as

P FN=p0BN+

k=1

i=1

k(δi

k−δi

k−1)BN

Bk−

N−1

k=0

i=1

kSi

k|δi

k−δi

k−1|BN

=BN"p0+

k=1

i=1

k(δi

k−δi

k−1)−

N−1

k=0

i=1

k˜

k|δi

k−δi

k−1|#

| {z }

P F N

(1)

where

δi

−1= 0

for all

and where

P F N

is the value of the tradable portfolio in the discounted market

(˜

S, 1)

where

Si(t) = Si(t)/B(t). The discounted portfolio value ˜

P F Nwill be justiﬁed later. To simplify notation, we deﬁne

(˜

S·δ)T:=

k=1

i=1

k(δi

k−δi

k−1)(2)

CT(δ) :=

N−1

k=0

i=1

k˜

k|δi

k−δi

k−1|,(3)

implying that

P F N=p0+ ( ˜

S·δ)T−CT(δ)

. We can now move on to consider our Proﬁt-and-Loss (PnL). At time

Tthe PnL of our combined portfolio is

P nLT(Z, p0, δ) := −Z+P FN

=BN−˜

Z+˜

P F N

| {z }

P nLT

where ˜

Z=Z/BN.

3.2 Risk measures and optimal trading strategies

If the portfolio of claims

is reachable with

(S, B)

(only traded at

t0,...tN−1

) and we were not subject to

transactions costs, then a reasonable objective would be to ﬁnd a trading strategy

(δk)k=0,...,N−1

and initial portfolio

value p0st.

P nLT(Z, p0, δ) = 0

almost surely. This would then be a perfect hedging strategy for

. This is essentially what we did in the experiment

with a Binomial model in section 2.

Generally, we do not think that Z is reachable with our available trading strategies. We, therefore, choose to

ﬁnd

and

(δk)k=0,...,N−1

st. they are optimal when considering the risk of

P nLT(Z, p0, δ)

. To do this, we focus

on measuring the risk with coherent risk measures.

Deﬁnition 3.1.

Let

L, L1, L2∈R

be loss random variables (liabilities) then

ρ:L→R

is a coherent risk measure

if it satisﬁes the following properties (axioms of risk measures):

•Translation (cash) invariance: ρ(L+c) = ρ(L) + cfor c∈R.

•Subadditivity: ρ(L1+L2)≤ρ(L1) + ρ(L2).

•Positive homogeneity: ρ(λL) = λρ(L)for λ > 0.

•Monotonicity: If L1≥L2then ρ(L1)≥ρ(L2).

In thesis project, we always assume that a coherent risk measure is normalized meaning that

ρ(0) = 0

. So given a

coherent risk measure

we can quantiﬁably measure the risk of

P nLT

by evaluating

ρ(−˜

P nLT(Z, p0, δ))

. Note

that we choose to evaluate the risk of the discounted PnL. We can now formulate our objective as

1. For a given p0, we wish to ﬁnd a trading strategy (δk)k=0,...,N −1that minimizes ρ(−˜

P nLt(Z, p0, δ)).

2. Find a fair value for p0given Z,ρand optimal (δk)k=0,...,N−1.

Note that solving both objective 1 and 2 might be diﬃcult if the optimal trading strategy depends on

since the fair

value for

might, in turn, depend on the optimal trading strategy. We will see later that this is not a problem when

we measure the risk with a coherent risk measure on the discounted PnL.

We start by considering the ﬁrst objective. We assume that

is given and consider the minimization prob-

lem

inf

δρ(−˜

P nLT(Z, p0, δ)).

Using the deﬁnition of P nLTand shorthand expression for ˜

P F N, we can express the optimization problem as

inf

δρ(−˜

P nLT(Z, p0, δ)) = inf

δρ(˜

Z−p0−(˜

S·δ)T+CT(δ)).

Using the cash invariance of ρ, we obtain

inf

δρ(˜

Z−p0−(˜

S·δ)T+CT(δ)) = −p0+ inf

δρ(˜

Z−(˜

S·δ)T+CT(δ))

showing us that the optimization problem (and therefore the optimal trading strategy) is

independent

. We,

therefore, deﬁne the following relevant optimization problem

π(˜

Z) := inf

δρ(−˜

P nLT(Z, 0, δ)) = inf

δρ(˜

Z−(˜

S, δ )T+CT(δ)).(4)

We think of the optimal trading strategy

δ∗

associated with

π(Z)

as the optimal trading strategy for a trader with an

option portfolio Zgiven risk measure ρ.

We wish to understand

before we discuss how to determine

. One can show that

itself is a coherent

risk measure.

Proposition 3.2. πis a coherent risk measure.

Proof.

That

satisﬁes the axioms of cash invariance, positive homogeneity and monotonicity follow directly from

cash invariance, positive homogeneity and monotonicity of ρ. We, therefore, focus on showing subadditivity.

To prove subadditivity, assume we have two loss random variables

and

. Then by the deﬁnition of

(see equation (4))

π(L1+L2) = inf

δρ(L1+L2−(˜

S·δ)T+CT(δ)).

As we have no restrictions on

, we can easily reformulate the problem with

δ=δ1+δ2

(1. step), then utilize

linearity of

(˜

S·δ)T

and

CT(δ)

(2. step), utilize subaddivity of

(3. step) and ﬁnally use the deﬁnition of

(4. step).

inf

δρ(L1+L2−(˜

S·δ)T+CT(δ) = inf

δ1,δ2

ρ(L1+L2−(˜

S·(δ1+δ2))T+CT(δ1+δ2))

= inf

δ1,δ2

ρ(L1−(˜

S·δ1)T+CT(δ1)) + (L2−(˜

S·δ2)T+CT(δ2))

≤inf

δ1

ρ(L1−(˜

S·δ1)T+CT(δ1)) + inf

δ2

ρ(L2−(˜

S·δ2)T+CT(δ2))

=π(L1) + π(L2).

This shows that πis subadditive. This all proofs that πis a coherent risk measure.

Finding a fair price p0

We can now consider ﬁnding a fair compensation

for selling option portfolio

. We ﬁrst notice that because of

cash invariance of π, then

π(˜

Z−π(˜

Z)) = π(˜

Z)−π(˜

Z)=0,

which justiﬁes thinking of

π(˜

as the smallest amount added to position

−Z

to make the position acceptable in

terms of

, i.e. the smallest amount

satisfying

π(˜

Z−c)≤0

. This might seem like a reasonable price. However,

it does not consider the fact that the trader could obtain better portfolio risk by not selling

. This is the case if

π(0) <0

, i.e. the trader can obtain better than 0 risk by trading

(S, B)

. This might happen if the tradable assets have

high expected returns or if we use a risk measure

that focuses less on large losses (e.g. CVaR with a low conﬁdence

level). We, therefore, consider the so-called indiﬀerence price p(Z)satisfying

π(˜

Z−p(Z)) = π(0)

i.e. the price making the trader indiﬀerent between selling

and getting

p(Z)

or not selling

(assuming the current

portfolio is 0). Note that because of the cash invariance of π, the indiﬀerence price p(Z)has solution

p(Z) = π(˜

Z)−π(0).(5)

The indiﬀerence price is a more sensible choice as a fair value for p0given a risk measure ρ.

Be aware that

p(Z)

might not exist if

π(0) = −∞

, which happens if the market exhibits a tradable arbitrage

or if

is chosen in an unfortunate way. We will, however, not worry about this as we (mostly) wish to work with

conditional value at risk with high conﬁdence levels (and, of course, in settings with no tradable arbitrages).

To show that

p(Z)

does not cause issues with standard pricing, we can show that if

is reachable and there

are no transaction costs, then p(Z)equals the unique hedging price.

Proposition 3.3.

Assume

CT(δ) = 0

for all

. If

is reachable, i.e. if

δ∗

and

p0∈R

s.t.

Z=BNp0+ ( ˜

SBN·δ∗)T

then p(Z) = p0.

Proof.

We wish to start by reassuring ourselves that

Z=BNp0+ ( ˜

SBN·δ∗)T

is actually what we think of as

reachable assuming no transaction costs. The right-hand side should be the value of a portfolio starting at

and

trading with δ∗until time T. Using equation (1) and (2), we know that the portfolio value P FNis

P FN=BNp0+

k=1

i=1

BNSi

k(δ∗i

k−δi∗

k−1) = BNp0+ ( ˜

SBN·δ∗)T,

which is exactly what we proposed to represent the fact that Zis reachable.

We know that

p(Z) = π(˜

Z)−π(0)

, so naturally to show

p(Z) = p0

, we consider

π(˜

. We start by using

our current assumptions of reachability and no transaction costs (1. step). Note that we divide by

. Then utilize

cash invariance and the deﬁnition of

(2. step) and the deﬁnition of

(˜

S·δ)T

(3. step). In the ﬁnal step, we utilize

the deﬁnition of

and the fact that the minimizing over the trading strategy

can completely oﬀset

δ∗

since we

assume no restrictions on δ.

π(˜

Z) = π(p0+ (δ∗·˜

S))

=p0+ inf

δρ(( ˜

S·δ∗)T−(˜

S·δ)T)

=p0+ inf

δρ(( ˜

S·[δ∗−δ]))

=p0+π(0).

Showing us that p0=π(˜

Z)−π(0), which proves that p(Z) = p0.

This shows that

p(Z)

is not arbitrary and is still tied to classical theory. We now have a better understanding of the

optimization problem faced by a trader. However, to better understand the eﬀect of working under risk measures we

consider another example.

Finding a fair price (of a new option portfolio) in the case of a preexisting option portfolio

Assume now that we are selling option portfolio

at a price

, and we consider selling

(in addition to

). What

would then be a fair price for Z1?

Using the same principle as before, the price for

, should make the trader indiﬀerent between selling

and Z+Z1where the price for Zis p0. This implies that p1should satisfy

π(˜

Z+˜

Z1−p0−p1) = π(Z−p0).

Using cash invariance of π, we obtain

p1=π(˜

Z+˜

Z1)−p0−π(˜

Z) + p0=π(˜

Z+˜

Z1)−π(˜

Z),

which is equal to

p(Z1)

in the case where

Z= 0

, but in general might be diﬀerent from

p(Z1)

. By subadditivity of

we can show that the fair value for

less or equal

p(Z1)

. To see this, we can ﬁrst notice that by subadditivity

of πthen

π(˜

Z+˜

Z1)≤π(˜

Z) + π(˜

Z1),

which easily shows that

p1=π(˜

Z+˜

Z1)−π(˜

Z)≤π(˜

Z1)≤p(Z1),

since p(Z1) = π(˜

Z1)−π(0) and π(0) ≤0.

This makes intuitive sense since the correlation between our current option portfolio

and the new one

might be advantageous due to the subadditivity of

(and then

). However, this result does not hold if we assume

that we are using a general convex risk measure (instead of a coherent risk measure). In this case,

might not be

subadditive, which could result in the indiﬀerence price for

being larger than

p(Z1)

. We will, however, only be

working with coherent risk measures, so we are not concerned about this.

3.3 CVaR and its practicalities

In this thesis, we consider optimizing the PnL under the risk measure conditional value-at-risk (CVaR or expected

shortfall). To deﬁne CVaR, we must ﬁrst deﬁne value-at-risk (VaR).

Deﬁnition 3.4. VaR of a loss random variable L at conﬁdence level α∈(0,1) is deﬁned as

V aRα(L) = inf{x∈R:P(L>x)≤1−α}.

Note that V aRα(L)is simply the αquantile of L. We can now deﬁne CVaR.

Deﬁnition 3.5.

Given a loss random variable

with

E(|L|)<∞

, we can deﬁne CVaR of

at conﬁdence level

α∈(0,1) as

CV aRα(L) := 1

1−αZ1

V aRu(L)du.

One can easily show that if the loss random variable L is integrable with continuous distribution function then

CV aRα(L) = E[L|L≥V aRα(L)]

(see lemma 2.13 in [

]). We can, therefore, think of CVaR as the average

loss exceeding the corresponding value-at-risk. This is a widespread and useful risk measure, and it satisﬁes the

conditions of a coherent risk measure.

Proposition 3.6. CV aRαis a coherent risk measure.

Proof. See example 2.26 in [3].

For practical purposes CVaR can be quite tough to evaluate. We, therefore, utilize the alternative form given below.

Proposition 3.7. For a loss random variable Land conﬁdence level α∈(0,1), we have

CV aRα(L) = inf

ww+1

1−αE[(L−w)+]

where (x)+= max(x, 0) and the optimal wis V aRα(L).

Proof. See proposition 4.51 in [4].

If we choose our risk measure ρto be CV aRαthen our optimization problem becomes

π(˜

Z) = inf

δinf

ww+1

1−αE[(−˜

P nLT(Z, 0, δ)−w)+]

= inf

δ,w w+1

1−αE[(−˜

P nLT(Z, 0, δ)−w)+](6)

where of course

P nLT(Z, 0, δ) = −˜

Z+ ( ˜

S·δ)T+CT(δ)

. So when ﬁnding the optimal trading strategy

δ∗

under

CVaR we have to solve the above minimization problem for

(δ, w)

, which fortunately does not pose any extra

conﬂicts/issues (other than slightly more computational work).

3.4 Practicalities of minimizing over δ

At this point, we can express the trader’s problem as the minimization problem in (4) (see (6) under CVaR), and we

have seen that a fair price can be expressed as the indiﬀerence price found in (5). However, solving these minimization

problems are easier said than done. Up until now, all minimization problems have minimized the risk of a portfolio’s

PnL w.r.t. the trading strategy δ. In practice, we have to assume some structure and/or functional form of δ.

To do this, we ﬁrst have to ﬁgure out how much information is needed or available to trade at time

for all

When we are in a Markovian model with deterministic interest rate and zero transaction costs, and we are only selling

simple claims, then an optimal trading decision is likely only dependent on the current value of the risky assets.

However, once we are in a more general setting, then we might need more information. Examples of this could be:

•Non-Markovian model and/or path-dependent options

: More information about the path of risky assets

is needed for optimal trading. For a down-and-out claim, this could be knowledge of the minimum of the

corresponding risky asset.

•Non-zero transaction costs

: Information about current holdings might be needed to trade optimally since

the current holdings aﬀect transaction costs.

We assume that all relevant and available information is contained in the

-measurable process

I(t)∈Rq

. Note that

previous trading decisions might also be included in I(t).

Assume now, that we have chosen some kind of parametric family of functions

fθk:Rq→Rd

to represent

the trading decision at time

(i.e.

δk

) for all

k= 0, . . . , N −1

. At time

, these are functions of the information

available

and represent the holdings in the

risky assets from time

tk+1

. By parametric family, we think of

functions whose diﬀerences only depend on the parameters θk.

With these structural decisions, we can represent the trader’s problem as

πf(˜

Z) := inf

θρ ˜

Z−

k=1

i=1

k(fθk(Ik)(i)−fθk−1(Ik−1)(i)) +

N−1

k=0

i=1

k˜

k|fθk(Ik)(i)−fθk−1(Ik−1)(i)|!

where

θ= (θ0, . . . , θN−1)

. With this formulation of the problem, we should now be able to utilize some op-

timization algorithm to (approximately) solve the problem and hence ﬁnd the (approximately) optimal trading

strategy represented by

(fθ∗

k(Ik))k

, where

θ∗

corresponds to the solution to the problem above. Note that one could

(possibly) reformulate this problem to solve it with the Bellmann-equation. However, we will not pursue that approach.

In this thesis, we choose ANNs as the parametric family of functions to represent the trading decisions. We

should then be able to train the ANNs simultaneously to minimize the above expression. The construction of the

ANNs, their ability as universal approximators and practicalities of training are discussed in section 4.

4 Artiﬁcial Neural Networks

In section 3, we saw that the optimal hedging strategies could be derived by solving problems on the form

inf

θl(fθ1(X1), . . . , fθn(Xn))

where the

s are random variables in

(possibly aﬀected by

fθj

for

j < i

fθi

are parametric functions in

representing trading decisions and

is some kind of loss function (in our case CVaR or MSE). Note when we say that

might depend on

fθj

for

j < i

(i.e. previous holdings/trades), we imply that

can be written as a function of

the general market information Yiand previous holdings, i.e. Xi=gi(Yi,{fθj(Xj)}j <i)where giis some map.

In practice, we have to solve the problem based on samples and an empirical version of

. If we assume

samples X= [x(1), . . . , x(m)](where x(i)∈Rq×n) then the problem can be represented as

inf

l(fθ1(x1), . . . , fθn(xn))

where

xi:= (x(1)

i, . . . , x(m)

i)∈Rq×m

and

fθi(xi) = fθix(1)

i, . . . , fθix(m)

i∈Rd×m

. Note again that

samples might depend on previous trading decisions.

Artiﬁcial Neural Networks (ANNs) provide a framework for solving such optimization problems. In this sec-

tion, we wish to explain and discuss diﬀerent aspects of ANNs in relation to our optimization problem:

1. Explain the architecture of feed-forward ANNs

2. Brieﬂy explain the universal approximation theorem and the representation beneﬁts of deep ANNs

3. Explain the backpropagation algorithm.

Discuss practicalities of training and working with ANNs (especially concerning the upcoming experiments).

4.1 Architecture

An ANN (or sometimes just called NN) is a parametric family of functions

{fθ|θ∈Θ}

, and among many properties

fθcan represent any continuous function.

In this thesis, we work with multilayered feed-forward ANNs. We deﬁne these ANNs as functions on the form

f(x) = hL◦hL−1◦. . . ◦h1(x)(7)

where

hi:Rni−1→Rni

with

hi(z) = σi(Aiz+bi)

. Here we assume

Ai∈Rni×ni−1

bi∈Rni

and

σi:R→R

but the σis are applied element wise to vectors (and matrices).

Using machine learning lingo, we say that

x∈Rn0

is the input layer,

Aihi−1◦. . . ◦h1(x) + bi∈Rni

the

’th hidden layer with

units for

i= 1, . . . , L −1

and

f(x)∈RnL

is the output layer. We, furthermore,

reference to σias the activation function in the i’th layer.

As a parametric family of functions, we view

as representing all the

s and

s, which we refer to as trainable

parameters. In contrast, the number of layers, the number of units in each layer and the activation functions are

referred to as untrainable parameters. One could, therefore, view the ANNs as inﬁnitely many parametric families,

but we do not wish to complicate matters unnecessarily.

With this architecture, we use the term shallow ANN for ANNs with only one hidden layer, and we use the term

deep ANN for ANNs with multiple hidden layers.

What is the idea behind using ANNs? For one, ANNs are universal approximators and can (in theory) ap-

proximate any continuous function arbitrarily well given enough layers and units (as we will see later). ANNs are

also very eﬃcient to evaluate (i.e. calculate

f(x)

for some

) due to their constructions as layered activated aﬃne

transformations (activated coming from the

σi

s, which enable non-linear behavior). The simple construction and

aﬃne nature also enable eﬃcient evaluations of the ANN for multiple inputs

(x(1), . . . , x(m))∈Rn0×m

. After

deﬁning the collection of inputs as X:= (x(1), . . . , x(m)), it is only natural to consider/deﬁne

h1(X) := h1x(1), . . . , h1x(m)=σ1(A1X+b1)∈Rn1×m

where

is added to each column of

A1X

, and

σi

is applied element wise to the entire matrix. From this we easily

obtain

hi◦. . . ◦h1(X) = σi(Aihi−1◦. . . ◦h1(X) + bi)∈Rni×m

fθ(X) = hL◦. . . ◦h1(X),

which are all extremely eﬃcient to compute due to the aﬃne nature of the ANNs.

The simple structure and eﬃciency of evaluations are also crucial for eﬃciently computing the derivatives of the

loss function w.r.t. the inputs

and the parameters

of the ANN. This is enabled by ingenious use of the chain rule,

referred to as backpropagation (or the backpropagation algorithm).

4.2 Universal Representation and Representation Beneﬁts of deep ANNs

In this section, we wish to explain and discuss the aforementioned universal approximation theorem. First, we should

note that there is no single universal approximation theorem, but many theorems that describe neural networks’

theoretical ability to approximate diﬀerent functions. For our explanation, we stick to the one presented by Leshno et

al. [7], which is general enough for our purposes (regarding the applicable activation functions).

As a starting point, we wish to approximate continuous mappings from

Rn0

RnL

. Note, however, that

any continuous map

f:Rn0→RnL

can be decomposed into

functions

fi:Rn0→R

. This implies that

approximating mappings into RnLcan be broken down into approximating functions that take values in R.

To reduce complexity even further, we wish to only study simple shallow ANNs without activation and bias in

the output layer. Such ANNs can be described as functions f:Rn0→Ron the form

f(x) = A2σ(A1x+b1)

where A1∈Rn1×n0,b1∈Rn1and A2∈R1×n1. We can then deﬁne the set of shallow ANNs on this form as

Σn0=f:Rn0→R|f(x) = A2σ(A1x+b1), n1∈N, A1∈Rn1×n0, b1∈Rn1, A2∈R1×n1.

Leshno et al. [

] showed that if

is locally essentially bounded on

(with the property that the closure of the set of

discontinuity points has Lebesgue measure zero) then

Σnis dense in C(Rn0)

if and only if

is not polynomial

almost surely.

That

is locally essentially bounded on

means that

|σ(x)|

is bounded a.e. on

for all compact sets

K⊂R

With

Σn0

being dense in

C(Rn0)

, we mean that for any continuous function

g∈C(Rn0)

and for every compact set

K⊂Rn0there exists a sequence of functions {fj}in Σn0such that

lim

j→∞ inf {c|λ{x∈K:|g(x)−fj(x)| ≥ c|} = 0}= 0

where

is the Lebesgue measure. This implies that for every continuous function

and every compact set

, there

exist functions/ANNs in Σnthat are arbitrarily close (in absolute terms) to gon K(except maybe on a null-set).

This result shows that for reasonable choices of activation functions, shallow ANNs can approximate contin-

uous functions arbitrarily well

the ANNs have enough units. Of course, the result also holds for deep ANNs since

they are a generalization of shallow ANNs. We are interested in deep ANNs because they have certain representation

beneﬁts over shallow ANNs. If computational resources are limited and/or we cannot use inﬁnitely many units and

layers, then deep ANNs have been empirically shown to be superior to shallow ANNs. This is likely because the

composite structure of the deep ANNs can produce relatively more complexity than the additive structure of shallow

ANNs.

In [

], Matus Telgarsky provides a simple example of a continuous function that deep ANNs can approximate

better with limited units than shallow ANNs. Together with the empirical evidence, it is clear that the use of deep

ANNs is justiﬁed. However, one should also consider the training procedures’ ability to ﬁnd optimal trainable

parameters for the ANNs. This is usually harder for deep ANNs than shallow ANNs, which highlights a trade-oﬀ:

Precise approximation can more easily be obtained with many layers, but shallow networks are more manageable.

Another practical remark is that even though ANNs can represent any continuous function, we should still

consider helping the ANN by transforming the inputs into more relevant features (if possible). The ANN can (in

theory) do all necessary feature extraction. Still, if we can transform the inputs into something more closely related to

the outputs, then that transformation might decrease the complexity of training the ANN. We will see this in section

5.3.

4.3 Backpropagation

We start by assuming a simpliﬁed minimization problem, which only depends on one ANN.

inf

ˆc:=

z }| {

l(fθ(X)) ∈R

where we assume that the empirical loss ˆ

lis based on observations X= [x(1), . . . , x(m)]∈Rn0×m.

To solve the minimization problem using ANNs (as our approximator), we need a method to ﬁnd the optimal

trainable parameters/weights of the ANNs. To do this, it is common to deploy a gradient descent algorithm, which

updates the weights according to the derivative of the loss w.r.t. the trainable weights. Starting with a small non-zero

guess of Aiand bifor all i, we iteratively update the weights according to

Ai←Ai−γ∂c

∂Ai

bi←bi−γ∂c

∂bi

where

γ > 0

is the so-called learning rate. Choosing an appropriate sequence of learning rates guarantees (in theory)

that we converge to a local minimum. Several improvements can be made to this algorithm. In this thesis, we utilize

the ADAM algorithm [

]. However, common for all gradient descent algorithms is the necessity for fast computations

of the derivatives of the loss w.r.t. the trainable parameters. This is where the backpropagation algorithm enters the

picture.

Our goal is to determine

∂ˆc

θj

for all

θj

, which corresponds to determining

∂ˆc

∂bi

and

∂ˆc

∂Ai

for all

s and

which make up the trainable parameters of the ANN.

First, we assume that we know

∂ˆc

∂fθ(X)∈RnL×m

, which should come naturally from the deﬁnition of the

empirical loss function ˆ

The ingenious idea behind the backpropagation algorithm is that we (given

∂ˆc

∂fθ(X)∈RnL×m

) can ﬁnd the

derivatives of

ˆc

wrt. to all intermediate values from the computation of

fθ

by utilizing the chain rule. That is using

the chain rule to compute

∂ˆc

∂hi,1(X)∈RnL×m

for all

(where we introduce notation

hi,1(X) := hi◦. . . ◦h1(X)

From there, we can again apply the chain rule to ﬁnd the derivatives of

ˆc

w.r.t.

and

for all

. These applications

of the chain rule can be particularly tedious. Pedersen and Frandsen [

] derive these derivatives in detail, which (in

our notation) can be boiled down to the following equations

∂ˆc

∂Aihi−1,1(X) + bi

=σ0

i(Aihi−1,1(X) + bi)⊗∂ˆc

∂hi,1(X)∈Rni×m

∂ˆc

∂Ai

=∂ˆc

∂Aihi−1,1(X) + bi

(hi,1(X))>∈Rni×ni−1

∂ˆc

∂bi

=∂ˆc

∂Aihi−1,1(X) + bi

1m∈Rni

∂ˆc

∂hi−1,1(X)=A>

∂ˆc

∂Aihi−1,1(X) + bi∈Rni−1×m

where

1m∈Rm

is a vector of

s. Looking at the ﬁrst and last equation, we see that given

∂ˆc

∂fθ(X)∈RnL×m

it is possible to iterate backwards through the intermediate calculations of

fθ(X)

to ﬁnd the derivatives of

ˆc

w.r.t. the trainable weights. An important observation is that this algorithm requires the knowledge of all in-

termediate values for the calculation of

fθ(X)

. To use the algorithm, it is, therefore, necessary to ﬁrst evaluate

fθ(X)

(which we call a forward pass of the ANN), where we store all intermediate values. After that, we do a

so-called backwards pass of the ANN using the above equations. This is, in its essence, the backpropagation algorithm.

We should note that this algorithm becomes too simplistic when the loss

ˆc

depends on multiple ANNs, which aﬀect

each other. This is the case when we wish to ﬁnd the optimal trading strategy, where each trading decision is

represented with a separate ANN. In this case, each trading decision (ANN) aﬀects future trading decisions. However,

this does not change the backbone of the backpropagation algorithm.

First, assume that we wish to backpropagate through

inf

ˆc:=

z }| {

l(fθ1(x1), . . . , fθn(xn))

to determine

∂ˆc

θi

for all

. Remember that

might depend on

fθj(xj)

for

j < i

. However, we can directly apply the

simple backpropagation technique to derive

∂c

∂θi

given

∂ˆc

∂fθi

. The challenge is, therefore, to determine

∂ˆc

∂fθi

for all

which might not be straightforward. Note as

fθn

is the last trading decision (meaning no further hidden dependencies

in the

s) then it should be possible to determine

∂ˆc

∂fθn

from the deﬁnition of the empirical loss function

. This

enables calculation of

∂ˆc

∂θn

using the simple backpropagation technique. From here, it should be possible to determine

∂ˆ

fθn

∂fθn−1

from the deﬁned connection between

and

fθn−1(xn−1)

. This enables calculation of

∂ˆc

∂fθn−1

and then

∂ˆc

∂θn−1

. At this point, we can continue to iteratively determine

∂ˆc

∂fθi

and then

∂ˆc

∂θi

from

i=n−2

i= 1

. As a result

of this, we can determine

∂ˆc

∂θi

for all

, which is exactly what we wanted. This may seem diﬃcult. However, it is

luckily easily handled by general automatic adjoint diﬀerentiation (AAD) algorithms (implemented in Tensorﬂow).

4.4 Implementation and Training Neural Networks

In this thesis, all implementations and training of ANNs are done in Python with Tensorﬂow. There exist an endless

number of training procedures and methods to obtain faster and more accurate convergence of ANNs. Below is a

brief summary of some of the techniques that we utilize in this thesis.

Normalization and Batch normalization

Normalization of input and output data can signiﬁcantly improve convergence speed and accuracy. The ANN could,

in theory, scale all parameters inputs and outputs itself, but pre-processing the data (if possible) can help to ensure

that some inputs and/or outputs are not being overly-prioritized over others. Many activation functions also work

best if the data is between -1 and 1. For this reason, we choose to normalize input and output data. We do this by

subtracting the mean and dividing by the standard deviation of the data.

However, in our implementations, it is not always feasible/practical to pre-process the inputs and outputs since

we are not performing regression but rather solving a minimization problem over hedging strategies. For this reason,

we may choose to use batch normalization. Batch normalization can be viewed as a layer in the ANN, which scales

the data according to the mean and standard deviation. However, the batch normalization layer does not utilize the

mean and standard deviation for the current data passed through the layer but rather a moving average of the mean

and standard deviation of current and previous data.

Batch normalization can even be done between every ordinary layer of the ANN to help increase the stability of

the ANN, which in turn helps convergence of the gradient descent algorithm.

Initialization of trainable parameters

In practice, the resulting optimal trainable parameters and convergence speed depend heavily on the initialization of

the trainable parameters. A common heuristic is to use variance scaling, where we initialize

bi= 0

and

from

independent normal random variables with mean zero and variance 1/ni−1.

Gradient descent algorithm

We utilize the ADAM algorithm, which is heuristically shown to provide excellent convergence in terms of speed and

accuracy. The ADAM algorithm uses adaptive learning rates for each parameter and momentum to ensure faster

convergence. For example, by accelerating learning when on a plateau and decelerating learning when being close to

a minimum.

However, we might still decrease the learning rate manually when the algorithm plateaus and stop the algorithm

when it plateaus despite lowering the learning rate. Typically plateauing is monitored on a validation set, which

is separate from the training set. This is done to reduce (the probability of ) overﬁtting. However, we are not too

concerned with this since we generate our own training data.

Mini batches

Even though evaluation of an ANN and backpropagation is eﬃcient, we still want to improve convergence speed if

possible. It is, therefore, common to update the parameters based on the gradient calculated on a small batch of the

original data set. This is called mini-batch gradient descent, and it is heuristically shown to improve the gradient

descent algorithm’s convergence signiﬁcantly. It is common to use a mini-batch size of 32. However, in this thesis,

we choose to use batch sizes of 128-1024 since we utilize the CVaR risk measure as the loss function.

5 Deep Hedging Experiments

In this section, we run multiple hedging experiments to evaluate the performance of the deep hedging approach and

to better understand its pros and cons. For these experiments, we assume that the risky assets have dynamics of

well-established models such as the Black Scholes model and the Heston model. We also assume a ﬁxed interest rate

rin all experiments.

When working in these established frameworks, we can compare the ANN model’s performance to standard

delta hedging. We measure performance by looking at discounted PnLs calculated on out-of-sample hedges.

In all of our experiments, we have

π(0) = 0

, i.e. the trader cannot gain extra risk-performance by trading the

market with no initial portfolio value. This is a result of our choice of risk measure

CV aR0.95

and that all assets have

moderate drift rates. This implies that the indiﬀerence price is

p(Z) = π(˜

. As we will see later, the interesting

results come from the trading strategies and not the indiﬀerence prices. For this reason, we are not too concerned

with indiﬀerence prices.

The diﬀerent experiments:

Hedging a European call option in a one dimensional Black-Scholes model with and without transaction costs.

We include an analysis of the stability of the deep hedging models when training with the wrong volatility.

Hedging a portfolio of put and call options in a four-dimensional Black-Scholes model without transaction

costs. We include an analysis of the stability of the deep hedging model when training with the wrong

correlation matrix.

3. Hedging a down-and-out call option in a dimensional Black-Scholes model.

4. Hedging a call option in a Heston model by only trading the underlying asset.

5.1 Black-Scholes with 1 asset and a simple claim

5.1.1 No transaction costs

In this experiment, we assume that we can trade one risky asset

that is driven by a Black-Scholes model. This

implies that Shas the following dynamics under the real-world measure P

dS(t)/S(t) = µdt +σ dW P(t).

The price of the risky asset Sis a geometric Brownian motion and has, for t>s, the conditional solution

S(t) = S(s)e(µ−σ2

2)(t−s)+σ(WP(t)−WP(s))

=S(s)e(µ−σ2

2)(t−s)+σ√t−sZ

where Z∼N(0,1). Hence, we can sample from the distribution of Swithout error.

For this experiment, we wish to hedge a single call option, over 60 hedge points and with no transaction costs. For

completion, we include the price and delta of a European call option below.

Proposition 5.1

(Prop. 7.10 in [

])

The price of a European call option at time

with maturity

and strike

given by

C(S(t), t) = S(t)Φ(d1)−e−r(T−t)KΦ(d2)

where Φis the cumulative distribution function of a standard normal random variable and

d1=1

σ(T−t)ln S(t)

K+r+σ2

2(T−t)

d2=d1−σ√T−t.

(8)

Proposition 5.2

(Prop. 9.5 in [

])

The delta of a European call option at time

with maturity

and strike

given by

∆BS =∂C(S(t), t)

∂S(t)= Φ(d1)

where Φis the cumulative distribution function of a standard normal random variable and d1is as in equation (8).

In this experiment, we refer to delta hedging as the analytical approach. Note that the delta hedging strategy invests

∆BS

at every trading opportunity. In this simple experiment, we know that delta hedging can yield an arbitrarily

low hedge error given enough hedge points.

The setup for the experiment is:

•Model: One-dimensional Black-Scholes.

•Model parameters:S(0) = 1,µ= 0.05,σ= 0.3,r= 0.02.

•Option: Type: Call, Strike: K= 1 (ATM), Maturity: T= 3/12.

•Hedging: Hedge-points: 60 (equidistant), Transaction costs: 0

•Hedge strategies

: Standard delta hedging (analytical), deep hedging with MSE loss (ANN-MSE) and deep

hedging with risk measure CV aR0.95 (ANN-CVaR).

•ANN architecture

(for each trading decision): Layers: 4, Units: 5, Input:

Sk∈R

at time

, Output:

δk∈R

(holdings in S) from time tkto tk+1.

•ANN training

: Training samples:

218

, Batch size:

1024

, Epochs: 100 with learning rate reduction and early

stopping.

•Hedge-test

: Test samples: 50,000 (independent of training), Initial portfolio value: 0.06216 (actual option

value).

•Performance measurement: To measure the performance, we utilize the following measures:

–

Average absolute PnL: Calculated as the average of the absolute value of the PnLs from the 50,000 test

samples. A low value is preferable as it indicates precise replication of the option.

–

PnL standard deviation: Calculated as the standard deviation of the 50,000 PnLs from the test samples.

A low value is preferable as it indicates stability and more precise replication.

–CV aR0.95

: Calculated as the empirical

CV aR0.95

based on the 50,000 PnLs from the test samples. A

low value is preferable as it indicates a lower downside tail risk.

Strategy Value (Standard Error) % of option price

Option Price 0.06216

Model p0ANN MSE FIXED

ANN CV AR0.95 0.07560 121.615%

Analytical 0.00498 (0.000020) 8.017%

Avg. abs. PnL ANN MSE 0.00502 (0.000020) 8.072%

ANN CV AR0.95 0.00589 (0.000021) 9.482%

Analytical 0.00665 10.695%

PnL Std. ANN MSE 0.00668 10.745%

ANN CV AR0.95 0.00758 12.188%

Analytical 0.01545 24.847%

CV aR0.95 ANN MSE 0.01539 24.753%

ANN CV aR0.95 0.01347 21.675%

Table 5.1: Results of hedge experiment over 50,000 out-of-sample trials. The experiment involves hedging

an ATM call option over 60 hedge points in a single asset Black-Scholes model

without transaction costs

This experiment aims to showcase the performance of the deep hedging approach compared to standard delta hedging

and to illustrate the eﬀect of minimizing tail risk over mean square error.

To ensure a fair performance measurement, we choose to ﬁx

to the actual option price for the ANN-MSE

model during training. We do this since we ﬁx the initial portfolio value for all strategies to the actual option price

when testing, and the ANN-MSE model is particularly sensitive to its initial portfolio value. From section 3.2, we

know that optimizing the ANN-CVaR model does not depend on the initial portfolio value. When training the

ANN-CVaR model, we, therefore, set the initial portfolio value to zero and calculate

using the indiﬀerence price

as in equation (5). However, to reiterate, when

testing

the three diﬀerent strategies, all strategies start with an initial

portfolio value equal to the actual option price.

In this experiment, we do not focus on training times, as we are more concerned with the models’ capabilities.

However, for this particular experiment, we train a deep hedging model for approximately 15 minutes. This does, of

course, depend heavily on the number of hedge points, the number of training samples, the architecture, computational

resources (CPU vs GPU and available RAM), batch size, epochs, learning rate schedule and settings for early

stopping. We should also note that our experienced training times will be relatively slow since our implemented

framework contains many redundancies in its architecture from other experiments.

The results of the experiment can be seen in table 5.1 and ﬁgure 5.1, 5.2, 5.3 and 5.4. In ﬁgure 5.1, the

PnL and portfolio values are illustrated for the 50,000 hedge trials and for all three approaches: Analytical, ANN-MSE

and ANN-CVaR. We notice that even the analytical delta hedging approach is not perfect, which is expected since

the experiment uses 60 hedge points. At a ﬁrst glance, it seems like the PnLs for the analytical approach and the

ANN-MSE model are pretty similar. In comparison, the PnLs for the ANN-CVaR model are slightly diﬀerent, yet

promising. Therefore, a visual assessment of the PnLs indicates that the ANN models have eﬀectively learned how

to hedge the call option.

Looking at table 5.1, we get a better sense of the quantitative performance of the diﬀerent models. It is clear

that the ANN-MSE model is very close to the analytical strategy. When looking at both average absolute PnL,

PnL standard deviation and empirical CVaR, the ANN-MSE model’s performance is only slightly worse than the

analytical approach. For example, the average absolute PnL is 0.00498 and 0.00502 for the analytical approach and

the ANN-MSE model, respectively. The diﬀerence is only 0.8%.

Looking at the performance of the ANN-CVaR model, it is clear that the model is outperformed by the two

other approaches, based on the metrics average absolute PnL and PnL standard deviation. For example, the average

absolute PnL is 0.00589 for the ANN-CVaR model, which is 18.3% more than the analytical approach. However,

when we look at the empirical CVaR on the test set, it is evident that the ANN-CVaR model outperforms the two

Option payoﬀ and PFN

(a) Portfolio value for the analytical strategy

PnL

(b) PnLs for the analytical strategy

Option payoﬀ and PFN

PnL

(d) PnLs for ANN optimized with MSE

Option payoﬀ and PFN

(e) Portfolio value for ANN optimized with C V aR0.95

PnL

(f) PnLs for ANN optimized with CV aR95

Figure 5.1: Portfolio values vs option payoﬀ and PnLs for the analytical strategy and ANN models

optimized over MSE and

CV aR0.95

. Model: Black Scholes model without transactions costs. Option:

ATM call option.

(a) Sample 1 (b) Sample 2

Figure 5.2: Holdings in

across time for two diﬀerent sample paths. Model: Black Scholes model without

transactions costs. Option: ATM call option.

Units of S

Figure 5.3: Holdings in

at time

t= 0.125

. Model: Black Scholes model without transactions costs.

Option: ATM call option.

PnL density

(a) ANN optimized with MSE

PnL density

(b) ANN optimized with CV aR0.95

Figure 5.4: Out of sample PnL distribution. ANN models vs analytical strategy. Model: Black Scholes

model without transactions costs. Option: ATM call option.

other approaches. Using this metric, the analytical approach (with a CVaR of 0.01545) is 14.7% higher than the

ANN-CVaR model (with a CVaR of 0.01347). It is not surprising that the ANN-CVaR model performs well when

measured on CVaR since the model was trained to be optimal under this metric. However, this shows that there exist

alternative and learnable trading strategies that focus on reducing the tail risk of the PnL.

In ﬁgure 5.2, we can see the holdings in the underlying risky asset over time for two test samples. In these two

samples, we observe that the analytical approach and the ANN-MSE model employ virtually the same trading strategy.

This is to be expected since their performance metrics were similar. However, we should note that delta hedging and

an MSE-optimal strategy will generally not produce the same trading strategy. See for example appendix A.1. Still,

they might be pretty close, as we have observed. We also observe that the ANN-CVaR model follows a diﬀerent

trading strategy compared to the two other approaches. It seems like the ANN-CVaR model is more conservative

than the two other models. This is also partially conﬁrmed in ﬁgure 5.3, which shows that the ANN-CVaR model

employs a ﬂatter trading strategy compared to the option delta at time t= 0.125.

In ﬁgure 5.4, we see the empirical PnL distribution of the two ANN models compared to the analytical approach.

The empirical PnL distributions strengthen the conclusions from our previous analysis. We again see that the

ANN-MSE model manages to imitate the PnL distribution of the analytical approach, where the PnL distribution of

the ANN-CVaR model is skewed, resulting in less downside tail risk.

Going back to table 5.1, we observe that the indiﬀerence price

for the ANN-CVaR model is

0.07560

, which is

21.615% higher than the actual option price. This price reﬂects the fair price needed to obtain an empirical CVaR of

0. Remember that

π(0) = 0

. Note that this is not in conﬂict with proposition 3.3 since the option is not reachable

with only 60 hedge points.

One might criticize the experiment because we have only trained one ANN model for each approach, one

for MSE and one for CVaR. However, we ﬁnd that the ANN models are pretty stable with the current training

procedure, at least with only one tradable asset. We would also expect to see signiﬁcantly higher PnL standard

deviations if the models were unstable. On that note, we have seen that the PnL standard deviation is comparable to

the analytical approach for both ANN models. Therefore, for most experiments, we will not worry about the ANN

stability (the exceptions are section 5.2 and 7.1).

5.1.2 Training on wrong volatility and The Fundamental Theorem of Derivatives Trading

We have showcased that the ANN models can solve a simple trading problem in the Black-Scholes model without

transactions costs. Before moving on to experiments with non-zero transaction costs, it would be interesting to

evaluate the stability of the learned trading strategies. In general, we/traders cannot accept huge unexpected drops in

hedging performance if the actual dynamics of Sdeviate from the ones used during training.

In our analysis, we consider deviations in volatility. This is a well-researched topic, and a well-known result is

the so-called Fundamental Theorem of Derivative Trading (see [10]). The theorem is discussed below:

Under both the P- and Q-measure, we assume that Sfollows the dynamics

dS(t)/S(t) = σ(t)dW (t)

where (σ(t)) is some random process, i.e. we assume no interest rate, drift under Por dividends. We assume to be

selling an option with expiry

at implied volatility

σH

. We also choose to Black-Sholes delta hedge the option using

hegde volatility

σH

. The Fundamental Theorem of Derivative Trading (FTDT) then states that our PnL at expiry is

P nLT=1

2ZT

Γ(t)S(t)(σH−σ(t))dt. (9)

If we assume that the option payoﬀ is convex (ie. positive gamma) then the FTDT shows us that we make money

when realized volatility

σ(t)

is low relative to

σH

and loose money if realized volatility is high. Note the sign is

ﬂipped compared to the usual version of the theorem since we are selling the option. The FTDT also shows us that

hedging with the wrong delta has a bleeding eﬀect on our PnL since equation (9) is not a stochastic integral. This is

(somewhat) reassuring, but this result does not hold when hedging with our ANN models.

The idea of this section and experiment is to analyze the eﬀect on PnL (in terms of average PnL, average absolute

PnL and empirical CVaR) when the ANN models are trained with the wrong volatility. To see a potential pitfall for

the ANN models, we can look at ﬁgure 5.3. In this ﬁgure, we see that the trading strategies of the ANNs compared to

the delta of the option at a ﬁxed time point. We see that for extreme values of the current spot, then the ANN models

diverge or exhibit weird behaviour. This is likely because the models have not seen many extreme examples during

training. Therefore, we would expect to see abnormal behaviour in our experiment if the ANN models are not trained

on enough samples and/or if the underlying asset’s volatility is signiﬁcantly diﬀerent from the one used in training.

The experiment is as follows:

•

The experiment involves hedging a call option with strike

K= 1

and maturity

T= 3/12

over 60 equidistant

trading days without transaction costs.

•

We train two ANN models (one on MSE and one on CVaR like in section 5.1.1) to hedge the call option on

218 training samples from a Black-Scholes model with S(0) = 1,µ=r= 0 and σ= 0.3.

•

Run an out-of-sample hedging test on 10,000 samples where a similar Black-Scholes model governs

but

with

varying from

0.2

0.6

. In this test, the initial portfolio value is chosen as the option price in the model

used for training the ANNs (i.e. with σ= 0.3).

•

As a benchmark, we also run the test with a Black-Scholes delta hedging strategy with volatility

σ= 0.3

, i.e.

the same model as the model used for training the ANNs.

The results of the experiment can be seen in ﬁgure 5.5.

In ﬁgure 5.5 (a), we see the average PnL for the three strategies across diﬀerent test-volatilities. All mod-

els/strategies are seemingly identical on average PnL, and they are all decreasing linearly. In ﬁgure 5.5 (b), we have

subtracted the average PnL of the analytical approach (delta hedging with

σH= 0.3

) to magnify the diﬀerences. We

see from the scale that the diﬀerences are indeed small (on the scale of 0.0001 compared to values of 0.02 to -0.06).

However, it is noteworthy that the ANN-MSE models seem to diverge around

σ= 0.4

. We have previously seen that

the hedging strategy produced by the ANN-MSE model is close to the analytical approach. Therefore, divergence in

the average PnL signiﬁes that the ANN-MSE model is struggling, likely because of extreme and unseen values of

In ﬁgure 5.5 (c), we see the average absolute PnL of the three strategies. As expected, the absolute average PnL

is minimized for

σ=σH= 0.3

, which makes sense since the average PnL was linear and decreasing. Again all three

strategies are quite close, but the ANN-CVaR model lacks behind when

is close to

σ= 0.3

. This is conﬁrmed

by looking at the diﬀerences to the average absolute PnL of the analytical strategy in ﬁgure 5.5 (d). We postulate

that this is because strategies become relatively more similar when viewed from a model with much higher or lower

volatility. Another interesting observation is that the ANN-MSE models again diverge around

σ= 0.4

, which we

still suspect is due extreme values of S.

In ﬁgure 5.5 (e), we see the empirical CVaR of the PnLs and in ﬁgure 5.5 (f), we see the diﬀerences to the

analytical approach. The interesting observation is that the ANN-CVaR model gets relatively better when

increases

compared to the other strategies. However, when

is low (compared to

σH= 0.3

), the ANN-CVaR model loses out

to the analytical approach and the ANN-CVaR model.

This is quite concerning

because it shows that a strategy that

is CVaR optimal does not retain its (relative) CVaR optimality when the real volatility is lower than expected even

when compared to the MSE optimal strategy.

5.1.3 0.5% transaction costs (Black-Scholes 1 asset)

In this experiment, we wish to analyze the eﬀects of non-zero transaction costs. To do this, we repeat the previous

experiment but now add proportional transaction costs of 0.5%.

(a) Average PnL (b) Average PnL diﬀerence from analytical.

(low = good)

(e) Empirical CV aR0.95

(f) Empirical C V aR0.95 diﬀerence from analytical.

(low = good)

Figure 5.5: Average PnL, average absolute PnL and

CV aR0.95

across varying volatilities for ANN models

trained on

σ= 0.3

and Black-Scholes delta hedging with

σ= 0.3

(analytical). (a), (c) and (e) show average

PnL, average absolute PnL and

CV aR0.95

, respectively. (b), (d), (f) show the respective diﬀerences to that

of the analytical approach. No transaction costs.

Strategy Value (Standard Error) % of option price

Option Price 0.06216

Model p0ANN MSE FIXED

ANN CV AR0.95 0.08852 125.275%

Analytical 0.01493 (3.73e-05) 24.022%

Avg. abs. PnL ANN MSE 0.01104 (3.57e-05) 17.764%

ANN CV AR0.95 0.01199 (3.02e-05) 19.284 %

Analytical 0.00843 13.561%

PnL Std. ANN MSE 0.00965 15.522%

ANN CV AR0.95 0.00863 13.888%

Analytical 0.03672 59.063%

CV aR0.95 ANN MSE 0.03183 51.209%

ANN CV AR0.95 0.02641 42.482%

Analytical 2.98644 (0.00402)

Avg. turnover ANN MSE 1.94404 (0.00212)

ANN CV AR0.95 2.16006 (0.00229)

Analytical 0.01493 (2.00e-05) 24.014%

Avg. transaction costs ANN MSE 0.00973 (1.06e-05) 15.647%

ANN CV AR0.95 0.01080 (1.14e-05) 17.375%

Table 5.2: Results of hedge experiment over 50.000 out-of-sample trials.The experiment involves hedging

an ATM call option with 60 hedge points in a single asset Black-Scholes model with

0.5% transaction

costs.

Important change:

To fully leverage the ANN models, we amend the input to the ANN models with the current

holdings in the tradable asset. Hopefully, the ANN models will use this information to adjust their trading strategies

to take current holdings and transaction costs into account.

We still consider the standard delta hedging strategy and the two ANN models, one using MSE and the other

using CVaR. The two ANN models are again trained on

218

samples, and all three models are tested on 50,000

independent samples. For the test, all strategies are initialized with an initial portfolio value equal to the actual option

price when not considering transaction costs. To further analyze the test results, we record the average accumulated

transaction costs and average turnover. To calculate the turnover, we sum the absolute number of units traded at every

trading opportunity, not including T, i.e.

T urnover i=

N−1

k=0 |δi

k−δi

k−1|

is the turnover for asset i. The results of this experiment can be seen in table 5.2 and in ﬁgure 5.6 and 5.7.

In table 5.2, we observe that both ANN models outperform the analytical delta hedging approach. When

considering the metric average absolute PnL, we see that the ANN-MSE model is the best. The ANN-MSE model

has an average absolute PnL of

0.01104

, which is 26.05% and 7.92% lower than the analytical approach and the

ANN-CVaR model, respectively. However, the ANN-CVaR model is, once again, the best performing model when

considering empirical CVaR. The ANN-CVaR model obtains an empirical CVaR of

0.02641

, which is 28.08% and

17.03% lower than the analytical approach and the ANN-MSE model, respectively. From table 5.2, we also observe

that both ANN models have lower average turnovers and average transaction costs, which show that the ANN models

trade less aggressively than the analytical approach, which explains the increased performance. This is also what we

observe in ﬁgure 5.6. Here we see holdings in the risky asset

over time for two samples. It is clear that both ANN

models deploy less aggressive trading strategies to avoid unnecessary large transaction costs. However, in table 5.2,

we can also observe that both ANN models have a higher PnL standard deviation than the analytical approach.

(a) Sample 1 (b) Sample 2

Figure 5.6: Holdings in

across time for two diﬀerent sample paths. Model: Black Scholes model with

0.5%proportional transactions costs. Option: ATM call option.

PnL density

(a) ANN optimized with MSE

PnL density

(b) ANN optimized with CV aR0.95

Figure 5.7: Out of sample PnL distribution. ANN models vs analytical strategy. Model: Black Scholes

model with 0.5%proportional transactions costs. Option: ATM call option.

In ﬁgure 5.7, we see the PnL distributions for the two ANN models compared to that of the analytical approach.

We observe that both ANN models have managed (through less aggressive trading) to shift the PnL distributions in

the favourable direction. Once again, the ANN-CVaR model has adopted a trading strategy that creates a skewed PnL

distribution, which decreases the downside tail risk.

In conclusion, it is clear that the ANN models can ﬁnd trading strategies superior (on every metric) to classic

delta hedging. To reiterate, ANN models have no information about the model, transaction costs or the option. This

is a promising sign.

5.2 Black Scholes with multiple assets and no transaction costs

For this experiment, we step up the complexity of the model by introducing more tradable assets. The idea is to see if

our ANN model setup can trade multiple assets by only observing the total PnL of the entire portfolio.

In this experiment, we assume that we are selling an option portfolio of 10 put and call options on four diﬀerent

assets. We assume that the four assets come from a four-dimensional Black-Scholes model. The multidimensional

Black-Scholes model with nassets has dynamics

dS(t)/S(t) = µdt +σ⊗C dW P(t)

where

µ, σ ∈Rn

are the vectors of individual drifts and volatilities,

C∈Rn×n

is a correlation matrix,

WP(t)∈Rn

is a vector of

independent Brownian motions and where "

⊗

" is used for element-wise multiplication between vectors.

We still assume that we can trade in a locally risk-free asset with constant interest rate, i.e. dB(t)/B(t) = rdt.

Since our portfolio consists of put and call options on individual assets, then the price of the option portfolio is

simply the weighted sum of individual option prices calculated using the Black-Scholes formula (see proposition 5.1)

and put-call parity in their respective one-dimensional models. Similarly, the analytical delta hedging strategy is the

collection of individual delta hedging strategies for all the options.

The setup for the experiment is:

•Model: Four-dimensional Black-Scholes.

•Model parameters

S(0) = (1,1,1,1)>

µ= (0.03,0.08,0.04,0.08)>

σ= (0.25,0.14,0.09,0.07)>

r= 0.02 and correlation matrix Ccan be seen in equation (10).

•Option: Portfolio of 10 call and put options with maturity T= 3/12. See table 5.3.

•Hedging: Hedge-points: 60 (equidistant), Transaction costs: 0

•Hedge strategies: Standard delta hedging and deep hedging with risk measure CV aR0.95 .

•ANN architecture

(for each trading decision): Layers: 4, Units: 10, Input:

Sk∈R4

at time

, Output:

δk∈R4(holdings in S) from time tkto tk+1.

•ANN training

: Training samples:

218

, Batch size:

128

, Epochs: 100 with learning rate reduction and early

stopping.

•Hedge-test

: Test samples: 50,000 (independent of training), Initial portfolio value: 0.50967 (actual option

value).

The correlation matrix, we utilize for this experiment, is

C=





1 0.673292 0.69783 0.732783

0.673292 1 0.787874 0.0763763

0.69783 0.787874 1 0.317681

0.732783 0.0763763 0.317681 1







.(10)

This experiment aims to showcase the performance of the deep hedging approach compared to standard delta

hedging in the case of multiple tradable assets and multiple options. Even-though, there are no transaction costs, we

expect/hope that the deep hedging model ﬁnds a trading strategy that utilizes the correlation between assets.

Type Underlying Strike Units

call 4 1.1 0.43

call 3 1.06 -0.77

put 1 0.98 -0.9

call 4 1.03 -0.54

call 4 1.04 -0.23

call 3 0.99 -0.79

put 1 0.78 1.78

call 2 0.81 0.64

call 4 0.8 1.9

put 1 0.97 1.69

Table 5.3: Option portfolio for experiment in multidimensional Black-Scholes. Underlying refers to the

index (starting at 1) of the risky asset. Units refer to the number of options in the portfolio, which we

imaging to be selling.

Strategy Value (Standard Error) % of option price

Option Price 0.50967

Model p0ANN CV AR0.95 0.51839

Avg. abs. PnL Analytical 0.00360 (1.49e-05) 0.707%

ANN CV AR0.95 0.01564 (5.22e-05) 3.069%

PnL std. Analytical 0.00491 0.963%

ANN CV AR0.95 0.01318 2.585%

CV aR0.95 Analytical 0.01162 2.280%

ANN CV AR0.95 0.00952 1.868%

Avg. turnover Analytical (2.65478, 0.64926, 3.37903, 3.37182)

pr. asset ANN CV AR0.95 (4.33312, 3.18385, 5.22738, 8.41255)

Table 5.4: Results of hedge experiment over 50,000 out-of-sample trials. The experiment involves hedging

an option portfolio with 10 put and call options over 60 hedge points in a multidimensional Black-Scholes

model without transaction costs.

The results of the experiment can be seen in table 5.4 and ﬁgure 5.8.

In ﬁgure 5.8, we see that the PnL distribution of the ANN-CVaR model is signiﬁcantly diﬀerent from that

of the analytical strategy. The PnL distribution of the ANN-CVaR model is shifted towards higher PnLs, implying that

the ANN model has learned a strategy that for a large portion of samples obtains a higher PnL. This is encouraging.

However, we also observe that the PnL distribution of the ANN-CVaR model has a much higher variance compared

to the analytical approach, which is undesirable.

In table 5.4, we observe that the ANN-CVaR model performs best on empirical CVaR where it obtains 0.00952,

which is

18.07%

lower than the analytical approach. However, we can also see that the ANN-CVaR model performs

signiﬁcantly worse on average absolute PnL and PnL standard deviation than the analytical approach. On average

absolute PnL the ANN-CVaR model obtained 0.01564, which is 334.44% higher than the analytical approach. On

the PnL standard deviation, the ANN-CVaR model obtained 0.01318, which is 168.43% higher than the analytical

approach. This suggests that the ANN-CVaR model has learned a signiﬁcantly diﬀerent hedging strategy to the

analytical approach. This is conﬁrmed when we look at the turnover, which is between 1.5 and 5 times higher for the

ANN-CVaR model than the analytical approach. This suggests that the ANN-CVaR model is unstable and exploits

the correlation of the assets, and it might even suggest that the model is poorly trained.

PnL density

Figure 5.8: Out of sample PnL distribution. ANN model using CVaR vs. analytical strategy. Model: 4

dimensional Black Scholes model without transactions costs. Option portfolio: see ﬁgure 5.3.

It might seem weird to include an unstable and poorly trained model. However, we ﬁnd that the ANN model was

signiﬁcantly harder to train in this experiment. When repeating the experiment, we have also found that some ANN

models have turnovers that are up to x10 that of the analytical approach. We have chosen to include these strange

results to showcase the diﬃculties of utilizing deep hedging methods in a multiasset scenario.

As mentioned, we suspect that the instability is a product of no transaction costs and the ANN model trying to

exploit the correlation matrix (that is close to being singular). We, therefore, test the stability of the ANN models

under shocks to the correlation matrix in the next section.

5.2.1 Training with the wrong correlation

Although the ANN-CVaR model outperformed the analytical delta hedging strategy, on empirical CVaR in the

previous experiment with a multidimensional Black-Scholes model, one might suspect that the ANN model relies too

heavily on the correlation matrix. This is a problem in practice where estimating correlation accurately is virtually

impossible.

This is the exact problem an investor faces if he/she/it tries to utilize mean-variance analysis (Markowitz). The

problem is that the obtained optimal portfolio is empirically unstable when considering minor changes to the return

and/or covariance matrix since the covariance matrix is often close to being singular. We suspect that the trading

strategy learned by the ANN model might face similar diﬃculties due to the performance and exaggerated turnover

seen in the previous experiment (see table 5.4).

To test the stability of the ANN models when faced with changes in the correlation, we repeat the previous

experiment with diﬀerent shocks to the correlation matrix. The way we choose to change the correlation matrix is

simple. We add a normally distributed number with a predetermined variance to each entry in the correlation matrix,

not on the diagonal. The method we employ for doing this is

•Denote the new correlation matrix ¯

C. Then for all i, j i 6=jand i≤j

Cij =¯

Cji =Cj i +k·Zij

where Zij are independent N(0,1) and kis the scale of the shock.

•If any ¯

Cij is outside (−1,1) then we set them to either −1or 1.

Strategy Scale of shock, k

0 0.05 0.1 0.15

Analytical 0.01157 0.01156 0.01159 0.01159

ANN model η= 1 0.00949 0.01016 0.02266 0.04588

ANN model η= 0 0.01284 0.01283 0.01283 0.01288

ANN model η= 0.75 0.01045 0.01043 0.01061 0.01076

Table 5.5: Average empirical

CV aR0.95

over 500 hedge experiments for each shock

, each including

2,000 samples. The standard errors range from 0.00002 to 0.00005 except for the ANN model with

η= 1

where the standards errors range from 0.00004 to 0.005. The experiment involves hedging an option

portfolio of puts and calls in a four-dimensional Black-Scholes model

without transaction costs

, but the

correlation matrix is not the same as during training.

•

Most likely

will not be positive deﬁnite, so we employ an algorithm from [

]

to ﬁnd the closest positive

deﬁnite matrix to

. This yields positive deﬁnite correlation matrices, which are slightly diﬀerent from the

original correlation matrix, C. The size of the shock is also easily controlled by changing k.

Before we move on to the experiment, we wish to add another trading strategy to the experiment, which is inspired by

Robust Portfolio Optimization (see [

]). The idea (in our case) is to solve the problem in a way in which the trading

strategy is not as dependent on the correlation matrix

. The way we do this is by training our ANN models on

samples coming from a model with correlation

CRO(η) = (1 −η)In+ηC

where

is the

-dimensional identity matrix. When

η= 1

, we have

CRO(1) = C

, which is the correlation matrix

in our original problem. If

η= 0

, then

CRO =In

is the identity matrix implying that all underlying assets will be

independent. Note that no matter the choice of

, the marginal distribution of each tradable asset is the same since all

we change is the correlation between the assets.

Our idea is to include two extra ANN models: One trained on

η= 0

(independent assets) and one with

η= 0.75

(assets that have a similar dependence structure but scaled down).

Our experiment can now be described in the following way

•Train relevant ANN models on 218 samples generated with correlation matrix Cfrom equation (10).

•

For each shock size

(0,0.05,0.1,0.15)

, run 500 hedge experiments with new positive deﬁnite correlation

matrix

(found using the above method). Each of the 500 hedge experiments for each

includes 2,000 sample

paths. The result is an empirical average CVaR for each of the four shock sizes and for each model/strategy.

The results can be seen in table 5.5. In table 5.5, we have listed the average empirical CVaR for the four diﬀerent

models for each of the four shock sizes

. We observe from the data that all models except for the ANN model with

η= 1

(i.e. the one trained on the true correlation) are remarkably stable under shocks to the correlation matrix. This

is to be expected of the analytical approach and the ANN model with

η= 0

. The analytical approach is a collection

of delta hedging strategies for the individual options, which do not depend on correlation. Furthermore, the ANN

model with

η= 0

is trained on samples with no correlation between the diﬀerent assets. However, it is surprising

that the ANN model with

η= 0.75

(dampened correlation) is stable under shocks to the correlation matrix when

the ANN model with

η= 1

seems to be quite unstable under shocks to the correlation matrix. The ANN model

with

η= 1

obtains an average CVaR, which is twice that of the other models when

k= 0.1

and is even worse for

k= 0.15

. This clearly shows that the ANN model with

η= 1

has learned a trading strategy, which relies too heavily

on the correlation between the assets, as suspected in the previous section.

The implementation of which is borrowed from: https://stackoverﬂow.com/questions/10939213/how-can-i-calculate-the-

nearest-positive-semi-deﬁnite-matrix

Strategy Scale of shock, k

0 0.05 0.1 0.15

Analytical 0.06844 0.06844 0.06855 0.06862

ANN model η= 1 0.04289 0.04312 0.04433 0.04519

ANN model η= 0 0.04929 0.04925 0.04925 0.04921

ANN model η= 0.75 0.04517 0.04517 0.04524 0.04530

Table 5.6: Same experiment as table 5.5, but with

0.5% transaction costs

. Standard deviations range

from 0.00003 to 0.0002

In table 5.5, we see that the ANN model with

η= 0

(independent samples) obtain a higher empirical CVaR than

the analytical approach. In theory and with proper training (and enough samples), all ANN models should beat the

analytical approach on CVaR. However, the fact that the ANN model with

η= 0

fails to do so show that training the

ANN models is diﬃcult, especially with the increased complexity of multiple assets. However, it is interesting that

the ANN model with

η= 0.75

(dampened correlation) performs better than the analytical approach and is stable

under correlation shocks. As in Robust Portfolio Optimization, dampening the correlation has a stabilizing eﬀect on

the resulting strategy. This is very encouraging since it shows that it is possible to ﬁnd eﬀective trading strategies,

which are not suﬀering from a high degree of instability in the utilized correlation matrix.

One might suspect that the instability would disappear if we introduced transaction costs since they would

discourage high trading volumes, which was a signiﬁcant characteristic of the unstable ANN trading strategy. We,

therefore, perform the same experiment but with

0.5%

transaction costs. The results of which can be seen in table

5.6. In table 5.6, we observe that all strategies are quite stable under shocks to the correlation. This is as suspected

since transaction costs have a regularizing eﬀect on trading volumes, which counteracts the beneﬁts of exploiting the

correlation matrix.

To round of, one should not read too much into the exact level of stability/instability of the deep hedging

models since that may depend heavily on the initial correlation matrix and the speciﬁc model structure/design and

training. However, one can conclude that stability is not given with these deep hedging models. However, there are

natural ways of discouraging exploitation of the correlation between assets.

5.3 Black Scholes model with path-dependent options

For this experiment, we wish to hedge a path-dependent option. We do this to test the performance of the ANN

model under increased complexity and because hedging a path-dependent option introduces interesting architectural

decisions. To simplify the problem, we choose to hedge a down-and-out call option in a one-dimensional Black-Scholes

model. The down-and-out call option is among the more straightforward barrier options with payoﬀ

(S(T)−K)+1minu≤TS(u)>L

where

is the strike, and

is the knock-out barrier. The down-and-out call also has a simple analytical price when

working in a Black-Scholes model, which can be seen below.

Proposition 5.3

(Prop. 18.17 in [

])

The price of a European down-and-out call option at time

(issued at time

)

with maturity T, strike Kand barrier L<Kis given by

CLO(S(t), t)) = 





C(S(t), t)−L

S(t)2˜r

σ2CL2

S(t), tminu≤tS(u)> L

0 minu≤tS(u)≤L

where Cis the price of a regular European call option with strike K(see proposition 5.1) and ˜r=r−1

2σ2.

This makes it easy to compare the hedging strategies learned by the ANN models with an analytical strategy. The

analytical strategy will utilize the delta hedging approach, which can hedge the option arbitrarily well, given enough

hedge points. The delta of the barrier option can be found analytically, but for convenience, we utilize algorithmic

adjoint diﬀerentiation (see [13, Chapter 9]), which is easily implemented using Tensorﬂow in Python.

To reduce complexity even further, we choose only to calculate the minimum of

over the discrete set of time

points where trading is allowed. The actual price of the option should, therefore, be higher than the price found in

proposition 5.3, which assumes continuous-time monitoring of the barrier. However, since prices only aﬀect the

initial portfolio value and CVaR is cash invariant, we are not overly concerned with the discrepancy between discrete

and continuous time.

Suppose we wish to hedge a down-and-out call option optimally. Then, at each trading decision, we need to

know the current value of the underlying asset

and whether or not the underlying asset has previously crossed the

barrier, rendering the option worthless. Previously, we have only provided the ANN models with information on the

current value of the underlying asset

. We, therefore, need to decide which extra input to give the ANN models.

The additional information should provide the knowledge of the state of the option (regarding the barrier). There are

three ways of handling this issue that we wish to investigate.

The ﬁrst is to provide the actual minimum of the underlying

as a separate input to the model. This is a sensible

idea, but it puts a burden on the programmer/trader (read "us") to keep track of more information.

The second is to only provide the ANN model with the current value of the underlying asset

but allow the

model to send information to its future self for the next trading decision. We say that the ANN model has memory.

The idea is that the ANN model should itself be capable of keeping track of the minimum of the underlying asset

and hence if the barrier has been crossed.

The third is not to provide the ANN model with any extra information than the current value of the underlying

asset. This strategy will not be able to trade optimally, but we include it as a benchmark for the other two methods.

The three diﬀerent ANN models will be referred to as ANN w. min. info, ANN w. memory and ANN raw,

respectively. Note that when we say min. info, we mean information of the minimum of

and not (virtually) no

information.

The setup for the experiment is:

•Model: One-dimensional Black-Scholes.

•Model parameters:S(0) = 1,µ= 0.05,σ= 0.3,r= 0.02.

•Option: Type: Down-and-out call, Strike: K= 1 (ATM), Barrier: L= 0.95, Maturity: T= 1/12.

•Hedging: Hedge-points: 20 (equidistant), Transaction costs: 0

•Hedge strategies

: Standard delta hedging and three ANN-CVaR models. The diﬀerence in the ANN models

lies in their architecture and available information (See below).

•ANN architecture

(for each trading decision): Layers: 4, Units: 6, Common input:

Sk∈R

at time

Common output: δk∈R(holdings in S) from time tkto tk+1.

–ANN model w. min. info: Additional input at time tk:mini≤kS(ti)

–

ANN model w. memory: Additional output at time

: 2 dimensional vector

mk∈R2

and additional

input at time tk: 2 dimensional vector mk−1∈R2from the previous trading decision.

–ANN model raw: No additional input or output.

•ANN training

: Training samples:

218

, Batch size:

1024

, Epochs: 100 with learning rate reduction and early

stopping.

•Hedge-test

: Test samples: 50,000 (independent of training), Initial portfolio value: 0.029523 (true continuous

time option value)

This experiment aims to see if the ANN models can learn how to optimally hedge a path-dependent option. Moreover,

to compare diﬀerent methods of handling the extra information that hedging a path-dependent option requires. The

results of the experiment can be seen in table 5.7 and ﬁgure 5.9 and 5.10.

Strategy Value (Standard Error) % of option price

Option Price 0.02953

ANN CV AR0.95 w. min. info 0.04392

Model p0ANN CV AR0.95 w. memory 0.04528

ANN CV AR0.95 raw 0.04737

Analytical 0.00496 (0.00430) 16.793%

Avg. abs. PnL ANN CV AR0.95 w. min. info 0.00603 (0.00639) 20.412%

ANN CV AR0.95 w. memory 0.00712 (0.00928) 24.101%

ANN CV AR0.95 raw 0.00902 (0.01194) 30.550%

Analytical 0.00628 20.754%

PnL std. ANN CV AR0.95 w. min. info 0.00900 29.743%

ANN CV AR0.95 w. memory 0.01155 38.170%

ANN CV AR0.95 raw 0.01477 48.811%

Analytical 0.01646 55.749%

CV aR0.95 ANN CV AR0.95 w. min. info 0.01440 48.767%

ANN CV AR0.95 w. memory 0.01579 53.483%

ANN CV AR0.95 raw 0.01780 60.292%

Table 5.7: Results of a hedge experiment over 50,000 out-of-sample trials. The experiment involves

hedging a down-and-out call option with 20 hedge points in a single asset Black-Scholes model

without

transaction costs.

(a) Sample 1 (b) Sample 2

Figure 5.9: Two samples of holdings for diﬀerent trading strategies. The goal is to hedge a down-and-out

call option in a one-dimensional Black-Scholes model without transaction costs.

PnL

(a) Analytical

PnL

(b) ANN model with min-info

PnL

(d) ANN model without min-info or memory

Figure 5.10: PnLs over terminal value of underlying asset for four diﬀerent hedging strategies. The goal

is to hedge a down-and-out call option in a one-dimensional Black-Scholes model

without transaction

costs.

In table 5.7, we see that the ANN model with min. info performed the best. It obtained an empirical CVaR

of 0.01440, which was 12.45% lower than the analytical approach and 8.80% and 19.10% lower than that of the

ANN model with memory and raw ANN model, respectively. This is expected, as the ANN model with min. info has

the best available information. We should also note that the ANN model with memory performed better than the

analytical approach and the raw ANN model. This shows that the model was partially successful in utilizing its

memory to more accurately hedge the down-and-out call option. We say partially since it is advantageous for the

ANN model to receive the minimum of Sas input instead of keeping track of that itself.

To illustrate the performance of the ANNs, we have chosen two test samples, where the price of the underlying

asset hit the barrier, which can be seen in ﬁgure 5.9 (a) and (b). From these two samples, we observe that the raw

ANN model does not capture the crossing of the barrier, which makes sense since it has no information of the

minimum of Sor memory. We also see that the ANN model with min. info and the ANN model with memory can

capture the crossing of the barrier. However, the ANN model with memory seems to ﬁnd this more challenging,

as shown in ﬁgure 5.9 (b). It is also worth noting that the ANN models ﬁnd it diﬃcult to stop trading altogether.

This may seem counter-intuitive as no action would seem like an easy action to learn (for a human trader, at least).

However, this is not the case for ANNs, which struggle to learn not to take action.

Lastly, we observe from table 5.7 that the average absolute PnL and standard deviation of the PnL is signiﬁcantly

higher for the ANN models compared to the analytical approach. The explanation for this can partly be seen in ﬁgure

5.10 (a)-(d). Here we see the PnLs of the four diﬀerent methods across terminal values of

. It is interesting to see

that the ANN models produced large positive PnLs for some test samples. These likely come from paths where the

option was knocked out, but the portfolio still had positive value. This may seem weird. However, we must remember

that the ANN models only care about downside risk (CVaR) and do not care about large hedge errors that yield a

positive PnL.

All in all, we ﬁnd that the deep hedging models are capable of hedging down-and-out call options. However,

we ﬁnd it advantageous to provide the ANN models with helpful but redundant information (in this case, about the

minimum of S) instead of expecting the ANN models to keep track of that itself.

5.4 Heston model with one tradable asset (incomplete and non-Markovian model)

For this ﬁnal deep hedging experiment, we wish to hedge a call option, with maturity

T= 3/12

using 60 hedge

points, in a model with increased complexity. We do this by assuming that a Heston model governs the underlying

asset. The model consists of the pair

(S(t), ν(t))

(the price of underlying and a variance process), which have

dynamics

dS(t)/S(t) = µdt +pν(t)dW P

1(t)

dν(t) = κ(θ−ν(t))dt +σpν(t)dW P

2(t)

where W1and W1are Brownian motions with correlation ρ. For this experiment, we use parameters:

S(0) = 1

drift: µ= 0.05

ν(0) = 0.1

mean reversion: κ= 5

long term variance: θ= 0.1

vol-of-vol: σ= 1

correlation: ρ=−0.9.

In this model, we cannot simulate paths from

(S, ν)

without error. Hence, we have to consider which discretization

scheme that we wish to employ. In this thesis, we use the full truncation scheme (introduced in [14])

S(t+ ∆t) = S(t) exp (µ−ν(t)+/2)∆t+pν(t)+∆tZ1

ν(t+ ∆t) = ν(t) + κ(θ−ν(t)+)∆t+σpν(t)+(ρZ1+p1−ρ2Z2))

where

and

are iid standard normal random variables and

(·)+= max(0,·)

. We utilize the scheme as it is

found to introduce relatively little discretization bias. For simplicity, we choose to perform daily discretization, which

corresponds to one discretization step between each hedge point. This implies that ∆t= (12 ·20)−1.

As the Heston model is incomplete, it is essential to consider how we hedge the option. For this experiment, we

assume that we can only utilize the underlying asset to hedge the option. This is, of course, not enough to hedge the

option perfectly (even in continuous time) as the model is incomplete. Still, it poses an interesting challenge for

our deep hedging model. Another crucial consideration is the information available at each trading decision. We

suspect that for optimal hedging, the deep hedging model needs to know the instantaneous variance at each trading

decision. To test this, we wish to perform this experiment with and without this extra information. Note that the

model becomes non-Markovian when we do not know the instantaneous variance.

As in the previous experiments, we wish to compare the deep hedging models to the ordinary delta hedging

approach. However, in the Heston model there exists no unique price and delta for the option without further

assumptions. For simplicity, we assume zero market price of volatility risk implying that the dynamics of

(S, ν)

under the Q-measure are

dS(t)/S(t) = rdt +pν(t)dW Q

1(t)

dν(t) = κ(θ−ν(t))dt +σpν(t)dW Q

2(t)

where again

and

are Brownian motions with correlation

. With this assumption, we can compute a unique

price and delta for the call option. Note that one could perform an entire analysis based on the choice of

-measure.

However, for simplicity, we choose to ignore this aspect.

In practice, we utilize the simple and relatively stable representation of the price, also referred to as the

Lipton-Lewis representation (see [

]). This formulation of the price also lends itself to an easy and eﬃcient

computation of the option delta. For completeness, we have included the option price and delta in the appendix A.2.

We are now ready to perform the experiment. For completeness, the experiment setup is:

•Model: Heston model with one asset.

•Model parameters:S(0) = 1,µ= 0.05,ν(0) = 0.1,κ= 5,θ= 0.1,σ= 1,ρ=−0.9and r= 0.02.

•Option: Type: Call, Strike: K= 1 (ATM), Maturity: T= 3/12.

•Hedging: Hedge-points: 60 (equidistant), Transaction costs: 0

•Hedge strategies

: Standard delta hedging and two ANN-CVaR models. The diﬀerence in the ANN models

lies in their architecture and available information (see below).

•ANN architecture

(for each trading decision): Layers: 4, Units: 6, Common input:

Sk∈R

at time

Common output: δk∈R(holdings in S) from time tkto tk+1.

–ANN model raw: No additional input or output

–

ANN model w.

: Additional input at time

νk∈R

(current level of the variance process). No

additional output.

•ANN training

: Training samples:

218

, Batch size:

1024

, Epochs: 100 with learning rate reduction and early

stopping.

•Hedge-test

: Test samples: 50,000 (independent of training), Initial portfolio value: 0.05868 (actual option

value with zero market price of volatility risk)

Strategy Value (Standard Error) % of option price

Option Price 0.05868

Model p0ANN CV aR0.95 raw 0.08366

ANN CV aR0.95 w. ν0.08320

Analytical 0.01708 (3.24e-05) 29.099%

Avg. abs. PnL ANN CV aR0.95 raw 0.01013 (2.40e-05) 17.260%

ANN CV aR0.95 w. ν0.00969 (2.30e-05) 16.504%

Analytical 0.01991 33.928%

PnL std. ANN CV aR0.95 raw 0.01257 21.420%

ANN CV aR0.95 w. ν0.01200 20.449%

Analytical 0.04158 70.854%

CV aR0.95 ANN CV aR0.95 raw 0.02498 42.565%

ANN CV aR0.95 w. ν0.02470 42.095%

Table 5.8: Results of a hedge experiment over 50,000 out-of-sample trials. The experiment involves

hedging an ATM call option with 60 hedge points in a single asset Heston model

without transaction

costs.

(a) Sample 1 (b) Sample 2

Figure 5.11: Two samples of holdings for diﬀerent trading strategies. The goal is to hedge a call option in a

one dimensional Heston model without transaction costs.

This experiment aims to show the ability of the deep hedging model to hedge a simple call option in an incomplete

model and to analyze the importance of knowing the exact instantaneous variance at each trading decision. The

results of the experiment can be seen in table 5.8 and ﬁgure 5.11 (a) and (b).

In table 5.8, we observe that the two ANN models (one without extra information and one with

) performed very

similarly. Both ANN models signiﬁcantly outperformed the delta hedging approach. However, we notice that the

ANN model with

performed slightly better (with a CVaR of

0.02470

) than the raw ANN model (with a CVaR of

0.02498

). This suggests that the ANN model can learn a slightly superior hedging strategy when also knowing the

exact current level of volatility. This is also evident in ﬁgure 5.11, where we see the holdings in

across time for two

test samples. Here we observe that the two ANN models choose slightly diﬀerent trading strategies, and they seem

more diﬀerent than what we would expect from two strategies from ANN models with identical architecture.

Overall, we are not surprised that the ANN models outperform the delta hedging approach, even in the Heston

model. However, it might be surprising that there was only a tiny gap in performance between the two ANN models.

This could, of course, vary with parameters and the option type. Still, it suggests that the ANN model can learn a

close to optimal hedging strategies even without knowing the exact level of volatility (in certain situations).

5.5 Subconclusion

We have seen that deep hedging models are indeed capable of learning optimal trading strategies just by observing

a large number of simulated paths under the

-measure from the underlying asset. This was even the case when

including transaction costs, multiple assets, path-dependent options and stochastic volatility. However, even when

avoiding

-measures and utilizing deep learning, we still had to perform many critical choices regarding available

information for the ANN models, overﬁtting and more:

•

We saw that the naive implementation could lead to unstable trading strategies in multiasset models. Here we

saw that training on paths with less correlation had a stabilizing eﬀect.

•

When hedging path-dependent options, we observed that the architecture of the ANNs and the available

information had a signiﬁcant impact on performance. It was (not surprisingly) clear that providing the deep

hedging models with more relevant and processed information greatly improved their performances.

•

In the Heston model, we also saw that information on the variance process did improve performance slightly.

We did, however, argue that this may depend on the model parameters and option.

Overall, it is possible to learn complex trading strategies from simulated paths alone. However, it is clear that no

single hedging model/setup works in all situations. Knowledge of the options and underlying assets is still crucial for

maximizing results (like in classical mathematical ﬁnance).

6 Market Generators

An issue with the deep hedging approach for learning optimal trading strategies is that the training procedure requires

many paths from the

-measure to converge probably. This implies that historical data can not directly be used to

train the model. Accurate time-series simulation is, therefore, a necessary tool for practical implementations of deep

hedging models. Up until now, we have assumed that the underlying assets follow the dynamics of a well-known

model such as Black-Scholes or Heston. Fitting actual time-series data to these models can, however, be diﬃcult due

to model inﬂexibilities.

In this section, we wish to explore how we can utilize ANNs (speciﬁcally variational autoencoders) to create

model free

market generators. Our goal is to train a model to produce new paths (of some length

) based on a single

historical path. The method that we employ in this thesis is inspired by the methods presented by Buehler et al. in [

Imagine that we have

N+ 1

equidistant observations for the price of an asset

S0, . . . , SN

with corresponding rate of

returns

r1, . . . , rN

where, of course,

ri=Si/Si−1−1

(which we for simplicity refer to as returns). We can split up this timeseries into

M=bN/ncreturn-paths of length n

r(1)

1, . . . , r(1)

n,...,r(M)

1, . . . , r(M)

n.

We can now reformulate our goal as creating a model capable of producing synthetic return-paths

(˜r1,...,˜rn)

similar distribution as our observed return-paths. These synthetic paths could then hopefully be used to train a deep

hedging model. We should note that we do not assume stationarity of the underlying price process, implying that

r(i)

1, . . . , r(i)

nr(j)

1, . . . , r(j)

n,

but their distributions might be inﬂuenced by some market state.

To create such a model, we use variational autoencoders (VAEs) and conditional variational autoencoders (CVAEs).

Other frameworks are capable of creating generative models, such a generative adversarial networks (GANs).

However, we stick to VAEs and CVAEs, in line with [

], since VAEs and CVAEs are more stable during training and

require less data to converge (GANs are known to be quite data hungry).

It is important to note that we choose to generate returns directly, which deviates from [

]. In [

], they are

big proponents of generating truncated log-signatures of lead-lag transformed paths (and then transforming them

back to paths). We will not go into details about the log-signature transformation in this thesis. However, the idea

behind the transformation is that it should be a more eﬃcient and robust encoding of the information in the path. We

do not employ this transformation because there is (to our knowledge) no tractable and fast inverse transformation

from log-signatures to paths. In a notebook

on Github for [

], the authors show an example of an inverse signature

transform for a path with 20 returns. In this example, it takes 51 seconds to perform the inverse transformation

(on unknown machinery). The method for transforming log-signatures to paths is slow because it semi-randomly

searches for a path with a close enough log-signature. Remember that we need

218

paths to train a deep hedging

model. In our opinion, this renders the log-signature approach unpractical (for now).

6.1 Variational Autoencoders

In this section, we introduce the concept of variational autoencoders (VAEs). This introduction is partly based on a

brilliant tutorial on VAEs by Carl Doersch [16].

To introduce VAEs, we step away from the world of returns and ﬁnancial time series. We start by assuming that

we wish to simulate some random variable

X∈Rn

based on iid samples

{X1, . . . , XN}

. We are not interested in

simply sampling from

{X1, . . . , XN}

, but rather create a model capable of generating new and unseen samples,

which have the same (or similar) distributional properties as

{X1, . . . , XN}

. Note that we let

with density

p(X)

refer to the simulated random variable.

We will not sample

directly, but rather through a latent variable

z∈Rk

with some distribution

p(z)

. We,

therefore, imagine to sample

from

p(z)

and then sample

X|z

from some conditional distribution

p(X|z)

. Our goal

can be thought of as ﬁnding

p(X|z)

such that

p(X)

is maximized for our data, like in maximum likelihood. The

connection between p(X)and p(X|z)comes from the law of total probability

p(X) = ZRk

p(X|z)p(z)dz.

When working with VAEs it is common to assume that

z∼N(0, Ik)

and

X|z∼N(f(z), σ2In)

where

Ik∈Rk×k

is the

k×k

identity matrix. Note that we do not assume that

is normally distributed since

f(z)

is some (possibly

complicated) transformation of a normal random variable. We simply assume that

is a multivariate standard normal

random variable and X|z(i.e. conditioned on z) is normally distributed.

We could maximize p(X)by approximation. If we have Msamples of z, then p(X)is approximately

p(X)≈1

i=1

p(X|zi).

It would then be possible to ﬁnd

f(z)

st.

p(X)

was maximized for our data. However, this is likely not feasible. In

practice,

p(X|zi)

is likely to be very small for each

due to a potential high dimensionality of

. This implies that

we would have to perform an impractically large sample of

s to accurately estimate

p(X)

and then try to perform

maximum likelihood estimation to ﬁnd the optimal f.

VAEs solve this by sampling

which are likely to have produced X

. To do this, we need a function which

supplies a distribution over

, which is more likely to produce

. We name this distribution

Q(z|X)

. A sensible

choice would be to choose Qclose to p(z|X), i.e. the distribution for zgiven X.

2See https://github.com/imanolperez/market_simulator/blob/master/notebooks/logsig_inversion.ipynb

Imagine that we now sample

from

Q(z|X)

based on our observations. How does this help us calculating

p(X)

We somehow need to relate

p(X)

Ez∼Q[p(X|z)]

. To do this, we make the (maybe) odd choice of considering the

relationship between

Q(z|X)

and

p(z|X)

, speciﬁcally the Kullbach-Leibler divergence (KL-divergence), which is

deﬁned as

Deﬁnition 6.1. For two distributions (densities) p, q on Rn, the KL-divergence is deﬁned as

DKL (p||q) = ZRn

p(x) ln p(x)

q(x)dx.

The KL-divergence between Q(z|X)and p(z|X)is, therefore,

DKL (Q(z|X)||p(z|X)) = Ez∼Q[ln Q(z|X)−ln p(z|X)].

We can now utilize Bayes rule on

p(z|X)

, i.e.

p(z|X) = p(X|z)p(z)/p(X)

. We can, therefore, express the

KL-divergence between Q(z|X)and p(z|X)as

DKL (Q(z|X)||p(z|X)) = Ez∼Q[ln Q(z|X)−ln p(X|z)−ln p(z)] + ln p(X).

We can now see that

p(X)

and

p(X|z)

enters the picture, which is exactly what we were looking for. Rearranging

the terms and using the deﬁnition of KL-divergence again yield

ln p(X)−DKL (Q(z|X)||p(z|X)) = Ez∼Q[ln p(X|z)] −DKL (Q(z|X)||p(z)).(11)

This is one of the deﬁning relationships relevant for VAEs. On the left-hand side is what we wish to maximize, i.e.

(the log of)

p(X)

plus a loss/error term that depends on the diﬀerence between

Q(z|X)

and

p(z|X)

. Note that we

postulated earlier that a sensible choice for

Q(z|X)

would be

p(z|X)

. So if we choose the framework of

Q(z|X)

in a way that is capable of representing

p(z|X)

then maximizing the left-hand side (w.r.t.

and

) would also

maximize

p(X)

since the KL-divergence would also be minimized, i.e. the term would be zero if

Q(z|X) = p(z|X)

This is exactly what we want! To maximize the left-hand side of equation

(11)

, we can use the shown relation since

the right-hand side can be maximized with gradient ascent (which we will explain later).

Remember that we want to maximize

p(X)

or equation

(11)

over our dataset

{X1, . . . , XN}

. To do this, we can

sum (or average) equation

(11)

over our dataset and maximize that expression. This will (because of the iid asumption

and log) exactly correspond to maximizing the log-likelihood if Q(z|X)is capable of representing p(z|X).

Let us reiterate what we have found: We want to sample

by ﬁrst sampling a random variable

z∼N(0, Ik)

from a

latent space and then sample

X|z∼N(f(z), σ2In)

. Therefore, we need to ﬁnd a function

s.t. we can maximize

the probability p(X)of observing a speciﬁc dataset (like in maximum likelihood).

In practice, we wish to maximize over

by sampling

from a distribution

Q(z|X)

, which is more likely to

produce

. We do not know

p(z|X)

, so we also have to maximize over

. Equation

(11)

links all this together

by providing us with an expression, we can maximize for

and

(the right hand side) and will in return maximize

p(X), if Qis capable of representing p(z|X).

The structure of VAEs and equation

(11)

can be seen visually in ﬁgure 6.1. Note that this includes assumptions

on Qthat will be speciﬁed in the upcoming section.

Specifying Q(z|X)and ﬁnding an expression for the RHS of equation (11)

We now wish to be more concrete about our assumptions on

and how we can maximize the right-hand side of

equation (11). We start by assuming that

Q(z|X) = N(g(X), h(X)⊗Ik)

Figure 6.1: VAE structure when assuming

Q(z|X) = N(g(X), h(X)⊗Ik)

. The blue boxes represent

terms from the right hand side of equation (11).

for some maps

g, h :Rn→Rk

. We now wish to simplify the right hand side of equation

(11)

based on this

assumption. We start with Ez∼Q[ln p(X|z)]. Remember that we assumed X|z∼N(f(z), σ2In)so

p(X|z) = 1

p(2π)nσ2nexp −1

2(X−f(z))>1

σ2In(X−f(z)

p(2πσ2)nexp −||X−f(z)||2

2σ2,

which implies that

Ez∼Q[ln p(X|z)] = Ez∼Q−||x−f(z)||2

2σ2+c

where cis some constant, which is not dependent on Xor f(z)(so is irrelevant for optimization).

We can now move onto the second term of the right-hand side of equation

(11)

, i.e.

DKL (Q(z|X)||p(z))

Since we have assumed that

Q(z|X) = N(g(X), h(X)⊗Ik)

then the KL-divergence is between two multivariate

normal random variables (remember z∼N(0, Ik)). The KL-divergence between two multivariate normal random

variables is well-known.

Proposition 6.2.

Given two multivariate normals

N1∼N(µ1,Σ1)

and

N1∼N(µ2,Σ2)

then the KL-divergence

between the two distributions is

DKL (N(µ1,Σ1)||N(µ2,Σ2)) = 1

2tr Σ−1

2Σ1+ (µ2−µ1)>Σ−1

2(µ2−µ1)−k+ ln det Σ2

det Σ1.

Proof. See [17] page 13.

In our case, µ1=g(x),µ2= 0,Σ1=h(x)⊗Ikand Σ2=Ik. We see that

1. Σ−1

2Σ1=Ikh(x)⊗Ik=h(X)⊗Ikso we have tr Σ−1

2Σ1=h(X)>1.

2. (µ2−µ1)>Σ−1

2(µ2−µ1) = g(X)>Ikg(X) = g(X)>g(X)

3. det Σ2= det Ik= 1 and det Σ1= det(h(X)⊗Ik) = Qk

i=1 h(X)i, which implies that

ln det Σ2

det Σ1= ln 1

i=1 h(X)i!=−(ln h(X))>1

where the log is applied element-wise to h(X).

We, therefore, get that the KL-divergence between Q(z|X)and p(z)is

DKL (Q(z|X)||p(z)) = 1

2h(X)>1+g(X)>g(X)−k−(ln h(X))>1

which is easy to compute for some Xgiven hand g.

Putting it all together, we can express the right-hand side of equation (11) as

Ez∼Q[ln p(X|z)]−DKL (Q(z|X)||p(z)) = Ez∼Q−||X−f(z)||2

2σ2−1

2h(X)>1+g(X)>g(X)−k−(ln h(X))>1+c.

(12)

Deriving the empirical objective function

Assume again that we have a dataset

{X1, . . . , XN}

and corresponding

realizations of

z|X

from

Q(z|X)

. We

can then create the empirical objective function, which we can apply gradient ascent to by diﬀerentiating w.r.t.

parameters in

f, g, h

. To emulate the maximum likelihood method, we create the empirical objective function as

a sum (or actually a mean) of equation

(12)

over the dataset (as argued previously). Hence, we can express the

empirical objective function as

−

recontruction loss

z }| {

i=1

||Xi−f(zi|Xi)||2

σ2−

KL-divergence loss

z }| {

i=1

h(Xi)>1+g(Xi)>g(Xi)−k−(ln(h(Xi))>1(13)

where we have omitted the factor

1/2

and the constant (since they do not aﬀect the optimization). Note that we only

use

one zi|Xi

to estimate

Ez∼Q[−||Xi−f(z)||2

2/σ2]

. This is done in order to speed up computations, but we are

then forced to do more and smaller gradient steps.

A critical step is to simulate

zi|Xi

like this

zi|Xi=g(Xi) + ph(Xi)⊗

where

∼N(0, Ik)

since this

allows us to backprobagate through the simulation step. Backprobagation cannot handle stochastic ops but can take

stochastic inputs.

We should also note that

σ2

acts as a way of balancing a trade-oﬀ between minimizing the reconstruction loss

and minimizing the KL-divergence loss. One can understand the trade-oﬀ like this: When

is close to zero, the

VAE will link

z|X

directly to

to ensure the best possible reconstruction. However, this will render simulation

from

z∼N(0, Ik)

impossible as a way to simulate

-lookalikes. On the other hand, if

is large, then simulations

become too normally distributed, and the model will struggle to separate diﬀerent Xs in the latent space.

Simulating X after optimization

Once we have found the maps f,hand gthat minimizes equation (13), we can sample Xin the following way:

•Sample latent variable z∼N(0, Ik).

•Sample X∼N(f(z), σ2In)

However, it is pretty common to sample

by setting

X=f(z)

and thus assume

σ2= 0

(but only for the simulation

step). Of course, if σ2is small, then these two techniques will not be too dissimilar.

Figure 6.2: CVAE structure. The blue boxes represent terms from the right hand side of equation (14).

One could also argue that we introduced the last simulation step to enable simulation of

s, which are not

exactly like our samples (otherwise, the problem would collapse). However, after training our model, we might not

care about the extra variance introduced by

σ2

. This would be the case in image generation, where

σ2

would be

unnecessary noise in the simulation. In this thesis, we choose not to skip the last simulation step since the additional

noise might be necessary for the distribution of the returns.

6.2 Conditional VAEs

In this brief section, we wish to explain an extension to VAEs called conditional VAEs (or CVAEs). Imagine, that

along with a number of samples

{X1, . . . , XN}

, we have some conditional samples

{Y1, . . . , YN}

with

Yi∈Rm

We assume that the

X|Y

s are

independent

, but the distribution depends on

. This is relevant for us when sampling

return-paths that are not stationary. Our samples of return-paths might come from diﬀerent periods of volatility, and

it is commonly accepted that future volatility of returns depends on previous volatility. Therefore, it is crucial that we

can sample return-paths conditioned on some market state (or the previous returns).

Extending VAEs to cope with conditional variables is not diﬃcult: First, we change our goal to be able to sample

X|Ysuch that it looks like {X1|Y1, . . . , X|YN}. We, therefore, wish to approximate p(X|Y)not just p(X).

Following the idea from before, we assume that we sample

X|(Y, z )∼N(f(z, Y ), σIN)

. Now

f:Rk×Rm→

is a function that takes both the latent variable and the conditional variable. Like before, we assume to sample

z|Y∼N(0, Ik). Meaning, that the latent variable is unaﬀected by the conditional variable Y.

To perform the approximation of

p(X|Y)

, we again want to sample the

s such that they have a higher

probability of producing

given

. We do this by sampling

from

Q(z|X, Y )

where we assume that

Q(z|X, Y ) = N(g(X, Y ), h(X , Y )⊗Ik)

, which is similar to before except that the maps

and

takes both

and

as inputs. So

now enters as input for the maps

f, g

and

. With these changes equation

(11)

can be derived

again yielding

ln p(X|Y)−DKL (Q(z|X, Y )||p(z|X, Y )) = Ez∼Q[ln p(X|z, Y )] −DKL (Q(z|X, Y )||p(z|Y)) (14)

and remember that we assumed

p(z|Y) = N(0, Ik)

. This is the core equation behind CVAEs, and it has similar

interpretations as equation

(11)

(which was the relation for VAEs). In ﬁgure 6.2, we have visualized the structure of

CVAEs and equation (14).

One can then again derive an objective function similar to equation (13)

−1

i=1

||Xi−f(zi|Xi, Yi)||2

σ2−1

i=1

h(Xi, Yi)>1+g(Xi, Yi)>g(Xi, Yi)−k−(ln h(Xi, Yi))>1.(15)

Maximizing the above w.r.t.

f, g

and

should enable us to sample

X|Y

by ﬁrst sampling

z∼N(0, Ik)

and then

sample X|(Y, z )∼N(f(z, Y ), σIn).

6.3 Connecting ANNs to VAEs and CVAEs

In the previous sections, we have derived an objective function, which we can maximize w.r.t. three maps

f, g

and

to enable simulation of

s with similar distributional properties to our samples

{X1, . . . , XN}

, even when

considering conditioning with some variable Y.

In this section, we explain how the idea of VAEs and CVAEs relate to artiﬁcial neural networks (ANNs). As we

know, ANNs are useful as approximators of continuous maps over which we want to minimize some loss function

using gradient descent. For VAEs and CVAEs, we can use ANNs to represent the maps

and

. We ﬁrst remember

that

f:Rk(×Rm)→Rn

g, h :Rn(×Rm)→Rk

for VAEs and (CVAEs). First, we let

be represented with an ANN,

, with input dimension

(

×m

) and output

dimension

. For

and

, we choose a single ANN,

, to represent both.

must then have input dimension

(×m)and output dimension k×k. So for VAEs and CVAEs, we have

VAE: D(z) = f(z)

E(x) = (g(x), h(x)) CVAE: D(z , y) = f(z, y)

E(x, y)=(g(x, y), h(x, y)).

We can now introduce

θ0

and

θ1

as the trainable parameters in

and

, respectively. For VAEs, we can state the

maximization problem using the objective function in equation (13) as

max

θ0,θ1"−α

i=1 ||Xi−Dθ1(zi|Xi)||2

2−1−α

i=1

E[1]

θ0(Xi)>1+E[0]

θ0(Xi)>E[0]

θ0(Xi)−k−(ln E[1]

θ0(Xi))>1#

(16)

where we have used notation

Eθ0(x)=(E[0]

θ0(x), E[1]

θ0(x))

, and we sample

zi|Xi

by ﬁrst sampling

i∼N(0, Ik)

and

then setting

zi|Xi=E[0]

θ0(Xi) + qE[1]

θ0(x)⊗i

. A similar maximization problem can also be derived for CVAEs

using equation (15).

As we have alluded to previously, this maximization problem can be solved using a gradient ascent algorithm.

The crucial detail that allows us to ﬁnd the derivative of the above expression w.r.t. the trainable parameters in

and

is that we sample

zi|Xi

as an aﬃne transformation of a multivariate standard normal random variable

i

. This

makes it possible to ﬁnd the derivative of zi|Xiw.r.t. θ0. This is often referred to as the reparameterization trick.

We should also note that we have also scaled the entire objective function with

ασ2

where

α= (1 + σ2)−1

. We

do this to emphasize the balance between minimizing reconstruction loss (ﬁrst term) or minimizing KL-divergence

loss (second term).

6.4 Experiments with VAEs and performance evaluation (in a simple Black-Scholes

model)

In this section, we wish to use VAEs to create generative models based on a Black-Scholes model (and a Heston

model later), which we refer to as the base model(s). Speciﬁcally, our goal is to simulate daily return-paths of length

20, which resembles those from the base model.

The section has two purposes: 1. Explain how we set up and create VAEs based on simulated return-paths

from the base model. 2. Analyze the generative models with light statistical tools to investigate if the generated

return-paths has similar characteristics as those from the base model. We will not do an extensive analysis of the

VAEs, since the real test for the VAEs will be as data-generators for deep hedging models (see section 7).

A simple Black Scholes model

We start by assuming that our base model is a simple Black-Scholes model with the following parameters:

drift: µ= 0.05

volatility: σ= 0.3

S(0) = 1.

We do this since returns in the Black-Scholes model are iid. Hence return-paths are stationary. This is not realistic,

but it allows us to use VAEs instead of CVAEs since all return-paths will have the same distribution independent of

any market states.

Data-preparation

For this experiment, we wish to train a VAE capable of producing Black-Scholes-like paths of length

n= 20

(plus

the initial spot value

S(0)

) with

dt =T/n

and

T= 1/12

, i.e. daily returns over one month. Note that we could

train a VAE model to generate single daily returns from the Black-Scholes model (i.e. having

n= 1

) and connect

multiple simulated returns to create a longer path. This is possible to do without extra modiﬁcations since the

Black-Scholes model has iid returns. However, this problem is too easy to be interesting. We, therefore, choose

n= 20, even-though, less would be enough.

We wish to train the VAE on

independent/non-overlapping return-paths (for now). To do this, we can sample

a single Black-Scholes return-path of length

M·n

and divide it into

return-paths of length

. We can then scale

the returns by subtracting their means and scaling by their standard deviations. This scaling is done independently

for every time point

(dt, . . . , n ·dt =T)

. We then have

normalized return-paths of length

. These will be the

samples we utilize to train the VAE.

Creating the VAE

We are now ready to create the VAE. In our experiments,

Xi∈Rn

corresponds to normalized return-paths. This

dictates the dimensions of the ANNs and inﬂuences our choice of latent space. To start, we choose the size of the

latent space. We choose to use a latent space with dimension

, i.e.

z∈Rn

. This may seem like an arbitrary choice

(and it might be). However, we would normally think return-paths require as many random variables to simulate as

time points (in this case

). This is unusual for VAE problems where we often choose latent spaces, which are much

smaller than the size of

. This is the case for image generation, where it is safe to assume that an image depends on

fewer components than the number of pixels in the image.

We are now ready to create the ANNs

(encoder representing

and

) and

(decoder representing

has input dimension

and output dimension

n×n

(or

2n)

has input dimension

(latent space size) and

output dimension

. We create

and

as shallow ANNs with 40 units in the hidden layer and with elu activation.

We choose shallow ANNs as VAEs can be challenging to train with deep ANNs.

Lastly, we need to choose the parameter

used for construction

N(f(z), σ2In)

. Remember

can also be

represented as

α= (1 + σ2)−1

from the objective function in equation

(16)

. To start, we choose

α= 0.9

, but we

wish to test diﬀerent choices later.

Training the VAE

The VAE is trained by maximizing the objective function in equation

(16)

(or minimizing the negative objective

function, i.e. the loss), where the

s are sample return-paths. This is easily done using gradient ascent (descent).

We train the VAE for 1,500 epochs with a batch size of 128 utilizing the ADAM algorithm with a learning rate of

0.01, 0.0001 and 0.00001 for 500 epochs each.

We do not use learning rate decay or early stopping since each batch and epoch produces quite varying losses

due to the simulation steps included in the objective function. A simple training procedure is, therefore, easier to use

and possibly more robust.

As with the deep hedging models, we do not worry too much about training times. However, with this training

procedure and proposed model architecture, we train a VAE with 1,000 training samples for approximately 50

seconds. This is relatively fast compared to the deep hedging models, but it is pretty reasonable when considering

the number of training samples (1,000 vs

218

). However, note that the training time depends signiﬁcantly on the

number of training samples, the number of returns in a path (

), the architecture, computational resources, epochs

and batch size.

Simulating new paths

Once training is done, we can generate new samples from the VAE by sampling

z∼N(0, In)

and then sampling

X|z=N(D(z), σ2In)

, where

is the trained decoder ANN. These samples represent new normalized return-paths.

Therefore, all that is left is denormalizing the VAE samples and converting the return-paths to

-paths using

S(0)

. Note that simulating new return-paths is virtually instant since it only requires a simulation of normal

random variables, a forward pass of the decoder ANN

, another simulation of normal random variables and a

denormalization transformation.

Evaluating paths sampled by the VAE

We should now (hopefully) have a working VAE model capable of generating new paths similar to the training

samples. We wish to test this to ensure that the model and setup are functioning correctly and analyze the eﬀects of

changes to σ(or α) and the number of samples used for training.

Analyzing if the VAE samples are similar in distribution to the training samples is not trivial as we work

with random variables in dimension

n= 20

. This is not helped by the fact that the number of training samples may

not be large (100 to 1,000). To simplify the analysis, we can analyze the empirical marginal distribution of each of the

n= 20

time points from the

-paths. After analyzing the marginal distributions, we can look at the autocorrelations

of the returns, including the absolute returns, as this is especially relevant for returns of ﬁnancial assets.

One could perform more detailed analyses of the VAEs. However, (as we will see later) analyzing empirical

marginal distributions and autocorrelations can get us a long way. We should also bear in mind that our goal is to use

the VAEs as data-generators for deep hedging models. We can, therefore, view the upcoming hedge experiments as

the most signiﬁcant tests of the VAEs.

We now wish to brieﬂy explain how we analyze the empirical marginal distributions. For completeness, we

also wish to explain how we ﬁnd the autocorrelations of the samples (as it is not actual autocorrelations).

To analyze the empirical marginal distributions of the

-paths, we perform a two-sample Kolmogorov-Smirnov

test for all time points. This test is designed to test whether two samples come from the same distribution. Lets assume

that we have two sets of iid samples

{X1,...XN}

and

{Y1, . . . , YM}

, both in

, with corresponding empirical

distribution functions

FX,N (x)

and

FY,M (y)

. The two-sample Kolmogorov-Smirnov test considers the maximum

distance between the two empirical CDFs. This distance, which is referred to as the Kolmogorov-Smirnov statistic, is

deﬁned as

DN,M = sup

x|FX,N (x)−FY,M (x)|.

The null hypothesis for this statistical test is that

{X1, . . . , XN}

and

{Y1, . . . , Ym}

are sampled from the same

underlying distribution. Without going into the theoretical details, we can reject the null hypothesis (and be convinced

that the Xs and Ys come from diﬀerent distributions) at some conﬁdence level αif

DN,M >r−ln α

2N+M

2M.

Solving for 1−α, we can therefore express the p-value as

p:= 1 −α= 1 −1

2exp −2M

N+MD2

N,M ,

and we can reject the null hypothesis if the p-value is found to be suitably low. In our analyses, we do not specify an

exact conﬁdence level, as it should be clear from observing the p-value. Still, we generally will not be satisﬁed with

the VAE if the p-value across multiple time points is below 10%.

We should note that the p-value, of course, depends on the size of our two samples. That is, if our two samples

are large, then we require their empirical distributions to be closer to obtain similar p-values. We, therefore, have to

choose the size of the VAE samples and be careful to keep it consistent. In our analyses, we sample 20,000 paths

from the trained VAEs to compare to the training paths (which typically is of size 1,000). We choose 20,000 to

reduce the uncertainty from the VAE sampling.

We can now move onto the autocorrelations. In our experiments, we have two sets of samples of return-paths (and

absolute return-paths) of length

. The classical autocorrelation does not really make sense in this setting, as we

have multiple paths covering the same time points. In this case, we choose to calculate the average of the empirical

correlations measured on lagged returns. For lack of a better term, we refer to this as autocorrelation. To be more

speciﬁc, we choose lag

and calculate the correlation between returns at time

and

ti+l

for all possible

s. We

then average all correlations to get the autocorrelation with lag

. We do this for

l= 1, . . . , n

to produce a graph of

autocorrelations across lags.

We should note that this method of evaluating the correlation structure of return-paths can hide unwanted

correlation since we average correlations across time points. We will, however, not be too worried about this as we

would expect perfect oﬀset to be quite unlikely.

An actual VAE-experiment

We are now ready to run an actual VAE experiment in a Black-Scholes model. In this experiment, we train a VAE on

1,000 independent return-paths of length

n= 20

with

α= 0.9

. We then simulate 20,000 return-paths /

-paths

from the VAE and compare them to the 1,000 training paths. The results of this experiment can be seen in ﬁgure 6.3

(a)-(f).

In ﬁgure 6.3 (a) and (b), we see 100

-paths generated by the Black-Scholes model and VAE model, respec-

tively. We cannot conclude anything about the distributional characteristics, but we can conﬁrm that the VAE has

indeed been able to create paths that seem like proper paths from a price process.

S(t)

(a) 100 Black-Scholes paths (used for training)

S(t)

(b) 100 VAE paths.

Empirical CDF

20,000 VAE paths at time T

KS p-value

(d) Kolmogorov-Smirnov p-values across time

Autocorrelation returns

(e) Autocorrelation of training and VAE returns

Autocorrelation absolute returns

(f) Autocorrelation of training and VAE absolute returns

Figure 6.3: Results from comparison of 20,000 VAE paths and 1,000 Black-Scholes paths, which were

used for training.

KS p-value

(a) α∈ {0.8,0.9,0.99}and 1,000 training paths.

KS p-value

(b) training paths in {100,250,1000}and α= 0.9

Figure 6.4: Average Kolmogorov-Smirnov p-values across time for varying

and number of training paths

including 95% conﬁdence bands. The p-values are calculated from 20,000 paths sampled from 100 VAEs,

which are trained on independent samples. When varying the number of training paths (in (b)), we test the

VAE paths against 1,000 independent Black-Scholes paths.

In ﬁgure 6.3 (c), we see the empirical distribution functions for the training paths and the VAE samples at time

T= 1/12

. The empirical distributions are visually quite close, suggesting that the VAE has captured some of the

distributional characteristics from the Black-Scholes samples (Note/remember we look at distributions of

S(T)

and

not returns). This is conﬁrmed in ﬁgure 6.3 (d), where we see the p-values from a two-sample Kolmogorov-Smirnov

test at each time point. Remember again that the test is done with 1,000 training paths and 20,000 VAE paths. We

observe that the p-values for all time points are above 0.5 (roughly), implying that we cannot reject any of the 20

null hypotheses (one for each time point). The marginal distributions of the VAE paths are thus close to that of the

Black-Scholes training samples.

To end this ﬁrst experiment, we look at the autocorrelations in ﬁgure 6.3 (e) and (f). From these, we see that the

VAE does not, on average, across any lag have a signiﬁcantly non-zero correlation, also including absolute returns.

These results suggest that the VAE is indeed able to produce paths of

(or return-paths) that have the same

distributional characteristics as the Black-Scholes model, which we used for training.

Varying αand number of training paths

We have just seen that a VAE could produce new paths with similar characteristics as a Black-Scholes model. In

this section, we wish to investigate the inﬂuence of the parameter

(or

) and the number of training paths. To be

speciﬁc, we wish to test the VAE for

α∈ {0.8,0.9,0.99}

and with the number of training paths in

{100,250,1000}

When varying α, we use 1,000 training paths, and when varying the number of training paths, we use α= 0.9.

For this analysis, we train 100 VAEs on independent sets of training samples. We can then compute the

Kolmogorov-Smirnov p-values for all time points and all VAEs. This allows us to display the average p-value across

time for the diﬀerent VAE setups. Our rationale is that a higher average p-value implies a better average ﬁt of the

model, i.e. VAE paths are closer to the training paths. Of course, it would be misleading to use the training paths for

the Kolmogorov-Smirnov test when varying the number of training samples since the test depends on the number

of samples. When varying the number of training paths, we, therefore, choose to test the VAE paths against 1,000

Black-Scholes paths that are independent of the training paths.

We should note that an analysis of the marginal distributions alone is not suﬃcient to ensure that the VAE paths

have similar multivariate distributions as the training paths. In the previous experiment, we looked at autocorrelations

of returns and absolute returns. However, we skip this analysis as it does not yield any interesting results.

The results of this experiment can be seen in ﬁgure 6.4 (a) and (b).

In ﬁgure 6.4 (a), we see the results of varying

. Here we observe that

α= 0.8

and

α= 0.9

both have av-

erage Kolmogorov-Smirnov p-values in the range 0.7 to 0.8 for all time points, whereas

α= 0.99

has average

p-values between 0.5 and 0.6. This shows that

should not be chosen too high (i.e.

too low). This is because

it emphasizes the reconstruction loss and not KL divergence loss, i.e. it has too little regularization. It cannot be

seen in the ﬁgure, but we do not want to choose

too low either as this would introduce too much noise in the VAE

samples and shift the marginal distributions more towards a normal distribution. Our choice of

α= 0.9

, therefore,

seems quite reasonable.

In ﬁgure 6.4 (b), we see the results of varying the number of training samples. We observe that the VAE generally

does a great job of capturing the marginal distributions of the Black-Scholes paths, even when only being trained on

100 paths. It is also clear from the 95% conﬁdence bands that the average Kolmogorov-Smirnov p-value is higher

when the VAE is trained on more paths, which is not surprising. One could also argue that it is not very surprising

that the VAE performs well even when trained on 100 paths since the Black-Scholes model is incredibly simple

(especially regarding stationarity).

All in all, we are pretty pleased with the capabilities of the VAEs to capture characteristics of Black-Scholes

paths. However, we withhold ﬁnal judgement until we have tested the VAE’s abilities as a data-generator for a deep

hedging model.

6.5 Cheating in the Heston model - capturing path dependency

We now move onto a more complicated problem (but not too complicated yet). Here, we choose to ﬁt a VAE to paths

of length

n= 20

from a Heston model. The distribution of returns in a Heston model are not iid and depend on the

instantaneous variance. In this experiment, we do not wish to generate paths with diﬀerent initial instantaneous

variances but rather capture the returns’ interdependencies. We, therefore, choose to train the VAE on independent

return-paths that all have the same initial instantaneous variance. Note that this is not realistic as we usually only

observe a single realized path with varying volatility across time. However, it is an interesting challenge for the VAE.

For this experiment, we choose a Heston model with high long-term and starting variance, high vol-of-vol and

high mean-reversion. The parameters we use are

S(0) = 1

drift: µ= 0.05

ν(0) = 0.1

mean-reversion: κ= 5

long-term variance: θ= 0.1

vol-of-vol: σ= 1

correlation: ρ=−0.9.

We perform the experiment in a similar manner to the Black-Scholes experiment in the previous section:

•Sample 1,000 independent return-paths from the Heston model.

•

Create a VAE with encoder and decoder with one hidden layer with 40 units and choose the latent space to be

in R20. Remember n= 20.

•

Train a VAE on the normalized return-paths for 1,500 epochs with a batch size of 128 with learning rates 0.01,

0.0001 and 0.00001 for 500 epochs each.

•

Sample 20,000 normalized return-paths from the trained VAE and convert them to return-paths and

-paths.

•

Compare the VAE generated paths to the training paths using the two-sample Kolmogorov-Smirnov test and

autocorrelations.

KS p-value

(a) Kolmogorov-Smirnov p-values across time.

Standard deviation

(b) Standard deviation of VAE and training paths across

time.

Autocorrelation

Figure 6.5: Standard deviation across time, Kolmogorov-Smirnov p-values and autocorrelation of paths

from a VAE trained on Heston paths. The VAE is trained on 1,000 independent paths with

α= 0.9

and no

moment-regularization.

The results of this experiment can be seen in ﬁgure 6.5 (a)-(c).

In ﬁgure 6.5 (a), we see the Kolmogorov-Smirnov p-values across all time points. The VAE seem to per-

form well for the ﬁrst few time points, but the quality quickly drops as time progresses. We get a hint of the reason in

ﬁgure 6.5 (b). Here we see the standard deviations of the generated spots across all time points. It is obvious that the

VAE has misjudged the variance of the returns, which ampliﬁes the deviation in the standard deviation of the spots

across time.

In ﬁgure 6.5 (c), we see the autocorrelations for the absolute returns. We are disappointed to see that the VAE

has not captured any correlation in the absolute returns present in the Heston model. One could suspect that the

ANNs are too small to represent the returns. However, the encoder

has 3,760 parameters, and the decoder

has

2,540 parameters, which should be enough to represent more complicated functions.

Overall, we are pretty disappointed in the VAE’s performance. In the next section, we propose a possible solution

to these issues by introducing moment-regularization during training.

Fitting Heston with moment-regularization

In this section, we

propose

the use of moment-regularization to obtain better performance from the VAE in the

Heston model (or generally in models with non-stationary returns). Our idea is to introduce a regularization term to

the objective functions in equation

(16)

that guides the VAE to produce samples that are more similar to the training

paths when looking at moments and correlation. To do this, we introduce four regularization terms: one for the

mean of the returns, one for the standard deviation of the returns, one for correlation of returns and lastly, one for

correlation of absolute returns.

The regularization terms will all be based on samples of the VAE. That is we ﬁrst sample

z∼N(0, In)

and

then sample

X|z∼N(D(z), σ2In)

where

is the decoder. Assume we have

samples in

from the VAE

{˜

X1,..., ˜

XN}and training samples {X1, . . . , XN}. The four terms are then computed as

1. Mean regularization:

i=1 

j=1

Xj,i −1

j=1

Xj,i

2. Standard deviation regularization:

i=1 v

N−1

j=1

(˜

Xj,i −¯

Xi)2−v

N−1

j=1

(Xj,i −¯

Xi)2

where ¯

Xi=1

NPN

j=1 ˜

Xj,i and ¯

Xi=1

NPN

j=1 Xj,i.

Correlation of returns. We ﬁrst deﬁne the sample correlation between dimension

and

i+l

(

is some lag) for

the Xs

ρX(i, l) := PN

j=1(Xj,i −¯

Xi)(Xj,i+l−¯

Xi+l)

(N−1)sXisXi+l

where

sXi=q1

N−1PN

j=1(Xj,i −¯

Xi)2

is the sample standard deviation of the

’th dimension of the

The correlation regularization is then calculated as

l=1

l·1

#{i:i≥0, i +l≤n}X

i≥0:i+l≤n|ρ˜

X(i, l)−ρX(i, l)|.

First, notice that we divide by

inside the ﬁrst sum. This is done to place more emphasis on the correlations

between returns with smaller lags. Also, notice that we apply the absolute operator inside the second sum.

This is done to avoid the possibility of the model creating paths with correlations that perfectly oﬀset each

other.

The regularization term for correlation of absolute returns is calculated like the above, but on

{| ˜

X1|,...,|˜

XN|}

and {|X1|,...,|XN|}.

These four terms can then be subtracted from the objective function in equation

(16)

. To adjust the emphasis on the

moment-regularization terms, we scale them by a factor β(like we use αfor balancing between reconstruction and

KL loss). The four moment-regularization terms should not add too much bias to the VAE but simply guide the VAE

during training to match the moments and correlations of the training samples.

To test this regularization, we train 100 VAEs for each

{0,10,100}

with the same procedure as in the

previous experiment. Note that

β= 0

corresponds to no moment regularization. We then sample 20,000 paths from

each VAE and calculate the average Kolmogorov-Smirnov p-value and autocorrelations over all VAEs for each

The results can be seen in ﬁgure 6.6 (a)-(c).

In ﬁgure 6.6 (a), we see the average Kolmogorov-Smirnov p-values across time for diﬀerent values of

. It

is clear that the superior choice of

, when it comes to marginal distributions, is

β= 10

. Both

β= 0

(no

regularization) and

β= 100

gives rise to decreasing p-values across time, which was also the case with the previous

experiment without moment regularization. Moment regularization seems to work. However, it is clear that too

much regularization can also ruin the marginal distributions, which makes sense since the distributions are not only

described by mean, standard deviation and correlations.

KS p-value

(a) Avg. Kolmogorov-Smirnov p-values across time

Autocorrelation returns

(b) Avg. autocorrelation of absolute returns

Autocorrelation absolute returns

Figure 6.6: Average Kolmogorov-Smirnov p-values and autocorrelations on returns and absolute returns

for 100 VAEs trained on Heston paths with varying moment-regularization

including 95% conﬁdence

bands in (a) and (b). The VAEs are trained on 1,000 Heston independent paths with

α= 0.9

. The

Kolmogorov-Smirnov p-values are calculated on 20,000 VAE paths and the training paths.

In ﬁgure 6.6 (b), we see that the autocorrelations on the absolute returns. We observe that no level of

could produce models, which on average, have correlations matching the Heston model. However, we can observe

that models trained with

β= 100

produce paths that, on average, have correlations closest to the Heston model.

Nevertheless, even though we have not reached a perfect match in correlation, we still see that moment regularization

provides a signiﬁcant improvement.

Lastly, we observe from ﬁgure 6.6 that moment regularization has not introduced any unwanted correlation in

the non-absolute returns. This is great considering the improvement in correlation in absolute returns.

Overall, we conclude that moment regularization provides signiﬁcant improvements to the ﬁt of the VAE, but the

high correlation of absolute returns is still quite diﬃcult to capture, even with

β= 100

. VAEs should, in theory, be

arbitrarily ﬂexible. However, this experiment illustrates that training VAEs to capture complex structures can be

diﬃcult.

6.6 Conditioning on instantaneous variance in the Heston model

Up until now, we have only used VAEs. Remember that VAEs produce paths that are not conditioned on market

states. Realistically, the volatility of a path will depend on the volatility before that path. This is also the case in the

Heston model. In this section, we wish to utilize CVAEs to capture this dependency. This also implies that we can

train our model on a single long observed path.

To make this experiment easier, we choose a Heston model with less long term variance and vol-of-vol. The

parameters we use are

S(0) = 1

drift: µ= 0.05

ν(0) = 0.05 (not very important)

mean reversion: κ= 4

long term variance: θ= 0.05

vol-of-vol: σ= 0.25

correlation: ρ= 0.

We wish to train a CVAE to generate paths similar to those from the Heston model but conditioned on an initial

instantaneous variance

ν(0)

, which is not necessarily the same as

ν(0) = 0.05

. To do this, we sample a single Heston

path and divide it into paths of length

n= 20

without overlap. Each path has a corresponding

ν(t)

(or

ν(0)

) that is

the instantaneous variance at the

beginning

of the path. One might question if it is realistic/sensible to assume that

is observable. However, estimating

or letting the CVAE estimate

removes focus from the conditional part of the

CVAE. We, therefore, assume that νis known for this analysis.

The ﬁrst experiment we perform has the following procedure:

•

Sample one path from the Heston model with

1,000 ·20 = 20,000

returns. Divide the path into 1,000

non-overlapping return-paths with corresponding

ν(0)

s, which are the instantaneous variances at the beginning

of each path.

•

Create a CVAE with encoder and decoder with one hidden layer with

units and choose the latent space to

be in R20. We also choose α= 0.9.

•

Train the CVAE on the normalized return-paths conditioned on the normalized instantaneous variances for

1,500 epochs with a batch size of

128

and with a learning rate of

0.01

, 0.0001, 0.00001 for 500 epochs each.

The training is performed without any moment regularization, i.e. β= 0.

•

Sample 20,000 CVAE paths for diﬀerent instantaneous variances. We start with

ν(0) = 0.05

and

ν(0) =

0.0276, which corresponds to the 10% quantile of the training data.

KS p-value

(a) Kolmogorov-Smirnov p-values across time,

ν(0) = 0.05

Standard deviation

(b) Standard deviation of VAE and training paths across

time, ν(0) = 0.05

KS p-value

ν(0) = 0.0276

Standard deviation

(d) Standard deviation of VAE and training paths across

time, ν(0) = 0.0276

Figure 6.7: Standard deviations and Kolmogorov-Smirnov p-values across time from a single CVAE trained

on 1,000 Heston paths. For the simulated paths, we condition on

ν(0) = 0.05

in (a)-(b) and

ν(0) = 0.0276

(10% quantile) in (c)-(d). The Kolmogorov-Smirnov p-values are calculated on 20,000 sampled CVAE

paths (for each ν(0)) and 1,000 Heston paths with corresponding ν(0) independent of the training paths.

•

Compare the VAE paths to 1,000

independent

Heston paths sampled with corresponding

ν(0)

. The

comparison is performed with a two-sample Kolmogorov-Smirnov test for each time point and by looking

at standard deviations of the simulated

-paths across time. We will not analyze correlations since ﬁtting

marginal distributions is our primary concern, and it is not trivial in this case.

The results of the experiment can be seen in ﬁgure 6.7 (a)-(d).

In ﬁgure 6.7 (a), we see the Kolmogorov-Smirnov p-values across time for the CVAE paths generated with

ν(0) = 0.05

. We observe that the p-value is decreasing in time. This is a sign of accumulated mismatch between

returns from the CVAE and those from a Heston model with

ν(0) = 0.05

. We get a clue to the poor ﬁt in ﬁgure 6.7

(b), where we see the standard deviation of the

-paths across time. It is clear that the CVAE does not capture the

correct volatility.

In ﬁgure 6.7 (c) and (d), we have repeated the same experiment with the

same

CVAE, but with

ν(0) = 0.0276

(corresponding to the 10% quantile of the training data). The Kolmogorov-Smirnov p-values in ﬁgure 6.7 (c) are

virtually all zero. This is also reﬂected in ﬁgure 6.7 (d), where we see the standard deviations of the simulated paths.

It is clear that the CVAE does not capture the volatility at all when conditioned on a low initial instantaneous variance.

Overall, we are pretty disappointed with the results. The hope was that the CVAE could capture the volatil-

ity of the paths when starting with diﬀerent instantaneous variances. This is also why we have not even considered

the autocorrelation structure, which is less relevant when the CVAE failed to capture the marginal distributions.

Like before, we wish to apply some regularization to the (conditional) moments. This is what we introduce/propose

in the next section.

Regularization for conditional moments

In this section, we

propose

a way to apply regularization for conditional moments. We cannot use the moment

regularization we used for the VAEs, since it does not guide the CVAE towards the correct

conditional

moments

(only the unconditional moments). It is, however, not obvious how one should create such regularization.

Assume we have training samples

{X1, . . . , XN}

and corresponding conditional variables

{Y1,,...,YN}

Assume now that we sample

{˜

X1,..., ˜

XN}

from a CVAE based on the

s. The issue now is that we do not know the

moments of the

s and

s conditioned on the

s. There are, of course, diﬀerent ways of estimating the conditional

moments. For the

s (the CVAE samples), we could simply create a larger sample and average the results. However,

this would be slow as it would have to be done at every gradient step. For the

s (our training samples), we could

perform a regression to estimate the conditional moments. This only has to be done once, but it will introduce

more complexity than we would like. The simplest solution is to estimate the conditional moments

E[˜

Xj,i|Yj]

and

E[Xj,i|Yj]

Xj,i

and

Xj,i

, respectively. We could also do something similar to estimate the conditional second

moments.

The idea is now to introduce a regularization term that guides the CVAE towards a solution where

E[˜

Xj,i|Yj]

close to

E[Xj,i|Yj]

for all

j, i

and similarly for the second moment. We do this by introducing two regularization

terms:

1. Conditional mean regularization loss:

j=1

i=1

(˜

Xj,i −Xj,i)2.

2. Conditional second moment regularization loss:

j=1

i=1

(˜

j,i −X2

j,i)2.

The regularization terms are similar to those from the previous section. However, in this case, we apply the square

operator inside the second sum (Note that we used the abs-operator in the previous section. This is a somewhat

arbitrary choice). These regularization terms will then be subtracted from the objective function for the CVAEs. We

scale the regularization terms with γ, which will allow us to test its eﬀectiveness.

The fear of using the above regularization is that it guides the CVAE towards producing deterministic paths

given

due to the regressional nature of the terms. Still, we hope that combining the ﬁrst and second moment and

original objective function counteracts this.

To test these regularization terms, we train 50 CVAEs with the same setup as the experiment in ﬁgure 6.7

for each γin {0,1,10}. Note γ= 0 corresponds to no regularization. Once we have trained 50 CVAEs for each of

the diﬀerent

s, we wish to test the CVAEs conditioned on diﬀerent instantaneous variances. To do this, we ﬁnd, for

each CVAE, the instantaneous variances corresponding to the

10%,15%,...,90%

quantiles from the training data.

Note the 10% and 90% quantiles of

ν(0)

are approximately

0.0270

and

0,0765

on average, respectively. For each

quantile of

ν(0)

, we sample 20,000 CVAE paths and 1,000 independent Heston paths conditioned on the speciﬁc

quantile of

ν(0)

. This is done for every CVAE. We are then able to ﬁnd the average Kolmogorov-Smirnov p-values

over CVAEs and time for the diﬀerent quantiles of ν(0) and choices of γ.

KS p-value

(a) Avg. Kolmogorov-Smirnov p-values across quantiles

of ν(0).

Autocorrelation absolute returns

(b) Avg. autocorrelation of absolute returns for

ν(0)

the 60% quantile.

Autocorrelation returns

ν(0)

at the

60%

quantile.

Figure 6.8: Results from 50 CVAEs trained each trained on 1,000 Heston paths conditioned on instantaneous

variance for varying regularization

. (a) displays average Kolmogorov-Smirnov p-values over diﬀerent

quantiles of

ν(0)

(calculated from the training paths). 10% and 90% quantile of

ν(0)

is approximately

0.027 and 0.077, respectively. The Kolmogorov-Smirnov p-values are calculated on 20,000 paths sampled

from the CVAEs and 1,000 independent Heston paths with corresponding

ν(0)

. (b) and (c) show average

autocorrelations calculated on samples from the CVAEs and Heston model with

ν(0)

at its 60% quantile

(ν(0) ≈0.0525 on average).

As a sanity check we also look at the autocorrelations for the

60%

quantile of

ν(0)

. The results of this experiment

can be seen in ﬁgure 6.8 (a)-(c).

In ﬁgure 6.8 (a), we see the average Kolmogorov-Smirnov p-values across diﬀerent values of

ν(0)

and choices of

We observe that without moment regularization (

γ= 0

), we obtain a good ﬁt for

ν(0)

-quantiles between 40% and

80%

. However, the models are not capable of producing Heston like samples with extreme initial instantaneous

variances. For

γ= 1

, we see signiﬁcantly better performance on a wider range of initial instantaneous variances.

However, we see that too much regularization (

γ= 10

) hurts performance. Focusing on marginal distributions, it is

clear that some regularization is beneﬁcial, especially if we want the CVAE to produce samples for a wider range of

ν(0).

In ﬁgure 6.8 (b), we see the autocorrelations for diﬀerent choices of

when sampling conditioned on the 60%

ν(0)

quantile (

ν(0) ≈0.0525

on average). We observe that

CVAEs have, on average, managed to reproduce the

positive downward sloping autocorrelations from the Heston model. This is again disappointing, but from ﬁgure 6.8

(b) and (c), we see that the CVAEs generally produce uncorrelated return-paths (also in absolute values). One could

argue that it is the next best result if it cannot match the Heston model.

One could try adding more regularization or optimizing training to produce CVAEs capable of producing

absolute returns with positive correlation. We have tried various methods and regularization terms. However, the

problem is unfortunately not easily solved. This is an obvious downside of VAEs/CVAEs (at least with our setup).

All in all, we managed to ﬁnd a regularization method that improved the training of CVAEs, such that the

CVAEs could match the marginal distributions of Heston paths conditioned on some initial instantaneous variance.

Unfortunately, we could not replicate the positive absolute correlation.

It is worth remembering that we obtained the above results using a combination of parameters, which should be

relatively easy for the model to handle. This shows that sampling 20-dimensional return-paths is a complex problem,

and using our setup in practice might not be feasible (without clever modiﬁcations).

6.7 Overlapping training paths (is it possible?)

This section addresses the elephant in the room, which is the amount of data we use to train the VAEs/CVAEs. For

the CVAEs in the previous experiments, we utilized 1,000 non-overlapping return-paths of length

n= 20

. If we

assume (for simplicity) that there are precisely 20 days in a month, then 1,000 non-overlapping return-paths amount

to more than 83 years of data. That is a staggering amount of data for only 1,000 return-paths. The idea of this

section is to investigate the possibility of training the CVAEs on overlapping return-paths.

If we have

returns in total and we wish to create return-paths of length

, it is possible to create

bN/nc

non-overlapping return-paths. If we choose to overlap the paths, however, we would get

N−n+ 1

return-paths.

This would be tempting, as it would almost increase the size of our training data by a factor of n.

The fear of using overlapping paths is that it would yield some unwanted dependencies in our CVAE or a weird

misﬁt to the base model. Remember that we assumed that the sample return-paths are independent. In machine

learning, it is, however, quite common to train models in this way. When training on images, it is common to create

dozens of variations of every image to increase the size of the available training set (through rotation, ﬂips etc.).

The same logic could also work for return-paths. For example, suppose a large negative return was experienced on

the ﬁrst day of a month. In that case, it might be beneﬁcial to show the CVAE multiple scenarios where that event

occurred at diﬀerent times of the month. This makes sense since we would typically assume that the exact day of the

month does not inﬂuence returns.

We test CVAEs that are trained on overlapping paths by creating a similar experiment to that in the previous

section. We limit the CVAEs to 20 years of data, which corresponds to 240 and 4,781 return-paths for the

non-overlapping data sets and overlapping data sets, respectively. We then train 50 CVAEs on 240 non-overlapping

return-paths and 50 CVAEs on 4,781 overlapping return-paths conditioned on the initial instantaneous variances. All

paths are sampled from a Heston model with similar parameters to the one used in the previous section. Moreover,

all CVAEs are trained with

α= 0.9

and conditional moment regularization (

γ= 1

) for 1,500 epochs with batch size

128 and learning rates 0.01, 0.0001, 0.00001 for 500 epochs each. We then sample 20,000 paths from each CVAE

and for each instantaneous variance quantile (based on the training data) in

{10%,15%,...,90%}

, corresponding

ν(0)

from 0.0274 to 0.0751 on average. Each set of 20,000 CVAE paths is then compared, using a two-sample

Kolmogorov-Smirnov test, to 1,000 independent Heston paths sampled with the corresponding

ν(0)

. The results can

be seen in ﬁgure 6.9 (a)-(c).

In ﬁgure 6.9 (a), we see the average Kolmogorov-Smirnov p-values (averaged over CVAEs and time) across

quantiles of

ν(0)

. It is clear that the CVAEs trained on overlapping return-paths on average match the marginal

distributions of the Heston model signiﬁcantly better across all

ν(0)

-quantiles than CVAEs trained on non-overlapping

return-paths. Remember again that the CVAEs trained on non-overlapping paths are trained with signiﬁcantly fewer

paths.

KS p-value

(a) Avg. Kolmogorov-Smirnov p-values across quantiles

of ν(0).

Autocorrelation absolute returns

(b) Avg. autocorrelation of absolute returns for

ν(0)

60% quantile.

Autocorrelation returns

ν(0)

60%

quantile.

Figure 6.9: Results from 50 CVAEs trained each trained on overlapping and non-overlapping Heston

paths conditioned on instantaneous variance. (a) displays average Kolmogorov-Smirnov p-values over

diﬀerent quantiles of

ν(0)

. 10% and 90% quantile of

ν(0)

is approximately 0.027 and 0.077, respectively.

The Kolmogorov-Smirnov p-values are calculated on 20,000 paths sampled from the CVAEs and 1,000

independent Heston paths with corresponding

ν(0)

. (b) and (c) show average autocorrelations calculated

on samples from the CVAEs and Heston model with ν(0) at its 60% quantile (ν(0) ≈0.0525).

In ﬁgure 6.9 (b), we see that for the 60%

ν(0)

-quantile, neither CVAEs trained on overlapping paths nor CVAEs

trained on non-overlapping paths have captured the positive dependence of the Heston paths. However, we do notice

that CVAEs trained on overlapping paths have, on average, a slightly negative correlation on absolute returns, which

could be concerning. However, the magnitude is on the smaller side.

Lastly, in ﬁgure 6.9 (c), we see that both CVAEs trained on overlapping paths nor CVAEs trained on non-

overlapping paths have no noticeable correlations between non-absolute returns, which is in line with the Heston model.

Overall, we are pretty impressed with the performance of the CVAEs trained on overlapping paths even though the

CVAEs, on average, had a slight negative autocorrelation on absolute returns. It is, however, clear that non-overlapping

paths require too much data when using return-paths of length n= 20.

We should mention that these issues could possibly be resolved by training the CVAEs to sample shorter

return-paths. Longer paths could then be obtained by connecting multiple shorter paths. The issue with this strategy

is that we have to choose the conditional variables to allow this sort of sampling. However, this is far from trivial,

which is why we choose not to explore that option further.

7 Data Driven Hedge Experiments

This section aims to combine the deep hedging approach from section 3 and the market generators from section

6. We hope to ﬁnd acceptable hedging strategies by only observing a (relatively) small number of paths from the

underlying asset. The ideal experiment would be to learn hedging strategies from actual stock prices. In this thesis,

however, we stick to the assumption that the underlying assets follow either a Black-Scholes or Heston model. We do

this since it is easier to evaluate our methods and because the market generators might not be advanced enough to

deal with actual data yet.

Another purpose of these experiments is to test the VAEs and CVAEs further. In section 6, we only tested

the marginal distributions and looked at the correlations of the sampled

-paths and returns. Testing the VAEs

and CVAEs as market generators for deep hedging models will, therefore, also act as tests of the joint distributions

learned by the VAEs and CVAEs.

7.1 VAE powered hedge experiments - Black-Scholes

In this section, we wish to perform hedge experiments involving hedging a simple ATM call option with maturity

T= 1/12

(one month). One might think that call options are too simple as the price of a call option only depends on

the distribution of

S(T)

under

, but remember that hedging depends on the entire joint distribution of the path

under the P-measure.

We assume daily rebalancing with 20 hedge points including

t= 0

, but not

t=T

. For simplicity in these

ﬁrst experiments, we assume that the underlying asset follows a Black-Scholes model with parameters

S(0) = 1

µ= 0.05,σ= 0.3, and we assume a ﬁxed interest rate r= 0.01 and no transaction costs.

In the ﬁrst experiment, we wish to test deep hedging models that utilize VAEs trained on 100, 250 and 1,000

paths. Each deep hedging model is trained with

218

independent VAE paths. We use two benchmarks for this

experiment. The ﬁrst is a deep hedging model that is trained on

218

sample paths from the

actual

underlying model.

The second is a model that utilizes standard delta hedging. The experiment is described in detail below.

•For each choice of the number of training paths {100,250,1000}, we train 10 VAEs. The training paths are

all sampled independently from the actual Black-Scholes model. All VAEs are trained with

α= 0.9

and

β= 10. The encoders and decoders, in each VAE, have one hidden layer with 40 units.

•

We train a deep hedging model for each of the 30 VAEs (10 for each number of training samples). Each deep

hedging model is trained on

218

VAE paths. The ANNs, in the deep hedging models, have four layers with

ﬁve units each. Remember that a deep hedging model contains an ANN for every trading decision.

•We also train 10 deep hedging models on 218 samples from the actual underlying model.

VAE Deep Hedge Deep Hedge Analytical

#VAE training samples 100 250 1,000

Avg. Abs Pnl Mean 0.0064 0.0056 0.0054 0.0058 0.0049

Std 0.00042 0.00006 0.00006 0.00002 0.

Pnl Std Mean 0.0082 0.0072 0.0070 0.0073 0.0065

Std 0.00058 0.00007 0.00005 0.00002 0.

CV aR0.95 Mean 0.0176 0.0148 0.0141 0.0136 0.0153

Std 0.00174 0.00023 0.00013 0.00001 0.

Turnover Mean 2.2157 1.9261 1.8542 1.7635 1.9014

Std 0.10306 0.04393 0.02071 0.00269 0.

Table 7.1: Results from a hedge experiment with an ATM call option in a Black-Scholes model without

transaction costs. Each column represents the average results (including standard deviation) of 10 hedging

models. In the ﬁrst three columns, deep hedging models are trained with

218

VAE paths. The VAEs are

trained on a varying number of training paths. The fourth column represents a deep hedging model trained

on actual Black-Scholes paths, and the ﬁfth column represents the delta hedging strategy. The results are

derived from a hedge experiment with 100,000 independent paths from the actual Black-Scholes model.

•

We then perform a hedge experiment on 100,000 independent paths. In this experiment, we test all deep

hedging models and a delta hedging approach. Furthermore, all portfolios are initialized with a portfolio

value of 0.03494, corresponding to the actual option price.

The results of the experiment can be seen in table 7.1 and ﬁgure 7.1.

In table 7.1, we observe that the deep hedging models trained on actual Black-Scholes paths generally per-

formed the best in terms of CVaR. This is not surprising as this model is trained with perfectly distributed data. We

also observe that the deep hedging models trained on actual Black-Scholes paths are very similar as the standard

deviation across all performance measures is relatively low. This suggests that the deep hedging model is relatively

stable, at least in this simple experiment.

Looking at the VAE powered models, we observe that the performance increases with the number of training

paths. When looking at empirical CVaR, we see that even deep hedging models using VAEs trained on 250 samples

(with an average CVaR of 0.0148) outperformed, on average, the analytical (delta hedging) approach (with a CVaR of

0.0153). However, it is also clear that the deep hedging models using VAEs trained on 100 samples paths (with an

average CVaR of 0.0176) did not perform as well as the analytical approach. They also exhibited higher variance in

performance, which is expected as the VAEs may diﬀer signiﬁcantly due to few training samples.

It is also interesting that the turnover is signiﬁcantly higher for deep hedging models using VAEs trained on fewer

paths. The issue is visualized in ﬁgure 7.1. Here we see a comparison of holdings in the underlying asset between

the analytical approach, a deep hedging model trained on actual Black-Scholes paths and a deep hedging model

trained on VAE paths (from a VAE trained on 100 Black-Scholes paths). The deep hedging model trained on VAE

paths approximately captures the optimal strategy in the two test samples, but it lacks stability and precision. The

issue could stem from overﬁtting from the deep hedging model. However, it is hard to pinpoint the reason without

further analysis. One possible suggestion/test would be to introduce some regularization on the deep hedging models

that discourage overﬁtting. One might also suspect that these stability issues would be partially alleviated/dampened

by introducing transaction costs.

All this suggests that one must be extremely careful when using VAEs combined with the deep hedging approach

as instabilities may arise in the hedging strategy.

Stability of VAEs vs deep hedging models

In the next experiment, we would like to get a sense of the stability of the deep hedging model compared to the VAE.

To do this, we repeat the previous experiment, but this time we train 11 VAEs on the

same

250 training samples. For

(a) Sample 1 (b) Sample 2

Figure 7.1: Holdings in

across time for two diﬀerent test sample paths. The experiment involves hedging

an ATM call option in a Black-Scholes model without transaction costs. The VAE behind ANN CVaR MG

is trained on 100 independent Black-Scholes paths.

Same VAE Diﬀerent VAEs

(same data)

Avg. Abs PnL Mean 0.0057 0.0057

Std 0.00001 0.00006

PnL Std. Mean 0.0073 0.0073

Std 0.00001 0.00006

CV aR0.95 Mean 0.0150 0.0152

Std 0.00004 0.00016

Turnover Mean 1.9831 1.9857

Std 0.00555 0.03961

Table 7.2: Results from a hedge experiment with an ATM call option in a Black-Scholes model without

transaction costs. Each column represents the average results (including standard deviation) of 10 deep

hedging models. In the ﬁrst column, deep hedging models are trained with

218

samples from the

same

VAE trained on 250 paths. The second column represents deep hedging models trained on paths from

diﬀerent VAEs (also trained on 250 paths). The results are derived from a hedge experiment with 100,000

independent paths from the actual Black-Scholes model.

each of the ﬁrst 10 VAEs, we train a single deep hedging model, and for the last VAE, we train 10 separate deep

hedging models. We then test the deep hedging models on 100,000 independent Black-Scholes paths. The results of

this experiment can be seen in table 7.2.

In table 7.2, we observe that deep hedging models trained on the same VAE have a signiﬁcantly lower vari-

ance across all performance measures (avg. abs. Pnl, CVaR etc.). This diﬀerence suggests that the deep hedging

models are pretty stable, and performance depend heavily on the samples generated by the VAEs (at least in this

experiment). Comparing the results to those from table 7.1 (with 250 training paths), we see that training VAEs on

diﬀerent sets of sample paths does not add too much extra variance compared to VAEs trained on the same set of

sample paths. To be speciﬁc, we see this in the second column of both table 7.1 and 7.2. The standard deviation

of CVaRs is 0.00016 vs 0.00023, and the standard deviation of turnover is 0.03961 vs 0.04393. Remember that

the standard deviation is calculated over the performance of 10 deep hedging models. Note that these results may

depend heavily on the number of paths used to train the VAEs (250 in this case). Still, this suggests that the VAEs’

training procedure could be improved. However, we must remember that training the VAEs include simulation in

each gradient step, which would naturally introduce variance in the models.

#VAE

training samples

(non-overlapping)

981

(overlapping)

250

(non-overlapping)

4,981

(overlapping)

Avg. Abs PnL Mean 0.0154 0.0055 0.0056 0.0054

Std 0.00748 0.00019 0.00006 0.00006

PnL Std. Mean 0.0201 0.0071 0.0072 0.0069

Std 0.00918 0.00026 0.00007 0.00006

CV aR0.95 Mean 0.0473 0.0145 0.0148 0.0139

Std 0.01981 0.00074 0.00023 0.00008

Turnover Mean 4.8062 1.8258 1.9261 1.8205

Std 2.07667 0.02965 0.04393 0.01277

Table 7.3: Results from a hedge experiment with an ATM call option in a Black-Scholes model without

transaction costs. Each column represents the average results (including standard deviation) of 10 deep

hedging models. In the ﬁrst and third column, deep hedging models are trained with

218

samples from the

VAEs trained on 50 and 250 non-overlapping paths, respectively. The second and fourth column represents

deep hedge models trained on paths from diﬀerent VAEs trained on 981 and 4,981

overlapping

paths,

respectively. The results are derived from a hedge experiment with 100,000 independent paths from the

actual Black-Scholes model.

Using overlapping paths to train the VAE

In this section, we wish to test the eﬀectiveness of using overlapping paths to train the VAEs. In a previous experiment

(see section 6.7), we noticed that using overlapping paths was quite eﬀective even in the Heston model, at least when

testing marginal distributions and comparing correlations of returns.

In another experiment (see table 7.1), we saw that deep hedging models, based on VAEs trained on 250

independent (i.e. non-overlapping) Black-Scholes paths, were comparable, in terms of CVaR, to the delta hedging

strategy. However, 250 non-overlapping return-paths require 250 months of data, which is more than 20 years. If we

had used overlapping return-paths, we could get 4,981 (

250 ·20 −20 + 1

) return-paths. Another example is if we

use only 50 non-overlapping return-paths corresponding to just over four years of data. In this case, we could get

981 overlapping return-paths instead. We, therefore, wish to investigate whether it is worth using the overlapping

return-paths.

For this experiment, we train 10 VAEs on 981 overlapping return-paths, 10 VAEs on 4,981 overlapping

return-paths, 10 VAEs on 50 non-overlapping return-paths and 10 VAEs on 250 non-overlapping return-paths.

We then train a deep hedging model for each VAE using

218

paths. The result of this experiment can be seen in table 7.3.

From table 7.3, we observe that using overlapping return-paths to train the VAEs provide a signiﬁcant perfor-

mance boost. It is clear that 50 non-overlapping training paths are not enough to train a useful VAE. The average

CVaR of 0.0473, obtained using VAEs trained on 50 non-overlapping training paths, is more than three times larger

than the average CVaR of 0.0145, obtained using VAEs trained on 981 overlapping paths. A similar diﬀerence exists

for the turnover. Moreover, we observe that deep hedging models using VAEs trained on 981 overlapping paths are

comparable to deep hedging models using VAEs trained on 250 non-overlapping paths (which use ﬁve times more

data). Lastly, we observe that the deep hedging models trained on 4,981 overlapping paths perform the best (with an

average CVaR of 0.0139) and is pretty close to the deep hedging models trained on actual Black-Scholes paths from

table 7.1.

All in all, we conclude that using overlapping paths is much prefered, at least for the current hedging problem.

Hedging a down-and-out call option

We have now seen that deep hedging models using VAEs can learn hedging strategies for call options successfully. It,

therefore, seems like the VAEs are (reasonably) proﬁcient in capturing the distributional properties of Black-Scholes

MG Hedge BS ANN Analytical

#VAE

training samples

250

(non-overlapping)

1000

(non-overlapping)

4981

(overlapping)

Avg. abs. PnL. Mean 0.0062 0.0061 0.0061 0.0064 0.0050

Std 0.00018 0.00013 0.000143 0.00019 0.

PnL Std. Mean 0.0094 0.0093 0.0093 0.0091 0.0062

Std 0.00037 0.00023 0.000215 0.00013 0.

CV aR0.95 Mean 0.0162 0.0151 0.0148 0.0144 0.0164

Std 0.00042 0.00012 0.000080 0.00003 0.

Turnover Mean 1.8645 1.7458 1.6795 1.6388 1.6194

Std 0.03595 0.02072 0.011673 0.00334 0.

Table 7.4: Results from a hedge experiment with an ATM down-and-out call option in a Black-Scholes

model without transaction costs. Each column represents the average results (including standard deviation)

of 10 hedging models. In the ﬁrst three columns, deep hedging models are trained with

218

samples from

VAEs that are trained on a varying number of training paths (in the third column, all VAEs are trained on

overlapping paths). The fourth column represents deep hedging models trained on actual Black-Scholes

paths, and the ﬁfth column represents the delta hedging strategy. The results are derived from a hedge

experiment with 100,000 independent paths from the actual Black-Scholes model.

paths. To test this further, we wish to conduct a hedge experiment with an ATM down-and-out call option, like in

section 5.3. We assume that the down-and-out call option has maturity

T= 1/12

and has a barrier at

L= 0.95

Like with the simple call option, we assume daily rebalancing (

n= 20

), and we also assume that the barrier is only

monitored on the 20 hedge points.

In section 5.3, we saw that multiple diﬀerent architectures allowed a deep hedging model to hedge a down-and-out

call option. The best method required giving the ANN (representing the current trading decision) the minimum

value of the underlying up until that point. We choose to use this method since it focuses on the distribution of the

minimum of the underlying asset. This is desirable as we wish to test the joint distribution of the VAEs paths.

To perform this experiment, we train 10 VAEs on 250 non-overlapping Black-Scholes paths, 10 VAEs on

1,000 non-overlapping independent Black-Scholes paths and 10 VAEs on 4,981 overlapping Black-Scholes paths

(corresponding to 250 non-overlapping paths). For each VAE, we train a deep hedging model on

218

paths from that

VAE. For comparison, we also train 10 deep hedging models, each on

218

paths from the actual Black-Scholes model.

To test the models, we perform a hedge experiment with 100,000 independent paths from the actual Black-Scholes

model. The experiment also includes a strategy using delta hedging for the continuous down-and-out call option.

All hedging strategies have an initial portfolio value of 0.02989, which corresponds to the price of a continuous

down-and-out call option. The results of this experiment can be seen in table 7.4.

In table 7.4, we observe that the deep hedging models trained on VAE paths perform pretty well. Looking

at average empirical CVaR, we see that performance increases with the number of paths used for training the VAEs.

This is expected. However, we notice that the deep hedging models trained on the VAEs using overlapping paths

performed the best with an average CVaR of

0.0148

. For comparison, deep hedging models trained on actual

Black-Scholes paths obtained an average CVaR of

0.0144

. This is (again) a clear vindication of training VAEs on

overlapping paths. It would be reasonable to fear that the VAEs would inherit some undesired dependence from

training on overlapping paths. However, this does not seem to be the case even after testing down-and-out call options.

We also observe that deep hedging models using VAEs trained on 250 non-overlapping paths are (with an average

CVaR of 0.0162) comparable to the delta hedging approach (with an average CVaR of 0.0164). This is in line with

previous experiments.

One might be concerned about the diﬀerence between the deep hedging and delta hedging model when it comes

to the average absolute PnL and PnL standard deviation. However, in section 5.3, we saw that the deep hedging

models’ optimization over CVaR likely explains this diﬀerence.

Overall, we are pretty pleased with the results as they conﬁrm that combining VAEs and deep hedging mod-

els work even for lighter path-dependent options, at least in the Black-Scholes model.

Subconclusion for hedging in Black-Scholes

Overall, we are pretty pleased with the results. From the previous experiments, we have seen that it is (indeed)

possible to learn a hedging strategy for a call option only by observing a relatively small number of paths from the

actual model. This is especially impressive since our models (VAE and deep hedging model) do not depend on

any knowledge of the model for the underlying asset and its relation to a call option. However, we do exploit some

simplicities of the Black-Scholes model. We also saw that most of the variation in the hedging strategies came from

the training of the VAEs, which was pretty reasonable, as training of the VAEs required random simulation in each

gradient step. Lastly, we saw that using overlapping paths could massively boost the performance of the hedging

strategies (despite concerns regarding dependence) as it allowed a 20-fold increase in training data.

However, before we celebrate these results too much, we have to remember that all these results assumed a

simple Black-Scholes model as the underlying model for the asset. We do, therefore, not automatically expect this

method to work with more complicated models or real data.

7.2 CVAE powered hedge experiments - Heston

In this section, we wish to perform hedge experiments in a Heston model. The idea is to train CVAEs on Heston

paths conditioned on the corresponding instantaneous variances. The hope is that the CVAEs can produce new paths

conditioned on some arbitrary instantaneous variance. This is a more realistic framework since it allows us to train

the CVAE on a single observed path (with varying volatility across time) and simulate new paths based on the current

level of volatility.

However, we saw in section 6.5 and 6.6 that training VAEs and CVAEs on Heston paths was diﬃcult. We noticed

that VAEs and CVAEs struggled to capture the positive correlation between absolute returns. We also observed that

the performance of the CVAEs decreased when conditioned on extreme instantaneous variances, compared to those

used for training. We saw that moment-regularization could improve performance quite eﬀectively. However, the

VAEs and CVAEs still struggled compared to the Black-Scholes model. It is, however, not obvious how performance,

as measured in section 6, translates to hedge performance. This is what we wish to address in this section.

For this experiment, we assume that the underlying asset follows a Heston model with similar parameters to

those used in section 6.6

S(0) = 1

drift: µ= 0.05

ν(0) = 0.05

mean reversion: κ= 4

long term variance: θ= 0.05

vol-of-vol: σ= 0.25

correlation: ρ= 0.

Recall from section 5.4 that there were diﬀerent architectures or approaches to consider when creating a deep hedging

model in the Heston model. We saw that providing the ANNs (representing each trading decision) with current

instantaneous variance slightly improved hedge performance. However, in our setup, we cannot create CVAEs that

generate joint paths for the underlying asset and its variance process. We, therefore, choose to utilize deep hedging

models without knowledge of the current instantaneous variance. Remember that the focus was to analyze the CVAEs

ability to generate paths conditioned on current instantaneous variance.

CVAE Deep Hedge Deep Hedge Analytical

#VAE

training samples

250

(non-overlapping)

1000

(non-overlapping)

4981

(overlapping)

Avg. abs. PnL Mean 0.0044 0.0044 0.0044 0.0049 0.0041

Std 0.00004 0.00005 0.00012 0.00002 0

PnL std. Mean 0.0058 0.0057 0.0057 0.0061 0.0054

Std 0.00004 0.00003 0.00009 0.00002 0

CV aR0.95 Mean 0.0137 0.0124 0.0121 0.0116 0.0127

Std 0.00050 0.00021 0.00025 0.00001 0

Turnover Mean 2.0201 1.8562 1.8087 1.7199 1.9002

Std 0.05035 0.02108 0.02647 0.00340 0

Table 7.5: Results from a hedge experiment with a call option in a Heston model with

ν(0) = 0.05

and

without transaction costs. The price of the option is 0.02607. Each column represents the average results

(including standard deviation) of 10 hedging models. In the ﬁrst three columns, deep hedging models are

trained with

218

samples from CVAEs that are trained on a varying number of training paths themselves

(in the third column, all CVAEs are trained on overlapping paths). The fourth column represents a deep

hedging model trained on actual Heston paths, and the ﬁfth column represents the delta hedging strategy.

The results are derived from a hedge experiment with 100,000 independent paths from the actual Heston

model.

CVAE Deep Hedge Deep Hedge Analytical

#VAE

training samples

250

(non-overlapping)

1000

(non-overlapping)

4981

(overlapping)

Avg. abs. Pnl Mean 0.0040 0.0041 0.0042 0.0040 0.0034

Std 0.00010 0.00004 0.00006 0.00001 0.

PnL std. Mean 0.0051 0.0051 0.0052 0.0051 0.0044

Std 0.00009 0.00004 0.00007 0.00001 0.

CV aR0.95 Mean 0.0105 0.0098 0.0097 0.0096 0.0105

Std 0.00019 0.00006 0.00005 0.00000 0.

Turnover Mean 1.8811 1.7146 1.6472 1.6903 1.8963

Std 0.07725 0.01259 0.01778 0.00385 0.

Table 7.6: Similar to table 7.5, but with ν(0) = 0.0270. Option price: 0.02040.

For this experiment, we train 10 CVAEs on 250 non-overlapping Heston paths, 10 CVAEs on 1,000 non-

overlapping Heston paths and 10 CVAEs on 4981

overlapping

Heston paths (corresponding to 250 non-overlapping

paths). All CVAEs are trained with

α= 0.9

β= 0

and

γ= 1

. Remember

was for VAEs not CVAEs. We

then perform three hedge experiments with

ν(0)

{0.0270,0.05,0.0765}

, where 0.0270 and 0.0765 approximately

corresponds to the 10% and 90% quantile from the training data. Note that these

ν(0)

s are the instantaneous variances

utilized for the hedge experiments, and when we create training paths for the CVAE, we still use ν(0) = 0.05.

In each of the three experiments with diﬀerent conditioned

ν(0)

s, we train one deep hedging model for each of

the 10 CVAEs. For the deep hedging models, we utilize ANNs that has four layers with six units each. The deep

hedging models are trained on

218

paths, which are conditioned on the chosen

ν(0)

. We also train 10 deep hedging

models on actual

independent

Heston paths with the chosen

ν(0)

. Lastly, we also include a delta hedging model,

which we refer to as the analytical approach. All hedging models are tested on 100,000 independent Heston paths

with the chosen

ν(0)

. For the tests, we set the initial portfolio values for all strategies equal to the option price, which

varies with

ν(0)

. The results can be seen in table 7.5, 7.6 and 7.7, corresponding to

ν(0) = 0.05

ν(0) = 0.0270

and

ν(0) = 0.0765, respectively.

In table 7.5 (

ν(0) = 0.05

), we observe that the deep hedging models trained on CVAE paths perform rea-

sonably well. In terms of CVaR, we notice that only the deep hedging models using CVAEs trained on 250

CVAE Deep Hedge Deep Hedge Analytical

#VAE

training samples

250

(non-overlapping)

1,000

(non-overlapping)

4,981

(overlapping)

Avg. abs. PnL Mean 0.0055 0.0049 0.0048 0.0057 0.0048

Std 0.00019 0.00004 0.00003 0.00001 0.

PnL std. Mean 0.0073 0.0066 0.0065 0.0071 0.0063

Std 0.00024 0.00004 0.00002 0.00001 0.

CV aR0.95 Mean 0.0187 0.0157 0.0153 0.0135 0.0149

Std 0.00074 0.00019 0.00028 0.00001 0.

Turnover Mean 2.2806 1.9804 1.9125 1.7308 1.9023

Std 0.09590 0.01676 0.02586 0.00347 0.

Table 7.7: Similar to table 7.5, but with ν(0) = 0.0765. Option price: 0.03134.

non-overlapping paths did not (with an average CVaR of

0.0137

) outperform the delta hedging approach (with an

average CVaR of

0.0127

). The deep hedging models using CVAEs trained on 4,981 overlapping paths performed the

best of the deep hedging models using CVAEs with an average CVaR of 0.0121. This is encouraging since it shows

that learning hedging strategies from data with varying volatility is possible, even though it is more challenging than

constant volatility. This is especially true when considering that the CVAEs in section 6.6 failed to replicate the

positive correlation between absolute returns. We are also pleased to see that CVAEs trained on overlapping paths,

once again, performed well.

In table 7.6 and 7.7 (

ν(0) = 0.0270

and

ν(0) = 0.0765

), we have performed the hedge experiment with more

challenging initial instantaneous variances, corresponding approximately to the

10%

and

90%

quantile of the training

data. As we saw in section 6.6, the CVAEs ﬁnd it harder to generate Heston-like paths with extreme values of

ν(0)

compared to the training data. In table 7.6 (

ν(0) = 0.0270

), we observe that all deep hedging models using CVAEs

perform as good or better than the delta hedging approach. However, this is not true in table 7.7 (

ν(0) = 0.0765

where all deep hedging models using CVAEs performed worse than the delta hedging approach. This suggests that it

is harder for the CVAEs to generate useful paths (for training the deep hedging models) when conditioned on high

instantaneous variances compared to low instantaneous variances. This may not seem surprising, but we did not see

this in section 6.6, where we analyzed the marginal distributions with the two-sample Kolmogorov-Smirnov test.

It is also interesting to observe that the turnover (and average absolute PnL) does not follow the same pattern in

table 7.5, 7.6 and 7.7. This might suggest that the CVAEs has a bias when conditioning on diﬀerent instantaneous

variances.

Overall, we are pretty pleased with the results, as they show that it is possible to learn usable hedging strategies

conditioned on the current level of volatility. Moreover, it further vindicates the use of overlapping paths for training

CVAEs. It is also a relief to see that the deep hedging models could learn these hedge strategies when the CVAEs

struggled to capture dependency in returns. These results could, of course, vary substantially if the Heston model

had more extreme parameters or if we tried hedging options that are more sensitive to dependency in the returns.

8 Conclusion

This thesis shows that it is possible with deep learning to learn hedging strategies from a single long observed path

of the underlying asset under the real-world measure

. Our approach is quite general and model-free, and it was

obtained by combining ideas and techniques suggested in [1] and [2].

In section 3, we formulated a general hedging problem that could be represented as a risk minimization problem over

trading decisions represented by ANNs. Critically, we observed that the optimal trading strategy was independent of

the option price when utilizing a coherent risk measure. The only input required to solve the minimization problem

was information about the current and previous market states (and possibly previous trades) at each trading decision.

In practice, this amounts to a large set of simulated paths under the P-measure of the underlying asset(s).

In section 5, we performed several hedge experiments where deep hedging models learned optimal hedging

strategies for various options and models. We observed that our framework was able to learn complex trading

strategies. However, it was clear that architecture and available information played a crucial part in stability and

performance. When hedging down-and-out call options, we saw that the performance improved signiﬁcantly if the

deep hedging models could observe the current minimum value of the underlying asset.

We also observed that learning trading strategies in a multiasset Black-Scholes model had stability issues,

especially when tested on asset correlations, which were diﬀerent to those used during training. We

proposed

training the deep hedging model on paths simulated with dampened asset correlation, and we found that this technique

did improve the stability of the deep hedging model.

At that point, we had developed a framework that could learn optimal trading strategies given enough train-

ing samples. The goal was then to create a framework that could generate an arbitrarily large training set of paths

from an underlying asset based on only a modest set of observed paths.

Section 6 introduced the theory behind VAEs and CVAEs and how one could utilize these models to create

generative models capable of generating return-paths with similar distributional characteristics as an observed set

of paths (or one long single path) from the underlying asset. We saw that training VAEs and CVAEs amounted to

minimizing a reconstruction loss between generated paths and training paths plus a regularization term, which was

crucial for the generative qualities of the model after training.

We then experimented with the VAEs and CVAEs abilities to generate paths, consisting of 20 daily returns, from

a Black-Scholes and Heston model. We saw that the VAEs could produce paths with marginal distributions and

correlations that aligned with those from a Black-Scholes model, even when trained on only 100 paths. However, we

observed that the VAEs struggled to capture the positive correlation between absolute returns in a Heston model. We

proposed

a series of moment regularizing terms that guide the VAEs to generate paths with moments and correlations

that align with the training paths. The VAEs were, with regularization, partially able to capture correlations in the

Heston model.

We also experimented with the CVAEs’ ability to generate paths with distributional properties of a Heston model,

conditioned on the current instantaneous variance. It was evident that the CVAE struggled with this challenge. Again,

proposed

two regularization terms to guide the CVAEs toward the right conditional moments of the training

paths. The proposed regularization had a signiﬁcant positive eﬀect on the marginal distributions. However, we failed

to capture the positive correlation structure of absolute returns from the Heston model when conditioning on the

current instantaneous variance.

In section 7, we combined the two frameworks (VAEs/CVAEs and deep hedging models) to create a single

framework capable of learning eﬃcient hedging strategies only by observing a modest number of paths (potentially

coming from a single long realized path).

In a Black-Scholes model, we observed that the framework worked as intended and could learn hedging strategies

for both ordinary and down-and-out call options based on a modest number of paths. As expected, the performance

increased with the number of training paths. However, we also found that using overlapping training paths could

increase performance signiﬁcantly. We also analyzed the stability of the framework, and it suggested that the VAEs

were the greatest source of variation. This did not surprise us, as training VAEs involved random simulations inside

each gradient step. Still, improvements to VAE training seemed like the most straightforward way of improving the

framework.

Finally, we tested the framework in a Heston model. Our goal was to train CVAEs on Heston paths conditioned on

instantaneous variance and then create hedging strategies based on some ﬁxed level of initial volatility (instantaneous

variance). We found that the framework worked as intended. We were able to ﬁnd decent hedging strategies for a call

option based on diﬀerent initial instantaneous variances, even though we failed to capture the positive correlation of

absolute returns with the CVAEs. However, we did observe that the model struggled to ﬁnd hedging strategies that

outperformed simple delta hedging when the initial instantaneous variance was large. This was not reﬂected in the

analysis from section 6, which shows that ﬁt of marginal distributions and correlations does not translate directly to

hedge performance.

Overall, we are pretty impressed with the models and frameworks used in this thesis. We have shown that it

is possible to create hedging strategies from a single observed path of an underlying asset without any knowledge of

-measures and Greeks. The framework is intended to be general and model-free. However, we have seen that the

VAE/CVAE models may require substantial regularization and that the deep hedging models are particularly sensitive

to the architecture and processing of inputs. One may, therefore, not need any knowledge about exact Greeks, but

understanding price dynamics and options is vital to utilize market generators and deep hedging models eﬀectively. It

would, therefore, be sensible to favour classic methods still. However, this framework might be helpful as a co-pilot

to classic techniques, as an alternative method with few shared assumptions.

A Appendix

A.1 Outperforming delta hedging using MSE (in theory)

From the experiments in section 5.1.1, we saw that the ANN model using MSE learned a strategy that was remarkably

close to that of the analytical delta hedging approach assuming no transaction costs. However, we would expect

that a trading strategy optimized for trading in discrete-time would outperform delta hedging, which is meant for

continuous trading.

In this short section, we wish to show analytically that strategies optimizing MSE in discrete time are diﬀerent

from delta hedging. To do this, we imagine being in the simplest setting possible. For this, we assume that we are

only allowed to trade at

t0= 0

and then evaluate the MSE at time

t1=T

. The goal is to trade a single asset

minimize the MSE between the portfolio value

P F1

and the option payoﬀ

g(S1)

. We also assume that we start with

portfolio value p0and that there are no transaction costs.

If we choose to hold δunits of Sat time t0= 0 then the portfolio value P F1is

P F1=S1δ+ (p0−δS0)erT

=δ(S1−S0erT ) + p0erT .

To formalize the analysis, we wish to ﬁnd δto minimize

0[(g(S1)−P F1)2].

Using the representation of P F1and applying the ﬁrst order condition yields

0 = ∂

∂δ EP

0[(g(S1)−P F1)2]

=−2EP

0[(g(S1)−P F1)(S1−S0erT )]

=−2EP

0(g(S1)−δ(S1−S0erT )−p0erT )(S1−S0erT )

=−2EP

0−δ(S1−S0erT )2+ (g(S1)−p0erT )(S1−S0erT ).

Solving for δyields

δ∗=EP

0(g(S1)−p0erT )(S1−S0erT )

0[(S1−S0erT )2].

Notice if P=Qand p0=e−rT EQ

0[g(S1)] then

δ∗=covQ

0(g(S1), S1)

0[S1].

As a sanity check, we see that if

g(S1) = S1

and

p0=S0

, then

δ∗= 1

, which we expect since we can perfectly

replicate the option by holding 1 unit of

. However, the interesting observation is that

δ∗

does not equal

∆ = ∂

∂s e−rT EQ[g(S1)|S0=s]

. Among other things, because

δ∗

depends on both

and

(which might not

necessarily be the price). Of course, we know that if we can trade continuously, then trading with the option delta

will be optimal. Nevertheless, when trading in discrete time (and in this case only trading once), it is possible to

choose a trading strategy that is more optimal than delta hedging.

Our experiments ﬁnd that the ANN model using MSE is dominant on fewer hedge points, as it can factor this

into its minimization. However, in experiments with a fair amount of hedge points, we have not found that the ANN

model using MSE signiﬁcantly outperforms delta hedging even if the ANN model also has the current portfolio value

as input.

A.2 Price and delta of a call option in the Heston model

In this thesis, when pricing call options in the Heston model, we utilize a simple (and relatively stable) semi-closed

formulation of the price referred to as the Lipton-Lewis reformulation (see [15])

CH(S(t), ν(t), t) = S(t)−Ke−r(T−t)

2πZ∞

−∞

exp (−ik +1

2)X+α+ (k2+1

4)βν(t)

(k2+1

4)dk

where

X= ln S(t)

K+r(T−t)

α=−κθ

σ2ψ+(T−t) + 2 lnψ−+ψ+e−ζ(T−t)

2ζ

β=1−e−ζ(T−t)

ψ−+ψ+e−ζ(T−t)

and where

ψ±=∓(ikρσ +ˆ

k) + ζ

ζ=pk2σ2(1 −ρ2)+2ikσρˆκ+ ˆκ2+σ2/4

ˆκ=κ−ρσ

Note that all parameters referenced are the parameters under the

-measure. From this, we can also easily derive a

semi-closed formula for the option delta

∆H=∂C H(S(t), ν (t), t)

∂S(t)= 1 −Ke−r(T−t)

2πZ∞

−∞

(−ik +1

2) exp (−ik +1

2)X+α+ (k2+1

4)βν(t)

S(t)(k2+1

4)dk.

References

[1]

Buehler, H., Gonon, L., Teichmann, J., & Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8),

1271-1291.

[2]

Buehler, H., Horvath, B., Lyons, T., Perez Arribas, I., & Wood, B. (2020). A data-driven market simulator for

small data environments. Available at SSRN 3632431.

[3]

McNeil, A. J., Frey, R., & Embrechts, P. (2015). Quantitative risk management: concepts, techniques and

tools-revised edition. Princeton university press.

[4] Föllmer, H., & Schied, A. (2016). Stochastic ﬁnance: an introduction in discrete time. Walter de Gruyter.

[5] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[6] Pedersen, T. C., & Frandsen, M. G. (2020). Applications of Deep Learning in Option Pricing and Calibration.

[7]

Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial

activation function can approximate any function. Neural networks, 6(6), 861-867.

[8] Telgarsky, M. (2015). Representation beneﬁts of deep feedforward networks. arXiv preprint arXiv:1509.08101

[9] Björk, T. (2009). Arbitrage theory in continuous time. Oxford university press.

[10] Poulsen, R. (2018). Fundamental Views. Wilmott, 2018(97), 44-47.

[11]

Higham, N. J. (2002). Computing the nearest correlation matrix—a problem from ﬁnance. IMA journal of

Numerical Analysis, 22(3), 329-343.

[12]

Yin, C., Perchet, R., & Soupé, F. (2021). A practical guide to robust portfolio optimization. Quantitative

Finance, 1-18.

[13] Savine, A. (2018). Modern computational ﬁnance: AAD and parallel simulations. John Wiley & Sons.

[14]

Lord, R., Koekkoek, R., & Dĳk, D. V. (2010). A comparison of biased simulation schemes for stochastic

volatility models. Quantitative Finance, 10(2), 177-194.

[15] Lipton, A. (2002). The vol smile problem Risk.

[16] Doersch, C. (2016). Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.

[17] Duchi, J. (2007). Derivations for linear algebra and optimization. Berkeley, California, 3(1), 2325-5870.

ResearchGate has not been able to resolve any citations for this publication.

A Practical Guide to Robust Portfolio Optimization

Article

Full-text available

Jan 2019

Modern Computational Finance: AAD and Parallel Simulations

Book

Full-text available

Nov 2018

Antoine Savine

Arguably the strongest addition to numerical finance of the past decade, Algorithmic Adjoint Differentiation (AAD) is the technology implemented in modern financial software to produce thousands of accurate risk sensitivities, within seconds, on light hardware. AAD recently became a centerpiece of modern financial systems and a key skill for all quantitative analysts, developers, risk professionals or anyone involved with derivatives. It is increasingly taught in Masters and PhD programs in finance. Danske Bank's wide scale implementation of AAD in its production and regulatory systems won the In-House System of the Year 2015 Risk award. The Modern Computational Finance books, written by three of the very people who designed Danske Bank's systems, offer a unique insight into the modern implementation of financial models. The volumes combine financial modelling, mathematics and programming to resolve real life financial problems and produce effective derivatives software. This volume is a complete, self-contained learning reference for AAD, and its application in finance. AAD is explained in deep detail throughout chapters that gently lead readers from the theoretical foundations to the most delicate areas of an efficient implementation, such as memory management, parallel implementation and acceleration with expression templates. The book comes with professional source code in C++, including an efficient, up to date implementation of AAD and a generic parallel simulation library. Modern C++, high performance parallel programming and interfacing C++ with Excel are also covered. The book builds the code step-by-step, while the code illustrates the concepts and notions developed in the book.

A Data-Driven Market Simulator for Small Data Environments

Article

Jan 2020

Arbitrage Theory in Continuous Time

Book

Dec 2019

Tomas Björk

The fourth edition of this textbook on pricing and hedging of financial derivatives, now also including dynamic equilibrium theory, continues to combine sound mathematical principles with economic applications. Concentrating on the probabilistic theory of continuous time arbitrage pricing of financial derivatives, including stochastic optimal control theory and optimal stopping theory, the book is designed for graduate students in economics and mathematics, and combines the necessary mathematical background with a solid economic focus. It includes a solved example for every new technique presented, contains numerous exercises, and suggests further reading in each chapter. All concepts and ideas are discussed, not only from a mathematics point of view, but the mathematical theory is also always supplemented with lots of intuitive economic arguments. In the substantially extended fourth edition Tomas Björk has added completely new chapters on incomplete markets, treating such topics as the Esscher transform, the minimal martingale measure, f-divergences , optimal investment theory for incomplete markets, and good deal bounds. There is also an entirely new part of the book presenting dynamic equilibrium theory. This includes several chapters on unit net supply endowments models, and the Cox–Ingersoll–Ross equilibrium factor model (including the CIR equilibrium interest rate model). Providing two full treatments of arbitrage theory—the classical delta hedging approach and the modern martingale approach—the book is written in such a way that these approaches can be studied independently of each other, thus providing the less mathematically oriented reader with a self-contained introduction to arbitrage theory and equilibrium theory, while at the same time allowing the more advanced student to see the full theory in action.

Quantitative risk management: Concepts, techniques and tools: Revised edition

Article

Jan 2015

This book provides the most comprehensive treatment of the theoretical concepts and modelling techniques of quantitative risk management. Whether you are a financial risk analyst, actuary, regulator or student of quantitative finance, Quantitative Risk Management gives you the practical tools you need to solve real-world problems. Describing the latest advances in the field, Quantitative Risk Management covers the methods for market, credit and operational risk modelling. It places standard industry approaches on a more formal footing and explores key concepts such as loss distributions, risk measures and risk aggregation and allocation principles. The book's methodology draws on diverse quantitative disciplines, from mathematical finance and statistics to econometrics and actuarial mathematics. A primary theme throughout is the need to satisfactorily address extreme outcomes and the dependence of key risk drivers. Proven in the classroom, the book also covers advanced topics like credit derivatives. • Fully revised and expanded to reflect developments in the field since the financial crisis • Features shorter chapters to facilitate teaching and learning • Provides enhanced coverage of Solvency II and insurance risk management and extended treatment of credit risk, including counterparty credit risk and CDO pricing • Includes a new chapter on market risk and new material on risk measures and risk aggregation.

A Comparison of Biased Simulation Schemes for Stochastic Volatility Models

Article

Feb 2010

Using an Euler discretization to simulate a mean-reverting CEV process gives rise to the problem that while the process itself is guaranteed to be nonnegative, the discretization is not. Although an exact and efficient simulation algorithm exists for this process, at present this is not the case for the CEV-SV stochastic volatility model, with the Heston model as a special case, where the variance is modelled as a mean-reverting CEV process. Consequently, when using an Euler discretization, one must carefully think about how to fix negative variances. Our contribution is threefold. Firstly, we unify all Euler fixes into a single general framework. Secondly, we introduce the new full truncation scheme, tailored to minimize the positive bias found when pricing European options. Thirdly and finally, we numerically compare all Euler fixes to recent quasi-second order schemes of Kahl and Jaumlckel, and Ninomiya and Victoir, as well as to the exact scheme of Broadie and Kaya. The choice of fix is found to be extremely important. The full truncation scheme outperforms all considered biased schemes in terms of bias and root-mean-squared error.

The Vol Smile Problem

Article

Jan 2002

Alexander Lipton

Stochastic finance: An introduction in discrete time

Article

Dec 2004

This book is an introduction to financial mathematics. The first part of the book studies a simple one-period model which serves as a building block for later developments. Topics include the characterization of arbitrage-free markets, preferences on asset profiles, an introduction to equilibrium analysis, and monetary measures of risk. In the second part, the idea of dynamic hedging of contingent claims is developed in a multiperiod framework. Such models are typically incomplete: They involve intrinsic risks which cannot be hedged away completely. Topics include martingale measures, pricing formulas for derivatives, American options, superhedging, and hedging strategies with minimal shortfall risk. In addition to many corrections and improvements, this second edition contains several new sections, including a systematic discussion of law-invariant risk measures and of the connections between American options, superhedging, and dynamic risk measures.

Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function

Article

Jan 1993
NEURAL NETWORKS

Several researchers characterized the activation function under which multilayer feedforward networks can act as universal approximators. We show that most of all the characterizations that were reported thus far in the literature are special cases of the following general result: A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can approximate any continuous function to any degree of accuracy if and only if the network's activation function is not a polynomial. We also emphasize the important role of the threshold, asserting that without it the last theorem does not hold.

Computing the Nearest Correlation Matrix - a Problem From Finance

Article

Oct 2002

Nicholas J. Higham

Introduction A correlation matrix is a symmetric positive semidefinite matrix with unit diagonal. Correlation matrices occur in several areas of numerical linear algebra, including preconditioning of linear systems and error analysis of Jacobi methods for the symmetric eigenvalue problem (see Davies & Higham (2000) for details and references). The term `correlation matrix' comes from statistics, since a matrix whose (i, j ) entry is the correlation coefficient between two random variables x i and x j is symmetric positive semidefinite with unit diagonal. It is a statistical application that motivates this work---one coming from the finance industry. In stock research sample correlation matrices constructed from vectors of stock returns are used for predictive purposes. Unfortunately, on any day when an observation is made data are rarely available for all the stocks of interest. One way to deal with this problem is to compute the sample correlations of pairs of stocks using data draw

Greeks Need Not Apply - Using Market Generators and Deep Hedging for Model-Free Data-Driven Hedging

Abstract and Figures

Recommended publications

Applications of Deep Learning in Option Pricing and Calibration

Model-Agnostic Pricing of Exotic Derivatives Using Signatures

Delta force: option pricing with differential machine learning

Geometric Arbitrage Theory and Market Dynamics Reloaded