PreprintPDF Available

Greeks Need Not Apply - Using Market Generators and Deep Hedging for Model-Free Data-Driven Hedging

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In recent years, research in deep learning has intensified, and deep learning methods have been developed in all areas of finance. This thesis aims to combine deep learning-based market generators with deep hedging methods to create a model-and greek-free framework that finds risk optimal hedging strategies from observed price paths. First, we explain and test the deep hedging method using virtually unrestricted amounts of synthetic data from Black-Scholes and Heston models. We show that the deep hedging method can find reasonable hedging strategies for simple claims, path-dependent options with and without transaction costs. However, architecture and inputs (including processing) have significant impacts on performance. We also observe that the deep hedging method can yield unstable hedging strategies in multivariate models. Next, we explain and test data-driven market generators based on variational autoencoders. We observe that the market generators can produce paths with similar marginal distributions and correlation to a Black-Scholes model. The market generators struggle in the Heston model and when conditioning on initial instantaneous variance. We propose several moment regularization terms that partly alleviate these issues. Finally, we combine and test the market generators with the deep hedging methods when assuming that a Black-Scholes or Heston model drives the market. In a Black-Scholes model, we observe that the combined framework can learn reasonable hedging strategies for call and down-and-out call options from a modest number of observations. In a Heston model, we also observe that the framework can learn hedging strategies (conditioned on initial instantaneous variance) from a single path. Still, the framework struggles with high initial instantaneous variance. We also show that it is possible to improve performance by utilizing overlapping paths, which increase the number of training paths. The results are promising. However, the techniques still require refining and further development before it is feasible to use commercially.
Content may be subject to copyright.
U N I V E R S I T Y O F C O P E N H A G E N
D E P A R T M E N T O F M A T H E M A T I C A L S C I E N C E S
Master Thesis
Magnus Gnnegaard Frandsen
Greeks Need Not Apply
Us in g Market Generators and Deep Hed ging for Model-Free Data-Driven Hedging
Ad vi sor: Rolf P oulsen
Su bm itted on: J un e 15, 2021
Abstract
In recent years, research in deep learning has intensified, and deep learning methods have been
developed in all areas of finance. This thesis aims to combine deep learning-based market generators
with deep hedging methods to create a model- and greek-free framework that finds risk optimal hedging
strategies from observed price paths.
First, we explain and test the deep hedging method using virtually unrestricted amounts of synthetic
data from Black-Scholes and Heston models. We show that the deep hedging method can find
reasonable hedging strategies for simple claims, path-dependent options with and without transaction
costs. However, architecture and inputs (including processing) have significant impacts on performance.
We also observe that the deep hedging method can yield unstable hedging strategies in multivariate
models.
Next, we explain and test data-driven market generators based on variational autoencoders. We
observe that the market generators can produce paths with similar marginal distributions and correlation
to a Black-Scholes model. The market generators struggle in the Heston model and when conditioning
on initial instantaneous variance. We propose several moment regularization terms that partly alleviate
these issues.
Finally, we combine and test the market generators with the deep hedging methods when assuming
that a Black-Scholes or Heston model drives the market. In a Black-Scholes model, we observe that the
combined framework can learn reasonable hedging strategies for call and down-and-out call options
from a modest number of observations. In a Heston model, we also observe that the framework can
learn hedging strategies (conditioned on initial instantaneous variance) from a single path. Still, the
framework struggles with high initial instantaneous variance. We also show that it is possible to
improve performance by utilizing overlapping paths, which increase the number of training paths. The
results are promising. However, the techniques still require refining and further development before it
is feasible to use commercially.
Keywords—
Hedging, Deep Learning, Model Free, Market Generator, Variational Autoencoder, Conditional
Variational Autoencoder, Stochastic Volatility, Transaction Costs, Barrier Option.
i
Contents
1 Introduction 1
2 Simple Hedging Experiment 2
3 General Hedging Problem 6
3.1 MarketSetup .............................................. 6
3.2 Risk measures and optimal trading strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 CVaRanditspracticalities ....................................... 11
3.4 Practicalities of minimizing over δ................................... 11
4 Artificial Neural Networks 12
4.1 Architecture............................................... 13
4.2 Universal Representation and Representation Benefits of deep ANNs . . . . . . . . . . . . . . . . . 14
4.3 Backpropagation ............................................ 15
4.4 Implementation and Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Deep Hedging Experiments 17
5.1 Black-Scholes with 1 asset and a simple claim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1.1 Notransactioncosts ...................................... 17
5.1.2 Training on wrong volatility and The Fundamental Theorem of Derivatives Trading . . . . . 22
5.1.3 0.5% transaction costs (Black-Scholes 1 asset) . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Black Scholes with multiple assets and no transaction costs . . . . . . . . . . . . . . . . . . . . . . 27
5.2.1 Training with the wrong correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Black Scholes model with path-dependent options . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Heston model with one tradable asset (incomplete and non-Markovian model)............ 35
5.5 Subconclusion.............................................. 38
6 Market Generators 38
6.1 VariationalAutoencoders........................................ 39
6.2 ConditionalVAEs............................................ 43
6.3 Connecting ANNs to VAEs and CVAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 Experiments with VAEs and performance evaluation (in a simple Black-Scholes model) . . . . . . . 45
6.5 Cheating in the Heston model - capturing path dependency . . . . . . . . . . . . . . . . . . . . . . 50
6.6 Conditioning on instantaneous variance in the Heston model . . . . . . . . . . . . . . . . . . . . . 54
6.7 Overlapping training paths (is it possible?) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7 Data Driven Hedge Experiments 60
7.1 VAE powered hedge experiments - Black-Scholes . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2 CVAE powered hedge experiments - Heston . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8 Conclusion 67
A Appendix 70
A.1 Outperforming delta hedging using MSE (in theory) . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.2 Price and delta of a call option in the Heston model . . . . . . . . . . . . . . . . . . . . . . . . . . 71
ii
1 Introduction
Pricing and hedging derivatives are among the cornerstones of mathematical finance and play a vital role for
derivatives dealers and traders. In classical theory, prices are determined as expected discounted payoffs under
a pricing measure
Q
, and derivatives can be hedged (locally) by constructing a portfolio with matching Greeks
(derivatives of the pricing function). This classical approach works perfectly in an idealized complete market without
transaction costs, where assets adhere to simple Ito-processes. However, in real-world scenarios, expert traders must
include other techniques and market knowledge to set prices and efficiently hedge portfolios against unwanted risks.
In recent years, all scientific fields have been overwhelmed with interest and research in machine learning. This
also includes finance, where machine learning techniques have been proposed in virtually all areas. The strength of
machine learning (especially deep learning) is that it offers solutions to complex and general problems with minimal
assumptions. The disadvantage of machine learning is that the result can be sensitive to problem formulations,
network architecture (for deep learning) and data quality. Results from machine learning methods also suffer from
difficulties of interpretation and poor understanding of errors.
In this thesis, we aim to combine and test the ideas of [
1
] and [
2
]. In [
1
], Buehler et al. propose a deep
hedging framework that can find optimal hedging strategies for option portfolios in a market with multiple assets
and market frictions. The approach is entirely model-free and does
not
depend on the model dynamics. Hence,
the approach does not depend on any
Q
-measures or Greeks. The framework optimizes a hedging portfolio under
acoherent risk measure using synthetic price data and artificial neural networks (ANNs). The framework will
depend heavily on the synthetic data. If we choose a classic model (like Black-Scholes or Heston) to drive the
market, then the deep hedging framework loses its inherent model-free status (although it would still be
Q
- and
Greek-free). In [
2
], Buehler et al. propose a data-driven, flexible and non-parametric generative model based on
variational autoencoders (VAEs), which can generate new asset paths from a small/modest number of observations.
This generative model (also called a market generator) can (hopefully) be utilized to create synthetic data for a deep
hedging model. Therefore, combining the generative model and the deep hedging model would yield a
data-driven
and
model-free
hedging framework. This thesis aims to explain, test, and analyze both the deep hedging approach
and the VAE-based generative models.
The thesis is organized as follows: We motivate our formulation of a hedging problem in section 2 with a
simple example in a Binomial Model. We do this by approaching the hedging problem as a regression problem.
Section 3 introduces a general hedging problem with coherent risk measures. We aim to formulate the problem to
enable optimal hedging strategies to be found by minimizing over possible hedging strategies. In this thesis, we
utilize ANNs to represent trading decisions. For completeness, we introduce ANNs in section 4, including universal
approximation, backpropagation and training practicalities. In section 5, we perform various hedge experiments to
evaluate the deep hedging approach. For these experiments, we assume that the market is driven by a standard model
like Black-Scholes or Heston. This enables better testing of the deep hedging models since we can compare the deep
hedging models’ performance to classic delta hedging. To test the deep hedging model (looking at performance
and stability), we perform experiments with call options, down-and-out call options, multiple assets and stochastic
volatility (using a Heston model).
At this point, we should understand (some of) the possibilities and limitations of the deep hedging approach.
We, therefore, wish to step back from hedging to introduce a generative model that can create synthetic data for
the deep hedging models. In section 6, we introduce the theory behind variational and conditional variational
autoencoders (VAEs and CVAEs). We then test the VAEs and CVAEs abilities to create new paths of an asset,
based on a modest number of observed paths. The market will again be assumed to be driven by a Black-Scholes
or Heston model. To evaluate the performance of the generative models, we analyze the marginal distributions
and time correlations of the simulated paths/returns. To improve the performance of the generative models,
we propose various moment regularization terms, which guide the generative models towards distributions with
similar moments and correlations to the utilized training data. Section 7 combines the generative models and
1
the deep hedging approach to create and test a framework that learns optimal hedging strategies from a modest
number of observations. For simplicity, we again assume that the market is driven by either a Black-Scholes or
Heston model. However, in these experiments, we train the deep hedging model on synthetic samples created
by generative models that are trained on a set of observed paths from the actual model, i.e. Black-Scholes or
Heston. We aim to analyze the performance of the developed framework when varying the number of train-
ing samples and when utilizing overlapping sample paths. Finally, we investigate if it is possible to learn the
optimal hedging strategy from training samples with varying levels of underlying volatility (e.g. from a Heston model).
All implementations for this thesis are implemented in Python 3.8 with Tensorflow 2.4.1. Implementations
are available on Github in the following repo:
https://github.com/jnr494/MasterThesis
2 Simple Hedging Experiment
In this section, we wish to present a simple hedging problem in a Binomial model, where we can represent the
problem as a regression problem.
The binomial model and hedge setup
We assume that we are in a Binomial model (S(t), B(t)) where
S(t+ ∆t) = S(t)·
u= exp(αt+σt)w.p. p
d= exp(αtσt)w.p. 1p
B(t+ ∆t) = B(t)ert, B(0) = 1
where we assume
t=T
N
and
T
is the time of maturity for some option (that we wish to hedge) and
N
is the
number of steps from t= 0 to t=T. For notational simplicity, we write Si=S(it),Bi=B(it)and so on.
We assume that we wish to hedge an option on
S
with payoff
CN=g(SN)
. Note that at time
T
the underlying
asset Scan take on N+ 1 different values
S0uN
S0uN1d
.
.
.
S0udN1
S0dN.
It is quite easy to hedge options in a Binomial model. Standing at time
t
, we can hedge the option value at time
t+ ∆tby solving the system
aS(t)u+bB(t)ert=Cu(t+ ∆t)
aS(t)d+bB(t)ert=Cd(t+ ∆t)
2
where
(a, b)
is a portfolio of
(S(t), B(t))
and
Cu(t+ ∆t)
,
Cd(t+ ∆t)
are the values of the option in the two
possible future states at time t+ ∆t. The solution
a=Cu(t+ ∆t)Cd(t+ ∆t)
St(ud)
b=uCd(t+ ∆t)dC u(t+ ∆t)
B(t+ ∆t)(ud)
is the optimal hedging strategy and the price at time
t
is
C(t) = aSt+bBt
, i.e. the value of the hedging portfolio.
Solving the above problem recursively backwards in time from
T
to
t= 0
gives us the optimal hedging portfolio and
the fair/arbitrage-free price of the option.
We now turn our attention to solving the hedging problem by learning
a
and
b
for all the states. We do
this by framing the problem as a regression problem.
Imagine that
p0
is the initial value of our hedging portfolio (
p0
should be the option’s price, but we do not know
this). We represent our investment strategy by functions
(f0(s), . . . , fN1(s))
which resembles the investments in
S
. The function
fi
could be polynomial, an ANN or any other parametric function that can represent the optimal
hedging strategy.
The value of our portfolio P Fican be recursively determined by
P F0=p0
P Fi=fi1(Si1)Si+
amount invested in Bi1
z }| {
(P Fi1fi1(Si1)Si)Bi
Bi1
for i= 1, . . . , N.
Rewriting P FN, we get
P FN=p0BN+
N
X
k=1
Sk(fk(Sk)fk1(Sk1))Bn
Bk
,
where
f1, fN= 0
. We can, therefore, think of
P FN
as a parametric function/algorithm that takes a path of
(S, B)
and returns the portfolio value at time
T
when starting at
p0
and following the trading strategy given by
(f0, . . . , fN1)
. Since we are looking for a hedge portfolio, we want to find
p0, f0, . . . , fN1
s.t.
P FN=g(SN)
almost surely.
We can now think of the problem as a problem of finding
p0, f0, . . . , fN1
given
M
sample-paths to minimize
the mean square error (MSE)
1
M
M
X
i=1
(C(i)
NP F (i)
N)2.
If we, given all possible sample paths (of which there are
2N
), can find
p0, f0, . . . , fN1
st. the mean square error is
zero, then we have found the desired optimal strategy since the optimal strategy is unique due to the completeness of
the model.
The practical issues of the loss function and choosing the functional form of
fi
will be covered in the later
chapters. For this experiment, we choose to represent the
fi
s with ANNs and apply a standard gradient descent
algorithm (ADAM) to update
p0
and the parameters of the
fi
s. Note, however, that the choice of ANNs to represent
the
fi
s is not essential from a theoretical point of view. Any parametric function capable of representing the optimal
hedging strategy would work. The choice of ANNs is made solely from a practical standpoint.
3
Value (Standard Error) % of option price
Option Price 8.04567
Model p08.04524 99.995%
Avg. PnL 0.00111 (0.00011) 0.014%
Avg. abs. PnL 0.01507 (0.00008) 0.187%
Avg. squared PnL 0.00052 (0.00001)
Table 2.1: Average PnL, absolute PnL and squared PnL for trained model in test of 50,000 samples.
Option payoff and PFN
(a)
PnL
(b)
Figure 2.1: Hedge accuracy (a) and PnLs (b) across terminal values of
S
over 50,000 samples for the ANN
model.
A simple experiment in a Binomial model
For this simple experiment, we wish to hedge an ATM call option in a Binomial model over one year. The Binomial
model will have step size
1/10
implying that we have
10
trading decisions over the lifetime of the option. We assume
the following model parameters and option characteristics:
Model parameters:S0= 100,α= 0.03,σ= 0.2,r= 0 and t= 1/10.
Option: Type: European call, Maturity: T= 1, Strike: K= 100.
For this experiment, we do not focus on the details of the ANN-based model and training (see section 4). However, to
quickly summarize: We generate
218
samples of
S
and estimate the option payoff on each sample. We then train
the model (with an ANN representing each trading decision, each with three layers of four units) on these samples.
During training, we seek to minimize the MSE of the total Profit-and-Loss (PnL) of selling the call option and trading
in (S, B), i.e. P nL =CN+P FN.
To test our model, we perform a hedge experiment where the model has to hedge the call option on 50,000
new paths of
(S, B)
(it will already have seen the vast majority). The result of this test can be seen in table 2.1 and
figures 2.1, 2.2 and 2.3.
In figure 2.1, we see that the model quite successfully hedges the call option (at least visually) with PnLs
below 0.2 corresponding to
2.5%
of the option price. From table 2.1, we see that the average PnL and even the
average absolute PnL are pretty low, with values of 0.00111 and 0.01507 corresponding to 0.014% and 0.187% of
the option price, respectively. The learned option price
p0
is
8.04524
, corresponding to
99.995%
of the actual option
price, which is also quite impressive.
In figure 2.2, we see two examples of holdings across time. As the black dashed line indicates, the ANN model
can mimic the analytical trading strategy. This is also seen in figure 2.3 where we see the learned strategy for holdings
4
(a) Sample 1 (b) Sample 2
Figure 2.2: Holdings in Sacross time for two different sample paths.
Units of S
Figure 2.3: Holdings in Sat time t= 0.6
5
in
S
at time
t= 0.6
across values of
S
. The strategy, learned by the ANN model, seems to be close to the analytical
strategy for the possible values of
S
(of which there are only seven). However, the ANN model is not super-smooth
in between the possible values of S, which makes sense since nothing is gained from this.
Whether or not this is impressive and/or good enough is debatable. Still, one should remember that the model
learned the option price and hedging strategy by
only
observing the MSE between the option payoffs and the values
of the hedging portfolios. The ANN model has no information on model dynamics or how it relates to the option.
There is, in principle, nothing stopping us from applying the same model/method to an entire option portfolio in a
stochastic volatility model with transaction costs. The power of this method comes from its flexibility. In the next
section, we formalize the problem in a more general setting, allowing us to investigate these tools further.
3 General Hedging Problem
In this section, we wish to introduce a more general hedging problem, which involves multiple tradable assets,
stochastic interest rate and transaction costs. This section is based on the work by Buehler et al. in [2].
3.1 Market Setup
We imagine a market consisting
d
risky tradable assets with price processes represented by
S(t)Rd
which are
adapted to
Ft
where
(F)tR+
is a filtration based on all relevant market information (prices, interest rates, news
etc.). The market also contains a tradable locally risk-free asset with price process
B(t)R
, which is a predictable
adapted process. We observe and interact with these assets under the real-world probability measure
P
, and it is
under this measure that we wish to find the optimal hedging strategy.
We assume that we are selling (and wish to hedge) a portfolio of claims
ZFT
in the described market.
We assume for simplicity that all claims have maturity
T
in this portfolio. Being the seller of
Z
implies that
Z
is a liability at time
T
, but it may contain both long and short positions in the underlying assets (which are not
necessarily
S
or
B
). As compensation for selling the portfolio
Z
, we imagine being compensated
p0R
(which
may be negative if we effectively hold long positions).
Our goal is to trade
(S, B)
to optimally minimize the risk of our combined portfolio (both containing
Z
and our hedging portfolio).
We assume that we are allowed to trade in
(S, B)
at
N
time-points,
0 = t0< t1< . . . < tN1< T
, at
prices
Sk=S(tk)
and
Bk=B(tk)
with
tN=T
being the time-point at which our combined portfolio is evaluated.
However, these trades might be subject to transaction costs. We denote our trading strategy
(δk)k=0,...,N1
with
δkRdrepresenting holdings of Sat time tk. We assume that the strategy is kept self-financing using B.
For simplicity, we assume that trades in
S
are subject to proportional transaction costs, implying that trading
δi
kδi
k1units of Siat time tkcosts
ciSi
k|δi
kδi
k1|
where
ci
k>0
(eg.
ci
k= 0.001
representing transaction costs of 0.1%). We also assume that liquidation of the
portfolio is free at time T.
6
Representing the terminal portfolio value and PnL
The value of the tradable portfolio (excluding
Z
)
P F
, obtained by following
(δk)k=0,...,N1
and having initial capital
p0, is
P F0=p0
d
X
i=1
ci
0Si
0|δi
0|
P Fk= d
X
i=1
Si
kδi
k1!+ (P Fk1δi
k1Si
k1)Bk
Bk1
d
X
i=1
ci
kSi
k|δi
kδi
k1|for k= 1, . . . , N 1
P FN= d
X
i=1
Si
Nδi
N1!+ (P FN1δi
N1Si
N1)BN
BN1
where
(P Fk1δi
k1Si
k1)
is the amount invested in
B
in order to keep the portfolio self-financing. Also, notice
that we have applied our assumption of zero transaction cost at time
T
. The value of the tradable portfolio at time
T
P FNcan be written as
P FN=p0BN+
N
X
k=1
d
X
i=1
Si
k(δi
kδi
k1)BN
Bk
N1
X
k=0
d
X
i=1
ci
kSi
k|δi
kδi
k1|BN
Bk
=BN"p0+
N
X
k=1
d
X
i=1
˜
Si
k(δi
kδi
k1)
N1
X
k=0
d
X
i=1
ci
k˜
Si
k|δi
kδi
k1|#
| {z }
˜
P F N
(1)
where
δi
1= 0
for all
i
and where
˜
P F N
is the value of the tradable portfolio in the discounted market
(˜
S, 1)
where
˜
Si(t) = Si(t)/B(t). The discounted portfolio value ˜
P F Nwill be justified later. To simplify notation, we define
(˜
S·δ)T:=
N
X
k=1
d
X
i=1
˜
Si
k(δi
kδi
k1)(2)
CT(δ) :=
N1
X
k=0
d
X
i=1
ci
k˜
Si
k|δi
kδi
k1|,(3)
implying that
˜
P F N=p0+ ( ˜
S·δ)TCT(δ)
. We can now move on to consider our Profit-and-Loss (PnL). At time
Tthe PnL of our combined portfolio is
P nLT(Z, p0, δ) := Z+P FN
=BN˜
Z+˜
P F N
| {z }
˜
P nLT
where ˜
Z=Z/BN.
3.2 Risk measures and optimal trading strategies
If the portfolio of claims
Z
is reachable with
(S, B)
(only traded at
t0,...tN1
) and we were not subject to
transactions costs, then a reasonable objective would be to find a trading strategy
(δk)k=0,...,N1
and initial portfolio
value p0st.
P nLT(Z, p0, δ) = 0
almost surely. This would then be a perfect hedging strategy for
Z
. This is essentially what we did in the experiment
with a Binomial model in section 2.
7
Generally, we do not think that Z is reachable with our available trading strategies. We, therefore, choose to
find
p0
and
(δk)k=0,...,N1
st. they are optimal when considering the risk of
˜
P nLT(Z, p0, δ)
. To do this, we focus
on measuring the risk with coherent risk measures.
Definition 3.1.
Let
L, L1, L2R
be loss random variables (liabilities) then
ρ:LR
is a coherent risk measure
if it satisfies the following properties (axioms of risk measures):
Translation (cash) invariance: ρ(L+c) = ρ(L) + cfor cR.
Subadditivity: ρ(L1+L2)ρ(L1) + ρ(L2).
Positive homogeneity: ρ(λL) = λρ(L)for λ > 0.
Monotonicity: If L1L2then ρ(L1)ρ(L2).
In thesis project, we always assume that a coherent risk measure is normalized meaning that
ρ(0) = 0
. So given a
coherent risk measure
ρ
we can quantifiably measure the risk of
P nLT
by evaluating
ρ(˜
P nLT(Z, p0, δ))
. Note
that we choose to evaluate the risk of the discounted PnL. We can now formulate our objective as
1. For a given p0, we wish to find a trading strategy (δk)k=0,...,N 1that minimizes ρ(˜
P nLt(Z, p0, δ)).
2. Find a fair value for p0given Z,ρand optimal (δk)k=0,...,N1.
Note that solving both objective 1 and 2 might be difficult if the optimal trading strategy depends on
p0
since the fair
value for
p0
might, in turn, depend on the optimal trading strategy. We will see later that this is not a problem when
we measure the risk with a coherent risk measure on the discounted PnL.
We start by considering the first objective. We assume that
p0
is given and consider the minimization prob-
lem
inf
δρ(˜
P nLT(Z, p0, δ)).
Using the definition of P nLTand shorthand expression for ˜
P F N, we can express the optimization problem as
inf
δρ(˜
P nLT(Z, p0, δ)) = inf
δρ(˜
Zp0(˜
S·δ)T+CT(δ)).
Using the cash invariance of ρ, we obtain
inf
δρ(˜
Zp0(˜
S·δ)T+CT(δ)) = p0+ inf
δρ(˜
Z(˜
S·δ)T+CT(δ))
showing us that the optimization problem (and therefore the optimal trading strategy) is
independent
of
p0
. We,
therefore, define the following relevant optimization problem
π(˜
Z) := inf
δρ(˜
P nLT(Z, 0, δ)) = inf
δρ(˜
Z(˜
S, δ )T+CT(δ)).(4)
We think of the optimal trading strategy
δ
associated with
π(Z)
as the optimal trading strategy for a trader with an
option portfolio Zgiven risk measure ρ.
We wish to understand
π
before we discuss how to determine
p0
. One can show that
π
itself is a coherent
risk measure.
Proposition 3.2. πis a coherent risk measure.
Proof.
That
π
satisfies the axioms of cash invariance, positive homogeneity and monotonicity follow directly from
cash invariance, positive homogeneity and monotonicity of ρ. We, therefore, focus on showing subadditivity.
8
To prove subadditivity, assume we have two loss random variables
L1
and
L2
. Then by the definition of
π
(see equation (4))
π(L1+L2) = inf
δρ(L1+L2(˜
S·δ)T+CT(δ)).
As we have no restrictions on
δ
, we can easily reformulate the problem with
δ=δ1+δ2
(1. step), then utilize
linearity of
(˜
S·δ)T
and
CT(δ)
(2. step), utilize subaddivity of
ρ
(3. step) and finally use the definition of
π
(4. step).
inf
δρ(L1+L2(˜
S·δ)T+CT(δ) = inf
δ12
ρ(L1+L2(˜
S·(δ1+δ2))T+CT(δ1+δ2))
= inf
δ12
ρ(L1(˜
S·δ1)T+CT(δ1)) + (L2(˜
S·δ2)T+CT(δ2))
inf
δ1
ρ(L1(˜
S·δ1)T+CT(δ1)) + inf
δ2
ρ(L2(˜
S·δ2)T+CT(δ2))
=π(L1) + π(L2).
This shows that πis subadditive. This all proofs that πis a coherent risk measure.
Finding a fair price p0
We can now consider finding a fair compensation
p0
for selling option portfolio
Z
. We first notice that because of
cash invariance of π, then
π(˜
Zπ(˜
Z)) = π(˜
Z)π(˜
Z)=0,
which justifies thinking of
π(˜
Z)
as the smallest amount added to position
Z
to make the position acceptable in
terms of
ρ
, i.e. the smallest amount
c
satisfying
π(˜
Zc)0
. This might seem like a reasonable price. However,
it does not consider the fact that the trader could obtain better portfolio risk by not selling
Z
. This is the case if
π(0) <0
, i.e. the trader can obtain better than 0 risk by trading
(S, B)
. This might happen if the tradable assets have
high expected returns or if we use a risk measure
ρ
that focuses less on large losses (e.g. CVaR with a low confidence
level). We, therefore, consider the so-called indifference price p(Z)satisfying
π(˜
Zp(Z)) = π(0)
i.e. the price making the trader indifferent between selling
Z
and getting
p(Z)
or not selling
Z
(assuming the current
portfolio is 0). Note that because of the cash invariance of π, the indifference price p(Z)has solution
p(Z) = π(˜
Z)π(0).(5)
The indifference price is a more sensible choice as a fair value for p0given a risk measure ρ.
Be aware that
p(Z)
might not exist if
π(0) = −∞
, which happens if the market exhibits a tradable arbitrage
or if
ρ
is chosen in an unfortunate way. We will, however, not worry about this as we (mostly) wish to work with
conditional value at risk with high confidence levels (and, of course, in settings with no tradable arbitrages).
To show that
p(Z)
does not cause issues with standard pricing, we can show that if
Z
is reachable and there
are no transaction costs, then p(Z)equals the unique hedging price.
Proposition 3.3.
Assume
CT(δ) = 0
for all
δ
. If
Z
is reachable, i.e. if
δ
and
p0R
s.t.
Z=BNp0+ ( ˜
SBN·δ)T
,
then p(Z) = p0.
Proof.
We wish to start by reassuring ourselves that
Z=BNp0+ ( ˜
SBN·δ)T
is actually what we think of as
reachable assuming no transaction costs. The right-hand side should be the value of a portfolio starting at
p0
and
9
trading with δuntil time T. Using equation (1) and (2), we know that the portfolio value P FNis
P FN=BNp0+
N
X
k=1
d
X
i=1
˜
BNSi
k(δi
kδi
k1) = BNp0+ ( ˜
SBN·δ)T,
which is exactly what we proposed to represent the fact that Zis reachable.
We know that
p(Z) = π(˜
Z)π(0)
, so naturally to show
p(Z) = p0
, we consider
π(˜
Z)
. We start by using
our current assumptions of reachability and no transaction costs (1. step). Note that we divide by
BN
. Then utilize
cash invariance and the definition of
π
(2. step) and the definition of
(˜
S·δ)T
(3. step). In the final step, we utilize
the definition of
π
and the fact that the minimizing over the trading strategy
δ
can completely offset
δ
since we
assume no restrictions on δ.
π(˜
Z) = π(p0+ (δ·˜
S))
=p0+ inf
δρ(( ˜
S·δ)T(˜
S·δ)T)
=p0+ inf
δρ(( ˜
S·[δδ]))
=p0+π(0).
Showing us that p0=π(˜
Z)π(0), which proves that p(Z) = p0.
This shows that
p(Z)
is not arbitrary and is still tied to classical theory. We now have a better understanding of the
optimization problem faced by a trader. However, to better understand the effect of working under risk measures we
consider another example.
Finding a fair price (of a new option portfolio) in the case of a preexisting option portfolio
Assume now that we are selling option portfolio
Z
at a price
p0
, and we consider selling
Z1
(in addition to
Z
). What
would then be a fair price for Z1?
Using the same principle as before, the price for
Z1
,
p1
, should make the trader indifferent between selling
Z
and Z+Z1where the price for Zis p0. This implies that p1should satisfy
π(˜
Z+˜
Z1p0p1) = π(Zp0).
Using cash invariance of π, we obtain
p1=π(˜
Z+˜
Z1)p0π(˜
Z) + p0=π(˜
Z+˜
Z1)π(˜
Z),
which is equal to
p(Z1)
in the case where
Z= 0
, but in general might be different from
p(Z1)
. By subadditivity of
π
,
we can show that the fair value for
p1
is
less or equal
to
p(Z1)
. To see this, we can first notice that by subadditivity
of πthen
π(˜
Z+˜
Z1)π(˜
Z) + π(˜
Z1),
which easily shows that
p1=π(˜
Z+˜
Z1)π(˜
Z)π(˜
Z1)p(Z1),
since p(Z1) = π(˜
Z1)π(0) and π(0) 0.
This makes intuitive sense since the correlation between our current option portfolio
Z
and the new one
Z1
might be advantageous due to the subadditivity of
ρ
(and then
π
). However, this result does not hold if we assume
that we are using a general convex risk measure (instead of a coherent risk measure). In this case,
π
might not be
10
subadditive, which could result in the indifference price for
Z1
being larger than
p(Z1)
. We will, however, only be
working with coherent risk measures, so we are not concerned about this.
3.3 CVaR and its practicalities
In this thesis, we consider optimizing the PnL under the risk measure conditional value-at-risk (CVaR or expected
shortfall). To define CVaR, we must first define value-at-risk (VaR).
Definition 3.4. VaR of a loss random variable L at confidence level α(0,1) is defined as
V aRα(L) = inf{xR:P(L>x)1α}.
Note that V aRα(L)is simply the αquantile of L. We can now define CVaR.
Definition 3.5.
Given a loss random variable
L
with
E(|L|)<
, we can define CVaR of
L
at confidence level
α(0,1) as
CV aRα(L) := 1
1αZ1
α
V aRu(L)du.
One can easily show that if the loss random variable L is integrable with continuous distribution function then
CV aRα(L) = E[L|LV aRα(L)]
(see lemma 2.13 in [
3
]). We can, therefore, think of CVaR as the average
loss exceeding the corresponding value-at-risk. This is a widespread and useful risk measure, and it satisfies the
conditions of a coherent risk measure.
Proposition 3.6. CV aRαis a coherent risk measure.
Proof. See example 2.26 in [3].
For practical purposes CVaR can be quite tough to evaluate. We, therefore, utilize the alternative form given below.
Proposition 3.7. For a loss random variable Land confidence level α(0,1), we have
CV aRα(L) = inf
ww+1
1αE[(Lw)+]
where (x)+= max(x, 0) and the optimal wis V aRα(L).
Proof. See proposition 4.51 in [4].
If we choose our risk measure ρto be CV aRαthen our optimization problem becomes
π(˜
Z) = inf
δinf
ww+1
1αE[(˜
P nLT(Z, 0, δ)w)+]
= inf
δ,w w+1
1αE[(˜
P nLT(Z, 0, δ)w)+](6)
where of course
˜
P nLT(Z, 0, δ) = ˜
Z+ ( ˜
S·δ)T+CT(δ)
. So when finding the optimal trading strategy
δ
under
CVaR we have to solve the above minimization problem for
(δ, w)
, which fortunately does not pose any extra
conflicts/issues (other than slightly more computational work).
3.4 Practicalities of minimizing over δ
At this point, we can express the trader’s problem as the minimization problem in (4) (see (6) under CVaR), and we
have seen that a fair price can be expressed as the indifference price found in (5). However, solving these minimization
problems are easier said than done. Up until now, all minimization problems have minimized the risk of a portfolio’s
PnL w.r.t. the trading strategy δ. In practice, we have to assume some structure and/or functional form of δ.
11
To do this, we first have to figure out how much information is needed or available to trade at time
tk
for all
k
.
When we are in a Markovian model with deterministic interest rate and zero transaction costs, and we are only selling
simple claims, then an optimal trading decision is likely only dependent on the current value of the risky assets.
However, once we are in a more general setting, then we might need more information. Examples of this could be:
Non-Markovian model and/or path-dependent options
: More information about the path of risky assets
is needed for optimal trading. For a down-and-out claim, this could be knowledge of the minimum of the
corresponding risky asset.
Non-zero transaction costs
: Information about current holdings might be needed to trade optimally since
the current holdings affect transaction costs.
We assume that all relevant and available information is contained in the
Ft
-measurable process
I(t)Rq
. Note that
previous trading decisions might also be included in I(t).
Assume now, that we have chosen some kind of parametric family of functions
fθk:RqRd
to represent
the trading decision at time
tk
(i.e.
δk
) for all
k= 0, . . . , N 1
. At time
tk
, these are functions of the information
available
Ik
and represent the holdings in the
d
risky assets from time
tk
to
tk+1
. By parametric family, we think of
functions whose differences only depend on the parameters θk.
With these structural decisions, we can represent the trader’s problem as
πf(˜
Z) := inf
θρ ˜
Z
N
X
k=1
d
X
i=1
˜
Si
k(fθk(Ik)(i)fθk1(Ik1)(i)) +
N1
X
k=0
d
X
i=1
ci
k˜
Si
k|fθk(Ik)(i)fθk1(Ik1)(i)|!
where
θ= (θ0, . . . , θN1)
. With this formulation of the problem, we should now be able to utilize some op-
timization algorithm to (approximately) solve the problem and hence find the (approximately) optimal trading
strategy represented by
(fθ
k(Ik))k
, where
θ
k
corresponds to the solution to the problem above. Note that one could
(possibly) reformulate this problem to solve it with the Bellmann-equation. However, we will not pursue that approach.
In this thesis, we choose ANNs as the parametric family of functions to represent the trading decisions. We
should then be able to train the ANNs simultaneously to minimize the above expression. The construction of the
ANNs, their ability as universal approximators and practicalities of training are discussed in section 4.
4 Artificial Neural Networks
In section 3, we saw that the optimal hedging strategies could be derived by solving problems on the form
inf
θl(fθ1(X1), . . . , fθn(Xn))
where the
Xi
s are random variables in
Rq
(possibly affected by
fθj
for
j < i
),
fθi
are parametric functions in
Rd
representing trading decisions and
l
is some kind of loss function (in our case CVaR or MSE). Note when we say that
Xi
might depend on
fθj
for
j < i
(i.e. previous holdings/trades), we imply that
Xi
can be written as a function of
the general market information Yiand previous holdings, i.e. Xi=gi(Yi,{fθj(Xj)}j <i)where giis some map.
In practice, we have to solve the problem based on samples and an empirical version of
l
,
ˆ
l
. If we assume
m
samples X= [x(1), . . . , x(m)](where x(i)Rq×n) then the problem can be represented as
inf
θ
ˆ
l(fθ1(x1), . . . , fθn(xn))
where
xi:= (x(1)
i, . . . , x(m)
i)Rq×m
and
fθi(xi) = fθix(1)
i, . . . , fθix(m)
iRd×m
. Note again that
samples might depend on previous trading decisions.
12
Artificial Neural Networks (ANNs) provide a framework for solving such optimization problems. In this sec-
tion, we wish to explain and discuss different aspects of ANNs in relation to our optimization problem:
1. Explain the architecture of feed-forward ANNs
2. Briefly explain the universal approximation theorem and the representation benefits of deep ANNs
3. Explain the backpropagation algorithm.
4.
Discuss practicalities of training and working with ANNs (especially concerning the upcoming experiments).
4.1 Architecture
An ANN (or sometimes just called NN) is a parametric family of functions
{fθ|θΘ}
, and among many properties
fθcan represent any continuous function.
In this thesis, we work with multilayered feed-forward ANNs. We define these ANNs as functions on the form
f(x) = hLhL1. . . h1(x)(7)
where
hi:Rni1Rni
with
hi(z) = σi(Aiz+bi)
. Here we assume
AiRni×ni1
,
biRni
and
σi:RR
,
but the σis are applied element wise to vectors (and matrices).
Using machine learning lingo, we say that
xRn0
is the input layer,
Aihi1. . . h1(x) + biRni
is
the
i
’th hidden layer with
ni
units for
i= 1, . . . , L 1
and
f(x)RnL
is the output layer. We, furthermore,
reference to σias the activation function in the ith layer.
As a parametric family of functions, we view
θ
as representing all the
A
s and
b
s, which we refer to as trainable
parameters. In contrast, the number of layers, the number of units in each layer and the activation functions are
referred to as untrainable parameters. One could, therefore, view the ANNs as infinitely many parametric families,
but we do not wish to complicate matters unnecessarily.
With this architecture, we use the term shallow ANN for ANNs with only one hidden layer, and we use the term
deep ANN for ANNs with multiple hidden layers.
What is the idea behind using ANNs? For one, ANNs are universal approximators and can (in theory) ap-
proximate any continuous function arbitrarily well given enough layers and units (as we will see later). ANNs are
also very efficient to evaluate (i.e. calculate
f(x)
for some
x
) due to their constructions as layered activated affine
transformations (activated coming from the
σi
s, which enable non-linear behavior). The simple construction and
affine nature also enable efficient evaluations of the ANN for multiple inputs
(x(1), . . . , x(m))Rn0×m
. After
defining the collection of inputs as X:= (x(1), . . . , x(m)), it is only natural to consider/define
h1(X) := h1x(1), . . . , h1x(m)=σ1(A1X+b1)Rn1×m
where
b1
is added to each column of
A1X
, and
σi
is applied element wise to the entire matrix. From this we easily
obtain
hi. . . h1(X) = σi(Aihi1. . . h1(X) + bi)Rni×m
fθ(X) = hL. . . h1(X),
which are all extremely efficient to compute due to the affine nature of the ANNs.
The simple structure and efficiency of evaluations are also crucial for efficiently computing the derivatives of the
loss function w.r.t. the inputs
X
and the parameters
θ
of the ANN. This is enabled by ingenious use of the chain rule,
referred to as backpropagation (or the backpropagation algorithm).
13
4.2 Universal Representation and Representation Benefits of deep ANNs
In this section, we wish to explain and discuss the aforementioned universal approximation theorem. First, we should
note that there is no single universal approximation theorem, but many theorems that describe neural networks
theoretical ability to approximate different functions. For our explanation, we stick to the one presented by Leshno et
al. [7], which is general enough for our purposes (regarding the applicable activation functions).
As a starting point, we wish to approximate continuous mappings from
Rn0
to
RnL
. Note, however, that
any continuous map
f:Rn0RnL
can be decomposed into
nL
functions
fi:Rn0R
. This implies that
approximating mappings into RnLcan be broken down into approximating functions that take values in R.
To reduce complexity even further, we wish to only study simple shallow ANNs without activation and bias in
the output layer. Such ANNs can be described as functions f:Rn0Ron the form
f(x) = A2σ(A1x+b1)
where A1Rn1×n0,b1Rn1and A2R1×n1. We can then define the set of shallow ANNs on this form as
Σn0=f:Rn0R|f(x) = A2σ(A1x+b1), n1N, A1Rn1×n0, b1Rn1, A2R1×n1.
Leshno et al. [
7
] showed that if
σ
is locally essentially bounded on
R
(with the property that the closure of the set of
discontinuity points has Lebesgue measure zero) then
Σnis dense in C(Rn0)
if and only if
σ
is not polynomial
almost surely.
That
σ
is locally essentially bounded on
R
means that
|σ(x)|
is bounded a.e. on
K
for all compact sets
KR
.
With
Σn0
being dense in
C(Rn0)
, we mean that for any continuous function
gC(Rn0)
and for every compact set
KRn0there exists a sequence of functions {fj}in Σn0such that
lim
j→∞ inf {c|λ{xK:|g(x)fj(x)| ≥ c|} = 0}= 0
where
λ
is the Lebesgue measure. This implies that for every continuous function
g
and every compact set
K
, there
exist functions/ANNs in Σnthat are arbitrarily close (in absolute terms) to gon K(except maybe on a null-set).
This result shows that for reasonable choices of activation functions, shallow ANNs can approximate contin-
uous functions arbitrarily well
if
the ANNs have enough units. Of course, the result also holds for deep ANNs since
they are a generalization of shallow ANNs. We are interested in deep ANNs because they have certain representation
benefits over shallow ANNs. If computational resources are limited and/or we cannot use infinitely many units and
layers, then deep ANNs have been empirically shown to be superior to shallow ANNs. This is likely because the
composite structure of the deep ANNs can produce relatively more complexity than the additive structure of shallow
ANNs.
In [
8
], Matus Telgarsky provides a simple example of a continuous function that deep ANNs can approximate
better with limited units than shallow ANNs. Together with the empirical evidence, it is clear that the use of deep
ANNs is justified. However, one should also consider the training procedures’ ability to find optimal trainable
parameters for the ANNs. This is usually harder for deep ANNs than shallow ANNs, which highlights a trade-off:
Precise approximation can more easily be obtained with many layers, but shallow networks are more manageable.
Another practical remark is that even though ANNs can represent any continuous function, we should still
consider helping the ANN by transforming the inputs into more relevant features (if possible). The ANN can (in
theory) do all necessary feature extraction. Still, if we can transform the inputs into something more closely related to
the outputs, then that transformation might decrease the complexity of training the ANN. We will see this in section
5.3.
14
4.3 Backpropagation
We start by assuming a simplified minimization problem, which only depends on one ANN.
inf
θ
ˆc:=
z }| {
ˆ
l(fθ(X)) R
where we assume that the empirical loss ˆ
lis based on observations X= [x(1), . . . , x(m)]Rn0×m.
To solve the minimization problem using ANNs (as our approximator), we need a method to find the optimal
trainable parameters/weights of the ANNs. To do this, it is common to deploy a gradient descent algorithm, which
updates the weights according to the derivative of the loss w.r.t. the trainable weights. Starting with a small non-zero
guess of Aiand bifor all i, we iteratively update the weights according to
AiAiγ∂c
∂Ai
bibiγ∂c
∂bi
where
γ > 0
is the so-called learning rate. Choosing an appropriate sequence of learning rates guarantees (in theory)
that we converge to a local minimum. Several improvements can be made to this algorithm. In this thesis, we utilize
the ADAM algorithm [
5
]. However, common for all gradient descent algorithms is the necessity for fast computations
of the derivatives of the loss w.r.t. the trainable parameters. This is where the backpropagation algorithm enters the
picture.
Our goal is to determine
ˆc
θj
for all
θj
in
θ
, which corresponds to determining
ˆc
∂bi
and
ˆc
∂Ai
for all
A
s and
b
s,
which make up the trainable parameters of the ANN.
First, we assume that we know
ˆc
∂fθ(X)RnL×m
, which should come naturally from the definition of the
empirical loss function ˆ
l.
The ingenious idea behind the backpropagation algorithm is that we (given
ˆc
∂fθ(X)RnL×m
) can find the
derivatives of
ˆc
wrt. to all intermediate values from the computation of
fθ
by utilizing the chain rule. That is using
the chain rule to compute
ˆc
∂hi,1(X)RnL×m
for all
i
(where we introduce notation
hi,1(X) := hi. . . h1(X)
).
From there, we can again apply the chain rule to find the derivatives of
ˆc
w.r.t.
Ai
and
bi
for all
i
. These applications
of the chain rule can be particularly tedious. Pedersen and Frandsen [
6
] derive these derivatives in detail, which (in
our notation) can be boiled down to the following equations
ˆc
∂Aihi1,1(X) + bi
=σ0
i(Aihi1,1(X) + bi)ˆc
∂hi,1(X)Rni×m
ˆc
∂Ai
=ˆc
∂Aihi1,1(X) + bi
(hi,1(X))>Rni×ni1
ˆc
∂bi
=ˆc
∂Aihi1,1(X) + bi
1mRni
ˆc
∂hi1,1(X)=A>
i
ˆc
∂Aihi1,1(X) + biRni1×m
where
1mRm
is a vector of
1
s. Looking at the first and last equation, we see that given
ˆc
∂fθ(X)RnL×m
it is possible to iterate backwards through the intermediate calculations of
fθ(X)
to find the derivatives of
ˆc
w.r.t. the trainable weights. An important observation is that this algorithm requires the knowledge of all in-
termediate values for the calculation of
fθ(X)
. To use the algorithm, it is, therefore, necessary to first evaluate
fθ(X)
(which we call a forward pass of the ANN), where we store all intermediate values. After that, we do a
so-called backwards pass of the ANN using the above equations. This is, in its essence, the backpropagation algorithm.
We should note that this algorithm becomes too simplistic when the loss
ˆc
depends on multiple ANNs, which affect
15
each other. This is the case when we wish to find the optimal trading strategy, where each trading decision is
represented with a separate ANN. In this case, each trading decision (ANN) affects future trading decisions. However,
this does not change the backbone of the backpropagation algorithm.
First, assume that we wish to backpropagate through
inf
θ
ˆc:=
z }| {
ˆ
l(fθ1(x1), . . . , fθn(xn))
to determine
ˆc
θi
for all
i
. Remember that
xi
might depend on
fθj(xj)
for
j < i
. However, we can directly apply the
simple backpropagation technique to derive
∂c
∂θi
given
ˆc
∂fθi
. The challenge is, therefore, to determine
ˆc
∂fθi
for all
i
,
which might not be straightforward. Note as
fθn
is the last trading decision (meaning no further hidden dependencies
in the
x
s) then it should be possible to determine
ˆc
∂fθn
from the definition of the empirical loss function
ˆ
l
. This
enables calculation of
ˆc
∂θn
using the simple backpropagation technique. From here, it should be possible to determine
ˆ
fθn
∂fθn1
from the defined connection between
xn
and
fθn1(xn1)
. This enables calculation of
ˆc
∂fθn1
and then
ˆc
∂θn1
. At this point, we can continue to iteratively determine
ˆc
∂fθi
and then
ˆc
∂θi
from
i=n2
to
i= 1
. As a result
of this, we can determine
ˆc
∂θi
for all
i
, which is exactly what we wanted. This may seem difficult. However, it is
luckily easily handled by general automatic adjoint differentiation (AAD) algorithms (implemented in Tensorflow).
4.4 Implementation and Training Neural Networks
In this thesis, all implementations and training of ANNs are done in Python with Tensorflow. There exist an endless
number of training procedures and methods to obtain faster and more accurate convergence of ANNs. Below is a
brief summary of some of the techniques that we utilize in this thesis.
Normalization and Batch normalization
Normalization of input and output data can significantly improve convergence speed and accuracy. The ANN could,
in theory, scale all parameters inputs and outputs itself, but pre-processing the data (if possible) can help to ensure
that some inputs and/or outputs are not being overly-prioritized over others. Many activation functions also work
best if the data is between -1 and 1. For this reason, we choose to normalize input and output data. We do this by
subtracting the mean and dividing by the standard deviation of the data.
However, in our implementations, it is not always feasible/practical to pre-process the inputs and outputs since
we are not performing regression but rather solving a minimization problem over hedging strategies. For this reason,
we may choose to use batch normalization. Batch normalization can be viewed as a layer in the ANN, which scales
the data according to the mean and standard deviation. However, the batch normalization layer does not utilize the
mean and standard deviation for the current data passed through the layer but rather a moving average of the mean
and standard deviation of current and previous data.
Batch normalization can even be done between every ordinary layer of the ANN to help increase the stability of
the ANN, which in turn helps convergence of the gradient descent algorithm.
Initialization of trainable parameters
In practice, the resulting optimal trainable parameters and convergence speed depend heavily on the initialization of
the trainable parameters. A common heuristic is to use variance scaling, where we initialize
bi= 0
and
Ai
from
independent normal random variables with mean zero and variance 1/ni1.
Gradient descent algorithm
We utilize the ADAM algorithm, which is heuristically shown to provide excellent convergence in terms of speed and
accuracy. The ADAM algorithm uses adaptive learning rates for each parameter and momentum to ensure faster
16
convergence. For example, by accelerating learning when on a plateau and decelerating learning when being close to
a minimum.
However, we might still decrease the learning rate manually when the algorithm plateaus and stop the algorithm
when it plateaus despite lowering the learning rate. Typically plateauing is monitored on a validation set, which
is separate from the training set. This is done to reduce (the probability of ) overfitting. However, we are not too
concerned with this since we generate our own training data.
Mini batches
Even though evaluation of an ANN and backpropagation is efficient, we still want to improve convergence speed if
possible. It is, therefore, common to update the parameters based on the gradient calculated on a small batch of the
original data set. This is called mini-batch gradient descent, and it is heuristically shown to improve the gradient
descent algorithm’s convergence significantly. It is common to use a mini-batch size of 32. However, in this thesis,
we choose to use batch sizes of 128-1024 since we utilize the CVaR risk measure as the loss function.
5 Deep Hedging Experiments
In this section, we run multiple hedging experiments to evaluate the performance of the deep hedging approach and
to better understand its pros and cons. For these experiments, we assume that the risky assets have dynamics of
well-established models such as the Black Scholes model and the Heston model. We also assume a fixed interest rate
rin all experiments.
When working in these established frameworks, we can compare the ANN model’s performance to standard
delta hedging. We measure performance by looking at discounted PnLs calculated on out-of-sample hedges.
In all of our experiments, we have
π(0) = 0
, i.e. the trader cannot gain extra risk-performance by trading the
market with no initial portfolio value. This is a result of our choice of risk measure
CV aR0.95
and that all assets have
moderate drift rates. This implies that the indifference price is
p(Z) = π(˜
Z)
. As we will see later, the interesting
results come from the trading strategies and not the indifference prices. For this reason, we are not too concerned
with indifference prices.
The different experiments:
1.
Hedging a European call option in a one dimensional Black-Scholes model with and without transaction costs.
We include an analysis of the stability of the deep hedging models when training with the wrong volatility.
2.
Hedging a portfolio of put and call options in a four-dimensional Black-Scholes model without transaction
costs. We include an analysis of the stability of the deep hedging model when training with the wrong
correlation matrix.
3. Hedging a down-and-out call option in a dimensional Black-Scholes model.
4. Hedging a call option in a Heston model by only trading the underlying asset.
5.1 Black-Scholes with 1 asset and a simple claim
5.1.1 No transaction costs
In this experiment, we assume that we can trade one risky asset
S
that is driven by a Black-Scholes model. This
implies that Shas the following dynamics under the real-world measure P
dS(t)/S(t) = µdt +σ dW P(t).
17
The price of the risky asset Sis a geometric Brownian motion and has, for t>s, the conditional solution
S(t) = S(s)e(µσ2
2)(ts)+σ(WP(t)WP(s))
d
=S(s)e(µσ2
2)(ts)+σtsZ
where ZN(0,1). Hence, we can sample from the distribution of Swithout error.
For this experiment, we wish to hedge a single call option, over 60 hedge points and with no transaction costs. For
completion, we include the price and delta of a European call option below.
Proposition 5.1
(Prop. 7.10 in [
9
])
.
The price of a European call option at time
t
with maturity
T
and strike
K
is
given by
C(S(t), t) = S(t)Φ(d1)er(Tt)KΦ(d2)
where Φis the cumulative distribution function of a standard normal random variable and
d1=1
σ(Tt)ln S(t)
K+r+σ2
2(Tt)
d2=d1σTt.
(8)
Proposition 5.2
(Prop. 9.5 in [
9
])
.
The delta of a European call option at time
t
with maturity
T
and strike
K
is
given by
BS =C(S(t), t)
∂S(t)= Φ(d1)
where Φis the cumulative distribution function of a standard normal random variable and d1is as in equation (8).
In this experiment, we refer to delta hedging as the analytical approach. Note that the delta hedging strategy invests
BS
in
S
at every trading opportunity. In this simple experiment, we know that delta hedging can yield an arbitrarily
low hedge error given enough hedge points.
The setup for the experiment is:
Model: One-dimensional Black-Scholes.
Model parameters:S(0) = 1,µ= 0.05,σ= 0.3,r= 0.02.
Option: Type: Call, Strike: K= 1 (ATM), Maturity: T= 3/12.
Hedging: Hedge-points: 60 (equidistant), Transaction costs: 0
Hedge strategies
: Standard delta hedging (analytical), deep hedging with MSE loss (ANN-MSE) and deep
hedging with risk measure CV aR0.95 (ANN-CVaR).
ANN architecture
(for each trading decision): Layers: 4, Units: 5, Input:
˜
SkR
at time
tk
, Output:
δkR
(holdings in S) from time tkto tk+1.
ANN training
: Training samples:
218
, Batch size:
1024
, Epochs: 100 with learning rate reduction and early
stopping.
Hedge-test
: Test samples: 50,000 (independent of training), Initial portfolio value: 0.06216 (actual option
value).
Performance measurement: To measure the performance, we utilize the following measures:
Average absolute PnL: Calculated as the average of the absolute value of the PnLs from the 50,000 test
samples. A low value is preferable as it indicates precise replication of the option.
PnL standard deviation: Calculated as the standard deviation of the 50,000 PnLs from the test samples.
A low value is preferable as it indicates stability and more precise replication.
CV aR0.95
: Calculated as the empirical
CV aR0.95
based on the 50,000 PnLs from the test samples. A
low value is preferable as it indicates a lower downside tail risk.
18
Strategy Value (Standard Error) % of option price
Option Price 0.06216
Model p0ANN MSE FIXED
ANN CV AR0.95 0.07560 121.615%
Analytical 0.00498 (0.000020) 8.017%
Avg. abs. PnL ANN MSE 0.00502 (0.000020) 8.072%
ANN CV AR0.95 0.00589 (0.000021) 9.482%
Analytical 0.00665 10.695%
PnL Std. ANN MSE 0.00668 10.745%
ANN CV AR0.95 0.00758 12.188%
Analytical 0.01545 24.847%
CV aR0.95 ANN MSE 0.01539 24.753%
ANN CV aR0.95 0.01347 21.675%
Table 5.1: Results of hedge experiment over 50,000 out-of-sample trials. The experiment involves hedging
an ATM call option over 60 hedge points in a single asset Black-Scholes model
without transaction costs
.
This experiment aims to showcase the performance of the deep hedging approach compared to standard delta hedging
and to illustrate the effect of minimizing tail risk over mean square error.
To ensure a fair performance measurement, we choose to fix
p0
to the actual option price for the ANN-MSE
model during training. We do this since we fix the initial portfolio value for all strategies to the actual option price
when testing, and the ANN-MSE model is particularly sensitive to its initial portfolio value. From section 3.2, we
know that optimizing the ANN-CVaR model does not depend on the initial portfolio value. When training the
ANN-CVaR model, we, therefore, set the initial portfolio value to zero and calculate
p0
using the indifference price
as in equation (5). However, to reiterate, when
testing
the three different strategies, all strategies start with an initial
portfolio value equal to the actual option price.
In this experiment, we do not focus on training times, as we are more concerned with the models’ capabilities.
However, for this particular experiment, we train a deep hedging model for approximately 15 minutes. This does, of
course, depend heavily on the number of hedge points, the number of training samples, the architecture, computational
resources (CPU vs GPU and available RAM), batch size, epochs, learning rate schedule and settings for early
stopping. We should also note that our experienced training times will be relatively slow since our implemented
framework contains many redundancies in its architecture from other experiments.
The results of the experiment can be seen in table 5.1 and figure 5.1, 5.2, 5.3 and 5.4. In figure 5.1, the
PnL and portfolio values are illustrated for the 50,000 hedge trials and for all three approaches: Analytical, ANN-MSE
and ANN-CVaR. We notice that even the analytical delta hedging approach is not perfect, which is expected since
the experiment uses 60 hedge points. At a first glance, it seems like the PnLs for the analytical approach and the
ANN-MSE model are pretty similar. In comparison, the PnLs for the ANN-CVaR model are slightly different, yet
promising. Therefore, a visual assessment of the PnLs indicates that the ANN models have effectively learned how
to hedge the call option.
Looking at table 5.1, we get a better sense of the quantitative performance of the different models. It is clear
that the ANN-MSE model is very close to the analytical strategy. When looking at both average absolute PnL,
PnL standard deviation and empirical CVaR, the ANN-MSE model’s performance is only slightly worse than the
analytical approach. For example, the average absolute PnL is 0.00498 and 0.00502 for the analytical approach and
the ANN-MSE model, respectively. The difference is only 0.8%.
Looking at the performance of the ANN-CVaR model, it is clear that the model is outperformed by the two
other approaches, based on the metrics average absolute PnL and PnL standard deviation. For example, the average
absolute PnL is 0.00589 for the ANN-CVaR model, which is 18.3% more than the analytical approach. However,
when we look at the empirical CVaR on the test set, it is evident that the ANN-CVaR model outperforms the two
19
Option payoff and PFN
(a) Portfolio value for the analytical strategy
PnL
(b) PnLs for the analytical strategy
Option payoff and PFN
(c) Portfolio value for ANN optimized with MSE
PnL
(d) PnLs for ANN optimized with MSE
Option payoff and PFN
(e) Portfolio value for ANN optimized with C V aR0.95
PnL
(f) PnLs for ANN optimized with CV aR95
Figure 5.1: Portfolio values vs option payoff and PnLs for the analytical strategy and ANN models
optimized over MSE and
CV aR0.95
. Model: Black Scholes model without transactions costs. Option:
ATM call option.
20
(a) Sample 1 (b) Sample 2
Figure 5.2: Holdings in
S
across time for two different sample paths. Model: Black Scholes model without
transactions costs. Option: ATM call option.
Units of S
Figure 5.3: Holdings in
S
at time
t= 0.125
. Model: Black Scholes model without transactions costs.
Option: ATM call option.
PnL density
(a) ANN optimized with MSE
PnL density
(b) ANN optimized with CV aR0.95
Figure 5.4: Out of sample PnL distribution. ANN models vs analytical strategy. Model: Black Scholes
model without transactions costs. Option: ATM call option.
21
other approaches. Using this metric, the analytical approach (with a CVaR of 0.01545) is 14.7% higher than the
ANN-CVaR model (with a CVaR of 0.01347). It is not surprising that the ANN-CVaR model performs well when
measured on CVaR since the model was trained to be optimal under this metric. However, this shows that there exist
alternative and learnable trading strategies that focus on reducing the tail risk of the PnL.
In figure 5.2, we can see the holdings in the underlying risky asset over time for two test samples. In these two
samples, we observe that the analytical approach and the ANN-MSE model employ virtually the same trading strategy.
This is to be expected since their performance metrics were similar. However, we should note that delta hedging and
an MSE-optimal strategy will generally not produce the same trading strategy. See for example appendix A.1. Still,
they might be pretty close, as we have observed. We also observe that the ANN-CVaR model follows a different
trading strategy compared to the two other approaches. It seems like the ANN-CVaR model is more conservative
than the two other models. This is also partially confirmed in figure 5.3, which shows that the ANN-CVaR model
employs a flatter trading strategy compared to the option delta at time t= 0.125.
In figure 5.4, we see the empirical PnL distribution of the two ANN models compared to the analytical approach.
The empirical PnL distributions strengthen the conclusions from our previous analysis. We again see that the
ANN-MSE model manages to imitate the PnL distribution of the analytical approach, where the PnL distribution of
the ANN-CVaR model is skewed, resulting in less downside tail risk.
Going back to table 5.1, we observe that the indifference price
p0
for the ANN-CVaR model is
0.07560
, which is
21.615% higher than the actual option price. This price reflects the fair price needed to obtain an empirical CVaR of
0. Remember that
π(0) = 0
. Note that this is not in conflict with proposition 3.3 since the option is not reachable
with only 60 hedge points.
One might criticize the experiment because we have only trained one ANN model for each approach, one
for MSE and one for CVaR. However, we find that the ANN models are pretty stable with the current training
procedure, at least with only one tradable asset. We would also expect to see significantly higher PnL standard
deviations if the models were unstable. On that note, we have seen that the PnL standard deviation is comparable to
the analytical approach for both ANN models. Therefore, for most experiments, we will not worry about the ANN
stability (the exceptions are section 5.2 and 7.1).
5.1.2 Training on wrong volatility and The Fundamental Theorem of Derivatives Trading
We have showcased that the ANN models can solve a simple trading problem in the Black-Scholes model without
transactions costs. Before moving on to experiments with non-zero transaction costs, it would be interesting to
evaluate the stability of the learned trading strategies. In general, we/traders cannot accept huge unexpected drops in
hedging performance if the actual dynamics of Sdeviate from the ones used during training.
In our analysis, we consider deviations in volatility. This is a well-researched topic, and a well-known result is
the so-called Fundamental Theorem of Derivative Trading (see [10]). The theorem is discussed below:
Under both the P- and Q-measure, we assume that Sfollows the dynamics
dS(t)/S(t) = σ(t)dW (t)
where (σ(t)) is some random process, i.e. we assume no interest rate, drift under Por dividends. We assume to be
selling an option with expiry
T
at implied volatility
σH
. We also choose to Black-Sholes delta hedge the option using
hegde volatility
σH
. The Fundamental Theorem of Derivative Trading (FTDT) then states that our PnL at expiry is
P nLT=1
2ZT
0
Γ(t)S(t)(σHσ(t))dt. (9)
If we assume that the option payoff is convex (ie. positive gamma) then the FTDT shows us that we make money
when realized volatility
σ(t)
is low relative to
σH
and loose money if realized volatility is high. Note the sign is
22
flipped compared to the usual version of the theorem since we are selling the option. The FTDT also shows us that
hedging with the wrong delta has a bleeding effect on our PnL since equation (9) is not a stochastic integral. This is
(somewhat) reassuring, but this result does not hold when hedging with our ANN models.
The idea of this section and experiment is to analyze the effect on PnL (in terms of average PnL, average absolute
PnL and empirical CVaR) when the ANN models are trained with the wrong volatility. To see a potential pitfall for
the ANN models, we can look at figure 5.3. In this figure, we see that the trading strategies of the ANNs compared to
the delta of the option at a fixed time point. We see that for extreme values of the current spot, then the ANN models
diverge or exhibit weird behaviour. This is likely because the models have not seen many extreme examples during
training. Therefore, we would expect to see abnormal behaviour in our experiment if the ANN models are not trained
on enough samples and/or if the underlying asset’s volatility is significantly different from the one used in training.
The experiment is as follows:
The experiment involves hedging a call option with strike
K= 1
and maturity
T= 3/12
over 60 equidistant
trading days without transaction costs.
We train two ANN models (one on MSE and one on CVaR like in section 5.1.1) to hedge the call option on
218 training samples from a Black-Scholes model with S(0) = 1,µ=r= 0 and σ= 0.3.
Run an out-of-sample hedging test on 10,000 samples where a similar Black-Scholes model governs
S
but
with
σ
varying from
0.2
to
0.6
. In this test, the initial portfolio value is chosen as the option price in the model
used for training the ANNs (i.e. with σ= 0.3).
As a benchmark, we also run the test with a Black-Scholes delta hedging strategy with volatility
σ= 0.3
, i.e.
the same model as the model used for training the ANNs.
The results of the experiment can be seen in figure 5.5.
In figure 5.5 (a), we see the average PnL for the three strategies across different test-volatilities. All mod-
els/strategies are seemingly identical on average PnL, and they are all decreasing linearly. In figure 5.5 (b), we have
subtracted the average PnL of the analytical approach (delta hedging with
σH= 0.3
) to magnify the differences. We
see from the scale that the differences are indeed small (on the scale of 0.0001 compared to values of 0.02 to -0.06).
However, it is noteworthy that the ANN-MSE models seem to diverge around
σ= 0.4
. We have previously seen that
the hedging strategy produced by the ANN-MSE model is close to the analytical approach. Therefore, divergence in
the average PnL signifies that the ANN-MSE model is struggling, likely because of extreme and unseen values of
S
.
In figure 5.5 (c), we see the average absolute PnL of the three strategies. As expected, the absolute average PnL
is minimized for
σ=σH= 0.3
, which makes sense since the average PnL was linear and decreasing. Again all three
strategies are quite close, but the ANN-CVaR model lacks behind when
σ
is close to
σ= 0.3
. This is confirmed
by looking at the differences to the average absolute PnL of the analytical strategy in figure 5.5 (d). We postulate
that this is because strategies become relatively more similar when viewed from a model with much higher or lower
volatility. Another interesting observation is that the ANN-MSE models again diverge around
σ= 0.4
, which we
still suspect is due extreme values of S.
In figure 5.5 (e), we see the empirical CVaR of the PnLs and in figure 5.5 (f), we see the differences to the
analytical approach. The interesting observation is that the ANN-CVaR model gets relatively better when
σ
increases
compared to the other strategies. However, when
σ
is low (compared to
σH= 0.3
), the ANN-CVaR model loses out
to the analytical approach and the ANN-CVaR model.
This is quite concerning
because it shows that a strategy that
is CVaR optimal does not retain its (relative) CVaR optimality when the real volatility is lower than expected even
when compared to the MSE optimal strategy.
5.1.3 0.5% transaction costs (Black-Scholes 1 asset)
In this experiment, we wish to analyze the effects of non-zero transaction costs. To do this, we repeat the previous
experiment but now add proportional transaction costs of 0.5%.
23
(a) Average PnL (b) Average PnL difference from analytical.
(c) Average absolute PnL (d) Average absolute PnL difference from analytical.
(low = good)
(e) Empirical CV aR0.95
(f) Empirical C V aR0.95 difference from analytical.
(low = good)
Figure 5.5: Average PnL, average absolute PnL and
CV aR0.95
across varying volatilities for ANN models
trained on
σ= 0.3
and Black-Scholes delta hedging with
σ= 0.3
(analytical). (a), (c) and (e) show average
PnL, average absolute PnL and
CV aR0.95
, respectively. (b), (d), (f) show the respective differences to that
of the analytical approach. No transaction costs.
24
Strategy Value (Standard Error) % of option price
Option Price 0.06216
Model p0ANN MSE FIXED
ANN CV AR0.95 0.08852 125.275%
Analytical 0.01493 (3.73e-05) 24.022%
Avg. abs. PnL ANN MSE 0.01104 (3.57e-05) 17.764%
ANN CV AR0.95 0.01199 (3.02e-05) 19.284 %
Analytical 0.00843 13.561%
PnL Std. ANN MSE 0.00965 15.522%
ANN CV AR0.95 0.00863 13.888%
Analytical 0.03672 59.063%
CV aR0.95 ANN MSE 0.03183 51.209%
ANN CV AR0.95 0.02641 42.482%
Analytical 2.98644 (0.00402)
Avg. turnover ANN MSE 1.94404 (0.00212)
ANN CV AR0.95 2.16006 (0.00229)
Analytical 0.01493 (2.00e-05) 24.014%
Avg. transaction costs ANN MSE 0.00973 (1.06e-05) 15.647%
ANN CV AR0.95 0.01080 (1.14e-05) 17.375%
Table 5.2: Results of hedge experiment over 50.000 out-of-sample trials.The experiment involves hedging
an ATM call option with 60 hedge points in a single asset Black-Scholes model with
0.5% transaction
costs.
Important change:
To fully leverage the ANN models, we amend the input to the ANN models with the current
holdings in the tradable asset. Hopefully, the ANN models will use this information to adjust their trading strategies
to take current holdings and transaction costs into account.
We still consider the standard delta hedging strategy and the two ANN models, one using MSE and the other
using CVaR. The two ANN models are again trained on
218
samples, and all three models are tested on 50,000
independent samples. For the test, all strategies are initialized with an initial portfolio value equal to the actual option
price when not considering transaction costs. To further analyze the test results, we record the average accumulated
transaction costs and average turnover. To calculate the turnover, we sum the absolute number of units traded at every
trading opportunity, not including T, i.e.
T urnover i=
N1
X
k=0 |δi
kδi
k1|
is the turnover for asset i. The results of this experiment can be seen in table 5.2 and in figure 5.6 and 5.7.
In table 5.2, we observe that both ANN models outperform the analytical delta hedging approach. When
considering the metric average absolute PnL, we see that the ANN-MSE model is the best. The ANN-MSE model
has an average absolute PnL of
0.01104
, which is 26.05% and 7.92% lower than the analytical approach and the
ANN-CVaR model, respectively. However, the ANN-CVaR model is, once again, the best performing model when
considering empirical CVaR. The ANN-CVaR model obtains an empirical CVaR of
0.02641
, which is 28.08% and
17.03% lower than the analytical approach and the ANN-MSE model, respectively. From table 5.2, we also observe
that both ANN models have lower average turnovers and average transaction costs, which show that the ANN models
trade less aggressively than the analytical approach, which explains the increased performance. This is also what we
observe in figure 5.6. Here we see holdings in the risky asset
S
over time for two samples. It is clear that both ANN
models deploy less aggressive trading strategies to avoid unnecessary large transaction costs. However, in table 5.2,
we can also observe that both ANN models have a higher PnL standard deviation than the analytical approach.
25
(a) Sample 1 (b) Sample 2
Figure 5.6: Holdings in
S
across time for two different sample paths. Model: Black Scholes model with
0.5%proportional transactions costs. Option: ATM call option.
PnL density
(a) ANN optimized with MSE
PnL density
(b) ANN optimized with CV aR0.95
Figure 5.7: Out of sample PnL distribution. ANN models vs analytical strategy. Model: Black Scholes
model with 0.5%proportional transactions costs. Option: ATM call option.
26
In figure 5.7, we see the PnL distributions for the two ANN models compared to that of the analytical approach.
We observe that both ANN models have managed (through less aggressive trading) to shift the PnL distributions in
the favourable direction. Once again, the ANN-CVaR model has adopted a trading strategy that creates a skewed PnL
distribution, which decreases the downside tail risk.
In conclusion, it is clear that the ANN models can find trading strategies superior (on every metric) to classic
delta hedging. To reiterate, ANN models have no information about the model, transaction costs or the option. This
is a promising sign.
5.2 Black Scholes with multiple assets and no transaction costs
For this experiment, we step up the complexity of the model by introducing more tradable assets. The idea is to see if
our ANN model setup can trade multiple assets by only observing the total PnL of the entire portfolio.
In this experiment, we assume that we are selling an option portfolio of 10 put and call options on four different
assets. We assume that the four assets come from a four-dimensional Black-Scholes model. The multidimensional
Black-Scholes model with nassets has dynamics
dS(t)/S(t) = µdt +σC dW P(t)
where
µ, σ Rn
are the vectors of individual drifts and volatilities,
CRn×n
is a correlation matrix,
WP(t)Rn
is a vector of
n
independent Brownian motions and where "
" is used for element-wise multiplication between vectors.
We still assume that we can trade in a locally risk-free asset with constant interest rate, i.e. dB(t)/B(t) = rdt.
Since our portfolio consists of put and call options on individual assets, then the price of the option portfolio is
simply the weighted sum of individual option prices calculated using the Black-Scholes formula (see proposition 5.1)
and put-call parity in their respective one-dimensional models. Similarly, the analytical delta hedging strategy is the
collection of individual delta hedging strategies for all the options.
The setup for the experiment is:
Model: Four-dimensional Black-Scholes.
Model parameters
:
S(0) = (1,1,1,1)>
,
µ= (0.03,0.08,0.04,0.08)>
,
σ= (0.25,0.14,0.09,0.07)>
r= 0.02 and correlation matrix Ccan be seen in equation (10).
Option: Portfolio of 10 call and put options with maturity T= 3/12. See table 5.3.
Hedging: Hedge-points: 60 (equidistant), Transaction costs: 0
Hedge strategies: Standard delta hedging and deep hedging with risk measure CV aR0.95 .
ANN architecture
(for each trading decision): Layers: 4, Units: 10, Input:
˜
SkR4
at time
tk
, Output:
δkR4(holdings in S) from time tkto tk+1.
ANN training
: Training samples:
218
, Batch size:
128
, Epochs: 100 with learning rate reduction and early
stopping.
Hedge-test
: Test samples: 50,000 (independent of training), Initial portfolio value: 0.50967 (actual option
value).
The correlation matrix, we utilize for this experiment, is
C=
1 0.673292 0.69783 0.732783
0.673292 1 0.787874 0.0763763
0.69783 0.787874 1 0.317681
0.732783 0.0763763 0.317681 1
.(10)
This experiment aims to showcase the performance of the deep hedging approach compared to standard delta
hedging in the case of multiple tradable assets and multiple options. Even-though, there are no transaction costs, we
expect/hope that the deep hedging model finds a trading strategy that utilizes the correlation between assets.
27
Type Underlying Strike Units
call 4 1.1 0.43
call 3 1.06 -0.77
put 1 0.98 -0.9
call 4 1.03 -0.54
call 4 1.04 -0.23
call 3 0.99 -0.79
put 1 0.78 1.78
call 2 0.81 0.64
call 4 0.8 1.9
put 1 0.97 1.69
Table 5.3: Option portfolio for experiment in multidimensional Black-Scholes. Underlying refers to the
index (starting at 1) of the risky asset. Units refer to the number of options in the portfolio, which we
imaging to be selling.
Strategy Value (Standard Error) % of option price
Option Price 0.50967
Model p0ANN CV AR0.95 0.51839
Avg. abs. PnL Analytical 0.00360 (1.49e-05) 0.707%
ANN CV AR0.95 0.01564 (5.22e-05) 3.069%
PnL std. Analytical 0.00491 0.963%
ANN CV AR0.95 0.01318 2.585%
CV aR0.95 Analytical 0.01162 2.280%
ANN CV AR0.95 0.00952 1.868%
Avg. turnover Analytical (2.65478, 0.64926, 3.37903, 3.37182)
pr. asset ANN CV AR0.95 (4.33312, 3.18385, 5.22738, 8.41255)
Table 5.4: Results of hedge experiment over 50,000 out-of-sample trials. The experiment involves hedging
an option portfolio with 10 put and call options over 60 hedge points in a multidimensional Black-Scholes
model without transaction costs.
The results of the experiment can be seen in table 5.4 and figure 5.8.
In figure 5.8, we see that the PnL distribution of the ANN-CVaR model is significantly different from that
of the analytical strategy. The PnL distribution of the ANN-CVaR model is shifted towards higher PnLs, implying that
the ANN model has learned a strategy that for a large portion of samples obtains a higher PnL. This is encouraging.
However, we also observe that the PnL distribution of the ANN-CVaR model has a much higher variance compared
to the analytical approach, which is undesirable.
In table 5.4, we observe that the ANN-CVaR model performs best on empirical CVaR where it obtains 0.00952,
which is
18.07%
lower than the analytical approach. However, we can also see that the ANN-CVaR model performs
significantly worse on average absolute PnL and PnL standard deviation than the analytical approach. On average
absolute PnL the ANN-CVaR model obtained 0.01564, which is 334.44% higher than the analytical approach. On
the PnL standard deviation, the ANN-CVaR model obtained 0.01318, which is 168.43% higher than the analytical
approach. This suggests that the ANN-CVaR model has learned a significantly different hedging strategy to the
analytical approach. This is confirmed when we look at the turnover, which is between 1.5 and 5 times higher for the
ANN-CVaR model than the analytical approach. This suggests that the ANN-CVaR model is unstable and exploits
the correlation of the assets, and it might even suggest that the model is poorly trained.
28
PnL density
Figure 5.8: Out of sample PnL distribution. ANN model using CVaR vs. analytical strategy. Model: 4
dimensional Black Scholes model without transactions costs. Option portfolio: see figure 5.3.
It might seem weird to include an unstable and poorly trained model. However, we find that the ANN model was
significantly harder to train in this experiment. When repeating the experiment, we have also found that some ANN
models have turnovers that are up to x10 that of the analytical approach. We have chosen to include these strange
results to showcase the difficulties of utilizing deep hedging methods in a multiasset scenario.
As mentioned, we suspect that the instability is a product of no transaction costs and the ANN model trying to
exploit the correlation matrix (that is close to being singular). We, therefore, test the stability of the ANN models
under shocks to the correlation matrix in the next section.
5.2.1 Training with the wrong correlation
Although the ANN-CVaR model outperformed the analytical delta hedging strategy, on empirical CVaR in the
previous experiment with a multidimensional Black-Scholes model, one might suspect that the ANN model relies too
heavily on the correlation matrix. This is a problem in practice where estimating correlation accurately is virtually
impossible.
This is the exact problem an investor faces if he/she/it tries to utilize mean-variance analysis (Markowitz). The
problem is that the obtained optimal portfolio is empirically unstable when considering minor changes to the return
and/or covariance matrix since the covariance matrix is often close to being singular. We suspect that the trading
strategy learned by the ANN model might face similar difficulties due to the performance and exaggerated turnover
seen in the previous experiment (see table 5.4).
To test the stability of the ANN models when faced with changes in the correlation, we repeat the previous
experiment with different shocks to the correlation matrix. The way we choose to change the correlation matrix is
simple. We add a normally distributed number with a predetermined variance to each entry in the correlation matrix,
not on the diagonal. The method we employ for doing this is
Denote the new correlation matrix ¯
C. Then for all i, j i 6=jand ij
¯
Cij =¯
Cji =Cj i +k·Zij
where Zij are independent N(0,1) and kis the scale of the shock.
If any ¯
Cij is outside (1,1) then we set them to either 1or 1.
29
Strategy Scale of shock, k
0 0.05 0.1 0.15
Analytical 0.01157 0.01156 0.01159 0.01159
ANN model η= 1 0.00949 0.01016 0.02266 0.04588
ANN model η= 0 0.01284 0.01283 0.01283 0.01288
ANN model η= 0.75 0.01045 0.01043 0.01061 0.01076
Table 5.5: Average empirical
CV aR0.95
over 500 hedge experiments for each shock
k
, each including
2,000 samples. The standard errors range from 0.00002 to 0.00005 except for the ANN model with
η= 1
where the standards errors range from 0.00004 to 0.005. The experiment involves hedging an option
portfolio of puts and calls in a four-dimensional Black-Scholes model
without transaction costs
, but the
correlation matrix is not the same as during training.
Most likely
¯
C
will not be positive definite, so we employ an algorithm from [
11
]
1
to find the closest positive
definite matrix to
¯
C
. This yields positive definite correlation matrices, which are slightly different from the
original correlation matrix, C. The size of the shock is also easily controlled by changing k.
Before we move on to the experiment, we wish to add another trading strategy to the experiment, which is inspired by
Robust Portfolio Optimization (see [
12
]). The idea (in our case) is to solve the problem in a way in which the trading
strategy is not as dependent on the correlation matrix
C
. The way we do this is by training our ANN models on
samples coming from a model with correlation
CRO(η) = (1 η)In+ηC
where
In
is the
n
-dimensional identity matrix. When
η= 1
, we have
CRO(1) = C
, which is the correlation matrix
in our original problem. If
η= 0
, then
CRO =In
is the identity matrix implying that all underlying assets will be
independent. Note that no matter the choice of
η
, the marginal distribution of each tradable asset is the same since all
we change is the correlation between the assets.
Our idea is to include two extra ANN models: One trained on
η= 0
(independent assets) and one with
η= 0.75
(assets that have a similar dependence structure but scaled down).
Our experiment can now be described in the following way
Train relevant ANN models on 218 samples generated with correlation matrix Cfrom equation (10).
For each shock size
k
in
(0,0.05,0.1,0.15)
, run 500 hedge experiments with new positive definite correlation
matrix
¯
C
(found using the above method). Each of the 500 hedge experiments for each
k
includes 2,000 sample
paths. The result is an empirical average CVaR for each of the four shock sizes and for each model/strategy.
The results can be seen in table 5.5. In table 5.5, we have listed the average empirical CVaR for the four different
models for each of the four shock sizes
k
. We observe from the data that all models except for the ANN model with
η= 1
(i.e. the one trained on the true correlation) are remarkably stable under shocks to the correlation matrix. This
is to be expected of the analytical approach and the ANN model with
η= 0
. The analytical approach is a collection
of delta hedging strategies for the individual options, which do not depend on correlation. Furthermore, the ANN
model with
η= 0
is trained on samples with no correlation between the different assets. However, it is surprising
that the ANN model with
η= 0.75
(dampened correlation) is stable under shocks to the correlation matrix when
the ANN model with
η= 1
seems to be quite unstable under shocks to the correlation matrix. The ANN model
with
η= 1
obtains an average CVaR, which is twice that of the other models when
k= 0.1
and is even worse for
k= 0.15
. This clearly shows that the ANN model with
η= 1
has learned a trading strategy, which relies too heavily
on the correlation between the assets, as suspected in the previous section.
1
The implementation of which is borrowed from: https://stackoverflow.com/questions/10939213/how-can-i-calculate-the-
nearest-positive-semi-definite-matrix
30
Strategy Scale of shock, k
0 0.05 0.1 0.15
Analytical 0.06844 0.06844 0.06855 0.06862
ANN model η= 1 0.04289 0.04312 0.04433 0.04519
ANN model η= 0 0.04929 0.04925 0.04925 0.04921
ANN model η= 0.75 0.04517 0.04517 0.04524 0.04530
Table 5.6: Same experiment as table 5.5, but with
0.5% transaction costs
. Standard deviations range
from 0.00003 to 0.0002
In table 5.5, we see that the ANN model with
η= 0
(independent samples) obtain a higher empirical CVaR than
the analytical approach. In theory and with proper training (and enough samples), all ANN models should beat the
analytical approach on CVaR. However, the fact that the ANN model with
η= 0
fails to do so show that training the
ANN models is difficult, especially with the increased complexity of multiple assets. However, it is interesting that
the ANN model with
η= 0.75
(dampened correlation) performs better than the analytical approach and is stable
under correlation shocks. As in Robust Portfolio Optimization, dampening the correlation has a stabilizing effect on
the resulting strategy. This is very encouraging since it shows that it is possible to find effective trading strategies,
which are not suffering from a high degree of instability in the utilized correlation matrix.
One might suspect that the instability would disappear if we introduced transaction costs since they would
discourage high trading volumes, which was a significant characteristic of the unstable ANN trading strategy. We,
therefore, perform the same experiment but with
0.5%
transaction costs. The results of which can be seen in table
5.6. In table 5.6, we observe that all strategies are quite stable under shocks to the correlation. This is as suspected
since transaction costs have a regularizing effect on trading volumes, which counteracts the benefits of exploiting the
correlation matrix.
To round of, one should not read too much into the exact level of stability/instability of the deep hedging
models since that may depend heavily on the initial correlation matrix and the specific model structure/design and
training. However, one can conclude that stability is not given with these deep hedging models. However, there are
natural ways of discouraging exploitation of the correlation between assets.
5.3 Black Scholes model with path-dependent options
For this experiment, we wish to hedge a path-dependent option. We do this to test the performance of the ANN
model under increased complexity and because hedging a path-dependent option introduces interesting architectural
decisions. To simplify the problem, we choose to hedge a down-and-out call option in a one-dimensional Black-Scholes
model. The down-and-out call option is among the more straightforward barrier options with payoff
(S(T)K)+1minuTS(u)>L
where
K
is the strike, and
L
is the knock-out barrier. The down-and-out call also has a simple analytical price when
working in a Black-Scholes model, which can be seen below.
Proposition 5.3
(Prop. 18.17 in [
9
])
.
The price of a European down-and-out call option at time
t
(issued at time
0
)
with maturity T, strike Kand barrier L<Kis given by
CLO(S(t), t)) =
C(S(t), t)L
S(t)r
σ2CL2
S(t), tminutS(u)> L
0 minutS(u)L
where Cis the price of a regular European call option with strike K(see proposition 5.1) and ˜r=r1
2σ2.
31
This makes it easy to compare the hedging strategies learned by the ANN models with an analytical strategy. The
analytical strategy will utilize the delta hedging approach, which can hedge the option arbitrarily well, given enough
hedge points. The delta of the barrier option can be found analytically, but for convenience, we utilize algorithmic
adjoint differentiation (see [13, Chapter 9]), which is easily implemented using Tensorflow in Python.
To reduce complexity even further, we choose only to calculate the minimum of
S
over the discrete set of time
points where trading is allowed. The actual price of the option should, therefore, be higher than the price found in
proposition 5.3, which assumes continuous-time monitoring of the barrier. However, since prices only affect the
initial portfolio value and CVaR is cash invariant, we are not overly concerned with the discrepancy between discrete
and continuous time.
Suppose we wish to hedge a down-and-out call option optimally. Then, at each trading decision, we need to
know the current value of the underlying asset
S
and whether or not the underlying asset has previously crossed the
barrier, rendering the option worthless. Previously, we have only provided the ANN models with information on the
current value of the underlying asset
S
. We, therefore, need to decide which extra input to give the ANN models.
The additional information should provide the knowledge of the state of the option (regarding the barrier). There are
three ways of handling this issue that we wish to investigate.
The first is to provide the actual minimum of the underlying
S
as a separate input to the model. This is a sensible
idea, but it puts a burden on the programmer/trader (read "us") to keep track of more information.
The second is to only provide the ANN model with the current value of the underlying asset
S
but allow the
model to send information to its future self for the next trading decision. We say that the ANN model has memory.
The idea is that the ANN model should itself be capable of keeping track of the minimum of the underlying asset
S
,
and hence if the barrier has been crossed.
The third is not to provide the ANN model with any extra information than the current value of the underlying
asset. This strategy will not be able to trade optimally, but we include it as a benchmark for the other two methods.
The three different ANN models will be referred to as ANN w. min. info, ANN w. memory and ANN raw,
respectively. Note that when we say min. info, we mean information of the minimum of
S
and not (virtually) no
information.
The setup for the experiment is:
Model: One-dimensional Black-Scholes.
Model parameters:S(0) = 1,µ= 0.05,σ= 0.3,r= 0.02.
Option: Type: Down-and-out call, Strike: K= 1 (ATM), Barrier: L= 0.95, Maturity: T= 1/12.
Hedging: Hedge-points: 20 (equidistant), Transaction costs: 0
Hedge strategies
: Standard delta hedging and three ANN-CVaR models. The difference in the ANN models
lies in their architecture and available information (See below).
ANN architecture
(for each trading decision): Layers: 4, Units: 6, Common input:
˜
SkR
at time
tk
,
Common output: δkR(holdings in S) from time tkto tk+1.
ANN model w. min. info: Additional input at time tk:minikS(ti)
ANN model w. memory: Additional output at time
tk
: 2 dimensional vector
mkR2
and additional
input at time tk: 2 dimensional vector mk1R2from the previous trading decision.
ANN model raw: No additional input or output.
ANN training
: Training samples:
218
, Batch size:
1024
, Epochs: 100 with learning rate reduction and early
stopping.
Hedge-test
: Test samples: 50,000 (independent of training), Initial portfolio value: 0.029523 (true continuous
time option value)
This experiment aims to see if the ANN models can learn how to optimally hedge a path-dependent option. Moreover,
to compare different methods of handling the extra information that hedging a path-dependent option requires. The
results of the experiment can be seen in table 5.7 and figure 5.9 and 5.10.
32
Strategy Value (Standard Error) % of option price
Option Price 0.02953
ANN CV AR0.95 w. min. info 0.04392
Model p0ANN CV AR0.95 w. memory 0.04528
ANN CV AR0.95 raw 0.04737
Analytical 0.00496 (0.00430) 16.793%
Avg. abs. PnL ANN CV AR0.95 w. min. info 0.00603 (0.00639) 20.412%
ANN CV AR0.95 w. memory 0.00712 (0.00928) 24.101%
ANN CV AR0.95 raw 0.00902 (0.01194) 30.550%
Analytical 0.00628 20.754%
PnL std. ANN CV AR0.95 w. min. info 0.00900 29.743%
ANN CV AR0.95 w. memory 0.01155 38.170%
ANN CV AR0.95 raw 0.01477 48.811%
Analytical 0.01646 55.749%
CV aR0.95 ANN CV AR0.95 w. min. info 0.01440 48.767%
ANN CV AR0.95 w. memory 0.01579 53.483%
ANN CV AR0.95 raw 0.01780 60.292%
Table 5.7: Results of a hedge experiment over 50,000 out-of-sample trials. The experiment involves
hedging a down-and-out call option with 20 hedge points in a single asset Black-Scholes model
without
transaction costs.
(a) Sample 1 (b) Sample 2
Figure 5.9: Two samples of holdings for different trading strategies. The goal is to hedge a down-and-out
call option in a one-dimensional Black-Scholes model without transaction costs.
33
PnL
(a) Analytical
PnL
(b) ANN model with min-info
PnL
(c) ANN model with memory
PnL
(d) ANN model without min-info or memory
Figure 5.10: PnLs over terminal value of underlying asset for four different hedging strategies. The goal
is to hedge a down-and-out call option in a one-dimensional Black-Scholes model
without transaction
costs.
34
In table 5.7, we see that the ANN model with min. info performed the best. It obtained an empirical CVaR
of 0.01440, which was 12.45% lower than the analytical approach and 8.80% and 19.10% lower than that of the
ANN model with memory and raw ANN model, respectively. This is expected, as the ANN model with min. info has
the best available information. We should also note that the ANN model with memory performed better than the
analytical approach and the raw ANN model. This shows that the model was partially successful in utilizing its
memory to more accurately hedge the down-and-out call option. We say partially since it is advantageous for the
ANN model to receive the minimum of Sas input instead of keeping track of that itself.
To illustrate the performance of the ANNs, we have chosen two test samples, where the price of the underlying
asset hit the barrier, which can be seen in figure 5.9 (a) and (b). From these two samples, we observe that the raw
ANN model does not capture the crossing of the barrier, which makes sense since it has no information of the
minimum of Sor memory. We also see that the ANN model with min. info and the ANN model with memory can
capture the crossing of the barrier. However, the ANN model with memory seems to find this more challenging,
as shown in figure 5.9 (b). It is also worth noting that the ANN models find it difficult to stop trading altogether.
This may seem counter-intuitive as no action would seem like an easy action to learn (for a human trader, at least).
However, this is not the case for ANNs, which struggle to learn not to take action.
Lastly, we observe from table 5.7 that the average absolute PnL and standard deviation of the PnL is significantly
higher for the ANN models compared to the analytical approach. The explanation for this can partly be seen in figure
5.10 (a)-(d). Here we see the PnLs of the four different methods across terminal values of
S
. It is interesting to see
that the ANN models produced large positive PnLs for some test samples. These likely come from paths where the
option was knocked out, but the portfolio still had positive value. This may seem weird. However, we must remember
that the ANN models only care about downside risk (CVaR) and do not care about large hedge errors that yield a
positive PnL.
All in all, we find that the deep hedging models are capable of hedging down-and-out call options. However,
we find it advantageous to provide the ANN models with helpful but redundant information (in this case, about the
minimum of S) instead of expecting the ANN models to keep track of that itself.
5.4 Heston model with one tradable asset (incomplete and non-Markovian model)
For this final deep hedging experiment, we wish to hedge a call option, with maturity
T= 3/12
using 60 hedge
points, in a model with increased complexity. We do this by assuming that a Heston model governs the underlying
asset. The model consists of the pair
(S(t), ν(t))
(the price of underlying and a variance process), which have
dynamics
dS(t)/S(t) = µdt +pν(t)dW P
1(t)
(t) = κ(θν(t))dt +σpν(t)dW P
2(t)
where W1and W1are Brownian motions with correlation ρ. For this experiment, we use parameters:
S(0) = 1
drift: µ= 0.05
ν(0) = 0.1
mean reversion: κ= 5
long term variance: θ= 0.1
vol-of-vol: σ= 1
correlation: ρ=0.9.
35
In this model, we cannot simulate paths from
(S, ν)
without error. Hence, we have to consider which discretization
scheme that we wish to employ. In this thesis, we use the full truncation scheme (introduced in [14])
S(t+ ∆t) = S(t) exp (µν(t)+/2)∆t+pν(t)+tZ1
ν(t+ ∆t) = ν(t) + κ(θν(t)+)∆t+σpν(t)+(ρZ1+p1ρ2Z2))
where
Z1
and
Z2
are iid standard normal random variables and
(·)+= max(0,·)
. We utilize the scheme as it is
found to introduce relatively little discretization bias. For simplicity, we choose to perform daily discretization, which
corresponds to one discretization step between each hedge point. This implies that t= (12 ·20)1.
As the Heston model is incomplete, it is essential to consider how we hedge the option. For this experiment, we
assume that we can only utilize the underlying asset to hedge the option. This is, of course, not enough to hedge the
option perfectly (even in continuous time) as the model is incomplete. Still, it poses an interesting challenge for
our deep hedging model. Another crucial consideration is the information available at each trading decision. We
suspect that for optimal hedging, the deep hedging model needs to know the instantaneous variance at each trading
decision. To test this, we wish to perform this experiment with and without this extra information. Note that the
model becomes non-Markovian when we do not know the instantaneous variance.
As in the previous experiments, we wish to compare the deep hedging models to the ordinary delta hedging
approach. However, in the Heston model there exists no unique price and delta for the option without further
assumptions. For simplicity, we assume zero market price of volatility risk implying that the dynamics of
(S, ν)
under the Q-measure are
dS(t)/S(t) = rdt +pν(t)dW Q
1(t)
(t) = κ(θν(t))dt +σpν(t)dW Q
2(t)
where again
WQ
1
and
WQ
2
are Brownian motions with correlation
ρ
. With this assumption, we can compute a unique
price and delta for the call option. Note that one could perform an entire analysis based on the choice of
Q
-measure.
However, for simplicity, we choose to ignore this aspect.
In practice, we utilize the simple and relatively stable representation of the price, also referred to as the
Lipton-Lewis representation (see [
15
]). This formulation of the price also lends itself to an easy and efficient
computation of the option delta. For completeness, we have included the option price and delta in the appendix A.2.
We are now ready to perform the experiment. For completeness, the experiment setup is:
Model: Heston model with one asset.
Model parameters:S(0) = 1,µ= 0.05,ν(0) = 0.1,κ= 5,θ= 0.1,σ= 1,ρ=0.9and r= 0.02.
Option: Type: Call, Strike: K= 1 (ATM), Maturity: T= 3/12.
Hedging: Hedge-points: 60 (equidistant), Transaction costs: 0
Hedge strategies
: Standard delta hedging and two ANN-CVaR models. The difference in the ANN models
lies in their architecture and available information (see below).
ANN architecture
(for each trading decision): Layers: 4, Units: 6, Common input:
˜
SkR
at time
tk
,
Common output: δkR(holdings in S) from time tkto tk+1.
ANN model raw: No additional input or output
ANN model w.
ν
: Additional input at time
tk
:
νkR
(current level of the variance process). No
additional output.
ANN training
: Training samples:
218
, Batch size:
1024
, Epochs: 100 with learning rate reduction and early
stopping.
Hedge-test
: Test samples: 50,000 (independent of training), Initial portfolio value: 0.05868 (actual option
value with zero market price of volatility risk)
36
Strategy Value (Standard Error) % of option price
Option Price 0.05868
Model p0ANN CV aR0.95 raw 0.08366
ANN CV aR0.95 w. ν0.08320
Analytical 0.01708 (3.24e-05) 29.099%
Avg. abs. PnL ANN CV aR0.95 raw 0.01013 (2.40e-05) 17.260%
ANN CV aR0.95 w. ν0.00969 (2.30e-05) 16.504%
Analytical 0.01991 33.928%
PnL std. ANN CV aR0.95 raw 0.01257 21.420%
ANN CV aR0.95 w. ν0.01200 20.449%
Analytical 0.04158 70.854%
CV aR0.95 ANN CV aR0.95 raw 0.02498 42.565%
ANN CV aR0.95 w. ν0.02470 42.095%
Table 5.8: Results of a hedge experiment over 50,000 out-of-sample trials. The experiment involves
hedging an ATM call option with 60 hedge points in a single asset Heston model
without transaction
costs.
(a) Sample 1 (b) Sample 2
Figure 5.11: Two samples of holdings for different trading strategies. The goal is to hedge a call option in a
one dimensional Heston model without transaction costs.
This experiment aims to show the ability of the deep hedging model to hedge a simple call option in an incomplete
model and to analyze the importance of knowing the exact instantaneous variance at each trading decision. The
results of the experiment can be seen in table 5.8 and figure 5.11 (a) and (b).
In table 5.8, we observe that the two ANN models (one without extra information and one with
ν
) performed very
similarly. Both ANN models significantly outperformed the delta hedging approach. However, we notice that the
ANN model with
ν
performed slightly better (with a CVaR of
0.02470
) than the raw ANN model (with a CVaR of
0.02498
). This suggests that the ANN model can learn a slightly superior hedging strategy when also knowing the
exact current level of volatility. This is also evident in figure 5.11, where we see the holdings in
S
across time for two
test samples. Here we observe that the two ANN models choose slightly different trading strategies, and they seem
more different than what we would expect from two strategies from ANN models with identical architecture.
Overall, we are not surprised that the ANN models outperform the delta hedging approach, even in the Heston
model. However, it might be surprising that there was only a tiny gap in performance between the two ANN models.
This could, of course, vary with parameters and the option type. Still, it suggests that the ANN model can learn a
close to optimal hedging strategies even without knowing the exact level of volatility (in certain situations).
37
5.5 Subconclusion
We have seen that deep hedging models are indeed capable of learning optimal trading strategies just by observing
a large number of simulated paths under the
P
-measure from the underlying asset. This was even the case when
including transaction costs, multiple assets, path-dependent options and stochastic volatility. However, even when
avoiding
Q
-measures and utilizing deep learning, we still had to perform many critical choices regarding available
information for the ANN models, overfitting and more:
We saw that the naive implementation could lead to unstable trading strategies in multiasset models. Here we
saw that training on paths with less correlation had a stabilizing effect.
When hedging path-dependent options, we observed that the architecture of the ANNs and the available
information had a significant impact on performance. It was (not surprisingly) clear that providing the deep
hedging models with more relevant and processed information greatly improved their performances.
In the Heston model, we also saw that information on the variance process did improve performance slightly.
We did, however, argue that this may depend on the model parameters and option.
Overall, it is possible to learn complex trading strategies from simulated paths alone. However, it is clear that no
single hedging model/setup works in all situations. Knowledge of the options and underlying assets is still crucial for
maximizing results (like in classical mathematical finance).
6 Market Generators
An issue with the deep hedging approach for learning optimal trading strategies is that the training procedure requires
many paths from the
P
-measure to converge probably. This implies that historical data can not directly be used to
train the model. Accurate time-series simulation is, therefore, a necessary tool for practical implementations of deep
hedging models. Up until now, we have assumed that the underlying assets follow the dynamics of a well-known
model such as Black-Scholes or Heston. Fitting actual time-series data to these models can, however, be difficult due
to model inflexibilities.
In this section, we wish to explore how we can utilize ANNs (specifically variational autoencoders) to create
model free
market generators. Our goal is to train a model to produce new paths (of some length
n
) based on a single
historical path. The method that we employ in this thesis is inspired by the methods presented by Buehler et al. in [
2
].
Imagine that we have
N+ 1
equidistant observations for the price of an asset
S0, . . . , SN
with corresponding rate of
returns
r1, . . . , rN
where, of course,
ri=Si/Si11
(which we for simplicity refer to as returns). We can split up this timeseries into
M=bN/ncreturn-paths of length n
r(1)
1, . . . , r(1)
n,...,r(M)
1, . . . , r(M)
n.
We can now reformulate our goal as creating a model capable of producing synthetic return-paths
r1,...,˜rn)
of
similar distribution as our observed return-paths. These synthetic paths could then hopefully be used to train a deep
hedging model. We should note that we do not assume stationarity of the underlying price process, implying that
r(i)
1, . . . , r(i)
nr(j)
1, . . . , r(j)
n,
but their distributions might be influenced by some market state.
To create such a model, we use variational autoencoders (VAEs) and conditional variational autoencoders (CVAEs).
Other frameworks are capable of creating generative models, such a generative adversarial networks (GANs).
38
However, we stick to VAEs and CVAEs, in line with [
2
], since VAEs and CVAEs are more stable during training and
require less data to converge (GANs are known to be quite data hungry).
It is important to note that we choose to generate returns directly, which deviates from [
2
]. In [
2
], they are
big proponents of generating truncated log-signatures of lead-lag transformed paths (and then transforming them
back to paths). We will not go into details about the log-signature transformation in this thesis. However, the idea
behind the transformation is that it should be a more efficient and robust encoding of the information in the path. We
do not employ this transformation because there is (to our knowledge) no tractable and fast inverse transformation
from log-signatures to paths. In a notebook
2
on Github for [
2
], the authors show an example of an inverse signature
transform for a path with 20 returns. In this example, it takes 51 seconds to perform the inverse transformation
(on unknown machinery). The method for transforming log-signatures to paths is slow because it semi-randomly
searches for a path with a close enough log-signature. Remember that we need
218
paths to train a deep hedging
model. In our opinion, this renders the log-signature approach unpractical (for now).
6.1 Variational Autoencoders
In this section, we introduce the concept of variational autoencoders (VAEs). This introduction is partly based on a
brilliant tutorial on VAEs by Carl Doersch [16].
To introduce VAEs, we step away from the world of returns and financial time series. We start by assuming that
we wish to simulate some random variable
XRn
based on iid samples
{X1, . . . , XN}
. We are not interested in
simply sampling from
{X1, . . . , XN}
, but rather create a model capable of generating new and unseen samples,
which have the same (or similar) distributional properties as
{X1, . . . , XN}
. Note that we let
X
with density
p(X)
refer to the simulated random variable.
We will not sample
X
directly, but rather through a latent variable
zRk
with some distribution
p(z)
. We,
therefore, imagine to sample
z
from
p(z)
and then sample
X|z
from some conditional distribution
p(X|z)
. Our goal
can be thought of as finding
p(X|z)
such that
p(X)
is maximized for our data, like in maximum likelihood. The
connection between p(X)and p(X|z)comes from the law of total probability
p(X) = ZRk
p(X|z)p(z)dz.
When working with VAEs it is common to assume that
zN(0, Ik)
and
X|zN(f(z), σ2In)
where
IkRk×k
is the
k×k
identity matrix. Note that we do not assume that
X
is normally distributed since
f(z)
is some (possibly
complicated) transformation of a normal random variable. We simply assume that
z
is a multivariate standard normal
random variable and X|z(i.e. conditioned on z) is normally distributed.
We could maximize p(X)by approximation. If we have Msamples of z, then p(X)is approximately
p(X)1
M
M
X
i=1
p(X|zi).
It would then be possible to find
f(z)
st.
p(X)
was maximized for our data. However, this is likely not feasible. In
practice,
p(X|zi)
is likely to be very small for each
zi
due to a potential high dimensionality of
X
. This implies that
we would have to perform an impractically large sample of
z
s to accurately estimate
p(X)
and then try to perform
maximum likelihood estimation to find the optimal f.
VAEs solve this by sampling
z
s
which are likely to have produced X
. To do this, we need a function which
supplies a distribution over
z
, which is more likely to produce
X
. We name this distribution
Q(z|X)
. A sensible
choice would be to choose Qclose to p(z|X), i.e. the distribution for zgiven X.
2See https://github.com/imanolperez/market_simulator/blob/master/notebooks/logsig_inversion.ipynb
39
Imagine that we now sample
z
from
Q(z|X)
based on our observations. How does this help us calculating
p(X)
?
We somehow need to relate
p(X)
to
EzQ[p(X|z)]
. To do this, we make the (maybe) odd choice of considering the
relationship between
Q(z|X)
and
p(z|X)
, specifically the Kullbach-Leibler divergence (KL-divergence), which is
defined as
Definition 6.1. For two distributions (densities) p, q on Rn, the KL-divergence is defined as
DKL (p||q) = ZRn
p(x) ln p(x)
q(x)dx.
The KL-divergence between Q(z|X)and p(z|X)is, therefore,
DKL (Q(z|X)||p(z|X)) = EzQ[ln Q(z|X)ln p(z|X)].
We can now utilize Bayes rule on
p(z|X)
, i.e.
p(z|X) = p(X|z)p(z)/p(X)
. We can, therefore, express the
KL-divergence between Q(z|X)and p(z|X)as
DKL (Q(z|X)||p(z|X)) = EzQ[ln Q(z|X)ln p(X|z)ln p(z)] + ln p(X).
We can now see that
p(X)
and
p(X|z)
enters the picture, which is exactly what we were looking for. Rearranging
the terms and using the definition of KL-divergence again yield
ln p(X)DKL (Q(z|X)||p(z|X)) = EzQ[ln p(X|z)] DKL (Q(z|X)||p(z)).(11)
This is one of the defining relationships relevant for VAEs. On the left-hand side is what we wish to maximize, i.e.
(the log of)
p(X)
plus a loss/error term that depends on the difference between
Q(z|X)
and
p(z|X)
. Note that we
postulated earlier that a sensible choice for
Q(z|X)
would be
p(z|X)
. So if we choose the framework of
Q(z|X)
in a way that is capable of representing
p(z|X)
then maximizing the left-hand side (w.r.t.
f
and
Q
) would also
maximize
p(X)
since the KL-divergence would also be minimized, i.e. the term would be zero if
Q(z|X) = p(z|X)
.
This is exactly what we want! To maximize the left-hand side of equation
(11)
, we can use the shown relation since
the right-hand side can be maximized with gradient ascent (which we will explain later).
Remember that we want to maximize
p(X)
or equation
(11)
over our dataset
{X1, . . . , XN}
. To do this, we can
sum (or average) equation
(11)
over our dataset and maximize that expression. This will (because of the iid asumption
and log) exactly correspond to maximizing the log-likelihood if Q(z|X)is capable of representing p(z|X).
Let us reiterate what we have found: We want to sample
X
by first sampling a random variable
zN(0, Ik)
from a
latent space and then sample
X|zN(f(z), σ2In)
. Therefore, we need to find a function
f
s.t. we can maximize
the probability p(X)of observing a specific dataset (like in maximum likelihood).
In practice, we wish to maximize over
f
by sampling
z
from a distribution
Q(z|X)
, which is more likely to
produce
X
. We do not know
Q
or
p(z|X)
, so we also have to maximize over
Q
. Equation
(11)
links all this together
by providing us with an expression, we can maximize for
f
and
Q
(the right hand side) and will in return maximize
p(X), if Qis capable of representing p(z|X).
The structure of VAEs and equation
(11)
can be seen visually in figure 6.1. Note that this includes assumptions
on Qthat will be specified in the upcoming section.
Specifying Q(z|X)and finding an expression for the RHS of equation (11)
We now wish to be more concrete about our assumptions on
Q
and how we can maximize the right-hand side of
equation (11). We start by assuming that
Q(z|X) = N(g(X), h(X)Ik)
40
Figure 6.1: VAE structure when assuming
Q(z|X) = N(g(X), h(X)Ik)
. The blue boxes represent
terms from the right hand side of equation (11).
for some maps
g, h :RnRk
. We now wish to simplify the right hand side of equation
(11)
based on this
assumption. We start with EzQ[ln p(X|z)]. Remember that we assumed X|zN(f(z), σ2In)so
p(X|z) = 1
p(2π)nσ2nexp 1
2(Xf(z))>1
σ2In(Xf(z)
=1
p(2πσ2)nexp ||Xf(z)||2
2
2σ2,
which implies that
EzQ[ln p(X|z)] = EzQ||xf(z)||2
2
2σ2+c
where cis some constant, which is not dependent on Xor f(z)(so is irrelevant for optimization).
We can now move onto the second term of the right-hand side of equation
(11)
, i.e.
DKL (Q(z|X)||p(z))
.
Since we have assumed that
Q(z|X) = N(g(X), h(X)Ik)
then the KL-divergence is between two multivariate
normal random variables (remember zN(0, Ik)). The KL-divergence between two multivariate normal random
variables is well-known.
Proposition 6.2.
Given two multivariate normals
N1N(µ1,Σ1)
and
N1N(µ2,Σ2)
then the KL-divergence
between the two distributions is
DKL (N(µ1,Σ1)||N(µ2,Σ2)) = 1
2tr Σ1
2Σ1+ (µ2µ1)>Σ1
2(µ2µ1)k+ ln det Σ2
det Σ1.
Proof. See [17] page 13.
In our case, µ1=g(x),µ2= 0,Σ1=h(x)Ikand Σ2=Ik. We see that
1. Σ1
2Σ1=Ikh(x)Ik=h(X)Ikso we have tr Σ1
2Σ1=h(X)>1.
2. (µ2µ1)>Σ1
2(µ2µ1) = g(X)>Ikg(X) = g(X)>g(X)
41
3. det Σ2= det Ik= 1 and det Σ1= det(h(X)Ik) = Qk
i=1 h(X)i, which implies that
ln det Σ2
det Σ1= ln 1
Qk
i=1 h(X)i!=(ln h(X))>1
where the log is applied element-wise to h(X).
We, therefore, get that the KL-divergence between Q(z|X)and p(z)is
DKL (Q(z|X)||p(z)) = 1
2h(X)>1+g(X)>g(X)k(ln h(X))>1
which is easy to compute for some Xgiven hand g.
Putting it all together, we can express the right-hand side of equation (11) as
EzQ[ln p(X|z)]DKL (Q(z|X)||p(z)) = EzQ||Xf(z)||2
2
2σ21
2h(X)>1+g(X)>g(X)k(ln h(X))>1+c.
(12)
Deriving the empirical objective function
Assume again that we have a dataset
{X1, . . . , XN}
and corresponding
N
realizations of
z|X
from
Q(z|X)
. We
can then create the empirical objective function, which we can apply gradient ascent to by differentiating w.r.t.
parameters in
f, g, h
. To emulate the maximum likelihood method, we create the empirical objective function as
a sum (or actually a mean) of equation
(12)
over the dataset (as argued previously). Hence, we can express the
empirical objective function as
recontruction loss
z }| {
1
N
N
X
i=1
||Xif(zi|Xi)||2
2
σ2
KL-divergence loss
z }| {
1
N
N
X
i=1
h(Xi)>1+g(Xi)>g(Xi)k(ln(h(Xi))>1(13)
where we have omitted the factor
1/2
and the constant (since they do not affect the optimization). Note that we only
use
one zi|Xi
to estimate
EzQ[−||Xif(z)||2
22]
. This is done in order to speed up computations, but we are
then forced to do more and smaller gradient steps.
A critical step is to simulate
zi|Xi
like this
zi|Xi=g(Xi) + ph(Xi)
where
N(0, Ik)
since this
allows us to backprobagate through the simulation step. Backprobagation cannot handle stochastic ops but can take
stochastic inputs.
We should also note that
σ2
acts as a way of balancing a trade-off between minimizing the reconstruction loss
and minimizing the KL-divergence loss. One can understand the trade-off like this: When
σ
is close to zero, the
VAE will link
z|X
directly to
X
to ensure the best possible reconstruction. However, this will render simulation
from
zN(0, Ik)
impossible as a way to simulate
X
-lookalikes. On the other hand, if
σ
is large, then simulations
become too normally distributed, and the model will struggle to separate different Xs in the latent space.
Simulating X after optimization
Once we have found the maps f,hand gthat minimizes equation (13), we can sample Xin the following way:
Sample latent variable zN(0, Ik).
Sample XN(f(z), σ2In)
However, it is pretty common to sample
X
by setting
X=f(z)
and thus assume
σ2= 0
(but only for the simulation
step). Of course, if σ2is small, then these two techniques will not be too dissimilar.
42
Figure 6.2: CVAE structure. The blue boxes represent terms from the right hand side of equation (14).
One could also argue that we introduced the last simulation step to enable simulation of
X
s, which are not
exactly like our samples (otherwise, the problem would collapse). However, after training our model, we might not
care about the extra variance introduced by
σ2
. This would be the case in image generation, where
σ2
would be
unnecessary noise in the simulation. In this thesis, we choose not to skip the last simulation step since the additional
noise might be necessary for the distribution of the returns.
6.2 Conditional VAEs
In this brief section, we wish to explain an extension to VAEs called conditional VAEs (or CVAEs). Imagine, that
along with a number of samples
{X1, . . . , XN}
, we have some conditional samples
{Y1, . . . , YN}
with
YiRm
.
We assume that the
X|Y
s are
independent
, but the distribution depends on
Y
. This is relevant for us when sampling
return-paths that are not stationary. Our samples of return-paths might come from different periods of volatility, and
it is commonly accepted that future volatility of returns depends on previous volatility. Therefore, it is crucial that we
can sample return-paths conditioned on some market state (or the previous returns).
Extending VAEs to cope with conditional variables is not difficult: First, we change our goal to be able to sample
X|Ysuch that it looks like {X1|Y1, . . . , X|YN}. We, therefore, wish to approximate p(X|Y)not just p(X).
Following the idea from before, we assume that we sample
X|(Y, z )N(f(z, Y ), σIN)
. Now
f:Rk×Rm
Rn
is a function that takes both the latent variable and the conditional variable. Like before, we assume to sample
z|YN(0, Ik). Meaning, that the latent variable is unaffected by the conditional variable Y.
To perform the approximation of
p(X|Y)
, we again want to sample the
z
s such that they have a higher
probability of producing
X
given
Y
. We do this by sampling
z
from
Q(z|X, Y )
where we assume that
Q(z|X, Y ) = N(g(X, Y ), h(X , Y )Ik)
, which is similar to before except that the maps
g
and
h
takes both
X
and
Y
as inputs. So
Y
now enters as input for the maps
f, g
and
h
. With these changes equation
(11)
can be derived
again yielding
ln p(X|Y)DKL (Q(z|X, Y )||p(z|X, Y )) = EzQ[ln p(X|z, Y )] DKL (Q(z|X, Y )||p(z|Y)) (14)
and remember that we assumed
p(z|Y) = N(0, Ik)
. This is the core equation behind CVAEs, and it has similar
interpretations as equation
(11)
(which was the relation for VAEs). In figure 6.2, we have visualized the structure of
CVAEs and equation (14).
43
One can then again derive an objective function similar to equation (13)
1
N
N
X
i=1
||Xif(zi|Xi, Yi)||2
2
σ21
N
N
X
i=1
h(Xi, Yi)>1+g(Xi, Yi)>g(Xi, Yi)k(ln h(Xi, Yi))>1.(15)
Maximizing the above w.r.t.
f, g
and
h
should enable us to sample
X|Y
by first sampling
zN(0, Ik)
and then
sample X|(Y, z )N(f(z, Y ), σIn).
6.3 Connecting ANNs to VAEs and CVAEs
In the previous sections, we have derived an objective function, which we can maximize w.r.t. three maps
f, g
and
h
to enable simulation of
X
s with similar distributional properties to our samples
{X1, . . . , XN}
, even when
considering conditioning with some variable Y.
In this section, we explain how the idea of VAEs and CVAEs relate to artificial neural networks (ANNs). As we
know, ANNs are useful as approximators of continuous maps over which we want to minimize some loss function
using gradient descent. For VAEs and CVAEs, we can use ANNs to represent the maps
f
,
g
and
h
. We first remember
that
f:Rk(×Rm)Rn
g, h :Rn(×Rm)Rk
for VAEs and (CVAEs). First, we let
f
be represented with an ANN,
D
, with input dimension
k
(
×m
) and output
dimension
m
. For
g
and
h
, we choose a single ANN,
E
, to represent both.
E
must then have input dimension
k
(×m)and output dimension k×k. So for VAEs and CVAEs, we have
VAE: D(z) = f(z)
E(x) = (g(x), h(x)) CVAE: D(z , y) = f(z, y)
E(x, y)=(g(x, y), h(x, y)).
We can now introduce
θ0
and
θ1
as the trainable parameters in
E
and
D
, respectively. For VAEs, we can state the
maximization problem using the objective function in equation (13) as
max
θ01"α
N
N
X
i=1 ||XiDθ1(zi|Xi)||2
21α
N
N
X
i=1
E[1]
θ0(Xi)>1+E[0]
θ0(Xi)>E[0]
θ0(Xi)k(ln E[1]
θ0(Xi))>1#
(16)
where we have used notation
Eθ0(x)=(E[0]
θ0(x), E[1]
θ0(x))
, and we sample
zi|Xi
by first sampling
iN(0, Ik)
and
then setting
zi|Xi=E[0]
θ0(Xi) + qE[1]
θ0(x)i
. A similar maximization problem can also be derived for CVAEs
using equation (15).
As we have alluded to previously, this maximization problem can be solved using a gradient ascent algorithm.
The crucial detail that allows us to find the derivative of the above expression w.r.t. the trainable parameters in
E
and
D
is that we sample
zi|Xi
as an affine transformation of a multivariate standard normal random variable
i
. This
makes it possible to find the derivative of zi|Xiw.r.t. θ0. This is often referred to as the reparameterization trick.
We should also note that we have also scaled the entire objective function with
ασ2
where
α= (1 + σ2)1
. We
do this to emphasize the balance between minimizing reconstruction loss (first term) or minimizing KL-divergence
loss (second term).
44
6.4 Experiments with VAEs and performance evaluation (in a simple Black-Scholes
model)
In this section, we wish to use VAEs to create generative models based on a Black-Scholes model (and a Heston
model later), which we refer to as the base model(s). Specifically, our goal is to simulate daily return-paths of length
20, which resembles those from the base model.
The section has two purposes: 1. Explain how we set up and create VAEs based on simulated return-paths
from the base model. 2. Analyze the generative models with light statistical tools to investigate if the generated
return-paths has similar characteristics as those from the base model. We will not do an extensive analysis of the
VAEs, since the real test for the VAEs will be as data-generators for deep hedging models (see section 7).
A simple Black Scholes model
We start by assuming that our base model is a simple Black-Scholes model with the following parameters:
drift: µ= 0.05
volatility: σ= 0.3
S(0) = 1.
We do this since returns in the Black-Scholes model are iid. Hence return-paths are stationary. This is not realistic,
but it allows us to use VAEs instead of CVAEs since all return-paths will have the same distribution independent of
any market states.
Data-preparation
For this experiment, we wish to train a VAE capable of producing Black-Scholes-like paths of length
n= 20
(plus
the initial spot value
S(0)
) with
dt =T/n
and
T= 1/12
, i.e. daily returns over one month. Note that we could
train a VAE model to generate single daily returns from the Black-Scholes model (i.e. having
n= 1
) and connect
multiple simulated returns to create a longer path. This is possible to do without extra modifications since the
Black-Scholes model has iid returns. However, this problem is too easy to be interesting. We, therefore, choose
n= 20, even-though, less would be enough.
We wish to train the VAE on
M
independent/non-overlapping return-paths (for now). To do this, we can sample
a single Black-Scholes return-path of length
M·n
and divide it into
M
return-paths of length
n
. We can then scale
the returns by subtracting their means and scaling by their standard deviations. This scaling is done independently
for every time point
(dt, . . . , n ·dt =T)
. We then have
M
normalized return-paths of length
n
. These will be the
samples we utilize to train the VAE.
Creating the VAE
We are now ready to create the VAE. In our experiments,
XiRn
corresponds to normalized return-paths. This
dictates the dimensions of the ANNs and influences our choice of latent space. To start, we choose the size of the
latent space. We choose to use a latent space with dimension
n
, i.e.
zRn
. This may seem like an arbitrary choice
(and it might be). However, we would normally think return-paths require as many random variables to simulate as
time points (in this case
n
). This is unusual for VAE problems where we often choose latent spaces, which are much
smaller than the size of
X
. This is the case for image generation, where it is safe to assume that an image depends on
fewer components than the number of pixels in the image.
We are now ready to create the ANNs
E
(encoder representing
g
and
h
) and
D
(decoder representing
f
).
E
has input dimension
n
and output dimension
n×n
(or
2n)
.
D
has input dimension
n
(latent space size) and
output dimension
n
. We create
E
and
D
as shallow ANNs with 40 units in the hidden layer and with elu activation.
45
We choose shallow ANNs as VAEs can be challenging to train with deep ANNs.
Lastly, we need to choose the parameter
σ
used for construction
X
as
N(f(z), σ2In)
. Remember
σ
can also be
represented as
α= (1 + σ2)1
from the objective function in equation
(16)
. To start, we choose
α= 0.9
, but we
wish to test different choices later.
Training the VAE
The VAE is trained by maximizing the objective function in equation
(16)
(or minimizing the negative objective
function, i.e. the loss), where the
Xi
s are sample return-paths. This is easily done using gradient ascent (descent).
We train the VAE for 1,500 epochs with a batch size of 128 utilizing the ADAM algorithm with a learning rate of
0.01, 0.0001 and 0.00001 for 500 epochs each.
We do not use learning rate decay or early stopping since each batch and epoch produces quite varying losses
due to the simulation steps included in the objective function. A simple training procedure is, therefore, easier to use
and possibly more robust.
As with the deep hedging models, we do not worry too much about training times. However, with this training
procedure and proposed model architecture, we train a VAE with 1,000 training samples for approximately 50
seconds. This is relatively fast compared to the deep hedging models, but it is pretty reasonable when considering
the number of training samples (1,000 vs
218
). However, note that the training time depends significantly on the
number of training samples, the number of returns in a path (
n
), the architecture, computational resources, epochs
and batch size.
Simulating new paths
Once training is done, we can generate new samples from the VAE by sampling
zN(0, In)
and then sampling
X|z=N(D(z), σ2In)
, where
D
is the trained decoder ANN. These samples represent new normalized return-paths.
Therefore, all that is left is denormalizing the VAE samples and converting the return-paths to
S
-paths using
S(0)
. Note that simulating new return-paths is virtually instant since it only requires a simulation of normal
random variables, a forward pass of the decoder ANN
D
, another simulation of normal random variables and a
denormalization transformation.
Evaluating paths sampled by the VAE
We should now (hopefully) have a working VAE model capable of generating new paths similar to the training
samples. We wish to test this to ensure that the model and setup are functioning correctly and analyze the effects of
changes to σ(or α) and the number of samples used for training.
Analyzing if the VAE samples are similar in distribution to the training samples is not trivial as we work
with random variables in dimension
n= 20
. This is not helped by the fact that the number of training samples may
not be large (100 to 1,000). To simplify the analysis, we can analyze the empirical marginal distribution of each of the
n= 20
time points from the
S
-paths. After analyzing the marginal distributions, we can look at the autocorrelations
of the returns, including the absolute returns, as this is especially relevant for returns of financial assets.
One could perform more detailed analyses of the VAEs. However, (as we will see later) analyzing empirical
marginal distributions and autocorrelations can get us a long way. We should also bear in mind that our goal is to use
the VAEs as data-generators for deep hedging models. We can, therefore, view the upcoming hedge experiments as
the most significant tests of the VAEs.
We now wish to briefly explain how we analyze the empirical marginal distributions. For completeness, we
also wish to explain how we find the autocorrelations of the samples (as it is not actual autocorrelations).
46
To analyze the empirical marginal distributions of the
S
-paths, we perform a two-sample Kolmogorov-Smirnov
test for all time points. This test is designed to test whether two samples come from the same distribution. Lets assume
that we have two sets of iid samples
{X1,...XN}
and
{Y1, . . . , YM}
, both in
R
, with corresponding empirical
distribution functions
FX,N (x)
and
FY,M (y)
. The two-sample Kolmogorov-Smirnov test considers the maximum
distance between the two empirical CDFs. This distance, which is referred to as the Kolmogorov-Smirnov statistic, is
defined as
DN,M = sup
x|FX,N (x)FY,M (x)|.
The null hypothesis for this statistical test is that
{X1, . . . , XN}
and
{Y1, . . . , Ym}
are sampled from the same
underlying distribution. Without going into the theoretical details, we can reject the null hypothesis (and be convinced
that the Xs and Ys come from different distributions) at some confidence level αif
DN,M >rln α
2N+M
2M.
Solving for 1α, we can therefore express the p-value as
p:= 1 α= 1 1
2exp 2M
N+MD2
N,M ,
and we can reject the null hypothesis if the p-value is found to be suitably low. In our analyses, we do not specify an
exact confidence level, as it should be clear from observing the p-value. Still, we generally will not be satisfied with
the VAE if the p-value across multiple time points is below 10%.
We should note that the p-value, of course, depends on the size of our two samples. That is, if our two samples
are large, then we require their empirical distributions to be closer to obtain similar p-values. We, therefore, have to
choose the size of the VAE samples and be careful to keep it consistent. In our analyses, we sample 20,000 paths
from the trained VAEs to compare to the training paths (which typically is of size 1,000). We choose 20,000 to
reduce the uncertainty from the VAE sampling.
We can now move onto the autocorrelations. In our experiments, we have two sets of samples of return-paths (and
absolute return-paths) of length
n
. The classical autocorrelation does not really make sense in this setting, as we
have multiple paths covering the same time points. In this case, we choose to calculate the average of the empirical
correlations measured on lagged returns. For lack of a better term, we refer to this as autocorrelation. To be more
specific, we choose lag
l
and calculate the correlation between returns at time
ti
and
ti+l
for all possible
i
s. We
then average all correlations to get the autocorrelation with lag
l
. We do this for
l= 1, . . . , n
to produce a graph of
autocorrelations across lags.
We should note that this method of evaluating the correlation structure of return-paths can hide unwanted
correlation since we average correlations across time points. We will, however, not be too worried about this as we
would expect perfect offset to be quite unlikely.
An actual VAE-experiment
We are now ready to run an actual VAE experiment in a Black-Scholes model. In this experiment, we train a VAE on
1,000 independent return-paths of length
n= 20
with
α= 0.9
. We then simulate 20,000 return-paths /
S
-paths
from the VAE and compare them to the 1,000 training paths. The results of this experiment can be seen in figure 6.3
(a)-(f).
In figure 6.3 (a) and (b), we see 100
S
-paths generated by the Black-Scholes model and VAE model, respec-
tively. We cannot conclude anything about the distributional characteristics, but we can confirm that the VAE has
indeed been able to create paths that seem like proper paths from a price process.
47
S(t)
(a) 100 Black-Scholes paths (used for training)
S(t)
(b) 100 VAE paths.
Empirical CDF
(c) Empirical distributions of 1,000 training paths and
20,000 VAE paths at time T
KS p-value
(d) Kolmogorov-Smirnov p-values across time
Autocorrelation returns
(e) Autocorrelation of training and VAE returns
Autocorrelation absolute returns
(f) Autocorrelation of training and VAE absolute returns
Figure 6.3: Results from comparison of 20,000 VAE paths and 1,000 Black-Scholes paths, which were
used for training.
48
KS p-value
(a) α∈ {0.8,0.9,0.99}and 1,000 training paths.
KS p-value
(b) training paths in {100,250,1000}and α= 0.9
Figure 6.4: Average Kolmogorov-Smirnov p-values across time for varying
α
and number of training paths
including 95% confidence bands. The p-values are calculated from 20,000 paths sampled from 100 VAEs,
which are trained on independent samples. When varying the number of training paths (in (b)), we test the
VAE paths against 1,000 independent Black-Scholes paths.
In figure 6.3 (c), we see the empirical distribution functions for the training paths and the VAE samples at time
T= 1/12
. The empirical distributions are visually quite close, suggesting that the VAE has captured some of the
distributional characteristics from the Black-Scholes samples (Note/remember we look at distributions of
S(T)
and
not returns). This is confirmed in figure 6.3 (d), where we see the p-values from a two-sample Kolmogorov-Smirnov
test at each time point. Remember again that the test is done with 1,000 training paths and 20,000 VAE paths. We
observe that the p-values for all time points are above 0.5 (roughly), implying that we cannot reject any of the 20
null hypotheses (one for each time point). The marginal distributions of the VAE paths are thus close to that of the
Black-Scholes training samples.
To end this first experiment, we look at the autocorrelations in figure 6.3 (e) and (f). From these, we see that the
VAE does not, on average, across any lag have a significantly non-zero correlation, also including absolute returns.
These results suggest that the VAE is indeed able to produce paths of
S
(or return-paths) that have the same
distributional characteristics as the Black-Scholes model, which we used for training.
Varying αand number of training paths
We have just seen that a VAE could produce new paths with similar characteristics as a Black-Scholes model. In
this section, we wish to investigate the influence of the parameter
α
(or
σ
) and the number of training paths. To be
specific, we wish to test the VAE for
α∈ {0.8,0.9,0.99}
and with the number of training paths in
{100,250,1000}
.
When varying α, we use 1,000 training paths, and when varying the number of training paths, we use α= 0.9.
For this analysis, we train 100 VAEs on independent sets of training samples. We can then compute the
Kolmogorov-Smirnov p-values for all time points and all VAEs. This allows us to display the average p-value across
time for the different VAE setups. Our rationale is that a higher average p-value implies a better average fit of the
model, i.e. VAE paths are closer to the training paths. Of course, it would be misleading to use the training paths for
the Kolmogorov-Smirnov test when varying the number of training samples since the test depends on the number
of samples. When varying the number of training paths, we, therefore, choose to test the VAE paths against 1,000
Black-Scholes paths that are independent of the training paths.
We should note that an analysis of the marginal distributions alone is not sufficient to ensure that the VAE paths
have similar multivariate distributions as the training paths. In the previous experiment, we looked at autocorrelations
of returns and absolute returns. However, we skip this analysis as it does not yield any interesting results.
The results of this experiment can be seen in figure 6.4 (a) and (b).
49
In figure 6.4 (a), we see the results of varying
α
. Here we observe that
α= 0.8
and
α= 0.9
both have av-
erage Kolmogorov-Smirnov p-values in the range 0.7 to 0.8 for all time points, whereas
α= 0.99
has average
p-values between 0.5 and 0.6. This shows that
α
should not be chosen too high (i.e.
σ
too low). This is because
it emphasizes the reconstruction loss and not KL divergence loss, i.e. it has too little regularization. It cannot be
seen in the figure, but we do not want to choose
α
too low either as this would introduce too much noise in the VAE
samples and shift the marginal distributions more towards a normal distribution. Our choice of
α= 0.9
, therefore,
seems quite reasonable.
In figure 6.4 (b), we see the results of varying the number of training samples. We observe that the VAE generally
does a great job of capturing the marginal distributions of the Black-Scholes paths, even when only being trained on
100 paths. It is also clear from the 95% confidence bands that the average Kolmogorov-Smirnov p-value is higher
when the VAE is trained on more paths, which is not surprising. One could also argue that it is not very surprising
that the VAE performs well even when trained on 100 paths since the Black-Scholes model is incredibly simple
(especially regarding stationarity).
All in all, we are pretty pleased with the capabilities of the VAEs to capture characteristics of Black-Scholes
paths. However, we withhold final judgement until we have tested the VAE’s abilities as a data-generator for a deep
hedging model.
6.5 Cheating in the Heston model - capturing path dependency
We now move onto a more complicated problem (but not too complicated yet). Here, we choose to fit a VAE to paths
of length
n= 20
from a Heston model. The distribution of returns in a Heston model are not iid and depend on the
instantaneous variance. In this experiment, we do not wish to generate paths with different initial instantaneous
variances but rather capture the returns’ interdependencies. We, therefore, choose to train the VAE on independent
return-paths that all have the same initial instantaneous variance. Note that this is not realistic as we usually only
observe a single realized path with varying volatility across time. However, it is an interesting challenge for the VAE.
For this experiment, we choose a Heston model with high long-term and starting variance, high vol-of-vol and
high mean-reversion. The parameters we use are
S(0) = 1
drift: µ= 0.05
ν(0) = 0.1
mean-reversion: κ= 5
long-term variance: θ= 0.1
vol-of-vol: σ= 1
correlation: ρ=0.9.
We perform the experiment in a similar manner to the Black-Scholes experiment in the previous section:
Sample 1,000 independent return-paths from the Heston model.
Create a VAE with encoder and decoder with one hidden layer with 40 units and choose the latent space to be
in R20. Remember n= 20.
Train a VAE on the normalized return-paths for 1,500 epochs with a batch size of 128 with learning rates 0.01,
0.0001 and 0.00001 for 500 epochs each.
Sample 20,000 normalized return-paths from the trained VAE and convert them to return-paths and
S
-paths.
Compare the VAE generated paths to the training paths using the two-sample Kolmogorov-Smirnov test and
autocorrelations.
50
KS p-value
(a) Kolmogorov-Smirnov p-values across time.
Standard deviation
(b) Standard deviation of VAE and training paths across
time.
Autocorrelation
(c) Autocorrelation of training and VAE absolute returns
Figure 6.5: Standard deviation across time, Kolmogorov-Smirnov p-values and autocorrelation of paths
from a VAE trained on Heston paths. The VAE is trained on 1,000 independent paths with
α= 0.9
and no
moment-regularization.
The results of this experiment can be seen in figure 6.5 (a)-(c).
In figure 6.5 (a), we see the Kolmogorov-Smirnov p-values across all time points. The VAE seem to per-
form well for the first few time points, but the quality quickly drops as time progresses. We get a hint of the reason in
figure 6.5 (b). Here we see the standard deviations of the generated spots across all time points. It is obvious that the
VAE has misjudged the variance of the returns, which amplifies the deviation in the standard deviation of the spots
across time.
In figure 6.5 (c), we see the autocorrelations for the absolute returns. We are disappointed to see that the VAE
has not captured any correlation in the absolute returns present in the Heston model. One could suspect that the
ANNs are too small to represent the returns. However, the encoder
E
has 3,760 parameters, and the decoder
D
has
2,540 parameters, which should be enough to represent more complicated functions.
Overall, we are pretty disappointed in the VAE’s performance. In the next section, we propose a possible solution
to these issues by introducing moment-regularization during training.
Fitting Heston with moment-regularization
In this section, we
propose
the use of moment-regularization to obtain better performance from the VAE in the
Heston model (or generally in models with non-stationary returns). Our idea is to introduce a regularization term to
the objective functions in equation
(16)
that guides the VAE to produce samples that are more similar to the training
51
paths when looking at moments and correlation. To do this, we introduce four regularization terms: one for the
mean of the returns, one for the standard deviation of the returns, one for correlation of returns and lastly, one for
correlation of absolute returns.
The regularization terms will all be based on samples of the VAE. That is we first sample
zN(0, In)
and
then sample
X|zN(D(z), σ2In)
where
D
is the decoder. Assume we have
N
samples in
Rn
from the VAE
{˜
X1,..., ˜
XN}and training samples {X1, . . . , XN}. The four terms are then computed as
1. Mean regularization:
1
n
n
X
i=1
1
N
N
X
j=1
Xj,i 1
N
N
X
j=1
˜
Xj,i
.
2. Standard deviation regularization:
1
n
n
X
i=1 v
u
u
t1
N1
N
X
j=1
(˜
Xj,i ¯
˜
Xi)2v
u
u
t1
N1
N
X
j=1
(Xj,i ¯
Xi)2
where ¯
˜
Xi=1
NPN
j=1 ˜
Xj,i and ¯
Xi=1
NPN
j=1 Xj,i.
3.
Correlation of returns. We first define the sample correlation between dimension
i
and
i+l
(
l
is some lag) for
the Xs
ρX(i, l) := PN
j=1(Xj,i ¯
Xi)(Xj,i+l¯
Xi+l)
(N1)sXisXi+l
where
sXi=q1
N1PN
j=1(Xj,i ¯
Xi)2
is the sample standard deviation of the
i
’th dimension of the
X
s.
The correlation regularization is then calculated as
n
X
l=1
1
l·1
#{i:i0, i +ln}X
i0:i+ln|ρ˜
X(i, l)ρX(i, l)|.
First, notice that we divide by
l
inside the first sum. This is done to place more emphasis on the correlations
between returns with smaller lags. Also, notice that we apply the absolute operator inside the second sum.
This is done to avoid the possibility of the model creating paths with correlations that perfectly offset each
other.
4.
The regularization term for correlation of absolute returns is calculated like the above, but on
{| ˜
X1|,...,|˜
XN|}
and {|X1|,...,|XN|}.
These four terms can then be subtracted from the objective function in equation
(16)
. To adjust the emphasis on the
moment-regularization terms, we scale them by a factor β(like we use αfor balancing between reconstruction and
KL loss). The four moment-regularization terms should not add too much bias to the VAE but simply guide the VAE
during training to match the moments and correlations of the training samples.
To test this regularization, we train 100 VAEs for each
β
in
{0,10,100}
with the same procedure as in the
previous experiment. Note that
β= 0
corresponds to no moment regularization. We then sample 20,000 paths from
each VAE and calculate the average Kolmogorov-Smirnov p-value and autocorrelations over all VAEs for each
β
.
The results can be seen in figure 6.6 (a)-(c).
In figure 6.6 (a), we see the average Kolmogorov-Smirnov p-values across time for different values of
β
. It
is clear that the superior choice of
β
, when it comes to marginal distributions, is
β= 10
. Both
β= 0
(no
regularization) and
β= 100
gives rise to decreasing p-values across time, which was also the case with the previous
experiment without moment regularization. Moment regularization seems to work. However, it is clear that too
much regularization can also ruin the marginal distributions, which makes sense since the distributions are not only
described by mean, standard deviation and correlations.
52
KS p-value
(a) Avg. Kolmogorov-Smirnov p-values across time
Autocorrelation returns
(b) Avg. autocorrelation of absolute returns
Autocorrelation absolute returns
(c) Avg. autocorrelation of returns
Figure 6.6: Average Kolmogorov-Smirnov p-values and autocorrelations on returns and absolute returns
for 100 VAEs trained on Heston paths with varying moment-regularization
β
including 95% confidence
bands in (a) and (b). The VAEs are trained on 1,000 Heston independent paths with
α= 0.9
. The
Kolmogorov-Smirnov p-values are calculated on 20,000 VAE paths and the training paths.
53
In figure 6.6 (b), we see that the autocorrelations on the absolute returns. We observe that no level of
β
could produce models, which on average, have correlations matching the Heston model. However, we can observe
that models trained with
β= 100
produce paths that, on average, have correlations closest to the Heston model.
Nevertheless, even though we have not reached a perfect match in correlation, we still see that moment regularization
provides a significant improvement.
Lastly, we observe from figure 6.6 that moment regularization has not introduced any unwanted correlation in
the non-absolute returns. This is great considering the improvement in correlation in absolute returns.
Overall, we conclude that moment regularization provides significant improvements to the fit of the VAE, but the
high correlation of absolute returns is still quite difficult to capture, even with
β= 100
. VAEs should, in theory, be
arbitrarily flexible. However, this experiment illustrates that training VAEs to capture complex structures can be
difficult.
6.6 Conditioning on instantaneous variance in the Heston model
Up until now, we have only used VAEs. Remember that VAEs produce paths that are not conditioned on market
states. Realistically, the volatility of a path will depend on the volatility before that path. This is also the case in the
Heston model. In this section, we wish to utilize CVAEs to capture this dependency. This also implies that we can
train our model on a single long observed path.
To make this experiment easier, we choose a Heston model with less long term variance and vol-of-vol. The
parameters we use are
S(0) = 1
drift: µ= 0.05
ν(0) = 0.05 (not very important)
mean reversion: κ= 4
long term variance: θ= 0.05
vol-of-vol: σ= 0.25
correlation: ρ= 0.
We wish to train a CVAE to generate paths similar to those from the Heston model but conditioned on an initial
instantaneous variance
ν(0)
, which is not necessarily the same as
ν(0) = 0.05
. To do this, we sample a single Heston
path and divide it into paths of length
n= 20
without overlap. Each path has a corresponding
ν(t)
(or
ν(0)
) that is
the instantaneous variance at the
beginning
of the path. One might question if it is realistic/sensible to assume that
ν
is observable. However, estimating
ν
or letting the CVAE estimate
ν
removes focus from the conditional part of the
CVAE. We, therefore, assume that νis known for this analysis.
The first experiment we perform has the following procedure:
Sample one path from the Heston model with
1,000 ·20 = 20,000
returns. Divide the path into 1,000
non-overlapping return-paths with corresponding
ν(0)
s, which are the instantaneous variances at the beginning
of each path.
Create a CVAE with encoder and decoder with one hidden layer with
60
units and choose the latent space to
be in R20. We also choose α= 0.9.
Train the CVAE on the normalized return-paths conditioned on the normalized instantaneous variances for
1,500 epochs with a batch size of
128
and with a learning rate of
0.01
, 0.0001, 0.00001 for 500 epochs each.
The training is performed without any moment regularization, i.e. β= 0.
Sample 20,000 CVAE paths for different instantaneous variances. We start with
ν(0) = 0.05
and
ν(0) =
0.0276, which corresponds to the 10% quantile of the training data.
54
KS p-value
(a) Kolmogorov-Smirnov p-values across time,
ν(0) = 0.05
Standard deviation
(b) Standard deviation of VAE and training paths across
time, ν(0) = 0.05
KS p-value
(c) Kolmogorov-Smirnov p-values across time,
ν(0) = 0.0276
Standard deviation
(d) Standard deviation of VAE and training paths across
time, ν(0) = 0.0276
Figure 6.7: Standard deviations and Kolmogorov-Smirnov p-values across time from a single CVAE trained
on 1,000 Heston paths. For the simulated paths, we condition on
ν(0) = 0.05
in (a)-(b) and
ν(0) = 0.0276
(10% quantile) in (c)-(d). The Kolmogorov-Smirnov p-values are calculated on 20,000 sampled CVAE
paths (for each ν(0)) and 1,000 Heston paths with corresponding ν(0) independent of the training paths.
Compare the VAE paths to 1,000
independent
Heston paths sampled with corresponding
ν(0)
. The
comparison is performed with a two-sample Kolmogorov-Smirnov test for each time point and by looking
at standard deviations of the simulated
S
-paths across time. We will not analyze correlations since fitting
marginal distributions is our primary concern, and it is not trivial in this case.
The results of the experiment can be seen in figure 6.7 (a)-(d).
In figure 6.7 (a), we see the Kolmogorov-Smirnov p-values across time for the CVAE paths generated with
ν(0) = 0.05
. We observe that the p-value is decreasing in time. This is a sign of accumulated mismatch between
returns from the CVAE and those from a Heston model with
ν(0) = 0.05
. We get a clue to the poor fit in figure 6.7
(b), where we see the standard deviation of the
S
-paths across time. It is clear that the CVAE does not capture the
correct volatility.
In figure 6.7 (c) and (d), we have repeated the same experiment with the
same
CVAE, but with
ν(0) = 0.0276
(corresponding to the 10% quantile of the training data). The Kolmogorov-Smirnov p-values in figure 6.7 (c) are
virtually all zero. This is also reflected in figure 6.7 (d), where we see the standard deviations of the simulated paths.
It is clear that the CVAE does not capture the volatility at all when conditioned on a low initial instantaneous variance.
55
Overall, we are pretty disappointed with the results. The hope was that the CVAE could capture the volatil-
ity of the paths when starting with different instantaneous variances. This is also why we have not even considered
the autocorrelation structure, which is less relevant when the CVAE failed to capture the marginal distributions.
Like before, we wish to apply some regularization to the (conditional) moments. This is what we introduce/propose
in the next section.
Regularization for conditional moments
In this section, we
propose
a way to apply regularization for conditional moments. We cannot use the moment
regularization we used for the VAEs, since it does not guide the CVAE towards the correct
conditional
moments
(only the unconditional moments). It is, however, not obvious how one should create such regularization.
Assume we have training samples
{X1, . . . , XN}
and corresponding conditional variables
{Y1,,...,YN}
.
Assume now that we sample
{˜
X1,..., ˜
XN}
from a CVAE based on the
Y
s. The issue now is that we do not know the
moments of the
X
s and
˜
X
s conditioned on the
Y
s. There are, of course, different ways of estimating the conditional
moments. For the
˜
X
s (the CVAE samples), we could simply create a larger sample and average the results. However,
this would be slow as it would have to be done at every gradient step. For the
X
s (our training samples), we could
perform a regression to estimate the conditional moments. This only has to be done once, but it will introduce
more complexity than we would like. The simplest solution is to estimate the conditional moments
E[˜
Xj,i|Yj]
and
E[Xj,i|Yj]
as
˜
Xj,i
and
Xj,i
, respectively. We could also do something similar to estimate the conditional second
moments.
The idea is now to introduce a regularization term that guides the CVAE towards a solution where
E[˜
Xj,i|Yj]
is
close to
E[Xj,i|Yj]
for all
j, i
and similarly for the second moment. We do this by introducing two regularization
terms:
1. Conditional mean regularization loss:
1
N
N
X
j=1
n
X
i=1
(˜
Xj,i Xj,i)2.
2. Conditional second moment regularization loss:
1
N
N
X
j=1
n
X
i=1
(˜
X2
j,i X2
j,i)2.
The regularization terms are similar to those from the previous section. However, in this case, we apply the square
operator inside the second sum (Note that we used the abs-operator in the previous section. This is a somewhat
arbitrary choice). These regularization terms will then be subtracted from the objective function for the CVAEs. We
scale the regularization terms with γ, which will allow us to test its effectiveness.
The fear of using the above regularization is that it guides the CVAE towards producing deterministic paths
given
Y
due to the regressional nature of the terms. Still, we hope that combining the first and second moment and
original objective function counteracts this.
To test these regularization terms, we train 50 CVAEs with the same setup as the experiment in figure 6.7
for each γin {0,1,10}. Note γ= 0 corresponds to no regularization. Once we have trained 50 CVAEs for each of
the different
γ
s, we wish to test the CVAEs conditioned on different instantaneous variances. To do this, we find, for
each CVAE, the instantaneous variances corresponding to the
10%,15%,...,90%
quantiles from the training data.
Note the 10% and 90% quantiles of
ν(0)
are approximately
0.0270
and
0,0765
on average, respectively. For each
quantile of
ν(0)
, we sample 20,000 CVAE paths and 1,000 independent Heston paths conditioned on the specific
quantile of
ν(0)
. This is done for every CVAE. We are then able to find the average Kolmogorov-Smirnov p-values
over CVAEs and time for the different quantiles of ν(0) and choices of γ.
56
KS p-value
(a) Avg. Kolmogorov-Smirnov p-values across quantiles
of ν(0).
Autocorrelation absolute returns
(b) Avg. autocorrelation of absolute returns for
ν(0)
at
the 60% quantile.
Autocorrelation returns
(c) Avg. autocorrelation of returns for
ν(0)
at the
60%
quantile.
Figure 6.8: Results from 50 CVAEs trained each trained on 1,000 Heston paths conditioned on instantaneous
variance for varying regularization
γ
. (a) displays average Kolmogorov-Smirnov p-values over different
quantiles of
ν(0)
(calculated from the training paths). 10% and 90% quantile of
ν(0)
is approximately
0.027 and 0.077, respectively. The Kolmogorov-Smirnov p-values are calculated on 20,000 paths sampled
from the CVAEs and 1,000 independent Heston paths with corresponding
ν(0)
. (b) and (c) show average
autocorrelations calculated on samples from the CVAEs and Heston model with
ν(0)
at its 60% quantile
(ν(0) 0.0525 on average).
As a sanity check we also look at the autocorrelations for the
60%
quantile of
ν(0)
. The results of this experiment
can be seen in figure 6.8 (a)-(c).
In figure 6.8 (a), we see the average Kolmogorov-Smirnov p-values across different values of
ν(0)
and choices of
γ
.
We observe that without moment regularization (
γ= 0
), we obtain a good fit for
ν(0)
-quantiles between 40% and
80%
. However, the models are not capable of producing Heston like samples with extreme initial instantaneous
variances. For
γ= 1
, we see significantly better performance on a wider range of initial instantaneous variances.
However, we see that too much regularization (
γ= 10
) hurts performance. Focusing on marginal distributions, it is
clear that some regularization is beneficial, especially if we want the CVAE to produce samples for a wider range of
ν(0).
In figure 6.8 (b), we see the autocorrelations for different choices of
γ
when sampling conditioned on the 60%
ν(0)
quantile (
ν(0) 0.0525
on average). We observe that
no
CVAEs have, on average, managed to reproduce the
positive downward sloping autocorrelations from the Heston model. This is again disappointing, but from figure 6.8
57
(b) and (c), we see that the CVAEs generally produce uncorrelated return-paths (also in absolute values). One could
argue that it is the next best result if it cannot match the Heston model.
One could try adding more regularization or optimizing training to produce CVAEs capable of producing
absolute returns with positive correlation. We have tried various methods and regularization terms. However, the
problem is unfortunately not easily solved. This is an obvious downside of VAEs/CVAEs (at least with our setup).
All in all, we managed to find a regularization method that improved the training of CVAEs, such that the
CVAEs could match the marginal distributions of Heston paths conditioned on some initial instantaneous variance.
Unfortunately, we could not replicate the positive absolute correlation.
It is worth remembering that we obtained the above results using a combination of parameters, which should be
relatively easy for the model to handle. This shows that sampling 20-dimensional return-paths is a complex problem,
and using our setup in practice might not be feasible (without clever modifications).
6.7 Overlapping training paths (is it possible?)
This section addresses the elephant in the room, which is the amount of data we use to train the VAEs/CVAEs. For
the CVAEs in the previous experiments, we utilized 1,000 non-overlapping return-paths of length
n= 20
. If we
assume (for simplicity) that there are precisely 20 days in a month, then 1,000 non-overlapping return-paths amount
to more than 83 years of data. That is a staggering amount of data for only 1,000 return-paths. The idea of this
section is to investigate the possibility of training the CVAEs on overlapping return-paths.
If we have
N
returns in total and we wish to create return-paths of length
n
, it is possible to create
bN/nc
non-overlapping return-paths. If we choose to overlap the paths, however, we would get
Nn+ 1
return-paths.
This would be tempting, as it would almost increase the size of our training data by a factor of n.
The fear of using overlapping paths is that it would yield some unwanted dependencies in our CVAE or a weird
misfit to the base model. Remember that we assumed that the sample return-paths are independent. In machine
learning, it is, however, quite common to train models in this way. When training on images, it is common to create
dozens of variations of every image to increase the size of the available training set (through rotation, flips etc.).
The same logic could also work for return-paths. For example, suppose a large negative return was experienced on
the first day of a month. In that case, it might be beneficial to show the CVAE multiple scenarios where that event
occurred at different times of the month. This makes sense since we would typically assume that the exact day of the
month does not influence returns.
We test CVAEs that are trained on overlapping paths by creating a similar experiment to that in the previous
section. We limit the CVAEs to 20 years of data, which corresponds to 240 and 4,781 return-paths for the
non-overlapping data sets and overlapping data sets, respectively. We then train 50 CVAEs on 240 non-overlapping
return-paths and 50 CVAEs on 4,781 overlapping return-paths conditioned on the initial instantaneous variances. All
paths are sampled from a Heston model with similar parameters to the one used in the previous section. Moreover,
all CVAEs are trained with
α= 0.9
and conditional moment regularization (
γ= 1
) for 1,500 epochs with batch size
128 and learning rates 0.01, 0.0001, 0.00001 for 500 epochs each. We then sample 20,000 paths from each CVAE
and for each instantaneous variance quantile (based on the training data) in
{10%,15%,...,90%}
, corresponding
to
ν(0)
from 0.0274 to 0.0751 on average. Each set of 20,000 CVAE paths is then compared, using a two-sample
Kolmogorov-Smirnov test, to 1,000 independent Heston paths sampled with the corresponding
ν(0)
. The results can
be seen in figure 6.9 (a)-(c).
In figure 6.9 (a), we see the average Kolmogorov-Smirnov p-values (averaged over CVAEs and time) across
quantiles of
ν(0)
. It is clear that the CVAEs trained on overlapping return-paths on average match the marginal
distributions of the Heston model significantly better across all
ν(0)
-quantiles than CVAEs trained on non-overlapping
return-paths. Remember again that the CVAEs trained on non-overlapping paths are trained with significantly fewer
paths.
58
KS p-value
(a) Avg. Kolmogorov-Smirnov p-values across quantiles
of ν(0).
Autocorrelation absolute returns
(b) Avg. autocorrelation of absolute returns for
ν(0)
at
60% quantile.
Autocorrelation returns
(c) Avg. autocorrelation of returns for
ν(0)
at
60%
quantile.
Figure 6.9: Results from 50 CVAEs trained each trained on overlapping and non-overlapping Heston
paths conditioned on instantaneous variance. (a) displays average Kolmogorov-Smirnov p-values over
different quantiles of
ν(0)
. 10% and 90% quantile of
ν(0)
is approximately 0.027 and 0.077, respectively.
The Kolmogorov-Smirnov p-values are calculated on 20,000 paths sampled from the CVAEs and 1,000
independent Heston paths with corresponding
ν(0)
. (b) and (c) show average autocorrelations calculated
on samples from the CVAEs and Heston model with ν(0) at its 60% quantile (ν(0) 0.0525).
59
In figure 6.9 (b), we see that for the 60%
ν(0)
-quantile, neither CVAEs trained on overlapping paths nor CVAEs
trained on non-overlapping paths have captured the positive dependence of the Heston paths. However, we do notice
that CVAEs trained on overlapping paths have, on average, a slightly negative correlation on absolute returns, which
could be concerning. However, the magnitude is on the smaller side.
Lastly, in figure 6.9 (c), we see that both CVAEs trained on overlapping paths nor CVAEs trained on non-
overlapping paths have no noticeable correlations between non-absolute returns, which is in line with the Heston model.
Overall, we are pretty impressed with the performance of the CVAEs trained on overlapping paths even though the
CVAEs, on average, had a slight negative autocorrelation on absolute returns. It is, however, clear that non-overlapping
paths require too much data when using return-paths of length n= 20.
We should mention that these issues could possibly be resolved by training the CVAEs to sample shorter
return-paths. Longer paths could then be obtained by connecting multiple shorter paths. The issue with this strategy
is that we have to choose the conditional variables to allow this sort of sampling. However, this is far from trivial,
which is why we choose not to explore that option further.
7 Data Driven Hedge Experiments
This section aims to combine the deep hedging approach from section 3 and the market generators from section
6. We hope to find acceptable hedging strategies by only observing a (relatively) small number of paths from the
underlying asset. The ideal experiment would be to learn hedging strategies from actual stock prices. In this thesis,
however, we stick to the assumption that the underlying assets follow either a Black-Scholes or Heston model. We do
this since it is easier to evaluate our methods and because the market generators might not be advanced enough to
deal with actual data yet.
Another purpose of these experiments is to test the VAEs and CVAEs further. In section 6, we only tested
the marginal distributions and looked at the correlations of the sampled
S
-paths and returns. Testing the VAEs
and CVAEs as market generators for deep hedging models will, therefore, also act as tests of the joint distributions
learned by the VAEs and CVAEs.
7.1 VAE powered hedge experiments - Black-Scholes
In this section, we wish to perform hedge experiments involving hedging a simple ATM call option with maturity
T= 1/12
(one month). One might think that call options are too simple as the price of a call option only depends on
the distribution of
S(T)
under
Q
, but remember that hedging depends on the entire joint distribution of the path
under the P-measure.
We assume daily rebalancing with 20 hedge points including
t= 0
, but not
t=T
. For simplicity in these
first experiments, we assume that the underlying asset follows a Black-Scholes model with parameters
S(0) = 1
,
µ= 0.05,σ= 0.3, and we assume a fixed interest rate r= 0.01 and no transaction costs.
In the first experiment, we wish to test deep hedging models that utilize VAEs trained on 100, 250 and 1,000
paths. Each deep hedging model is trained with
218
independent VAE paths. We use two benchmarks for this
experiment. The first is a deep hedging model that is trained on
218
sample paths from the
actual
underlying model.
The second is a model that utilizes standard delta hedging. The experiment is described in detail below.
For each choice of the number of training paths {100,250,1000}, we train 10 VAEs. The training paths are
all sampled independently from the actual Black-Scholes model. All VAEs are trained with
α= 0.9
and
β= 10. The encoders and decoders, in each VAE, have one hidden layer with 40 units.
We train a deep hedging model for each of the 30 VAEs (10 for each number of training samples). Each deep
hedging model is trained on
218
VAE paths. The ANNs, in the deep hedging models, have four layers with
five units each. Remember that a deep hedging model contains an ANN for every trading decision.
We also train 10 deep hedging models on 218 samples from the actual underlying model.
60
VAE Deep Hedge Deep Hedge Analytical
#VAE training samples 100 250 1,000
Avg. Abs Pnl Mean 0.0064 0.0056 0.0054 0.0058 0.0049
Std 0.00042 0.00006 0.00006 0.00002 0.
Pnl Std Mean 0.0082 0.0072 0.0070 0.0073 0.0065
Std 0.00058 0.00007 0.00005 0.00002 0.
CV aR0.95 Mean 0.0176 0.0148 0.0141 0.0136 0.0153
Std 0.00174 0.00023 0.00013 0.00001 0.
Turnover Mean 2.2157 1.9261 1.8542 1.7635 1.9014
Std 0.10306 0.04393 0.02071 0.00269 0.
Table 7.1: Results from a hedge experiment with an ATM call option in a Black-Scholes model without
transaction costs. Each column represents the average results (including standard deviation) of 10 hedging
models. In the first three columns, deep hedging models are trained with
218
VAE paths. The VAEs are
trained on a varying number of training paths. The fourth column represents a deep hedging model trained
on actual Black-Scholes paths, and the fifth column represents the delta hedging strategy. The results are
derived from a hedge experiment with 100,000 independent paths from the actual Black-Scholes model.
We then perform a hedge experiment on 100,000 independent paths. In this experiment, we test all deep
hedging models and a delta hedging approach. Furthermore, all portfolios are initialized with a portfolio
value of 0.03494, corresponding to the actual option price.
The results of the experiment can be seen in table 7.1 and figure 7.1.
In table 7.1, we observe that the deep hedging models trained on actual Black-Scholes paths generally per-
formed the best in terms of CVaR. This is not surprising as this model is trained with perfectly distributed data. We
also observe that the deep hedging models trained on actual Black-Scholes paths are very similar as the standard
deviation across all performance measures is relatively low. This suggests that the deep hedging model is relatively
stable, at least in this simple experiment.
Looking at the VAE powered models, we observe that the performance increases with the number of training
paths. When looking at empirical CVaR, we see that even deep hedging models using VAEs trained on 250 samples
(with an average CVaR of 0.0148) outperformed, on average, the analytical (delta hedging) approach (with a CVaR of
0.0153). However, it is also clear that the deep hedging models using VAEs trained on 100 samples paths (with an
average CVaR of 0.0176) did not perform as well as the analytical approach. They also exhibited higher variance in
performance, which is expected as the VAEs may differ significantly due to few training samples.
It is also interesting that the turnover is significantly higher for deep hedging models using VAEs trained on fewer
paths. The issue is visualized in figure 7.1. Here we see a comparison of holdings in the underlying asset between
the analytical approach, a deep hedging model trained on actual Black-Scholes paths and a deep hedging model
trained on VAE paths (from a VAE trained on 100 Black-Scholes paths). The deep hedging model trained on VAE
paths approximately captures the optimal strategy in the two test samples, but it lacks stability and precision. The
issue could stem from overfitting from the deep hedging model. However, it is hard to pinpoint the reason without
further analysis. One possible suggestion/test would be to introduce some regularization on the deep hedging models
that discourage overfitting. One might also suspect that these stability issues would be partially alleviated/dampened
by introducing transaction costs.
All this suggests that one must be extremely careful when using VAEs combined with the deep hedging approach
as instabilities may arise in the hedging strategy.
Stability of VAEs vs deep hedging models
In the next experiment, we would like to get a sense of the stability of the deep hedging model compared to the VAE.
To do this, we repeat the previous experiment, but this time we train 11 VAEs on the
same
250 training samples. For
61
(a) Sample 1 (b) Sample 2
Figure 7.1: Holdings in
S
across time for two different test sample paths. The experiment involves hedging
an ATM call option in a Black-Scholes model without transaction costs. The VAE behind ANN CVaR MG
is trained on 100 independent Black-Scholes paths.
Same VAE Different VAEs
(same data)
Avg. Abs PnL Mean 0.0057 0.0057
Std 0.00001 0.00006
PnL Std. Mean 0.0073 0.0073
Std 0.00001 0.00006
CV aR0.95 Mean 0.0150 0.0152
Std 0.00004 0.00016
Turnover Mean 1.9831 1.9857
Std 0.00555 0.03961
Table 7.2: Results from a hedge experiment with an ATM call option in a Black-Scholes model without
transaction costs. Each column represents the average results (including standard deviation) of 10 deep
hedging models. In the first column, deep hedging models are trained with
218
samples from the
same
VAE trained on 250 paths. The second column represents deep hedging models trained on paths from
different VAEs (also trained on 250 paths). The results are derived from a hedge experiment with 100,000
independent paths from the actual Black-Scholes model.
each of the first 10 VAEs, we train a single deep hedging model, and for the last VAE, we train 10 separate deep
hedging models. We then test the deep hedging models on 100,000 independent Black-Scholes paths. The results of
this experiment can be seen in table 7.2.
In table 7.2, we observe that deep hedging models trained on the same VAE have a significantly lower vari-
ance across all performance measures (avg. abs. Pnl, CVaR etc.). This difference suggests that the deep hedging
models are pretty stable, and performance depend heavily on the samples generated by the VAEs (at least in this
experiment). Comparing the results to those from table 7.1 (with 250 training paths), we see that training VAEs on
different sets of sample paths does not add too much extra variance compared to VAEs trained on the same set of
sample paths. To be specific, we see this in the second column of both table 7.1 and 7.2. The standard deviation
of CVaRs is 0.00016 vs 0.00023, and the standard deviation of turnover is 0.03961 vs 0.04393. Remember that
the standard deviation is calculated over the performance of 10 deep hedging models. Note that these results may
depend heavily on the number of paths used to train the VAEs (250 in this case). Still, this suggests that the VAEs’
training procedure could be improved. However, we must remember that training the VAEs include simulation in
each gradient step, which would naturally introduce variance in the models.
62
#VAE
training samples
50
(non-overlapping)
981
(overlapping)
250
(non-overlapping)
4,981
(overlapping)
Avg. Abs PnL Mean 0.0154 0.0055 0.0056 0.0054
Std 0.00748 0.00019 0.00006 0.00006
PnL Std. Mean 0.0201 0.0071 0.0072 0.0069
Std 0.00918 0.00026 0.00007 0.00006
CV aR0.95 Mean 0.0473 0.0145 0.0148 0.0139
Std 0.01981 0.00074 0.00023 0.00008
Turnover Mean 4.8062 1.8258 1.9261 1.8205
Std 2.07667 0.02965 0.04393 0.01277
Table 7.3: Results from a hedge experiment with an ATM call option in a Black-Scholes model without
transaction costs. Each column represents the average results (including standard deviation) of 10 deep
hedging models. In the first and third column, deep hedging models are trained with
218
samples from the
VAEs trained on 50 and 250 non-overlapping paths, respectively. The second and fourth column represents
deep hedge models trained on paths from different VAEs trained on 981 and 4,981
overlapping
paths,
respectively. The results are derived from a hedge experiment with 100,000 independent paths from the
actual Black-Scholes model.
Using overlapping paths to train the VAE
In this section, we wish to test the effectiveness of using overlapping paths to train the VAEs. In a previous experiment
(see section 6.7), we noticed that using overlapping paths was quite effective even in the Heston model, at least when
testing marginal distributions and comparing correlations of returns.
In another experiment (see table 7.1), we saw that deep hedging models, based on VAEs trained on 250
independent (i.e. non-overlapping) Black-Scholes paths, were comparable, in terms of CVaR, to the delta hedging
strategy. However, 250 non-overlapping return-paths require 250 months of data, which is more than 20 years. If we
had used overlapping return-paths, we could get 4,981 (
250 ·20 20 + 1
) return-paths. Another example is if we
use only 50 non-overlapping return-paths corresponding to just over four years of data. In this case, we could get
981 overlapping return-paths instead. We, therefore, wish to investigate whether it is worth using the overlapping
return-paths.
For this experiment, we train 10 VAEs on 981 overlapping return-paths, 10 VAEs on 4,981 overlapping
return-paths, 10 VAEs on 50 non-overlapping return-paths and 10 VAEs on 250 non-overlapping return-paths.
We then train a deep hedging model for each VAE using
218
paths. The result of this experiment can be seen in table 7.3.
From table 7.3, we observe that using overlapping return-paths to train the VAEs provide a significant perfor-
mance boost. It is clear that 50 non-overlapping training paths are not enough to train a useful VAE. The average
CVaR of 0.0473, obtained using VAEs trained on 50 non-overlapping training paths, is more than three times larger
than the average CVaR of 0.0145, obtained using VAEs trained on 981 overlapping paths. A similar difference exists
for the turnover. Moreover, we observe that deep hedging models using VAEs trained on 981 overlapping paths are
comparable to deep hedging models using VAEs trained on 250 non-overlapping paths (which use five times more
data). Lastly, we observe that the deep hedging models trained on 4,981 overlapping paths perform the best (with an
average CVaR of 0.0139) and is pretty close to the deep hedging models trained on actual Black-Scholes paths from
table 7.1.
All in all, we conclude that using overlapping paths is much prefered, at least for the current hedging problem.
Hedging a down-and-out call option
We have now seen that deep hedging models using VAEs can learn hedging strategies for call options successfully. It,
therefore, seems like the VAEs are (reasonably) proficient in capturing the distributional properties of Black-Scholes
63
MG Hedge BS ANN Analytical
#VAE
training samples
250
(non-overlapping)
1000
(non-overlapping)
4981
(overlapping)
Avg. abs. PnL. Mean 0.0062 0.0061 0.0061 0.0064 0.0050
Std 0.00018 0.00013 0.000143 0.00019 0.
PnL Std. Mean 0.0094 0.0093 0.0093 0.0091 0.0062
Std 0.00037 0.00023 0.000215 0.00013 0.
CV aR0.95 Mean 0.0162 0.0151 0.0148 0.0144 0.0164
Std 0.00042 0.00012 0.000080 0.00003 0.
Turnover Mean 1.8645 1.7458 1.6795 1.6388 1.6194
Std 0.03595 0.02072 0.011673 0.00334 0.
Table 7.4: Results from a hedge experiment with an ATM down-and-out call option in a Black-Scholes
model without transaction costs. Each column represents the average results (including standard deviation)
of 10 hedging models. In the first three columns, deep hedging models are trained with
218
samples from
VAEs that are trained on a varying number of training paths (in the third column, all VAEs are trained on
overlapping paths). The fourth column represents deep hedging models trained on actual Black-Scholes
paths, and the fifth column represents the delta hedging strategy. The results are derived from a hedge
experiment with 100,000 independent paths from the actual Black-Scholes model.
paths. To test this further, we wish to conduct a hedge experiment with an ATM down-and-out call option, like in
section 5.3. We assume that the down-and-out call option has maturity
T= 1/12
and has a barrier at
L= 0.95
.
Like with the simple call option, we assume daily rebalancing (
n= 20
), and we also assume that the barrier is only
monitored on the 20 hedge points.
In section 5.3, we saw that multiple different architectures allowed a deep hedging model to hedge a down-and-out
call option. The best method required giving the ANN (representing the current trading decision) the minimum
value of the underlying up until that point. We choose to use this method since it focuses on the distribution of the
minimum of the underlying asset. This is desirable as we wish to test the joint distribution of the VAEs paths.
To perform this experiment, we train 10 VAEs on 250 non-overlapping Black-Scholes paths, 10 VAEs on
1,000 non-overlapping independent Black-Scholes paths and 10 VAEs on 4,981 overlapping Black-Scholes paths
(corresponding to 250 non-overlapping paths). For each VAE, we train a deep hedging model on
218
paths from that
VAE. For comparison, we also train 10 deep hedging models, each on
218
paths from the actual Black-Scholes model.
To test the models, we perform a hedge experiment with 100,000 independent paths from the actual Black-Scholes
model. The experiment also includes a strategy using delta hedging for the continuous down-and-out call option.
All hedging strategies have an initial portfolio value of 0.02989, which corresponds to the price of a continuous
down-and-out call option. The results of this experiment can be seen in table 7.4.
In table 7.4, we observe that the deep hedging models trained on VAE paths perform pretty well. Looking
at average empirical CVaR, we see that performance increases with the number of paths used for training the VAEs.
This is expected. However, we notice that the deep hedging models trained on the VAEs using overlapping paths
performed the best with an average CVaR of
0.0148
. For comparison, deep hedging models trained on actual
Black-Scholes paths obtained an average CVaR of
0.0144
. This is (again) a clear vindication of training VAEs on
overlapping paths. It would be reasonable to fear that the VAEs would inherit some undesired dependence from
training on overlapping paths. However, this does not seem to be the case even after testing down-and-out call options.
We also observe that deep hedging models using VAEs trained on 250 non-overlapping paths are (with an average
CVaR of 0.0162) comparable to the delta hedging approach (with an average CVaR of 0.0164). This is in line with
previous experiments.
One might be concerned about the difference between the deep hedging and delta hedging model when it comes
to the average absolute PnL and PnL standard deviation. However, in section 5.3, we saw that the deep hedging
models’ optimization over CVaR likely explains this difference.
64
Overall, we are pretty pleased with the results as they confirm that combining VAEs and deep hedging mod-
els work even for lighter path-dependent options, at least in the Black-Scholes model.
Subconclusion for hedging in Black-Scholes
Overall, we are pretty pleased with the results. From the previous experiments, we have seen that it is (indeed)
possible to learn a hedging strategy for a call option only by observing a relatively small number of paths from the
actual model. This is especially impressive since our models (VAE and deep hedging model) do not depend on
any knowledge of the model for the underlying asset and its relation to a call option. However, we do exploit some
simplicities of the Black-Scholes model. We also saw that most of the variation in the hedging strategies came from
the training of the VAEs, which was pretty reasonable, as training of the VAEs required random simulation in each
gradient step. Lastly, we saw that using overlapping paths could massively boost the performance of the hedging
strategies (despite concerns regarding dependence) as it allowed a 20-fold increase in training data.
However, before we celebrate these results too much, we have to remember that all these results assumed a
simple Black-Scholes model as the underlying model for the asset. We do, therefore, not automatically expect this
method to work with more complicated models or real data.
7.2 CVAE powered hedge experiments - Heston
In this section, we wish to perform hedge experiments in a Heston model. The idea is to train CVAEs on Heston
paths conditioned on the corresponding instantaneous variances. The hope is that the CVAEs can produce new paths
conditioned on some arbitrary instantaneous variance. This is a more realistic framework since it allows us to train
the CVAE on a single observed path (with varying volatility across time) and simulate new paths based on the current
level of volatility.
However, we saw in section 6.5 and 6.6 that training VAEs and CVAEs on Heston paths was difficult. We noticed
that VAEs and CVAEs struggled to capture the positive correlation between absolute returns. We also observed that
the performance of the CVAEs decreased when conditioned on extreme instantaneous variances, compared to those
used for training. We saw that moment-regularization could improve performance quite effectively. However, the
VAEs and CVAEs still struggled compared to the Black-Scholes model. It is, however, not obvious how performance,
as measured in section 6, translates to hedge performance. This is what we wish to address in this section.
For this experiment, we assume that the underlying asset follows a Heston model with similar parameters to
those used in section 6.6
S(0) = 1
drift: µ= 0.05
ν(0) = 0.05
mean reversion: κ= 4
long term variance: θ= 0.05
vol-of-vol: σ= 0.25
correlation: ρ= 0.
Recall from section 5.4 that there were different architectures or approaches to consider when creating a deep hedging
model in the Heston model. We saw that providing the ANNs (representing each trading decision) with current
instantaneous variance slightly improved hedge performance. However, in our setup, we cannot create CVAEs that
generate joint paths for the underlying asset and its variance process. We, therefore, choose to utilize deep hedging
models without knowledge of the current instantaneous variance. Remember that the focus was to analyze the CVAEs
ability to generate paths conditioned on current instantaneous variance.
65
CVAE Deep Hedge Deep Hedge Analytical
#VAE
training samples
250
(non-overlapping)
1000
(non-overlapping)
4981
(overlapping)
Avg. abs. PnL Mean 0.0044 0.0044 0.0044 0.0049 0.0041
Std 0.00004 0.00005 0.00012 0.00002 0
PnL std. Mean 0.0058 0.0057 0.0057 0.0061 0.0054
Std 0.00004 0.00003 0.00009 0.00002 0
CV aR0.95 Mean 0.0137 0.0124 0.0121 0.0116 0.0127
Std 0.00050 0.00021 0.00025 0.00001 0
Turnover Mean 2.0201 1.8562 1.8087 1.7199 1.9002
Std 0.05035 0.02108 0.02647 0.00340 0
Table 7.5: Results from a hedge experiment with a call option in a Heston model with
ν(0) = 0.05
and
without transaction costs. The price of the option is 0.02607. Each column represents the average results
(including standard deviation) of 10 hedging models. In the first three columns, deep hedging models are
trained with
218
samples from CVAEs that are trained on a varying number of training paths themselves
(in the third column, all CVAEs are trained on overlapping paths). The fourth column represents a deep
hedging model trained on actual Heston paths, and the fifth column represents the delta hedging strategy.
The results are derived from a hedge experiment with 100,000 independent paths from the actual Heston
model.
CVAE Deep Hedge Deep Hedge Analytical
#VAE
training samples
250
(non-overlapping)
1000
(non-overlapping)
4981
(overlapping)
Avg. abs. Pnl Mean 0.0040 0.0041 0.0042 0.0040 0.0034
Std 0.00010 0.00004 0.00006 0.00001 0.
PnL std. Mean 0.0051 0.0051 0.0052 0.0051 0.0044
Std 0.00009 0.00004 0.00007 0.00001 0.
CV aR0.95 Mean 0.0105 0.0098 0.0097 0.0096 0.0105
Std 0.00019 0.00006 0.00005 0.00000 0.
Turnover Mean 1.8811 1.7146 1.6472 1.6903 1.8963
Std 0.07725 0.01259 0.01778 0.00385 0.
Table 7.6: Similar to table 7.5, but with ν(0) = 0.0270. Option price: 0.02040.
For this experiment, we train 10 CVAEs on 250 non-overlapping Heston paths, 10 CVAEs on 1,000 non-
overlapping Heston paths and 10 CVAEs on 4981
overlapping
Heston paths (corresponding to 250 non-overlapping
paths). All CVAEs are trained with
α= 0.9
,
β= 0
and
γ= 1
. Remember
β
was for VAEs not CVAEs. We
then perform three hedge experiments with
ν(0)
in
{0.0270,0.05,0.0765}
, where 0.0270 and 0.0765 approximately
corresponds to the 10% and 90% quantile from the training data. Note that these
ν(0)
s are the instantaneous variances
utilized for the hedge experiments, and when we create training paths for the CVAE, we still use ν(0) = 0.05.
In each of the three experiments with different conditioned
ν(0)
s, we train one deep hedging model for each of
the 10 CVAEs. For the deep hedging models, we utilize ANNs that has four layers with six units each. The deep
hedging models are trained on
218
paths, which are conditioned on the chosen
ν(0)
. We also train 10 deep hedging
models on actual
independent
Heston paths with the chosen
ν(0)
. Lastly, we also include a delta hedging model,
which we refer to as the analytical approach. All hedging models are tested on 100,000 independent Heston paths
with the chosen
ν(0)
. For the tests, we set the initial portfolio values for all strategies equal to the option price, which
varies with
ν(0)
. The results can be seen in table 7.5, 7.6 and 7.7, corresponding to
ν(0) = 0.05
,
ν(0) = 0.0270
and
ν(0) = 0.0765, respectively.
In table 7.5 (
ν(0) = 0.05
), we observe that the deep hedging models trained on CVAE paths perform rea-
sonably well. In terms of CVaR, we notice that only the deep hedging models using CVAEs trained on 250
66
CVAE Deep Hedge Deep Hedge Analytical
#VAE
training samples
250
(non-overlapping)
1,000
(non-overlapping)
4,981
(overlapping)
Avg. abs. PnL Mean 0.0055 0.0049 0.0048 0.0057 0.0048
Std 0.00019 0.00004 0.00003 0.00001 0.
PnL std. Mean 0.0073 0.0066 0.0065 0.0071 0.0063
Std 0.00024 0.00004 0.00002 0.00001 0.
CV aR0.95 Mean 0.0187 0.0157 0.0153 0.0135 0.0149
Std 0.00074 0.00019 0.00028 0.00001 0.
Turnover Mean 2.2806 1.9804 1.9125 1.7308 1.9023
Std 0.09590 0.01676 0.02586 0.00347 0.
Table 7.7: Similar to table 7.5, but with ν(0) = 0.0765. Option price: 0.03134.
non-overlapping paths did not (with an average CVaR of
0.0137
) outperform the delta hedging approach (with an
average CVaR of
0.0127
). The deep hedging models using CVAEs trained on 4,981 overlapping paths performed the
best of the deep hedging models using CVAEs with an average CVaR of 0.0121. This is encouraging since it shows
that learning hedging strategies from data with varying volatility is possible, even though it is more challenging than
constant volatility. This is especially true when considering that the CVAEs in section 6.6 failed to replicate the
positive correlation between absolute returns. We are also pleased to see that CVAEs trained on overlapping paths,
once again, performed well.
In table 7.6 and 7.7 (
ν(0) = 0.0270
and
ν(0) = 0.0765
), we have performed the hedge experiment with more
challenging initial instantaneous variances, corresponding approximately to the
10%
and
90%
quantile of the training
data. As we saw in section 6.6, the CVAEs find it harder to generate Heston-like paths with extreme values of
ν(0)
compared to the training data. In table 7.6 (
ν(0) = 0.0270
), we observe that all deep hedging models using CVAEs
perform as good or better than the delta hedging approach. However, this is not true in table 7.7 (
ν(0) = 0.0765
),
where all deep hedging models using CVAEs performed worse than the delta hedging approach. This suggests that it
is harder for the CVAEs to generate useful paths (for training the deep hedging models) when conditioned on high
instantaneous variances compared to low instantaneous variances. This may not seem surprising, but we did not see
this in section 6.6, where we analyzed the marginal distributions with the two-sample Kolmogorov-Smirnov test.
It is also interesting to observe that the turnover (and average absolute PnL) does not follow the same pattern in
table 7.5, 7.6 and 7.7. This might suggest that the CVAEs has a bias when conditioning on different instantaneous
variances.
Overall, we are pretty pleased with the results, as they show that it is possible to learn usable hedging strategies
conditioned on the current level of volatility. Moreover, it further vindicates the use of overlapping paths for training
CVAEs. It is also a relief to see that the deep hedging models could learn these hedge strategies when the CVAEs
struggled to capture dependency in returns. These results could, of course, vary substantially if the Heston model
had more extreme parameters or if we tried hedging options that are more sensitive to dependency in the returns.
8 Conclusion
This thesis shows that it is possible with deep learning to learn hedging strategies from a single long observed path
of the underlying asset under the real-world measure
P
. Our approach is quite general and model-free, and it was
obtained by combining ideas and techniques suggested in [1] and [2].
In section 3, we formulated a general hedging problem that could be represented as a risk minimization problem over
trading decisions represented by ANNs. Critically, we observed that the optimal trading strategy was independent of
the option price when utilizing a coherent risk measure. The only input required to solve the minimization problem
was information about the current and previous market states (and possibly previous trades) at each trading decision.
In practice, this amounts to a large set of simulated paths under the P-measure of the underlying asset(s).
67
In section 5, we performed several hedge experiments where deep hedging models learned optimal hedging
strategies for various options and models. We observed that our framework was able to learn complex trading
strategies. However, it was clear that architecture and available information played a crucial part in stability and
performance. When hedging down-and-out call options, we saw that the performance improved significantly if the
deep hedging models could observe the current minimum value of the underlying asset.
We also observed that learning trading strategies in a multiasset Black-Scholes model had stability issues,
especially when tested on asset correlations, which were different to those used during training. We
proposed
training the deep hedging model on paths simulated with dampened asset correlation, and we found that this technique
did improve the stability of the deep hedging model.
At that point, we had developed a framework that could learn optimal trading strategies given enough train-
ing samples. The goal was then to create a framework that could generate an arbitrarily large training set of paths
from an underlying asset based on only a modest set of observed paths.
Section 6 introduced the theory behind VAEs and CVAEs and how one could utilize these models to create
generative models capable of generating return-paths with similar distributional characteristics as an observed set
of paths (or one long single path) from the underlying asset. We saw that training VAEs and CVAEs amounted to
minimizing a reconstruction loss between generated paths and training paths plus a regularization term, which was
crucial for the generative qualities of the model after training.
We then experimented with the VAEs and CVAEs abilities to generate paths, consisting of 20 daily returns, from
a Black-Scholes and Heston model. We saw that the VAEs could produce paths with marginal distributions and
correlations that aligned with those from a Black-Scholes model, even when trained on only 100 paths. However, we
observed that the VAEs struggled to capture the positive correlation between absolute returns in a Heston model. We
proposed
a series of moment regularizing terms that guide the VAEs to generate paths with moments and correlations
that align with the training paths. The VAEs were, with regularization, partially able to capture correlations in the
Heston model.
We also experimented with the CVAEs’ ability to generate paths with distributional properties of a Heston model,
conditioned on the current instantaneous variance. It was evident that the CVAE struggled with this challenge. Again,
we
proposed
two regularization terms to guide the CVAEs toward the right conditional moments of the training
paths. The proposed regularization had a significant positive effect on the marginal distributions. However, we failed
to capture the positive correlation structure of absolute returns from the Heston model when conditioning on the
current instantaneous variance.
In section 7, we combined the two frameworks (VAEs/CVAEs and deep hedging models) to create a single
framework capable of learning efficient hedging strategies only by observing a modest number of paths (potentially
coming from a single long realized path).
In a Black-Scholes model, we observed that the framework worked as intended and could learn hedging strategies
for both ordinary and down-and-out call options based on a modest number of paths. As expected, the performance
increased with the number of training paths. However, we also found that using overlapping training paths could
increase performance significantly. We also analyzed the stability of the framework, and it suggested that the VAEs
were the greatest source of variation. This did not surprise us, as training VAEs involved random simulations inside
each gradient step. Still, improvements to VAE training seemed like the most straightforward way of improving the
framework.
Finally, we tested the framework in a Heston model. Our goal was to train CVAEs on Heston paths conditioned on
instantaneous variance and then create hedging strategies based on some fixed level of initial volatility (instantaneous
variance). We found that the framework worked as intended. We were able to find decent hedging strategies for a call
option based on different initial instantaneous variances, even though we failed to capture the positive correlation of
absolute returns with the CVAEs. However, we did observe that the model struggled to find hedging strategies that
outperformed simple delta hedging when the initial instantaneous variance was large. This was not reflected in the
68
analysis from section 6, which shows that fit of marginal distributions and correlations does not translate directly to
hedge performance.
Overall, we are pretty impressed with the models and frameworks used in this thesis. We have shown that it
is possible to create hedging strategies from a single observed path of an underlying asset without any knowledge of
Q
-measures and Greeks. The framework is intended to be general and model-free. However, we have seen that the
VAE/CVAE models may require substantial regularization and that the deep hedging models are particularly sensitive
to the architecture and processing of inputs. One may, therefore, not need any knowledge about exact Greeks, but
understanding price dynamics and options is vital to utilize market generators and deep hedging models effectively. It
would, therefore, be sensible to favour classic methods still. However, this framework might be helpful as a co-pilot
to classic techniques, as an alternative method with few shared assumptions.
69
A Appendix
A.1 Outperforming delta hedging using MSE (in theory)
From the experiments in section 5.1.1, we saw that the ANN model using MSE learned a strategy that was remarkably
close to that of the analytical delta hedging approach assuming no transaction costs. However, we would expect
that a trading strategy optimized for trading in discrete-time would outperform delta hedging, which is meant for
continuous trading.
In this short section, we wish to show analytically that strategies optimizing MSE in discrete time are different
from delta hedging. To do this, we imagine being in the simplest setting possible. For this, we assume that we are
only allowed to trade at
t0= 0
and then evaluate the MSE at time
t1=T
. The goal is to trade a single asset
S
to
minimize the MSE between the portfolio value
P F1
and the option payoff
g(S1)
. We also assume that we start with
portfolio value p0and that there are no transaction costs.
If we choose to hold δunits of Sat time t0= 0 then the portfolio value P F1is
P F1=S1δ+ (p0δS0)erT
=δ(S1S0erT ) + p0erT .
To formalize the analysis, we wish to find δto minimize
EP
0[(g(S1)P F1)2].
Using the representation of P F1and applying the first order condition yields
0 =
∂δ EP
0[(g(S1)P F1)2]
=2EP
0[(g(S1)P F1)(S1S0erT )]
=2EP
0(g(S1)δ(S1S0erT )p0erT )(S1S0erT )
=2EP
0δ(S1S0erT )2+ (g(S1)p0erT )(S1S0erT ).
Solving for δyields
δ=EP
0(g(S1)p0erT )(S1S0erT )
EP
0[(S1S0erT )2].
Notice if P=Qand p0=erT EQ
0[g(S1)] then
δ=covQ
0(g(S1), S1)
VQ
0[S1].
As a sanity check, we see that if
g(S1) = S1
and
p0=S0
, then
δ= 1
, which we expect since we can perfectly
replicate the option by holding 1 unit of
S
. However, the interesting observation is that
δ
does not equal
∆ =
∂s erT EQ[g(S1)|S0=s]
. Among other things, because
δ
depends on both
P
and
p0
(which might not
necessarily be the price). Of course, we know that if we can trade continuously, then trading with the option delta
will be optimal. Nevertheless, when trading in discrete time (and in this case only trading once), it is possible to
choose a trading strategy that is more optimal than delta hedging.
Our experiments find that the ANN model using MSE is dominant on fewer hedge points, as it can factor this
into its minimization. However, in experiments with a fair amount of hedge points, we have not found that the ANN
model using MSE significantly outperforms delta hedging even if the ANN model also has the current portfolio value
as input.
70
A.2 Price and delta of a call option in the Heston model
In this thesis, when pricing call options in the Heston model, we utilize a simple (and relatively stable) semi-closed
formulation of the price referred to as the Lipton-Lewis reformulation (see [15])
CH(S(t), ν(t), t) = S(t)Ker(Tt)
2πZ
−∞
exp (ik +1
2)X+α+ (k2+1
4)βν(t)
(k2+1
4)dk
where
X= ln S(t)
K+r(Tt)
α=κθ
σ2ψ+(Tt) + 2 lnψ+ψ+eζ(Tt)
2ζ
β=1eζ(Tt)
ψ+ψ+eζ(Tt)
and where
ψ±=(ikρσ +ˆ
k) + ζ
ζ=pk2σ2(1 ρ2)+2ikσρˆκ+ ˆκ2+σ2/4
ˆκ=κρσ
2.
Note that all parameters referenced are the parameters under the
Q
-measure. From this, we can also easily derive a
semi-closed formula for the option delta
H=∂C H(S(t), ν (t), t)
∂S(t)= 1 Ker(Tt)
2πZ
−∞
(ik +1
2) exp (ik +1
2)X+α+ (k2+1
4)βν(t)
S(t)(k2+1
4)dk.
71
References
[1]
Buehler, H., Gonon, L., Teichmann, J., & Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8),
1271-1291.
[2]
Buehler, H., Horvath, B., Lyons, T., Perez Arribas, I., & Wood, B. (2020). A data-driven market simulator for
small data environments. Available at SSRN 3632431.
[3]
McNeil, A. J., Frey, R., & Embrechts, P. (2015). Quantitative risk management: concepts, techniques and
tools-revised edition. Princeton university press.
[4] Föllmer, H., & Schied, A. (2016). Stochastic finance: an introduction in discrete time. Walter de Gruyter.
[5] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[6] Pedersen, T. C., & Frandsen, M. G. (2020). Applications of Deep Learning in Option Pricing and Calibration.
[7]
Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial
activation function can approximate any function. Neural networks, 6(6), 861-867.
[8] Telgarsky, M. (2015). Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101
[9] Björk, T. (2009). Arbitrage theory in continuous time. Oxford university press.
[10] Poulsen, R. (2018). Fundamental Views. Wilmott, 2018(97), 44-47.
[11]
Higham, N. J. (2002). Computing the nearest correlation matrix—a problem from finance. IMA journal of
Numerical Analysis, 22(3), 329-343.
[12]
Yin, C., Perchet, R., & Soupé, F. (2021). A practical guide to robust portfolio optimization. Quantitative
Finance, 1-18.
[13] Savine, A. (2018). Modern computational finance: AAD and parallel simulations. John Wiley & Sons.
[14]
Lord, R., Koekkoek, R., & Dijk, D. V. (2010). A comparison of biased simulation schemes for stochastic
volatility models. Quantitative Finance, 10(2), 177-194.
[15] Lipton, A. (2002). The vol smile problem Risk.
[16] Doersch, C. (2016). Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.
[17] Duchi, J. (2007). Derivations for linear algebra and optimization. Berkeley, California, 3(1), 2325-5870.
72
ResearchGate has not been able to resolve any citations for this publication.
Book
Full-text available
Arguably the strongest addition to numerical finance of the past decade, Algorithmic Adjoint Differentiation (AAD) is the technology implemented in modern financial software to produce thousands of accurate risk sensitivities, within seconds, on light hardware. AAD recently became a centerpiece of modern financial systems and a key skill for all quantitative analysts, developers, risk professionals or anyone involved with derivatives. It is increasingly taught in Masters and PhD programs in finance. Danske Bank's wide scale implementation of AAD in its production and regulatory systems won the In-House System of the Year 2015 Risk award. The Modern Computational Finance books, written by three of the very people who designed Danske Bank's systems, offer a unique insight into the modern implementation of financial models. The volumes combine financial modelling, mathematics and programming to resolve real life financial problems and produce effective derivatives software. This volume is a complete, self-contained learning reference for AAD, and its application in finance. AAD is explained in deep detail throughout chapters that gently lead readers from the theoretical foundations to the most delicate areas of an efficient implementation, such as memory management, parallel implementation and acceleration with expression templates. The book comes with professional source code in C++, including an efficient, up to date implementation of AAD and a generic parallel simulation library. Modern C++, high performance parallel programming and interfacing C++ with Excel are also covered. The book builds the code step-by-step, while the code illustrates the concepts and notions developed in the book.
Book
The fourth edition of this textbook on pricing and hedging of financial derivatives, now also including dynamic equilibrium theory, continues to combine sound mathematical principles with economic applications. Concentrating on the probabilistic theory of continuous time arbitrage pricing of financial derivatives, including stochastic optimal control theory and optimal stopping theory, the book is designed for graduate students in economics and mathematics, and combines the necessary mathematical background with a solid economic focus. It includes a solved example for every new technique presented, contains numerous exercises, and suggests further reading in each chapter. All concepts and ideas are discussed, not only from a mathematics point of view, but the mathematical theory is also always supplemented with lots of intuitive economic arguments. In the substantially extended fourth edition Tomas Björk has added completely new chapters on incomplete markets, treating such topics as the Esscher transform, the minimal martingale measure, f-divergences , optimal investment theory for incomplete markets, and good deal bounds. There is also an entirely new part of the book presenting dynamic equilibrium theory. This includes several chapters on unit net supply endowments models, and the Cox–Ingersoll–Ross equilibrium factor model (including the CIR equilibrium interest rate model). Providing two full treatments of arbitrage theory—the classical delta hedging approach and the modern martingale approach—the book is written in such a way that these approaches can be studied independently of each other, thus providing the less mathematically oriented reader with a self-contained introduction to arbitrage theory and equilibrium theory, while at the same time allowing the more advanced student to see the full theory in action.
Article
This book provides the most comprehensive treatment of the theoretical concepts and modelling techniques of quantitative risk management. Whether you are a financial risk analyst, actuary, regulator or student of quantitative finance, Quantitative Risk Management gives you the practical tools you need to solve real-world problems. Describing the latest advances in the field, Quantitative Risk Management covers the methods for market, credit and operational risk modelling. It places standard industry approaches on a more formal footing and explores key concepts such as loss distributions, risk measures and risk aggregation and allocation principles. The book's methodology draws on diverse quantitative disciplines, from mathematical finance and statistics to econometrics and actuarial mathematics. A primary theme throughout is the need to satisfactorily address extreme outcomes and the dependence of key risk drivers. Proven in the classroom, the book also covers advanced topics like credit derivatives. • Fully revised and expanded to reflect developments in the field since the financial crisis • Features shorter chapters to facilitate teaching and learning • Provides enhanced coverage of Solvency II and insurance risk management and extended treatment of credit risk, including counterparty credit risk and CDO pricing • Includes a new chapter on market risk and new material on risk measures and risk aggregation.
Article
Using an Euler discretization to simulate a mean-reverting CEV process gives rise to the problem that while the process itself is guaranteed to be nonnegative, the discretization is not. Although an exact and efficient simulation algorithm exists for this process, at present this is not the case for the CEV-SV stochastic volatility model, with the Heston model as a special case, where the variance is modelled as a mean-reverting CEV process. Consequently, when using an Euler discretization, one must carefully think about how to fix negative variances. Our contribution is threefold. Firstly, we unify all Euler fixes into a single general framework. Secondly, we introduce the new full truncation scheme, tailored to minimize the positive bias found when pricing European options. Thirdly and finally, we numerically compare all Euler fixes to recent quasi-second order schemes of Kahl and Jaumlckel, and Ninomiya and Victoir, as well as to the exact scheme of Broadie and Kaya. The choice of fix is found to be extremely important. The full truncation scheme outperforms all considered biased schemes in terms of bias and root-mean-squared error.
Article
This book is an introduction to financial mathematics. The first part of the book studies a simple one-period model which serves as a building block for later developments. Topics include the characterization of arbitrage-free markets, preferences on asset profiles, an introduction to equilibrium analysis, and monetary measures of risk. In the second part, the idea of dynamic hedging of contingent claims is developed in a multiperiod framework. Such models are typically incomplete: They involve intrinsic risks which cannot be hedged away completely. Topics include martingale measures, pricing formulas for derivatives, American options, superhedging, and hedging strategies with minimal shortfall risk. In addition to many corrections and improvements, this second edition contains several new sections, including a systematic discussion of law-invariant risk measures and of the connections between American options, superhedging, and dynamic risk measures.
Article
Several researchers characterized the activation function under which multilayer feedforward networks can act as universal approximators. We show that most of all the characterizations that were reported thus far in the literature are special cases of the following general result: A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can approximate any continuous function to any degree of accuracy if and only if the network's activation function is not a polynomial. We also emphasize the important role of the threshold, asserting that without it the last theorem does not hold.
Article
Introduction A correlation matrix is a symmetric positive semidefinite matrix with unit diagonal. Correlation matrices occur in several areas of numerical linear algebra, including preconditioning of linear systems and error analysis of Jacobi methods for the symmetric eigenvalue problem (see Davies & Higham (2000) for details and references). The term `correlation matrix' comes from statistics, since a matrix whose (i, j ) entry is the correlation coefficient between two random variables x i and x j is symmetric positive semidefinite with unit diagonal. It is a statistical application that motivates this work---one coming from the finance industry. In stock research sample correlation matrices constructed from vectors of stock returns are used for predictive purposes. Unfortunately, on any day when an observation is made data are rarely available for all the stocks of interest. One way to deal with this problem is to compute the sample correlations of pairs of stocks using data draw