PreprintPDF Available

Limit theorems for the site frequency spectrum of neutral mutations in an exponentially growing population

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The site frequency spectrum (SFS) is a widely used summary statistic of genomic data, offering a simple means of inferring the evolutionary history of a population. Motivated by recent evidence for the role of neutral evolution in cancer, we examine the SFS of neutral mutations in an exponentially growing population. Whereas recent work has focused on the mean behavior of the SFS in this scenario, here, we investigate the first-order asymptotics of the underlying stochastic process. Using branching process techniques, we show that the SFS of a Galton-Watson process evaluated at a fixed time converges almost surely to a random limit. We also show that the SFS evaluated at the stochastic time at which the population first reaches a certain size converges in probability to a constant. Finally, we illustrate how our results can be used to construct consistent estimators for the extinction probability and the effective mutation rate of a birth-death process.
arXiv:2307.03346v1 [math.PR] 7 Jul 2023
Limit theorems for the site frequency spectrum
of neutral mutations in an
exponentially growing population
Einar Bjarki Gunnarsson1Kevin Leder2Xuanming Zhang2
1School of Mathematics, University of Minnesota, Twin Cities, MN 55455, USA.
2Department of Industrial and Systems Engineering, University of Minnesota, Twin Cities, MN 55455,
USA.
Abstract
The site frequency spectrum (SFS) is a widely used summary statistic of genomic
data, offering a simple means of inferring the evolutionary history of a population.
Motivated by recent evidence for the role of neutral evolution in cancer, we exam-
ine the SFS of neutral mutations in an exponentially growing population. Whereas
recent work has focused on the mean behavior of the SFS in this scenario, here, we
investigate the first-order asymptotics of the underlying stochastic process. Using
branching process techniques, we show that the SFS of a Galton-Watson process
evaluated at a fixed time converges almost surely to a random limit. We also show
that the SFS evaluated at the stochastic time at which the population first reaches
a certain size converges in probability to a constant. Finally, we illustrate how our
results can be used to construct consistent estimators for the extinction probability
and the effective mutation rate of a birth-death process.
Keywords: Site frequency spectrum; Neutral evolution; Infinite sites model; Branch-
ing processes; Convergence of stochastic processes.
MSC2020 Classification: 60J85, 60F15, 92D25, 92B05.
1 Introduction
The site frequency spectrum (SFS) is a popular summary statistic of genomic data, record-
ing the frequencies of mutations within a given population or population sample. For the
case of a large constant-sized population and selectively neutral mutations, the SFS has
given rise to several estimators of the rate of mutation accumulation within the popu-
lation, and these estimators have formed the basis of many statistical tests of neutral
evolution vs. evolution under selection [1, 2]. In this way, the SFS has provided a simple
means of understanding the rate and mode of evolution in a population using genomic
data.
1
Motivated by the uncontrolled growth of cancer cell populations, and the mounting
evidence for the role of neutral evolution in cancer [3, 4, 5, 6, 7], several authors have
recently studied the SFS of neutral mutations in an exponentially growing population.
Durrett [8, 9] considered a supercritical birth-death process, in which cells live for an
exponentially distributed time and then divide or die. He showed that in the large-
time limit, the expected number of mutations found at a frequency famongst cells
with infinite lineage follows a 1/f power law with 0 < f < 1. Similar results were
obtained by Bozic et al. [10] and in a deterministic setting by Williams et al. [5]. In the
aforementioned work, Durrett also derived an approximation for the expected SFS of a
small random sample taken from the population [8, 9]. Further small sample results have
been derived using both branching process and coalescence techniques and they have been
compared with Durrett’s result in [11, 12]. In [13], we derived exact expressions for the
SFS of neutral mutations in a supercritical birth-death process, both for cells with infinite
lineage and for the total cell population, evaluated either at a fixed time (fixed-time SFS)
or at the stochastic time at which the population first reaches a given size (fixed-size SFS).
More recently, the effect of selective mutations on the expected SFS has been investigated
by Tung and Durrett [14] and Bonnet and Leman [15]. The latter work considers the
setting of a drug-sensitive tumor which decays exponentially under treatment, with cells
randomly acquiring resistance which enables them to grow exponentially under treatment.
Whereas the aforementioned works have focused on the mean behavior of the SFS,
here, we are interested in the asymptotic behavior of the underlying stochastic process.
Using the framework of coalescent point processes, Lambert [16] derived a strong law
of large numbers for the SFS of neutral mutations in a population sample, where the
sample is ranked in such a way that coalescence times between consecutive individuals
are i.i.d. Later works by Lambert [17], Johnston [18] and Harris et al. [19] characterized
the joint distribution of coalescence times for a uniformly drawn sample from a continuous-
time Galton-Watson process. Building on these works, Johnson et al. [20] derived limit
distributions for the total lengths of internal and external branches in the genealogical tree
of a birth-death process. Schweinsberg and Shuai [21] extended this analysis to branches
supporting exactly kleaves, which under a constant mutation rate characterizes the SFS of
a uniformly drawn sample. For a supercritical birth-death process, the authors established
both a weak law of large numbers and the asymptotic normality of branch lengths in the
limit of a large sample, assuming that the sample is sufficiently small compared to the
expected population size at the sampling time.
In this work, instead of considering a sample from the population using coalescence
techniques, we will investigate the first-order asymptotics for the SFS of the total pop-
ulation using branching process techniques. We establish results both for the fixed-time
and fixed-size SFS under the infinite sites model of mutation, where each new mutation is
assumed to be unique [22]. Cheek and Antal recently studied a finite sites model in [23]
(see also [24]), where each genetic site is allowed to mutate back and forth between the
four nucleotides A, C, G, T . With the understanding that a site is mutated if its nucleotide
differs from the nucleotide of the initial individual, the authors investigated the SFS of
a birth-death process stopped at a certain size, both for mutations observed in a certain
number and in a certain fraction of individuals. They used a limiting regime where the
population size is sent to infinity, mutation rate is sent to 0, and the number of genetic
2
sites is sent to infinity. In contrast, we will assume a constant mutation rate under the
infinite sites model (with no back mutations), and send either the fixed time or the fixed
size at which the population is observed to infinity.
Our results are derived for a supercritical Galton-Watson process in continuous time,
where each individual acquires neutral mutations at a constant rate ν > 0. Let Z0(t)
denote the size of the population at time t,λ > 0 denote the net growth rate of the
population, τNdenote the time at which the population first reaches size N, and Sj(t)
denote the number of mutations found in j1 individuals at time t. Our main result,
Theorem 1, characterizes the first-order behavior of eλtSj(t) as t (fixed-time re-
sult) and N1Sj(τN) as N (fixed-size result). To prove the fixed-time result, the key
idea is to decompose (Sj(t))t0into a difference of two increasing processes (Sj,+(t))t0
and (Sj,(t))t0. These processes count the total number of instances that a mutation
reaches and leaves frequency j, respectively, up until time t. Using the limiting behavior of
Z0(t) as t , we construct large-time approximations for the two processes (Sj,+(t))t0
and (Sj,(t))t0. We then establish exponential L1error bounds on these approxima-
tions, which imply convergence in probability. Finally, by adapting an argument of Harris
(Theorem 21.1 of [25]), we use the exponential error bounds and the fact that (Sj,+(t))t0
and (Sj,(t))t0are increasing processes to show that eλtSj,+(t) and eλtSj,(t) converge
almost surely to their approximations. This in turn gives almost sure convergence for
eλtSj(t) as t . The fixed-size result is obtained by combining the fixed-time result
with an approximation result for τN, given by Proposition 1. Since we are only able to
establish the approximation for τNin probability, the result for N1Sj(τN) as N
is given in probability. Finally, we establish analogous fixed-time and fixed-size conver-
gence results for M(t) = P
j=1 Sj(t), the total number of mutations present at time t, in
Proposition 2. All results are given conditional on nonextinction of the population.
The rest of the paper is organized as follows. Section 2 introduces our branching pro-
cess model and establishes the relevant notation. Section 3 presents our results, including
explicit expressions for the birth-death process. Section 4 outlines the proof of the main
result, Theorem 1. Section 5 constructs consistent estimators for the extinction proba-
bility and effective mutation rate of the birth-death process. Finally, the proofs of the
remaining results can be found in Section 6.
2 Model
2.1 Branching process model with neutral mutations
We consider a Galton-Watson branching process (Z0(t))t0, started with a single individ-
ual at time 0, Z0(0) = 1, where the lifetimes of individuals are exponentially distributed
with mean 1/a > 0. At the end of an individual’s lifetime, it produces offspring according
to the distribution (uk)k0, where ukis the probability that koffspring are produced. We
define m:= P
k=0 kukas the mean number of offspring per death event and assume that
the offspring distribution has a finite third moment, P
k=0 k3uk<. Each individual,
over its lifetime, accumulates neutral mutations at (exponential) rate ν > 0. We assume
the infinite sites model of mutation, where each new mutation is assumed to be unique.
3
Throughout, we consider the case m > 1 of a supercritical process. The net growth rate
of the population is then λ=a(m1) >0, with E[Z0(t)] = eλt for t0.
We will be primarily interested in analyzing the process conditional on long-term
survival of the population. We define the event of nonextinction of the population as
:= {Z0(t)>0 for all t > 0}.
We also define the probability of eventual extinction as
p:= P(Ωc
) = P(Z0(t) = 0 for some t > 0},(1)
and the corresponding survival probability as q:= P(Ω). For N1, we define τNas
the time at which the population first reaches size N,
τN:= inf{t0 : Z0(t)N},(2)
with the convention that inf =. Note that on ,τN<almost surely. Also note
that if uk>0 for some k > 2, it is possible that Z0(τN)> N . We finally define
pi,j(t) := P(Z0(t) = j|Z0(0) = i)
as the probability of transitioning from ito jindividuals in ttime units. For the baseline
case Z0(0) = 1, we simplify the notation to pj(t) := p1,j (t).
2.2 Special case: Birth-death process
An important special case is that of the birth-death process, where u2> u00 and
u0+u2= 1. In this process, an individual at the end of its lifetime either dies with-
out producing offspring or produces two offspring. At each death event, the population
therefore either reduces or increases in size by one individual. The birth-death process is
for example relevant to the population dynamics of cancer cell populations (tumors) and
bacteria. In this case, the probability of eventual extinction can be computed explicitly
as p=u0/u2and the survival probability as q= 1 u0/u2[9]. Furthermore, the prob-
ability mass function j7→ pj(t) has an explicit expression for each t0, which is given
by expression (64) in Section 6.8. This will enable us to derive explicit limits for the site
frequency spectrum of the birth-death process, see Corollary 1 in Section 3.2.
2.3 Asymptotic behavior
We note that (eλt Z0(t))t0is a nonnegative martingale with respect to the natural
filtration Ft:= σ(Z0(s); st). Thus, there exists a random variable Ysuch that
eλtZ0(t)Yalmost surely t . By Theorem 2 in Section III.7 of [26],
YD
=0+, (3)
where pand qare the extinction and survival probabilities of the population, respectively,
δ0is a point mass at 0, and ξis a random variable on (0,) with a strictly positive
4
continuous density function and mean 1/q. Since we assume that the offspring distribution
has a finite second moment we know that E[(Z0(t))2] = O(e2λt) by Chapter III.4 of [27]
or Lemma 5 of [28], hence (eλtZ0(t))t0is uniformly integrable and E[Y|Ft] = eλt Z0(t).
Based on the large-time approximation Z0(t)Y eλt, for N1, we define an approx-
imation to the hitting time τNdefined in (2) as follows:
tN:= inf{t0 : Y eλt =N}.(4)
In Proposition 1, we show that conditional on ,τNtN0 in probability as N .
2.4 Site frequency spectrum
In the model, each individual accumulates neutral mutations at rate ν > 0. For t > 0,
enumerate the mutations that occur up until time tas 1,...,Nt, and define Mt:=
{1,...,Nt}as the set of mutations generated up until time t. For i Mtand st, let
Ci(s) denote the number of individuals at time sthat carry mutation i, with Ci(s) = 0
before mutation ioccurs. The number of mutations present in jindividuals at time tis
then given by
Sj(t) := X
i∈Mt
1{Ci(t)=j}.
The vector (Sj(t))j1is the site frequency spectrum (SFS) of the neutral mutations at
time t. We also define the total number of mutations present at time tas
M(t) :=
X
j=1
Sj(t).
The goal of this paper is to establish first-order limit theorems for Sj(t) and M(t), eval-
uated either at the fixed time tas t or at the random time τNas N .
3 Results
3.1 General case
Our main result, Theorem 1, provides large-time and large-size first-order asymptotics for
the SFS conditional on nonextinction. For the fixed-time SFS, we establish almost sure
convergence, while for the fixed-size SFS, we establish convergence in probability. A proof
sketch is given in Section 4 and the proof details are carried out in Sections 6.16.5.
Theorem 1. (1) Conditional on ,
lim
t→∞ eλtSj(t) = νY Z
0
eλspj(s)ds, j 1,(5)
almost surely. Equivalently, with rN:= (1) log(qN ),X:= qY and E[X|] = 1,
lim
N→∞ N1Sj(rN) = νX Z
0
eλspj(s)ds, j 1,(6)
almost surely.
5
(2) Conditional on ,
lim
N→∞ N1Sj(τN) = νZ
0
eλspj(s)ds, j 1,(7)
in probability.
Proof. Section 4 and Sections 6.16.5.
The main difference between the fixed-time result (5) and the fixed-size result (7) is
that the limit in (5) is a random variable while it is constant in (7). The reason is that
the population size at a large, fixed time tis dependent on the limiting random variable
Yin eλtZ0(t)Y, while the population size at time τNis always approximately N. In
expression (6), the fixed-time result is viewed at the time rNdefined so that
lim
N→∞ N1E[Z0(rN)|] = 1.
The point is to show that when the result in (5) is viewed at a fixed time comparable to
τN, the mean of the limiting random variable becomes equal to the fixed-size limit in (7).
To establish the fixed-size result (7), we prove a secondary approximation result for the
hitting time τNdefined in (2). The result, stated as Proposition 1, shows that conditional
on ,τNis equal to the approximation tNdefined in (4) up to an O(1) error. The proof
involves relatively simple calculations, given in Section 6.6.
Proposition 1. For any ε > 0,
lim
N→∞ P(|τNtN|> ε|) = 0.(8)
Proof. Section 6.6.
The proof of the fixed-size result (7) combines the fixed-time result (5) with Proposition
1, as is discussed in Section 4.5. Since we are only able to establish the approximation for
τNin probability, the fixed-size result (7) is given in probability. An almost sure version
of Proposition 1 would immediately imply an almost sure version of (7).
Finally, a simpler version of the argument used to prove Theorem 1 can be used to
prove analogous limit theorems for the total number of mutations at time t,M(t).
Proposition 2. (1) Conditional on ,
lim
t→∞ eλtM(t) = νY Z
0
eλs(1 p0(s))ds, (9)
almost surely.
(2) Conditional on ,
lim
N→∞ N1M(τN) = νZ
0
eλs(1 p0(s))ds, (10)
in probability.
6
Proof. Section 6.7.
By combining the results of Theorem 1 and Proposition 2, we obtain the following
limits for the proportion of mutations found in j1 individuals:
lim
t→∞
Sj(t)
M(t)= lim
N→∞
Sj(τN)
M(τN)=R
0eλspj(s)ds
R
0eλs(1 p0(s))ds , j 1,(11)
where the fixed-time limit applies almost surely and the fixed-size limit in probability. In
the application Section 5, we will also be interested in the proportion of mutations found
in j1 individuals out of all mutations found in jindividuals. If we define
Mj(t) := X
kj
Sj(t), j 1, t 0,
as the total number of mutations found in jindividuals, this proportion is given by
lim
t→∞
Sj(t)
Mj(t)= lim
N→∞
Sj(τN)
Mj(τN)=R
0eλspj(s)ds
R
0eλsP
k=jpk(s)ds, j 1,(12)
since limit theorems for Mj(t) follow from Theorem 1 and Proposition 2 by writing Mj(t) =
M(t)Pj1
k=1 Sk(t). Note that for both proportions, the fixed-time and fixed-size limits are
the same, as the variability in population size at a fixed time has been removed. Also note
that both proportions are independent of the mutation rate ν. In Section 5, we show that
for the birth-death process, these properties enable us to define a consistent estimator for
the extinction probability pwhich applies both to the fixed-time and fixed-size SFS.
3.2 Special case: Birth-death process
For the special case of the birth-death process, we are able to derive explicit expressions for
the limits in Theorem 1 and Proposition 2, as we demonstrate in the following corollary.
Corollary 1. For the birth-death process, conditional on ,
(1) the random variable Yin Theorem 1 has the exponential distribution with mean 1/q,
and the fixed-time result (5) can be written explicitly as
lim
t→∞ eλtSj(t) = νqY
λZ1
0
(1 py)1(1 y)yj1dy
=νqY
λ
X
k=0
pk
(j+k)(j+k+ 1), j 1.
(13)
For the special case p= 0 of a pure-birth or Yule process,
lim
t→∞ eλtSj(t) = νY
λ
1
j(j+ 1).
7
(2) the fixed-size result (7) can be written explicitly as
lim
N→∞ N1Sj(τN) = νq
λZ1
0
(1 py)1(1 y)yj1dy
=νq
λ
X
k=0
pk
(j+k)(j+k+ 1), j 1.
(14)
For the pure-birth or Yule process,
lim
N→∞ N1Sj(τN) = ν
λ
1
j(j+ 1).(15)
(3) the fixed-time result (9) can be written explicitly as
lim
t→∞ eλtM(t) =
νY
λ, p = 0,
νq log(q)Y
λp ,0< p < 1.
(16)
(4) the fixed-size result (10) can be written explicitly as
lim
N→∞ N1M(τN) =
ν
λ, p = 0,
νq log(q)
λp ,0< p < 1.(17)
Proof. Section 6.8.
Similarly, the proportion of mutations found in j1 individuals, appearing in ex-
pression (11), can be written explicitly as
R
0eλspj(s)ds
R
0eλs(1 p0(s))ds =
1
j(j+ 1), p = 0,
p
log(q)Z1
0
(1 py)1(1 y)yj1dy, 0< p < 1,
(18)
and the proportion of mutations in jindividuals out of all mutations in jindividuals,
appearing in expression (12), can be written as
ϕj(p) := R
0eλspj(s)ds
R
0eλsP
k=jpk(s)ds =
1
j+ 1, p = 0,
1R1
0(1 py)1yjdy
R1
0(1 py)1yj1dy ,0< p < 1,
(19)
see Section 6.9. Note that expressions (18) and (19) give the same proportion for j= 1.
It can be shown that for any j1, ϕj(p) is strictly decreasing in p(Section 6.10). In
Section 5, we use this fact to develop an estimator for the extinction probability p.
We showed in expression (C.1) of [13] that for p= 0,
E[Sj(τN)] = νN
λ·1
j(j+ 1), j = 2,...,N 1.
In other words, the fixed-size result (15) holds in the mean even for finite values of N,
excluding boundary effects at j= 1 and j=N.
8
4 Proof of Theorem 1
In this section, we sketch the proof of the main result, Theorem 1. Proving the fixed-time
result (5) represents most of the work, which is discussed in Sections 4.1 to 4.4. The
main idea is to write the site-frequency spectrum process (Sj(t))t0as a difference of two
increasing processes in time, and to prove limit theorems for the increasing processes.
The fixed-size result (7) follows easily from fixed-time result (5) and Proposition 1 via the
continuous mapping theorem, as is discussed in Section 4.5.
4.1 Decomposition into increasing processes Sj,+(t)and Sj,(t)
Fix j1. The key idea of the proof of the fixed-time result (5) is to decompose the process
(Sj(t))t0into a difference of two increasing processes (Sj,+(t))t0and (Sj,(t))t0. To
describe these processes, we first need to establish some notation.
Recall that for mutation i Mtand st,Ci(s) is the size of the clone containing
mutation iat time s, meaning the number of individuals carrying mutation iat time s.
Set τi
j,(0) := 0 and define recursively for k1,
τi
j,+(k) := inf{s > τ i
j,(k1) : Ci(s) = j},
τi
j,(k) := inf{s > τ i
j,+(k) : Ci(s)6=j}.
Note that τi
j,+(k) is the k-th time at which the clone containing mutation ireaches or
“enters” size j, and τi
j,(k) is the k-th time at which it leaves or “exits” size j. Next,
define
Ii
j,+(t) :=
X
=1
1{τi
j,+()t}, Ii
j,(t) :=
X
=1
1{τi
j,()t},(20)
as the number of times the clone containing mutation ienters and exits size j, respectively,
up until time t. Then, for each k1, define the increasing processes (Sk
j,+(t))t0and
(Sk
j,(t))t0by
Sk
j,+(t) := X
i∈Mt
1{Ii
j,+(t)k}, Sk
j,(t) := X
i∈Mt
1{Ii
j,(t)k}.(21)
These processes keep track of the number of mutations in Mtwhose clones enter and
exit size j, respectively, at least ktimes up until time t. We can now finally define the
increasing processes (Sj,+(t))t0and (Sj,(t))t0as
Sj,+(t) :=
X
k=1
Sk
j,+(t), Sj,(t) :=
X
k=1
Sk
j,(t).
A key observation is that these processes count the total number of instances that a
9
mutation enters and exits size j, respectively, up until time t. To see why, note that
X
k=1
Sk
j,+(t) = X
i∈Mt
X
k=1
1{Ii
j,+(t)k}=X
i∈Mt
X
k=1
X
=k
1{Ii
j,+(t)=}
=X
i∈Mt
X
=1
X
k=1
1{Ii
j,+(t)=}=X
i∈Mt
X
=1
1{Ii
j,+(t)=}
=X
i∈Mt
Ii
j,+(t).
Similar calculations hold for P
k=1 Sk
j,(t). Note that Ii
j,+(t)Ii
j,(t) = 1 if and only if
Ci(t) = j, and Ii
j,+(t)Ii
j,(t) = 0 otherwise. It follows that
Sj(t) = Sj,+(t)Sj,(t).(22)
The fixed-time result (5) will follow from limit theorems for Sj,+(t) and Sj,(t), which in
turn follow from approximation results for the subprocesses Sk
j,+(t) and Sk
j,(t) for k1.
4.2 Approximation results for Sk
j,+(t)and Sk
j,(t)
We begin by establishing approximation results for Sk
j,+(t) and Sk
j,(t) for each k1.
First, for the branching process (Z0(t))t0with Z0(0) = 1, set τ
j(0) := 0 and define
recursively
τ+
j(k) := inf{s > τ
j(k1) : Z0(s) = j},
τ
j(k) := inf{s > τ+
j(k) : Z0(s)6=j}, k 1.(23)
Set
pk
j,+(t) := P(τ+
j(k)t), pk
j,(t) := P(τ
j(k)t),(24)
which are the probabilities that the branching process enters and exits size j, respectively,
at least ktimes up until time t. A key observation is that
pj(t) = P(Z0(t) = j) =
X
k=1 pk
j,+(t)pk
j,(t),(25)
which follows from the fact that
{Z0(t) = j}=[
k1{τ+
j(k)t, τ
j(k)> t}
=[
k1{τ+
j(k)t}\{τ
j(k)t}.
In addition, we note that since almost surely, Z0(t)0 or Z0(t) as t , there
exists 0 < θ < 1 so that for each t0,
pk
j,(t)pk
j,+(t)P(τ+
j(k)<)θk.(26)
10
The approximation results for Sk
j,+(t) and Sk
j,(t) can be established using almost
identical arguments, so if suffices to analyze Sk
j,+(t). Recall that Sk
j,+(t) is the number of
mutations whose clones enter size jat least ktimes up until time t. At any time st, a
mutation occurs at rate νZ0(s), and with probability pk
j,+(ts), its clone enters size jat
least ktimes up until time t. This suggests the approximation
Sk
j,+(t)νZt
0
Z0(s)pk
j,+(ts)ds =: ¯
Sk
j,+(t).(27)
Since eλtZ0(t)Yas t , we can further approximate for large t,
¯
Sk
j,+(t)νZt
0
Y eλspk
j,+(ts)ds =: ˆ
Sk
j,+(t).(28)
For the remainder of the section, our goal is to establish bounds on the L1-error associated
with the approximations Sk
j,+(t)¯
Sk
j,+(t)ˆ
Sk
j,+(t).
We first consider the approximation (27). For >0, define the Riemann sum
¯
Sk
j,+,(t) := ν
t/
X
=0
Z0(∆)pk
j,+(t∆).(29)
Clearly, lim0¯
Sk
j,+,(t) = ¯
Sk
j,+(t) almost surely. In addition, for some C > 0,
¯
Sk
j,+,(t)Ct max
stZ0(s).
Since (Z0(s))s0is a nonnegative submartingale, we can use Doob’s inequality to show
that CtEmaxstZ0(s)<for each t0. Therefore, by dominated convergence,
lim
0E¯
Sk
j,+,(t)¯
Sk
j,+(t)= 0, t 0.
It then follows from the triangle inequality that
ESk
j,+(t)¯
Sk
j,+(t)lim
0ESk
j,+(t)¯
Sk
j,+,(t), t 0.(30)
To bound the L1-error of the approximation (27), it therefore suffices to bound the right-
hand side of (30). We accomplish this in the following lemma.
Lemma 1. Let t > 0and >0. There exists constants C1>0and C2>0independent
of t,and ksuch that
EhSk
j,+(t)¯
Sk
j,+,(t)2iC1θkt2eλt +C2e3λt.(31)
Proof. Section 6.1.
11
We next turn to the approximation (28). By the triangle inequality and the Cauchy-
Schwarz inequality, we can write
E¯
Sk
j,+(t)ˆ
Sk
j,+(t)νZt
0
EY eλs Z0(s)pk
j,+(ts)ds
νZt
0EhY eλs Z0(s)2i1/2pk
j,+(ts)ds.
By showing that EhY eλs Z0(s)2i=Ceλs for some C > 0 and applying (26), we can
obtain the following bound on the L1-error of the approximation (28).
Lemma 2.
E¯
Sk
j,+(t)ˆ
Sk
j,+(t)=O(θkeλt/2).(32)
Proof. Section 6.2.
Finally, from (30), (31) and (32), it is straightforward to obtain a bound on the L1-
error of the approximation Sk
j,+(t)ˆ
Sk
j,+(t), which we state as Proposition 3.
Proposition 3.
ESk
j,+(t)ˆ
Sk
j,+(t)=O(θk/2teλt/2).(33)
4.3 Limit theorems for Sj,+(t)and Sj,(t)
To establish limit theorems for Sj,+(t) and Sj,(t), we define the approximations
ˆ
Sj,+(t) :=
X
k=1
ˆ
Sk
j,+(t),ˆ
Sj,(t) :=
X
k=1
ˆ
Sk
j,(t).
Focusing on the former approximation, we first argue that limt→∞ eλt ˆ
Sj,+(t) exists. In-
deed, consider the following calculations for k1 and t0, where we use (26):
eλt ˆ
Sk
j,+(t) = νeλt Zt
0
Y eλspk
j,+(ts)ds
=νY Zt
0
eλspk
j,+(s)ds
νY
λθk.
The second equality shows that t7→ eλt ˆ
Sk
j,+(t) is an increasing function, and the inequal-
ity shows that the function is bounded above by the summable sequence (νY )θk. There-
fore, t7→ eλt ˆ
Sj,+(t) is increasing and bounded above, which implies that limt→∞ eλt ˆ
Sj,+(t)
exists. The limit is given by
lim
t→∞ eλt ˆ
Sj,+(t) = νY Z
0
eλs
X
k=1
pk
j,+(s)!ds. (34)
12
We next note that by the triangle inequality and Proposition 3,
ESj,+(t)ˆ
Sj,+(t)
X
k=1
ESk
j,+(t)ˆ
Sk
j,+(t)=Oteλt/2,
which implies that
Z
0
eλtESj,+(t)ˆ
Sj,+(t)dt < .(35)
Combining (35) with the fact that (Sj,+(t))t0and (Sj,(t))t0are increasing processes,
we can establish almost sure convergence results for eλtSj,+(t) and eλt Sj,(t). In the
proof, we adapt an argument of Harris (Theorem 21.1 of [25]), with the L1condition (35)
replacing an analogous L2condition used by Harris.
Proposition 4. Conditional on ,
lim
t→∞ eλtSj,+(t) = νY Z
0
eλs
X
k=1
pk
j,+(s)!ds,
lim
t→∞ eλtSj,(t) = νY Z
0
eλs
X
k=1
pk
j,(s)!ds,
almost surely.
Proof. Section 6.3.
4.4 Proof of the fixed-time result (5)
To finish the proof of the fixed-time result (5), it suffices to note that by (25) and Propo-
sition 4,
lim
t→∞ eλtSj,+(t)Sj,(t)=νY Z
0
eλspj(s)ds.
Since Sj(t) = Sj,+(t)Sj,(t) by (22), the result follows.
4.5 Proof of the fixed-size result (7)
To prove the fixed-size result (7), we note that by (5), conditional on ,
lim
N→∞ eλτNSj(τN) = νY Z
0
eλspj(s)ds,
almost surely. Since N eλtN=Yby (4), we also have
lim
N→∞ eλ(τNtN)·N1Sj(τN) = Y1lim
N→∞ eλτNSj(τN)
=νZ
0
eλspj(s)ds,
13
almost surely. By Proposition 1 and the continuous mapping theorem, conditional on ,
lim
N→∞ eλ(τNtN)= 1,
in probability. We can therefore conclude that conditional on ,
lim
N→∞ N1Sj(τN) = νZ
0
eλspj(s)ds,
in probability, which is the desired result.
5 Application: Estimation of extinction probability
and effective mutation rate for birth-death process
We conclude by briefly discussing how for the birth-death process, our results imply
consistent estimators for the extinction probability pand the effective mutation rate ν/λ,
given data on the SFS of all mutations found in the population. The estimator for pis
based on the long-run proportion of mutations found in one individual. Recall that by
(12), this proportion is the same for the fixed-time and fixed-size SFS. By setting j= 1
in (18), the proportion can be written explicitly as (Section 6.11)
ϕ1(p) =
1
2, p = 0,
p+qlog(q)
plog(q),0< p < 1,(36)
where we recall that q= 1 p. The function ϕ1(p) is strictly decreasing in pand it
takes values in (0,1/2]. If in a given population, the proportion of mutations found in
one individual is observed to be x, we define an estimator for pby applying the inverse
function of ϕ1:
bp=bp(x) := ϕ1
1(x).(37)
Technically, ϕ1
1is only defined on (0,1/2], whereas the random number xmay take
any value in [0,1]. This can be addressed by extending the definition of ϕ1
1so that
ϕ1
1(x) := ϕ1
1(1/2) = 0 for x > 1/2 and ϕ1
1(0) := limx0+ϕ1
1(x) = 1. Since ϕ1
1so
defined is continuous, we can combine (11) and (18) with the continuous mapping theorem
to see that whether the SFS is observed at a fixed time or a fixed size, the estimator in
(37) is consistent in the sense that bppin probability as t or N . In other
words, if the population is sufficiently large, its site frequency spectrum can be used to
obtain an arbitrarily accurate estimate of p. Then, using the total number of mutations
and the current size of the population, an estimate for ν can be derived from (16) or
(17). We refer to Section 5 of [13] for a more detailed discussion of this estimator, which
includes an application of the estimator to simulated data.
In the preceding discussion, we focused on the proportion of mutations found in one
individual for illustration purposes. The point was to show that it is possible to define a
14
consistent estimator for pand ν using the SFS. If it is difficult to measure the number of
mutations found in one individual, one can instead focus on the proportion of mutations
found in jcells out of all mutations found in jcells for some j > 1, denoted by ϕj(p)
in (19). As noted in Section 3.2, ϕj(p) is strictly decreasing in pfor any j1, and it
takes values in (0,1/(j+ 1)]. We can therefore define a consistent estimator for pusing
the inverse function ϕ1
j(p). However, it should be noted that the range of ϕj(p) becomes
narrower as jincreases, which will likely affect the standard deviation of the estimator.
6 Proofs
6.1 Proof of Lemma 1
Proof. Before considering the quantity of interest ESk
j,+(t)¯
Sk
j,+,(t)2, we perform
some preliminary calculations. Recall that Mtis the set of mutations generated up until
time t. For >0 and any non-negative integer with < t, define Aℓ,to be the set
of mutations created in the time interval ,min{(+ 1)∆, t}, and note that
Mt=
t/
[
=0
Aℓ,.
Define Xℓ,:= |Aℓ,|as the number of mutations created in ,min{(+ 1)∆, t}. Note
that conditional on F(+1)∆ =σ(Z0(s); s(+ 1)∆),
Xℓ,Pois νZ(+1)∆
Z0(s)ds!.
Using this fact, it is easy to see that
E[Xℓ,|F(+1)∆] = νZ(+1)∆
Z0(s)ds = νZ0(∆)(1 + O(∆)) (38)
and
E[X2
ℓ,|F(+1)∆]E[Xℓ,|F(+1)∆] = E[Xℓ,|F(+1)∆ ]2,(39)
which implies
E[X2
ℓ,]E[Xℓ,] = 2ν2E[Z0(∆)2](1 + O(∆)).(40)
For ease of presentation, we will for the remainder of the proof drop 1+O(∆) multiplicative
factors in calculations, as they will not affect the final result.
Recall that for a mutation i Mt,Ii
j,+(t) is the number of times the clone containing
mutation ireaches size jup until time t, see (20). Define
Wk
,t(j) := X
iAℓ,
1{Ii
j,+(t)k}
15
as the number of mutations in Aℓ,whose clone reaches size jat least ktimes up until
time t. Note that by the definition of Sk
j,+(t) in (21),
Sk
j,+(t) =
t/
X
=0
Wk
,t(j).(41)
For iAℓ,,P(Ii
j,+(t)k) = pk
j,+(t) + O(∆), where pk
j,+(t) is defined as in (24).
Therefore, conditional on Xℓ,,Wk
,t(j) is a binomial random variable with parameters
Xℓ,and pk
j,+(t∆) + O(∆). Dropping 1 + O(∆) factors, this implies by (38),
E[Wk
,t(j)|F(+1)∆] = EEWk
,t(j)|Xℓ,,F(+1)∆|F(+1)∆
=pk
j,+(t∆)EXℓ,|F(+1)∆
= νpk
j,+(t∆)Z0(∆),(42)
and by (40) and (38),
EWk
,t(j)2
=pk
j,+(t∆)2EX2
ℓ,+pk
j,+(t∆) 1pk
j,+(t∆)E[Xℓ,]
=pk
j,+(t∆)2(EX2
ℓ,E[Xℓ,]) + pk
j,+(t∆)E[Xℓ,]
=pk
j,+(t∆)22ν2EZ0(∆)2+pk
j,+(t∆)∆νE [Z0(∆)] .(43)
We are now ready to begin the main calculations. First, note that by (29) and (41),
EhSk
j,+(t)¯
Sk
j,+,(t)2i
=E
t/
X
=0 νZ0(∆)pk
j,+(t∆) Wk
,t(j)
2
=
t/
X
2=0
t/
X
1=0
EνZ0(∆2)pk
j,+(t2)Wk
2,t(j)
νZ0(∆1)pk
j,+(t1)Wk
1,t(j).(44)
We first consider the diagonal terms in the double sum. Note first that by (42),
E[Z0(∆)Wk
,t(j)] = νpk
j,+(t∆)E[Z0(∆)2],
which implies by (43),
EhνZ0(∆)pk
j,+(t)Wk
,t(j)2i
=ν22pk
j,+(t∆)2E[Z0(∆)2]2νpk
j,+(t)E[Z0(∆)Wk
,t(j)] + E[Wk
,t(j)2]
=EWk
,t(j)2ν22pk
j,+(t∆)2E[Z0(∆)2]
=νpk
j,+(t∆)E[Z0(∆)].
16
Next, we consider the cross terms for 1< 2:
EνZ0(∆2)pk
j,+(t2)Wk
2,t(j)νZ0(∆1)pk
j,+(t1)Wk
1,t(j)
=νpk
j,+(t1)EZ0(∆1)νZ0(∆2)pk
j,+(t2)Wk
2,t(j)
EWk
1,t(j)νZ0(∆2)pk
j,+(t2)Wk
2,t(j)
=EWk
1,t(j)Wk
2,t(j)νZ0(∆2)pk
j,+(t2),
where the final equality follows by combining (42) with the fact that
EZ0(∆1)νZ0(∆2)pk
j,+(t2)Wk
2,t(j)
=EEZ0(∆1)νZ0(∆2)pk
j,+(t2)EWk
2,t(j)|F(2+1)∆ |F(1+1)∆.
We can now rewrite (44) as
EhSk
j,+(t)¯
Sk
j,+,(t)2i
=ν
t/
X
=0
pk
j,+(t∆)E[Z0(∆)]
+ 2 X
1<ℓ2
EWk
1,t(j)Wk
2,t(j)νZ0(∆2)pk
j,+(t2).(45)
The remainder of the proof will focus on bounding the off-diagonal terms
EWk
1,t(j)Wk
2,t(j)νZ0(∆2)pk
j,+(t2).(46)
We begin with the following lemma, which shows that in the limit as 0, we can
ignore the possibility of multiple mutations in time intervals of length ∆.
Lemma 3. For 1< 2,>0and t > 0,
E[Wk
2,t(j)Wk
1,t(j)] = P(Wk
2,t(j) = 1, W k
1,t(j) = 1) + Oeλ1e2λ23,
E[Z0(2∆)Wk
1,t(j)] = E[Z0(2∆); Wk
1,t(j) = 1] + Oeλ1e2λ22.
Proof. Section 6.4.
By Lemma 3, instead of (46) we can study the simpler difference
P(Wk
1,t(j) = 1, W k
2,t(j) = 1) νpk
j,+(t2)E[Z0(∆2); Wk
1,t(j) = 1].(47)
For ease of notation, define
I1(1, 2) := P(Wk
1,t(j) = 1, W k
2,t(j) = 1),
I2(1, 2) := νpk
j,+(t2)E[Z0(∆2); Wk
1,t(j) = 1].
In the following calculations, we will use twice that
P(Wk
1,t(j) = 1|Z0(∆1) = n) = pk
j,+(t1).
17
First consider the I2(1, 2) term,
I2(1, 2)
νpk
j,+(t2)=EZ0(∆2); Wk
1,t(j) = 1
=
X
m=1
mP Z0(∆2) = m, W k
1,t(j) = 1
=
X
m=1
X
n=1
mP Z0(∆2) = m, W k
1,t(j) = 1, Z0(∆1) = n
=
X
m=1
X
n=1
mP Z0(∆2) = m|Wk
1,t(j) = 1, Z0(∆1) = n
·P(Wk
1,t(j) = 1|Z0(∆1) = n)P(Z0(∆1) = n)
=νpk
j,+(t1)
X
n=1
nP (Z0(∆1) = n)
·
X
m=1
mP (Z0(∆2) = m|Wk
1,t(j) = 1, Z0(∆1) = n).
Next we consider the I1(1, 2) term,
I1(1, 2) = P(Wk
1,t(j) = 1, W k
2,t(j) = 1)
=
X
n=1
P(Wk
2,t(j) = 1|Z0(∆1) = n, W k
1,t(j) = 1)
·P(Wk
1,t(j) = 1|Z0(∆1) = n)P(Z0(∆1) = n)
=νpk
j,+(t1)
X
n=1
nP (Z0(∆1) = n)P(Wk
2,t(j) = 1|Z0(∆1) = n, W k
1,t(j) = 1)
=νpk
j,+(t1)
X
n=1
nP (Z0(∆1) = n)
·
X
m=1
P(Wk
2,t(j) = 1|Z0(∆2) = m, Z0(1) = n, W k
1,t(j) = 1)
·P(Z0(∆2) = m|Wk
1,t(j) = 1, Z0(∆1) = n).
We can therefore write
I1(1, 2)I2(1, 2)
=νpk
j,+(t1)
X
n=1
nP (Z0(∆1) = n)
·
X
m=1
P(Z0(∆2) = m|Wk
1,t(j) = 1, Z0(∆1) = n)
·P(Wk
2,t(j) = 1|Z0(∆2) = m, Z0(1) = n, W k
1,t(j) = 1)
pk
j,+(t2).(48)
18
We can use (48) to show that there exists a constant C > 0 so that
I1(1, 2)I2(1, 2)C2θkeλ2,(49)
where θis obtained from (26). The proof is deferred to the following lemma.
Lemma 4. For 1< 2,>0and t > 0,(49) holds.
Proof. Section 6.5.
Returning to (45), we can finally use Lemmas 3 and 4 to conclude that there exist
positive constants C1,C2and C3such that
EhSk
j,+(t)¯
Sk
j,+,(t)2i=ν
t/
X
=0
pk
j,+(t∆)E[Z0(∆)]
+ 2 X
1<ℓ2
(I1(1, 2)I2(1, 2)) + C3e3λt
C1θkteλt +C2θkt2eλt +C3e3λt.
This concludes the proof.
6.2 Proof of Lemma 2
Proof. Using that E[Y|Fs] = eλs Z0(s), see Section 2.3, we begin by writing
EhY eλs Z0(s)2i
=E[Z0(s)2]2eλsE[Y Z0(s)] + e2λs E[Y2]
=e2λsE[Y2]E[Z0(s)2].
From expression (5) of Chapter III.4 of [26], we know there exist positive constants c1and
c2such that
E[Z0(s)2] = c1e2λs c2eλs.(50)
If we establish that E[Y2] = c1, then it will follow that
EhY eλt Z0(t)2i=c2eλt,(51)
which is what we need to prove Lemma 2. To this end, note that Theorem 1 of IV.11 in
[26] implies that E[(Z0(t)eλt)2]E[Y2] as t . And from (50), we know that
lim
t→∞ e2λtE[Z0(t)2] = c1.
Therefore, E[Y2] = c1, which concludes the proof.
19
6.3 Proof of Proposition 4
Proof. Since Sj,+(t) is increasing in t,
eλ(t+τ)Sj,+(t+τ)eλτ eλtSj,+(t), t, τ 0.
In Section 4.3, it is shown that ˆ
S:= limt→∞ eλt ˆ
Sj,+(t) exists, and the limit is positive on
since Y > 0, see (34). Suppose there is an ωsuch that
lim sup
t→∞
eλtSj,+(t, ω)>ˆ
S(ω).(52)
For notational convenience, we will drop the ωin what follows. If (52) is true, there is a
δ > 0 and a sequence of real numbers t1< t2< . . . such that ti+1 ti> δ/λ(2 + 2δ) and
eλtiSj,+(ti)>ˆ
S(1 + δ) for i= 1,2,.... Then
eλ(ti+τ)Sj,+(ti+τ)eλτ eλtiSj,+(ti)(1 λτ)ˆ
S(1 + δ).(53)
Also, there exists t0so that for t > t0,
eλt ˆ
Sj,+(t)<ˆ
S(1 + δ/2).
Therefore, for ti> t0,
Zti+1
tieλtSj,+(t)eλt ˆ
Sj,+(t)dt Zti+δ/λ(2+2δ)
tieλtSj,+(t)eλt ˆ
Sj,+(t)dt
Zti+δ/λ(2+2δ)
tieλtSj,+(t)eλt ˆ
Sj,+(t)dt
ˆ
SZδ/λ(2+2δ)
0
((1 λτ)(1 + δ)(1 + δ/2))
=ˆ
S·δ2
8λ(1 + δ),
from which it follows that
Z
0eλtSj,+(t)eλt ˆ
Sj,+(t)dt =.
By (35), we see that the inequality (52) cannot hold on a set of positive probability.
Now suppose that
lim inf
t→∞ eλtSj,+(t, ω)<ˆ
S(ω) (54)
for some ω. Then there is a sequence of real numbers t1< t2< . . . with ti+1 ti>
δ/λ(2 δ) and a real number 0 < δ < 1 such that eλtiSj,+(ti)<(1 δ)ˆ
S. Therefore,
eλ(tiτ)Sj,+(tiτ)(1 δ)ˆ
Seλτ (1 δ)ˆ
S
1λτ ,0τ < 1/λ. (55)
20
Also, there exists t0so that for t > t0,
eλt ˆ
Sj,+(t)>(1 δ/2) ˆ
S.
Therefore,
Zti+1
tieλtSj,+(t)eλt ˆ
Sj,+(t)dt Zti+1
ti+1δ/λ(2δ)eλt Sj,+(t)eλt ˆ
Sj,+(t)dt
Zti+1
ti+1δ/λ(2δ)eλt ˆ
Sj,+(t)eλtSj,+(t)dt
ˆ
SZδ/λ(2δ)
0
((1 δ/2) (1 δ)/(1 λτ ))
=ˆ
Sδ
2λ+1δ
λlog 22δ
2δ,
where we can verify that δ
2λ+1δ
λlog 22δ
2δ>0 when δ < 1. Hence
Z
0eλtSj,+(t)eλt ˆ
Sj,+(t)dt =,
which allows us to conclude that (54) cannot hold on a set of positive probability.
We can now conclude that on ,
lim
t→∞ eλtSj,+(t) = ˆ
S
almost surely, which is the desired result.
6.4 Proof of Lemma 3
Proof. We will only prove the first statement, the proof of the second statement being
largely the same. To that end, it suffices to show that
E[Wk
2,t(j)Wk
1,t(j); Wk
2,t(j)>1] + E[Wk
2,t(j)Wk
1,t(j); Wk
1,t(j)>1]
=Oeλ1e2λ23,
with 1< 2. Again, we will only show that the first term satisfies the bound, the proof
for the second term being largely the same. We first note that since Wk
1,t(j)X1,,
E[Wk
2,t(j)Wk
1,t(j)1{Wk
2,t(j)>1}]
=EhE[Wk
2,t(j)Wk
1,t(j)1{Wk
2,t(j)>1}|F∆(1+1)]i
EhE[X1,Wk
2,t(j)1{Wk
2,t(j)>1}|F∆(1+1)]i
=EhEX1,|F∆(1+1)EhWk
2,t(j)1{Wk
2,t(j)>1}F∆(1+1)ii.
21
The final equality follows because the number of mutations created in the interval [∆1,1+
∆) is independent of the number of mutations created in [∆2,2+ ∆) and their fate,
given the population size up until time ∆(1+1). Therefore, using (38), Wk
2,t(j)X2,
and (39),
E[Wk
2,t(j)Wk
1,t(j)1{Wk
2,t(j)>1}]
νEhZ0(∆1)EhEhWk
2,t(j)1{Wk
2,t(j)>1}F∆(2+1)iF∆(1+1)ii
νEhZ0(∆1)EhEhX2,1{X2,>1}F∆(2+1)iF∆(1+1)ii
νEhZ0(∆1)EhEhX2,(X2,1)F∆(2+1)iF∆(1+1)ii
=ν33EZ0(∆1)EZ0(∆2)2|F∆(1+1).
We then use that for st,
E[Z0(t)2|Fs] = e2λ(ts)Z0(s)2+ Var (Z0(ts)) Z0(s),
to conclude that
E[Wk
2,t(j)Wk
1,t(j)1{Wk
2,t(j)>1}]
ν33e2λ∆(211)E[Z0(∆1)Z0(∆(1+ 1))2]
+ν33Var (Z0(∆(211))) E[Z0(∆1)Z0(∆(1+ 1))]
=ν33e2λ∆(21)E[Z0(∆1)3]
+ν33e2λ∆(211)Var(Z0(∆))E[Z0(∆1)2]
+ν33Var (Z0(∆(211))) eλE[Z0(∆1)2].
The desired result now follows from the assumption that the offspring distribution has a
finite third moment and thus E[Z0(t)3] = Oe3λtby Lemma 5 of [28].
6.5 Proof of Lemma 4
Proof. Let be a positive integer and let s > 0 such that + < s. On the event
{Xℓ,= 1}, define Dj
(s) to be the number of disjoint intervals in [0, s] that the mutation
at time is present in jindividuals, and let B(s) be the number of individuals alive
at time sdescended from the mutation at time ∆. Note that
P(Wk
,t(j) = 1) = P(Xℓ,= 1, Dj
(t)k)(1 + O(∆)).
On {X1,= 1, X2,= 1}with 1< 2, let Adenote the event that the mutation at time
2 occurs in the clone started by the mutation at time 1∆.
We now consider the first term inside the parenthesis in (48), and break it up based on
the value of B1(2∆) and whether Aoccurs or not. Once again, we refrain from writing
22
1 + O(∆) multiplicative factors.
P(Wk
2,t(j) = 1|Z0(∆2) = m, Z0(1) = n, W k
1,t(j) = 1)
=
m
X
i=1
P(Wk
2,t(j) = 1, B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
=
m
X
i=1
P(Wk
2,t(j) = 1, B1(2∆) = i, A|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
+
m
X
i=1
P(Wk
2,t(j) = 1, B1(2∆) = i, Ac|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1).
Note that
P(Wk
2,t(j) = 1, B1(2∆) = i, A|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
=P(Wk
2,t(j) = 1, A|Z0(∆2) = m, Z0(1) = n, W k
1,t(j) = 1, B1(2∆) = i)
·P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
=P(Wk
2,t(j) = 1, A, Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)
·P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)
P(Wk
2,t(j) = 1, A|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)
·P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i),
and
P(Wk
2,t(j) = 1, A|Z0(∆2) = m, Z0(1) = n, X1,= 1, B1(2∆) = i)
=pk
j,+(t2).
Also note that
P(Wk
2,t(j) = 1, B1(2∆) = i, Ac|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
=P(Wk
2,t(j) = 1, Ac|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1, B1(2∆) = i)
·P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
= (mi)νpk
j,+(t2)
·P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1).
23
It follows that
P(Wk
2,t(j) = 1|Z0(∆2) = m, Z0(1) = n, W k
1,t(j) = 1)
νpk
j,+(t2)
· m
X
i=1
iP(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)
+
m
X
i=1
(mi)∆P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)!
νpk
j,+(t2)
· m+
m
X
i=1
iP(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)!.
Going back to (48), we can then derive the upper bound
I1(1, 2)I2(1, 2)
ν22pk
j,+(t1)pk
j,+(t2)
X
n=1
nP (Z0(∆1) = n)
·
X
m=1
P(Z0(∆2) = m|Wk
1,t(j) = 1, Z0(∆1) = n)
·
m
X
i=1
iP(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i).
Note that
P(Z0(∆2) = m|Wk
1,t(j) = 1, Z0(∆1) = n)
·P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
=P(B1(2∆) = i, Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Wk
1,t(j) = 1, Z0(∆1) = n)
and
P(B1(2∆) = i, Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)
=P(B1(2∆) = i, Z0(∆2) = m, Z0(∆1) = n, X1,= 1, Dj
1(t)k)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)
=P(Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(∆2) = i).
24
It follows that
P(Z0(∆2) = m|Wk
1,t(j) = 1, Z0(∆1) = n)
·P(B1(2∆) = i|Z0(∆2) = m, Z0(∆1) = n, W k
1,t(j) = 1)
P(Dj
1(t)k|Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(2∆) = i)
=P(Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(∆2) = i)
P(Wk
1,t(j) = 1, Z0(∆1) = n).
Since
P(Z0(∆1) = n)
P(Wk
1,t(j) = 1, Z0(∆1) = n)=1
P(Wk
1,t(j) = 1|Z0(∆1) = n)
=1
pk
j,+(t1),
we can write
I1(1, 2)I2(1, 2)
νpk
j,+(t2)
·
X
n=1
X
m=1
m
X
i=1
iP (Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(∆2) = i).
Now,
P(Z0(∆2) = m, Z0(∆1) = n, X1,= 1, B1(∆2) = i)
=P(Z0(∆2) = m, B1(∆2) = i|Z0(∆1) = n, X1,= 1)
·P(X1,= 1|Z0(∆1) = n)P(Z0(∆1) = n)
=P(Z0(∆1) = n)pi(∆(21))pn1,mi(∆(21)),
where we recall that pn,m(t) = P(Z0(t) = m|Z0(0) = n) and pm(t) = p1,m(t). It follows
that
I1(1, 2)I2(1, 2)
ν22pk
j,+(t2)
X
n=1
nP (Z0(∆1) = n)
X
m=1
m
X
i=1
ipn1,mi(∆(21))pi(∆(21))
=ν22pk
j,+(t2)
X
n=1
nP (Z0(∆1) = n)
X
i=1
ipi(∆(21))
X
m=i
pn1,mi(∆(21))
=ν22pk
j,+(t2)
X
n=1
nP (Z0(∆1) = n)
X
i=1
ipi(∆(21))
=ν22pk
j,+(t2)eλ2ν22θkeλ2.
This is the desired result.
25
6.6 Proof of Proposition 1
Proof. To begin with, define the extinction time of the branching process with Z0(0) = 1
as
τ0= inf{t > 0 : Z0(t) = 0},
and note that the extinction probability p=P(τ0<) satisfies p[0,1) by the
assumption m > 1. We want to prove that for any ε > 0,
lim
N→∞ P(|τNtN|> ε|) = 0,
where τNand tNare defined by (2) and (4), respectively. We begin by establishing a
simple lower bound on τNfor large N.
Lemma 5. For ρ(0,1) define sN(ρ) := ρ
λlog(N). Then
P(τN< sN(ρ)) = ON2(ρ1).
Proof. Since m > 1, we know that (Z0(t))t0is a submartingale. Therefore,
P(τN< sN(ρ)) = P sup
tsN(ρ)
Z0(t)N!1
N2EZ0(sN(ρ))2=ON2(ρ1).
We next establish a simple result about the rate of convergence of eλtZ0(t)Y.
Lemma 6. For z > 0,
lim
a→∞ Psup
ta|Z0(t)eλt Y| zY, = 0.
Proof. Fix a > 0 and δ > 0. On the event ,Yis a random variable on (0,) with a
strictly positive continuous density function, see (3). Thus, there exists η > 0 such that
P(Y < η, )δ. We can therefore write
Psup
ta|Z0(t)eλt Y| zY, δ+Psup
ta|Z0(t)eλt Y| , .
For arbitrary b > a, we see from the triangle inequality that
sup
atbZ0(t)eλt YZ0(a)eλa Y+ sup
atbZ0(t)eλt Z0(a)eλa.
Thus, for z > 0,
Psup
atbZ0(t)eλt Y,
PZ0(a)eλa Y/2+Psup
atbZ0(t)eλt Z0(a)eλazη/2.(56)
26
Since by (51),
EhZ0(a)eλa Y2i=Oeλa,
EhZ0(a)eλa Z0(b)eλb2i=Oeλa,
Markov’s and Doob’s inequalities can be applied to (56) to see that
Psup
atbZ0(t)eλt Yzη, =Oeλa 2z2.
Since
Psup
taZ0(t)eλt Y, = lim
b→∞ Psup
atbZ0(t)eλt Y, ,
it follows that
lim sup
a→∞
Psup
ta|Z0(t)eλt Y| zY, δ,
and because δis arbitrary the desired result follows.
We are now ready to analyze the difference τNtNon . We first consider the case
τN< tNε. Define the difference function
ω0(t) = Z0(t)Y eλt.
On , by the definition of tNin (4),
Z0(τN) = Y eλτN+ω0(τN)
=Neλ(τNtN)+ω0(τN),
which implies for τN< tNε,
ω0(τN) = N1eλ(τNtN)+ (Z0(τN)N)N1eλ(τNtN)N1eλε.
Take 0 < ρ < 1. Applying Lemma 5,
P(τN< tNε, )
Pω0(τN)N1eλε, τN< tNε,
P(τNsN(ρ))
+Pω0(τN)N1eλε, sN(ρ)< τN< tNε,
=ON2(ρ1)+Pω0(τN)N1eλε , sN(ρ)< τN< tNε, .
27
Thus we consider
Pω0(τN)N1eλε, sN(ρ)< τN< tNε,
P sup
sN(ρ)<t<tNε
(Z0(t)Y eλt)N1eλε,!
P sup
sN(ρ)<t<tNε
(Z0(t)eλt Y)eλ(tNε)N1eλε,!
P sup
sN(ρ)<t
(Z0(t)eλt Y)Yeλε 1,!,
where in the last step, we use the defition of tN. We can now apply Lemma 6 to get
lim
N→∞ P(τN< tNε, ) = 0.
We next consider τN> tN+ε. Note that on the event {τN> tN+ε} ,
ω0(tN+ε) = Y eλ(tN+ε)Z0(tN+ε) = Neλε Z0(tN+ε)Neλε 1.
Therefore,
P(τN> tN+ε, )PY eλ(tN+ε)Z0(tN+ε)Neλε 1,
=PYZ0(tN+ε)eλ(tN+ε)Y1eλε,.
Since P(tN1
2λlog(N),) = P(YN , )0 as N , we can write
PYZ0(tN+ε)eλ(tN+ε)Y1eλε,
P(YN) + P sup
t> 1
2λlog(N)YeλtZ0(t)Y1eλε,!.
We can then apply Lemma 6 to get
lim
N→∞ P(τN> tN+ε, ) = 0,
which concludes the proof.
6.7 Proof of Proposition 2
Proof. We use a similar argument to the proof of Theorem 1. First, we break the total
number of mutations M(t) into
M(t) = M+(t)M(t),
where M+(t) represents the total number of mutations generated up until time t, and
M(t) represents the number of mutations which belong to M+(t) but die out before time
28
t. Obviously, these two processes are increasing in time. The limit theorems for M(t)
will follow from limit theorems for M+(t) and M(t). Because of the almost identical
arguments, we will focus on the analysis of M+(t).
As in the proof of Theorem 1, we define the approximations
ˆ
M+(t) := νZt
0
Y eλsds (57)
and
¯
M+(t) := νZt
0
Z0(s)ds, (58)
as well as the Riemann sum approximation
¯
M+,(t) := ν
t/
X
=0
Z0(∆).(59)
Note that the only difference between (28) and (57) is the probability pk
j,+(ts) which
does not appear in (57). Therefore, we can simply follow the proofs of Lemmas 1 and 2
by replacing Sk
j,+(t), ˆ
Sk
j,+(t), ¯
Sk
j,+(t), ¯
Sk
j,+,(t) and θwith M+(t), ˆ
M+(t), ¯
M+(t), ¯
M+,(t)
and 1, respectively, and we will get
EM+(t)ˆ
M+(t)=O(teλt/2),(60)
which implies
Z
0
eλtEM+(t)ˆ
M+(t)dt < .(61)
Note that lim
t→∞ eλt ˆ
M+(t) = νY exists and M+(t) is an increasing process. By replacing
the corresponding terms in the proof of Proposition 4, we can get
lim
t→∞ eλtM+(t) = νY Z
0
eλsds =νY /λ, (62)
almost surely. Similarly,
lim
t→∞ eλtM(t) = νY Z
0
eλsp0(s)ds, (63)
almost surely. The fixed-time result (9) follows immediately from (62) and (63).
Then, by following the proof in Section 4.5, we can get the fixed-size result (10) for
the total number of mutations,
lim
N→∞ N1M(τN) = νZ
0
eλs(1 p0(s))ds,
in probability.
29
6.8 Proof of Corollary 1
Proof. (1) For the birth-death process, we can write
p0(t) = p(eλt 1)
eλt p,
pj(t) = q2eλt
(eλt p)2·eλt 1
eλt pj1
, j 1,
(64)
see expression (B.1) in [13]. Therefore, for j1,
Z
0
eλspj(s)ds =1
λZ
0
q2eλs
(1 peλs)2·1eλs
1peλs j1
·λeλsds.
Using the substitution x:= eλs,dx =λeλs ds, we obtain
Z
0
eλspj(s)ds =q2
λZ1
0
x
(1 px)2·1x
1pxj1
dx.
We again change variables, this time y:= (1 x)/(1 px), in which case
x= (1 y)/(1 py),
dx =q/(1 py)2dy,
1px =q/(1 py).
In addition, y= 1 for x= 0 and y= 0 for x= 1, which implies
Z
0
eλspj(s)ds =q
λZ1
0
(1 py)1(1 y)yj1dy. (65)
To get the sum representation in (13), it suffices to note that
Z1
0
(1 py)1(1 y)yj1dy =
X
k=0
pkZ1
0
(1 y)yj+k1dy
=
X
k=0
pk
(j+k)(j+k+ 1).
To get the pure-birth process result, it suffices to note that p= 0, q= 1 and
Z1
0
(1 y)yj1dy =1
j(j+ 1).
(2) Follows from the same calculations as in (1).
(3) By (64), for the birth-death process,
1p0(t) = (1 p)eλt
eλt p=qeλt
eλt p.
30
Therefore,
Z
0
eλs(1 p0(s))ds =1
λZ
0
q
1peλs ·λeλsds.
Using the substitution x:= eλs,dx =λeλs ds, we obtain
Z
0
eλs(1 p0(s))ds =1
λZ1
0
q
1pxdx =
1
λ, p = 0,
qlog(q)
λp ,0< p < 1.
(66)
(4) Follows from the same calculations as in (3).
6.9 Derivation of expression (19)
By writing Mj(t) = M(t)Pj1
k=0 Sk(t), it follows from Corollary 1 that conditional on
,
lim
t→∞ eλtMj(t) = νqY
λZ1
0
(1 py)1(1 y)
X
k=j
yk1dy
=νqY
λZ1
0
(1 py)1yj1dy.
Similarly,
lim
N→∞ N1Mj(τN) = νq
λZ1
0
(1 py)1yj1dy.
It follows that
lim
t→∞
Sj(t)
Mj(t)= lim
N→∞
Sj(τN)
Mj(τN)=R1
0(1 py)1(1 y)yj1dy
R1
0(1 py)1yj1dy
= 1 R1
0(1 py)1yjdy
R1
0(1 py)1yj1dy =: ϕj(p).
6.10 Proof that ϕj(p)is strictly decreasing
Here, we show that for each j1, ϕj(p) given by the last expression in Section 6.9 is
strictly decreasing in p. Set
a:= Z1
0
(1 py)2yj+1dyZ1
0
(1 py)1yj1dy,
b:= Z1
0
(1 py)2yjdyZ1
0
(1 py)1yjdy.
31
It suffices to show that a > b for each p(0,1). First, note that we can write
a=Z1
0Z1
0
(1 py)2yj+1(1 px)1xj1dydx
and
b=Z1
0Z1
0
(1 py)2yj(1 px)1xjdydx,
which implies
ab=Z1
0Z1
0
(1 py)2(1 px)1yjxj1(yx)dydx
=Z1
0Zx
0
(1 py)2(1 px)1yjxj1(yx)dydx
+Z1
0Z1
x
(1 py)2(1 px)1yjxj1(yx)dydx.
The latter integral can be rewritten as follows:
Z1
0Z1
x
(1 py)2(1 px)1yjxj1(yx)dydx
=Z1
0Zy
0
(1 py)2(1 px)1yjxj1(yx)dxdy
=Z1
0Zx
0
(1 px)2(1 py)1xjyj1(yx)dydx
which implies
ab=Z1
0Zx
0
(1 py)1(1 px)1yj1xj1(yx)(1 py)1y(1 px)1xdydx.
Since
y
1py x
1px =yx
(1 py)(1 px),
we can finally conclude that
ab=Z1
0Zx
0
(1 py)2(1 px)2yj1xj1(yx)2dydx > 0
for each p(0,1).
32
6.11 Derivation of expression (36)
To derive expression (36) in the main text, we note that (1 py)1=P
k=0(py)kfor
0< p < 1 and 0 y1, which implies
Z1
0
(1 py)1(1 y)dy =
X
k=0
pkZ1
0
yk(1 y)dy
=
X
k=0
pk
k+ 1
X
k=0
pk
k+ 2.
Since P
k=1
xk
k=log(1 x), we obtain
Z1
0
(1 py)1(1 y)dy =log(q)
p1
p2log(q)p
=q
p2log(q) + 1
p.
Therefore, applying expression (18), we can write for 0 < p < 1,
ϕ1(p) = p
log(q)Z1
0
(1 py)1(1 y)dy =p+qlog(q)
plog(q).
Acknowledgments
EBG was supported in part by NSF grant CMMI-1552764, NIH grant R01 CA241137,
funds from the Norwegian Centennial Chair grant and the Doctoral Dissertation Fellow-
ship from the University of Minnesota. K. Leder was supported in part with funds from
NSF award CMMI 2228034 and Research Council of Norway Grant 309273.
References
[1] K. Zeng, Y.-X. Fu, S. Shi, and C.-I. Wu, “Statistical tests for detecting positive
selection by utilizing high-frequency variants,” Genetics, vol. 174, no. 3, pp. 1431–
1439, 2006.
[2] G. Achaz, “Frequency spectrum neutrality tests: one for all and all for one,” Genetics,
vol. 183, no. 1, pp. 249–258, 2009.
[3] A. Sottoriva, H. Kang, Z. Ma, T. A. Graham, M. P. Salomon, J. Zhao, P. Marjoram,
K. Siegmund, M. F. Press, D. Shibata, et al., “A big bang model of human colorectal
tumor growth,” Nat. Genet., vol. 47, no. 3, pp. 209–216, 2015.
[4] S. Ling, Z. Hu, Z. Yang, F. Yang, Y. Li, P. Lin, K. Chen, L. Dong, L. Cao, Y. Tao,
et al., “Extremely high genetic diversity in a single tumor points to prevalence of non-
darwinian cell evolution,” Proc. Natl. Acad. Sci. USA, vol. 112, no. 47, pp. E6496–
E6505, 2015.
33
[5] M. J. Williams, B. Werner, C. P. Barnes, T. A. Graham, and A. Sottoriva, “Identi-
fication of neutral tumor evolution across cancer types,” Nat. Genet., vol. 48, no. 3,
p. 238, 2016.
[6] S. Venkatesan and C. Swanton, “Tumor evolutionary principles: how intratu-
mor heterogeneity influences cancer treatment and outcome,” Am. Soc. Clin. On-
col. Educ. Book, vol. 36, pp. e141–e149, 2016.
[7] A. Davis, R. Gao, and N. Navin, “Tumor evolution: Linear, branching, neutral or
punctuated?,” Biochim. Biophys. Acta Rev. Cancer, vol. 1867, no. 2, pp. 151–161,
2017.
[8] R. Durrett, “Population genetics of neutral mutations in exponentially growing can-
cer cell populations,” Ann. Appl. Propab., vol. 23, no. 1, p. 230, 2013.
[9] R. Durrett, “Branching process models of cancer,” in Branching Process Models of
Cancer, pp. 1–63, Springer, 2015.
[10] I. Bozic, J. M. Gerold, and M. A. Nowak, “Quantifying clonal and subclonal passenger
mutations in cancer evolution,” PLoS Comput. Biol., vol. 12, no. 2, p. e1004731, 2016.
[11] H. Ohtsuki and H. Innan, “Forward and backward evolutionary processes and allele
frequency spectrum in a cancer cell population,” Theor. Popul. Biol., vol. 117, pp. 43–
50, 2017.
[12] K. N. Dinh, R. Jaksik, M. Kimmel, A. Lambert, S. Tavar´e, et al., “Statistical in-
ference for the evolutionary history of cancer genomes,” Stat. Sci., vol. 35, no. 1,
pp. 129–144, 2020.
[13] E. B. Gunnarsson, K. Leder, and J. Foo, “Exact site frequency spectra of neutrally
evolving tumors: A transition between power laws reveals a signature of cell viabil-
ity,” Theoretical Population Biology, vol. 142, pp. 67–90, 2021.
[14] H.-R. Tung and R. Durrett, “Signatures of neutral evolution in exponentially growing
tumors: A theoretical perspective,” PLOS Computational Biology, vol. 17, no. 2,
p. e1008701, 2021.
[15] C. Bonnet and H. Leman, “Site frequency spectrum of a rescued population under
rare resistant mutations,” arXiv preprint arXiv:2303.04069, 2023.
[16] A. Lambert, “The allelic partition for coalescent point processes,” Markov Pro-
cess. Relat. Fields, vol. 15, no. 3, pp. 359–386, 2009.
[17] A. Lambert, The coalescent of a sample from a binary branching process,” Theo-
retical population biology, vol. 122, pp. 30–35, 2018.
[18] S. G. Johnston, “The genealogy of galton-watson trees,” 2019.
34
[19] S. C. Harris, S. G. G. Johnston, and M. I. Roberts, The coalescent structure of
continuous-time Galton–Watson trees,” The Annals of Applied Probability, vol. 30,
no. 3, pp. 1368 1414, 2020.
[20] B. Johnson, Y. Shuai, J. Schweinsberg, and K. Curtius, Estimating single cell clonal
dynamics in human blood using coalescent theory,” bioRxiv, pp. 2023–02, 2023.
[21] J. Schweinsberg and Y. Shuai, “Asymptotics for the site frequency spectrum
associated with the genealogy of a birth and death process,” arXiv preprint
arXiv:2304.13851, 2023.
[22] R. Durrett, Probability models for DNA sequence evolution. Springer Science & Busi-
ness Media, 2008.
[23] D. Cheek and T. Antal, “Genetic composition of an exponentially growing cell pop-
ulation,” Stochastic Processes and their Applications, 2020.
[24] D. Cheek and T. Antal, “Mutation frequencies in a birth–death branching process,”
Ann. Appl. Probab., vol. 28, no. 6, pp. 3922–3947, 2018.
[25] T. E. Harris, The theory of branching process,” 1964.
[26] K. B. Athreya and P. E. Ney, Branching processes. Courier Corporation, 2004.
[27] K. Athreya and P. Ney, Branching Processes. New York: Springer-Verlag, 1972.
[28] J. Foo, K. Leder, and J. Zhu, “Escape times for branching processes with random
mutational fitness effects,” Stochastic Processes and Their Applications, vol. 124,
no. 11, pp. 3661–3697, 2014.
35
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Recent work of Sottoriva, Graham, and collaborators have led to the controversial claim that exponentially growing tumors have a site frequency spectrum that follows the 1/f law consistent with neutral evolution. This conclusion has been criticized based on data quality issues, statistical considerations, and simulation results. Here, we use rigorous mathematical arguments to investigate the site frequency spectrum in the two-type model of clonal evolution. If the fitnesses of the two types are λ0 < λ1, then the site frequency spectrum is c/fα where α = λ0/λ1. This is due to the advantageous mutations that produce the founders of the type 1 population. Mutations within the growing type 0 and type 1 populations follow the 1/f law. Our results show that, in contrast to published criticisms, neutral evolution in an exponentially growing tumor can be distinguished from the two-type model using the site frequency spectrum.
Article
Full-text available
Article
The site frequency spectrum (SFS) is a popular summary statistic of genomic data. While the SFS of a constant-sized population undergoing neutral mutations has been extensively studied in population genetics, the rapidly growing amount of cancer genomic data has attracted interest in the spectrum of an exponentially growing population. Recent theoretical results have generally dealt with special or limiting cases, such as considering only cells with an infinite line of descent, assuming deterministic tumor growth, or taking large-time or large-population limits. In this work, we derive exact expressions for the expected SFS of a cell population that evolves according to a stochastic branching process, first for cells with an infinite line of descent and then for the total population, evaluated either at a fixed time (fixed-time spectrum) or at the stochastic time at which the population reaches a certain size (fixed-size spectrum). We find that while the rate of mutation scales the SFS of the total population linearly, the rates of cell birth and cell death change the shape of the spectrum at the small-frequency end, inducing a transition between a 1/j2 power-law spectrum and a 1/j spectrum as cell viability decreases. We show that this insight can in principle be used to estimate the ratio between the rate of cell death and cell birth, as well as the mutation rate, using the site frequency spectrum alone. Although the discussion is framed in terms of tumor dynamics, our results apply to any exponentially growing population of individuals undergoing neutral mutations.
Article
We study a simple model of DNA evolution in a growing population of cells. Each cell contains a nucleotide sequence which randomly mutates at cell division. Cells divide according to a branching process. Following typical parameter values in bacteria and cancer cell populations, we take the mutation rate to zero and the final number of cells to infinity. We prove that almost every site (entry of the nucleotide sequence) is mutated in only a finite number of cells, and these numbers are independent across sites. However independence breaks down for the rare sites which are mutated in a positive fraction of the population. The model is free from the popular but disputed infinite sites assumption. Violations of the infinite sites assumption are widespread while their impact on mutation frequencies is negligible at the scale of population fractions. Some results are generalised to allow for cell death, selection, and site-specific mutation rates. For illustration we estimate mutation rates in a lung adenocarcinoma.
Article
Recent years have seen considerable work on inference about cancer evolution from mutations identified in cancer samples. Much of the modeling work has been based on classical models of population genetics, generalized to accommodate time-varying cell population size. Reverse-time, genealogical views of such models, commonly known as coalescents, have been used to infer aspects of the past of growing populations. Another approach is to use branching processes, the simplest scenario being the classical linear birth-death process. Inference from evolutionary models of DNA often exploits summary statistics of the sequence data, a common one being the so-called Site Frequency Spectrum (SFS). In a bulk tumor sequencing experiment, we can estimate for each site at which a novel somatic point mutation has arisen, the proportion of cells that carry that mutation. These numbers are then grouped into collections of sites which have similar mutant fractions. We examine how the SFS based on birth-death processes differs from those based on the coalescent model. This may stem from the different sampling mechanisms in the two approaches. However, we also show that despite this, they are quantitatively comparable for the range of parameters typical for tumor cell populations. We also present a model of tumor evolution with selective sweeps, and demonstrate how it may help in understanding the history of a tumor as well as the influence of data pre-processing. We illustrate the theory with applications to several examples from The Cancer Genome Atlas tumors.
Article
At time 0, start a time-continuous binary branching process, where particles give birth to a single particle independently (at a possibly time-dependent rate) and die independently (at a possibly time-dependent and age-dependent rate). A particular case is the classical birth-death process. Stop this process at time T>0. It is known that the tree spanned by the N tips alive at time T of the tree thus obtained (called a reduced tree or coalescent tree) is a coalescent point process (CPP), which basically means that the depths of interior nodes are independent and identically distributed (iid). Now select each of the N tips independently with probability y (Bernoulli sample). It is known that the tree generated by the selected tips, which we will call the Bernoulli sampled CPP, is again a CPP. Now instead, select exactly k tips uniformly at random among the N tips (a k-sample). We show that the tree generated by the selected tips is a mixture of Bernoulli sampled CPPs with the same parent CPP, over some explicit distribution of the sampling probability y. An immediate consequence is that the genealogy of a k-sample can be obtained by the realization of k random variables, first the random sampling probability Y and then the k-1 node depths which are iid conditional on Y=y.
Article
How is genetic variability shaped by natural selection, demographic factors, and random genetic drift? To approach this question, we introduce and analyze a number of probability models beginning with the basics, and ending at the frontiers of current research. Throughout the book, the theory is developed in close connection with examples from the biology literature that illustrate the use of these results. Along the way, there are many numerical examples and graphs to illustrate the conclusions. This is the second edition and is twice the size of the first one. The material on recombination and the stepping stone model have been greatly expanded, there are many results form the last five years, and two new chapters on diffusion processes develop that viewpoint. This book is written for mathematicians and for biologists alike. No previous knowledge of concepts from biology is assumed, and only a basic knowledge of probability, including some familiarity with Markov chains and Poisson processes. The book has been restructured into a large number of subsections and written in a theorem-proof style, to more clearly highlight the main results and allow readers to find the results they need and to skip the proofs if they desire. Rick Durrett received his Ph.D. in operations research from Stanford University in 1976. He taught in the UCLA mathematics department before coming to Cornell in 1985. He is the author of eight books and 160 research papers, most of which concern the use of probability models in genetics and ecology. He is the academic father of 39 Ph.D. students and was recently elected to the National Academy of Science.
Article
This volume develops results on continuous time branching processes and applies them to study rate of tumor growth, extending classic work on the Luria-Delbruck distribution. As a consequence, the authors calculate the probability that mutations that confer resistance to treatment are present at detection and quantify the extent of tumor heterogeneity. As applications, the authors evaluate ovarian cancer screening strategies and give rigorous proofs for results of Heano and Michor concerning tumor metastasis. These notes should be accessible to students who are familiar with Poisson processes and continuous time. Richard Durrett is mathematics professor at Duke University, USA. He is the author of 8 books, over 200 journal articles, and has supervised more than 40 Ph.D. students. Most of his current research concerns the applications of probability to biology: ecology, genetics, and most recently cancer.
Article
First, we revisit a classic two-type branching process which describes cell proliferation and mutation; widespread application has been seen in cancer and microbial modelling. As the mutation rate tends to zero and the population size to infinity, the mutation times converge to a Poisson process. This yields the number of mutants and clone sizes. Other limits and exact results are also explored. Second, we extend the model to consider mutations at multiple sites on the genome. The number of mutants in the two-type model characterises the mean site frequency spectrum in the multiple-site model. Our predictions are consistent with genomic data from tumours.
Article
A cancer grows from a single cell, thereby constituting a large cell population. In this work, we are interested in how mutations accumulate in a cancer cell population. We provide a theoretical framework of the stochastic process in a cancer cell population and obtain near exact expressions of allele frequency spectrum or AFS (only continuous approximation is involved) from both forward and backward treatments under a simple setting; all cells undergo cell divisions and die at constant rates, b and d, respectively, such that the entire population grows exponentially. This setting means that once a parental cancer cell is established, in the following growth phase, all mutations are assumed to have no effect on b or d (i.e., neutral or passengers). Our theoretical results show that the difference from organismal population genetics is mainly in the coalescent time scale, and the mutation rate is defined per cell division, not per time unit (e.g., generation). Except for these two factors, the basic logic is very similar between organismal and cancer population genetics, indicating that a number of well established theories of organismal population genetics could be translated to cancer population genetics with simple modifications.