Access to this full-text is provided by Springer Nature.
Content available from Nature Communications
This content is subject to copyright. Terms and conditions apply.
ARTICLE
A Bayesian approach to infer recombination
patterns in coronaviruses
Nicola F. Müller 1✉, Kathryn E. Kistler1,2 & Trevor Bedford1,2,3
As shown during the SARS-CoV-2 pandemic, phylogenetic and phylodynamic methods are
essential tools to study the spread and evolution of pathogens. One of the central
assumptions of these methods is that the shared history of pathogens isolated from different
hosts can be described by a branching phylogenetic tree. Recombination breaks this
assumption. This makes it problematic to apply phylogenetic methods to study recombining
pathogens, including, for example, coronaviruses. Here, we introduce a Markov chain Monte
Carlo approach that allows inference of recombination networks from genetic sequence data
under a template switching model of recombination. Using this method, we first show that
recombination is extremely common in the evolutionary history of SARS-like coronaviruses.
We then show how recombination rates across the genome of the human seasonal cor-
onaviruses 229E, OC43 and NL63 vary with rates of adaptation. This suggests that recom-
bination could be beneficial to fitness of human seasonal coronaviruses. Additionally, this
work sets the stage for Bayesian phylogenetic tracking of the spread and evolution of SARS-
CoV-2 in the future, even as recombinant viruses become prevalent.
https://doi.org/10.1038/s41467-022-31749-8 OPEN
1Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA. 2Molecular and Cellular Biology Program, University
of Washington, Seattle, WA, USA. 3Howard Hughes Medical Institute, Seattle, WA, USA. ✉email: nicola.felix.mueller@gmail.com
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 1
1234567890():,;
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Since the emergence of SARS-CoV-2, genetic sequence data
has been used to study its evolution and spread. Genetic
sequences have, for example, been used to investigate nat-
ural versus lab origins of SARS-CoV-21, when SARS-CoV-2 was
introduced into the US2as well as whether genetic variants differ
in growth rate3. These analyses often rely on phylogenetic and
phylodynamic approaches, at the heart of which are phylogenetic
trees. Such trees denote how viruses isolated from different
individuals are related and contain information about the trans-
mission dynamics connecting these infections4.
Along with mutations introduced by errors during replication
or by anti-viral molecules (for example ref. 5), different recom-
bination processes contribute to genetic diversity in RNA viruses
(reviewed by Simon-Loriere and Holmes6). Reassortment in
segmented viruses (generally negative-sense RNA viruses), such
as influenza or rotaviruses, can produce offspring that carry
segments from different parental lineages7. In other RNA viruses
(generally positive-sense RNA viruses), such as flaviviruses and
coronaviruses, homologous recombination can combine different
parts of a genome from different parental lineages in absence of
physically separate segments on the genome of those viruses8.
The main mechanism of this process is thought to be via template
switching9, where the template for replication is switched during
the replication process. Recombination breakpoints in experi-
ments appear to be largely random, with selection selecting
recombination breakpoints in some areas of the genome10. Recent
work shows that recombination breakpoints occur more fre-
quently in the spike region of betacoronaviruses, such as SARS-
CoV-211. While the reason why the recombination process
evolved in RNA viruses is not completely understood6, there are
different explanations of why recombination may be beneficial. In
general, recombination is selected if breaking up the linkage
disequilibrium is beneficial12. Recombination can help purge
deleterious mutations from the genome, such as proposed by
the mutational-deterministic hypothesis13. It can also increase the
rate at which a fit combination of mutations occurs, such as stated
by the Robertson–Hill effect14. Alternatively, recombination in
RNA viruses may also just be a by-product of the processivity of
the viral polymerase6.
Recombination poses a unique challenge to phylogenetic
methods, as it violates the very central assumption that the evo-
lutionary history of individuals can be denoted by branching
phylogenetic trees. Recombination breaks this assumption and
requires representation of the shared ancestry of a set of
sequences as a network. Not accounting for this can lead to biased
phylogenetic and phylodynamic inferences15,16. An analytic
description of recombination is provided by the coalescent with
recombination, which models a backward in time process where
lineages can coalesce and recombine17. When recombination is
considered backward in time, a single lineage results in two-
parent lineages, with one parent lineage carrying the genetic
material from one side of a random recombination breakpoint
and the other parent lineage carrying the genetic material from
the other side of this breakpoint. This equates to the backward in
time equivalent of template switching where there is one
recombination breakpoint per recombination event.
Currently, some Bayesian phylogenetic approaches exist that
infer recombination networks, or ancestral recombination graphs
(ARG), but are either approximate or do not directly allow for
efficient model-based inference. Some approaches consider tree-
based networks18,19, where the networks consist of a base tree
with recombination edges that always attach to edges on the base
tree. Alternative approaches rely on approximations to the coa-
lescent with recombination20,21, consider a different model of
recombination16, or seek to infer recombination networks absent
an explicit recombination model22. Bayesian and maximum
likelihood methods have also been used to account for gene
transfer events when describing the evolutionary history of spe-
cies from multiple loci (for example, see refs. 23,24). Additionally,
methods have been used to describe non-tree-like evolution using
split trees25,26. There is, however, a gap for Bayesian inference of
recombination networks under the coalescent with recombination
that can be applied to study pathogens, such as coronaviruses.
In order to fill this gap, we here develop a Markov chain Monte
Carlo (MCMC) approach to efficiently infer recombination net-
works under the coalescent with recombination for sequences
sampled over time. This framework allows joint estimation of
recombination networks, effective population sizes, recombina-
tion rates, and parameters describing mutations over time from
genetic sequence data sampled through time. We explicitly do not
make additional approximations to characterize the recombina-
tion process, other than those of the coalescent with
recombination17, such as, for example, the approximation of tree-
based networks. We implemented this approach as an open-
source software package for BEAST227, allowing us to use the
various evolutionary models already implemented in BEAST2.
We then use the coalescent with recombination to study the
recombination patterns of SARS-like, MERS, and 3 seasonal
human coronaviruses.
Results
Widespread recombination in SARS-like coronaviruses.
Recombination has been implicated at the beginning of the SARS-
CoV-1 outbreak28 and has been suggested as the origin of the
receptor-binding domain in SARS-CoV-229, though Boni et al.30
report that recombination is unlikely to be the origin of SARS-
CoV-2. While this strongly suggests non-tree-like evolution, the
evolutionary history of SARS-like viruses has, out of necessity,
mainly been denoted using phylogenetic trees.
We here reconstruct the recombination history of SARS-like
viruses, which includes SARS-CoV-1 and SARS-CoV-2 as well as
related bat31–33 and pangolin34 coronaviruses. To do so, we infer
the recombination network of SARS-like viruses under the
coalescent with recombination. We assumed that the rates of
recombination and effective population sizes were constant over
time and that the genomes evolved under a GTR+Γ
4
model.
Similar to the estimate in ref. 30, we used a fixed evolutionary rate
of 5 × 10−4mutations per nucleotide and year. We fixed the
evolutionary rate since the time interval of sampling between
individual isolates is relatively short compared to the time scale of
the evolutionary history of SARS-like viruses. This means that the
sampling times themselves offer little insight into the evolu-
tionary rates and, in absence of other calibration points, there is
little information about the evolutionary rate in this dataset. This,
in turn, means that if the evolutionary rate we used here is
inaccurate then the timings of common ancestors will also be
inaccurate. Therefore, exact timings and calendar dates in this
analysis should be taken as guideposts rather than formal
estimates.
As shown in Fig. 1A and Fig. S1A, the evolutionary history of
SARS-like viruses are characterized by frequent recombination
events, including ancestral to SARS-CoV-2 (see Fig. S2). This
means that only relatively short segments of the genomes code for
the same tree (see Figs. S3 and S1B). Consequently, characterizing
the evolutionary history of SARS-like viruses by a single, genome-
wide phylogeny is bound to be inaccurate and potentially
misleading. We infer the recombination rate in SARS-like viruses
to be approximately 2 × 10−6recombination events per site per
year, which is about 200 times slower than the evolutionary rate.
This rate translated to about 0.06 recombination events per
lineage per year, which is slightly lower than the estimated rate of
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
2NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
recombination for the seasonal human coronaviruses and the
reassortment rates for pandemic 1918 like influenza A/H1N1 and
influenza B viruses, which are all around 0.1−0.2 reassortment
events per lineage per year16. This recombination rate is a
function of co-infection rates, probability of recombination
occurring upon co-infection, and selection. As such, the
recombination rate we infer here will be (possibly substantially)
lower than the within-host rate of recombination.
These recombination events were not evenly distributed across
the genome and, instead, were relatively higher in areas outside
those coding for ORF1ab (Figs. S4 and S5). Additionally, our
inference suggests that rates of recombination are slightly elevated
on spike subunit S1 compared to subunit S2 (Fig. S4). If we track
recombination events ancestral to the SARS-CoV-2 lineage that
are inferred to have happened in the last 100 years, we find
evidence for recombination breakpoints occurring close to the 5’
end of the spike, just outside the coding region (see Fig. S5).
Additionally, we find support for recombination breakpoints
toward the 3’end of the spike, near the nucleocapsid gene (see
Fig. S5). If we assume that during genome replication in
coronaviruses template shifts occur randomly on the genome10,
differences in observed recombination rates could be explained by
selection favoring recombinant lineages with breakpoints on 3’to
ORF1ab relative to elsewhere on the genome.
We next investigate when different viruses last shared a
common ancestor (MRCA) along the genome (see Figs. 1B and
S6). RmYN0233 shares the MRCA with SARS-CoV-2 on the part
of the genome that codes for ORF1ab (Fig. 1B). We additionally
find strong evidence for one or more recombination events in the
ancestry of RmYN02 at the beginning of spike (Fig. 1B). This
recent recombination event is unlikely to have occurred with a
recent ancestor of any of the coronaviruses included in this
dataset since the common ancestor of RmYN02 with any other
virus in the dataset is approximately the same (Fig. S6A). In other
words, large parts of the spike protein of RmYN02 are as related
to SARS-CoV-2 as SARS-CoV-2 is to SARS-CoV-1. The common
ancestor timings of P2S across the genome are equal between
RaTG13 and SARS-CoV-2 (Fig. S6C). RaTG13 on the other hand
is more closely related to SARS-CoV-2 than P2S (Fig. S6B) across
the entire genome.
When looking at when different viruses last shared a common
ancestor anywhere on the genome (in other words: when the
ancestral lineages of two viruses last crossed paths), we find that
RmYN02 has the most recent MRCA with SARS-CoV-2
(Fig. S6C). The median estimate of the most recent MRCA
between SARS-CoV-2 and RmYN02 is 1986 (95% CI:
1973–2005). For RaTG13 it is 1975 (95% CI: 1988–1964), for
P2S it is 1949 (95% CI: 1907–1973) and with SARS-CoV-1 it is
1834 (95% CI: 1707–1935). These estimates are contingent on a
fixed evolutionary rate of 5 × 10−4per nucleotide per year.
Rates of recombination are associated with rates of adaptation
in human seasonal coronaviruses. We next investigate recom-
bination patterns in MERS-CoV, which has over 2500 confirmed
cases in humans, as well as in human seasonal coronaviruses
229E, OC43, and NL63, which have widespread seasonal circu-
lation in humans. As for the SARS-like viruses, we jointly infer
recombination networks, rates of recombination, and population
sizes for these viruses. We assumed that the genomes evolved
under a GTR +Γ
4
model and, in contrast to the analysis of SARS-
like viruses, inferred the evolutionary rates. We observe frequent
recombination in the history of all four viruses, wherein genetic
ancestry is described by network rather than a strictly branching
phylogeny (Fig. 2A–D and Fig. S6A).
The human seasonal coronaviruses all have recombination
rates around 1 × 10−5per site and year (Fig. S7). This is around
10–20 times lower than the evolutionary rate (Fig. S8). In contrast
to the recombination rates, the evolutionary rates vary greatly
across the human seasonal coronaviruses, with rates between a
median of 1.3 × 10−4(95% highest posterior density interval
(HPD) 1.1−1.5 × 10−4) for NL63 and a median rate of 2.5 × 10−4
(95% HPD 2.2−2.7 × 10−4) and 2.1 × 10−4(95% HPD
1.9−2.3 × 10−4) for 229E and OC43 (Fig. S8). These evolutionary
Fig. 1 Evolutionary history of SARS-like viruses. A Maximum clade credibility network of SARS-like viruses. Blue dots denote samples and green dots
recombination events. BCommon ancestor times of Wuhan-Hu1 (SARS-CoV-2) with different SARS-like viruses on different positions of the genome. The
y-axis denotes common ancestor times on the log scale. The line denotes the median common ancestor time, while the colored area denotes the 95%
highest posterior density interval. CMost recent time anywhere on the genome that Wuhan-Hu1 shared a common ancestor with different SARS-like
viruses. The error bars denote the upper and lower bound of the 95% highest posterior density interval. The MCC network and common ancestor times are
provided as a Source Data file.
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
rates are substantially lower than those estimated for SARS-CoV-
2 (1.1 × 10−3substitutions per site and year35), which are more in
line with our estimates for the evolutionary rates of MERS with a
median rate of 6.9 × 10−4(95% HPD 6.0−7.9 × 10−4). Evolu-
tionary rate estimates can be time-dependent, with datasets
spanning more time estimating lower rates of evolution than
those spanning less time36. In turn, this means that the
evolutionary rate estimates for SARS-CoV-2 will likely be lower
the more time passes. It is unclear though if it will approximate
the evolutionary rates of other seasonal coronaviruses in the
long run.
On a per-lineage basis, the estimated recombination rate for
seasonal coronaviruses translates into around 0.1–0.3 recombina-
tion events per lineage and year (Fig. 2E). Recombination events
defined here are a product of co-infection, recombination, and
selection of recombinant viruses. Interestingly, the rate at which
recombination events occur is highly similar to the rate at which
reassortment events occur in human influenza viruses (Fig. 2D, and
ref. 16). If we assume similar selection pressures for recombinant
coronaviruses compared to reassortant influenza viruses, this would
indicate similar co-infection rates in influenza and coronaviruses.
The incidence of coronaviruses in patients with respiratory illness
cases over 12 seasons in western Scotland has been found to be
lower (7–17%) than for influenza viruses (13–34%) but to be of the
same order of magnitude37. Considering that seasonal corona-
viruses typically are less symptomatic than influenza viruses, it is
not unreasonable to assume that annual incidence, and therefore
likely the annual co-infection rates, are comparable between
influenza and coronaviruses.
Compared to human seasonal coronaviruses, recombination
occurs around 3 times more often for MERS-CoV (Fig. 2E).
MERS-CoV mainly circulates in camels and occasionally spills
over into humans38. MERS-CoV infections are highly prevalent
in camels, with close to 100% of adult camels showing antibodies
against MERS-CoV39. Higher incidence, and thus higher rates of
co-infection, could therefore account for higher rates of
recombination in MERS-CoV compared to the human seasonal
coronaviruses.
We next tested whether parts of the genome with higher rates
of recombination are also associated with higher rates of
adaptation. To do so, we allowed for different relative rates of
recombination within the region 5’of the spike (i.e. mostly
ORF1ab), spike itself, and everything 3’of the spike. We
computed recombination rate ratios on each of these three
sections of the genome as the recombination rate on that section
divided by the mean rate on the other two sections. We infer that
recombination rates are elevated in the spike protein of all human
seasonal coronaviruses considered here (Fig. 3, Figs. S9, and S10).
This is consistent with other work estimating higher rates of
recombination on the spike protein of betacoronaviruses11.
We then computed the rates of adaption on different parts of
the genomes of the seasonal human coronaviruses using the
approach described in refs. 40,41. This approach does not
explicitly consider trees to compute the rates of adaptation on
different parts of the genomes and is not affected by
recombination41.Wefind that sections of the genome with
relatively higher rates of adaptation correspond to sections of the
genome with relatively higher rates of recombination (Fig. 3). In
particular, recombination and adaptation are elevated on the
section of the genome that codes for the spike protein and are
lower elsewhere.
We next investigated whether these trends hold when looking
only at spikes. The spike protein is made up of two subunits: S1
and S2. S1 binds to the host cell receptor, while S2 facilitates
fusion of the viral and cellular membrane42. Rates of adaptation
have been shown to be high in S1, but not S2, for 229E and
OC4341. While the rates of adaptation are relatively low overall
for NL63, there is still some evidence that they are elevated in S1
compared to S241.
To test whether recombination rates vary with rates of
adaptation on the subunits of the spike as well, we inferred the
recombination rates from the spike only, allowing for different
rates of recombination on S1 versus the rest of the spike. We find
that the rates of recombination are elevated on S1 for 229E and
OC43 compared to the rest of the spike gene (Fig. 3). This is
consistent with strong absolute rates of adaptation on S1 on these
Fig. 2 Recombination networks and rates for coronaviruses MERS, 229E, OC43, and NL63. Recombination networks for MERS (A) and seasonal human
coronaviruses 229E (B), OC43 (C), and NL63 (D). ERecombination rates (per lineage and year) for the different coronaviruses compared to reassortment
rates in seasonal human influenza A/H3N2 and influenza B viruses as estimated in under the coalescent with reassortment using whole-genome influenza
sequences sampled over multiple decades16. For OC43 and NL63, the parts of the recombination networks that stretch beyond 1950 are not shown to
increase the readability of more recent parts of the networks. The error bars denote the upper and lower bound of the 95% highest posterior density
interval. All MCC networks are provided as a Source Data file.
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
4NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
two viruses. For NL63, we find weak evidence for the rate on S2 to
be slightly higher than on S1 (Fig. 3), even though the rates of
adaptation are inferred to be higher on S1. The absolute rate of
adaptation in S1 of NL63 is, however, substantially lower than for
229E or OC43. Additionally, the uncertainty around the estimates
on adaption rate ratios between the two subunits for NL63 is
rather large and includes no difference at all. Overall, these results
suggest that particular recombination events that have resulted in
recombinant viruses are either positively or negatively selected.
Elevated rates of recombination in areas where adaptation is
stronger have been described for other organisms (reviewed
here43). Alternatively, higher rates of recombination could also be
due to mechanistic reasons, as has been suggested in the case of
SARS-CoV-244.
To further investigate this, we next computed the rates of
recombination on fitter and less fit parts of the recombination
networks of 229E, OC43, and NL63. To do so, we first classify
each edge of the inferred posterior distribution of the recombina-
tion networks into fit and unfit based on how long a lineage
survives into the future. Fit edges are those that have descendants
at least 1, 2, 5, or 10 years into the future, and unfit edges are
those that do not. We then computed the rates of recombination
on both types of edges for the entire posterior distribution of
networks. Overall, we do not find that fit edges show relatively
higher rates of recombination (see Fig. S11). The simplest
explanation is that we do not have enough data points to measure
recombination rates on unfit edges, meaning to measure
recombination rates on part of the recombination network where
selection had too little time to shape which lineages survive and
which go extinct. An alternative explanation to why we see
elevated rate or recombination in the spike protein, but do not
observe a population level fitness benefit could be that most
(outside of spike) recombinants could be detrimental to fitness
with few (within spike) having little fitness effect at all.
Discussion
Though not yet highly prevalent, evidence for recombination in
SARS-CoV-2 has started to appear 45–48. As such, it is crucial to
know the extent to which recombination is expected to shape
SARS-CoV-2 in the coming years, to have methods to identify
recombination, and to perform phylogenetic reconstruction in the
presence of recombination. The results shown here indicate that
some recombinant viruses are either positively or negatively
selected. Estimating the deleterious load of viruses before and
after recombination using ancestral sequence reconstruction49
could help shed light on which sequences are favored during
recombination. Furthermore, having additional sequences to
reconstruct recombination patterns in the seasonal coronaviruses
should clarify the role recombination plays in the long-term
evolution of these viruses.
While their impact on the evolutionary dynamics of SARS-
CoV-2 remains unclear, the likely rise of future SARS-CoV-2
recombinants will further necessitate methods that allow phylo-
genetic and phylodynamic inferences to be performed in the
presence of recombination50. In absence of that, recombination
has to be either ignored, leading to biased phylogenetic and
phylodynamic reconstruction15, or non-recombinant parts of the
Fig. 3 Comparison of recombination rates with rates of adaptation on different parts of the genomes of seasonal human coronaviruses 229E, OC43,
and NL63. Association between estimated relative recombination rate (x-axis) and relative adaptation rate (y-axis) for three different seasonal human
coronaviruses: 229E, OC43, and NL63. These estimates are shown for different parts of the genome, indicated by the different colors. These results from
two different types of analysis: one using spike only (subunit 1 over subunit 2, shown in yellow) and one using the full genome (shown in orange, blue, and
green). The rate ratios denote the rate on a part of the genome divided by the average rate on the two other parts of the genome. The error bars of the
recombination rates (x-axis) denote the upper and lower bounds of the 95% HPD intervals of the estimates of relative recombination rates. The error bars
of the rates of adaptation are computed using 100 bootstrapped outgroups and alignments when computing the rates of adaptation. Source data are
provided as a Source Data file.
no sampled descendents
has sampled descendents
present
Coalescent Event
Coalescent Event
Recombination Event
Coalescent Event
Coalescent Event
Recombination Event
past
Fig. 4 Example recombination network. Events that can occur on a
recombination network as considered here. We consider events to occur
from the present backward in time to the past (as is the norm when looking
at coalescent processes). Lineages can be added upon sampling events,
which occur at predefined points in time and are conditioned on.
Recombination events split the path of a lineage in two, with everything on
one side of a recombination breakpoint going in one direction and
everything on the other side of a breakpoint going in the other direction.
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
genome have to be used for analyses, reducing the precision of
these methods. Our approach addresses this gap by providing a
Bayesian framework to infer recombination networks. To facil-
itate easy adaptation, we implemented the method so that ana-
lyses can be set up following the same workflow as regular
BEAST227 analyses. Extending the current suite of population
dynamic models, such as birth–death models51 or models that
account for population structure52,53, will further increase the
applicability of recombination models to study the spread of
pathogens.
Methods
Coalescent with recombination. The coalescent with recombination models a
backward in time coalescent and recombination process17. In this process, three
different events are possible: sampling, coalescence, and recombination. Sampling
events happen at predefined points in time. Recombination events happen at a rate
proportional to the number of coexisting lineages at any point in time. Recom-
bination events split the path of a lineage in two, with everything on one side of a
recombination breakpoint going in one ancestral direction and everything on the
other side of a breakpoint going in the other direction. As shown in Fig. 4, the
two parental lineages after a recombination event each carry a subset of the gen-
ome. In reality, the viruses corresponding to those two lineages still carry the full
genome, but only a part of it will have sampled descendants. In other words, only a
part of the genome carried by a lineage at any time may impact the genome of a
future lineage that is sampled. The probability of actually observing a recombi-
nation event on lineage lis proportional to how much genetic material that lineage
carries. This can be computed as the difference between the last and first nucleotide
position that is carried by l, which we denote as LðlÞ. Coalescent events happen
between co-existing lineages at a rate proportional to the number of pairs of
coexisting lineages at any point in time and inversely proportional to the effective
population size. The parent lineage at each coalescent event will carry genetic
material corresponding to the union of the genetic material of the two-child
lineages.
Posterior probability. In order to perform joint Bayesian inference of recombi-
nation networks together with the parameters of the associated models, we use a
MCMC algorithm to characterize the joint posterior density. The posterior density
is denoted as:
PðN;μ;θ;ρjDÞ¼PðDjN;μÞPðNjθ;ρÞPðμ;θ;ρÞ
PðDÞ;ð1Þ
where Ndenotes the recombination network, μthe evolutionary model, θthe
effective population size and ρthe recombination rate. The multiple sequence
alignment, that is the data, is denoted D.P(D∣N,μ) denotes the network likelihood,
P(N∣θ,ρ), the network prior and P(μ,θ,ρ) the parameter priors. As is usually done
in Bayesian phylogenetics, we assume that P(μ,θ,ρ)=P(μ)P(θ)P(ρ).
Using a Bayesian approach has several advantages. In particular, it allows us to
account for uncertainty in the parameter and network estimates. Additionally, it
allows balancing different sources of information against each other. The
coalescent with recombination model, for example, will tend to favor networks
with fewer recombination events. The cost of adding more recombination events
depends on the recombination rate. At lower rates of recombination, adding new
recombination events is more costly and the information coming from the
sequence alignment in support of a recombination event needs to be greater.
Network likelihood. While the evolutionary history of the entire genome is a net-
work, the evolutionary history of each individual position in the genome can be
described as a tree. We can therefore denote the likelihood of observing a sequence
alignment (the data denoted D) given a network Nand evolutionary model μas
PðDjN;μÞ¼ Y
sequence length
i¼1
PðDijTi;μÞ;ð2Þ
with D
i
denoting the nucleotides at position iin the sequence alignment and T
i
denoting the tree at position i. The likelihood at each individual position in the
alignment can then be computed using the standard pruning algorithm54.We
implemented the network likelihood calculation P(D
i
∣T
i
,μ) such that it allows
making use of all the standard site models in BEAST2. Currently, we only consider
strict clock models and therefore do not allow for rate variations across different
branches of the network. This is because the number of edges in the network
changes over the course of the MCMC, making relaxed clock models more com-
plex to implement. We implemented the network likelihood such that it can make
use of caching of intermediate results and use unique patterns in the multiple
sequence alignment, similar to what is done for tree likelihood computations.
Network prior. The network prior is denoted by P(N∣θ,ρ), which is the probability
of observing a network and the embedding of segment trees under the coalescent
with recombination model, with effective population size θand per-lineage
recombination rate ρ. It plays essentially the same role that tree prior plays in
phylodynamic analyses on trees.
We can calculate P(N∣θ,ρ) by expressing it as the product of exponential
waiting times between events (i.e., recombination, coalescent, and sampling
events):
PðNjθ;ρÞ¼ Y
#events
i¼1
PðeventijLi;θ;ρÞ´PðintervalijLi;θ;ρÞ;ð3Þ
where we define t
i
to be the time of the ith event and L
i
to be the set of lineages
extant immediately prior to this event. (That is, L
i
=L
t
for t2½ti1;tiÞ.
Given that the coalescent process is a constant size coalescent and given the ith
event is a coalescent event, the event contribution is denoted as
PðeventijLi;θ;ρÞ¼1
θ:ð4Þ
If the ith event is a recombination event and assuming constant rates of
recombination over time, the event contribution is denoted as
PðeventijLi;θ;ρÞ¼ρLðlÞ:ð5Þ
The interval contribution denotes the probability of not observing any event in
a given interval. It can be computed as the product of not observing any coalescent,
nor any recombination events in interval i. We can therefore write:
PðintervalijLi;θ;ρÞ¼exp½ðλcþλrÞðtiti1Þ;ð6Þ
where λcdenotes the rate of coalescence and can be expressed as
λc¼jLij
2
1
θ;ð7Þ
and λrdenotes the rate of observing a recombination event on any co-existing
lineage and can be expressed as
λr¼ρ∑
l2Li
LðlÞ:ð8Þ
In order to allow the recombination rates to vary across ssections Sson the
genome, we modify λrto differ in each section Ss, such that:
λr¼∑
s2S
ρs∑
l2Li
LðlÞ\Ss;ð9Þ
with LðlÞ\Ssdenoting the amount of overlap between LðlÞand Ss. The
recombination rate in each section sis denoted as ρ
s
.
MCMC algorithm for recombination networks. In order to explore the posterior
space of recombination networks, we implemented a series of MCMC operators.
These operators often have analogs in operators used to explore different phylo-
genetic trees and are similar to the ones used to explore reassortment networks16.
Here, we briefly summarize each of these operators.
Add/remove operator: The add/remove operator adds and removes
recombination events. An extension of the subtree prune and regraft move for
networks55 to jointly operate on segment trees as well. We additionally
implemented an adapted version to sample re-attachment under a coalescent
distribution to increase acceptance probabilities.
Loci diversion operator: The loci diversion operator randomly changes the
location of recombination breakpoints of a recombination event.
Exchange operator: The exchange operator changes the attachment of edges in
the network while keeping the network length constant.
Subnetwork slide operator: The subnetwork slide operator changes the height of
nodes in the network while allowing to change in the topology.
Scale operator: The scale operator scales the heights of the root node or the
whole network without changing the network topology.
Gibbs operator: The Gibbs operator efficiently samples any part of the network
that is older than the root of any segment of the alignment and is thus not
informed by any genetic data and is the analog to the Gibbs operator in16 for
reassortment networks.
Empty loci preoperator: The empty loci preoperator augments the network with
edges that do not carry any loci for the duration of one of the above moves, to allow
for larger jumps in network space.
One of the issues when inferring these recombination networks is that the root
height can be substantially larger than when not allowing for recombination events.
This can cause a computational issue when performing inferences. To circumvent
this, we truncate the recombination networks by reducing the recombination rate
sometime after all positions of the sequence alignment have reached their common
ancestor height.
Validation and testing. We validate the implementation of the coalescent with
recombination network prior as well as all operators in Fig. S12. We also show that
truncating the recombination networks does not affect the sampling of recombi-
nation networks prior to reaching the common ancestor height of all positions in
the sequence alignment.
We then tested whether we are able to infer recombination networks,
recombination rates, effective population sizes, and evolutionary parameters from
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
6NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
simulated data. To do so, we randomly simulated recombination networks under
the coalescent with recombination. On top of these, we then simulated multiple
sequence alignments. We then re-infer the parameters used to simulate using our
MCMC approach. As shown in Fig. S13, these parameters are retrieved well from
simulated data with little bias and accurate coverage of simulated parameters by
credible intervals.
We next tested how well we can retrieve individual recombination events. To do
so, we plot the location and timings of simulated recombination events for the first
9 out of 100 simulations. We then plot the density of recombination events in the
posterior distribution of networks, based on the timing and location of the inferred
breakpoint on the genome. As shown in Fig. S14, we are able to retrieve the true
(simulated) recombination events well.
We next tested how the speed of inference scales with the number of
recombination events, the number of samples in the dataset, and the
evolutionary rate. To do so, we simulated 300 recombination networks and
sequence alignment of length 10,000 under a Jukes–Cantor model with between
10 and 200 leaves and a recombination rate between 1 × 10−5and 2 × 10−5
recombination events per site per year. This means that for each simulation,
there were between 0 and 100 recombination events, allowing us to investigate
how the inference scales in different settings. As shown in Fig. S15, the ESS per
hour decreases with the number of recombination events and samples, but not
the evolutionary rates. In particular, the ESS per hour decreases much faster
with the number of recombination events in a dataset than the number of
samples. This suggests that the methods can currently be used more easily to
analyze a dataset with a large number of samples over a large number of
recombination events.
We next tested how the choice of the prior distribution on the recombination
rate impacts the recombination rate estimate. To do so, we simulate 20
recombination networks and sequence alignment of length 10,000 under a
Jukes–Cantor model with 100 leaves and a recombination rate drawn randomly
from a log-normal distribution. We then infer the recombination rates using 5
different recombination rate priors as shown in Fig. 5F that put some or a lot of
weight on the wrong parameters. As shown in Fig. 5A–E, we are able to infer
recombination rates, even with the wrong priors.
Additionally, we compared the effective sample size values from MCMC runs
inferring recombination networks for the MERS spike protein to treating the
evolutionary histories as trees. We find that although the effective sample size
values are lower when inferring recombination networks, they are not orders of
magnitude lower (see Fig. S16).
Recombination network summary. We implemented an algorithm to summarize
distributions of recombination networks similar to the maximum clade credibility
framework typically used to summarize trees in BEAST56. In short, the algorithm
summarizes individual trees at each position in the alignment. To do so, we first
compute how often we encountered the same coalescent event at every position in
the alignment during the MCMC. We then choose the network that maximizes the
clade support over each position as the maximum clade credibility (MCC) network.
The MCC networks are logged in the extended Newick format57 and can be
visualized in icytree.org58. We here plotted the MCC networks using an adapted
version of baltic (https://github.com/evogytis/baltic).
Sequence data. The genetic sequence data for OC43, NL63, and 229E were
obtained from ViPR (http://www.viprbrc.org) and were the same as used41. All
these sequences were isolated from a human host and downsampled from the
dataset used in ref. 41 to 100 sequences (for OC43 and NL63). As there were only
54 229E sequences, we did not do any downsampling on this data. The sequence
data for the MERS analyses were the same as described in ref. 38, but using a
randomly down sampled dataset of 100 sequences. For the SARS-like analyses, we
used 40 different deposited SARS-like genomes, mostly originating from bats, as
well as humans, and one pangolin-derived sequence.
Rates of adaptation. The rates of adaptation were calculated using a modification
of the McDonald–Kreitman method, as designed by Bhatt et al.40, and imple-
mented in ref. 41. Briefly, for each virus, we aligned the sequence of each gene or
genomic region. Then, we split the alignment into 3-year sliding windows, each
containing a minimum of 3 sequenced isolates. We used the consensus sequence at
the first time point as the outgroup. A comparison of the outgroup to the alignment
of each subsequent temporal yielded a measure of synonymous and non-
synonymous fixations and polymorphisms at each position in the alignment. This
approach requires having sequence data gathered over relatively long time periods
where the consensus genome allows for an accurate description of the long-term
evolutionary patterns and, as such, would not be adequate for a pathogen with a
relatively short evolutionary history, such as for SARS-CoV-2. We used propor-
tional site counting for these estimations59. We assumed that selectively neutral
sites are all silent mutations as well as replacement polymorphisms occurring at
frequencies between 0.15 and 0.7540. We identified adaptive substitutions as non-
synonymous fixations and high-frequency polymorphisms that exceed the neutral
expectation. We then estimated the rate of adaptation (per codon per year) using
linear regression of the number of adaptive substitutions inferred at each time
point. In order to compute the 5’spike and 3’spike rates of adaptation, we used the
Fig. 5 Impact of the recombination rate prior distribution on the inferred recombination rates. Here, we compare then inferred recombination rates
when using different prior distributions that differed from the distributions from which the rates for simulations were sampled. The rates for simulations
were sampled from a log-normal distribution with μ=−11.12 and σ=0.5. In A, we show the inferred rates when using a prior distribution with μ=−12.74
and σ=0.5 (leading to a 5 times lower mean in real space than the correct prior). In B, we show the inferred rates when using a prior distribution with
μ=−12.74 and σ=2. In C, we show the inferred rates when using the same prior distribution as was sampled under. In D, we show the inferred rates when
using a prior distribution with μ=−9.72 and σ=2. In E, we show the inferred rates when using a prior distribution with μ=−9.72 and σ=0.5 (leading to
5 times higher mean in real space than the correct prior). Fshows the corresponding density plots for all log-normal distributions used as prior distributions
on the recombination rates.
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
weighted average of all coding regions to the left (upstream) or right (downstream)
of the spike gene, respectively, using the length of the individual sections as
weights. We estimated the uncertainty by running the same analysis on 100
bootstrapped outgroups and alignments.
Reporting summary. Further information on research design is available in the Nature
Research Reporting Summary linked to this article.
Data availability
The BEAST2 input xml files for all coronavirus analyses in this manuscript, as well as the
files used to post process these analyses are available from https://github.com/nicfel/
Recombination-Material and here ref. 60. The xml files include the sequence data and
exact input specification of the coronavirus analyses performed in this manuscript,
except for the sequences published on gisaid. The acknowledgment table for the four
gisaid sequences used for the SARS-like analyses is provided in Supplementary Note 1.
The genbank accession numbers for the 229E, OC43, NL63, SARS-like, and MERS
analyses are provided as separate tables in Supplementary Data 1. The MERS sequences
without accession numbers are used from ref. 38. Source data are provided with
this paper.
Code availability
The Recombination package is implemented as an addon to the Bayesian phylogenetics
software platform BEAST227. All MCMC analyses performed here were run using
adaptive parallel tempering61. The source code is available at https://github.com/nicfel/
Recombination and here ref. 62. We additionally provide a tutorial on how to set up and
post-process analysis at https://github.com/nicfel/Recombination-Tutorial. The MCC
networks are plotted using an adapted version of baltic (https://github.com/evogytis/
baltic). All other plots are done in R using ggplot263 and ggenes64.
Received: 5 May 2021; Accepted: 30 June 2022;
References
1. Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The
proximal origin of SARS-CoV-2. Nat. Med. 26, 450–452 (2020).
2. Bedford, T. et al. Cryptic transmission of SARS-COV-2 in washington state.
Science 370, 571–575 (2020).
3. Volz, E. et al. Evaluating the effects of SARS-COV-2 spike mutation d614g on
transmissibility and pathogenicity. Cell 184,64–75 (2021).
4. Grenfell, B. T. et al. Unifying the epidemiological and evolutionary dynamics
of pathogens. Science 303, 327–332 (2004).
5. Kim, E.-Y. et al. Human apobec3 induced mutation of human
immunodeficiency virus type-1 contributes to adaptation and evolution in
natural infection. PLoS Pathog. 10, e1004281 (2014).
6. Simon-Loriere, E. & Holmes, E. C. Why do rna viruses recombine? Nat. Rev.
Microbiol. 9, 617–626 (2011).
7. McDonald, S. M., Nelson, M. I., Turner, P. E. & Patton, J. T. Reassortment in
segmented rna viruses: mechanisms and outcomes. Nat. Rev. Microbiol. 14,
448 (2016).
8. Su, S. et al. Epidemiology, genetic recombination, and pathogenesis of
coronaviruses. Trends Microbiol. 24, 490–502 (2016).
9. Lai, M. RNA recombination in animal and plant viruses. Microbiol. Mol. Biol.
Rev. 56,61–79 (1992).
10. Banner, L. R. & Mc Lai, M. Random nature of coronavirus rna recombination
in the absence of selection pressure. Virology 185, 441–445 (1991).
11. Bobay, L.-M., O’Donnell, A. C. & Ochman, H. Recombination events are
concentrated in the spike protein region of betacoronaviruses. PLoS Genet. 16,
e1009272 (2020).
12. Barton, N. A general model for the evolution of recombination. Genet. Res. 65,
123–144 (1995).
13. Feldman, M. W., Christiansen, F. B. & Brooks, L. D. Evolution of recombination
in a constant environment. Proc. Natl Acad. Sci. USA 77,4838–4841 (1980).
14. Hill, W. G. & Robertson, A. The effect of linkage on limits to artificial
selection. Genet. Res. 8, 269–294 (1966).
15. Posada, D. & Crandall, K. A. The effect of recombination on the accuracy of
phylogeny estimation. J. Mol. Evol. 54, 396–402 (2002).
16. Müller, N. F., Stolz, U., Dudas, G., Stadler, T. & Vaughan, T. G. Bayesian
inference of reassortment networks reveals fitness benefits of reassortment in
human influenza viruses. Proc. Natl Acad. Sci. USA 117, 17104–17111 (2020).
17. Hudson, R. R. Properties of a neutral allele model with intragenic
recombination. Theor. Popul. Biol. 23, 183–201 (1983).
18. Didelot, X., Lawson, D., Darling, A. & Falush, D. Inference of homologous
recombination in bacteria using whole-genome sequences. Genetics 186,
1435–1449 (2010).
19. Vaughan, T. G. et al. Inferring ancestral recombination graphs from bacterial
genomic data. Genetics 205, 857–870 (2017).
20. Rasmussen, M. D., Hubisz, M. J., Gronau, I. & Siepel, A. Genome-wide
inference of ancestral recombination graphs. PLoS Genet. 10, e1004342 (2014).
21. McVean, G. A. & Cardin, N. J. Approximating the coalescent with
recombination. Philos. Trans. R. Soc. B: Biol. Sci. 360, 1387–1393 (2005).
22. Bloomquist, E. W. & Suchard, M. A. Unifying vertical and nonvertical
evolution: a stochastic arg-based framework. Syst. Biol. 59,27–41 (2010).
23. Meng, C. & Kubatko, L. S. Detecting hybrid speciation in the presence of
incomplete lineage sorting using gene tree incongruence: a model. Theor.
Popul. Biol. 75,35–45 (2009).
24. Yu, Y., Dong, J., Liu, K. J. & Nakhleh, L. Maximum likelihood inference of
reticulate evolutionary histories. Proc. Natl Acad. Sci. USA 111, 16448–16453
(2014).
25. Bryant, D. & Moulton, V. Neighbor-net: an agglomerative method for the
construction of phylogenetic networks. Mol. Biol. Evol. 21, 255–265 (2004).
26. Huson, D. H. & Bryant, D. Application of phylogenetic networks in
evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006).
27. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, et al.
BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis.
PLoS Comput Biol. 15, e1006650 https://doi.org/10.1371/journal.pcbi.1006650
(2019).
28. Hon, C.-C. et al. Evidence of the recombinant origin of a bat severe acute
respiratory syndrome (sars)-like coronavirus and its implications on the direct
ancestor of sars coronavirus. J. Virol. 82, 1819–1826 (2008).
29. Li, X. et al. Emergence of SARS-COV-2 through recombination and strong
purifying selection. Sci. Adv. 6, eabb9153 (2020).
30. Boni, M. F. et al. Evolutionary origins of the SARS-COV-2 sarbecovirus
lineage responsible for the covid-19 pandemic. Nat. Microbiol. 5, 1408–1417
(2020).
31. Ge, X.-Y. et al. Isolation and characterization of a bat sars-like coronavirus
that uses the ace2 receptor. Nature 503, 535–538 (2013).
32. Ge, X.-Y. et al. Coexistence of multiple coronaviruses in several bat colonies in
an abandoned mineshaft. Virol. Sin. 31,31–40 (2016).
33. Zhou, H. et al. A novel bat coronavirus closely related to sars-cov-2 contains
natural insertions at the s1/s2 cleavage site of the spike protein. Curr. Biol. 30,
2196–2203 (2020).
34. Lam, T. T.-Y. et al. Identifying sars-cov-2-related coronaviruses in malayan
pangolins. Nature 583, 282–285 (2020).
35. Duchene, S. et al. Temporal signal and the phylodynamic threshold of sars-
cov-2. Virus Evol. 6, veaa061 (2020).
36. Duchêne, S., Holmes, E. C. & Ho, S. Y. Analyses of evolutionary dynamics in
viruses are hindered by a time-dependent bias in rate estimates. Proc. R. Soc.
B: Biol. Sci. 281, 20140732 (2014).
37. Nickbakhsh, S. et al. Epidemiology of seasonal coronaviruses: establishing the
context for the emergence of coronavirus disease 2019. J. Infect. Dis. 222,
17–25 (2020).
38. Dudas, G., Carvalho, L. M., Rambaut, A. & Bedford, T. Mers-cov spillover at
the camel-human interface. Elife 7, e31257 (2018).
39. Reusken, C. B. et al. Geographic distribution of mers coronavirus among
dromedary camels, africa. Emerg. Infect. Dis. 20, 1370 (2014).
40. Bhatt, S., Holmes, E. C. & Pybus, O. G. The genomic rate of molecular adaptation
of the human influenzaavirus.Mol. Biol. Evol. 28, 2443–2451 (2011).
41. Kistler, K. E. & Bedford, T. Evidence for adaptive evolution in the receptor-
binding domain of seasonal coronaviruses oc43 and 229e. Elife 10, e64509
(2021).
42. Walls, A. C. et al. Structure, function, and antigenicity of the sars-cov-2 spike
glycoprotein. Cell 181, 281–292 (2020).
43. Nachman, M. W. Variation in recombination rate across the genome:
evidence and implications. Curr. Opin. Genet. Dev. 12, 657–663 (2002).
44. Turakhia, Y. et al. Pandemic-scale phylogenomics reveals elevated
recombination rates in the sars-cov-2 spike region. Preprint at https://doi.org/
10.1101/2021.08.04.455157 (2021).
45. VanInsberghe, D., Neish, A. S., Lowen, A. C. & Koelle, K. Recombinant SARS-
CoV-2 genomes circulated at low levels over the first year of the pandemic,
Virus Evolution,7, veab059 https://doi.org/10.1093/ve/veab059 (2021).
46. Jackson, B. et al. Generation and transmission of interlineage recombinants in
the SARS-CoV-2 pandemic. Cell.184, 5179–5188 (2021).
47. Varabyou, A., Pockrandt, C., Salzberg, S. L. & Pertea, M. Rapid detection of inter-
clade recombination in sars-cov-2 with bolotie. Genetics 218, iyab074 (2021).
48. Ignatieva, A., Hein, J. & Jenkins, P. A. Ongoing recombination in SARS-COV-
2 revealed through genealogical reconstruction. Mol Biol Evol. 39, msac028
https://doi.org/10.1093/molbev/msac028 (2022).
49. Yang, Z., Kumar, S. & Nei, M. A new method of inference of ancestral
nucleotide and amino acid sequences. Genetics 141, 1641–1650 (1995).
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
8NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
50. Neches, R. Y., McGee, M. D. & Kyrpides, N. C. Recombination should not be
an afterthought. Nat. Rev. Microbiol. 18, 606–606 (2020).
51. Stadler, T. On incomplete sampling under birth–death models and connections to
the sampling-based coalescent. J. Theor. Biol. 261,58–66 (2009).
52. Hudson, R. R. et al. Gene genealogies and the coalescent process. Oxf. Surv.
Evol. Biol. 7, 44 (1990).
53. Lemey, P., Rambaut, A., Drummond, A. J. & Suchard, M. A. Bayesian
phylogeography finds its roots. PLoS Comput. Biol. 5, e1000520 (2009).
54. Felsenstein, J. Evolutionary trees from dna sequences: a maximum likelihood
approach. J. Mol. Evol. 17, 368–376 (1981).
55. Bordewich, M., Linz, S. & Semple, C. Lost in space? generalising subtree prune
and regraft to spaces of phylogenetic networks. J. Theor. Biol. 423,1–12 (2017).
56. Heled, J. & Bouckaert, R. R. Looking for trees in the forest: summary tree from
posterior samples. BMC Evol. Biol. 13,1–11 (2013).
57. Cardona, G., Rosselló, F. & Valiente, G. ExtendedNewick: it is time for a standard
representation of phylogenetic networks. BMC Bioinform. 9,1–8 (2008).
58. Vaughan, T. G. Icytree: rapid browser-based visualization for phylogenetic
trees and networks. Bioinformatics 33, 2392–2394 (2017).
59. Bhatt, S., Katzourakis, A. & Pybus, O. G. Detecting natural selection in RNA
virus populations using sequence summary statistics. Infect. Genet. Evol. 10,
421–430 (2010).
60. Müller, N. F. nicfel/Recombination-Material: Release for Nat. comm.
recombination manuscript. https://doi.org/10.5281/zenodo.6600818 (2022).
61. Müller, N. F. & Bouckaert, R. R. Adaptive metropolis-coupled mcmc for beast
2. PeerJ 8, e9473 (2020).
62. Müller, N. F. nicfel/Recombination: adds common ancestor heights logger to
beauti. https://doi.org/10.5281/zenodo.5076684 (2021)
63. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
64. Wilkins, D. gggenes: draw gene arrow maps in ‘ggplot2’. r package version 0.4.
0 (2019).
Acknowledgements
We would like to thank Timothy G. Vaughan for his helpful insights into the imple-
mentation of the software. N.F.M. is funded by the Swiss National Science Foundation
(P2EZP3_191891). K.E.K. is a NSF GRFP Fellow (DGE-1762114). T.B. is a Pew Bio-
medical Scholar and is supported by NIH R35 GM119774. The Scientific Computing
Infrastructure at Fred Hutch is supported by NIH ORIP S10OD028685.
Author contributions
N.F.M. and T.B. conceived and designed the experiments. N.F.M. and K.E.K. performed
the statistical analysis and analyzed the data. N.F.M. implemented the software. N.F.M.,
K.E.K., and T.B. wrote the paper.
Competing interests
The authors declare no competing interests.
Additional information
Supplementary information The online version contains supplementary material
available at https://doi.org/10.1038/s41467-022-31749-8.
Correspondence and requests for materials should be addressed to Nicola F. Müller.
Peer review information Nature Communications thanks the anonymous reviewers for
their contribution to the peer review of this work. Peer reviewer reports are available.
Reprints and permission information is available at http://www.nature.com/reprints
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made. The images or other third party
material in this article are included in the article’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this license, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2022
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Available via license: CC BY 4.0
Content may be subject to copyright.