ArticlePDF Available

A Bayesian approach to infer recombination patterns in coronaviruses

Authors:

Abstract and Figures

As shown during the SARS-CoV-2 pandemic, phylogenetic and phylodynamic methods are essential tools to study the spread and evolution of pathogens. One of the central assumptions of these methods is that the shared history of pathogens isolated from different hosts can be described by a branching phylogenetic tree. Recombination breaks this assumption. This makes it problematic to apply phylogenetic methods to study recombining pathogens, including, for example, coronaviruses. Here, we introduce a Markov chain Monte Carlo approach that allows inference of recombination networks from genetic sequence data under a template switching model of recombination. Using this method, we first show that recombination is extremely common in the evolutionary history of SARS-like coronaviruses. We then show how recombination rates across the genome of the human seasonal coronaviruses 229E, OC43 and NL63 vary with rates of adaptation. This suggests that recombination could be beneficial to fitness of human seasonal coronaviruses. Additionally, this work sets the stage for Bayesian phylogenetic tracking of the spread and evolution of SARS-CoV-2 in the future, even as recombinant viruses become prevalent. Genetic recombination can confound standard phylogenetic approaches. Here, the authors present a method to reconstruct virus recombination networks, and show the importance of recombination in shaping the ongoing evolution of SARS-like, MERS and 3 human seasonal coronaviruses.
This content is subject to copyright. Terms and conditions apply.
ARTICLE
A Bayesian approach to infer recombination
patterns in coronaviruses
Nicola F. Müller 1, Kathryn E. Kistler1,2 & Trevor Bedford1,2,3
As shown during the SARS-CoV-2 pandemic, phylogenetic and phylodynamic methods are
essential tools to study the spread and evolution of pathogens. One of the central
assumptions of these methods is that the shared history of pathogens isolated from different
hosts can be described by a branching phylogenetic tree. Recombination breaks this
assumption. This makes it problematic to apply phylogenetic methods to study recombining
pathogens, including, for example, coronaviruses. Here, we introduce a Markov chain Monte
Carlo approach that allows inference of recombination networks from genetic sequence data
under a template switching model of recombination. Using this method, we rst show that
recombination is extremely common in the evolutionary history of SARS-like coronaviruses.
We then show how recombination rates across the genome of the human seasonal cor-
onaviruses 229E, OC43 and NL63 vary with rates of adaptation. This suggests that recom-
bination could be benecial to tness of human seasonal coronaviruses. Additionally, this
work sets the stage for Bayesian phylogenetic tracking of the spread and evolution of SARS-
CoV-2 in the future, even as recombinant viruses become prevalent.
https://doi.org/10.1038/s41467-022-31749-8 OPEN
1Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA. 2Molecular and Cellular Biology Program, University
of Washington, Seattle, WA, USA. 3Howard Hughes Medical Institute, Seattle, WA, USA. email: nicola.felix.mueller@gmail.com
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 1
1234567890():,;
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Since the emergence of SARS-CoV-2, genetic sequence data
has been used to study its evolution and spread. Genetic
sequences have, for example, been used to investigate nat-
ural versus lab origins of SARS-CoV-21, when SARS-CoV-2 was
introduced into the US2as well as whether genetic variants differ
in growth rate3. These analyses often rely on phylogenetic and
phylodynamic approaches, at the heart of which are phylogenetic
trees. Such trees denote how viruses isolated from different
individuals are related and contain information about the trans-
mission dynamics connecting these infections4.
Along with mutations introduced by errors during replication
or by anti-viral molecules (for example ref. 5), different recom-
bination processes contribute to genetic diversity in RNA viruses
(reviewed by Simon-Loriere and Holmes6). Reassortment in
segmented viruses (generally negative-sense RNA viruses), such
as inuenza or rotaviruses, can produce offspring that carry
segments from different parental lineages7. In other RNA viruses
(generally positive-sense RNA viruses), such as aviviruses and
coronaviruses, homologous recombination can combine different
parts of a genome from different parental lineages in absence of
physically separate segments on the genome of those viruses8.
The main mechanism of this process is thought to be via template
switching9, where the template for replication is switched during
the replication process. Recombination breakpoints in experi-
ments appear to be largely random, with selection selecting
recombination breakpoints in some areas of the genome10. Recent
work shows that recombination breakpoints occur more fre-
quently in the spike region of betacoronaviruses, such as SARS-
CoV-211. While the reason why the recombination process
evolved in RNA viruses is not completely understood6, there are
different explanations of why recombination may be benecial. In
general, recombination is selected if breaking up the linkage
disequilibrium is benecial12. Recombination can help purge
deleterious mutations from the genome, such as proposed by
the mutational-deterministic hypothesis13. It can also increase the
rate at which a t combination of mutations occurs, such as stated
by the RobertsonHill effect14. Alternatively, recombination in
RNA viruses may also just be a by-product of the processivity of
the viral polymerase6.
Recombination poses a unique challenge to phylogenetic
methods, as it violates the very central assumption that the evo-
lutionary history of individuals can be denoted by branching
phylogenetic trees. Recombination breaks this assumption and
requires representation of the shared ancestry of a set of
sequences as a network. Not accounting for this can lead to biased
phylogenetic and phylodynamic inferences15,16. An analytic
description of recombination is provided by the coalescent with
recombination, which models a backward in time process where
lineages can coalesce and recombine17. When recombination is
considered backward in time, a single lineage results in two-
parent lineages, with one parent lineage carrying the genetic
material from one side of a random recombination breakpoint
and the other parent lineage carrying the genetic material from
the other side of this breakpoint. This equates to the backward in
time equivalent of template switching where there is one
recombination breakpoint per recombination event.
Currently, some Bayesian phylogenetic approaches exist that
infer recombination networks, or ancestral recombination graphs
(ARG), but are either approximate or do not directly allow for
efcient model-based inference. Some approaches consider tree-
based networks18,19, where the networks consist of a base tree
with recombination edges that always attach to edges on the base
tree. Alternative approaches rely on approximations to the coa-
lescent with recombination20,21, consider a different model of
recombination16, or seek to infer recombination networks absent
an explicit recombination model22. Bayesian and maximum
likelihood methods have also been used to account for gene
transfer events when describing the evolutionary history of spe-
cies from multiple loci (for example, see refs. 23,24). Additionally,
methods have been used to describe non-tree-like evolution using
split trees25,26. There is, however, a gap for Bayesian inference of
recombination networks under the coalescent with recombination
that can be applied to study pathogens, such as coronaviruses.
In order to ll this gap, we here develop a Markov chain Monte
Carlo (MCMC) approach to efciently infer recombination net-
works under the coalescent with recombination for sequences
sampled over time. This framework allows joint estimation of
recombination networks, effective population sizes, recombina-
tion rates, and parameters describing mutations over time from
genetic sequence data sampled through time. We explicitly do not
make additional approximations to characterize the recombina-
tion process, other than those of the coalescent with
recombination17, such as, for example, the approximation of tree-
based networks. We implemented this approach as an open-
source software package for BEAST227, allowing us to use the
various evolutionary models already implemented in BEAST2.
We then use the coalescent with recombination to study the
recombination patterns of SARS-like, MERS, and 3 seasonal
human coronaviruses.
Results
Widespread recombination in SARS-like coronaviruses.
Recombination has been implicated at the beginning of the SARS-
CoV-1 outbreak28 and has been suggested as the origin of the
receptor-binding domain in SARS-CoV-229, though Boni et al.30
report that recombination is unlikely to be the origin of SARS-
CoV-2. While this strongly suggests non-tree-like evolution, the
evolutionary history of SARS-like viruses has, out of necessity,
mainly been denoted using phylogenetic trees.
We here reconstruct the recombination history of SARS-like
viruses, which includes SARS-CoV-1 and SARS-CoV-2 as well as
related bat3133 and pangolin34 coronaviruses. To do so, we infer
the recombination network of SARS-like viruses under the
coalescent with recombination. We assumed that the rates of
recombination and effective population sizes were constant over
time and that the genomes evolved under a GTR+Γ
4
model.
Similar to the estimate in ref. 30, we used a xed evolutionary rate
of 5 × 104mutations per nucleotide and year. We xed the
evolutionary rate since the time interval of sampling between
individual isolates is relatively short compared to the time scale of
the evolutionary history of SARS-like viruses. This means that the
sampling times themselves offer little insight into the evolu-
tionary rates and, in absence of other calibration points, there is
little information about the evolutionary rate in this dataset. This,
in turn, means that if the evolutionary rate we used here is
inaccurate then the timings of common ancestors will also be
inaccurate. Therefore, exact timings and calendar dates in this
analysis should be taken as guideposts rather than formal
estimates.
As shown in Fig. 1A and Fig. S1A, the evolutionary history of
SARS-like viruses are characterized by frequent recombination
events, including ancestral to SARS-CoV-2 (see Fig. S2). This
means that only relatively short segments of the genomes code for
the same tree (see Figs. S3 and S1B). Consequently, characterizing
the evolutionary history of SARS-like viruses by a single, genome-
wide phylogeny is bound to be inaccurate and potentially
misleading. We infer the recombination rate in SARS-like viruses
to be approximately 2 × 106recombination events per site per
year, which is about 200 times slower than the evolutionary rate.
This rate translated to about 0.06 recombination events per
lineage per year, which is slightly lower than the estimated rate of
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
2NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
recombination for the seasonal human coronaviruses and the
reassortment rates for pandemic 1918 like inuenza A/H1N1 and
inuenza B viruses, which are all around 0.10.2 reassortment
events per lineage per year16. This recombination rate is a
function of co-infection rates, probability of recombination
occurring upon co-infection, and selection. As such, the
recombination rate we infer here will be (possibly substantially)
lower than the within-host rate of recombination.
These recombination events were not evenly distributed across
the genome and, instead, were relatively higher in areas outside
those coding for ORF1ab (Figs. S4 and S5). Additionally, our
inference suggests that rates of recombination are slightly elevated
on spike subunit S1 compared to subunit S2 (Fig. S4). If we track
recombination events ancestral to the SARS-CoV-2 lineage that
are inferred to have happened in the last 100 years, we nd
evidence for recombination breakpoints occurring close to the 5
end of the spike, just outside the coding region (see Fig. S5).
Additionally, we nd support for recombination breakpoints
toward the 3end of the spike, near the nucleocapsid gene (see
Fig. S5). If we assume that during genome replication in
coronaviruses template shifts occur randomly on the genome10,
differences in observed recombination rates could be explained by
selection favoring recombinant lineages with breakpoints on 3to
ORF1ab relative to elsewhere on the genome.
We next investigate when different viruses last shared a
common ancestor (MRCA) along the genome (see Figs. 1B and
S6). RmYN0233 shares the MRCA with SARS-CoV-2 on the part
of the genome that codes for ORF1ab (Fig. 1B). We additionally
nd strong evidence for one or more recombination events in the
ancestry of RmYN02 at the beginning of spike (Fig. 1B). This
recent recombination event is unlikely to have occurred with a
recent ancestor of any of the coronaviruses included in this
dataset since the common ancestor of RmYN02 with any other
virus in the dataset is approximately the same (Fig. S6A). In other
words, large parts of the spike protein of RmYN02 are as related
to SARS-CoV-2 as SARS-CoV-2 is to SARS-CoV-1. The common
ancestor timings of P2S across the genome are equal between
RaTG13 and SARS-CoV-2 (Fig. S6C). RaTG13 on the other hand
is more closely related to SARS-CoV-2 than P2S (Fig. S6B) across
the entire genome.
When looking at when different viruses last shared a common
ancestor anywhere on the genome (in other words: when the
ancestral lineages of two viruses last crossed paths), we nd that
RmYN02 has the most recent MRCA with SARS-CoV-2
(Fig. S6C). The median estimate of the most recent MRCA
between SARS-CoV-2 and RmYN02 is 1986 (95% CI:
19732005). For RaTG13 it is 1975 (95% CI: 19881964), for
P2S it is 1949 (95% CI: 19071973) and with SARS-CoV-1 it is
1834 (95% CI: 17071935). These estimates are contingent on a
xed evolutionary rate of 5 × 104per nucleotide per year.
Rates of recombination are associated with rates of adaptation
in human seasonal coronaviruses. We next investigate recom-
bination patterns in MERS-CoV, which has over 2500 conrmed
cases in humans, as well as in human seasonal coronaviruses
229E, OC43, and NL63, which have widespread seasonal circu-
lation in humans. As for the SARS-like viruses, we jointly infer
recombination networks, rates of recombination, and population
sizes for these viruses. We assumed that the genomes evolved
under a GTR +Γ
4
model and, in contrast to the analysis of SARS-
like viruses, inferred the evolutionary rates. We observe frequent
recombination in the history of all four viruses, wherein genetic
ancestry is described by network rather than a strictly branching
phylogeny (Fig. 2AD and Fig. S6A).
The human seasonal coronaviruses all have recombination
rates around 1 × 105per site and year (Fig. S7). This is around
1020 times lower than the evolutionary rate (Fig. S8). In contrast
to the recombination rates, the evolutionary rates vary greatly
across the human seasonal coronaviruses, with rates between a
median of 1.3 × 104(95% highest posterior density interval
(HPD) 1.11.5 × 104) for NL63 and a median rate of 2.5 × 104
(95% HPD 2.22.7 × 104) and 2.1 × 104(95% HPD
1.92.3 × 104) for 229E and OC43 (Fig. S8). These evolutionary
Fig. 1 Evolutionary history of SARS-like viruses. A Maximum clade credibility network of SARS-like viruses. Blue dots denote samples and green dots
recombination events. BCommon ancestor times of Wuhan-Hu1 (SARS-CoV-2) with different SARS-like viruses on different positions of the genome. The
y-axis denotes common ancestor times on the log scale. The line denotes the median common ancestor time, while the colored area denotes the 95%
highest posterior density interval. CMost recent time anywhere on the genome that Wuhan-Hu1 shared a common ancestor with different SARS-like
viruses. The error bars denote the upper and lower bound of the 95% highest posterior density interval. The MCC network and common ancestor times are
provided as a Source Data le.
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
rates are substantially lower than those estimated for SARS-CoV-
2 (1.1 × 103substitutions per site and year35), which are more in
line with our estimates for the evolutionary rates of MERS with a
median rate of 6.9 × 104(95% HPD 6.07.9 × 104). Evolu-
tionary rate estimates can be time-dependent, with datasets
spanning more time estimating lower rates of evolution than
those spanning less time36. In turn, this means that the
evolutionary rate estimates for SARS-CoV-2 will likely be lower
the more time passes. It is unclear though if it will approximate
the evolutionary rates of other seasonal coronaviruses in the
long run.
On a per-lineage basis, the estimated recombination rate for
seasonal coronaviruses translates into around 0.10.3 recombina-
tion events per lineage and year (Fig. 2E). Recombination events
dened here are a product of co-infection, recombination, and
selection of recombinant viruses. Interestingly, the rate at which
recombination events occur is highly similar to the rate at which
reassortment events occur in human inuenza viruses (Fig. 2D, and
ref. 16). If we assume similar selection pressures for recombinant
coronaviruses compared to reassortant inuenza viruses, this would
indicate similar co-infection rates in inuenza and coronaviruses.
The incidence of coronaviruses in patients with respiratory illness
cases over 12 seasons in western Scotland has been found to be
lower (717%) than for inuenza viruses (1334%) but to be of the
same order of magnitude37. Considering that seasonal corona-
viruses typically are less symptomatic than inuenza viruses, it is
not unreasonable to assume that annual incidence, and therefore
likely the annual co-infection rates, are comparable between
inuenza and coronaviruses.
Compared to human seasonal coronaviruses, recombination
occurs around 3 times more often for MERS-CoV (Fig. 2E).
MERS-CoV mainly circulates in camels and occasionally spills
over into humans38. MERS-CoV infections are highly prevalent
in camels, with close to 100% of adult camels showing antibodies
against MERS-CoV39. Higher incidence, and thus higher rates of
co-infection, could therefore account for higher rates of
recombination in MERS-CoV compared to the human seasonal
coronaviruses.
We next tested whether parts of the genome with higher rates
of recombination are also associated with higher rates of
adaptation. To do so, we allowed for different relative rates of
recombination within the region 5of the spike (i.e. mostly
ORF1ab), spike itself, and everything 3of the spike. We
computed recombination rate ratios on each of these three
sections of the genome as the recombination rate on that section
divided by the mean rate on the other two sections. We infer that
recombination rates are elevated in the spike protein of all human
seasonal coronaviruses considered here (Fig. 3, Figs. S9, and S10).
This is consistent with other work estimating higher rates of
recombination on the spike protein of betacoronaviruses11.
We then computed the rates of adaption on different parts of
the genomes of the seasonal human coronaviruses using the
approach described in refs. 40,41. This approach does not
explicitly consider trees to compute the rates of adaptation on
different parts of the genomes and is not affected by
recombination41.Wend that sections of the genome with
relatively higher rates of adaptation correspond to sections of the
genome with relatively higher rates of recombination (Fig. 3). In
particular, recombination and adaptation are elevated on the
section of the genome that codes for the spike protein and are
lower elsewhere.
We next investigated whether these trends hold when looking
only at spikes. The spike protein is made up of two subunits: S1
and S2. S1 binds to the host cell receptor, while S2 facilitates
fusion of the viral and cellular membrane42. Rates of adaptation
have been shown to be high in S1, but not S2, for 229E and
OC4341. While the rates of adaptation are relatively low overall
for NL63, there is still some evidence that they are elevated in S1
compared to S241.
To test whether recombination rates vary with rates of
adaptation on the subunits of the spike as well, we inferred the
recombination rates from the spike only, allowing for different
rates of recombination on S1 versus the rest of the spike. We nd
that the rates of recombination are elevated on S1 for 229E and
OC43 compared to the rest of the spike gene (Fig. 3). This is
consistent with strong absolute rates of adaptation on S1 on these
Fig. 2 Recombination networks and rates for coronaviruses MERS, 229E, OC43, and NL63. Recombination networks for MERS (A) and seasonal human
coronaviruses 229E (B), OC43 (C), and NL63 (D). ERecombination rates (per lineage and year) for the different coronaviruses compared to reassortment
rates in seasonal human inuenza A/H3N2 and inuenza B viruses as estimated in under the coalescent with reassortment using whole-genome inuenza
sequences sampled over multiple decades16. For OC43 and NL63, the parts of the recombination networks that stretch beyond 1950 are not shown to
increase the readability of more recent parts of the networks. The error bars denote the upper and lower bound of the 95% highest posterior density
interval. All MCC networks are provided as a Source Data le.
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
4NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
two viruses. For NL63, we nd weak evidence for the rate on S2 to
be slightly higher than on S1 (Fig. 3), even though the rates of
adaptation are inferred to be higher on S1. The absolute rate of
adaptation in S1 of NL63 is, however, substantially lower than for
229E or OC43. Additionally, the uncertainty around the estimates
on adaption rate ratios between the two subunits for NL63 is
rather large and includes no difference at all. Overall, these results
suggest that particular recombination events that have resulted in
recombinant viruses are either positively or negatively selected.
Elevated rates of recombination in areas where adaptation is
stronger have been described for other organisms (reviewed
here43). Alternatively, higher rates of recombination could also be
due to mechanistic reasons, as has been suggested in the case of
SARS-CoV-244.
To further investigate this, we next computed the rates of
recombination on tter and less t parts of the recombination
networks of 229E, OC43, and NL63. To do so, we rst classify
each edge of the inferred posterior distribution of the recombina-
tion networks into t and unt based on how long a lineage
survives into the future. Fit edges are those that have descendants
at least 1, 2, 5, or 10 years into the future, and unt edges are
those that do not. We then computed the rates of recombination
on both types of edges for the entire posterior distribution of
networks. Overall, we do not nd that t edges show relatively
higher rates of recombination (see Fig. S11). The simplest
explanation is that we do not have enough data points to measure
recombination rates on unt edges, meaning to measure
recombination rates on part of the recombination network where
selection had too little time to shape which lineages survive and
which go extinct. An alternative explanation to why we see
elevated rate or recombination in the spike protein, but do not
observe a population level tness benet could be that most
(outside of spike) recombinants could be detrimental to tness
with few (within spike) having little tness effect at all.
Discussion
Though not yet highly prevalent, evidence for recombination in
SARS-CoV-2 has started to appear 4548. As such, it is crucial to
know the extent to which recombination is expected to shape
SARS-CoV-2 in the coming years, to have methods to identify
recombination, and to perform phylogenetic reconstruction in the
presence of recombination. The results shown here indicate that
some recombinant viruses are either positively or negatively
selected. Estimating the deleterious load of viruses before and
after recombination using ancestral sequence reconstruction49
could help shed light on which sequences are favored during
recombination. Furthermore, having additional sequences to
reconstruct recombination patterns in the seasonal coronaviruses
should clarify the role recombination plays in the long-term
evolution of these viruses.
While their impact on the evolutionary dynamics of SARS-
CoV-2 remains unclear, the likely rise of future SARS-CoV-2
recombinants will further necessitate methods that allow phylo-
genetic and phylodynamic inferences to be performed in the
presence of recombination50. In absence of that, recombination
has to be either ignored, leading to biased phylogenetic and
phylodynamic reconstruction15, or non-recombinant parts of the
Fig. 3 Comparison of recombination rates with rates of adaptation on different parts of the genomes of seasonal human coronaviruses 229E, OC43,
and NL63. Association between estimated relative recombination rate (x-axis) and relative adaptation rate (y-axis) for three different seasonal human
coronaviruses: 229E, OC43, and NL63. These estimates are shown for different parts of the genome, indicated by the different colors. These results from
two different types of analysis: one using spike only (subunit 1 over subunit 2, shown in yellow) and one using the full genome (shown in orange, blue, and
green). The rate ratios denote the rate on a part of the genome divided by the average rate on the two other parts of the genome. The error bars of the
recombination rates (x-axis) denote the upper and lower bounds of the 95% HPD intervals of the estimates of relative recombination rates. The error bars
of the rates of adaptation are computed using 100 bootstrapped outgroups and alignments when computing the rates of adaptation. Source data are
provided as a Source Data le.
no sampled descendents
has sampled descendents
present
Coalescent Event
Coalescent Event
Recombination Event
Coalescent Event
Coalescent Event
Recombination Event
past
Fig. 4 Example recombination network. Events that can occur on a
recombination network as considered here. We consider events to occur
from the present backward in time to the past (as is the norm when looking
at coalescent processes). Lineages can be added upon sampling events,
which occur at predened points in time and are conditioned on.
Recombination events split the path of a lineage in two, with everything on
one side of a recombination breakpoint going in one direction and
everything on the other side of a breakpoint going in the other direction.
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
genome have to be used for analyses, reducing the precision of
these methods. Our approach addresses this gap by providing a
Bayesian framework to infer recombination networks. To facil-
itate easy adaptation, we implemented the method so that ana-
lyses can be set up following the same workow as regular
BEAST227 analyses. Extending the current suite of population
dynamic models, such as birthdeath models51 or models that
account for population structure52,53, will further increase the
applicability of recombination models to study the spread of
pathogens.
Methods
Coalescent with recombination. The coalescent with recombination models a
backward in time coalescent and recombination process17. In this process, three
different events are possible: sampling, coalescence, and recombination. Sampling
events happen at predened points in time. Recombination events happen at a rate
proportional to the number of coexisting lineages at any point in time. Recom-
bination events split the path of a lineage in two, with everything on one side of a
recombination breakpoint going in one ancestral direction and everything on the
other side of a breakpoint going in the other direction. As shown in Fig. 4, the
two parental lineages after a recombination event each carry a subset of the gen-
ome. In reality, the viruses corresponding to those two lineages still carry the full
genome, but only a part of it will have sampled descendants. In other words, only a
part of the genome carried by a lineage at any time may impact the genome of a
future lineage that is sampled. The probability of actually observing a recombi-
nation event on lineage lis proportional to how much genetic material that lineage
carries. This can be computed as the difference between the last and rst nucleotide
position that is carried by l, which we denote as LðlÞ. Coalescent events happen
between co-existing lineages at a rate proportional to the number of pairs of
coexisting lineages at any point in time and inversely proportional to the effective
population size. The parent lineage at each coalescent event will carry genetic
material corresponding to the union of the genetic material of the two-child
lineages.
Posterior probability. In order to perform joint Bayesian inference of recombi-
nation networks together with the parameters of the associated models, we use a
MCMC algorithm to characterize the joint posterior density. The posterior density
is denoted as:
PðN;μ;θ;ρjDÞ¼PðDjN;μÞPðNjθ;ρÞPðμ;θ;ρÞ
PðDÞ;ð1Þ
where Ndenotes the recombination network, μthe evolutionary model, θthe
effective population size and ρthe recombination rate. The multiple sequence
alignment, that is the data, is denoted D.P(DN,μ) denotes the network likelihood,
P(Nθ,ρ), the network prior and P(μ,θ,ρ) the parameter priors. As is usually done
in Bayesian phylogenetics, we assume that P(μ,θ,ρ)=P(μ)P(θ)P(ρ).
Using a Bayesian approach has several advantages. In particular, it allows us to
account for uncertainty in the parameter and network estimates. Additionally, it
allows balancing different sources of information against each other. The
coalescent with recombination model, for example, will tend to favor networks
with fewer recombination events. The cost of adding more recombination events
depends on the recombination rate. At lower rates of recombination, adding new
recombination events is more costly and the information coming from the
sequence alignment in support of a recombination event needs to be greater.
Network likelihood. While the evolutionary history of the entire genome is a net-
work, the evolutionary history of each individual position in the genome can be
described as a tree. We can therefore denote the likelihood of observing a sequence
alignment (the data denoted D) given a network Nand evolutionary model μas
PðDjN;μÞ¼ Y
sequence length
i¼1
PðDijTi;μÞ;ð2Þ
with D
i
denoting the nucleotides at position iin the sequence alignment and T
i
denoting the tree at position i. The likelihood at each individual position in the
alignment can then be computed using the standard pruning algorithm54.We
implemented the network likelihood calculation P(D
i
T
i
,μ) such that it allows
making use of all the standard site models in BEAST2. Currently, we only consider
strict clock models and therefore do not allow for rate variations across different
branches of the network. This is because the number of edges in the network
changes over the course of the MCMC, making relaxed clock models more com-
plex to implement. We implemented the network likelihood such that it can make
use of caching of intermediate results and use unique patterns in the multiple
sequence alignment, similar to what is done for tree likelihood computations.
Network prior. The network prior is denoted by P(Nθ,ρ), which is the probability
of observing a network and the embedding of segment trees under the coalescent
with recombination model, with effective population size θand per-lineage
recombination rate ρ. It plays essentially the same role that tree prior plays in
phylodynamic analyses on trees.
We can calculate P(Nθ,ρ) by expressing it as the product of exponential
waiting times between events (i.e., recombination, coalescent, and sampling
events):
PðNjθ;ρÞ¼ Y
#events
i¼1
PðeventijLi;θ;ρÞ´PðintervalijLi;θ;ρÞ;ð3Þ
where we dene t
i
to be the time of the ith event and L
i
to be the set of lineages
extant immediately prior to this event. (That is, L
i
=L
t
for tti1;tiÞ.
Given that the coalescent process is a constant size coalescent and given the ith
event is a coalescent event, the event contribution is denoted as
PðeventijLi;θ;ρÞ¼1
θ:ð4Þ
If the ith event is a recombination event and assuming constant rates of
recombination over time, the event contribution is denoted as
PðeventijLi;θ;ρÞ¼ρLðlÞ:ð5Þ
The interval contribution denotes the probability of not observing any event in
a given interval. It can be computed as the product of not observing any coalescent,
nor any recombination events in interval i. We can therefore write:
PðintervalijLi;θ;ρÞ¼exp½ðλcþλrÞðtiti1Þ;ð6Þ
where λcdenotes the rate of coalescence and can be expressed as
λc¼jLij
2

1
θ;ð7Þ
and λrdenotes the rate of observing a recombination event on any co-existing
lineage and can be expressed as
λr¼ρ
l2Li
LðlÞ:ð8Þ
In order to allow the recombination rates to vary across ssections Sson the
genome, we modify λrto differ in each section Ss, such that:
λr¼
s2S
ρs
l2Li
LðlÞ\Ss;ð9Þ
with LðlÞ\Ssdenoting the amount of overlap between LðlÞand Ss. The
recombination rate in each section sis denoted as ρ
s
.
MCMC algorithm for recombination networks. In order to explore the posterior
space of recombination networks, we implemented a series of MCMC operators.
These operators often have analogs in operators used to explore different phylo-
genetic trees and are similar to the ones used to explore reassortment networks16.
Here, we briey summarize each of these operators.
Add/remove operator: The add/remove operator adds and removes
recombination events. An extension of the subtree prune and regraft move for
networks55 to jointly operate on segment trees as well. We additionally
implemented an adapted version to sample re-attachment under a coalescent
distribution to increase acceptance probabilities.
Loci diversion operator: The loci diversion operator randomly changes the
location of recombination breakpoints of a recombination event.
Exchange operator: The exchange operator changes the attachment of edges in
the network while keeping the network length constant.
Subnetwork slide operator: The subnetwork slide operator changes the height of
nodes in the network while allowing to change in the topology.
Scale operator: The scale operator scales the heights of the root node or the
whole network without changing the network topology.
Gibbs operator: The Gibbs operator efciently samples any part of the network
that is older than the root of any segment of the alignment and is thus not
informed by any genetic data and is the analog to the Gibbs operator in16 for
reassortment networks.
Empty loci preoperator: The empty loci preoperator augments the network with
edges that do not carry any loci for the duration of one of the above moves, to allow
for larger jumps in network space.
One of the issues when inferring these recombination networks is that the root
height can be substantially larger than when not allowing for recombination events.
This can cause a computational issue when performing inferences. To circumvent
this, we truncate the recombination networks by reducing the recombination rate
sometime after all positions of the sequence alignment have reached their common
ancestor height.
Validation and testing. We validate the implementation of the coalescent with
recombination network prior as well as all operators in Fig. S12. We also show that
truncating the recombination networks does not affect the sampling of recombi-
nation networks prior to reaching the common ancestor height of all positions in
the sequence alignment.
We then tested whether we are able to infer recombination networks,
recombination rates, effective population sizes, and evolutionary parameters from
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
6NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
simulated data. To do so, we randomly simulated recombination networks under
the coalescent with recombination. On top of these, we then simulated multiple
sequence alignments. We then re-infer the parameters used to simulate using our
MCMC approach. As shown in Fig. S13, these parameters are retrieved well from
simulated data with little bias and accurate coverage of simulated parameters by
credible intervals.
We next tested how well we can retrieve individual recombination events. To do
so, we plot the location and timings of simulated recombination events for the rst
9 out of 100 simulations. We then plot the density of recombination events in the
posterior distribution of networks, based on the timing and location of the inferred
breakpoint on the genome. As shown in Fig. S14, we are able to retrieve the true
(simulated) recombination events well.
We next tested how the speed of inference scales with the number of
recombination events, the number of samples in the dataset, and the
evolutionary rate. To do so, we simulated 300 recombination networks and
sequence alignment of length 10,000 under a JukesCantor model with between
10 and 200 leaves and a recombination rate between 1 × 105and 2 × 105
recombination events per site per year. This means that for each simulation,
there were between 0 and 100 recombination events, allowing us to investigate
how the inference scales in different settings. As shown in Fig. S15, the ESS per
hour decreases with the number of recombination events and samples, but not
the evolutionary rates. In particular, the ESS per hour decreases much faster
with the number of recombination events in a dataset than the number of
samples. This suggests that the methods can currently be used more easily to
analyze a dataset with a large number of samples over a large number of
recombination events.
We next tested how the choice of the prior distribution on the recombination
rate impacts the recombination rate estimate. To do so, we simulate 20
recombination networks and sequence alignment of length 10,000 under a
JukesCantor model with 100 leaves and a recombination rate drawn randomly
from a log-normal distribution. We then infer the recombination rates using 5
different recombination rate priors as shown in Fig. 5F that put some or a lot of
weight on the wrong parameters. As shown in Fig. 5AE, we are able to infer
recombination rates, even with the wrong priors.
Additionally, we compared the effective sample size values from MCMC runs
inferring recombination networks for the MERS spike protein to treating the
evolutionary histories as trees. We nd that although the effective sample size
values are lower when inferring recombination networks, they are not orders of
magnitude lower (see Fig. S16).
Recombination network summary. We implemented an algorithm to summarize
distributions of recombination networks similar to the maximum clade credibility
framework typically used to summarize trees in BEAST56. In short, the algorithm
summarizes individual trees at each position in the alignment. To do so, we rst
compute how often we encountered the same coalescent event at every position in
the alignment during the MCMC. We then choose the network that maximizes the
clade support over each position as the maximum clade credibility (MCC) network.
The MCC networks are logged in the extended Newick format57 and can be
visualized in icytree.org58. We here plotted the MCC networks using an adapted
version of baltic (https://github.com/evogytis/baltic).
Sequence data. The genetic sequence data for OC43, NL63, and 229E were
obtained from ViPR (http://www.viprbrc.org) and were the same as used41. All
these sequences were isolated from a human host and downsampled from the
dataset used in ref. 41 to 100 sequences (for OC43 and NL63). As there were only
54 229E sequences, we did not do any downsampling on this data. The sequence
data for the MERS analyses were the same as described in ref. 38, but using a
randomly down sampled dataset of 100 sequences. For the SARS-like analyses, we
used 40 different deposited SARS-like genomes, mostly originating from bats, as
well as humans, and one pangolin-derived sequence.
Rates of adaptation. The rates of adaptation were calculated using a modication
of the McDonaldKreitman method, as designed by Bhatt et al.40, and imple-
mented in ref. 41. Briey, for each virus, we aligned the sequence of each gene or
genomic region. Then, we split the alignment into 3-year sliding windows, each
containing a minimum of 3 sequenced isolates. We used the consensus sequence at
the rst time point as the outgroup. A comparison of the outgroup to the alignment
of each subsequent temporal yielded a measure of synonymous and non-
synonymous xations and polymorphisms at each position in the alignment. This
approach requires having sequence data gathered over relatively long time periods
where the consensus genome allows for an accurate description of the long-term
evolutionary patterns and, as such, would not be adequate for a pathogen with a
relatively short evolutionary history, such as for SARS-CoV-2. We used propor-
tional site counting for these estimations59. We assumed that selectively neutral
sites are all silent mutations as well as replacement polymorphisms occurring at
frequencies between 0.15 and 0.7540. We identied adaptive substitutions as non-
synonymous xations and high-frequency polymorphisms that exceed the neutral
expectation. We then estimated the rate of adaptation (per codon per year) using
linear regression of the number of adaptive substitutions inferred at each time
point. In order to compute the 5spike and 3spike rates of adaptation, we used the
Fig. 5 Impact of the recombination rate prior distribution on the inferred recombination rates. Here, we compare then inferred recombination rates
when using different prior distributions that differed from the distributions from which the rates for simulations were sampled. The rates for simulations
were sampled from a log-normal distribution with μ=11.12 and σ=0.5. In A, we show the inferred rates when using a prior distribution with μ=12.74
and σ=0.5 (leading to a 5 times lower mean in real space than the correct prior). In B, we show the inferred rates when using a prior distribution with
μ=12.74 and σ=2. In C, we show the inferred rates when using the same prior distribution as was sampled under. In D, we show the inferred rates when
using a prior distribution with μ=9.72 and σ=2. In E, we show the inferred rates when using a prior distribution with μ=9.72 and σ=0.5 (leading to
5 times higher mean in real space than the correct prior). Fshows the corresponding density plots for all log-normal distributions used as prior distributions
on the recombination rates.
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
weighted average of all coding regions to the left (upstream) or right (downstream)
of the spike gene, respectively, using the length of the individual sections as
weights. We estimated the uncertainty by running the same analysis on 100
bootstrapped outgroups and alignments.
Reporting summary. Further information on research design is available in the Nature
Research Reporting Summary linked to this article.
Data availability
The BEAST2 input xml les for all coronavirus analyses in this manuscript, as well as the
les used to post process these analyses are available from https://github.com/nicfel/
Recombination-Material and here ref. 60. The xml les include the sequence data and
exact input specication of the coronavirus analyses performed in this manuscript,
except for the sequences published on gisaid. The acknowledgment table for the four
gisaid sequences used for the SARS-like analyses is provided in Supplementary Note 1.
The genbank accession numbers for the 229E, OC43, NL63, SARS-like, and MERS
analyses are provided as separate tables in Supplementary Data 1. The MERS sequences
without accession numbers are used from ref. 38. Source data are provided with
this paper.
Code availability
The Recombination package is implemented as an addon to the Bayesian phylogenetics
software platform BEAST227. All MCMC analyses performed here were run using
adaptive parallel tempering61. The source code is available at https://github.com/nicfel/
Recombination and here ref. 62. We additionally provide a tutorial on how to set up and
post-process analysis at https://github.com/nicfel/Recombination-Tutorial. The MCC
networks are plotted using an adapted version of baltic (https://github.com/evogytis/
baltic). All other plots are done in R using ggplot263 and ggenes64.
Received: 5 May 2021; Accepted: 30 June 2022;
References
1. Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The
proximal origin of SARS-CoV-2. Nat. Med. 26, 450452 (2020).
2. Bedford, T. et al. Cryptic transmission of SARS-COV-2 in washington state.
Science 370, 571575 (2020).
3. Volz, E. et al. Evaluating the effects of SARS-COV-2 spike mutation d614g on
transmissibility and pathogenicity. Cell 184,6475 (2021).
4. Grenfell, B. T. et al. Unifying the epidemiological and evolutionary dynamics
of pathogens. Science 303, 327332 (2004).
5. Kim, E.-Y. et al. Human apobec3 induced mutation of human
immunodeciency virus type-1 contributes to adaptation and evolution in
natural infection. PLoS Pathog. 10, e1004281 (2014).
6. Simon-Loriere, E. & Holmes, E. C. Why do rna viruses recombine? Nat. Rev.
Microbiol. 9, 617626 (2011).
7. McDonald, S. M., Nelson, M. I., Turner, P. E. & Patton, J. T. Reassortment in
segmented rna viruses: mechanisms and outcomes. Nat. Rev. Microbiol. 14,
448 (2016).
8. Su, S. et al. Epidemiology, genetic recombination, and pathogenesis of
coronaviruses. Trends Microbiol. 24, 490502 (2016).
9. Lai, M. RNA recombination in animal and plant viruses. Microbiol. Mol. Biol.
Rev. 56,6179 (1992).
10. Banner, L. R. & Mc Lai, M. Random nature of coronavirus rna recombination
in the absence of selection pressure. Virology 185, 441445 (1991).
11. Bobay, L.-M., ODonnell, A. C. & Ochman, H. Recombination events are
concentrated in the spike protein region of betacoronaviruses. PLoS Genet. 16,
e1009272 (2020).
12. Barton, N. A general model for the evolution of recombination. Genet. Res. 65,
123144 (1995).
13. Feldman, M. W., Christiansen, F. B. & Brooks, L. D. Evolution of recombination
in a constant environment. Proc. Natl Acad. Sci. USA 77,48384841 (1980).
14. Hill, W. G. & Robertson, A. The effect of linkage on limits to articial
selection. Genet. Res. 8, 269294 (1966).
15. Posada, D. & Crandall, K. A. The effect of recombination on the accuracy of
phylogeny estimation. J. Mol. Evol. 54, 396402 (2002).
16. Müller, N. F., Stolz, U., Dudas, G., Stadler, T. & Vaughan, T. G. Bayesian
inference of reassortment networks reveals tness benets of reassortment in
human inuenza viruses. Proc. Natl Acad. Sci. USA 117, 1710417111 (2020).
17. Hudson, R. R. Properties of a neutral allele model with intragenic
recombination. Theor. Popul. Biol. 23, 183201 (1983).
18. Didelot, X., Lawson, D., Darling, A. & Falush, D. Inference of homologous
recombination in bacteria using whole-genome sequences. Genetics 186,
14351449 (2010).
19. Vaughan, T. G. et al. Inferring ancestral recombination graphs from bacterial
genomic data. Genetics 205, 857870 (2017).
20. Rasmussen, M. D., Hubisz, M. J., Gronau, I. & Siepel, A. Genome-wide
inference of ancestral recombination graphs. PLoS Genet. 10, e1004342 (2014).
21. McVean, G. A. & Cardin, N. J. Approximating the coalescent with
recombination. Philos. Trans. R. Soc. B: Biol. Sci. 360, 13871393 (2005).
22. Bloomquist, E. W. & Suchard, M. A. Unifying vertical and nonvertical
evolution: a stochastic arg-based framework. Syst. Biol. 59,2741 (2010).
23. Meng, C. & Kubatko, L. S. Detecting hybrid speciation in the presence of
incomplete lineage sorting using gene tree incongruence: a model. Theor.
Popul. Biol. 75,3545 (2009).
24. Yu, Y., Dong, J., Liu, K. J. & Nakhleh, L. Maximum likelihood inference of
reticulate evolutionary histories. Proc. Natl Acad. Sci. USA 111, 1644816453
(2014).
25. Bryant, D. & Moulton, V. Neighbor-net: an agglomerative method for the
construction of phylogenetic networks. Mol. Biol. Evol. 21, 255265 (2004).
26. Huson, D. H. & Bryant, D. Application of phylogenetic networks in
evolutionary studies. Mol. Biol. Evol. 23, 254267 (2006).
27. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, et al.
BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis.
PLoS Comput Biol. 15, e1006650 https://doi.org/10.1371/journal.pcbi.1006650
(2019).
28. Hon, C.-C. et al. Evidence of the recombinant origin of a bat severe acute
respiratory syndrome (sars)-like coronavirus and its implications on the direct
ancestor of sars coronavirus. J. Virol. 82, 18191826 (2008).
29. Li, X. et al. Emergence of SARS-COV-2 through recombination and strong
purifying selection. Sci. Adv. 6, eabb9153 (2020).
30. Boni, M. F. et al. Evolutionary origins of the SARS-COV-2 sarbecovirus
lineage responsible for the covid-19 pandemic. Nat. Microbiol. 5, 14081417
(2020).
31. Ge, X.-Y. et al. Isolation and characterization of a bat sars-like coronavirus
that uses the ace2 receptor. Nature 503, 535538 (2013).
32. Ge, X.-Y. et al. Coexistence of multiple coronaviruses in several bat colonies in
an abandoned mineshaft. Virol. Sin. 31,3140 (2016).
33. Zhou, H. et al. A novel bat coronavirus closely related to sars-cov-2 contains
natural insertions at the s1/s2 cleavage site of the spike protein. Curr. Biol. 30,
21962203 (2020).
34. Lam, T. T.-Y. et al. Identifying sars-cov-2-related coronaviruses in malayan
pangolins. Nature 583, 282285 (2020).
35. Duchene, S. et al. Temporal signal and the phylodynamic threshold of sars-
cov-2. Virus Evol. 6, veaa061 (2020).
36. Duchêne, S., Holmes, E. C. & Ho, S. Y. Analyses of evolutionary dynamics in
viruses are hindered by a time-dependent bias in rate estimates. Proc. R. Soc.
B: Biol. Sci. 281, 20140732 (2014).
37. Nickbakhsh, S. et al. Epidemiology of seasonal coronaviruses: establishing the
context for the emergence of coronavirus disease 2019. J. Infect. Dis. 222,
1725 (2020).
38. Dudas, G., Carvalho, L. M., Rambaut, A. & Bedford, T. Mers-cov spillover at
the camel-human interface. Elife 7, e31257 (2018).
39. Reusken, C. B. et al. Geographic distribution of mers coronavirus among
dromedary camels, africa. Emerg. Infect. Dis. 20, 1370 (2014).
40. Bhatt, S., Holmes, E. C. & Pybus, O. G. The genomic rate of molecular adaptation
of the human inuenzaavirus.Mol. Biol. Evol. 28, 24432451 (2011).
41. Kistler, K. E. & Bedford, T. Evidence for adaptive evolution in the receptor-
binding domain of seasonal coronaviruses oc43 and 229e. Elife 10, e64509
(2021).
42. Walls, A. C. et al. Structure, function, and antigenicity of the sars-cov-2 spike
glycoprotein. Cell 181, 281292 (2020).
43. Nachman, M. W. Variation in recombination rate across the genome:
evidence and implications. Curr. Opin. Genet. Dev. 12, 657663 (2002).
44. Turakhia, Y. et al. Pandemic-scale phylogenomics reveals elevated
recombination rates in the sars-cov-2 spike region. Preprint at https://doi.org/
10.1101/2021.08.04.455157 (2021).
45. VanInsberghe, D., Neish, A. S., Lowen, A. C. & Koelle, K. Recombinant SARS-
CoV-2 genomes circulated at low levels over the rst year of the pandemic,
Virus Evolution,7, veab059 https://doi.org/10.1093/ve/veab059 (2021).
46. Jackson, B. et al. Generation and transmission of interlineage recombinants in
the SARS-CoV-2 pandemic. Cell.184, 51795188 (2021).
47. Varabyou, A., Pockrandt, C., Salzberg, S. L. & Pertea, M. Rapid detection of inter-
clade recombination in sars-cov-2 with bolotie. Genetics 218, iyab074 (2021).
48. Ignatieva, A., Hein, J. & Jenkins, P. A. Ongoing recombination in SARS-COV-
2 revealed through genealogical reconstruction. Mol Biol Evol. 39, msac028
https://doi.org/10.1093/molbev/msac028 (2022).
49. Yang, Z., Kumar, S. & Nei, M. A new method of inference of ancestral
nucleotide and amino acid sequences. Genetics 141, 16411650 (1995).
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8
8NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
50. Neches, R. Y., McGee, M. D. & Kyrpides, N. C. Recombination should not be
an afterthought. Nat. Rev. Microbiol. 18, 606606 (2020).
51. Stadler, T. On incomplete sampling under birthdeath models and connections to
the sampling-based coalescent. J. Theor. Biol. 261,5866 (2009).
52. Hudson, R. R. et al. Gene genealogies and the coalescent process. Oxf. Surv.
Evol. Biol. 7, 44 (1990).
53. Lemey, P., Rambaut, A., Drummond, A. J. & Suchard, M. A. Bayesian
phylogeography nds its roots. PLoS Comput. Biol. 5, e1000520 (2009).
54. Felsenstein, J. Evolutionary trees from dna sequences: a maximum likelihood
approach. J. Mol. Evol. 17, 368376 (1981).
55. Bordewich, M., Linz, S. & Semple, C. Lost in space? generalising subtree prune
and regraft to spaces of phylogenetic networks. J. Theor. Biol. 423,112 (2017).
56. Heled, J. & Bouckaert, R. R. Looking for trees in the forest: summary tree from
posterior samples. BMC Evol. Biol. 13,111 (2013).
57. Cardona, G., Rosselló, F. & Valiente, G. ExtendedNewick: it is time for a standard
representation of phylogenetic networks. BMC Bioinform. 9,18 (2008).
58. Vaughan, T. G. Icytree: rapid browser-based visualization for phylogenetic
trees and networks. Bioinformatics 33, 23922394 (2017).
59. Bhatt, S., Katzourakis, A. & Pybus, O. G. Detecting natural selection in RNA
virus populations using sequence summary statistics. Infect. Genet. Evol. 10,
421430 (2010).
60. Müller, N. F. nicfel/Recombination-Material: Release for Nat. comm.
recombination manuscript. https://doi.org/10.5281/zenodo.6600818 (2022).
61. Müller, N. F. & Bouckaert, R. R. Adaptive metropolis-coupled mcmc for beast
2. PeerJ 8, e9473 (2020).
62. Müller, N. F. nicfel/Recombination: adds common ancestor heights logger to
beauti. https://doi.org/10.5281/zenodo.5076684 (2021)
63. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
64. Wilkins, D. gggenes: draw gene arrow maps in ggplot2. r package version 0.4.
0 (2019).
Acknowledgements
We would like to thank Timothy G. Vaughan for his helpful insights into the imple-
mentation of the software. N.F.M. is funded by the Swiss National Science Foundation
(P2EZP3_191891). K.E.K. is a NSF GRFP Fellow (DGE-1762114). T.B. is a Pew Bio-
medical Scholar and is supported by NIH R35 GM119774. The Scientic Computing
Infrastructure at Fred Hutch is supported by NIH ORIP S10OD028685.
Author contributions
N.F.M. and T.B. conceived and designed the experiments. N.F.M. and K.E.K. performed
the statistical analysis and analyzed the data. N.F.M. implemented the software. N.F.M.,
K.E.K., and T.B. wrote the paper.
Competing interests
The authors declare no competing interests.
Additional information
Supplementary information The online version contains supplementary material
available at https://doi.org/10.1038/s41467-022-31749-8.
Correspondence and requests for materials should be addressed to Nicola F. Müller.
Peer review information Nature Communications thanks the anonymous reviewers for
their contribution to the peer review of this work. Peer reviewer reports are available.
Reprints and permission information is available at http://www.nature.com/reprints
Publishers note Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional afliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made. The images or other third party
material in this article are included in the articles Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
articles Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this license, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2022
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE
NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... The empirical networks were sampled from[Maier et al., 2023, Figs. 3a-c (left), 4a-c (left)] (reported as estimated byBergström et al. [2020],Librado et al. [2021],Hajdinjak et al. [2021],,,Sikora et al. [2019]),[Lazaridis et al., 2014, Fig. 3],[Nielsen et al., 2023, Fig. 3 (left)],[Sun et al., 2023, Fig. 4c],[Müller et al., 2022, Fig. 1a],[Neureiter et al., 2022, Fig. 5a]; fit by these authors using ADMIXTOOLS Patterson et al. [2012], Maier et al. [2023], admixturegraph Leppälä et al. [2017], OrientAGraph Molloy et al. [2021], contacTrees Neureiter et al. [2022], Recombination Müller et al. [2022], AdmixtureBayesNielsen et al. [2023]. The simulated networks were obtained by subsampling 10 networks per parameter scenario simulated byJustison and Heath [2024], then filtering out networks of treewidth 1 (trees, possibly with parallel hybrid edges). ...
... al. (2020b): n = 12, ℓ = 12, h = 12, k = 6, k * = 3Müller et al. (2022): n = 40, ℓ = 358, h = 361, k = 54, k * = Accuracy of loopy BP. Approximation of the conditional distribution of the root state X ρ (left and center) and log-likelihood (right) using a greedy minimum-fill clique tree U and a join-graph structuring cluster graph U * for two networks of varying complexityMüller et al. [2022], as measured by their number of tips (n), level (ℓ), number of hybrids (h), maximum clique size (k), and maximum cluster size (k * ). ...
... al. (2020b): n = 12, ℓ = 12, h = 12, k = 6, k * = 3Müller et al. (2022): n = 40, ℓ = 358, h = 361, k = 54, k * = Accuracy of loopy BP. Approximation of the conditional distribution of the root state X ρ (left and center) and log-likelihood (right) using a greedy minimum-fill clique tree U and a join-graph structuring cluster graph U * for two networks of varying complexityMüller et al. [2022], as measured by their number of tips (n), level (ℓ), number of hybrids (h), maximum clique size (k), and maximum cluster size (k * ). For U, estimates are exact after one iteration and shown as horizontal red lines. ...
Preprint
Full-text available
The evolution of molecular and phenotypic traits is commonly modelled using Markov processes along a rooted phylogeny. This phylogeny can be a tree, or a network if it includes reticulations, representing events such as hybridization or admixture. Computing the likelihood of data observed at the leaves is costly as the size and complexity of the phylogeny grows. Efficient algorithms exist for trees, but cannot be applied to networks. We show that a vast array of models for trait evolution along phylogenetic networks can be reformulated as graphical models, for which efficient belief propagation algorithms exist. We provide a brief review of belief propagation on general graphical models, then focus on linear Gaussian models for continuous traits. We show how belief propagation techniques can be applied for exact or approximate (but more scalable) likelihood and gradient calculations, and prove novel results for efficient parameter inference of some models. We highlight the possible fruitful interactions between graphical models and phylogenetic methods. For example, approximate likelihood approaches have the potential to greatly reduce computational costs for phylogenies with reticulations.
... In practice, the 10 evolutionary histories of many human pathogenic viruses violate this assumption 11 through processes of reassortment or recombination, as seen in seasonal influenza [5,6] 12 and seasonal coronaviruses [7], respectively. Researchers account for these evolutionary 13 mechanisms by limiting their analyses to individual genes [8,9], combining multiple 14 genes despite their different evolutionary histories [10], or developing more sophisticated 15 models to represent the joint likelihoods of multiple co-evolving lineages with ancestral 16 reassortment or recombination graphs [11,12]. However, several key questions in 17 genomic epidemiology do not require inference of ancestral relationships and states, and 18 therefore may be amenable to non-phylogenetic approaches for summarizing genetic 19 relationships. ...
... As in that previous study, we scaled the number of (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made We simulated coronavirus-like populations as previously described for human 522 seasonal coronaviruses with genomes of 21,285 bp [12]. For the current study, we 523 assigned 30 generations per real year to obtain mutation rates similar to the 8 × 10 −4 524 substitutions per site per year estimated for SARS-CoV-2 [66]. ...
... For the current study, we 523 assigned 30 generations per real year to obtain mutation rates similar to the 8 × 10 −4 524 substitutions per site per year estimated for SARS-CoV-2 [66]. To account for the effect 525 of recombination on optimal method parameters, we simulated populations with a 526 recombination rate of 10 −5 events per site per year based on human seasonal 527 coronaviruses for which recombination rates are well-studied [12,67]. We calibrated the 528 overall recombination probability in SANTA-SIM such that the number of observed Optimization of embedding method parameters 537 We identified optimal parameter values for each embedding method with time series 538 cross-validation of embeddings based on simulated populations [68]. ...
Preprint
Full-text available
Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS maintained the strongest correlation between pairwise genetic and Euclidean distances between sequences and best captured the intermediate placement of recombinant lineages between parental lineages. Clusters from t-SNE most accurately recapitulated known phylogenetic clades and recombinant lineages. Both MDS and t-SNE accurately identified reassortment groups. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate. Author summary To track the progress of viral epidemics, public health researchers often need to identify groups of genetically-related samples. A common approach to find these groups involves inferring the complete evolutionary history of virus samples using phylogenetic methods. However, these methods assume that new viruses descend from a single parent, while many viruses including seasonal influenza and SARS-CoV-2 produce offspring through a form of sexual reproduction that violates this assumption. Additionally, phylogenies may be unnecessarily complex or unintuitive when researchers only need to find and visualize clusters of related samples. We tested an alternative approach by applying widely-used statistical methods (PCA, MDS, t-SNE, and UMAP) to create 2- or 3-dimensional maps of virus samples from their pairwise genetic distances and identify clusters of samples that place close together in these maps. We found that these statistical methods without an underlying biological model could accurately capture known genetic relationships in populations of seasonal influenza and SARS-CoV-2 even in the presence of sexual reproduction. The conceptual and practical simplicity of our open source implementation of these methods enables researchers to visualize and compare human pathogenic virus samples when phylogenetic methods are unnecessary or inappropriate.
... The phylogenetic network is a powerful tool in evolutionary studies that consid ers recombinant and horizontal gene transfer among populations (51). Recently, this approach has been widely applied in the evolution of, for example, SARS-CoV2 and influenza virus (52)(53)(54)(55)(56). In the influenza virus, a segmented negative-strand virus, there is a high rate of reassortment during the evolution of all circulating influenza A virus serotypes H1N1, pH1N1, H2N2, H3N2, and influenza B virus (55). ...
... Recently, this approach has been widely applied in the evolution of, for example, SARS-CoV2 and influenza virus (52)(53)(54)(55)(56). In the influenza virus, a segmented negative-strand virus, there is a high rate of reassortment during the evolution of all circulating influenza A virus serotypes H1N1, pH1N1, H2N2, H3N2, and influenza B virus (55). In this study, we applied this approach to identify the reassortment events that occurred during the evolution of this strain. ...
Article
Full-text available
Mammalian orthoreoviruses (MRVs) infect a wide range of hosts, including humans, livestock, and wildlife. In the present study, we isolated a novel Mammalian orthoreovirus from the intestine of a microbat ( Myotis aurascens ) and investigated its biological and pathological characteristics. Phylogenetic analysis indicated that the new isolate was serotype 2, sharing the segments with those from different hosts. Our results showed that it can infect a wide range of cell lines from different mammalian species, including human, swine, and non-human primate cell lines. Additionally, media containing trypsin, yeast extract, and tryptose phosphate broth promoted virus propagation in primate cell lines and most human cell lines, but not in A549 and porcine cell lines. Mice infected with this strain via the intranasal route, but not via the oral route, exhibited weight loss and respiratory distress. The virus is distributed in a broad range of organs and causes lung damage. In vitro and in vivo experiments also suggested that the new virus could be a neurotropic infectious strain that can infect a neuroblastoma cell line and replicate in the brains of infected mice. Additionally, it caused a delayed immune response, as indicated by the high expression levels of cytokines and chemokines only at 14 days post-infection (dpi). These data provide an important understanding of the genetics and pathogenicity of mammalian orthoreoviruses in bats at risk of spillover infections. IMPORTANCE Mammalian orthoreoviruses (MRVs) have a broad range of hosts and can cause serious respiratory and gastroenteritis diseases in humans and livestock. Some strains infect the central nervous system, causing severe encephalitis. In this study, we identified BatMRV2/SNU1/Korea/2021, a reassortment of MRV serotype 2, isolated from bats with broad tissue tropism, including the neurological system. In addition, it has been shown to cause respiratory syndrome in mouse models. The given data will provide more evidence of the risk of mammalian orthoreovirus transmission from wildlife to various animal species and the sources of spillover infections.
... Recombination requires co-circulation and co-infection in the same host; the clinical and epidemiological relevance is substantial since recombinant viral strains have been associated with altered viral host tropism, enhanced virulence, host immune evasion, and the development of resistance to antivirals 1,2 . In light of these considerations, and in hindsight from the recent global scale COVID-19 epidemic, the need for the development of novel and rapid methods to identify recombination has been increasingly recognized by international health authorities and researches 3,4 . Phylogenetic analyses are essential to monitoring the spread and evolution of viruses 5 . ...
Article
Full-text available
Recombination is a key molecular mechanism for the evolution and adaptation of viruses. The first recombinant SARS-CoV-2 genomes were recognized in 2021; as of today, more than ninety SARS-CoV-2 lineages are designated as recombinant. In the wake of the COVID-19 pandemic, several methods for detecting recombination in SARS-CoV-2 have been proposed; however, none could faithfully confirm manual analyses by experts in the field. We hereby present RecombinHunt, an original data-driven method for the identification of recombinant genomes, capable of recognizing recombinant SARS-CoV-2 genomes (or lineages) with one or two breakpoints with high accuracy and within reduced turn-around times. ReconbinHunt shows high specificity and sensitivity, compares favorably with other state-of-the-art methods, and faithfully confirms manual analyses by experts. RecombinHunt identifies recombinant viral genomes from the recent monkeypox epidemic in high concordance with manually curated analyses by experts, suggesting that our approach is robust and can be applied to any epidemic/pandemic virus.
... First, the phylogenies inferred using the Bayesian approach cannot be used to represent horizontal gene transfer and recombination events which have occurred during the evolutionary timeline under study. Such recombination events have been shown to affect the tree topologies, and thus could influence the TMRCA estimates as well [9,10]. Their inclusion in a future study could offer a more nuanced and complex explanation of the evolution of SARS-CoV-2 and the related betacoronaviruses. ...
Article
Full-text available
Understanding the evolution of Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV-2) and its relationship to other coronaviruses in the wild is crucial for preventing future virus outbreaks. While the origin of the SARS-CoV-2 pandemic remains uncertain, mounting evidence suggests the direct involvement of the bat and pangolin coronaviruses in the evolution of the SARS-CoV-2 genome. To unravel the early days of a probable zoonotic spillover event, we analyzed genomic data from various coronavirus strains from both human and wild hosts. Bayesian phylogenetic analysis was performed using multiple datasets, using strict and relaxed clock evolutionary models to estimate the occurrence times of key speciation, gene transfer, and recombination events affecting the evolution of SARS-CoV-2 and its closest relatives. We found strong evidence supporting the presence of temporal structure in datasets containing SARS-CoV-2 variants, enabling us to estimate the time of SARS-CoV-2 zoonotic spillover between August and early October 2019. In contrast, datasets without SARS-CoV-2 variants provided mixed results in terms of temporal structure. However, they allowed us to establish that the presence of a statistically robust clade in the phylogenies of gene S and its receptor-binding (RBD) domain, including two bat (BANAL) and two Guangdong pangolin coronaviruses (CoVs), is due to the horizontal gene transfer of this gene from the bat CoV to the pangolin CoV that occurred in the middle of 2018. Importantly, this clade is closely located to SARS-CoV-2 in both phylogenies. This phylogenetic proximity had been explained by an RBD gene transfer from the Guangdong pangolin CoV to a very recent ancestor of SARS-CoV-2 in some earlier works in the field before the BANAL coronaviruses were discovered. Overall, our study provides valuable insights into the timeline and evolutionary dynamics of the SARS-CoV-2 pandemic.
... For SARS-CoV-2, studies have shown that recombination breakpoints occur disproportionately in the region encoding the spike protein or close to transcriptional regulatory sequence (TRS) sites [221][222][223] . ...
Article
The zoonotic emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the ensuing coronavirus disease 2019 (COVID-19) pandemic have profoundly affected our society. The rapid spread and continuous evolution of new SARS-CoV-2 variants continue to threaten global public health. Recent scientific advances have dissected many of the molecular and cellular mechanisms involved in coronavirus infections, and large-scale screens have uncovered novel host-cell factors that are vitally important for the virus life cycle. In this Review, we provide an updated summary of the SARS-CoV-2 life cycle, gene function and virus-host interactions, including recent landmark findings on general aspects of coronavirus biology and newly discovered host factors necessary for virus replication.
... Thus, recombination could still have been an important factor in the emergence of SARS-CoV-2, involving regions other than the variable loop of the RBD. In this regard, signals of recombination events involving the SARS-CoV-2 lineage have been detected on the 5′ and 3′ ends of the S gene 27 . In a recent study 18 , a recombination event was detected at the beginning of the SARS-CoV-2 RBD, associated with these viruses, however, it extended beyond the region included in our analysis. ...
Article
Full-text available
SARS-CoV-2 can infect human cells through the recognition of the human angiotensin-converting enzyme 2 receptor. This affinity is given by six amino acid residues located in the variable loop of the receptor binding domain (RBD) within the Spike protein. Genetic recombination involving bat and pangolin Sarbecoviruses, and natural selection have been proposed as possible explanations for the acquisition of the variable loop and these amino acid residues. In this study we employed Bayesian phylogenetics to jointly reconstruct the phylogeny of the RBD among human, bat and pangolin Sarbecoviruses and detect recombination events affecting this region of the genome. A recombination event involving RaTG13, the closest relative of SARS-CoV-2 that lacks five of the six residues, and an unsampled Sarbecovirus lineage was detected. This result suggests that the variable loop of the RBD didn’t have a recombinant origin and the key amino acid residues were likely present in the common ancestor of SARS-CoV-2 and RaTG13, with the latter losing five of them probably as the result of recombination.
Article
The emergence of the COVID-19 pandemic prompted an increased interest in seasonal human coronaviruses. OC43, 229E, NL63, and HKU1 are endemic seasonal coronaviruses that cause the common cold and are associated with generally mild respiratory symptoms. In this study, we identified cell lines that exhibited cytopathic effects (CPE) upon infection by three of these coronaviruses and characterized their viral replication kinetics and the effect of infection on host surface receptor expression. We found that NL63 produced CPE in LLC-MK2 cells, while OC43 produced CPE in MRC-5, HCT-8, and WI-38 cell lines, while 229E produced CPE in MRC-5 and WI-38 by day 3 post-infection. We observed a sharp increase in nucleocapsid and spike viral RNA (vRNA) from day 3 to day 5 post-infection for all viruses; however, the abundance and the proportion of vRNA copies measured in the supernatants and cell lysates of infected cells varied considerably depending on the virus-host cell pair. Importantly, we observed modulation of coronavirus entry and attachment receptors upon infection. Infection with 229E and OC43 led to a downregulation of CD13 and GD3, respectively. In contrast, infection with NL63 and OC43 leads to an increase in ACE2 expression. Attempts to block entry of NL63 using either soluble ACE2 or anti-ACE2 monoclonal antibodies demonstrated the potential of these strategies to greatly reduce infection. Overall, our results enable a better understanding of seasonal coronaviruses infection kinetics in permissive cell lines and reveal entry receptor modulation that may have implications in facilitating co-infections with multiple coronaviruses in humans. IMPORTANCE Seasonal human coronavirus is an important cause of the common cold associated with generally mild upper respiratory tract infections that can result in respiratory complications for some individuals. There are no vaccines available for these viruses, with only limited antiviral therapeutic options to treat the most severe cases. A better understanding of how these viruses interact with host cells is essential to identify new strategies to prevent infection-related complications. By analyzing viral replication kinetics in different permissive cell lines, we find that cell-dependent host factors influence how viral genes are expressed and virus particles released. We also analyzed entry receptor expression on infected cells and found that these can be up- or down-modulated depending on the infecting coronavirus. Our findings raise concerns over the possibility of infection enhancement upon co-infection by some coronaviruses, which may facilitate genetic recombination and the emergence of new variants and strains.
Preprint
Full-text available
Cross-species transmission of coronaviruses (CoVs) poses a serious threat to both animal and human health ¹⁻³ . Whilst the large RNA genome of CoVs shows relatively low mutation rates, recombination within genera is frequently observed and demonstrated ⁴⁻⁷ . Companion animals are often overlooked in the transmission cycle of viral diseases; however, the close relationship of feline (FCoV) and canine CoV (CCoV) to human hCoV-229E 5,8 , as well as their susceptibility to SARS-CoV-2 ⁹ highlight their importance in potential transmission cycles. Whilst recombination between CCoV and FCoV of a large fragment spanning orf1b to M has been previously described 5,10 , here we report the emergence of a novel, highly pathogenic FCoV-CCoV recombinant responsible for a rapidly spreading outbreak of feline infectious peritonitis (FIP), originating in Cyprus ¹¹ . The recombination, spanning spike, shows 97% sequence identity to the pantropic canine coronavirus CB/05. Infection is spreading fast and infecting cats of all ages. Development of FIP appears rapid and likely non-reliant on biotype switch ¹² . High sequence identity of isolates from cats in different districts of the island is strongly supportive of direct transmission. A deletion and several amino acid changes in spike, particularly the receptor binding domain, compared to other FCoV-2s, indicate changes to receptor binding and likely cell tropism.
Article
Full-text available
The evolutionary process of genetic recombination has the potential to rapidly change the properties of a viral pathogen, and its presence is a crucial factor to consider in the development of treatments and vaccines. It can also significantly affect the results of phylogenetic analyses and the inference of evolutionary rates. The detection of recombination from samples of sequencing data is a very challenging problem, and is further complicated for SARS-CoV-2 by its relatively slow accumulation of genetic diversity. The extent to which recombination is ongoing for SARS-CoV-2 is not yet resolved. To address this, we use a parsimony-based method to reconstruct possible genealogical histories for samples of SARS-CoV-2 sequences, which enables us to pinpoint specific recombination events that could have generated the data. We propose a statistical framework for disentangling the effects of recurrent mutation from recombination in the history of a sample, and hence provide a way of estimating the probability that ongoing recombination is present. We apply this to samples of sequencing data collected in England and South Africa, and find evidence of ongoing recombination.
Article
Full-text available
We present evidence for multiple independent origins of recombinant SARS-CoV-2 viruses sampled from late 2020 and early 2021 in the United Kingdom. Their genomes carry single nucleotide polymorphisms and deletions that are characteristic of the B.1.1.7 variant of concern, but lack the full complement of lineage-defining mutations. Instead, the remainder of their genomes share contiguous genetic variation with non-B.1.1.7 viruses circulating in the same geographic area at the same time as the recombinants. In four instances there was evidence for onward transmission of a recombinant-origin virus, including one transmission cluster of 45 sequenced cases over the course of two months. The inferred genomic locations of recombination breakpoints suggest that every community-transmitted recombinant virus inherited its spike region from a B.1.1.7 parental virus, consistent with a transmission advantage for B.1.1.7’s set of mutations.
Preprint
Full-text available
Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms, thereby crippling real-time analysis of viral recombination. Low SARS-CoV-2 mutation rates make detecting recombination difficult. Here, we develop and apply a novel phylogenomic method to exhaustively search a nearly comprehensive SARS-CoV-2 phylogeny for recombinant lineages. We investigate a 1.6M sample tree, and identify 606 recombination events. Approximately 2.7% of sequenced SARS-CoV-2 genomes have recombinant ancestry. Recombination breakpoints occur disproportionately in the Spike protein region. Our method empowers comprehensive real time tracking of viral recombination during the SARS-CoV-2 pandemic and beyond.
Article
Full-text available
Viral recombination can generate novel genotypes with unique phenotypic characteristics, including transmissibility and virulence. Although the capacity for recombination among betacoronaviruses is well documented, recombination between strains of SARS-CoV-2 has not been characterized in detail. Here, we present a lightweight approach for detecting genomes that are potentially recombinant. This approach relies on identifying the mutations that primarily determine SARS-CoV-2 clade structure and then screening genomes for ones that contain multiple mutational markers from distinct clades. Among the over 537,000 genomes queried that were deposited on GISAID.org prior to February 16, 2021, we detected 1175 potential recombinant sequences. Using a highly conservative criteria to exclude sequences that may have originated through de novo mutation, we find that at least 30% (n = 358) are likely of recombinant origin. An analysis of deep-sequencing data for these putative recombinants, where available, indicated that the majority are high quality. Additional phylogenetic analysis and the observed co-circulation of predicted parent clades in the geographic regions of exposure further support the feasibility of recombination in this subset of potential recombinants. An analysis of these genomes did not reveal evidence for recombination hotspots in the SARS-CoV-2 genome. While most of the putative recombinant sequences we detected were genetic singletons, a small number of genetically identical or highly similar recombinant sequences were identified in the same geographic region, indicative of locally circulating lineages. Recombinant genomes were also found to have originated from parental lineages with substitutions of concern, including D614G, N501Y, E484K, and L452R. Adjusting for an unequal probability of detecting recombinants derived from different parent clades and for geographic variation in clade abundance, we estimate that at most 0.2-2.5% of circulating viruses in the US and UK are recombinant. Our identification of a small number of putative recombinants within the first year of SARS-CoV-2 circulation underscores the need to sustain efforts to monitor the emergence of new genotypes generated through recombination.
Article
Full-text available
Global dispersal and increasing frequency of the SARS-CoV-2 spike protein variant D614G are suggestive of a selective advantage but may also be due to a random founder effect. We investigate the hypothesis for positive selection of spike D614G in the United Kingdom using more than 25,000 whole genome SARS-CoV-2 sequences. Despite the availability of a large dataset, well represented by both spike 614 variants, not all approaches showed a conclusive signal of positive selection. Population genetic analysis indicates that 614G increases in frequency relative to 614D in a manner consistent with a selective advantage. We do not find any indication that patients infected with the spike 614G variant have higher COVID-19 mortality or clinical severity, but 614G is associated with higher viral load and younger age of patients. Significant differences in growth and size of 614G phylogenetic clusters indicate a need for continued study of this variant.
Article
Full-text available
Seasonal coronaviruses (OC43, 229E, NL63 and HKU1) are endemic to the human population, regularly infecting and reinfecting humans while typically causing asymptomatic to mild respiratory infections. It is not known to what extent reinfection by these viruses is due to waning immune memory or antigenic drift of the viruses. Here, we address the influence of antigenic drift on immune evasion of seasonal coronaviruses. We provide evidence that at least two of these viruses, OC43 and 229E, are undergoing adaptive evolution in regions of the viral spike protein that are exposed to human humoral immunity. This suggests that reinfection may be due, in part, to positively-selected genetic changes in these viruses that enable them to escape recognition by the immune system. It is possible that, as with seasonal influenza, these adaptive changes in antigenic regions of the virus would necessitate continual reformulation of a vaccine made against them.
Article
Full-text available
Global dispersal and increasing frequency of the SARS-CoV-2 spike protein variant D614G are suggestive of a selective advantage but may also be due to a random founder effect. We investigate the hypothesis for positive selection of spike D614G in the United Kingdom using more than 25,000 whole genome SARS-CoV-2 sequences. Despite the availability of a large dataset, well represented by both spike 614 variants, not all approaches showed a conclusive signal of positive selection. Population genetic analysis indicates that 614G increases in frequency relative to 614D in a manner consistent with a selective advantage. We do not find any indication that patients infected with the spike 614G variant have higher COVID-19 mortality or clinical severity, but 614G is associated with higher viral load and younger age of patients. Significant differences in growth and size of 614G phylogenetic clusters indicate a need for continued study of this variant.
Article
Full-text available
The Betacoronaviruses comprise multiple subgenera whose members have been implicated in human disease. As with SARS, MERS and now SAR-CoV-2, the origin and emergence of new variants are often attributed to events of recombination that alter host tropism or disease severity. In most cases, recombination has been detected by searches for excessively similar genomic regions in divergent strains; however, such analyses are complicated by the high mutation rates of RNA viruses, which can produce sequence similarities in distant strains by convergent mutations. By applying a genome-wide approach that examines the source of individual polymorphisms and that can be tested against null models in which recombination is absent and homoplasies can arise only by convergent mutations, we examine the extent and limits of recombination in Betacoronaviruses . We find that recombination accounts for nearly 40% of the polymorphisms circulating in populations and that gene exchange occurs almost exclusively among strains belonging to the same subgenus. Although experimental studies have shown that recombinational exchanges occur at random along the coronaviral genome, in nature, they are vastly overrepresented in regions controlling viral interaction with host cells.
Article
The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, in case of the SARS-CoV-2, the low divergence of near-identical genomes sequenced over a short period of time makes conventional analysis infeasible. Using a novel method, we identified 225 anomalous SARS-CoV-2 genomes of likely recombinant origins out of the first 87,695 genomes to be released, several of which have persisted in the population. Bolotie is specifically designed to perform a rapid search for inter-clade recombination events over extremely large datasets, facilitating analysis of novel isolates in seconds. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. The Bolotie software and other data from our study are available at https://github.com/salzberg-lab/bolotie.