ArticlePDF Available

A Bayesian approach to infer recombination patterns in coronaviruses

July 2022
Nature Communications 13(1):4186

July 2022
13(1):4186

License
CC BY 4.0

Authors:

As shown during the SARS-CoV-2 pandemic, phylogenetic and phylodynamic methods are essential tools to study the spread and evolution of pathogens. One of the central assumptions of these methods is that the shared history of pathogens isolated from different hosts can be described by a branching phylogenetic tree. Recombination breaks this assumption. This makes it problematic to apply phylogenetic methods to study recombining pathogens, including, for example, coronaviruses. Here, we introduce a Markov chain Monte Carlo approach that allows inference of recombination networks from genetic sequence data under a template switching model of recombination. Using this method, we first show that recombination is extremely common in the evolutionary history of SARS-like coronaviruses. We then show how recombination rates across the genome of the human seasonal coronaviruses 229E, OC43 and NL63 vary with rates of adaptation. This suggests that recombination could be beneficial to fitness of human seasonal coronaviruses. Additionally, this work sets the stage for Bayesian phylogenetic tracking of the spread and evolution of SARS-CoV-2 in the future, even as recombinant viruses become prevalent. Genetic recombination can confound standard phylogenetic approaches. Here, the authors present a method to reconstruct virus recombination networks, and show the importance of recombination in shaping the ongoing evolution of SARS-like, MERS and 3 human seasonal coronaviruses.

Evolutionary history of SARS-like viruses A Maximum clade credibility network of SARS-like viruses. Blue dots denote samples and green dots recombination events. B Common ancestor times of Wuhan-Hu1 (SARS-CoV-2) with different SARS-like viruses on different positions of the genome. The y-axis denotes common ancestor times on the log scale. The line denotes the median common ancestor time, while the colored area denotes the 95% highest posterior density interval. C Most recent time anywhere on the genome that Wuhan-Hu1 shared a common ancestor with different SARS-like viruses. The error bars denote the upper and lower bound of the 95% highest posterior density interval. The MCC network and common ancestor times are provided as a Source Data file.

…

Recombination networks and rates for coronaviruses MERS, 229E, OC43, and NL63 Recombination networks for MERS (A) and seasonal human coronaviruses 229E (B), OC43 (C), and NL63 (D). E Recombination rates (per lineage and year) for the different coronaviruses compared to reassortment rates in seasonal human influenza A/H3N2 and influenza B viruses as estimated in under the coalescent with reassortment using whole-genome influenza sequences sampled over multiple decades¹⁶. For OC43 and NL63, the parts of the recombination networks that stretch beyond 1950 are not shown to increase the readability of more recent parts of the networks. The error bars denote the upper and lower bound of the 95% highest posterior density interval. All MCC networks are provided as a Source Data file.

…

Comparison of recombination rates with rates of adaptation on different parts of the genomes of seasonal human coronaviruses 229E, OC43, and NL63 Association between estimated relative recombination rate (x-axis) and relative adaptation rate (y-axis) for three different seasonal human coronaviruses: 229E, OC43, and NL63. These estimates are shown for different parts of the genome, indicated by the different colors. These results from two different types of analysis: one using spike only (subunit 1 over subunit 2, shown in yellow) and one using the full genome (shown in orange, blue, and green). The rate ratios denote the rate on a part of the genome divided by the average rate on the two other parts of the genome. The error bars of the recombination rates (x-axis) denote the upper and lower bounds of the 95% HPD intervals of the estimates of relative recombination rates. The error bars of the rates of adaptation are computed using 100 bootstrapped outgroups and alignments when computing the rates of adaptation. Source data are provided as a Source Data file.

…

Example recombination network Events that can occur on a recombination network as considered here. We consider events to occur from the present backward in time to the past (as is the norm when looking at coalescent processes). Lineages can be added upon sampling events, which occur at predefined points in time and are conditioned on. Recombination events split the path of a lineage in two, with everything on one side of a recombination breakpoint going in one direction and everything on the other side of a breakpoint going in the other direction.

…

Impact of the recombination rate prior distribution on the inferred recombination rates Here, we compare then inferred recombination rates when using different prior distributions that differed from the distributions from which the rates for simulations were sampled. The rates for simulations were sampled from a log-normal distribution with μ = −11.12 and σ = 0.5. In A, we show the inferred rates when using a prior distribution with μ = −12.74 and σ = 0.5 (leading to a 5 times lower mean in real space than the correct prior). In B, we show the inferred rates when using a prior distribution with μ = −12.74 and σ = 2. In C, we show the inferred rates when using the same prior distribution as was sampled under. In D, we show the inferred rates when using a prior distribution with μ = −9.72 and σ = 2. In E, we show the inferred rates when using a prior distribution with μ = −9.72 and σ = 0.5 (leading to 5 times higher mean in real space than the correct prior). F shows the corresponding density plots for all log-normal distributions used as prior distributions on the recombination rates.

…

Figures - available from: Nature Communications

This content is subject to copyright. Terms and conditions apply.

Access to this full-text is provided by Springer Nature.

Learn more

Content available from Nature Communications

This content is subject to copyright. Terms and conditions apply.

ARTICLE

A Bayesian approach to infer recombination

patterns in coronaviruses

Nicola F. Müller 1✉, Kathryn E. Kistler1,2 & Trevor Bedford1,2,3

As shown during the SARS-CoV-2 pandemic, phylogenetic and phylodynamic methods are

essential tools to study the spread and evolution of pathogens. One of the central

assumptions of these methods is that the shared history of pathogens isolated from different

hosts can be described by a branching phylogenetic tree. Recombination breaks this

assumption. This makes it problematic to apply phylogenetic methods to study recombining

pathogens, including, for example, coronaviruses. Here, we introduce a Markov chain Monte

Carlo approach that allows inference of recombination networks from genetic sequence data

under a template switching model of recombination. Using this method, we ﬁrst show that

recombination is extremely common in the evolutionary history of SARS-like coronaviruses.

We then show how recombination rates across the genome of the human seasonal cor-

onaviruses 229E, OC43 and NL63 vary with rates of adaptation. This suggests that recom-

bination could be beneﬁcial to ﬁtness of human seasonal coronaviruses. Additionally, this

work sets the stage for Bayesian phylogenetic tracking of the spread and evolution of SARS-

CoV-2 in the future, even as recombinant viruses become prevalent.

https://doi.org/10.1038/s41467-022-31749-8 OPEN

1Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA. 2Molecular and Cellular Biology Program, University

of Washington, Seattle, WA, USA. 3Howard Hughes Medical Institute, Seattle, WA, USA. ✉email: nicola.felix.mueller@gmail.com

NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 1

1234567890():,;

Content courtesy of Springer Nature, terms of use apply. Rights reserved

Since the emergence of SARS-CoV-2, genetic sequence data

has been used to study its evolution and spread. Genetic

sequences have, for example, been used to investigate nat-

ural versus lab origins of SARS-CoV-21, when SARS-CoV-2 was

introduced into the US2as well as whether genetic variants differ

in growth rate3. These analyses often rely on phylogenetic and

phylodynamic approaches, at the heart of which are phylogenetic

trees. Such trees denote how viruses isolated from different

individuals are related and contain information about the trans-

mission dynamics connecting these infections4.

Along with mutations introduced by errors during replication

or by anti-viral molecules (for example ref. 5), different recom-

bination processes contribute to genetic diversity in RNA viruses

(reviewed by Simon-Loriere and Holmes6). Reassortment in

segmented viruses (generally negative-sense RNA viruses), such

as inﬂuenza or rotaviruses, can produce offspring that carry

segments from different parental lineages7. In other RNA viruses

(generally positive-sense RNA viruses), such as ﬂaviviruses and

coronaviruses, homologous recombination can combine different

parts of a genome from different parental lineages in absence of

physically separate segments on the genome of those viruses8.

The main mechanism of this process is thought to be via template

switching9, where the template for replication is switched during

the replication process. Recombination breakpoints in experi-

ments appear to be largely random, with selection selecting

recombination breakpoints in some areas of the genome10. Recent

work shows that recombination breakpoints occur more fre-

quently in the spike region of betacoronaviruses, such as SARS-

CoV-211. While the reason why the recombination process

evolved in RNA viruses is not completely understood6, there are

different explanations of why recombination may be beneﬁcial. In

general, recombination is selected if breaking up the linkage

disequilibrium is beneﬁcial12. Recombination can help purge

deleterious mutations from the genome, such as proposed by

the mutational-deterministic hypothesis13. It can also increase the

rate at which a ﬁt combination of mutations occurs, such as stated

by the Robertson–Hill effect14. Alternatively, recombination in

RNA viruses may also just be a by-product of the processivity of

the viral polymerase6.

Recombination poses a unique challenge to phylogenetic

methods, as it violates the very central assumption that the evo-

lutionary history of individuals can be denoted by branching

phylogenetic trees. Recombination breaks this assumption and

requires representation of the shared ancestry of a set of

sequences as a network. Not accounting for this can lead to biased

phylogenetic and phylodynamic inferences15,16. An analytic

description of recombination is provided by the coalescent with

recombination, which models a backward in time process where

lineages can coalesce and recombine17. When recombination is

considered backward in time, a single lineage results in two-

parent lineages, with one parent lineage carrying the genetic

material from one side of a random recombination breakpoint

and the other parent lineage carrying the genetic material from

the other side of this breakpoint. This equates to the backward in

time equivalent of template switching where there is one

recombination breakpoint per recombination event.

Currently, some Bayesian phylogenetic approaches exist that

infer recombination networks, or ancestral recombination graphs

(ARG), but are either approximate or do not directly allow for

efﬁcient model-based inference. Some approaches consider tree-

based networks18,19, where the networks consist of a base tree

with recombination edges that always attach to edges on the base

tree. Alternative approaches rely on approximations to the coa-

lescent with recombination20,21, consider a different model of

recombination16, or seek to infer recombination networks absent

an explicit recombination model22. Bayesian and maximum

likelihood methods have also been used to account for gene

transfer events when describing the evolutionary history of spe-

cies from multiple loci (for example, see refs. 23,24). Additionally,

methods have been used to describe non-tree-like evolution using

split trees25,26. There is, however, a gap for Bayesian inference of

recombination networks under the coalescent with recombination

that can be applied to study pathogens, such as coronaviruses.

In order to ﬁll this gap, we here develop a Markov chain Monte

Carlo (MCMC) approach to efﬁciently infer recombination net-

works under the coalescent with recombination for sequences

sampled over time. This framework allows joint estimation of

recombination networks, effective population sizes, recombina-

tion rates, and parameters describing mutations over time from

genetic sequence data sampled through time. We explicitly do not

make additional approximations to characterize the recombina-

tion process, other than those of the coalescent with

recombination17, such as, for example, the approximation of tree-

based networks. We implemented this approach as an open-

source software package for BEAST227, allowing us to use the

various evolutionary models already implemented in BEAST2.

We then use the coalescent with recombination to study the

recombination patterns of SARS-like, MERS, and 3 seasonal

human coronaviruses.

Results

Widespread recombination in SARS-like coronaviruses.

Recombination has been implicated at the beginning of the SARS-

CoV-1 outbreak28 and has been suggested as the origin of the

receptor-binding domain in SARS-CoV-229, though Boni et al.30

report that recombination is unlikely to be the origin of SARS-

CoV-2. While this strongly suggests non-tree-like evolution, the

evolutionary history of SARS-like viruses has, out of necessity,

mainly been denoted using phylogenetic trees.

We here reconstruct the recombination history of SARS-like

viruses, which includes SARS-CoV-1 and SARS-CoV-2 as well as

related bat31–33 and pangolin34 coronaviruses. To do so, we infer

the recombination network of SARS-like viruses under the

coalescent with recombination. We assumed that the rates of

recombination and effective population sizes were constant over

time and that the genomes evolved under a GTR+Γ

model.

Similar to the estimate in ref. 30, we used a ﬁxed evolutionary rate

of 5 × 10−4mutations per nucleotide and year. We ﬁxed the

evolutionary rate since the time interval of sampling between

individual isolates is relatively short compared to the time scale of

the evolutionary history of SARS-like viruses. This means that the

sampling times themselves offer little insight into the evolu-

tionary rates and, in absence of other calibration points, there is

little information about the evolutionary rate in this dataset. This,

in turn, means that if the evolutionary rate we used here is

inaccurate then the timings of common ancestors will also be

inaccurate. Therefore, exact timings and calendar dates in this

analysis should be taken as guideposts rather than formal

estimates.

As shown in Fig. 1A and Fig. S1A, the evolutionary history of

SARS-like viruses are characterized by frequent recombination

events, including ancestral to SARS-CoV-2 (see Fig. S2). This

means that only relatively short segments of the genomes code for

the same tree (see Figs. S3 and S1B). Consequently, characterizing

the evolutionary history of SARS-like viruses by a single, genome-

wide phylogeny is bound to be inaccurate and potentially

misleading. We infer the recombination rate in SARS-like viruses

to be approximately 2 × 10−6recombination events per site per

year, which is about 200 times slower than the evolutionary rate.

This rate translated to about 0.06 recombination events per

lineage per year, which is slightly lower than the estimated rate of

ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8

2NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications

Content courtesy of Springer Nature, terms of use apply. Rights reserved

recombination for the seasonal human coronaviruses and the

reassortment rates for pandemic 1918 like inﬂuenza A/H1N1 and

inﬂuenza B viruses, which are all around 0.1−0.2 reassortment

events per lineage per year16. This recombination rate is a

function of co-infection rates, probability of recombination

occurring upon co-infection, and selection. As such, the

recombination rate we infer here will be (possibly substantially)

lower than the within-host rate of recombination.

These recombination events were not evenly distributed across

the genome and, instead, were relatively higher in areas outside

those coding for ORF1ab (Figs. S4 and S5). Additionally, our

inference suggests that rates of recombination are slightly elevated

on spike subunit S1 compared to subunit S2 (Fig. S4). If we track

recombination events ancestral to the SARS-CoV-2 lineage that

are inferred to have happened in the last 100 years, we ﬁnd

evidence for recombination breakpoints occurring close to the 5’

end of the spike, just outside the coding region (see Fig. S5).

Additionally, we ﬁnd support for recombination breakpoints

toward the 3’end of the spike, near the nucleocapsid gene (see

Fig. S5). If we assume that during genome replication in

coronaviruses template shifts occur randomly on the genome10,

differences in observed recombination rates could be explained by

selection favoring recombinant lineages with breakpoints on 3’to

ORF1ab relative to elsewhere on the genome.

We next investigate when different viruses last shared a

common ancestor (MRCA) along the genome (see Figs. 1B and

S6). RmYN0233 shares the MRCA with SARS-CoV-2 on the part

of the genome that codes for ORF1ab (Fig. 1B). We additionally

ﬁnd strong evidence for one or more recombination events in the

ancestry of RmYN02 at the beginning of spike (Fig. 1B). This

recent recombination event is unlikely to have occurred with a

recent ancestor of any of the coronaviruses included in this

dataset since the common ancestor of RmYN02 with any other

virus in the dataset is approximately the same (Fig. S6A). In other

words, large parts of the spike protein of RmYN02 are as related

to SARS-CoV-2 as SARS-CoV-2 is to SARS-CoV-1. The common

ancestor timings of P2S across the genome are equal between

RaTG13 and SARS-CoV-2 (Fig. S6C). RaTG13 on the other hand

is more closely related to SARS-CoV-2 than P2S (Fig. S6B) across

the entire genome.

When looking at when different viruses last shared a common

ancestor anywhere on the genome (in other words: when the

ancestral lineages of two viruses last crossed paths), we ﬁnd that

RmYN02 has the most recent MRCA with SARS-CoV-2

(Fig. S6C). The median estimate of the most recent MRCA

between SARS-CoV-2 and RmYN02 is 1986 (95% CI:

1973–2005). For RaTG13 it is 1975 (95% CI: 1988–1964), for

P2S it is 1949 (95% CI: 1907–1973) and with SARS-CoV-1 it is

1834 (95% CI: 1707–1935). These estimates are contingent on a

ﬁxed evolutionary rate of 5 × 10−4per nucleotide per year.

Rates of recombination are associated with rates of adaptation

in human seasonal coronaviruses. We next investigate recom-

bination patterns in MERS-CoV, which has over 2500 conﬁrmed

cases in humans, as well as in human seasonal coronaviruses

229E, OC43, and NL63, which have widespread seasonal circu-

lation in humans. As for the SARS-like viruses, we jointly infer

recombination networks, rates of recombination, and population

sizes for these viruses. We assumed that the genomes evolved

under a GTR +Γ

model and, in contrast to the analysis of SARS-

like viruses, inferred the evolutionary rates. We observe frequent

recombination in the history of all four viruses, wherein genetic

ancestry is described by network rather than a strictly branching

phylogeny (Fig. 2A–D and Fig. S6A).

The human seasonal coronaviruses all have recombination

rates around 1 × 10−5per site and year (Fig. S7). This is around

10–20 times lower than the evolutionary rate (Fig. S8). In contrast

to the recombination rates, the evolutionary rates vary greatly

across the human seasonal coronaviruses, with rates between a

median of 1.3 × 10−4(95% highest posterior density interval

(HPD) 1.1−1.5 × 10−4) for NL63 and a median rate of 2.5 × 10−4

(95% HPD 2.2−2.7 × 10−4) and 2.1 × 10−4(95% HPD

1.9−2.3 × 10−4) for 229E and OC43 (Fig. S8). These evolutionary

Fig. 1 Evolutionary history of SARS-like viruses. A Maximum clade credibility network of SARS-like viruses. Blue dots denote samples and green dots

recombination events. BCommon ancestor times of Wuhan-Hu1 (SARS-CoV-2) with different SARS-like viruses on different positions of the genome. The

y-axis denotes common ancestor times on the log scale. The line denotes the median common ancestor time, while the colored area denotes the 95%

highest posterior density interval. CMost recent time anywhere on the genome that Wuhan-Hu1 shared a common ancestor with different SARS-like

viruses. The error bars denote the upper and lower bound of the 95% highest posterior density interval. The MCC network and common ancestor times are

provided as a Source Data ﬁle.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE

NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 3

Content courtesy of Springer Nature, terms of use apply. Rights reserved

rates are substantially lower than those estimated for SARS-CoV-

2 (1.1 × 10−3substitutions per site and year35), which are more in

line with our estimates for the evolutionary rates of MERS with a

median rate of 6.9 × 10−4(95% HPD 6.0−7.9 × 10−4). Evolu-

tionary rate estimates can be time-dependent, with datasets

spanning more time estimating lower rates of evolution than

those spanning less time36. In turn, this means that the

evolutionary rate estimates for SARS-CoV-2 will likely be lower

the more time passes. It is unclear though if it will approximate

the evolutionary rates of other seasonal coronaviruses in the

long run.

On a per-lineage basis, the estimated recombination rate for

seasonal coronaviruses translates into around 0.1–0.3 recombina-

tion events per lineage and year (Fig. 2E). Recombination events

deﬁned here are a product of co-infection, recombination, and

selection of recombinant viruses. Interestingly, the rate at which

recombination events occur is highly similar to the rate at which

reassortment events occur in human inﬂuenza viruses (Fig. 2D, and

ref. 16). If we assume similar selection pressures for recombinant

coronaviruses compared to reassortant inﬂuenza viruses, this would

indicate similar co-infection rates in inﬂuenza and coronaviruses.

The incidence of coronaviruses in patients with respiratory illness

cases over 12 seasons in western Scotland has been found to be

lower (7–17%) than for inﬂuenza viruses (13–34%) but to be of the

same order of magnitude37. Considering that seasonal corona-

viruses typically are less symptomatic than inﬂuenza viruses, it is

not unreasonable to assume that annual incidence, and therefore

likely the annual co-infection rates, are comparable between

inﬂuenza and coronaviruses.

Compared to human seasonal coronaviruses, recombination

occurs around 3 times more often for MERS-CoV (Fig. 2E).

MERS-CoV mainly circulates in camels and occasionally spills

over into humans38. MERS-CoV infections are highly prevalent

in camels, with close to 100% of adult camels showing antibodies

against MERS-CoV39. Higher incidence, and thus higher rates of

co-infection, could therefore account for higher rates of

recombination in MERS-CoV compared to the human seasonal

coronaviruses.

We next tested whether parts of the genome with higher rates

of recombination are also associated with higher rates of

adaptation. To do so, we allowed for different relative rates of

recombination within the region 5’of the spike (i.e. mostly

ORF1ab), spike itself, and everything 3’of the spike. We

computed recombination rate ratios on each of these three

sections of the genome as the recombination rate on that section

divided by the mean rate on the other two sections. We infer that

recombination rates are elevated in the spike protein of all human

seasonal coronaviruses considered here (Fig. 3, Figs. S9, and S10).

This is consistent with other work estimating higher rates of

recombination on the spike protein of betacoronaviruses11.

We then computed the rates of adaption on different parts of

the genomes of the seasonal human coronaviruses using the

approach described in refs. 40,41. This approach does not

explicitly consider trees to compute the rates of adaptation on

different parts of the genomes and is not affected by

recombination41.Weﬁnd that sections of the genome with

relatively higher rates of adaptation correspond to sections of the

genome with relatively higher rates of recombination (Fig. 3). In

particular, recombination and adaptation are elevated on the

section of the genome that codes for the spike protein and are

lower elsewhere.

We next investigated whether these trends hold when looking

only at spikes. The spike protein is made up of two subunits: S1

and S2. S1 binds to the host cell receptor, while S2 facilitates

fusion of the viral and cellular membrane42. Rates of adaptation

have been shown to be high in S1, but not S2, for 229E and

OC4341. While the rates of adaptation are relatively low overall

for NL63, there is still some evidence that they are elevated in S1

compared to S241.

To test whether recombination rates vary with rates of

adaptation on the subunits of the spike as well, we inferred the

recombination rates from the spike only, allowing for different

rates of recombination on S1 versus the rest of the spike. We ﬁnd

that the rates of recombination are elevated on S1 for 229E and

OC43 compared to the rest of the spike gene (Fig. 3). This is

consistent with strong absolute rates of adaptation on S1 on these

Fig. 2 Recombination networks and rates for coronaviruses MERS, 229E, OC43, and NL63. Recombination networks for MERS (A) and seasonal human

coronaviruses 229E (B), OC43 (C), and NL63 (D). ERecombination rates (per lineage and year) for the different coronaviruses compared to reassortment

rates in seasonal human inﬂuenza A/H3N2 and inﬂuenza B viruses as estimated in under the coalescent with reassortment using whole-genome inﬂuenza

sequences sampled over multiple decades16. For OC43 and NL63, the parts of the recombination networks that stretch beyond 1950 are not shown to

increase the readability of more recent parts of the networks. The error bars denote the upper and lower bound of the 95% highest posterior density

interval. All MCC networks are provided as a Source Data ﬁle.

ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8

4NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications

Content courtesy of Springer Nature, terms of use apply. Rights reserved

two viruses. For NL63, we ﬁnd weak evidence for the rate on S2 to

be slightly higher than on S1 (Fig. 3), even though the rates of

adaptation are inferred to be higher on S1. The absolute rate of

adaptation in S1 of NL63 is, however, substantially lower than for

229E or OC43. Additionally, the uncertainty around the estimates

on adaption rate ratios between the two subunits for NL63 is

rather large and includes no difference at all. Overall, these results

suggest that particular recombination events that have resulted in

recombinant viruses are either positively or negatively selected.

Elevated rates of recombination in areas where adaptation is

stronger have been described for other organisms (reviewed

here43). Alternatively, higher rates of recombination could also be

due to mechanistic reasons, as has been suggested in the case of

SARS-CoV-244.

To further investigate this, we next computed the rates of

recombination on ﬁtter and less ﬁt parts of the recombination

networks of 229E, OC43, and NL63. To do so, we ﬁrst classify

each edge of the inferred posterior distribution of the recombina-

tion networks into ﬁt and unﬁt based on how long a lineage

survives into the future. Fit edges are those that have descendants

at least 1, 2, 5, or 10 years into the future, and unﬁt edges are

those that do not. We then computed the rates of recombination

on both types of edges for the entire posterior distribution of

networks. Overall, we do not ﬁnd that ﬁt edges show relatively

higher rates of recombination (see Fig. S11). The simplest

explanation is that we do not have enough data points to measure

recombination rates on unﬁt edges, meaning to measure

recombination rates on part of the recombination network where

selection had too little time to shape which lineages survive and

which go extinct. An alternative explanation to why we see

elevated rate or recombination in the spike protein, but do not

observe a population level ﬁtness beneﬁt could be that most

(outside of spike) recombinants could be detrimental to ﬁtness

with few (within spike) having little ﬁtness effect at all.

Discussion

Though not yet highly prevalent, evidence for recombination in

SARS-CoV-2 has started to appear 45–48. As such, it is crucial to

know the extent to which recombination is expected to shape

SARS-CoV-2 in the coming years, to have methods to identify

recombination, and to perform phylogenetic reconstruction in the

presence of recombination. The results shown here indicate that

some recombinant viruses are either positively or negatively

selected. Estimating the deleterious load of viruses before and

after recombination using ancestral sequence reconstruction49

could help shed light on which sequences are favored during

recombination. Furthermore, having additional sequences to

reconstruct recombination patterns in the seasonal coronaviruses

should clarify the role recombination plays in the long-term

evolution of these viruses.

While their impact on the evolutionary dynamics of SARS-

CoV-2 remains unclear, the likely rise of future SARS-CoV-2

recombinants will further necessitate methods that allow phylo-

genetic and phylodynamic inferences to be performed in the

presence of recombination50. In absence of that, recombination

has to be either ignored, leading to biased phylogenetic and

phylodynamic reconstruction15, or non-recombinant parts of the

Fig. 3 Comparison of recombination rates with rates of adaptation on different parts of the genomes of seasonal human coronaviruses 229E, OC43,

and NL63. Association between estimated relative recombination rate (x-axis) and relative adaptation rate (y-axis) for three different seasonal human

coronaviruses: 229E, OC43, and NL63. These estimates are shown for different parts of the genome, indicated by the different colors. These results from

two different types of analysis: one using spike only (subunit 1 over subunit 2, shown in yellow) and one using the full genome (shown in orange, blue, and

green). The rate ratios denote the rate on a part of the genome divided by the average rate on the two other parts of the genome. The error bars of the

recombination rates (x-axis) denote the upper and lower bounds of the 95% HPD intervals of the estimates of relative recombination rates. The error bars

of the rates of adaptation are computed using 100 bootstrapped outgroups and alignments when computing the rates of adaptation. Source data are

provided as a Source Data ﬁle.

no sampled descendents

has sampled descendents

present

Coalescent Event

Recombination Event

Coalescent Event

Recombination Event

past

Fig. 4 Example recombination network. Events that can occur on a

recombination network as considered here. We consider events to occur

from the present backward in time to the past (as is the norm when looking

at coalescent processes). Lineages can be added upon sampling events,

which occur at predeﬁned points in time and are conditioned on.

Recombination events split the path of a lineage in two, with everything on

one side of a recombination breakpoint going in one direction and

everything on the other side of a breakpoint going in the other direction.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE

NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 5

Content courtesy of Springer Nature, terms of use apply. Rights reserved

genome have to be used for analyses, reducing the precision of

these methods. Our approach addresses this gap by providing a

Bayesian framework to infer recombination networks. To facil-

itate easy adaptation, we implemented the method so that ana-

lyses can be set up following the same workﬂow as regular

BEAST227 analyses. Extending the current suite of population

dynamic models, such as birth–death models51 or models that

account for population structure52,53, will further increase the

applicability of recombination models to study the spread of

pathogens.

Methods

Coalescent with recombination. The coalescent with recombination models a

backward in time coalescent and recombination process17. In this process, three

different events are possible: sampling, coalescence, and recombination. Sampling

events happen at predeﬁned points in time. Recombination events happen at a rate

proportional to the number of coexisting lineages at any point in time. Recom-

bination events split the path of a lineage in two, with everything on one side of a

recombination breakpoint going in one ancestral direction and everything on the

other side of a breakpoint going in the other direction. As shown in Fig. 4, the

two parental lineages after a recombination event each carry a subset of the gen-

ome. In reality, the viruses corresponding to those two lineages still carry the full

genome, but only a part of it will have sampled descendants. In other words, only a

part of the genome carried by a lineage at any time may impact the genome of a

future lineage that is sampled. The probability of actually observing a recombi-

nation event on lineage lis proportional to how much genetic material that lineage

carries. This can be computed as the difference between the last and ﬁrst nucleotide

position that is carried by l, which we denote as LðlÞ. Coalescent events happen

between co-existing lineages at a rate proportional to the number of pairs of

coexisting lineages at any point in time and inversely proportional to the effective

population size. The parent lineage at each coalescent event will carry genetic

material corresponding to the union of the genetic material of the two-child

lineages.

Posterior probability. In order to perform joint Bayesian inference of recombi-

nation networks together with the parameters of the associated models, we use a

MCMC algorithm to characterize the joint posterior density. The posterior density

is denoted as:

PðN;μ;θ;ρjDÞ¼PðDjN;μÞPðNjθ;ρÞPðμ;θ;ρÞ

PðDÞ;ð1Þ

where Ndenotes the recombination network, μthe evolutionary model, θthe

effective population size and ρthe recombination rate. The multiple sequence

alignment, that is the data, is denoted D.P(D∣N,μ) denotes the network likelihood,

P(N∣θ,ρ), the network prior and P(μ,θ,ρ) the parameter priors. As is usually done

in Bayesian phylogenetics, we assume that P(μ,θ,ρ)=P(μ)P(θ)P(ρ).

Using a Bayesian approach has several advantages. In particular, it allows us to

account for uncertainty in the parameter and network estimates. Additionally, it

allows balancing different sources of information against each other. The

coalescent with recombination model, for example, will tend to favor networks

with fewer recombination events. The cost of adding more recombination events

depends on the recombination rate. At lower rates of recombination, adding new

recombination events is more costly and the information coming from the

sequence alignment in support of a recombination event needs to be greater.

Network likelihood. While the evolutionary history of the entire genome is a net-

work, the evolutionary history of each individual position in the genome can be

described as a tree. We can therefore denote the likelihood of observing a sequence

alignment (the data denoted D) given a network Nand evolutionary model μas

PðDjN;μÞ¼ Y

sequence length

i¼1

PðDijTi;μÞ;ð2Þ

with D

denoting the nucleotides at position iin the sequence alignment and T

denoting the tree at position i. The likelihood at each individual position in the

alignment can then be computed using the standard pruning algorithm54.We

implemented the network likelihood calculation P(D

∣T

,μ) such that it allows

making use of all the standard site models in BEAST2. Currently, we only consider

strict clock models and therefore do not allow for rate variations across different

branches of the network. This is because the number of edges in the network

changes over the course of the MCMC, making relaxed clock models more com-

plex to implement. We implemented the network likelihood such that it can make

use of caching of intermediate results and use unique patterns in the multiple

sequence alignment, similar to what is done for tree likelihood computations.

Network prior. The network prior is denoted by P(N∣θ,ρ), which is the probability

of observing a network and the embedding of segment trees under the coalescent

with recombination model, with effective population size θand per-lineage

recombination rate ρ. It plays essentially the same role that tree prior plays in

phylodynamic analyses on trees.

We can calculate P(N∣θ,ρ) by expressing it as the product of exponential

waiting times between events (i.e., recombination, coalescent, and sampling

events):

PðNjθ;ρÞ¼ Y

#events

i¼1

PðeventijLi;θ;ρÞ´PðintervalijLi;θ;ρÞ;ð3Þ

where we deﬁne t

to be the time of the ith event and L

to be the set of lineages

extant immediately prior to this event. (That is, L

for t2½ti1;tiÞ.

Given that the coalescent process is a constant size coalescent and given the ith

event is a coalescent event, the event contribution is denoted as

PðeventijLi;θ;ρÞ¼1

θ:ð4Þ

If the ith event is a recombination event and assuming constant rates of

recombination over time, the event contribution is denoted as

PðeventijLi;θ;ρÞ¼ρLðlÞ:ð5Þ

The interval contribution denotes the probability of not observing any event in

a given interval. It can be computed as the product of not observing any coalescent,

nor any recombination events in interval i. We can therefore write:

PðintervalijLi;θ;ρÞ¼exp½ðλcþλrÞðtiti1Þ;ð6Þ

where λcdenotes the rate of coalescence and can be expressed as

λc¼jLij



θ;ð7Þ

and λrdenotes the rate of observing a recombination event on any co-existing

lineage and can be expressed as

λr¼ρ∑

l2Li

LðlÞ:ð8Þ

In order to allow the recombination rates to vary across ssections Sson the

genome, we modify λrto differ in each section Ss, such that:

λr¼∑

s2S

ρs∑

l2Li

LðlÞ\Ss;ð9Þ

with LðlÞ\Ssdenoting the amount of overlap between LðlÞand Ss. The

recombination rate in each section sis denoted as ρ

MCMC algorithm for recombination networks. In order to explore the posterior

space of recombination networks, we implemented a series of MCMC operators.

These operators often have analogs in operators used to explore different phylo-

genetic trees and are similar to the ones used to explore reassortment networks16.

Here, we brieﬂy summarize each of these operators.

Add/remove operator: The add/remove operator adds and removes

recombination events. An extension of the subtree prune and regraft move for

networks55 to jointly operate on segment trees as well. We additionally

implemented an adapted version to sample re-attachment under a coalescent

distribution to increase acceptance probabilities.

Loci diversion operator: The loci diversion operator randomly changes the

location of recombination breakpoints of a recombination event.

Exchange operator: The exchange operator changes the attachment of edges in

the network while keeping the network length constant.

Subnetwork slide operator: The subnetwork slide operator changes the height of

nodes in the network while allowing to change in the topology.

Scale operator: The scale operator scales the heights of the root node or the

whole network without changing the network topology.

Gibbs operator: The Gibbs operator efﬁciently samples any part of the network

that is older than the root of any segment of the alignment and is thus not

informed by any genetic data and is the analog to the Gibbs operator in16 for

reassortment networks.

Empty loci preoperator: The empty loci preoperator augments the network with

edges that do not carry any loci for the duration of one of the above moves, to allow

for larger jumps in network space.

One of the issues when inferring these recombination networks is that the root

height can be substantially larger than when not allowing for recombination events.

This can cause a computational issue when performing inferences. To circumvent

this, we truncate the recombination networks by reducing the recombination rate

sometime after all positions of the sequence alignment have reached their common

ancestor height.

Validation and testing. We validate the implementation of the coalescent with

recombination network prior as well as all operators in Fig. S12. We also show that

truncating the recombination networks does not affect the sampling of recombi-

nation networks prior to reaching the common ancestor height of all positions in

the sequence alignment.

We then tested whether we are able to infer recombination networks,

recombination rates, effective population sizes, and evolutionary parameters from

ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8

6NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications

Content courtesy of Springer Nature, terms of use apply. Rights reserved

simulated data. To do so, we randomly simulated recombination networks under

the coalescent with recombination. On top of these, we then simulated multiple

sequence alignments. We then re-infer the parameters used to simulate using our

MCMC approach. As shown in Fig. S13, these parameters are retrieved well from

simulated data with little bias and accurate coverage of simulated parameters by

credible intervals.

We next tested how well we can retrieve individual recombination events. To do

so, we plot the location and timings of simulated recombination events for the ﬁrst

9 out of 100 simulations. We then plot the density of recombination events in the

posterior distribution of networks, based on the timing and location of the inferred

breakpoint on the genome. As shown in Fig. S14, we are able to retrieve the true

(simulated) recombination events well.

We next tested how the speed of inference scales with the number of

recombination events, the number of samples in the dataset, and the

evolutionary rate. To do so, we simulated 300 recombination networks and

sequence alignment of length 10,000 under a Jukes–Cantor model with between

10 and 200 leaves and a recombination rate between 1 × 10−5and 2 × 10−5

recombination events per site per year. This means that for each simulation,

there were between 0 and 100 recombination events, allowing us to investigate

how the inference scales in different settings. As shown in Fig. S15, the ESS per

hour decreases with the number of recombination events and samples, but not

the evolutionary rates. In particular, the ESS per hour decreases much faster

with the number of recombination events in a dataset than the number of

samples. This suggests that the methods can currently be used more easily to

analyze a dataset with a large number of samples over a large number of

recombination events.

We next tested how the choice of the prior distribution on the recombination

rate impacts the recombination rate estimate. To do so, we simulate 20

recombination networks and sequence alignment of length 10,000 under a

Jukes–Cantor model with 100 leaves and a recombination rate drawn randomly

from a log-normal distribution. We then infer the recombination rates using 5

different recombination rate priors as shown in Fig. 5F that put some or a lot of

weight on the wrong parameters. As shown in Fig. 5A–E, we are able to infer

recombination rates, even with the wrong priors.

Additionally, we compared the effective sample size values from MCMC runs

inferring recombination networks for the MERS spike protein to treating the

evolutionary histories as trees. We ﬁnd that although the effective sample size

values are lower when inferring recombination networks, they are not orders of

magnitude lower (see Fig. S16).

Recombination network summary. We implemented an algorithm to summarize

distributions of recombination networks similar to the maximum clade credibility

framework typically used to summarize trees in BEAST56. In short, the algorithm

summarizes individual trees at each position in the alignment. To do so, we ﬁrst

compute how often we encountered the same coalescent event at every position in

the alignment during the MCMC. We then choose the network that maximizes the

clade support over each position as the maximum clade credibility (MCC) network.

The MCC networks are logged in the extended Newick format57 and can be

visualized in icytree.org58. We here plotted the MCC networks using an adapted

version of baltic (https://github.com/evogytis/baltic).

Sequence data. The genetic sequence data for OC43, NL63, and 229E were

obtained from ViPR (http://www.viprbrc.org) and were the same as used41. All

these sequences were isolated from a human host and downsampled from the

dataset used in ref. 41 to 100 sequences (for OC43 and NL63). As there were only

54 229E sequences, we did not do any downsampling on this data. The sequence

data for the MERS analyses were the same as described in ref. 38, but using a

randomly down sampled dataset of 100 sequences. For the SARS-like analyses, we

used 40 different deposited SARS-like genomes, mostly originating from bats, as

well as humans, and one pangolin-derived sequence.

Rates of adaptation. The rates of adaptation were calculated using a modiﬁcation

of the McDonald–Kreitman method, as designed by Bhatt et al.40, and imple-

mented in ref. 41. Brieﬂy, for each virus, we aligned the sequence of each gene or

genomic region. Then, we split the alignment into 3-year sliding windows, each

containing a minimum of 3 sequenced isolates. We used the consensus sequence at

the ﬁrst time point as the outgroup. A comparison of the outgroup to the alignment

of each subsequent temporal yielded a measure of synonymous and non-

synonymous ﬁxations and polymorphisms at each position in the alignment. This

approach requires having sequence data gathered over relatively long time periods

where the consensus genome allows for an accurate description of the long-term

evolutionary patterns and, as such, would not be adequate for a pathogen with a

relatively short evolutionary history, such as for SARS-CoV-2. We used propor-

tional site counting for these estimations59. We assumed that selectively neutral

sites are all silent mutations as well as replacement polymorphisms occurring at

frequencies between 0.15 and 0.7540. We identiﬁed adaptive substitutions as non-

synonymous ﬁxations and high-frequency polymorphisms that exceed the neutral

expectation. We then estimated the rate of adaptation (per codon per year) using

linear regression of the number of adaptive substitutions inferred at each time

point. In order to compute the 5’spike and 3’spike rates of adaptation, we used the

Fig. 5 Impact of the recombination rate prior distribution on the inferred recombination rates. Here, we compare then inferred recombination rates

when using different prior distributions that differed from the distributions from which the rates for simulations were sampled. The rates for simulations

were sampled from a log-normal distribution with μ=−11.12 and σ=0.5. In A, we show the inferred rates when using a prior distribution with μ=−12.74

and σ=0.5 (leading to a 5 times lower mean in real space than the correct prior). In B, we show the inferred rates when using a prior distribution with

μ=−12.74 and σ=2. In C, we show the inferred rates when using the same prior distribution as was sampled under. In D, we show the inferred rates when

using a prior distribution with μ=−9.72 and σ=2. In E, we show the inferred rates when using a prior distribution with μ=−9.72 and σ=0.5 (leading to

5 times higher mean in real space than the correct prior). Fshows the corresponding density plots for all log-normal distributions used as prior distributions

on the recombination rates.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE

NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 7

Content courtesy of Springer Nature, terms of use apply. Rights reserved

weighted average of all coding regions to the left (upstream) or right (downstream)

of the spike gene, respectively, using the length of the individual sections as

weights. We estimated the uncertainty by running the same analysis on 100

bootstrapped outgroups and alignments.

Reporting summary. Further information on research design is available in the Nature

Research Reporting Summary linked to this article.

Data availability

The BEAST2 input xml ﬁles for all coronavirus analyses in this manuscript, as well as the

ﬁles used to post process these analyses are available from https://github.com/nicfel/

Recombination-Material and here ref. 60. The xml ﬁles include the sequence data and

exact input speciﬁcation of the coronavirus analyses performed in this manuscript,

except for the sequences published on gisaid. The acknowledgment table for the four

gisaid sequences used for the SARS-like analyses is provided in Supplementary Note 1.

The genbank accession numbers for the 229E, OC43, NL63, SARS-like, and MERS

analyses are provided as separate tables in Supplementary Data 1. The MERS sequences

without accession numbers are used from ref. 38. Source data are provided with

this paper.

Code availability

The Recombination package is implemented as an addon to the Bayesian phylogenetics

software platform BEAST227. All MCMC analyses performed here were run using

adaptive parallel tempering61. The source code is available at https://github.com/nicfel/

Recombination and here ref. 62. We additionally provide a tutorial on how to set up and

post-process analysis at https://github.com/nicfel/Recombination-Tutorial. The MCC

networks are plotted using an adapted version of baltic (https://github.com/evogytis/

baltic). All other plots are done in R using ggplot263 and ggenes64.

Received: 5 May 2021; Accepted: 30 June 2022;

References

1. Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The

proximal origin of SARS-CoV-2. Nat. Med. 26, 450–452 (2020).

2. Bedford, T. et al. Cryptic transmission of SARS-COV-2 in washington state.

Science 370, 571–575 (2020).

3. Volz, E. et al. Evaluating the effects of SARS-COV-2 spike mutation d614g on

transmissibility and pathogenicity. Cell 184,64–75 (2021).

4. Grenfell, B. T. et al. Unifying the epidemiological and evolutionary dynamics

of pathogens. Science 303, 327–332 (2004).

5. Kim, E.-Y. et al. Human apobec3 induced mutation of human

immunodeﬁciency virus type-1 contributes to adaptation and evolution in

natural infection. PLoS Pathog. 10, e1004281 (2014).

6. Simon-Loriere, E. & Holmes, E. C. Why do rna viruses recombine? Nat. Rev.

Microbiol. 9, 617–626 (2011).

7. McDonald, S. M., Nelson, M. I., Turner, P. E. & Patton, J. T. Reassortment in

segmented rna viruses: mechanisms and outcomes. Nat. Rev. Microbiol. 14,

448 (2016).

8. Su, S. et al. Epidemiology, genetic recombination, and pathogenesis of

coronaviruses. Trends Microbiol. 24, 490–502 (2016).

9. Lai, M. RNA recombination in animal and plant viruses. Microbiol. Mol. Biol.

Rev. 56,61–79 (1992).

10. Banner, L. R. & Mc Lai, M. Random nature of coronavirus rna recombination

in the absence of selection pressure. Virology 185, 441–445 (1991).

11. Bobay, L.-M., O’Donnell, A. C. & Ochman, H. Recombination events are

concentrated in the spike protein region of betacoronaviruses. PLoS Genet. 16,

e1009272 (2020).

12. Barton, N. A general model for the evolution of recombination. Genet. Res. 65,

123–144 (1995).

13. Feldman, M. W., Christiansen, F. B. & Brooks, L. D. Evolution of recombination

in a constant environment. Proc. Natl Acad. Sci. USA 77,4838–4841 (1980).

14. Hill, W. G. & Robertson, A. The effect of linkage on limits to artiﬁcial

selection. Genet. Res. 8, 269–294 (1966).

15. Posada, D. & Crandall, K. A. The effect of recombination on the accuracy of

phylogeny estimation. J. Mol. Evol. 54, 396–402 (2002).

16. Müller, N. F., Stolz, U., Dudas, G., Stadler, T. & Vaughan, T. G. Bayesian

inference of reassortment networks reveals ﬁtness beneﬁts of reassortment in

human inﬂuenza viruses. Proc. Natl Acad. Sci. USA 117, 17104–17111 (2020).

17. Hudson, R. R. Properties of a neutral allele model with intragenic

recombination. Theor. Popul. Biol. 23, 183–201 (1983).

18. Didelot, X., Lawson, D., Darling, A. & Falush, D. Inference of homologous

recombination in bacteria using whole-genome sequences. Genetics 186,

1435–1449 (2010).

19. Vaughan, T. G. et al. Inferring ancestral recombination graphs from bacterial

genomic data. Genetics 205, 857–870 (2017).

20. Rasmussen, M. D., Hubisz, M. J., Gronau, I. & Siepel, A. Genome-wide

inference of ancestral recombination graphs. PLoS Genet. 10, e1004342 (2014).

21. McVean, G. A. & Cardin, N. J. Approximating the coalescent with

recombination. Philos. Trans. R. Soc. B: Biol. Sci. 360, 1387–1393 (2005).

22. Bloomquist, E. W. & Suchard, M. A. Unifying vertical and nonvertical

evolution: a stochastic arg-based framework. Syst. Biol. 59,27–41 (2010).

23. Meng, C. & Kubatko, L. S. Detecting hybrid speciation in the presence of

incomplete lineage sorting using gene tree incongruence: a model. Theor.

Popul. Biol. 75,35–45 (2009).

24. Yu, Y., Dong, J., Liu, K. J. & Nakhleh, L. Maximum likelihood inference of

reticulate evolutionary histories. Proc. Natl Acad. Sci. USA 111, 16448–16453

(2014).

25. Bryant, D. & Moulton, V. Neighbor-net: an agglomerative method for the

construction of phylogenetic networks. Mol. Biol. Evol. 21, 255–265 (2004).

26. Huson, D. H. & Bryant, D. Application of phylogenetic networks in

evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006).

27. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, et al.

BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis.

PLoS Comput Biol. 15, e1006650 https://doi.org/10.1371/journal.pcbi.1006650

(2019).

28. Hon, C.-C. et al. Evidence of the recombinant origin of a bat severe acute

respiratory syndrome (sars)-like coronavirus and its implications on the direct

ancestor of sars coronavirus. J. Virol. 82, 1819–1826 (2008).

29. Li, X. et al. Emergence of SARS-COV-2 through recombination and strong

purifying selection. Sci. Adv. 6, eabb9153 (2020).

30. Boni, M. F. et al. Evolutionary origins of the SARS-COV-2 sarbecovirus

lineage responsible for the covid-19 pandemic. Nat. Microbiol. 5, 1408–1417

(2020).

31. Ge, X.-Y. et al. Isolation and characterization of a bat sars-like coronavirus

that uses the ace2 receptor. Nature 503, 535–538 (2013).

32. Ge, X.-Y. et al. Coexistence of multiple coronaviruses in several bat colonies in

an abandoned mineshaft. Virol. Sin. 31,31–40 (2016).

33. Zhou, H. et al. A novel bat coronavirus closely related to sars-cov-2 contains

natural insertions at the s1/s2 cleavage site of the spike protein. Curr. Biol. 30,

2196–2203 (2020).

34. Lam, T. T.-Y. et al. Identifying sars-cov-2-related coronaviruses in malayan

pangolins. Nature 583, 282–285 (2020).

35. Duchene, S. et al. Temporal signal and the phylodynamic threshold of sars-

cov-2. Virus Evol. 6, veaa061 (2020).

36. Duchêne, S., Holmes, E. C. & Ho, S. Y. Analyses of evolutionary dynamics in

viruses are hindered by a time-dependent bias in rate estimates. Proc. R. Soc.

B: Biol. Sci. 281, 20140732 (2014).

37. Nickbakhsh, S. et al. Epidemiology of seasonal coronaviruses: establishing the

context for the emergence of coronavirus disease 2019. J. Infect. Dis. 222,

17–25 (2020).

38. Dudas, G., Carvalho, L. M., Rambaut, A. & Bedford, T. Mers-cov spillover at

the camel-human interface. Elife 7, e31257 (2018).

39. Reusken, C. B. et al. Geographic distribution of mers coronavirus among

dromedary camels, africa. Emerg. Infect. Dis. 20, 1370 (2014).

40. Bhatt, S., Holmes, E. C. & Pybus, O. G. The genomic rate of molecular adaptation

of the human inﬂuenzaavirus.Mol. Biol. Evol. 28, 2443–2451 (2011).

41. Kistler, K. E. & Bedford, T. Evidence for adaptive evolution in the receptor-

binding domain of seasonal coronaviruses oc43 and 229e. Elife 10, e64509

(2021).

42. Walls, A. C. et al. Structure, function, and antigenicity of the sars-cov-2 spike

glycoprotein. Cell 181, 281–292 (2020).

43. Nachman, M. W. Variation in recombination rate across the genome:

evidence and implications. Curr. Opin. Genet. Dev. 12, 657–663 (2002).

44. Turakhia, Y. et al. Pandemic-scale phylogenomics reveals elevated

recombination rates in the sars-cov-2 spike region. Preprint at https://doi.org/

10.1101/2021.08.04.455157 (2021).

45. VanInsberghe, D., Neish, A. S., Lowen, A. C. & Koelle, K. Recombinant SARS-

CoV-2 genomes circulated at low levels over the ﬁrst year of the pandemic,

Virus Evolution,7, veab059 https://doi.org/10.1093/ve/veab059 (2021).

46. Jackson, B. et al. Generation and transmission of interlineage recombinants in

the SARS-CoV-2 pandemic. Cell.184, 5179–5188 (2021).

47. Varabyou, A., Pockrandt, C., Salzberg, S. L. & Pertea, M. Rapid detection of inter-

clade recombination in sars-cov-2 with bolotie. Genetics 218, iyab074 (2021).

48. Ignatieva, A., Hein, J. & Jenkins, P. A. Ongoing recombination in SARS-COV-

2 revealed through genealogical reconstruction. Mol Biol Evol. 39, msac028

https://doi.org/10.1093/molbev/msac028 (2022).

49. Yang, Z., Kumar, S. & Nei, M. A new method of inference of ancestral

nucleotide and amino acid sequences. Genetics 141, 1641–1650 (1995).

ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8

8NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications

Content courtesy of Springer Nature, terms of use apply. Rights reserved

50. Neches, R. Y., McGee, M. D. & Kyrpides, N. C. Recombination should not be

an afterthought. Nat. Rev. Microbiol. 18, 606–606 (2020).

51. Stadler, T. On incomplete sampling under birth–death models and connections to

the sampling-based coalescent. J. Theor. Biol. 261,58–66 (2009).

52. Hudson, R. R. et al. Gene genealogies and the coalescent process. Oxf. Surv.

Evol. Biol. 7, 44 (1990).

53. Lemey, P., Rambaut, A., Drummond, A. J. & Suchard, M. A. Bayesian

phylogeography ﬁnds its roots. PLoS Comput. Biol. 5, e1000520 (2009).

54. Felsenstein, J. Evolutionary trees from dna sequences: a maximum likelihood

approach. J. Mol. Evol. 17, 368–376 (1981).

55. Bordewich, M., Linz, S. & Semple, C. Lost in space? generalising subtree prune

and regraft to spaces of phylogenetic networks. J. Theor. Biol. 423,1–12 (2017).

56. Heled, J. & Bouckaert, R. R. Looking for trees in the forest: summary tree from

posterior samples. BMC Evol. Biol. 13,1–11 (2013).

57. Cardona, G., Rosselló, F. & Valiente, G. ExtendedNewick: it is time for a standard

representation of phylogenetic networks. BMC Bioinform. 9,1–8 (2008).

58. Vaughan, T. G. Icytree: rapid browser-based visualization for phylogenetic

trees and networks. Bioinformatics 33, 2392–2394 (2017).

59. Bhatt, S., Katzourakis, A. & Pybus, O. G. Detecting natural selection in RNA

virus populations using sequence summary statistics. Infect. Genet. Evol. 10,

421–430 (2010).

60. Müller, N. F. nicfel/Recombination-Material: Release for Nat. comm.

recombination manuscript. https://doi.org/10.5281/zenodo.6600818 (2022).

61. Müller, N. F. & Bouckaert, R. R. Adaptive metropolis-coupled mcmc for beast

2. PeerJ 8, e9473 (2020).

62. Müller, N. F. nicfel/Recombination: adds common ancestor heights logger to

beauti. https://doi.org/10.5281/zenodo.5076684 (2021)

63. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).

64. Wilkins, D. gggenes: draw gene arrow maps in ‘ggplot2’. r package version 0.4.

0 (2019).

Acknowledgements

We would like to thank Timothy G. Vaughan for his helpful insights into the imple-

mentation of the software. N.F.M. is funded by the Swiss National Science Foundation

(P2EZP3_191891). K.E.K. is a NSF GRFP Fellow (DGE-1762114). T.B. is a Pew Bio-

medical Scholar and is supported by NIH R35 GM119774. The Scientiﬁc Computing

Infrastructure at Fred Hutch is supported by NIH ORIP S10OD028685.

Author contributions

N.F.M. and T.B. conceived and designed the experiments. N.F.M. and K.E.K. performed

the statistical analysis and analyzed the data. N.F.M. implemented the software. N.F.M.,

K.E.K., and T.B. wrote the paper.

Competing interests

The authors declare no competing interests.

Additional information

Supplementary information The online version contains supplementary material

available at https://doi.org/10.1038/s41467-022-31749-8.

Correspondence and requests for materials should be addressed to Nicola F. Müller.

Peer review information Nature Communications thanks the anonymous reviewers for

their contribution to the peer review of this work. Peer reviewer reports are available.

Reprints and permission information is available at http://www.nature.com/reprints

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in

published maps and institutional afﬁliations.

Open Access This article is licensed under a Creative Commons

Attribution 4.0 International License, which permits use, sharing,

adaptation, distribution and reproduction in any medium or format, as long as you give

appropriate credit to the original author(s) and the source, provide a link to the Creative

Commons license, and indicate if changes were made. The images or other third party

material in this article are included in the article’s Creative Commons license, unless

indicated otherwise in a credit line to the material. If material is not included in the

article’s Creative Commons license and your intended use is not permitted by statutory

regulation or exceeds the permitted use, you will need to obtain permission directly from

the copyright holder. To view a copy of this license, visit http://creativecommons.org/

licenses/by/4.0/.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31749-8 ARTICLE

NATURE COMMUNICATIONS | (2022) 13:4186 | https://doi.org/10.1038/s41467-022-31749-8 | www.nature.com/naturecommunications 9

Content courtesy of Springer Nature, terms of use apply. Rights reserved

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-

scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By

accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these

purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal

subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription

(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will

apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within

ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not

otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as

detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may

not:

use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access

control;

use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is

otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in

writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal

content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,

royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal

content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any

other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or

content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature

may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied

with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,

including merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed

from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not

expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

Available via license: CC BY 4.0

Content may be subject to copyright.

Leveraging graphical model techniques to study evolution on phylogenetic networks

Preprint

Full-text available

May 2024

The evolution of molecular and phenotypic traits is commonly modelled using Markov processes along a rooted phylogeny. This phylogeny can be a tree, or a network if it includes reticulations, representing events such as hybridization or admixture. Computing the likelihood of data observed at the leaves is costly as the size and complexity of the phylogeny grows. Efficient algorithms exist for trees, but cannot be applied to networks. We show that a vast array of models for trait evolution along phylogenetic networks can be reformulated as graphical models, for which efficient belief propagation algorithms exist. We provide a brief review of belief propagation on general graphical models, then focus on linear Gaussian models for continuous traits. We show how belief propagation techniques can be applied for exact or approximate (but more scalable) likelihood and gradient calculations, and prove novel results for efficient parameter inference of some models. We highlight the possible fruitful interactions between graphical models and phylogenetic methods. For example, approximate likelihood approaches have the potential to greatly reduce computational costs for phylogenies with reticulations.

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2

Preprint

Full-text available

Feb 2024

Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS maintained the strongest correlation between pairwise genetic and Euclidean distances between sequences and best captured the intermediate placement of recombinant lineages between parental lineages. Clusters from t-SNE most accurately recapitulated known phylogenetic clades and recombinant lineages. Both MDS and t-SNE accurately identified reassortment groups. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate. Author summary To track the progress of viral epidemics, public health researchers often need to identify groups of genetically-related samples. A common approach to find these groups involves inferring the complete evolutionary history of virus samples using phylogenetic methods. However, these methods assume that new viruses descend from a single parent, while many viruses including seasonal influenza and SARS-CoV-2 produce offspring through a form of sexual reproduction that violates this assumption. Additionally, phylogenies may be unnecessarily complex or unintuitive when researchers only need to find and visualize clusters of related samples. We tested an alternative approach by applying widely-used statistical methods (PCA, MDS, t-SNE, and UMAP) to create 2- or 3-dimensional maps of virus samples from their pairwise genetic distances and identify clusters of samples that place close together in these maps. We found that these statistical methods without an underlying biological model could accurately capture known genetic relationships in populations of seasonal influenza and SARS-CoV-2 even in the presence of sexual reproduction. The conceptual and practical simplicity of our open source implementation of these methods enables researchers to visualize and compare human pathogenic virus samples when phylogenetic methods are unnecessary or inappropriate.

Genetic characterization and pathogenicity in a mouse model of newly isolated bat-originated mammalian orthoreovirus in South Korea

Article

Full-text available

Jan 2024

Mammalian orthoreoviruses (MRVs) infect a wide range of hosts, including humans, livestock, and wildlife. In the present study, we isolated a novel Mammalian orthoreovirus from the intestine of a microbat ( Myotis aurascens ) and investigated its biological and pathological characteristics. Phylogenetic analysis indicated that the new isolate was serotype 2, sharing the segments with those from different hosts. Our results showed that it can infect a wide range of cell lines from different mammalian species, including human, swine, and non-human primate cell lines. Additionally, media containing trypsin, yeast extract, and tryptose phosphate broth promoted virus propagation in primate cell lines and most human cell lines, but not in A549 and porcine cell lines. Mice infected with this strain via the intranasal route, but not via the oral route, exhibited weight loss and respiratory distress. The virus is distributed in a broad range of organs and causes lung damage. In vitro and in vivo experiments also suggested that the new virus could be a neurotropic infectious strain that can infect a neuroblastoma cell line and replicate in the brains of infected mice. Additionally, it caused a delayed immune response, as indicated by the high expression levels of cytokines and chemokines only at 14 days post-infection (dpi). These data provide an important understanding of the genetics and pathogenicity of mammalian orthoreoviruses in bats at risk of spillover infections. IMPORTANCE Mammalian orthoreoviruses (MRVs) have a broad range of hosts and can cause serious respiratory and gastroenteritis diseases in humans and livestock. Some strains infect the central nervous system, causing severe encephalitis. In this study, we identified BatMRV2/SNU1/Korea/2021, a reassortment of MRV serotype 2, isolated from bats with broad tissue tropism, including the neurological system. In addition, it has been shown to cause respiratory syndrome in mouse models. The given data will provide more evidence of the risk of mammalian orthoreovirus transmission from wildlife to various animal species and the sources of spillover infections.

Data-driven recombination detection in viral genomes

Article

Full-text available

Apr 2024

Recombination is a key molecular mechanism for the evolution and adaptation of viruses. The first recombinant SARS-CoV-2 genomes were recognized in 2021; as of today, more than ninety SARS-CoV-2 lineages are designated as recombinant. In the wake of the COVID-19 pandemic, several methods for detecting recombination in SARS-CoV-2 have been proposed; however, none could faithfully confirm manual analyses by experts in the field. We hereby present RecombinHunt, an original data-driven method for the identification of recombinant genomes, capable of recognizing recombinant SARS-CoV-2 genomes (or lineages) with one or two breakpoints with high accuracy and within reduced turn-around times. ReconbinHunt shows high specificity and sensitivity, compares favorably with other state-of-the-art methods, and faithfully confirms manual analyses by experts. RecombinHunt identifies recombinant viral genomes from the recent monkeypox epidemic in high concordance with manually curated analyses by experts, suggesting that our approach is robust and can be applied to any epidemic/pandemic virus.

Assessing the emergence time of SARS-CoV-2 zoonotic spillover

Article

Full-text available

Apr 2024
PLOS ONE

Understanding the evolution of Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV-2) and its relationship to other coronaviruses in the wild is crucial for preventing future virus outbreaks. While the origin of the SARS-CoV-2 pandemic remains uncertain, mounting evidence suggests the direct involvement of the bat and pangolin coronaviruses in the evolution of the SARS-CoV-2 genome. To unravel the early days of a probable zoonotic spillover event, we analyzed genomic data from various coronavirus strains from both human and wild hosts. Bayesian phylogenetic analysis was performed using multiple datasets, using strict and relaxed clock evolutionary models to estimate the occurrence times of key speciation, gene transfer, and recombination events affecting the evolution of SARS-CoV-2 and its closest relatives. We found strong evidence supporting the presence of temporal structure in datasets containing SARS-CoV-2 variants, enabling us to estimate the time of SARS-CoV-2 zoonotic spillover between August and early October 2019. In contrast, datasets without SARS-CoV-2 variants provided mixed results in terms of temporal structure. However, they allowed us to establish that the presence of a statistically robust clade in the phylogenies of gene S and its receptor-binding (RBD) domain, including two bat (BANAL) and two Guangdong pangolin coronaviruses (CoVs), is due to the horizontal gene transfer of this gene from the bat CoV to the pangolin CoV that occurred in the middle of 2018. Importantly, this clade is closely located to SARS-CoV-2 in both phylogenies. This phylogenetic proximity had been explained by an RBD gene transfer from the Guangdong pangolin CoV to a very recent ancestor of SARS-CoV-2 in some earlier works in the field before the BANAL coronaviruses were discovered. Overall, our study provides valuable insights into the timeline and evolutionary dynamics of the SARS-CoV-2 pandemic.

SARS-CoV-2 biology and host interactions

Article

Jan 2024

The zoonotic emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the ensuing coronavirus disease 2019 (COVID-19) pandemic have profoundly affected our society. The rapid spread and continuous evolution of new SARS-CoV-2 variants continue to threaten global public health. Recent scientific advances have dissected many of the molecular and cellular mechanisms involved in coronavirus infections, and large-scale screens have uncovered novel host-cell factors that are vitally important for the virus life cycle. In this Review, we provide an updated summary of the SARS-CoV-2 life cycle, gene function and virus-host interactions, including recent landmark findings on general aspects of coronavirus biology and newly discovered host factors necessary for virus replication.

Recombination-aware phylogenetic analysis sheds light on the evolutionary origin of SARS-CoV-2

Article

Full-text available

Jan 2024

SARS-CoV-2 can infect human cells through the recognition of the human angiotensin-converting enzyme 2 receptor. This affinity is given by six amino acid residues located in the variable loop of the receptor binding domain (RBD) within the Spike protein. Genetic recombination involving bat and pangolin Sarbecoviruses, and natural selection have been proposed as possible explanations for the acquisition of the variable loop and these amino acid residues. In this study we employed Bayesian phylogenetics to jointly reconstruct the phylogeny of the RBD among human, bat and pangolin Sarbecoviruses and detect recombination events affecting this region of the genome. A recombination event involving RaTG13, the closest relative of SARS-CoV-2 that lacks five of the six residues, and an unsampled Sarbecovirus lineage was detected. This result suggests that the variable loop of the RBD didn’t have a recombinant origin and the key amino acid residues were likely present in the common ancestor of SARS-CoV-2 and RaTG13, with the latter losing five of them probably as the result of recombination.

Seasonal human coronaviruses OC43, 229E, and NL63 induce cell surface modulation of entry receptors and display host cell-specific viral replication kinetics

Article

Jun 2024

The emergence of the COVID-19 pandemic prompted an increased interest in seasonal human coronaviruses. OC43, 229E, NL63, and HKU1 are endemic seasonal coronaviruses that cause the common cold and are associated with generally mild respiratory symptoms. In this study, we identified cell lines that exhibited cytopathic effects (CPE) upon infection by three of these coronaviruses and characterized their viral replication kinetics and the effect of infection on host surface receptor expression. We found that NL63 produced CPE in LLC-MK2 cells, while OC43 produced CPE in MRC-5, HCT-8, and WI-38 cell lines, while 229E produced CPE in MRC-5 and WI-38 by day 3 post-infection. We observed a sharp increase in nucleocapsid and spike viral RNA (vRNA) from day 3 to day 5 post-infection for all viruses; however, the abundance and the proportion of vRNA copies measured in the supernatants and cell lysates of infected cells varied considerably depending on the virus-host cell pair. Importantly, we observed modulation of coronavirus entry and attachment receptors upon infection. Infection with 229E and OC43 led to a downregulation of CD13 and GD3, respectively. In contrast, infection with NL63 and OC43 leads to an increase in ACE2 expression. Attempts to block entry of NL63 using either soluble ACE2 or anti-ACE2 monoclonal antibodies demonstrated the potential of these strategies to greatly reduce infection. Overall, our results enable a better understanding of seasonal coronaviruses infection kinetics in permissive cell lines and reveal entry receptor modulation that may have implications in facilitating co-infections with multiple coronaviruses in humans. IMPORTANCE Seasonal human coronavirus is an important cause of the common cold associated with generally mild upper respiratory tract infections that can result in respiratory complications for some individuals. There are no vaccines available for these viruses, with only limited antiviral therapeutic options to treat the most severe cases. A better understanding of how these viruses interact with host cells is essential to identify new strategies to prevent infection-related complications. By analyzing viral replication kinetics in different permissive cell lines, we find that cell-dependent host factors influence how viral genes are expressed and virus particles released. We also analyzed entry receptor expression on infected cells and found that these can be up- or down-modulated depending on the infecting coronavirus. Our findings raise concerns over the possibility of infection enhancement upon co-infection by some coronaviruses, which may facilitate genetic recombination and the emergence of new variants and strains.

Emergence and spread of feline infection peritonitis due to a highly pathogenic canine/feline recombinant coronavirus

Preprint

Full-text available

Nov 2023

Cross-species transmission of coronaviruses (CoVs) poses a serious threat to both animal and human health ¹⁻³ . Whilst the large RNA genome of CoVs shows relatively low mutation rates, recombination within genera is frequently observed and demonstrated ⁴⁻⁷ . Companion animals are often overlooked in the transmission cycle of viral diseases; however, the close relationship of feline (FCoV) and canine CoV (CCoV) to human hCoV-229E 5,8 , as well as their susceptibility to SARS-CoV-2 ⁹ highlight their importance in potential transmission cycles. Whilst recombination between CCoV and FCoV of a large fragment spanning orf1b to M has been previously described 5,10 , here we report the emergence of a novel, highly pathogenic FCoV-CCoV recombinant responsible for a rapidly spreading outbreak of feline infectious peritonitis (FIP), originating in Cyprus ¹¹ . The recombination, spanning spike, shows 97% sequence identity to the pantropic canine coronavirus CB/05. Infection is spreading fast and infecting cats of all ages. Development of FIP appears rapid and likely non-reliant on biotype switch ¹² . High sequence identity of isolates from cats in different districts of the island is strongly supportive of direct transmission. A deletion and several amino acid changes in spike, particularly the receptor binding domain, compared to other FCoV-2s, indicate changes to receptor binding and likely cell tropism.

The Opportunity of Data-Driven Services for Viral Genomic Surveillance

Conference Paper

Jul 2023

Anna Bernasconi

Ongoing Recombination in SARS-CoV-2 Revealed Through Genealogical Reconstruction

Article

Full-text available

Feb 2022
MOL BIOL EVOL

The evolutionary process of genetic recombination has the potential to rapidly change the properties of a viral pathogen, and its presence is a crucial factor to consider in the development of treatments and vaccines. It can also significantly affect the results of phylogenetic analyses and the inference of evolutionary rates. The detection of recombination from samples of sequencing data is a very challenging problem, and is further complicated for SARS-CoV-2 by its relatively slow accumulation of genetic diversity. The extent to which recombination is ongoing for SARS-CoV-2 is not yet resolved. To address this, we use a parsimony-based method to reconstruct possible genealogical histories for samples of SARS-CoV-2 sequences, which enables us to pinpoint specific recombination events that could have generated the data. We propose a statistical framework for disentangling the effects of recurrent mutation from recombination in the history of a sample, and hence provide a way of estimating the probability that ongoing recombination is present. We apply this to samples of sequencing data collected in England and South Africa, and find evidence of ongoing recombination.

Generation and transmission of inter-lineage recombinants in the SARS-CoV-2 pandemic

Article

Full-text available

Aug 2021
CELL

We present evidence for multiple independent origins of recombinant SARS-CoV-2 viruses sampled from late 2020 and early 2021 in the United Kingdom. Their genomes carry single nucleotide polymorphisms and deletions that are characteristic of the B.1.1.7 variant of concern, but lack the full complement of lineage-defining mutations. Instead, the remainder of their genomes share contiguous genetic variation with non-B.1.1.7 viruses circulating in the same geographic area at the same time as the recombinants. In four instances there was evidence for onward transmission of a recombinant-origin virus, including one transmission cluster of 45 sequenced cases over the course of two months. The inferred genomic locations of recombination breakpoints suggest that every community-transmitted recombinant virus inherited its spike region from a B.1.1.7 parental virus, consistent with a transmission advantage for B.1.1.7’s set of mutations.

Pandemic-Scale Phylogenomics Reveals Elevated Recombination Rates in the SARS-CoV-2 Spike Region

Preprint

Full-text available

Aug 2021

Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms, thereby crippling real-time analysis of viral recombination. Low SARS-CoV-2 mutation rates make detecting recombination difficult. Here, we develop and apply a novel phylogenomic method to exhaustively search a nearly comprehensive SARS-CoV-2 phylogeny for recombinant lineages. We investigate a 1.6M sample tree, and identify 606 recombination events. Approximately 2.7% of sequenced SARS-CoV-2 genomes have recombinant ancestry. Recombination breakpoints occur disproportionately in the Spike protein region. Our method empowers comprehensive real time tracking of viral recombination during the SARS-CoV-2 pandemic and beyond.

Recombinant SARS-CoV-2 Genomes Circulated at Low Levels Over The First Year of The Pandemic

Article

Full-text available

Jun 2021

Viral recombination can generate novel genotypes with unique phenotypic characteristics, including transmissibility and virulence. Although the capacity for recombination among betacoronaviruses is well documented, recombination between strains of SARS-CoV-2 has not been characterized in detail. Here, we present a lightweight approach for detecting genomes that are potentially recombinant. This approach relies on identifying the mutations that primarily determine SARS-CoV-2 clade structure and then screening genomes for ones that contain multiple mutational markers from distinct clades. Among the over 537,000 genomes queried that were deposited on GISAID.org prior to February 16, 2021, we detected 1175 potential recombinant sequences. Using a highly conservative criteria to exclude sequences that may have originated through de novo mutation, we find that at least 30% (n = 358) are likely of recombinant origin. An analysis of deep-sequencing data for these putative recombinants, where available, indicated that the majority are high quality. Additional phylogenetic analysis and the observed co-circulation of predicted parent clades in the geographic regions of exposure further support the feasibility of recombination in this subset of potential recombinants. An analysis of these genomes did not reveal evidence for recombination hotspots in the SARS-CoV-2 genome. While most of the putative recombinant sequences we detected were genetic singletons, a small number of genetically identical or highly similar recombinant sequences were identified in the same geographic region, indicative of locally circulating lineages. Recombinant genomes were also found to have originated from parental lineages with substitutions of concern, including D614G, N501Y, E484K, and L452R. Adjusting for an unequal probability of detecting recombinants derived from different parent clades and for geographic variation in clade abundance, we estimate that at most 0.2-2.5% of circulating viruses in the US and UK are recombinant. Our identification of a small number of putative recombinants within the first year of SARS-CoV-2 circulation underscores the need to sustain efforts to monitor the emergence of new genotypes generated through recombination.

Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity.

Article

Full-text available

Jan 2021

Global dispersal and increasing frequency of the SARS-CoV-2 spike protein variant D614G are suggestive of a selective advantage but may also be due to a random founder effect. We investigate the hypothesis for positive selection of spike D614G in the United Kingdom using more than 25,000 whole genome SARS-CoV-2 sequences. Despite the availability of a large dataset, well represented by both spike 614 variants, not all approaches showed a conclusive signal of positive selection. Population genetic analysis indicates that 614G increases in frequency relative to 614D in a manner consistent with a selective advantage. We do not find any indication that patients infected with the spike 614G variant have higher COVID-19 mortality or clinical severity, but 614G is associated with higher viral load and younger age of patients. Significant differences in growth and size of 614G phylogenetic clusters indicate a need for continued study of this variant.

Evidence for adaptive evolution in the receptor-binding domain of seasonal coronaviruses OC43 and 229E

Article

Full-text available

Jan 2021
eLife

Seasonal coronaviruses (OC43, 229E, NL63 and HKU1) are endemic to the human population, regularly infecting and reinfecting humans while typically causing asymptomatic to mild respiratory infections. It is not known to what extent reinfection by these viruses is due to waning immune memory or antigenic drift of the viruses. Here, we address the influence of antigenic drift on immune evasion of seasonal coronaviruses. We provide evidence that at least two of these viruses, OC43 and 229E, are undergoing adaptive evolution in regions of the viral spike protein that are exposed to human humoral immunity. This suggests that reinfection may be due, in part, to positively-selected genetic changes in these viruses that enable them to escape recognition by the immune system. It is possible that, as with seasonal influenza, these adaptive changes in antigenic regions of the virus would necessitate continual reformulation of a vaccine made against them.

Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity

Article

Full-text available

Nov 2020
CELL

Recombination events are concentrated in the spike protein region of Betacoronaviruses

Article

Full-text available

Dec 2020
PLOS GENET

The Betacoronaviruses comprise multiple subgenera whose members have been implicated in human disease. As with SARS, MERS and now SAR-CoV-2, the origin and emergence of new variants are often attributed to events of recombination that alter host tropism or disease severity. In most cases, recombination has been detected by searches for excessively similar genomic regions in divergent strains; however, such analyses are complicated by the high mutation rates of RNA viruses, which can produce sequence similarities in distant strains by convergent mutations. By applying a genome-wide approach that examines the source of individual polymorphisms and that can be tested against null models in which recombination is absent and homoplasies can arise only by convergent mutations, we examine the extent and limits of recombination in Betacoronaviruses . We find that recombination accounts for nearly 40% of the polymorphisms circulating in populations and that gene exchange occurs almost exclusively among strains belonging to the same subgenus. Although experimental studies have shown that recombinational exchanges occur at random along the coronaviral genome, in nature, they are vastly overrepresented in regions controlling viral interaction with host cells.

Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie

Article

May 2021
GENETICS

The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, in case of the SARS-CoV-2, the low divergence of near-identical genomes sequenced over a short period of time makes conventional analysis infeasible. Using a novel method, we identified 225 anomalous SARS-CoV-2 genomes of likely recombinant origins out of the first 87,695 genomes to be released, several of which have persisted in the population. Bolotie is specifically designed to perform a rapid search for inter-clade recombination events over extremely large datasets, facilitating analysis of novel isolates in seconds. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. The Bolotie software and other data from our study are available at https://github.com/salzberg-lab/bolotie.

Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein

Article

Dec 2020
CELL

A Bayesian approach to infer recombination patterns in coronaviruses

Abstract and Figures

Recommended publications

Recombination patterns in coronaviruses

Bayesian inference of reassortment networks reveals fitness benefits of reassortment in human influe...

Recombination-aware phylogenetic analysis sheds light on the evolutionary origin of SARS-CoV-2

Bayesian inference of ancestral recombination graphs for bacterial populations