ArticlePDF AvailableLiterature Review

Use of phylogenetics in the molecular epidemiology and evolutionary studies of viral infections

Authors:

Abstract and Figures

Since DNA sequencing techniques first became available almost 30 years ago, the amount of nucleic acid sequence data has increased enormously. Phylogenetics, which is widely applied to compare and analyze such data, is particularly useful for the analysis of genes from rapidly evolving viruses. It has been used extensively to describe the molecular epidemiology and transmission of the human immunodeficiency virus (HIV), the origins and subsequent evolution of the severe acute respiratory syndrome (SARS)-associated coronavirus (SCoV), and, more recently, the evolving epidemiology of avian influenza as well as seasonal and pandemic human influenza viruses. Recent advances in phylogenetic methods can infer more in-depth information about the patterns of virus emergence, adding to the conventional approaches in viral epidemiology. Examples of this information include estimations (with confidence limits) of the actual time of the origin of a new viral strain or its emergence in a new species, viral recombination and reassortment events, the rate of population size change in a viral epidemic, and how the virus spreads and evolves within a specific population and geographical region. Such sequence-derived information obtained from the phylogenetic tree can assist in the design and implementation of public health and therapeutic interventions. However, application of many of these advanced phylogenetic methods are currently limited to specialized phylogeneticists and statisticians, mainly because of their mathematical basis and their dependence on the use of a large number of computer programs. This review attempts to bridge this gap by presenting conceptual, technical, and practical aspects of applying phylogenetic methods in studies of influenza, HIV, and SCoV. It aims to provide, with minimal mathematics and statistics, a practical overview of how phylogenetic methods can be incorporated into virological studies by clinical and laboratory specialists.
Content may be subject to copyright.
A preview of the PDF is not available
... When this virion subsequently infects a new cell, template switching occurs during reverse transcription, resulting in a recombinant virion. The diagram was adapted from Lam et al., 2010, with modification for clarity 110 . ...
... When this virion subsequently infects a new cell, template switching occurs during reverse transcription, resulting in a recombinant virion. The diagram was adapted from Lam et al., 2010, with modification for clarity 110 . ...
... Phylogenetic tree-building methods are generally classified as distance methods or character-based methods (based on how the sequences are compared to infer the tree topology, e.g., a matrix of pairwise genetic distances or discrete character states) 110,282 . Several distance-based methods for inferring phylogenies have been developed, including the Fitch-Margoliash method 283 , the unweighted group method with arithmetic methods (UPGMA) 284 , Minimum Evolution method (ME) 285 , and Neighbour Joining method (NJ) 286 . ...
Thesis
Full-text available
... Phylogenetic trees not only describe the similarities and distinctions between species but also aid scientists in understanding how species have developed (Lam et al., 2010;Bbole et al., 2018). If the visual traits required for the identification of a species are not accessible, the DNA sequence from the mtDNA has been proven handy as an effective method of fish identification (Victor et al., 2009;Sogbesan et al., 2017). ...
... The phylogenetic tree portrays a relationship between sets of species and represents a model of evolution. The current forms of species retain many of their ancestral features, some of which gradually change to help these species adjust to their environment (Geetika et al., 2018;Lam et al., 2010). Phylogenetics also relies on information extracted from genetic material such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or protein sequences (Bbole et al., 2018). ...
Article
Full-text available
The species of cichlids are easily misidentified due to their morphological similarities. This study, therefore, was designed to discriminate the cichlid species inhabiting Zobe reservoir using a molecular approach. Samples of different cichlid species were collected from the reservoir between August and November 2022 and identified using their morphological features following the field guide to Nigerian freshwater fish. Fin clips of two samples from each of the species were fixed in 100% ethanol and the DNA was extracted. The mitochondrial cytochrome oxidase (CO1) gene area was amplified (using the FishF1 and FishR1 primer pair), purified and sequenced to reveal the identity of the species. Morphologically, the result revealed the presence of four species, namely: Oreochromis niloticus, Oreochromis mossambicus, Sarotherodon galilaeus, and Coptodon zilli. The genetic identity of the samples agreed with the earlier attempt made at morphological discrimination except for O. mossambicus which matches the partial and complete genome of Oreochromis aureus, hence, solving misidentification. The phylogenetic tree of the CO1 genes constructed using MEGA 11 software reveals the species were grouped independently into the three different genera of the Cichlidae family (i.e., the Coptodon, Oreochromis and Sarotherodon). This study affirms the need for molecular confirmation of morphologically identified cichlids in Nigeria.
... The quality of these sequence data varies and requires multistep quality control before finally being used for analysis. 26 Subsequently, evolutionary analysis of consensus DENV genomes has typically been relegated to phylogenetics and phylogenomics, 27,28 where phylogenetic trees are extrapolated to explain evolution as a function of interviral relationships and genetic distance. 28 As DENV evolution occurs on a scale which often matches that of DENV transmission and epidemics, genetic variances among common DENV strains can often be detected within a few weeks, especially when genome-wide data are used. ...
... 26 Subsequently, evolutionary analysis of consensus DENV genomes has typically been relegated to phylogenetics and phylogenomics, 27,28 where phylogenetic trees are extrapolated to explain evolution as a function of interviral relationships and genetic distance. 28 As DENV evolution occurs on a scale which often matches that of DENV transmission and epidemics, genetic variances among common DENV strains can often be detected within a few weeks, especially when genome-wide data are used. [29][30][31] In recent years, Bayesian phylogenetic methods have been widely applied using software such as BEAST (Bayesian Evolutionary Analysis of Sampling Trees). ...
Article
Full-text available
Dengue virus (DENV) is one of the most important arboviral pathogens in the tropics and subtropics, and nearly one-third of the world's population is at risk of infection. The transmission of DENV involves a sylvatic cycle between nonhuman primates (NHP) and Aedes genus mosquitoes, and an endemic cycle between human hosts and predominantly Aedes aegypti. DENV belongs to the genus Flavivirus of the family Flaviviridae and consists of four antigenically distinct serotypes (DENV-1-4). Phylogenetic analyses of DENV have revealed its origin, epidemiology, and the drivers that determine its molecular evolution in nature. This review discusses how phylogenetic research has improved our understanding of DENV evolution and how it affects viral ecology and improved our ability to analyze and predict future DENV emergence.
... e. The patterns of shared history between species are revealed by sequence comparison, which aids in the estimation of ancestral states. Finding similarities and relationships between species requires an understanding of biology, which is aided by sequence comparisons. We have two options for sequence comparison: alignment-based and alignment-free (Lam et. al., 2010). ...
Article
Full-text available
Phylogenetics is a potent method for determining how modern species have evolved. Scientists can explain the similarities and differences between species and learn more about how species have developed by looking at phylogenetic trees. This study investigates computer approaches for reconstructing species phylogenies and emphasizes the advantages of alignment-free phylogenetic methods. The application of phylogenetic analysis has also been covered in the work.
... Evolutionary analyses of DENV genes and genomes fall under the fields of phylogenetics and phylogenomics in which phylogenetic trees (phylograms) are used to infer the relationship between viruses and how they evolve over time [35] . As DENV evolution occurs on a scale, which often matches that of their transmission, the genetic differences between DENV strains transmitted during epidemics are generally detectable within weeks, especially when using whole virus genomes [36] . ...
Article
Full-text available
The circulation of the four-dengue virus (DENV) serotypes has significantly increased in recent years, accompanied by an increase in viral genetic diversity. In order to conduct disease surveillance and understand DENV evolution and its effects on virus transmission and disease, efficient and accurate methods for phylogenetic classification are required. Phylogenetic analysis of different viral genes sequences is the most used method, the envelope gene (E) being the most frequently selected target. We explored the genetic variability of the four DENV serotypes throughout their complete coding sequence (CDS) of sequences available in GenBank and used genomic regions of different variability rate to recapitulate the phylogeny obtained with the DENV CDS. Our results indicate that the use of high or low variable regions accurately recapitulate the phylogeny obtained with CDS of sequences from different DENV genotypes. However, when analyzing the phylogeny of a single genotype, highly variable regions performed better in recapitulating the distance branch length, topology, and support of the CDS phylogeny. The use of three concatenated highly variable regions was not statistically different in distance branch length and support to that obtained in CDS phylogeny. •This study demonstrated the ability of highly variable regions of the DENV genome to recapitulate the phylogeny obtained with the full coding sequence (CDS). •The use of genomic regions of high or low variability did not affect the performance in recapitulating the phylogeny obtained with CDS from different genotypes. However, when phylogeny was analyzed for sequences from a single genotype, highly variable regions performed better in recapitulating the distance branch length, topology, and support of the CDS phylogeny. •The use of concatenated highly variable genome regions represent a useful option for recapitulating genome-wide phylogenies in analyses of sequences belonging to the same DENV genotype.
... The SARS-CoV-2 early infections phylogeny as an example of the misinterpreting of phylogenies in epidemiology Dear Editor, Taking into consideration the underlying concept of molecular epidemiology as the axis of epidemiology, uniting insights at the molecular level and the understanding of diseases at the population and geographic levels, it is not surprising that the use of phylogenetic approaches to understand and to determine the geographic origin and traceability of animal cases, including humans, has increased over the years (1). However, methodological misconceptions, arisen because of an indiscriminate use of phylogenetic approaches, in many cases have led to wrong conclusions regarding the ancestors-descendants relationships that can be extrapolated to geographic scales (2). ...
Article
Full-text available
The use of phylogenies to study the geographic origin and traceability of infections caused by outbreaks of diseases has increased. However, the use of these phylogenetic tools is limited by the information available, as well as the results obtained by phylogenies. As an example, we analyzed the SARS-CoV-2 genomes available in GenBank up to August 2020, we performed a global alignment and, based on a phylogenetic analysis of Bayesian inference, we compared the information on the origin of the first cases of COVID-19 in Latin American countries represented at that date by their genomes of COVID-19 positive patients. The results show that of the six Latin American countries represented, only Puerto Rico, the information obtained through phylogenetic analysis corresponds to the clinical case study carried out by the Ministry of Health of that country. This shows that for outbreaks similar to the one produced by SARS-CoV-2, with incomplete information it is important to take the information obtained from these phylogenetic analyzes with caution when carrying out geographic epidemiology studies.
... Phylogenetics is widely used to provide insight into the evolutionary history and relationships between sequenced viruses. In epidemiology, phylogenetics is used to identify the biological determinants and population dynamics that underpin the successful transmission of pathogens (Eybpoosh et al., 2017;Grenfell et al., 2004;Lam et al., 2010). Coupled with high-throughput sequencing technologies, phylogenetics has proven to be an invaluable tool in monitoring global viral outbreaks such as the 2014 Ebola epidemic (Quick et al., 2016) and the ongoing COVID-19 pandemic (Hadfield et al., 2018). ...
Thesis
Recombination detection is a critical step when analysing viral sequencing data. When unaccounted for, recombination has the potential to mislead estimations of the evolutionary history and relationships between viruses. A repertoire of recombination detection methods have been developed over the past two decades, but their ability to process increasingly large viral datasets is unclear. In this thesis, five recombination detection methods (PhiPack (Profile), 3SEQ, GENECONV, VSEARCH (UCHIME), and gmos) are evaluated to determine if any are suitable for the analysis of bulk next-generation sequencing data. Analysis of datasets simulated across a wide range of mutation and recombination rates, and three empirical datasets, revealed that the assessed recombination detection methods may not be scalable, nor robust, for the analysis of bulk next-generation sequencing data. In particular, the most scalable methods VSEARCH (UCHIME) and gmos may not be suitable due to respective technical limitations. Overall, no single recombination detection method is suited for the analysis of all types of viral sequencing data, and the critical trade-offs between the methods are outlined. Recombination detection remains a complex problem. Continual evaluation of detection methods, particularly novel approaches, should be conducted to identify both scalable and robust methods to meet the need for the rapid analysis of bulk viral sequencing data.
Article
Full-text available
Emerging infectious diseases (EIDs) pose an increasingly significant global burden, driven by urbanization, population explosion, global travel, changes in human behavior, and inadequate public health systems. The recent SARS-CoV-2 pandemic highlights the urgent need for innovative and robust technologies to effectively monitor newly emerging pathogens. Rapid identification, epidemiological surveillance, and transmission mitigation are crucial challenges for ensuring public health safety. Genomics has emerged as a pivotal tool in public health during pandemics, enabling the diagnosis, management, and prediction of infections, as well as the analysis and identification of cross-species interactions and the categorization of infectious agents. Recent advancements in high-throughput DNA sequencing tools have facilitated rapid and precise identification and characterization of emerging pathogens. This review article provides insights into the latest advances in various genomic techniques for pathogen detection and tracking and their applications in global outbreak surveillance. We assess methods that leverage pathogen sequences and explore the role of genomic analysis in understanding the epidemiology of newly emerged infectious diseases. Additionally, we address technical challenges and limitations, ethical and legal considerations, and highlight opportunities for integrating genomics with other surveillance approaches. By delving into the prospects and obstacles of genomics, we can gain valuable insights into its role in mitigating the threats posed by emerging pathogens and improving global preparedness in the face of future outbreaks.
Article
The field of biological sciences enumerates various important aspects of life forms, some of which remain mysterious to explain or validate. Though there has been a revolutionary advancement of tools and techniques to study different biological phenomena under laboratory conditions, there are significant limitations with implementing each concept in the laboratory. Sometimes, it is practically impossible to simulate the actual environmental conditions of living systems under in-vitro settings, or sometimes the requirements of life are discordant with various analytical techniques, or the study of complex evolutionary processes becomes technically difficult using wet-lab methods. Thus, these experimental challenges confine our boundary to explore the real world of biological systems, which may lead to the acquisition of knowledge with few lacunae. Bioinformatics as a discipline tries to fill those spaces to a major extent using in silico analysis and provides a deeper theoretical foundation and validation of existing biological principles. The importance and potential of bioinformatics have been witnessed in the pandemic times to fight against Covid-19. From developing new drugs and vaccines to crop improvement and space and environment studies, the field of bioinformatics has many prospects. This article aims to provide a bird’s-eye view of different applications and cutting-edge bioinformatics approaches to understand biological systems and the emerging need to integrate this course into the education framework.
Article
Disaggregated computer architectures eliminate resource fragmentation in next-generation datacenters by enabling virtual machines to employ resources such as CPUs, memory, and accelerators that are physically located on different servers. While this paves the way for highly compute- and/or memory-intensive applications to potentially deploy all CPUs and/or memory resources in a datacenter, it poses a major challenge to the efficient deployment of hardware accelerators: input/output data can reside on different servers than the ones hosting accelerator resources, thereby requiring time- and energy-consuming remote data transfers that diminish the gains of hardware acceleration. Targeting a disaggregated datacenter architecture similar to the IBM dReDBox disaggregated datacenter prototype, the present work explores the potential of deploying custom acceleration units adjacently to the disaggregated-memory controller on memory bricks (in dReDBox terminology), which is implemented on FPGA technology, to reduce data movement and improve performance and energy efficiency when reconstructing large phylogenies (evolutionary relationships among organisms). A fundamental computational kernel is the Phylogenetic Likelihood Function (PLF), which dominates the total execution time (up to 95%) of widely used maximum-likelihood methods. Numerous efforts to boost PLF performance over the years focused on accelerating computation; since the PLF is a data-intensive, memory-bound operation, performance remains limited by data movement, and memory disaggregation only exacerbates the problem. We describe two near-memory processing models, one that addresses the problem of workload distribution to memory bricks, which is particularly tailored toward larger genomes (e.g., plants and mammals), and one that reduces overall memory requirements through memory-side data interpolation transparently to the application, thereby allowing the phylogeny size to scale to a larger number of organisms without requiring additional memory.
Article
The controversy over the evolutionary advantage of recombination initially discovered by Fisher and by Muller is reviewed. Those authors whose models had finite-population effects found an advantage of recombination, and those whose models had infinite populations found none. The advantage of recombination is that it breaks down random linkage disequilibrium generated by genetic drift. Hill and Robertson found that the average effect of this randomly-generated linkage disequilibrium was to cause linked loci to interfere with each other's response to selection, even where there was no gene interaction between the loci. This effect is shown to be identical to the original argument of Fisher and Muller. It also predicts the "ratchet mechanism" discovered by Muller, who pointed out that deleterious mutants would more readily increase in a population without recombination. Computer simulations of substitution of favorable mutants and of the long-term increase of deleterious mutants verified the essential correctness of the original Fisher-Muller argument and the reality of the Muller ratchet mechanism. It is argued that these constitute an intrinsic advantage of recombination capable of accounting for its persistence in the face of selection for tighter linkage between interacting polymorphisms, and possibly capable of accounting for its origin.
Article
The recently-developed statistical method known as the "bootstrap" can be used to place confidence intervals on phylogenies. It involves resampling points from one's own data, with replacement, to create a series of bootstrap samples of the same size as the original data. Each of these is analyzed, and the variation among the resulting estimates taken to indicate the size of the error involved in making estimates from the original data. In the case of phylogenies, it is argued that the proper method of resampling is to keep all of the original species while sampling characters with replacement, under the assumption that the characters have been independently drawn by the systematist and have evolved independently. Majority-rule consensus trees can be used to construct a phylogeny showing all of the inferred monophyletic groups that occurred in a majority of the bootstrap samples. If a group shows up 95% of the time or more, the evidence for it is taken to be statistically significant. Existing computer programs can be used to analyze different bootstrap samples by using weights on the characters, the weight of a character being how many times it was drawn in bootstrap sampling. When all characters are perfectly compatible, as envisioned by Hennig, bootstrap sampling becomes unnecessary; the bootstrap method would show significant evidence for a group if it is defined by three or more characters.
Article
Episodes of population growth and decline leave characteristic signatures in the distribution of nucleotide (or restriction) site differences between pairs of individuals. These signatures appear in histograms showing the relative frequencies of pairs of individuals who differ by i sites, where i = 0, 1, .... In this distribution an episode of growth generates a wave that travels to the right, traversing 1 unit of the horizontal axis in each 1/2u generations, where u is the mutation rate. The smaller the initial population, the steeper will be the leading face of the wave. The larger the increase in population size, the smaller will be the distribution's vertical intercept. The implications of continued exponential growth are indistinguishable from those of a sudden burst of population growth Bottlenecks in population size also generate waves similar to those produced by a sudden expansion, but with elevated uppertail probabilities. Reductions in population size initially generate L-shaped distributions with high probability of identity, but these converge rapidly to a new equilibrium. In equilibrium populations the theoretical curves are free of waves. However, computer simulations of such populations generate empirical distributions with many peaks and little resemblance to the theory. On the other hand, agreement is better in the transient (nonequilibrium) case, where simulated empirical distributions typically exhibit waves very similar to those predicted by theory. Thus, waves in empirical distributions may be rich in information about the history of population dynamics.
Article
We explored the epidemic history of HIV-1 subtype B in the United Kingdom by using statistical methods that infer the population history of pathogens from sampled gene sequence data. Phylogenetic analysis of HIV-1 pol gene sequences from Britain showed at least six large transmission chains, indicating a genetically variable, but epidemiologically homogeneous, epidemic among men having sex with men. Through coalescent-based analysis, we showed that these chains arose through separate introductions of subtype B strains into the United Kingdom in the early to mid-1980s. After an initial period of exponential growth, the rate of spread generally slowed in the early 1990s, which is more likely to correlate with behavior change than with reduced infectiousness resulting from highly active antiretroviral therapy. Our results provide insights into the complexity of HIV-1 epidemics that must be considered when developing HIV monitoring and prevention initiatives.
Article
A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.
Article
We determined the origin and evolutionary pathways of the PB1 genes of influenza A viruses responsible for the 1957 and 1968 human pandemics and obtained information on the variable or conserved region of the PB1 protein. The evolutionary tree constructed from nucleotide sequences suggested the following: (i) the PB1 gene of the 1957 human pandemic strain, A/Singapore/1/57 (H2N2), was probably introduced from avian species and was maintained in humans until 1968; (ii) in the 1968 pandemic strain, A/NT/60/68 (H3N2), the PB1 gene was not derived from the previously circulating virus in humans but probably from another avian virus; and (iii) a current human H3N2 virus inherited the PB1 gene from an A/NT/60/68-like virus. Nucleotide sequence analysis also showed that the avian PB1 gene was introduced into pigs. Hence, transmission of the PB1 gene from avian to mammalian species is a relatively frequent event. Comparative analysis of deduced amino acid sequences disclosed highly conserved regions in PB1 proteins, which may be key structures required for PB1 activities.