Distribution of the average allele frequency change (AFC) of the rising allele for the top 2000 candidates. AFC was calculated for each SNPs based on the average difference between the base and end populations across replicates. (a-b) AFC of the top 2000 candidates of the simulated data with 5 replicates, GP is performed on 4 (a) and 6 (b) time points, respectively. (c) AFC of the top 2000 candidates of the simulated data with 3 replicates, GP is performed on 4 time points. (d) AFC of the top 2000 candidates of the real data. We observed a significant location shift between the AFC distributions among the top 2000 candidate SNPs of the CMH and the BBGP (Mann-Whitney U, p-value < 2.2e-16 for all panels). The location shift indicates that the CMH test mostly captures radical AFC while the GP-based methods are also sensitive to consistent signals coming from intermediate time points.

Distribution of the average allele frequency change (AFC) of the rising allele for the top 2000 candidates. AFC was calculated for each SNPs based on the average difference between the base and end populations across replicates. (a-b) AFC of the top 2000 candidates of the simulated data with 5 replicates, GP is performed on 4 (a) and 6 (b) time points, respectively. (c) AFC of the top 2000 candidates of the simulated data with 3 replicates, GP is performed on 4 time points. (d) AFC of the top 2000 candidates of the real data. We observed a significant location shift between the AFC distributions among the top 2000 candidate SNPs of the CMH and the BBGP (Mann-Whitney U, p-value < 2.2e-16 for all panels). The location shift indicates that the CMH test mostly captures radical AFC while the GP-based methods are also sensitive to consistent signals coming from intermediate time points.

Source publication
Article
Full-text available
Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but to monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencie...

Similar publications

Article
Full-text available
Experimental evolution is a powerful tool to investigate complex traits. Artificial selection can be applied for a specific trait and the resulting phenotypically divergent populations pool-sequenced to identify alleles that occur at substantially different frequencies in the extreme populations. In order to maximize the proportion of loci that are...
Article
Full-text available
Population genetics predicts that tight linkage between new and/or pre-existing beneficial and deleterious alleles should decrease the efficiency of natural selection in finite populations. By decoupling beneficial and deleterious alleles and facilitating the combination of beneficial alleles, recombination accelerates the formation of high-fitness...
Article
Full-text available
The molecular underpinnings of pigmentation diversity in Drosophila have recently emerged as a model for understanding how the evolution of different cis-regulatory variants results in common adaptive phenotypes within species. We compared sequence variation in a 5′ regulatory region harboring a modular enhancer containing a ∼0.7-kb core element co...

Citations

... Genomewide sequencing can provide extensive catalogues of changes in the frequency spectrum of genetic variants in lab evolved populations, as well as a portrait of their expression levels (Remolina et al. 2012;Schlotterer et al. 2015;Mallard et al. 2018). Yet despite recent attempts (Burke et al. 2016;Turner et al. 2011;Schlotterer et al. 2015;Graves et al. 2017;Hsu et al. 2020), the study of the genomewide architecture of adaptation is still in its infancy (Braendle et al. 2011;Topa et al. 2015;Taus et al. 2017;Kelly and Hughes 2018;Vlachos et al. 2019). Multiple questions about the molecular basis of such laboratory adaptation remain unanswered. ...
... Selection can lead to a wide array of reproducible changes in the genome (Topa et al. 2015;Graves et al. 2017;Taus et al. 2017;Hsu et al. 2020). In addition, there have been a handful of studies that focus on reproducible changes in the transcriptome between populations that have differing selection regimes (Remolina et al. 2012;Mallard et al. 2018;Barter et al. 2019). ...
Article
Dissecting the molecular basis of adaptation remains elusive despite our ability to sequence genomes and transcriptomes. At present, most genomic research on selection focusses on signatures of selective sweeps in patterns of heterozygosity. Other research has studied changes in patterns of gene expression in evolving populations but has not usually identified the genetic changes causing these shifts in expression. Here we attempt to go beyond these approaches by using machine learning tools to explore interactions between the genome, transcriptome, and life-history phenotypes in two groups of 10 experimentally evolved Drosophila populations subjected to selection for opposing life history patterns. Our findings indicate that genomic and transcriptomic data have comparable power for predicting phenotypic characters. Looking at the relationships between the genome and the transcriptome, we find that the expression of individual transcripts is influenced by many sites across the genome that are differentiated between the two types of populations. We find that single-nucleotide polymorphisms (SNPs), transposable elements, and indels are powerful predictors of gene expression. Collectively, our results suggest that the genomic architecture of adaptation is highly polygenic with extensive pleiotropy.
... They serve to reduce the dimensionality of the original data while trying to capture as much information as possible. Previously proposed estimates and test statistics [29], [54], [49], [14], [45], [13], [31], [38], [17], [30], [36], [15] for the selection coefficient s seem to be the natural summary statistics in this case. Other quantities that are influenced by s, such as estimates of the effective population size (N e ) have also been used as summary statistics before, see [18]. ...
Preprint
Full-text available
A bstract With the exact likelihood often intractable, likelihood-free inference plays an important role in the field of population genetics. Indeed, several methodological developments in the context of Approximate Bayesian Computation (ABC) were inspired by population genetic applications. Here we explore a novel combination of recently proposed ABC tools that can deal with high dimensional summary statistics and apply it to infer selection strength and the number of selected loci for data from experimental evolution. While there are several methods to infer selection strength that operate on a single SNP level, our window based approach provides additional information about the selective architecture in terms of the number of selected positions. This is not trivial, since the spatial correlation introduced by genomic linkage leads to signals of selection also at neighboring SNPs. A further advantage of our approach is that we can easily provide an uncertainty quantification using the ABC posterior. Both on simulated and real data, we demonstrate a promising performance. This suggests that our ABC variant could also be interesting in other applications.
... Hence they do not estimate epistasis, nor do they account for epistatic effects when estimating the fitness advantage of an allele. Most existing methods are based on single-locus models which assume independent evolution of loci (Bollback et al. 2008;Malaspinas et al. 2012;Mathieson and McVean 2013;Feder et al. 2014;Lacerda and Seoighe 2014;Steinrücken et al. 2014;Foll et al. 2015;Topa et al. 2015;Ferrer-Admetlla et al. 2016;Gompert 2016;Schraiber et al. 2016;Iranmehr et al. 2017;Taus et al. 2017;Zinger et al. 2019), thus they are unable to directly account for genetic linkage or epistasis. A few methods (Illingworth and Mustonen 2011;Terhorst et al. 2015;Sohail et al. 2021) have been developed that consider the joint evolution of multiple loci, but these assume additive fitness models. ...
Article
Full-text available
Epistasis refers to fitness or functional effects of mutations that depend on the sequence background in which these mutations arise. Epistasis is prevalent in nature, including populations of viruses, bacteria, and cancers, and can contribute to the evolution of drug resistance and immune escape. However, it is difficult to directly estimate epistatic effects from sampled observations of a population. At present, there are very few methods that can disentangle the effects of selection (including epistasis), mutation, recombination, genetic drift, and genetic linkage in evolving populations. Here we develop a method to infer epistasis, along with the fitness effects of individual mutations, from observed evolutionary histories. Simulations show that we can accurately infer pairwise epistatic interactions provided that there is sufficient genetic diversity in the data. Our method also allows us to identify which fitness parameters can be reliably inferred from a particular data set and which ones are unidentifiable. Our approach therefore allows for the inference of more complex models of selection from time series genetic data, while also quantifying uncertainty in the inferred parameters.
... The covariance function encodes our assumptions about the underlying signal in the data [12] They are typically used in complex statistical models consisting of observed variables, unknown parameters and latent variables, with many sorts of relationships among these types of random variables. As is typical in Bayesian inference, unobserved variables are the parameters and latent variables [11], [13]. ...
Article
Full-text available
Gene expression of time series analysis uses and supports in many biological studies. The difference in transcriptional regulation between two strains of mice. the phenotype of the two mutant strains differ, where one of the strains succumbs to ALS far quicker than the other. The aim of the work determines a candidate list of genes or pathway that would give insight into the mechanism behind this difference of phenotype. Gaussian processes are efficient and usability for the analysis these series, Gaussian process (GP) regression with Coregionalization model have built to determine a candidate list of genes or pathway that would give insight into the mechanism behind this difference of phenotype. A model has built on these series to account for more structure within the time series; these Data have a correlated output for mouse model for ALS disease. The results of this model are well done to detect gene expression differences associated with the difference in the phenotype for four cases the genes alter its behavior and the new information that discovery genes have same behavior in both two mutations and two strains.
... Gaussian process approach to analysing functional data in biomedical applications is extensive [205,276,433,235,195,458] We assume each quantitative protein profile can be described by some unknown function, with the uncertainty in this function captured using a Gaussian process (GP) prior. Each sub-cellular niche is described by distinct density-gradient profiles, which display a non-linear structure with no particular parametric assumption being suitable. ...
Thesis
Proteins are biomolecules that govern the biochemical processes of the cell. Correct cellular function, therefore, depends on correct protein function. For a protein to function as intended, there need to be sufficient copies of that protein, it should be correctly folded into its tertiary structure and ought to be in proximity of its interaction partners, amongst many other requirements. For a protein to be in the proximity of its interaction partners, whether those be other proteins, RNA or metabolites, it needs to be localised to the required compartment. Cells from all organisms display sub-cellular compartmentalisation, though to vastly differing degrees. E.Coli, for example, has remarkably simple sub-cellular organisation, whilst the apicomplexan Toxoplasma gondii has a vast number of specialised organelles. In seminal experiments, Christian De Duve showed that upon biochemical fractionation of the cell, proteins co-fractionated if they were localised to the same organelle. These experiments led to the discovery of two organelles: the lysosome and the peroxisome, for which Christian De Duve was awarded the Nobel prize. Upon the advent of mass-spectrometry, these experiments were refashioned into high-throughput techniques with the development of Localisation of Organelle Proteins by Isotope Tagging (LOPIT) and Protein Correlation Profiling (PCP). Now these techniques have been redeveloped and a typical experiment can accurately measure thousands of proteins per experiments, whilst also providing information on (at least) a dozen subcellular compartments. To analyse spatial proteomics data, they are first annotated with marker proteins, which are proteins with a priori known unambiguous localisations. Typical analysis proceeds by training a machine learning classifier to assigned proteins with unknown localisations to one of the compartments based on the spatial proteomics data. However, this framework holds back spatial proteomics from answering more complex questions. The first challenge is that proteins are not necessarily localised to a single compartment and so there is uncertainty associating a protein with an organelle. There is also uncertainty associated with the experiment itself, for example, reproducing the biochemical fraction and the stochastic nature of mass spectrometric quantitation. Two chapters of my thesis are dedicated to alleviating this problem by developing a Bayesian model for spatial proteomics data, with dedicated software. These approaches perform competitively with state-of-the-art classification algorithms whilst Markov-chain Monte Carlo algorithms are employed to sample from the posterior distribution of localisation probabilities. This is the basis for quantifying uncertainty in protein-organelle associations. This Bayesian approach has several limitations, for example it still relies on marker proteins. This precludes analysis of poorly annotated non-model organisms using spatial proteomics techniques. A chapter of my thesis is dedicated to this challenge with a motivating application to the \textit{T. gondii} sub-proteome. Following on from this in a separate chapter, I develop a semi-supervised Bayesian model that reduces the reliance on marker proteins. The application to T. gondii constitutes a massive knowledge expansion revealing localisation of thousands of proteins to complex specialised niches. I also analyse the relative redundancy of the organelle sub-proteomes and the selective pressure of the host-adaptive response, revealing previously unknown insights. The semi-supervised Bayesian approach makes use of the principle of over-fitted mixtures, currently used for data clustering, by extending it to model spatial proteomics data. Reanalysis of spatial proteomics data reveals new annotations in all datasets and allows interrogation of previously overlooked organelles. Another limitation of the approaches, thus far, is the parametric assumptions made by the Bayesian approach. One chapter is dedicated to placing the analysis of spatial proteomics in the semi-supervised Bayesian non-parametric context. In the final chapters of thesis, I summarise the modern questions that spatial proteomics seeks to answer, including deciphering multi-localisation, change in localisation and the effect of post-translation modifications on subcellular localisation. I carefully define these problems and motivate further Bayesian models. I develop a Bayesian model to analyse differential localisation experiments; that is, spatial proteomics concerned with changes in localisation. This approach improves over currently ad-hoc methods applied to such data. I conclude with the limitations of our approach and potential solutions to the other methods.
... 1 Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, 1210 Wien, Austria Full list of author information is available at the end of the article to characterize the underlying evolutionary processes at unprecedented detail [7][8][9][10]. The great potential of E&R studies in combination with the continuously growing data sets of powerful experiments has driven the development of a diverse set of methods to detect selected SNPs, which change in allele frequency more than expected under neutrality [11][12][13][14][15][16][17][18][19]. Some of the published methods use this information to estimate the underlying selection coefficient and dominance [11,14,19,20]. ...
... Beta-binomial Gaussian process (BBGP). BBGP employs a beta-binomial Gaussian process to detect significant allele frequency changes over time [17]. The beta-binomial model corrects for the uncertainty arising from finite sequencing depth. ...
Article
Full-text available
Background: The combination of experimental evolution with whole-genome resequencing of pooled individuals, also called evolve and resequence (E&R) is a powerful approach to study the selection processes and to infer the architecture of adaptive variation. Given the large potential of this method, a range of software tools were developed to identify selected SNPs and to measure their selection coefficients. Results: In this benchmarking study, we compare 15 test statistics implemented in 10 software tools using three different scenarios. We demonstrate that the power of the methods differs among the scenarios, but some consistently outperform others. LRT-1, CLEAR, and the CMH test perform best despite LRT-1 and the CMH test not requiring time series data. CLEAR provides the most accurate estimates of selection coefficients. Conclusion: This benchmark study will not only facilitate the analysis of already existing data, but also affect the design of future data collections.
... For theses reasons the CMH test is widely used in E&R studies (Orozco-Terwengel et al. 2012;Martins et al. 2014;Tobler et al. 2014;Phillips et al. 2018;Barghi et al. 2019;Kelly and Hughes 2019). Recently however several test-statistics became available that utilize time-series data, that is allele frequencies estimates for multiple time points (>2) during the experiment (Topa et al. 2015;Iranmehr et al. 2017;Spitzer et al. 2019). We were interested if our conclusion, that an increasing regime enhances the power to identify QTNs, also holds when a time-series based test-statistic is used. ...
Article
Full-text available
Evolve and Resequence (E&R) studies are frequently used to dissect the genetic basis of quantitative traits. By subjecting a population to truncating selection for several generations and estimating the allele frequency differences between selected and non-selected populations using Next Generation Sequencing, the loci contributing to the selected trait may be identified. The role of different parameters, such as, the population size or the number of replicate populations have been examined in previous works. However, the influence of the selection regime, i.e. the strength of truncating selection during the experiment, remains little explored. Using whole genome, individual based forward simulations of E&R studies, we found that the power to identify the causative alleles may be maximized by gradually increasing the strength of truncating selection during the experiment. Notably, such an optimal selection regime comes at no or little additional cost in terms of sequencing effort and experimental time. Interestingly, we also found that a selection regime which optimizes the power to identify the causative loci is not necessarily identical to a regime that maximizes the phenotypic response. Finally, our simulations suggest that an E&R study with an optimized selection regime may have a higher power to identify the genetic basis of quantitative traits than a GWAS, highlighting that E&R is a powerful approach for finding the loci underlying complex traits. E&R studies are however more risky than GWAS as suboptimal selection regimes lead to a weak performance.
... On the other hand, many studies have focused on only the detection of SNPs selected in evolving populations. For this detection, researchers have used general-purpose statistical tests and machine learning methods such as the Cochran-Mantel-Haenszel (CMH) test (Orozco-Terwengel et al., 2012) and BBGP-based test (Topa et al., 2015), which quantify only the statistical significance of selection. The CMH test detects significant allele frequency changes occurring over the course of artificial selection from sequenced allele counts. ...
... Following a previous E&R study (Iranmehr et al., 2017;Terhorst et al., 2015;Topa et al., 2015), we adopted a hidden Markov model (HMM)-based model as the probabilistic model for the E&R data. Suppose we have M replicates of breeding populations, and population genomes are sequenced at ðL m þ 1Þ (m ¼ 1; . . . ...
... Since we are generally interested in a small number of highly significant SNPs among a large number of SNPs throughout the genome, we mainly used the true positive rate (TPR) when the significance level is set so that the false positive rate (FPR) is 0.01 in the Receiver Operating Characteristic (ROC) curve. We compared the detection accuracy of our method to those of the CMH test (Agresti, 2002), BBGP-based test (Topa et al., 2015) and Iranmehr et al. (2017). While the CMH test is available in the R statistical software environment as the 'mantelhaen.test' ...
Article
Motivation: Evolve and resequence (E&R) experiments shows promise in capturing real-time evolution at genome-wide scales, enabling the assessment of allele frequency changes SNPs in evolving populations and thus the estimation of population genetic parameters in the Wright-Fisher model (WF) that quantify the selection on SNPs. Currently, these analyses face two key difficulties: the numerous SNPs in E&R data and the frequent unreliability of estimates. Hence, a methodology for efficiently estimating WF parameters is needed to understand the evolutionary processes that shape genomes. Results: We developed a novel method for estimating WF parameters (EMWER), by applying an expectation maximization algorithm to the Kolmogorov forward equation associated with the WF model diffusion approximation. EMWER was used to infer the effective population size, selection coefficients and dominance parameters from E&R data. Of the methods examined, EMWER was the most efficient method for selection strength estimation in multi-core computing environments, estimating both selection and dominance with accurate confidence intervals. We applied EMWER to E&R data from experimental Drosophila populations adapting to thermally fluctuating environments and found a common selection affecting allele frequency of many SNPs within the cosmopolitan In(3R)P inversion. Furthermore, this application indicated that many of beneficial alleles in this experiment are dominant. Availability: Our C ++ implementation of "EMWER" is available at https://github.com/kojikoji/EMWER. Supplementary information: Supplementary data are available at Bioinformatics online.
... There have been many notable advances in the development of approaches to infer selection and/or population size from time-series data (Bollback et al. 2008;Illingworth et al. 2012;Acevedo et al. 2014;Renzette et al. 2014;Feder et al. 2014;Foll et al. 2014Foll et al. , 2015Ferrer-Admetlla et al. 2016;Jó ná s et al. 2016;Khatri 2016;Steinrü cken et al. 2014;Schraiber et al. 2016;Terhorst et al. 2015;Topa et al. 2015). However, many of these methods either ignore genetic drift, are not designed for very low frequency alleles, or allow for only two alleles per locus. ...
Article
Full-text available
With the advent of deep sequencing techniques, it is now possible to track the evolution of viruses with ever-increasing detail. Here, we present Flexible Inference from Time-Series (FITS)-a computational tool that allows inference of one of three parameters: the fitness of a specific mutation, the mutation rate or the population size from genomic time-series sequenc-ing data. FITS was designed first and foremost for analysis of either short-term Evolve & Resequence (E&R) experiments or rapidly recombining populations of viruses. We thoroughly explore the performance of FITS on simulated data and highlight its ability to infer the fitness/mutation rate/population size. We further show that FITS can infer meaningful information even when the input parameters are inexact. In particular, FITS is able to successfully categorize a mutation as advantageous or deleterious. We next apply FITS to empirical data from an E&R experiment on poliovirus where parameters were determined experimentally and demonstrate high accuracy in inference.
... The great potential of E&R studies in combination with the continuously growing data sets of powerful experiments has driven the development of a diverse set of methods to detect selected SNPs, which change in allele frequency more than expected under neutrality (Iranmehr et al., 2017;Spitzer et al., 2019;Kofler et al., 2011;Taus et al., 2017;Kelly and Hughes, 2019;Wiberg et al., 2017;Topa et al., 2015;Feder et al., 2014;Mathieson and McVean, 2013). Some of the published methods use this information to estimate the underlying selection coefficient and dominance (Iranmehr et al., 2017;Taus et al., 2017;Foll et al., 2015;Mathieson and McVean, 2013). ...
... Beta-Binomial Gaussian Process (BBGP). BBGP employs a beta-binomial Gaussian process to detect significant allele frequency changes over time (Topa et al., 2015). The beta-binomial model corrects for the uncertainty arising from finite sequencing depth. ...
Preprint
Full-text available
The combination of experimental evolution with whole genome resequencing of pooled individuals, also called Evolve and Resequence (E&R) is a powerful approach to study selection processes and to infer the architecture of adaptive variation. Given the large potential of this method, a range of software tools were developed to identify selected SNPs and to measure their selection coefficients. In this benchmarking study, we are comparing 15 test statistics implemented in 10 software tools using three different scenarios. We demonstrate that the power of the methods differs among the scenarios, but some consistently outperform others. LRT-1, which takes advantage of time series data consistently performed best for all three scenarios. Nevertheless, the CMH test, which requires only two time points had almost the same performance. This benchmark study will not only facilitate the analysis of already existing data, but also affect the design of future data collections.