Fig 1 - uploaded by Dominik G Grimm
Content may be subject to copyright.
Small examples of the three types of networks considered

Small examples of the three types of networks considered

Source publication
Article
Full-text available
As an increasing number of genome-wide association studies reveal the limitations of the attempt to explain phenotypic heritability by single genetic loci, there is a recent focus on associating complex phenotypes with sets of genetic loci. Although several methods for multi-locus mapping have been proposed, it is often unclear how to relate the de...

Context in source publication

Context 1
... filter out SNPs with a minor allele frequency lower than 10%, as is typical in A.thaliana GWAS studies. We use the first prin- cipal components of the genotypic data as covariates to correct for population structure (Price et al., 2006): the number of prin- cipal components is chosen by adding them one by one until the genomic control is close to 1 (see Supplementary Figure S1). The direct competitors of SConES on this problem are the methods that also impose graph constraints on the SNPs they select, namely, graphLasso and ncLasso. ...

Similar publications

Article
Full-text available
Adaptation of soybean cultivars to the photoperiod in which they are grown is critical for optimizing plant yield. However, despite its importance, only the major loci conferring variation in flowering time and maturity of US soybean have been isolated. By contrast, over 200 genes contributing to floral induction in the model organism Arabidopsis t...

Citations

... The problem of limited power in GWAS is generally rooted in both a large marker-to-sample ratio and low heritability of complex traits. In order to mitigate that, two strategies have been pursued: (i) to group genetic markers and test set of markers at once, thereby reducing the multiplicity of markers tested (Holden et al., 2008;Li and Leal, 2008;Listgarten et al., 2013;Schwender et al., 2011), or (ii) to employ biological networks in order to conduct a post hoc aggregation of association (Akula et al., 2011;Azencott et al., 2013;Carlin et al., 2019;Greene et al., 2015;Ideker et al., 2002;Shim et al., 2017;Shim and Lee, 2015;Wang et al., 2015). Both approaches amplify the signal of SNPs or genes which are collectively phenotype-related but would not pass the significance threshold on their own. ...
Article
Full-text available
Motivation: While the search for associations between genetic markers and complex traits has led to the discovery of tens of thousands of trait-related genetic variants, the vast majority of these only explain a small fraction of the observed phenotypic variation. One possible strategy to overcome this while leveraging biological prior is to aggregate the effects of several genetic markers and to test entire genes, pathways or (sub)networks of genes for association to a phenotype. The latter, network-based genome-wide association studies, in particular suffers from a vast search space and an inherent multiple testing problem. As a consequence, current approaches are either based on greedy feature selection, thereby risking that they miss relevant associations, or neglect doing a multiple testing correction, which can lead to an abundance of false positive findings. Results: To address the shortcomings of current approaches of network-based genome-wide association studies, we propose networkGWAS, a computationally efficient and statistically sound approach to network-based genome-wide association studies using mixed models and neighborhood aggregation. It allows for population structure correction and for well-calibrated p-values, which are obtained through circular and degree-preserving network permutations. networkGWAS successfully detects known associations on diverse synthetic phenotypes, as well as known and novel genes in phenotypes from S. cerevisiae and H. sapiens. It thereby enables the systematic combination of gene-based genome-wide association studies with biological network information. Availability: https://github.com/BorgwardtLab/networkGWAS.git. Supplementary information: Supplementary data are available at Bioinformatics online.
... For guidance, we refer to Climente-Gonzá lez et al. (2021) and to the methods' respective publications. 1,[2][3][4][5]6,13 Problem 5 I faced a problem not described here. ...
... Different licensing terms might apply to the used tools, as you should verify. If you use the results of these tools in your publication, please cite the relevant articles as well.[2][3][4][5]6,13 Supplemental information can be found online at https://doi.org/10.1016/j.xpro.2022.101998. ...
Article
Full-text available
We present a network-based protocol to discover susceptibility genes in case-control genome-wide association studies (GWASs). In short, this protocol looks for biomarkers that are informative of disease status and interconnected in an underlying biological network. This boosts discovery and interpretability. Moreover, the protocol tackles the instability of network methods, producing a stable set of genes most likely to replicate in external cohorts. To apply the procedure to a provided GWAS dataset, install the required software and execute our command-line tool. For complete details on the use and execution of this protocol, please refer to Climente-González et al.¹
... Methods to identify altered subnetworks employ a diverse collection of techniques, but they can be grouped into two major classes. The first class of methods rely on the specification of a subnetwork family, or a family of subnetworks with a topological constraint; sometimes, the family is stated explicitly-for example, the early approaches such as jActiveModules (Ideker et al, 2001) or heinz (Dittrich et al, 2008) identify connected subnetworks-but in other methods, the subnetwork family is implicitly specified-for example, the optimization problems of Azencott et al (2013) and Liu et al (2017) penalize subnetworks with large cut-size and small edge-density, respectively. ...
... We use this flexibility to investigate the topology of subnetworks identified by network propagation methods. We show empirically that network propagation does not correspond to standard topological constraints on altered subnetworks such as connectivity (Dittrich et al, 2008;Ideker et al, 2002;Reyna et al, 2021), cut-size (Azencott et al, 2013), or edge-density (Liu et al, 2017). Instead, we derive the propagation family, a subnetwork family that we show ''approximates'' the sets of vertices that are ranked highly by network propagation approaches and thereby unifies the two major network approaches in the literature: network propagation and subnetwork family approaches. ...
... Sgj is the number of edges between vertices in S. The edge-dense family E G‚ p formalizes the topological constraints made by Guo et al (2007), Liu et al (2017), Vanunu et al (2010), which identify altered subnetworks that have large edge-density. S = T G‚ q , the cut family, or the set of all subgraphs S of G with cut(S) jSj q, where cut(S) = jf(u‚ v) 2 E : u 2 S‚ v 6 2 Sgj is the number of edges with exactly one endpoint in S. The cut family T G‚ q formalizes the topological constraints made by Azencott et al (2013), which identifies altered subnetworks that have small cut. S = Q G‚ q , the modularity family, or the set of all subgraphs S of G with modularity Q(S) ! ...
Article
Full-text available
A standard paradigm in computational biology is to leverage interaction networks as prior knowledge in analyzing high-throughput biological data, where the data give a score for each vertex in the network. One classical approach is the identification of altered subnetworks, or subnetworks of the interaction network that have both outlier vertex scores and a defined network topology. One class of algorithms for identifying altered subnetworks search for high-scoring subnetworks in subnetwork families with simple topological constraints, such as connected subnetworks, and have sound statistical guarantees. A second class of algorithms employ network propagation-the smoothing of vertex scores over the network using a random walk or diffusion process-and utilize the global structure of the network. However, network propagation algorithms often rely on ad hoc heuristics that lack a rigorous statistical foundation. In this work, we unify the subnetwork family and network propagation approaches by deriving the propagation family, a subnetwork family that approximates the sets of vertices ranked highly by network propagation approaches. We introduce NetMix2, a principled algorithm for identifying altered subnetworks from a wide range of subnetwork families. When using the propagation family, NetMix2 combines the advantages of the subnetwork family and network propagation approaches. NetMix2 outperforms other methods, including network propagation on simulated data, pan-cancer somatic mutation data, and genome-wide association data from multiple human diseases.
... One way to improve feature selection for GWAS analysis is to add prior knowledge about biological environment in a graph or a group structure. On the one hand, feature selection models can incorporate connectivity in graph constraints from biological networks, in addition to regularization terms [Azencott et al.(2013)]. On the other hand, Linkage Disequilibrium (LD), presented by high correlation between nearby SNPs, can be incorporated also in feature selection models based on group structure such as the group Lasso [Yuan andLin(2006),Jacob et al.(2009)]. ...
... Indeed, adding prior knowledge about biological mechanisms in feature selection methods provides a realistic design of the problem. In this setting, feature selection models provide connectivity and association constraints, in addition to regularization terms in order to design biological interactions [Azencott et al.(2013)]. ...
Thesis
Genome-Wide Association Studies, or GWAS, aim at finding Single Nucleotide Polymorphisms (SNPs) that are associated with a phenotype of interest. GWAS are known to suffer from the large dimensionality of the data with respect to the number of available samples. Many challenges limiting the identification of causal SNPs such as dependency between SNPs, due to linkage disequilibrium (LD), the population stratification and the low of statistical of univariate analysis. Machine learning models based on multivariate analysis contribute to advance research in GWAS. Hence, feature selection models reduce the dimensionality of data by keeping only the relevant features associated with disease. However, these methods lack of stability, that is to say, robustness to slight variations in the input dataset. This major issue can lead to false biological interpretation. Hence, we focus in this thesis on evaluating and improving the stability as it is an important indicator to trust feature selection discoveries. In this thesis, we develop two efficient novel methods (multitask group lasso and sparse multitask group lasso) for the multivariate analysis of multi-population GWAS data based on a two multitask group Lasso formulations. Each task corresponds to a subpopulation of the data, and each group to an LD-block. This formulation alleviates the curse of dimensionality, and makes it possible to identify disease LD-blocks shared across populations/tasks, as well as some that are specific to one population task. In addition, we use stability selection to increase the robustness of our approach. Finally, gap safe screening rules speed up computations enough that our method can run at a genome-wide scale. By analyzing several data including breast cancer dataset, the efficiency of the developed models was demonstrated in discovering new risk genes related to disease.
... To achieve a scalable solution for all known variants in the genome while considering the dependencies between them, alternative SNP selection algorithms have been proposed (Azencott et al., 2013;Yilmaz et al., 2019). Such algorithms simplify the problem by focusing on a linear combination of individual phenotype associations of SNPs while using some a priori information encoded in the form of a biological network to improve the overall predictivity of the selected subset. ...
... Such algorithms simplify the problem by focusing on a linear combination of individual phenotype associations of SNPs while using some a priori information encoded in the form of a biological network to improve the overall predictivity of the selected subset. In particular, SConES (Azencott et al., 2013) uses a minimum-cut solution under sparsity and connectivity constraints on a SNP-SNP network. More recently, SPADIS (Yilmaz et al., 2019) selects a diverse set of SNPs using the SNP-SNP network. ...
... For consistency with the previous results (Azencott et al., 2013;Yilmaz et al., 2019), we use SKAT (Wu et al., 2011) to score the individual phenotype association of each SNP, unless otherwise ...
Article
Motivation Genome-wide association studies show that variants in individual genomic loci alone are not sufficient to explain the heritability of complex, quantitative phenotypes. Many computational methods have been developed to address this issue by considering subsets of loci that can collectively predict the phenotype. This problem can be considered a challenging instance of feature selection in which the number of dimensions (loci that are screened) is much larger than the number of samples. While currently available methods can achieve decent phenotype prediction performance, they either do not scale to large datasets or have parameters that require extensive tuning. Results We propose a fast and simple algorithm, Macarons, to select a small, complementary subset of variants by avoiding redundant pairs that are likely to be in linkage disequilibrium. Our method features two interpretable parameters that control the time/performance trade-off without requiring parameter tuning. In our computational experiments, we show that Macarons consistently achieves similar or better prediction performance than state-of-the-art selection methods while having a simpler premise and being at least 2 orders of magnitude faster. Overall, Macarons can seamlessly scale to the human genome with ∼ 107 variants in a matter of minutes while taking the dependencies between the variants into account. Availability Macarons is available in Matlab and Python at https://github.com/serhan-yilmaz/macarons.
... The problem of limited power in GWAS is generally rooted in both a large marker-to-sample ratio and low heritability of complex traits. In order to mitigate that, two strategies have been pursued: (i) to group genetic markers and test them at once, thereby reducing the multiplicity of markers tested [13,18,21,28,40], or (ii) to employ biological networks in order to conduct a post hoc aggregation of association [5,14,2]. Both approaches amplify the signal of SNPs or genes which are collectively phenotype-related but would not pass the significance threshold on their own. ...
... with X, β, and defined as in Equation (2), X (2) representing the n by p(p − 1)/2 second-order design matrix of all SNP interactions, β (2) comprising the fixed effects of all SNP interactions, and the coefficient c serving to tune between the ratio of linear to non-linear signal. is drawn from a multivariate normal distribution N ( 0, I). ...
... with X, β, and defined as in Equation (2), X (2) representing the n by p(p − 1)/2 second-order design matrix of all SNP interactions, β (2) comprising the fixed effects of all SNP interactions, and the coefficient c serving to tune between the ratio of linear to non-linear signal. is drawn from a multivariate normal distribution N ( 0, I). ...
Preprint
Full-text available
While the search for associations between genetic markers and complex traits has discovered tens of thousands of trait-related genetic variants, the vast majority of these only explain a tiny fraction of observed phenotypic variation. One possible strategy to detect stronger associations is to aggregate the effects of several genetic markers and to test entire genes, pathways or (sub)networks of genes for association to a phenotype. The latter, network-based genome-wide association studies, in particular suffers from a huge search space and an inherent multiple testing problem. As a consequence, current approaches are either based on greedy feature selection, thereby risking that they miss relevant associations, and/or neglect doing a multiple testing correction, which can lead to an abundance of false positive findings. To address the shortcomings of current approaches of network-based genome-wide association studies, we propose networkGWAS , a computationally efficient and statistically sound approach to gene-based genome-wide association studies based on mixed models and neighborhood aggregation. It allows for population structure correction and for well-calibrated p -values, which we obtain through a block permutation scheme. networkGWAS successfully detects known or plausible associations on simulated rare variants from H. sapiens data as well as semi-simulated and real data with common variants from A. thaliana and enables the systematic combination of gene-based genome-wide association studies with biological network information. Availability https://github.com/BorgwardtLab/networkGWAS.git
... It allows the user to perform various analyses on SNP data, such as univariate GWAS using two-sample tests and linear regression models, as well as set-based tests and epistasis screenings. In addition to PLINK, there are many other toolboxes that implement different association tests with linear mixed models, such as GCTA (Yang et al. 2011), FaST-LMM (Lippert et al. 2011), EMMAX (Kang et al. 2010), GEMMA (Zhou and Stephens 2012), and with network-based approaches for the joint test of multiple variants, such as SConES (Azencott et al. 2013), dmGWAS (Jia et al. 2011) and DAPPLE (Rossin et al. 2011). Apart from these downloadable software packages, some web-based GWAS tools have been developed, including Matapax (Childs et al. 2012), GWAPP (Seren et al. 2012) and easyGWAS (Grimm et al. 2017). ...
Article
Full-text available
Large-scale, case-control genome-wide association studies (GWASs) have revealed genetic variations associated with diverse neurological and psychiatric disorders. Recent advances in neuroimaging and genomic databases of large healthy and diseased cohorts have empowered studies to characterize effects of the discovered genetic factors on brain structure and function, implicating neural pathways and genetic mechanisms in the underlying biology. However, the unprecedented scale and complexity of the imaging and genomic data requires new advanced biomedical data science tools to manage, process and analyze the data. In this work, we introduce Neuroimaging PheWAS (phenome-wide association study): a web-based system for searching over a wide variety of brain-wide imaging phenotypes to discover true system-level gene-brain relationships using a unified genotype-to-phenotype strategy. This design features a user-friendly graphical user interface (GUI) for anonymous data uploading, study definition and management, and interactive result visualizations as well as a cloud-based computational infrastructure and multiple state-of-art methods for statistical association analysis and multiple comparison correction. We demonstrated the potential of Neuroimaging PheWAS with a case study analyzing the influences of the apolipoprotein E (APOE) gene on various brain morphological properties across the brain in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. Benchmark tests were performed to evaluate the system’s performance using data from UK Biobank. The Neuroimaging PheWAS system is freely available. It simplifies the execution of PheWAS on neuroimaging data and provides an opportunity for imaging genetics studies to elucidate routes at play for specific genetic variants on diseases in the context of detailed imaging phenotypic data.
... However, they differ in their tolerance to the inclusion of low-scoring nodes and the topology of the solution. Lastly, other methods also consider the topology of the network, favoring groups of nodes that are not only high-scoring but also densely interconnected; such is the case of HotNet2 [16], SConES [17], and SigMod [18]. ...
... SNP networks: SConES [17] was the only network method designed to handle SNP networks. As in gene networks, two SNPs were connected in a SNP network when there was evidence of shared functionality between them. ...
... As in gene networks, two SNPs were connected in a SNP network when there was evidence of shared functionality between them. Azencott et al. [17] proposed three ways of building such networks: connecting the SNPs consecutive in the genomic sequence ("GS network"); interconnecting all the SNPs mapped to the same gene, on top of GS ("GM network"); and interconnecting all SNPs mapped to two genes for which a protein-protein interaction exists, on top of GM ("GI network"). We focused on the GI network using the PPIN described above, as it fitted the scope of this work better. ...
Article
Full-text available
Genome-wide association studies (GWAS) explore the genetic causes of complex diseases. However, classical approaches ignore the biological context of the genetic variants and genes under study. To address this shortcoming, one can use biological networks, which model functional relationships, to search for functionally related susceptibility loci. Many such network methods exist, each arising from different mathematical frameworks, pre-processing steps, and assumptions about the network properties of the susceptibility mechanism. Unsurprisingly, this results in disparate solutions. To explore how to exploit these heterogeneous approaches, we selected six network methods and applied them to GENESIS, a nationwide French study on familial breast cancer. First, we verified that network methods recovered more interpretable results than a standard GWAS. We addressed the heterogeneity of their solutions by studying their overlap, computing what we called the consensus . The key gene in this consensus solution was COPS5 , a gene related to multiple cancer hallmarks. Another issue we observed was that network methods were unstable, selecting very different genes on different subsamples of GENESIS. Therefore, we proposed a stable consensus solution formed by the 68 genes most consistently selected across multiple subsamples. This solution was also enriched in genes known to be associated with breast cancer susceptibility ( BLM , CASP8 , CASP10 , DNAJC1 , FGFR2 , MRPS30 , and SLC4A7 , P-value = 3 × 10 ⁻⁴ ). The most connected gene was CUL3 , a regulator of several genes linked to cancer progression. Lastly, we evaluated the biases of each method and the impact of their parameters on the outcome. In general, network methods preferred highly connected genes, even after random rewirings that stripped the connections of any biological meaning. In conclusion, we present the advantages of network-guided GWAS, characterize their shortcomings, and provide strategies to address them. To compute the consensus networks, implementations of all six methods are available at https://github.com/hclimente/gwas-tools .
... To achieve a scalable solution for all known variants in the genome, a new body of SNP selection algorithms have emerged that are designed with computational efficiency in mind (Azencott et al., 2013;Yilmaz et al., 2019). Such algorithms, instead of considering the combinatorial effect of the SNPs on a phenotype, simplify the problem by focusing on a linear combination of individual phenotype associations of SNPs, together with other a priori information encoded in the form of a biological network to improve the overall predictivity. ...
... Such algorithms, instead of considering the combinatorial effect of the SNPs on a phenotype, simplify the problem by focusing on a linear combination of individual phenotype associations of SNPs, together with other a priori information encoded in the form of a biological network to improve the overall predictivity. In particular, SConES (Azencott et al., 2013) uses a minimum-cut solution under sparsity and connectivity constraints on a SNP-SNP network, and more recently, SPADIS (Yilmaz et al., 2019) uses a SNP-SNP network to enforce the diversity of the selected SNPs. ...
... • SConES: A SNP selection method that poses the SNP selection problem as a minimum-cut problem (Azencott et al., 2013). ...
Preprint
Full-text available
Motivation Genome-wide association studies show that variants in individual genomic loci alone are not sufficient to explain the heritability of complex, quantitative phenotypes. Many computational methods have been developed to address this issue by considering subsets of loci that can collectively predict the phenotype. This problem can be considered a challenging instance of feature selection in which the number of dimensions (loci that are screened) is much larger than the number of samples. While currently available methods can achieve decent phenotype prediction performance, they either do not scale to large datasets or have parameters that require extensive tuning. Results We propose a fast and simple algorithm, Macarons, to select a small, complementary subset of variants by avoiding redundant pairs that are in linkage disequilibrium. Our method features two interpretable parameters that control the time/performance trade-off without requiring parameter tuning. In our computational experiments, we show that Macarons consistently achieves similar or better prediction performance than state-of-the-art selection methods while having a simpler premise and being at least 2 orders of magnitude faster. Overall, Macarons can seamlessly scale to the human genome with ~10 ⁷ variants in a matter of minutes while taking the dependencies between the variants into account. Conclusion Macarons can offer a reasonable trade-off between phenotype predictivity, runtime and the complementarity of the selected subsets. The framework we present can be generalized to other high-dimensional feature selection problems within and beyond biomedical applications. Availability Macarons is implemented in Matlab and the source code is available at: https://github.com/serhan-yilmaz/macarons
... Potpourri operates on an SNP-SNP interaction network. In this study, we used the genomic sequence network as defined in Azencott et al. (2013). In this network, SNPs are connected if they are adjacent on the genome. ...
... We follow the experimental procedure in Yilmaz et al. (2019) and Azencott et al. (2013) for the SNP selection step. The parameters were selected by using a nested 10-fold cross-validation. ...
Article
Genome-wide association studies (GWAS) explain a fraction of the underlying heritability of genetic diseases. Investigating epistatic interactions between two or more loci help to close this gap. Unfortunately, the sheer number of loci combinations to process and hypotheses prohibit the process both computationally and statistically. Epistasis test prioritization algorithms rank likely epistatic single nucleotide polymorphism (SNP) pairs to limit the number of tests. However, they still suffer from very low precision. It was shown in the literature that selecting SNPs that are individually correlated with the phenotype and also diverse with respect to genomic location leads to better phenotype prediction due to genetic complementation. Here, we propose that an algorithm that pairs SNPs from such diverse regions and ranks them can improve prediction power. We propose an epistasis test prioritization algorithm that optimizes a submodular set function to select a diverse and complementary set of genomic regions that span the underlying genome. The SNP pairs from these regions are then further ranked w.r.t. their co-coverage of the case cohort. We compare our algorithm with the state of the art on three GWAS and show that (1) we substantially improve precision (from 0.003 to 0.652) while maintaining the significance of selected pairs, (2) decrease the number of tests by 25-fold, and (3) decrease the runtime by 4-fold. We also show that promoting SNPs from regulatory/coding regions improves the performance (up to 0.8). Potpourri is available at http:/ciceklab.cs.bilkent.edu.tr/potpourri.