Conference PaperPDF Available

Comprehensive-GWAS: a pipeline for genome-wide association studies utilizing cross-validation to assess the predictivity of genetic variations

December 2020

December 2020

DOI:10.1109/BIBM49941.2020.9313355

Conference: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Authors:

Gabrielle Dagasso

The University of Calgary

University of Saskatchewan

University of Saskatchewan

Show all 7 authorsHide

The flowchart depicts the overall workflow of the pipeline

The file structure chart depicts the overall data flow of the pipeline

The likelihood distribution of k results using the pophelper package in R (Step 1 in Figure 1). The graph shows the mean of ( )( ) L k SD  over 10 runs for each k value. It's plateau at k = 2, which supports the usage of its corresponding Q-matrix

k  plot results using the Evanno method in the pophelper package in R. k  is calculated from the absolute value of the second order rate of change of the likelihood distribution divided by the standard deviation of the likelihood distribution (Step 1 in Figure 1). The modal value of this distribution is the true k or the uppermost level of structure. Here = 2 k clusters because the graph plateau's at = 2 k . This supports the usage of the Q -matrix of = 2 k

Multiline barplots from Q -matrices from = 2 k to = 7 k for step 1 in Figure 1. Each group is individually labelled

Figures - uploaded by Lipu Wang

Content may be subject to copyright.

Content uploaded by Lipu Wang

Content may be subject to copyright.

I

nt. J. Data Mining and Bioinformatics, Vol. 25, Nos. 1

/

2, 2021 17

Copyright © 2021 Inderscience Enterprises Ltd.

Leveraging machine learning to advance

genome-wide association studies

Gabrielle Dagasso

Department of Mathematics and Statistics,

Thompson Rivers University,

Kamloops, British Columbia, Canada

Email: gdagasso@gmail.com

Yan Yan

Department of Computing Science,

Thompson Rivers University,

Kamloops, British Columbia, Canada

Email: yyan@tru.ca

Lipu Wang

Department of Plant Sciences,

University of Saskatchewan,

Saskatoon, Saskatchewan, Canada

Email: lipu.wang@usask.ca

Longhai Li

Department of Mathematics and Statistics,

University of Saskatchewan,

Saskatoon, Saskatchewan, Canada

Email: longhai.li@usask.ca

Randy Kutcher

Department of Plant Sciences,

University of Saskatchewan,

Saskatoon, Saskatchewan, Canada

Email: randy.kutcher@usask.ca

Wentao Zhang

National Research Council of Canada,

Saskatoon, Saskatchewan, Canada

Email: Wentao.Zhang@nrc-cnrc.gc.ca

18 G. Dagasso et al.

Lingling Jin*

Department of Computer Science,

University of Saskatchewan,

Saskatoon, Saskatchewan, Canada

Email: lingling.jin@cs.usask.ca

*Corresponding author

Abstract: Genome-Wide Association Studies (GWAS) has demonstrated its

power in discovering genetic variations to particular traits related to

agronomically important features in crops. The typical output of a GWAS

program includes a series of Single Nucleotide Polymorphisms (SNPs) and

their significance. Currently, there is no standard way to compare results across

different programs or to select the most ‘significant’ results uniformly and

consistently. To obtain a comprehensive and accurate set of SNPs associated

with a trait of interest, we present a novel automated pipeline that leverages

machine learning for GWAS discoveries. The pipeline first performs

population structure analysis, then executes multiple GWAS software and

combines their results into a single SNP set. After that, it selects SNPs from the

set with high individual and/or joint effects with the Least Absolute Shrinkage

and Selection Operator analysis. Finally, the predictivity of the model is

assessed using cross-validation.

Keywords: genome-wide association studies; machine learning; population

structure analysis; cross-validation; LASSO; fusarium head blight.

Reference to this paper should be made as follows: Dagasso, G., Yan, Y.,

Wang, L., Li, L., Kutcher, R., Zhang, W. and Jin, L. (2021) ‘Leveraging

machine learning to advance genome-wide association studies’, Int. J. Data

Mining and Bioinformatics, Vol. 25, Nos. 1/2, pp.17–36.

Biographical notes: Gabrielle Dagasso is an undergraduate student at

Thompson Rivers University. She is finishing her Bachelor of Science with a

major in Mathematics and a minor in Computing Science in 2021. She has

been working as an Undergraduate Research Assistant with Lingling Jin since

the 3rd year of her program. She is the recipient of several research

scholarships, such as NSERC USRA and TRU Undergraduate Research

Experience Award Program.

Yan Yan is an Assistant Professor in the Department of Computing Science of

Thompson Rivers University (TRU). She received her PhD degree in

Biomedical Engineering from the University of Saskatchewan (UofS). Before

joining TRU, she worked as a Postdoctoral Fellow at the Department of

Computer Science in the University of Western Ontario and the UofS. Her

primary research interests include proteomics with a focus on peptide

sequencing, population genomics on large scale genome-wide data analysis and

machine learning in computational biology.

Lipu Wang received his PhD degree in the Department of Biology from the

University of Saskatchewan. Now, he is a Research Officer at the Crop

Development Centre, Department of Plant Sciences, University of

Saskatchewan. Previously he was a Research Associate in National Research

Council of Canada. He has extensive experience and expertise in Plant

Molecular Biology and Plant Pathology. Currently, he focuses on

characterising the molecular mechanism of Fusarium head blight (FHB)

resistance in wheat by using molecular and genomic tools. He developed an

Levera

g

ing machine learning to advance genome-wide association studies 19

LC-MS/MS-based mycotoxin test methods for FHB breeding programs and

investigates FHB resistance components through genome-wide association

study (GWAS).

Longhai Li received his PhD degree in Statistics from the University of

Toronto in 2007 and BSc degree in Statistics from the University of Science

and Technology of China in 2002. Currently, he is a Full Professor at the

University of Saskatchewan. His research activities focus on developing and

applying statistical machine learning methods for high-throughput data and

complex-structured data.

Randy Kutcher is a Professor of Plant Pathology at the Crop Development

Centre, Department of Plant Sciences, University of Saskatchewan. Previously,

he was a Research Scientist with Agriculture and Agri-Food Canada in

Saskatchewan focusing on Canola Pathology and Integrated Pest Management.

He received his BSc and MSc degrees from the University of Manitoba and

PhD degree from the University of Saskatchewan. He leads the Cereal and Flax

Pathology program; his wheat research is focused on applied strategies to

mitigate stripe rust, fusarium head blight and leaf spot diseases. He teaches an

undergraduate course and conducts extension activities with growers and

industry on field crop disease management.

Wentao Zhang is a Quantitative Geneticist from National Research Council of

Canada (NRC), is curious about key questions: how does genetic variation give

rise to phenotypic variation and how can better predict the performance of

these complex traits. His research focuses on understanding the genetic

architecture underlying complex traits include disease resistance, high yield,

better quality and good agronomics package in crops like wheat, Canola and

recently pulses by applying: (1) quantitative genetics, genomics, statistical

modelling and computational approaches for complex traits (2) Genomics

selection, machine learning and deep learning to predict complex traits.

Lingling Jin received her PhD degree in Computer Science specialised in

Bioinformatics from the University of Saskatchewan in 2017. Currently, she is

an Assistant Professor of Computer Science at the University of Saskatchewan

and an Adjunct Professor at Thompson Rivers University. Previously, she was

a Postdoctoral Research Scientist with Agriculture and Agri-Food Canada. Her

primary research interest is in computational modelling of polyploid genome

evolution and various aspects of plant bioinformatics.

This paper is a revised and expanded version of a paper entitled

‘Comprehensive GWAS: a pipeline for genome-wide association studies

utilising cross-validation to assess the predictivity of genetic variations’

presented at the ‘IEEE International Conference on Bioinformatics and

Biomedicine 2020 (BIBM 2020)’, 16–19 December 2020.

1 Introduction

With the development of next generation sequencing technology and drastically

decreasing of sequencing cost, vast amount of whole-genome data becomes available

nowadays. Genome-Wide Association Studies (GWAS) becomes powerful tools for

identifying associations between genetic variants within a species and phenotype

differences among individual samples. Typically, the genetic variants are Single

20 G. Dagasso et al.

Nucleotide Polymorphisms, also known as SNPs, which are changes of single DNA

base-pairs. The genetic data is referred to as genotypes and the phenotypic data is called

phenotypes. In most cases, the number of SNPs is much more than the number of

samples as the SNPs are across the whole genome. This makes the genotype “small-n-

large-p” data.

GWAS do not rely on previous knowledge to locate potential causal genes regions;

instead, they scan the whole genome to identify causal regions. Therefore, they have the

ability to investigate low-frequency and rare variants across the whole genome and

identify genes that could be missed by other methods. Since last decade, GWAS have

served as primary methods to reveal the relationships between genotypes and phenotypes

and made significant advancement in health, agriculture and many other fields. In

agriculture, they are widely used to dissect driving genes and biological architecture of

complex traits.

GWAS usually rely on statistical models to evaluate associations and find significant

SNPs that are related to the phenotype. Traditional GWAS perform a series of single-

SNP statistical hypothesis tests. These tests are independent for each SNP and the null

hypothesis is that there is no association between the SNP and the phenotype of interests.

If the null hypothesis is rejected, the tested SNP is reported to be associated with the

phenotype and a p-value is usually output along with the SNP to indicate the

significance. Out of all SNPs, the ones correlated with phenotypes are just a small subset,

which means that the genotype data is sparse. This property could bring complex

challenges to current GWAS methods as many of reported SNPs may not be truly related

to the phenotype. Phenotype can be binary (e.g., 1/0 representing having a particular

disease or not) or quantitative (e.g., height of individuals). When dealing with

quantitative phenotypes, linear regression is usually applied. The most used models

include Generalised Linear Models (GLMs) and Mixed Linear Models (MLMs). MLMs

are also known as Linear Mixed Models (LMMs). In this study, we will refer to them

as LMMs.

GWAS packages commonly used include PLINK (Purcell et al., 2010), BOLT-LMM

(Loh et al., 2015), FaST-LMM (Lippert et al., 2011), GCTA (Yang et al., 2011),

TASSEL (Bradbury et al., 2007), GAPIT (Lipka et al., 2012; Tang et al., 2016) and

others. All of them are based on traditional statistical models and many of them have the

options to choose between LMM, GLM and other statistical models. PLINK is a free,

open-source toolset for GWAS and population-based linkage analyses. It provides a

range of basic yet rapid computation for large biological data sets. It was first developed

for human genomes and is one of the earliest GWAS packages. Nowadays, PLINK is

considered as a standard GWAS method in many fields of study. BOLT-LMM utilises an

efficient Bayesian mixed-model and it is reported to increase the statistical power and

computational speed for expanded data sets. FaST-LMM applies a factorised log-

likelihood function in the LMM and claims to scale linearly with data size in both run

time and memory. It can be ideal for large data to achieve satisfying results. GCTA is a

tool for genome-wide complex trait analysis. By utilising LMM, it fits the contribution of

all SNPs as random effects and addresses the “missing heritability” problem of human

genomes.

TASSEL and GAPIT were developed specifically for GWAS of agricultural plants.

TASSEL was first developed and tested for maize with the purpose to handle a wide

range of insertions and deletions. Many other existing software did not consider these

types of polymorphisms before. GAPIT was developed after TASSEL and used statistical

Levera

g

ing machine learning to advance genome-wide association studies 21

methods that were similar as TASSEL. It is capable to perform both GWAS and

Genomic Selection (GS) (Goddard and Hayes, 2007). Genomic selection is a marker-

assisted selection in which genetic markers across the whole genome (typically SNPs)

are used and breeding values are predicted. GAPIT has been updated frequently to

incorporate the state-of-the-art methods. It is therefore reported to achieve the most

accurate and computationally efficient results.

2 Motivations

GWAS typically output a series of SNPs along with p-values, which are used to evaluate

the significance of SNPs associated with phenotypes of interest. A pre-defined threshold

is set to select significant SNPs for further analysis, e.g., 10–7 is a commonly used cut-off

in the literature. However, output SNPs from these programs could vary largely and their

p-values may not be comparable to each other. For example, one study investigated

different GWAS packages (Yan et al., 2018) reported that for the same input plant data,

p-values of the same SNPs output by PLINK were dramatically smaller than the ones

output by TASSEL and GAPIT. Their differences were by orders of magnitude.

Moreover, none of these programs utilises cross-validation which ensures that a model is

generalised and measures the accuracy of the models, such as GLM and LMM. Cross

validation is an important step in statistical analysis to ensure the results found are

validate and not false positives due to over estimations of the model’s accuracy (Oetting

et al., 2017). A false positive occurs when something assumed to be true is false, likewise

a false negative occurs when something assumed to be false is true. For example, when

someone tests positive for a disease but do not have it, this would be a false positive. A

common fault of some programs when dealing with false positives is to utilise either the

Bonferroni or Benjamini-Hochberg methods which may lead to false negatives and loss

of true significant SNPs (Waldmann et al., 2013).

In addition, traditional GWAS methods are criticised as causing the “missing

heritability”. It means that a large number of SNPs identified by GWAS may only

account for a proportion of the heritability of complex traits. This is because p-values

from GWAS only indicate whether SNPs are related to a phenotype or not; they cannot

tell how strongly they are related. Therefore, some individual causal SNPs fail to pass the

stringent significance thresholds because of their small effects to the phenotype (Manolio

et al., 2009; Tam et al., 2019). Thus, we cannot completely rely on GWAS results as

predictors for casual SNPs. Moreover, GWAS can be sensitive to the degree of match

between the assumed statistical model (such as GLM or LMM) and the real data

distribution. Last but not least, the single SNP analysis performed in these methods

ignores the potential correlations and joint effects among SNPs.

To overcome these limitations, machine learning can be integrated in the statistical

modelling of GWAS. Machine learning algorithms can identify relevant features from a

complex data set or make accurate predictions from such data. They have been widely

applied to whole-genome data analysis of various bioinformatics problems. When

incorporating with GWAS, it could identify high orders of interactions between SNPs

based on biological pathways and gene modules (epistasis), identify and prioritise SNPs

with small effects by evaluating simultaneously pan-genomic SNPs (Sun et al., 2020),

and serve as alternative to find casual SNPs and even predict phenotypes. Instead of

22 G. Dagasso et al.

using machine learning or traditional statistical modelling alone, combining the two

methods into a two-step model will integrate the strength of the separate methods.

The Least Absolute Shrinkage and Selection Operator (LASSO) is one of such

machine learning methods that is suitable for “big-n-small-p” data just as the genotype.

LASSO and its variations have been used to whole-genome data for feature selection and

produce a subset of SNPs with significantly large effects to the phenotype (Waldmann

et al., 2019). LASSO selects a subset of SNPs with the best joint effect explaining the

phenotype of interests. These selected SNPs can also be used in the phenotype prediction

and viewed as an alternative for genomic selection. Since LASSO can produce a reduced

set of SNPs, it could be extremely useful as a “pre-processing” step for conducting

GWAS on large genomes or polyploid crops, or as a “post-processing” step to select

relevant SNPs with high individual effects and/or high joint effects from the “significant”

SNPs reported by statistical methods like GWAS. GWAS methods often have limitations

on the scale of SNPs and could fail to find the correlated SNPs on large complex data.

The predictive model in LASSO can be validated using an independent test data set to

measure its effectiveness. In order to utilise a machine learning method as LASSO, the

whole data set is often arbitrarily split into two parts, a training data set and a test data

set. When applying cross-validation, it splits the whole data set into training and testing

repeatedly to increase the testing sample size. Cross-validation is thus usually applied to

data sets with small sample size just like the genotype data.

In this paper, we present a two-step wrapper model that integrates LASSO into

GWAS workflow and develop an automatic pipeline. It first uses traditional GWAS with

statistical models to select SNPs with their significant values, and then uses LASSO to

further select SNPs with high individual effects and/or high joint effects to enhance the

model’s results for relating SNPs to the phenotypes. In addition, the pipeline

automatically integrates population structure calculation into the model to improve the

prediction accuracy. Users do not need to generate the population structure matrix

separately and manually feed it into any GWAS program in the pipeline. Since different

GWAS programs often produce dissimilar association results, the proposed pipeline

includes multiple GWAS packages to mitigate these effects. The pipeline is publicly

available online at https://github.com/notTrivial/Comprehensive-GWAS.

3 Methods

In this section, the methods and software used in the proposed two-step wrapper model

are introduced.

The typical statistical models used in GWAS include the GLM, LMM and the

Multiple Locus Mixed Effects Model (MLMM). In this study, we select GLM and LMM.

Typically, GLM models assumes the data is independent and LMM can deal with

non-independence by adding random effects. The statistical model of a GLM can be

described as

=

y

Xe





and a LMM model can be described as

=

y

XZue







Levera

g

ing machine learning to advance genome-wide association studies 23

where y is the vector of observations,



is an unknown vector containing fixed effects

including genetic marker and population structure





Q, u is an unknown vector of

random additive effects,

X

and

Z

are the known design matrices and e is the residuals

(Bradbury et al., 2007). When the covariance between individuals is considered in the

LMM model, it is viewed as a random additive genetic effect.

Our proposed model combines several open-source software to achieve a clear and

simple GWAS workflow for end users. It conducts GWAS analysis utilising multiple

packages. The current version includes TASSEL and GAPIT as the GWAS programs.

TASSEL utilises GLM and LMM in determining associations; GAPIT uses similar

statistical models as TASSEL and claims to be 7-fold faster than it. They are particularly

popular in plant genome analysis. Since our study focuses on the plant genomes and

agriculture, the two programs are included in the pipeline by default. However, it can be

easily extended to include additional GWAS programs such as PLINK.

One important feature of the proposed model is that it integrates the automatic

calculation of population structure of the input data. Including population structure in

GWAS helps reduce false positive rate of output associations that are not related to the

phenotype (Sul et al., 2018). To determine the most likely population structure, a widely

used program, Structure, is integrated to calculate population structure matrix Q. This

enables an accurate determination of the number of sub-populations (denoted as k) in a

data set (Pritchard et al., 2000). To further verify these results, the Evanno method is

included to perform an assessment and visualise the likelihood across all tested sub-

population numbers in the data set (Evanno et al., 2005). It is implemented using the

method from the pophelper v2.3.0 package (Francis, 2019) in R. The optimal population

structure is then integrated in the model as a co-variance. Both TASSEL and GAPIT

could take into consideration of this and use the population matrix as a co-variance.

After the GWAS analysis, SNPs along with p-values reported from each program go

through a False Discovery Rate (FDR) adjustment, unless already done. The adjusted

p-values are also known as q-values. Only those SNPs with q-values less than the cut-off

are viewed as significant SNPs. In this study, the q-value=0.05 is set as the default

cut-off.

The combined significant SNPs and their q-values are then verified by LASSO with

the use of glmnet package in R (Friedman et al., 2010). This package is responsible for

fitting a GLM via penalised maximum likelihood. It is computed for both the elastic-net

and LASSO penalty paths (Friedman et al., 2010). LASSO penalises non-zero

coefficients by taking the sum of their absolute values, referred to as 1L and 2L

penalties from Ridge Regression, where it penalises the model with the sum of squared

individuals. Elastic-net combines both 1L and 2L penalties. Ridge regression is to

shrink regression coefficients for variables with minor contribution to the model

(Friedman et al., 2010). When combined with GWAS, the application of LASSO further

narrows down the SNPs found by these programs and builds a predictive model for the

phenotype. The most significant SNPs passed by LASSO are the final output from the

model.

24 G. Dagasso et al.

4 Pipeline overview

In this section, the details of the pipeline are introduced with clear explanations about

input and output data of each step.

The pipeline takes genotype and phenotype data as input and produces association

results along with visualisations and verification of association with the integration of

LASSO. The entire workflow of the pipeline is shown as a flowchart in Figure 1.

Figure 1 The flowchart depicts the overall workflow of the pipeline

The pipeline is written in R v3.6.1. To execute the pipeline, TASSEL and Structure need

to be pre-installed in the appropriate locations as described on https://github.com/

notTrivial/Comprehensive-GWAS. The rest will be installed by the pipeline

automatically at run time. It is designed to be as seamless as possible to end users to

utilise different GWAS software and adjust output from each software to be the input for

the next one. The steps taken in the pipeline are further described as follows.

Levera

g

ing machine learning to advance genome-wide association studies 25

Step 0: data pre-processing

The input data required for this pipeline is numeric genotype, non-numeric genotype,

phenotype data and a kinship matrix. The numeric genotype is used for determining the

population structure by Structure and validation by LASSO. To be specific, numeric

genotype is a matrix with rows as individuals and columns as the SNP loci. When used in

Structure, marker names and headings are removed as per its requirement. When used in

LASSO, marker names and headings are present in the file. The non-numeric genotype is

used for GWAS along with phenotypes and the kinship matrix. The phenotype data and

kinship are in TASSEL required format and both need to have two files, one with the

header and one without.

Step 1: determine optimal population structure

The Structure program (Pritchard et al., 2000) is a Bayesian clustering program that

characterises population clusters to discriminate populations, to determine population

structure and to reveal the genetic composition of individuals using molecular markers. It

uses multilocus allele frequencies to assign individuals to a pre-defined number of

populations (k). The assignment is usually run for a range of k such as 2 to 10. Multiple

repeats are carried out for each k. Each output file for each repeat of k showing the

assignment probabilities of all individuals is referred to as the Q-matrix.

The Evanno method (Evanno et al., 2005) is used to estimate the optimal number of

populations, k, based on the results generated from the Structure program (Pritchard

et al., 2000). The pophelper package (Francis, 2019) in R is the analysis tool of the

Evanno method to analyse and visualise population structure. First, Structure examines

possible numbers of sub-populations from 2 to k, where k is specified by the user, i.e.

=10k is specified in the results section. This process is repeated E times for each

possible number of sub-populations to ensure that the Evanno method is able to run. In

this study, =10E is used as default. Results are then presented to the users and the

optimal number of populations (k) is chosen. The resultant Q-matrix is the output of this

step and then will be automatically fed into the follow-up GWAS analysis.

Step 2: GWAS with traditional statistical methods

The non-numeric genotype data, phenotype data, kinship matrix K, and population matrix

Q are used as input to run GAPIT and TASSEL. The pipeline has the flexibility to choose

between GLM and LMM models as both programs support them (Bradbury et al., 2007;

Tang et al., 2016). Output of this step includes Manhattan Plots, QQ-Plots and statistical

association results from TASSL and GAPIT. They are then output to their respective

directories as shown in Figure 2.

Step 3: merge data set from statistical methods

Statistical association results from GAPIT and TASSEL are read back into the pipeline as

the input of this step. Reported p-values of each program are adjusted using FDR. Since

GAPIT can report both adjusted and un-adjusted p-values, in the current version, only

TASSEL needs this adjustment. They are then merged together and the appropriate

Manhattan plots of each statistical model (GLM or LMM) per phenotype are generated.

Only the SNPs that meet the cut-off and are found by both programs will be kept as the

output of this step and used in further verification steps.

26 G. Dagasso et al.

Figure 2 The file structure chart depicts the overall data flow of the pipeline

Step 4: LASSO and cross-validation on merged data set

The significant SNPs output from above Step are verified using the LASSO method from

the glmnet package in R to identify joint correlations. Cross-validation with 10 folds are

applied to the significant SNPs of each statistical method per phenotype. The phenotype

data is converted to discrete variables from numerical, anything over 50% is converted

to 1 and less than 50% is converted to 0. The binomial method is applied in the

validation. The coefficient plot and cross-validation plot are the output of this step and

stored under the LASSO directory.

Step 5: generate final results

The final significant SNPs passed from above step of each statistical method per

phenotype are output to respective directories. Combined with other output from

previous steps, all the results are stored under the directories as described in Figure 2.

5 Experiments and results

In this section, we describe the experiments and report the results of using the proposed

pipeline on a plant genomeic and phenotypic data set.

Levera

g

ing machine learning to advance genome-wide association studies 27

5.1 Experimental data

Fusarium Head Blight (FHB) is a fungal disease that affects both wheat and barley

worldwide. It is a serious disease that causes reduced crop yields and produces

mycotoxins which pose serious health threats to both humans and livestock (Walter et al.,

2010; Hilton et al., 1999). Therefore, detecting FHB resistance has significant meaning in

plant breeding. GWAS has been widely used to provide important insights into

the linkages of complex traits to causal SNPs and its related genes in plant breeding

(Yan et al., 2019).

In this study, 199 bread wheat varieties were selected with the lowest FHB severity

rate among a poll of 4000 varieties. 2235 SNPs were detected using wheat 90 k SNP

array (Wang et al., 2014). Three phenotypes including flowering date, plant height and

disease severity have been screened in a FHB disease greenhouse in Saskatchewan,

Canada in 2019.

5.2 Population structure determination

The pipeline’s built-in function identifies the most likely population structure of the

experimental data. Figures 3 and 4 show the number of population verified by Evanno

method. As noted in both figures, the graph plateau’s at k = 2, which supports that k = 2

is the optimal number. The Q-matrix produced by Structure with k = 2 is used in the

subsequent GWAS runs (Evanno et al., 2005). Figure 5 is a stacked barplot that shows

the population group assignment of each individual in 100 repetitions for k ranging from

2 to 7 groups of populations.

Figure 3 The likelihood distribution of k results using the pophelper package in R (Step 1 in

Figure 1). The graph shows the mean of ()( )

L

kSD



over 10 runs for each k value.

It’s plateau at k = 2, which supports the usage of its corresponding Q-matrix

28 G. Dagasso et al.

Figure 4 k plot results using the Evanno method in the pophelper package in R. k is

calculated from the absolute value of the second order rate of change of the likelihood

distribution divided by the standard deviation of the likelihood distribution (Step 1 in

Figure 1). The modal value of this distribution is the true k or the uppermost level of

structure. Here = 2k clusters because the graph plateau’s at = 2k. This supports the

usage of the Q-matrix of = 2k

5.3 GWAS results

From all three tested phenotypes, two of them, plant height and disease severity, did not

get significant results by TASSEL and GAPIT. The resultant QQ plots from them

showed that the model did not fit the data well. However, this is also likely due to the

fact that the 90 k SNP array only detects SNPs in “preselected” genomic regions that

have a higher likelihood to be associated with traits of interest without including any

irrelevant positions (Tang et al., 2016; Bradbury et al., 2007). In the following, results on

flowering date are shown and explained in details.

Figures 6 and 7 show the Manhattan plots of TASSEL and GAPIT’s GWAS results.

The SNP’s p-values are with a scale of log 10 ( )pvalue



. They are both produced by

the GLM of the two programs as The LMM found no significant results for either of

them. Both programs identified the most significant SNPs located on chromosomes 2B

and 6B.

Levera

g

ing machine learning to advance genome-wide association studies 29

Figure 5 Multiline barplots from Q-matrices from = 2k to = 7k for step 1 in Figure 1. Each

group is individually labelled

30 G. Dagasso et al.

Figure 6 Manhattan plot for flowering date by GLM of GAPIT. The most significant SNPs are

above the green line and located on chromosomes 2B and 6B (Step 2 in Figure 1)

Figure 7 Manhattan plot for flowering date by GLM of TASSEL. The most significant SNPs are

located on chromosomes 2B and 6B (Step 2 in Figure 1)

There were a total of 65 significant (q-values



0.05) SNPs predicted by the pipeline

using GLM of TASSEL and GAPIT. The q-value is adjusted by the FDR (Benjamini and

Hochberg, 1995) of the p-value. They are the merged results from the two programs.

Among the whole genome, chromosome 2B and chromosome 6B have the largest

number of SNPs which the two programs agree on. This indicates that the two regions

may have significant influence to the flowering time phenotype. The pipeline output is

summarised in the merged Manhattan plot in Figure 8. It shows the merged significant

SNPs Manhattan Plot with a scale of log 10 ( )q value



. The coloured columns help

distinguish different chromosomes. Each dot on this Manhattan plot corresponds to one

significant SNP (q-values  0.05), though some overlap to appear as a larger circle.

Levera

g

ing machine learning to advance genome-wide association studies 31

Figure 8 Manhattan plot for flowering date using the merged SNPs from both TASSEL and

GAPIT. All 65 significant SNPs (whose q-values



0.05) are illustrated in this graph.

The colours indicate different chromosomes from its neighbouring chromosomes

(Step 3 in Figure 1)

The Manhattan plot also highlights the top 3 significant merged SNPs with their names

displayed. These top 3 significant SNPs are also listed in Table 1 with their chromosome

and location information. The results show that our proposed pipeline captured

significant SNPs from both GWAS software and they could have joint effects to the

phenotype.

Table 1 Top 3 Significant SNPs after merging (Step 5 in Figure 1)

Name of SNP Chromosome Position

wsnp_RFL_Contig4668_5551137 6B 71

wsnp_Ex_rep_c71064_69904031 2B 100

wsnp_Ra_c3955_7262354 2B 116

5.4 LASSO and cross-validation

Top significant SNPs found by the glmnet of R package are shown in Table 2. They

played an important role in fitting the model and therefore were concluded to be the top

significant SNPs from the glmnet analysis. SNPs used in glmnet are only the 65

significant SNPs from the merged results of GAPIT and TASSEL. We notice that these

top SNPs are not the same as the SNPs with the lowest q-values. But they could pinpoint

new regions that researchers could focus on to find potential casual relationships to the

phenotype. As we stated in Section 2, GWAS results only reflect associations but not

how strong they are. SNPs reported by LASSO may influence the phenotype but failed to

pass the q-value threshold.

32 G. Dagasso et al.

Table 2 Top 5 significant SNPs found by glmnet in step 5 of Figure 1. These SNPs

contributed more to the development of the model fitting than the other 65 significant

SNPs from the merged results

Name of SNP Chromosome Position

Excalibur_rep_c114833_645 3D 41

D_contig25474_188 7D 161

RAC875_c5243_479 1B 31

BS00065005_51 1A 55

wsnp_Ex_c56629_58677561 5B 12

The coefficients plot from the model fitting of SNPs is shown in Figure 9, which includes

all 65 significant SNPs. Each curved line in Figure 9 corresponds to one SNP and it

shows the path of each SNPs’ coefficient against the 1L-norm. The number of non-zero

coefficients is indicated as the numbers in the x-axis on the top.

Figure 10 shows the corresponding cross-validation result. It shows the validation of

the model, the choice of two

s



, and the LASSO regularisation parameter, as indicated

by the vertical dotted lines. The x-axis on the top is the number of non-zero coefficients

just the same as the ones in Figure 9. The red dotted line is the cross-validation curve and

the error bars indicate the upper and lower standard deviation of the



. One



gives the

minimum mean cross-validated error and the other gives the most regularised model such

that the error is within one standard error of the minimum (Friedman et al., 2010).

Figure 10 also shows that the misclassification error rate is about 0.20 , which indicates

the predictivity of the selected SNPs. This highlights the significance of the SNPs from

the merged results.

Figure 9 Coefficients from LASSO for the 65 merged significant SNPs (Step 4 in Figure 1)

Levera

g

ing machine learning to advance genome-wide association studies 33

Figure 10 Cross-validation from LASSO for the 65 merged significant SNPs (Step 4 in Figure 1)

6 Conclusions and discussions

In this study, we proposed a computational method for seamless GWAS analyses with

statistically modelling and machine learning. It uses an ensemble approach that allows

easy analysis of associations between genotype and phenotype data to narrow down areas

and SNPs of “true” significance while being efficient and easy to handle. Combining

results from two well-known GWAS programs yields more promising results, which

allows overlapping analysis of multiple programs across multiple statistical models.

Automatically including the population structure analysis in these GWAS programs helps

to reduce a high rate of false positive associations for SNPs that are not linked with

phenotypes. The final step of the pipeline further verifies the significant SNPs by using

LASSO to provide better accuracy in detecting “true” significant SNPs to help narrow

down the number of resultant SNPs (Friedman et al., 2010).

The pipeline is created for GWAS analysis in one place while utilising various

software. It also has the capacity to extend and include more GWAS programs and

statistical models to be utilised in other analyses. We demonstrated the utility of this

34 G. Dagasso et al.

pipeline in associating FHB resistance related phenotypes with genomic regions in bread

wheat varieties. Significant SNPs were found in the flowering date phenotype by GLM,

whereas the other studied phenotypes had no significant SNPs found with the q-value

threshold of 0.05 using TASSEL and GAPIT. The significant SNPs in the flowering date

phenotype was verified by FDR adjusted p-values and were further verified by cross-

validation using LASSO (Friedman et al., 2010). Having this result verified by both

GAPIT and TASSEL lends credibility to the finding of these significant SNPs.

There are a few directions where the pipeline will be extended to obtain better and

more comprehensive results. For example, instead of detecting associated SNPs for each

phenotype independently, the pipeline can be extended to combine results from multiple

related phenotypes. It is known that flowering date, plant height and disease severity

rates are phenotypes related to FHB disease; therefore, genetic variants that are common

across multiple related phenotypes or unique to any specific phenotype can be aggregated

and associated with the FHB disease. Further, given that results from different GWAS

programs could vary largely in terms of output SNPs as well as their p-values, in addition

to the current q-value threshold, the pipeline can include other selection criterion such as

choosing the top significant SNPs without meeting the q-value threshold. Finally, more

GWAS packages such as PLINK can be included in the pipeline to complement with

current two software, GAPIT and TASSEL. In this case, additional file format

conversion could also be added into the pipeline, for example, to convert PLINK ped/bed

files to the VCF files. This will enable the pipeline to be further extended to synthesise

and report significant SNP sets in a customised but uniform way.

References

Benjamini, Y. and Hochberg, Y. (1995) ‘Controlling the false discovery rate: a practical and

powerful approach to multiple testing’, Journal of the Royal Statistical Society: Series B

(Methodological), Vol. 57, No. 1, pp.289–300.

Bradbury, P.J. and Zhang, Z., Kroon, D.E., Casstevens, T.M. (2007) ‘TASSEL: software for

association mapping of complex traits in diverse samples’, Bioinformatics, Vol. 23, No. 19,

pp.2633–2635.

Evanno, G., Regnaut, S. and Goudet, J. (2005) ‘Detecting the number of clusters of individuals

using the software STRUCTURE: a simulation study’, Mol. Ecol., Vol. 14, No. 8,

pp.2611–2620.

Francis, R.M. (2019) Tabulate, Analyse and Visualise Admixture Proportions from STRUCTURE,

TESS, BAPS, ADMIXTURE and Tab-Delimited q-matrices Files, R package version 2.3.0.

Friedman, J., Hastie, T. and Tibshirani, R. (2010) ‘Regularization paths for generalized linear

models via coordinate descent’, Journal of statistical software, Vol. 33, No. 1, pp.1–22.

Goddard, M.E. and Hayes, B.J. (2007) ‘Genomic selection’, Journal of Animal breeding and

Genetics, Vol. 124, No. 6, pp.323–330.

Hilton, A.J., Jenkinson, P., Hollins, T.W. and Parry, D.W. (1999) ‘Relationship between cultivar

height and severity of fusarium ear blight in wheat’, Plant Pathology, Vol. 48, No. 2,

pp.202–208.

Levera

g

ing machine learning to advance genome-wide association studies 35

Lipka, A.E., Tian, F., Wang, Q., Peiffer, J., Li, M., Bradbury, P.J., Gore, M.A., Buckler, E.S. and

Zhang, Z. (2012) ‘GAPIT: genome association and prediction integrated tool’, Bioinformatics,

Vol. 28, No. 18, pp.2397–2399.

Lippert, C., Listgarten, J., Liu, Y., Kadie, C.M., Davidson, R.I. and Heckerman, D. (2011) ‘Fast

linear mixed models for genome-wide association studies’, Nature Methods, Vol. 8, No. 10,

pp.833–835.

Loh, P-R., Tucker, G., Bulik-Sullivan, B-K., Vilhjalmsson, B.J., Finucane, H.K., Salem, R.M.,

Chasman, D.I., Ridker, P.M., Neale, B.M. and Berger, B. et al. (2015) ‘Efficient Bayesian

mixed-model analysis increases association power in large cohorts’, Nature Genetics, Vol. 47,

No. 3, pp.284–290.

Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy,

M.I., Ramos, E.M., Cardon, L.R. and Chakravarti, A. et al. (2009) ‘Finding the missing

heritability of complex diseases’, Nature, pp.747–753.

Oetting, W.S., Jacobson, P.A. and Israni, A.K. (2017) ‘Validation is critical for genome-wide

association study-based associations’, American Journal of Transplantation, Vol. 17, No. 2,

pp.318–319.

Pritchard, J.K., Stephens, M. and Donnelly, P. (2000) ‘Inference of population structure using

multilocus genotype data’, Genetics, Vol. 155, No. 2, pp.945–959.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J., Sklar,

P., de Bakker, P.I.W. and Daly, M.J. et al. (2007) ‘PLINK: a tool set for whole-genome

association and population-based linkage analyses’, The American Journal of Human

Genetics, Vol. 81, pp.559–575.

Sul, J.H., Martin, L.S. and Eskin, E. (2018) ‘Population structure in genetic studies:

confounding factors and mixed models’, PLoS Genetics, Vol. 14, No. 12. Doi:

10.1371/journal.pgen.1007309.

Sun, S., Dong, B. and Zou, Q. (2020) (2020) ‘Revisiting genome-wide association studies from

statistical modelling to machine learning’, Briefings in Bioinformatics. Doi:

10.1093/bib/bbaa263.

Tam, V., Patel, N., Turcotte, M., Bossé, Y., Paré, G. and Meyre, D. (2019) ‘Benefits and

limitations of genome-wide association studies’, Nature Reviews Genetics, Vol. 20, No. 8,

pp.467–484.

Tang, Y., Liu, X., Wang, J., Li, M., Wang, Q., Tian, F., Su, Z., Pan, Y., Liu, D. and Lipka,

A.E. et al. (2016) ‘GAPIT version 2: an enhanced integrated tool for genomic association and

prediction’, The plant genome, Vol. 9, No. 2, pp.1–9.

Waldmann, P., Ferenčaković, M., Mészáros, G., Khayatzadeh, N., Curik, I. and Sölkner, J. (2019)

‘Autalasso: an automatic adaptive lasso for genome-wide prediction’, BMC Bioinformatics,

Vol. 20, No. 1, pp.1–10.

Waldmann, P., Mészáros, G., Gredler, B., Fuerst, C. and Sölkner, J. (2013) ‘Evaluation of the lasso

and the elastic net in genome-wide association studies’, Frontiers in genetics, Vol. 4. Doi:

10.3389/fgene.2013.00270.

Walter, S., Nicholson, P. and Doohan, F.M. (2010) ‘Action and reaction of host and pathogen

during Fusarium head blight disease’, New Phytologist, Vol. 185, No. 1, pp.54–66.

Wang, S., Wong, D., Forrest, K., Allen, A., Chao, S., Huang, B.E., Maccaferri, M., Salvi, S.,

Milner, S.G. and Cattivelli, L. et al. (2014) ‘Characterization of polyploid wheat genomic

diversity using a high-density 90 000 single nucleotide polymorphism array’, Plant

Biotechnology Journal, Vol. 12, No. 6, pp.787–796.

36 G. Dagasso et al.

Yan, Y., Burbridge, C., Shi, J., Liu, J. and Kusalik, A. (2018) ‘Comparing four genome-wide

association study (GWAS) programs with varied input data quantity’, Proceedings of the

IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.1802–1809.

Yan, Y., Burbridge, C., Shi, J., Liu, J. and Kusalik, A. (2019) ‘Effects of input data quantity on

genome-wide association studies (GWAS)’, International Journal of Data Mining and

Bioinformatics, Vol. 22, No. 1, pp.19–43.

Yang, J., Lee, S.H., Goddard, M.E. and Visscher, P.M. (2011) ‘GCTA: a tool for genome-wide

complex trait analysis’, The American Journal of Human Genetics, Vol. 88, No. 1, pp.76–82.

Performance Comparison of LASSO Variants with Genome-Wide Association Studies (GWAS)

Conference Paper

Dec 2021

Effects of input data quantity on genome-wide association studies (GWAS)

Article

Full-text available

Jan 2019

AUTALASSO: an automatic adaptive LASSO for genome-wide prediction

Article

Full-text available

Apr 2019
BMC BIOINFORMATICS

Background Genome-wide prediction has become the method of choice in animal and plant breeding. Prediction of breeding values and phenotypes are routinely performed using large genomic data sets with number of markers on the order of several thousands to millions. The number of evaluated individuals is usually smaller which results in problems where model sparsity is of major concern. The LASSO technique has proven to be very well-suited for sparse problems often providing excellent prediction accuracy. Several computationally efficient LASSO algorithms have been developed, but optimization of hyper-parameters can be demanding. Results We have developed a novel automatic adaptive LASSO (AUTALASSO) based on the alternating direction method of multipliers (ADMM) optimization algorithm. The two major hyper-parameters of ADMM are the learning rate and the regularization factor. The learning rate is automatically tuned with line search and the regularization factor optimized using Golden section search. Results show that AUTALASSO provides superior prediction accuracy when evaluated on simulated and real bull data compared to the adaptive LASSO, LASSO and ridge regression implemented in the popular glmnet software. Conclusions The AUTALASSO provides a very flexible and computationally efficient approach to GWP, especially when it is important to obtain high prediction accuracy and genetic gain. The AUTALASSO also has the capability to perform GWAS of both additive and dominance effects with smaller prediction error than the ordinary LASSO.

Population structure in genetic studies: Confounding factors and mixed models

Article

Full-text available

Dec 2018
PLOS GENET

A genome-wide association study (GWAS) seeks to identify genetic variants that contribute to the development and progression of a specific disease. Over the past 10 years, new approaches using mixed models have emerged to mitigate the deleterious effects of population structure and relatedness in association studies. However, developing GWAS techniques to accurately test for association while correcting for population structure is a computational and statistical challenge. Using laboratory mouse strains as an example, our review characterizes the problem of population structure in association studies and describes how it can cause false positive associations. We then motivate mixed models in the context of unmodeled factors.

GAPIT Version 2: An Enhanced Integrated Tool for Genomic Association and Prediction

Article

Full-text available

Jul 2016

Most human diseases and agriculturally important traits are complex. Dissecting their genetic architecture requires continued development of innovative and powerful statistical methods. Corresponding advances in computing tools are critical to efficiently use these statistical innovations and to enhance and accelerate biomedical and agricultural research and applications. The genome association and prediction integrated tool (GAPIT) was first released in 2012 and became widely used for genome-wide association studies (GWAS) and genomic prediction. The GAPIT implemented computationally efficient statistical methods, including the compressed mixed linear model (CMLM) and genomic prediction by using genomic best linear unbiased prediction (gBLUP). New state-of-the-art statistical methods have now been implemented in a new, enhanced version of GAPIT. These methods include factored spectrally transformed linear mixed models (FaST-LMM), enriched CMLM (ECMLM), FaST-LMM-Select, and settlement of mixed linear models under progressively exclusive relationship (SUPER). The genomic prediction methods implemented in this new release of the GAPIT include gBLUP based on CMLM, ECMLM, and SUPER. Additionally, the GAPIT was updated to improve its existing output display features and to add new data display and evaluation functions, including new graphing options and capabilities, phenotype simulation, power analysis, and cross-validation. These enhancements make the GAPIT a valuable resource for determining appropriate experimental designs and performing GWAS and genomic prediction. The enhanced R-based GAPIT software package uses state-of-the-art methods to conduct GWAS and genomic prediction. The GAPIT also provides new functions for developing experimental designs and creating publication-ready tabular summaries and graphs to improve the efficiency and application of genomic research.

Revisiting genome-wide association studies from statistical modelling to machine learning

Article

Oct 2020
BRIEF BIOINFORM

Over the last decade, genome-wide association studies (GWAS) have discovered thousands of genetic variants underlying complex human diseases and agriculturally important traits. These findings have been utilized to dissect the biological basis of diseases, to develop new drugs, to advance precision medicine and to boost breeding. However, the potential of GWAS is still underexploited due to methodological limitations. Many challenges have emerged, including detecting epistasis and single-nucleotide polymorphisms (SNPs) with small effects and distinguishing causal variants from other SNPs associated through linkage disequilibrium. These issues have motivated advancements in GWAS analyses in two contrasting cultures-statistical modelling and machine learning. In this review, we systematically present the basic concepts and the benefits and limitations in both methods. We further discuss recent efforts to mitigate their weaknesses. Additionally, we summarize the state-of-the-art tools for detecting the missed signals, ultrarare mutations and gene-gene interactions and for prioritizing SNPs. Our work can offer both theoretical and practical guidelines for performing GWAS analyses and for developing further new robust methods to fully exploit the potential of GWAS.

Characterization of polyploid wheat genomic diversity using a high-density 90 000 single nucleotide polymorphism array

Article

Aug 2014

Benefits and limitations of genome-wide association studies

Article

May 2019

Genome-wide association studies (GWAS) involve testing genetic variants across the genomes of many individuals to identify genotype–phenotype associations. GWAS have revolutionized the field of complex disease genetics over the past decade, providing numerous compelling associations for human complex traits and diseases. Despite clear successes in identifying novel disease susceptibility genes and biological pathways and in translating these findings into clinical care, GWAS have not been without controversy. Prominent criticisms include concerns that GWAS will eventually implicate the entire genome in disease predisposition and that most association signals reflect variants and genes with no direct biological relevance to disease. In this Review, we comprehensively assess the benefits and limitations of GWAS in human populations and discuss the relevance of performing more GWAS.

Comparing Four Genome-Wide Association Study (GWAS) Programs with Varied Input Data Quantity

Conference Paper

Dec 2018

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Article

Jan 1995

The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses — the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

Inference of Population Structure Using Multilocus Genotype Data

Article

Jun 2000
GENETICS

We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.

Article

A Platform for the Description, Distribution and Analysis

April 2003

In this paper we suggest the requirements for an open platform designed for the description, distribution and analysis of genetic polymorphism data. This platform is discussed in terms of our implementation of a phenotypic prediction pipeline with general application to the understanding of genetic variation.

Article

Leveraging machine learning to advance genome-wide association studies

January 2021 · International Journal of Data Mining and Bioinformatics

Genome-Wide Association Studies (GWAS) has demonstrated its power in discovering genetic variations to particular traits related to agronomically important features in crops. The typical output of a GWAS program includes a series of Single Nucleotide Polymorphisms (SNPs) and their significance. Currently, there is no standard way to compare results across different programs or to select the most ... [Show full abstract] 'significant' results uniformly and consistently. To obtain a comprehensive and accurate set of SNPs associated with a trait of interest, we present a novel automated pipeline that leverages machine learning for GWAS discoveries. The pipeline first performs population structure analysis, then executes multiple GWAS software and combines their results into a single SNP set. After that, it selects SNPs from the set with high individual and/or joint effects with the Least Absolute Shrinkage and Selection Operator analysis. Finally, the predictivity of the model is assessed using cross-validation.

Article

Leveraging machine learning to advance genome-wide association studies

January 2021 · International Journal of Data Mining and Bioinformatics

Article

Full-text available

Effects of input data quantity on genome-wide association studies (GWAS)

January 2019 · International Journal of Data Mining and Bioinformatics

Preprint

Full-text available

GAPIT Version 3: An Interactive Analytical Tool for Genomic Association and Prediction

December 2018

As the two essential enterprises of genomic research, genome-wide association studies and genomic prediction have benefited, in terms of accuracy and computational efficiency, from continuously improved statistical methods and software packages. To address both types of analyses simultaneously, GAPIT (Genomic Association and Prediction Integrated Tool) has implemented an expanded series of ... [Show full abstract] statistical methods for gene mapping and prediction of phenotypes from genotypes. To assimilate prior knowledge of users, the updated GAPIT also provides the capability to interact with both model selection and result interpretation graphics.