ArticlePDF Available

Comparative Study of Statistical Models for Genomic Prediction

January 2020

January 2020
74(2):91-98

Authors:

Sayanti Guha Majumdar

ICAR - Indian Institute of Sugarcane Research

Anil Rai

Indian Council of Agricultural Reserach

Dwijesh C. Mishra

Indian Agricultural Statistics Research Institute

Genomic prediction has been used for breeding of animals and plants with complex quantitative traits by predicting Genomic Estimated Breeding Values (GEBVs) of target population. The accuracy of genomic prediction depends on various factors including sampling population, genetic architecture of target species, statistical models, etc. There are large numbers of statistical models for genomic prediction available in the literature. These models perform differently due to different genetic architecture of the datasets. In this article, performances of linear least squared regression, BLUP, LASSO, ridge regression, SpAM, HSIC LASSO, SVM, ANN along with our newly developed integrated model framework have been evaluated in wheat dataset containing 599 wheat lines and 1279 SNP markers. In general, the performances of SVM, ridge regression and integrated model framework were found to be superior for genomic prediction. This study will help researcher in selection of appropriate statistical method to predict phenotypic values.

Statistical comparison of various models for genomic prediction

…

Figures - uploaded by Sayanti Guha Majumdar

Content may be subject to copyright.

Content uploaded by Sayanti Guha Majumdar

Content may be subject to copyright.

Comparative Study of Statistical Models for Genomic Prediction

Sayanti Guha Majumdar, Anil Rai and Dwijesh Chandra Mishra

ICAR-Indian Agricultural Statistics Research Institute, New Delhi

Received 21 December 2019; Revised 08 January 2020; Accepted 10 January 2020

SUMMARY

Genomic prediction has been used for breeding of animals and plants with complex quantitative traits by predicting Genomic Estimated Breeding

Values (GEBVs) of target population. The accuracy of genomic prediction depends on various factors including sampling population, genetic

architecture of target species, statistical models, etc. There are large numbers of statistical models for genomic prediction available in the literature.

These models perform dierently due to dierent genetic architecture of the datasets. In this article, performances of linear least squared regression,

BLUP, LASSO, ridge regression, SpAM, HSIC LASSO, SVM, ANN along with our newly developed integrated model framework have been

evaluated in wheat dataset containing 599 wheat lines and 1279 SNP markers. In general, the performances of SVM, ridge regression and integrated

model framework were found to be superior for genomic prediction. This study will help researcher in selection of appropriate statistical method to

predict phenotypic values.

Keywords: ANN, Genomic prediction, Integrated model framework, LASSO, Ridge regression, SVM.

Available online at www.isas.org.in/jisas

JOURNAL OF THE INDIAN SOCIETY OF

AGRICULTURAL STATISTICS 74(2) 2020 91–98

1. INTRODUCTION

Genomic prediction or genomic selection (GS)

is an emerging eld of genomic-assisted breeding

methodology where whole genome marker data is used

to predict genomic estimated breeding value (GEBV).

The aim of this method is to increase genetic gain by

shortening breeding cycles and increasing accuracy

of prediction of GEBV. There are several statistical

models available in literature for genomic prediction,

viz. least squared regression (LSR), best linear unbiased

prediction (BLUP) (Henderson, 1975), least absolute

shrinkage and selection operator (LASSO) (Tibshirani,

1996), ridge regression (Hoerl and Kennard, 1970),

Sparse Additive Models (SpAM) (Ravikumar et al.,

2009), Hilbert-Schmidt Independence Criterion

LASSO (HSIC LASSO) (Gretton et al., 2005 and

Yamada et al., 2014), support vector machine (SVM)

(Vapnik, 1995), articial neural network (ANN) (Bain,

1873 and James, 1890). Another statistical model

framework has been developed in our previous study

by combining one additive model, i.e. SpAM and one

non-additive model, i.e. HSIC LASSO. The newly

developed model can be mentioned as integrated model

framework (Guha Majumdar et al., 2019). The accuracy

of prediction of dierent models varies on the basis

of the underlying statistical methods. All the models

dier among themselves due to assumption about

the distribution, variance among the genetic markers

used etc. In our present study, we have compared the

performance of all the above mentioned models for

genomic prediction in case of wheat. These models are

described briey below.

1.1 Linear Least-Squares Regression

In GS, the main goal is to predict the individual’s

breeding value by modeling the relationship between

the individual’s genotype and phenotype. The Linear

least-square regression is the simplest model, which

can be written as

where, individual, marker

position/segment, is the phenotypic value for

Corresponding author: Sayanti Guha Majumdar

E-mail address: sayanti23gm@gmail.com

92 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98

individual , is the overall mean, is an element

of the incidence matrix corresponding to marker j,

individual , is a random eect associated with

marker , and is a random residual which follows

The basic problem in using this model is that, it

does not work well, if the available number of markers

(explanatory variables) is greater than the number

of individual available (observations). In order to

overcome this problem Meuwissen et al. (2001)

adopted a stepwise procedure of least squares regression

for GS. First, least squares regression analysis was

performed on each segment (marker) separately using

above model. Then the likelihood of every segment

was plotted against the position of the segment which

helped in identifying the segments having signicant

eects. Finally, segments having signicant eects

were used simultaneously by the model to estimate

their individual eects. But this approach also has

some drawbacks, like it does not fully take advantage

of the whole genome marker information as markers

with a signicant eect are included in the nal model.

1.2 Best Linear Unbiased Prediction (BLUP)

The BLUP theory and the mixed model formulation

were rst described by Henderson (1949), and BLUP

was recommended as a method of GS by Meuwissen

et al. (2001). The random eects model of BLUP

(Henderson, 1975)can be written as

where, is the vector of phenotypic data,

is the overall mean vector, is the th

column of the design matrix, is the genetic eect

associated with the th marker, and is the number of

markers. is the intercept which is xed, and is

the random eects with , ,

and . The intercept

can be replaced by to include all the xed eects

if other covariates are also available. Then, the model

can be written as

where is a vector of unknown xed

eects, where, the rst element is considered as the

population mean, and is the incidence matrix which

relates to . The above equation is usually known as

mixed model or mixed eects model. The xed eect

vector is estimated by BLUE, whereas, BLUP is the

predictor of the random eects.

Henderson (1953) proposed that ( ) can be

obtained by maximizing the joint likelihood of ( )

given by:

A set of linear equations [Henderson’s Mixed Model

Equations (MME)] can be obtained by maximizing the

likelihood with respect to , and equating it

to zero:

where, and . The BLUE

of and the BLUP of can be obtained by solving

the MME. The assumption of Henderson’s derivation

is that and are normally distributed and maximizes

the joint likelihood of over the unknowns and

1.3 Least Absolute Shrinkage and Selection

Operator (LASSO)

The LASSO technique (Tibshirani, 1996) is

being used for ecient feature selection based on the

assumption of linear dependency between input features

and output values. In case of LASSO, optimization

problem is given as

where, is a regression coecient

vector, denotes the regression coecient of the k-th

feature, and are the and -norms, and

λ > 0 is the regularization parameter. The -regularizer

in LASSO tends to produce a sparse solution, which

means that the regression coecients of non-signicant

features become zero.

Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98

LASSO is specically suitable, when the number

of features is larger than the number of training

samples (Tibshirani, 1996). So, by using LASSO we

can overcome the limitations of linear least-square

regression method. However, this method performs

well for additive eect data only.

1.4 Ridge Regression

In case of multi-collinear marker data, the

performance of variable selection methods is generally

very poor. In order to address this problem, we can

use penalized regression model i.e. ridge regression of

Hoerl and Kennard 1970. Ridge regression minimizes

the penalized sum of squares:

where, is the penalty parameter, and the

estimate of the regression coecient is given by:

where, is a identity matrix. The penalty

parameter can be calculated by several dierent

methods, for example, by plotting as a function of

and choosing the smallest that results in a stable

estimate of . Hoerl et al. (1975) have proposed

another way to choose using an automated procedure.

The estimate of is given by:

where, is the number of parameters in the model

except the intercept, is the residual mean square

obtained by linear least squares estimation, and is

the vector of least squares estimates of regression

coecients.

Ridge regression estimator of is biased and

this increase in bias is compensated by the decrease

in variance. As a result, we get an estimator with

smallest MSE. Another advantage of ridge regression

is that it can be used when available markers are more

than the sample size to overcome the problem.

Meuwissen et al. (2001) employed ridge regression

in GS. It was assumed that the marker eects

were random, and they were drawn from a normal

distribution with , whereas, additive

genetic variance among individuals is expressed as

= , where, represents additive genetic

variance among individuals and is the number

of marker loci (Habier et al. 2007). This method is

suitable for data with additive eects, i.e. for linear

features only.

1.5 Sparse Additive Models (SpAM)

High dimensional feature selection can be

performed with sparse additive models (SpAM)

(Ravikumar et al., 2009). The SpAM optimization

model can be dened as

where, are

regression coecient vectors,

is Gram matrix and ,

is kernel function, is a coecient for

and λ > 0 is a

regularization parameter. SpAM is considered as a

convex method which can be eciently optimized by

the back-tting algorithm.

A disadvantage of SpAM is that it can only deal

with additive eects. In case of epistatic eects in the

data SpAM may fail to select signicant markers. Also,

SpAM is computationally expensive procedure.

1.6 HSIC LASSO

A kernalized non-linear LASSO was proposed

by Yamada et al. (2014), which isalso called as HSIC

(Hilbert-Schmidt Independence Criterion, Gretton

et al., 2005) LASSO. The optimization problem can be

expressed as

s.t. ≥ 0,

where, ‖ · ‖Frob is the Frobenius norm,

and are centered Gram

matrices, and Li, j = L(yi, yj)

are Gram matrices, and are kernel

functions, is the centering matrix,

is the n-dimensional identity matrix, and is the

94 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98

n-dimensional vector with all ones. A non-negativity

constraint is employed in this model so that meaningful

features are selected. This model diers from the

original formulation of LASSO as in this case kernel

functions and is dierent and non-negativity

constraint is imposed. The rst term in this equation

denotes that we are regressing the output kernel matrix

by a linear combination of feature-wise input kernel

matrices .

1.7 Support Vector Machine (SVM)

SVM was proposed by Vapnik in 1995. SVM is

a supervised machine learning technique, originally

used as a classier. To train the SVM, training dataset

is used and the classier produces maximum margin

separation between two classes of observations. The

idea can be used for estimating unknown regression

function. Maenhout et al. (2007) and Long et al. (2011)

have implemented SVM regression for GS in plant

breeding. SVM regression can model the relationship

between the marker genotypes and the phenotypes with

a linear as well as nonlinear mapping function.

Let us consider a training sample

, where

is avector of genotypic values of the markers

for individual , and is the vector of phenotypic

value for individual . SVM model describing the

relationship between the phenotype and the genotype

of an individual has been described below:

Where is a constant which reects the maximum

error while estimating and is a vector of unknown

weights. The function can be obtained by

minimizing the expression

Where denotes the loss function which may

be squared loss function, absolute loss function or

-insensitive loss function measuring the quality of the

estimation. is the regularization parameter which is

responsible for the trade-o between the sparsity and the

complexity of the model. The norm of vector is

associated with model complexity inversely. A support

vector satises the equation by

denition.

1.8 Articial Neural networks (ANN)

Neural network (NN) is a nonparametric statistical

method which can model relationship between

genotypes and phenotype with both linear and complex

nonlinear functions. NN mimics the idea of how

neurons in the human brain work and interact, and

conducts computations. NN was rst introduced by

Bain (1873) and James (1890). Every unit in NN is

analogous to a brain neuron and they connect among

themselves with several functions which are analogous

to synapses (Hastie et al. 2009). The NN is composed

of three types of layers, viz. an input layer, a hidden

layer and an output layer. This model is known as the

feed-forward NN. The NN which is used to estimate a

regression function usually consists of only one output

layer unit. The hidden layer units are functions of linear

combinations of the inputs, whereas, the output layer

units are functions of the hidden layer units. The output

function of a feed-forward NN can be dened as:

where is the number of units in the input layer,

is the number of output layer units, is the number

of hidden layer units, is the th input,

is the intercept, are the output

layer weights connecting the th hidden layer unit

to the output layer units, is the activation function

modeling the connection between the hidden layer

and the output layer, and and are the

unknown learning parameters of the hidden layer unit

connecting the th neuron in the input

layer.

In GS, marker genotypes are represented by

where is the number of individuals in the analysis.

We can choose the activation function as the sigmoid

or the Gaussian radial basis function. Gianola et al.

implemented NNs for GS in 2011.

1.9 Integrated Model Framework

The integrated model framework (Guha Majumdar

et al., 2019) for estimation of GEBV has been developed

by combining SpAM and HSIC LASSO and can be

used to capture both linear and non-linear eect of the

genetic markers on the phenotypic data. The model can

be expressed as

Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98

2.2 Implementation in R

Least squares regression (LSR)

In order to implement least square regression, the lm

function from the stats package (R Development Core

Team 2019) in R was used. In rst step, simple linear

regression model was tted for each of the individual

markers and most signicant 100 markers are selected

according to their p-values. Then in a nal model those

100 markers were included to simultaneously t a linear

regression model. This two-step method was applied as

there are more number of markers than the number of

individuals. Finally, phenotypic values were predicted

from testing dataset with the selected marker data and

estimated regression coecient of marker eects.

BLUP

BLUP was implemented by using mixed.solve

function from rrBLUP package (Endelman, 2011) in

R. The model was tted using training data. Then the

phenotypic value was predicted using testing dataset

and the predicted coecients of marker eects from

the tted model.

LASSO

The glmnet function of the glmnet package

(Friedman et al., 2010) in R was used with default

parameter values to implement LASSO. The prediction

was performed with the help of predict function of the

same package by minimizing cross-validation error.

Ridge Regression

Ridge regression can also be implemented through

glmnet function of the glmnet package (Friedman et al.,

2010) in R by setting the value of alpha equal to zero.

Then the prediction was performed in testing set by

usingpredict function.

SpAM

The sparse additive model was implemented by

using samQL function of the SAM package (Zhao et al.,

2014) in R with default parameter values. The predict

function of the same package was used to perform the

prediction of phenotypic value in testing dataset.

HSIC LASSO

In-house R function was developed to implement

HSIC LASSO or kernalized LASSO. The penalized

function of penalized package (Goeman et al., 2010)

where, is the predicted phenotype of the

integrated model framework, is , where

and are the error variances of models HSIC LASSO

and SpAM respectively, is the predicted phenotype

from Sparse Additive Models and is the predicted

phenotype from HSIC LASSO. The estimation of

and can be performed by following retted cross

validation approach of Fan et al., 2012.

2. MATERIALS AND METHODS

2.1 Data Description

We have a real dataset of wheat for implementing

genomic prediction models (Crossa et al., 2010).

Genotyping of wheat was done by using 1447 Diversity

Array Technology markers generated by Triticarte Pty.

Ltd. (Canberra, Australia; http://www.triticarte.com.

au). This dataset includes 599 lines observed for trait

grain yield (GY) for four mega environments. For

the convenience of our study the GY for rst mega

environment has been considered. The nal number

of DArT markers in the dataset after editing was 1279

which has been used in this study.

After implementation of statistical models

prediction accuracy (PA) and mean squared error

(MSE) have been estimated for all the models. PA can be

dened as the correlation between the actual phenotypic

values and the predicted phenotypic

values (Howard et al., 2014). MSE can be expressed as

Where, is the predicted value of phenotype,

is the actual value of phenotype, and is the number

of individuals in the dataset. In order to implement

the statistical models, the statistical software R was

used. Before implementing the models, the dataset

has been split into training and testing data. 80% of

the observation has been chosen randomly for training

purpose and rest 20% data has been kept as testing data.

This procedure of splitting was repeated 500 times to

get 500 training and testing datasets which were used to

implement all the statistical models.

96 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98

has been used to t this kernelized LASSO model.

Then predict function of the same package is used to

predict the phenotypic value of the testing dataset.

SVM

The ksvm function of the kernlab package

(Karatzoglou et al., 2004) in R with the default

parameters was used to perform SVM regression on

the training dataset. After tting the model, the predict

function was used to obtain the predicted phenotypic

values for the testing set.

Neural network

The NN model was implemented using the brnn

function of the brnn package (Rodriguez and Gianola,

2013) in R. This function uses a two layer NN and maps

the input information into some basis function. The

number of neurons was set to be three and the number

of epochs to train the model was 30. The predict.brnn

function of the same package was used in the next step

to predict the phenotype using testing dataset.

Integrated Model Framework

The GSelection package (Guha Majumdar et al.,

2019) in R was developed by us to implement the

integrated model framework in GS. To t the model

in training data feature.selection function was used.

The error variances of SpAM and HSIC LASSO model

were estimated with spam.var.rcv and hsic.var.rcv

function of the same package. Then the prediction of

the phenotypic value in testing dataset was performed

by using genomic.prediction function.

3. RESULTS AND DISCUSSION

In this study, various statistical models have been

implemented in real dataset of wheat for genomic

prediction. LSR, BLUP, LASSO, Ridge regression,

SpAM, HSIC LASSO, SVM, ANN, Integrated

model framework are compared on the basis of their

performance in genomic prediction of breeding values.

The results have been shown in Table 1.

Table 1. Statistical comparison of various

models for genomic prediction

Models Prediction

Accuracy (PA)

Standard

Error of PA MSE

LSR 0.0476 0.0142 2.5199

BLUP 0.1941 0.0076 2.0420

LASSO 0.4299 0.0070 1.8036

Ridge 0.5253 0.0058 1.1740

SpAM 0.4941 0.0056 1.4436

HSIC LASSO 0.1490 0.0023 0.0730

SVM 0.5784 0.0053 1.1667

ANN 0.4822 0.0065 1.4194

Integrated Model

Framework

0.4950 0.0056 1.3211

It is evident from Table 1 that the newly developed

integrated model has been performed better than LSR,

BLUP, LASSO, SpAM, HSIC LASSO and ANN in

terms of prediction accuracy. Only one parametric

statistical model, i.e. ridge regression and one non-

parametric statistical model i.e. SVM has better

prediction accuracy than the integrated model. Also

the mean square error is less in case of SVM and ridge

Fig. 1. The boxplots of prediction accuracy corresponding to dierent statistical models

Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98

regression than integrated model. Another observation

is that the mean square error of HSIC LASSO is very

less than the other models. This is due to the reason that

HSIC LASSO is nonlinear parametric model where

MSE may not be highly desirable criteria for evaluation

of model performance. The prediction accuracy of

dierent models are shown with the help of a boxplot

in Fig. 1.

It is known from the literature that genetic

architecture is responsible for the dierences of the

accurate predictions of breeding value among the GS

methods. Genetic architecture of a population depends

on the presence of additive and epistatic genetic eects.

The parametric models assume that the markers are

independent, i.e. additive in nature. But in practical

situation, both additive and epistatic eects are present

in the genetic architecture of the population. Because

of that reason the parametric models, viz. LSR, BLUP

do not perform well in genomic prediction. Although,

ridge regression, which is a biased parametric

estimator, performs very well in this study. It is also

observed that the non-parametric models (viz. SVM,

ANN) and newly developed integrated model perform

very well in genomic prediction of breeding value. This

is due to the reason that these models can capture both

linear (additive) and non-linear (epistatic) eect of the

markers in the dataset.

4. CONCLUSION

The performances of various statistical models in

case of genomic prediction have been compared in

the present study. The study is conducted in the real

dataset of wheat. So, this article will give a clear idea

about several statistical models on how they behave

in practical situation of genomic prediction. This will

help in choosing the appropriate model for our dataset.

The superiority of models like ridge regression, SVM,

integrated model framework has been depicted in the

above study. The accuracy of these models depends

on several factors including the trait of interest, extent

of additive and epistatic eect present in the dataset,

heritability of the trait etc. The performances of

these models can be improved further if we consider

dominance eect and genotype by environment

interaction in the study.

REFERENCES

Bain, A. (1873). Mind and Body: The Theories of Their Relation. D.

Appleton and Company, New York.

Crossa, J., de los Campos, G., Perez, P., Gianola, D., Burgueño, J.,

et al. (2010). Prediction of Genetic Values of Quantitative Traits in

Plant Breeding Using Pedigree and Molecular Markers, Genetics,

186, 713-724. https://doi.org/10.1534/genetics.110.118521

Endelman, J.B. (2011). Ridge regression and other kernels for genomic

selection with R package rrBLUP, Plant Genome, 4, 250-255.

Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using retted

cross-validation in ultrahigh dimensional regression, J. Roy.

Statist. Soc., 74(1), 37-65.

Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths

for generalized linear models via coordinate descent, Journal of

Statistical Software, 33, 1-22. URL http://www.jstatsoft.org/v33/

i01/

Gianola, D., Okut, H., Weigel, K.A. and Rosa, G.J.M.(2011).Predicting

complex quantitative traits with Bayesian neural networks: a case

study with Jersey cows and wheat, BMC Genetics, 12, 87-100.

Goeman, J.J. (2010). L1 penalized estimation in the Cox proportional

hazards model, Biometrical Journal, 52(1), 70-84.

Gretton, A., Bousquet, O., Smola, A. and Scholkopf, B. (2005).

Measuring statistical dependence with Hilbert-Schmidt norms,

Algorithmic Learning Theory (ALT), pp 63-77.Springer.

Guha Majumdar, S., Rai, A. and Mishra, D.C.(2019).Integrated

Framework for Selection of Additive and Non additive Genetic

Markers for Genomic Selection, Journal of Computational

Biology.http://doi.org/10.1089/cmb.2019.0223

Guha Majumdar, S., Rai, A. and Mishra, D.C.(2019).GSelection:

Genomic Selection. R package version 0.1.0.https://CRAN.R-

project.org/package=GSelection

Habier, D., Fernando, R.L. and Dekkers, J.C.M. (2007). The impact

of genetic relationship information on genome-assisted breeding

values, Genetics, 177, 2389-2397.

Hastie, T., R. Tibshirani, and J. Friedman.(2009).The Elements of

Statistical Learning: Data Mining, Inference, and Prediction.

Springer, New York.

Henderson, C.R. (1949). Estimates of changes in herd environment.

Journal of Dairy Science, 32, 706.

Henderson, C.R. (1953). Estimation of Variance and Covariance

Components, Biometrics, 9(2), 226-252.

Henderson, C.R. (1975). Best linear unbiased estimation and prediction

under a selection model, Biometrics, 31(2), 423-447.

Hoerl, A.E. and Kennard, R.W. (1970). Ridge regression: biased

estimation for non orthogonal problems, Technometrics, 12,

55-67.

Hoerl, A.E. and Kennard, R.W. (1970). Ridge regression: applications

to non orthogonal problems, Technometrics, 12, 69-82.

Hoerl, A.E., Kennard, R.W. and Baldwin, K.F. (1975). Ridge

Regression: Some Simulation, Communications in Statistics:

Theory and Methods, 4(2), 105-123.

Howard, R., Carriquiry, A.L. and Beavis, W.D. (2014).Parametric and

nonparametric statistical methods for genomic selection of traits

with additive and epistatic genetic architectures, G3 (Bethesda),

4(6), 1027-46.

James, W. (1890).The Principles of Psychology. H. Holt and Company,

NewYork.

98 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98

Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A.(2004). kernlab

- An S4 Package for Kernel Methods in R, Journal of Statistical

Software, 11, 1-20. URL http://www.jstatsoft.org/v11/i09/.

Long, N., Gianola, D., Rosa, G.J.M. and Weigel, K.A. (2011).

Application of support vector regression to genome-assisted

prediction of quantitative traits, Theoretical and Applied Genetics,

123, 1065-1074.

Maenhout, S., Baets, B.D., Haesaert, G. and Bockstaele, E.V. (2007).

Support vector machine regression for the prediction of maize

hybrid performance, Theoretical and applied genetics, 115(7),

1003-1013. doi: 10.1007/s00122-007-0627-9.

Meuwissen, T.H.E., Hayes, B.J. and Goddard, M.E. (2001).Prediction

of total genetic value using genome-wide dense marker maps,

Genetics, 157, 1819-1829.

Pérez-Rodriguez, P., and Gianola, D. (2013).brnn: brnn (Bayesian

regularization for feed-forward neural networks), R package

version 0.3. http://CRAN.R-project.org/package=brnn.

R Core Team. (2019). R: A language and environment for statistical

computing. R Foundation for Statistical Computing, Vienna,

Austria.URL https://www.R-project.org/.

Ravikumar, P., Laerty, J., Liu, H. and Wasserman, L. (2009). Sparse

additive models, J. Roy. Statist. Soc., Series B (Statistical

Methodology), 71(5), 1009-1030.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso,

Journal of Royal Statistical Society Series B (Methodological),

58, 267-288.

Vapnik, V.(1995).The Nature of Statistical Learning Theory, Ed.

2.Springer, New York.

Yamada, M., Jitkrittum, W., Sigal, L., Xing, E.P. and Sugiyama, M.

(2014). High-Dimensional Feature Selection by Feature-Wise

Kernelized Lasso, Neural Computation, 26, 185-207.

Zhao, T., Li, X., Liu, H. and Roeder, K. (2014). SAM: Sparse Additive

Modelling, R package version 1.0.5. https://CRAN.R-project.org/

package=SAM

ResearchGate has not been able to resolve any citations for this publication.

Integrated Framework for Selection of Additive and Nonadditive Genetic Markers for Genomic Selection

Article

Full-text available

Oct 2019
J COMPUT BIOL

Genomic selection is a modified form of marker-assisted selection in which the markers from the whole genome are used to estimate the genomic-estimated breeding value (GEBV). Several estimators are available to estimate GEBV. These estimators are able to capture either additive genetic effects or nonadditive genetic effects. However, there is hardly any procedure available that could capture both the effects simultaneously. Therefore, this study has been conducted to develop an integrated framework that is able to capture both additive and nonadditive effects efficiently. This integrated framework has been developed after evaluating existing additive and nonadditive models for marker selection. Furthermore, two efficient additive and nonadditive methods, that is, sparse additive models (SpAM) and Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC LASSO), have been combined to select both additive and nonadditive genetic markers for estimation of GEBV. The performance of the proposed framework has been evaluated on the basis of prediction accuracy, fraction of correctly selected features, and redundancy rate, along with standard error of mean for estimation of GEBV, compared with the individual performances of SpAM and HSIC LASSO separately. The newly developed framework is found to be satisfactory in terms of its performance and found to be robust for estimation of GEBV.

Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP

Article

Full-text available

Nov 2011

Jeffrey B Endelman

Many important traits in plant and animal breeding are polygenic and therefore recalcitrant to traditional marker-assisted selection. Genomic selection addresses this complexity by including all markers in the prediction model. A key method for the genomic prediction of breeding values is ridge regression (RR), which is equivalent to BLUP when the genetic covariance between lines is proportional to their similarity in genotype space. This additive model can be broadened to include epistatic effects by using other kernels, such as the Gaussian, which represent inner products in a complex feature space. To facilitate the use of RR and non-additive kernels by breeders, a new software package for R called rrBLUP has been developed. At its core is a fast maximum-likelihood algorithm for mixed models with a single variance component besides the residual error, which allows for efficient prediction with unreplicated training data. Use of the rrBLUP software is demonstrated through several examples, including the identification of optimal crosses based on superior progeny value. In cross-validation tests, the prediction accuracy with non-additive kernels was significantly higher than RR for wheat grain yield but equivalent for several maize traits.

R: A Language and Environment for Statistical Computing

Book

Jan 2015

Core R Team

Ridge regression: Biased estimation for nonorthogonal problems

Article

TECHNOMETRICS

Regression Shrinkage and Selection via the LASSO

Article

Jan 1996

R. J. Tibshirani

RIDGE REGRESSION - SOME SIMULATIONS

Article

Jan 1975

Estimation of changes in herd environment

Article

C.R. Henderson

Mind and Body: The Theories of Their Relation

Book

Jan 1874

Alexander Bain

The Nature of Statistical Learning Theory

Chapter

Jan 2000

Vladimir N. Vapnik

In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.

Estimation of Variance and Covariance Components

Article

Jun 1953

C R Henderson

Comparative Study of Statistical Models for Genomic Prediction

Abstract and Figures

Recommended publications

Identification of genetic markers for increasing agricultural productivity: An empirical study

Integrated Framework for Selection of Additive and Nonadditive Genetic Markers for Genomic Selection

Effect of genotype imputation on integrated model for genomic selection

Estimation of Error Variance in Genomic Selection for Ultrahigh Dimensional Data