ArticlePDF Available

Comparative Study of Statistical Models for Genomic Prediction

Authors:
  • ICAR - Indian Institute of Sugarcane Research
  • Indian Council of Agricultural Reserach

Abstract and Figures

Genomic prediction has been used for breeding of animals and plants with complex quantitative traits by predicting Genomic Estimated Breeding Values (GEBVs) of target population. The accuracy of genomic prediction depends on various factors including sampling population, genetic architecture of target species, statistical models, etc. There are large numbers of statistical models for genomic prediction available in the literature. These models perform differently due to different genetic architecture of the datasets. In this article, performances of linear least squared regression, BLUP, LASSO, ridge regression, SpAM, HSIC LASSO, SVM, ANN along with our newly developed integrated model framework have been evaluated in wheat dataset containing 599 wheat lines and 1279 SNP markers. In general, the performances of SVM, ridge regression and integrated model framework were found to be superior for genomic prediction. This study will help researcher in selection of appropriate statistical method to predict phenotypic values.
Content may be subject to copyright.
Comparative Study of Statistical Models for Genomic Prediction
Sayanti Guha Majumdar, Anil Rai and Dwijesh Chandra Mishra
ICAR-Indian Agricultural Statistics Research Institute, New Delhi
Received 21 December 2019; Revised 08 January 2020; Accepted 10 January 2020
SUMMARY
Genomic prediction has been used for breeding of animals and plants with complex quantitative traits by predicting Genomic Estimated Breeding
Values (GEBVs) of target population. The accuracy of genomic prediction depends on various factors including sampling population, genetic
architecture of target species, statistical models, etc. There are large numbers of statistical models for genomic prediction available in the literature.
These models perform dierently due to dierent genetic architecture of the datasets. In this article, performances of linear least squared regression,
BLUP, LASSO, ridge regression, SpAM, HSIC LASSO, SVM, ANN along with our newly developed integrated model framework have been
evaluated in wheat dataset containing 599 wheat lines and 1279 SNP markers. In general, the performances of SVM, ridge regression and integrated
model framework were found to be superior for genomic prediction. This study will help researcher in selection of appropriate statistical method to
predict phenotypic values.
Keywords: ANN, Genomic prediction, Integrated model framework, LASSO, Ridge regression, SVM.
Available online at www.isas.org.in/jisas
JOURNAL OF THE INDIAN SOCIETY OF
AGRICULTURAL STATISTICS 74(2) 2020 91–98
1. INTRODUCTION
Genomic prediction or genomic selection (GS)
is an emerging eld of genomic-assisted breeding
methodology where whole genome marker data is used
to predict genomic estimated breeding value (GEBV).
The aim of this method is to increase genetic gain by
shortening breeding cycles and increasing accuracy
of prediction of GEBV. There are several statistical
models available in literature for genomic prediction,
viz. least squared regression (LSR), best linear unbiased
prediction (BLUP) (Henderson, 1975), least absolute
shrinkage and selection operator (LASSO) (Tibshirani,
1996), ridge regression (Hoerl and Kennard, 1970),
Sparse Additive Models (SpAM) (Ravikumar et al.,
2009), Hilbert-Schmidt Independence Criterion
LASSO (HSIC LASSO) (Gretton et al., 2005 and
Yamada et al., 2014), support vector machine (SVM)
(Vapnik, 1995), articial neural network (ANN) (Bain,
1873 and James, 1890). Another statistical model
framework has been developed in our previous study
by combining one additive model, i.e. SpAM and one
non-additive model, i.e. HSIC LASSO. The newly
developed model can be mentioned as integrated model
framework (Guha Majumdar et al., 2019). The accuracy
of prediction of dierent models varies on the basis
of the underlying statistical methods. All the models
dier among themselves due to assumption about
the distribution, variance among the genetic markers
used etc. In our present study, we have compared the
performance of all the above mentioned models for
genomic prediction in case of wheat. These models are
described briey below.
1.1 Linear Least-Squares Regression
In GS, the main goal is to predict the individual’s
breeding value by modeling the relationship between
the individual’s genotype and phenotype. The Linear
least-square regression is the simplest model, which
can be written as
where, individual, marker
position/segment, is the phenotypic value for
Corresponding author: Sayanti Guha Majumdar
E-mail address: sayanti23gm@gmail.com
92 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98
individual , is the overall mean, is an element
of the incidence matrix corresponding to marker j,
individual , is a random eect associated with
marker , and is a random residual which follows
.
The basic problem in using this model is that, it
does not work well, if the available number of markers
(explanatory variables) is greater than the number
of individual available (observations). In order to
overcome this problem Meuwissen et al. (2001)
adopted a stepwise procedure of least squares regression
for GS. First, least squares regression analysis was
performed on each segment (marker) separately using
above model. Then the likelihood of every segment
was plotted against the position of the segment which
helped in identifying the segments having signicant
eects. Finally, segments having signicant eects
were used simultaneously by the model to estimate
their individual eects. But this approach also has
some drawbacks, like it does not fully take advantage
of the whole genome marker information as markers
with a signicant eect are included in the nal model.
1.2 Best Linear Unbiased Prediction (BLUP)
The BLUP theory and the mixed model formulation
were rst described by Henderson (1949), and BLUP
was recommended as a method of GS by Meuwissen
et al. (2001). The random eects model of BLUP
(Henderson, 1975)can be written as
where, is the vector of phenotypic data,
is the overall mean vector, is the th
column of the design matrix, is the genetic eect
associated with the th marker, and is the number of
markers. is the intercept which is xed, and is
the random eects with , ,
and . The intercept
can be replaced by to include all the xed eects
if other covariates are also available. Then, the model
can be written as
where is a vector of unknown xed
eects, where, the rst element is considered as the
population mean, and is the incidence matrix which
relates to . The above equation is usually known as
mixed model or mixed eects model. The xed eect
vector is estimated by BLUE, whereas, BLUP is the
predictor of the random eects.
Henderson (1953) proposed that ( ) can be
obtained by maximizing the joint likelihood of ( )
given by:
A set of linear equations [Henderson’s Mixed Model
Equations (MME)] can be obtained by maximizing the
likelihood with respect to , and equating it
to zero:
where, and . The BLUE
of and the BLUP of can be obtained by solving
the MME. The assumption of Henderson’s derivation
is that and are normally distributed and maximizes
the joint likelihood of over the unknowns and
.
1.3 Least Absolute Shrinkage and Selection
Operator (LASSO)
The LASSO technique (Tibshirani, 1996) is
being used for ecient feature selection based on the
assumption of linear dependency between input features
and output values. In case of LASSO, optimization
problem is given as
where, is a regression coecient
vector, denotes the regression coecient of the k-th
feature, and are the and -norms, and
λ > 0 is the regularization parameter. The -regularizer
in LASSO tends to produce a sparse solution, which
means that the regression coecients of non-signicant
features become zero.
93
Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98
LASSO is specically suitable, when the number
of features is larger than the number of training
samples (Tibshirani, 1996). So, by using LASSO we
can overcome the limitations of linear least-square
regression method. However, this method performs
well for additive eect data only.
1.4 Ridge Regression
In case of multi-collinear marker data, the
performance of variable selection methods is generally
very poor. In order to address this problem, we can
use penalized regression model i.e. ridge regression of
Hoerl and Kennard 1970. Ridge regression minimizes
the penalized sum of squares:
where, is the penalty parameter, and the
estimate of the regression coecient is given by:
.
where, is a identity matrix. The penalty
parameter can be calculated by several dierent
methods, for example, by plotting as a function of
and choosing the smallest that results in a stable
estimate of . Hoerl et al. (1975) have proposed
another way to choose using an automated procedure.
The estimate of is given by:
where, is the number of parameters in the model
except the intercept, is the residual mean square
obtained by linear least squares estimation, and is
the vector of least squares estimates of regression
coecients.
Ridge regression estimator of is biased and
this increase in bias is compensated by the decrease
in variance. As a result, we get an estimator with
smallest MSE. Another advantage of ridge regression
is that it can be used when available markers are more
than the sample size to overcome the problem.
Meuwissen et al. (2001) employed ridge regression
in GS. It was assumed that the marker eects
were random, and they were drawn from a normal
distribution with , whereas, additive
genetic variance among individuals is expressed as
= , where, represents additive genetic
variance among individuals and is the number
of marker loci (Habier et al. 2007). This method is
suitable for data with additive eects, i.e. for linear
features only.
1.5 Sparse Additive Models (SpAM)
High dimensional feature selection can be
performed with sparse additive models (SpAM)
(Ravikumar et al., 2009). The SpAM optimization
model can be dened as
where, are
regression coecient vectors,
is Gram matrix and ,
is kernel function, is a coecient for
and λ > 0 is a
regularization parameter. SpAM is considered as a
convex method which can be eciently optimized by
the back-tting algorithm.
A disadvantage of SpAM is that it can only deal
with additive eects. In case of epistatic eects in the
data SpAM may fail to select signicant markers. Also,
SpAM is computationally expensive procedure.
1.6 HSIC LASSO
A kernalized non-linear LASSO was proposed
by Yamada et al. (2014), which isalso called as HSIC
(Hilbert-Schmidt Independence Criterion, Gretton
et al., 2005) LASSO. The optimization problem can be
expressed as
s.t. ≥ 0,
where, · Frob is the Frobenius norm,
and are centered Gram
matrices, and Li, j = L(yi, yj)
are Gram matrices, and are kernel
functions, is the centering matrix,
is the n-dimensional identity matrix, and is the
94 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98
n-dimensional vector with all ones. A non-negativity
constraint is employed in this model so that meaningful
features are selected. This model diers from the
original formulation of LASSO as in this case kernel
functions and is dierent and non-negativity
constraint is imposed. The rst term in this equation
denotes that we are regressing the output kernel matrix
by a linear combination of feature-wise input kernel
matrices .
1.7 Support Vector Machine (SVM)
SVM was proposed by Vapnik in 1995. SVM is
a supervised machine learning technique, originally
used as a classier. To train the SVM, training dataset
is used and the classier produces maximum margin
separation between two classes of observations. The
idea can be used for estimating unknown regression
function. Maenhout et al. (2007) and Long et al. (2011)
have implemented SVM regression for GS in plant
breeding. SVM regression can model the relationship
between the marker genotypes and the phenotypes with
a linear as well as nonlinear mapping function.
Let us consider a training sample
, where
is avector of genotypic values of the markers
for individual , and is the vector of phenotypic
value for individual . SVM model describing the
relationship between the phenotype and the genotype
of an individual has been described below:
Where is a constant which reects the maximum
error while estimating and is a vector of unknown
weights. The function can be obtained by
minimizing the expression
Where denotes the loss function which may
be squared loss function, absolute loss function or
-insensitive loss function measuring the quality of the
estimation. is the regularization parameter which is
responsible for the trade-o between the sparsity and the
complexity of the model. The norm of vector is
associated with model complexity inversely. A support
vector satises the equation by
denition.
1.8 Articial Neural networks (ANN)
Neural network (NN) is a nonparametric statistical
method which can model relationship between
genotypes and phenotype with both linear and complex
nonlinear functions. NN mimics the idea of how
neurons in the human brain work and interact, and
conducts computations. NN was rst introduced by
Bain (1873) and James (1890). Every unit in NN is
analogous to a brain neuron and they connect among
themselves with several functions which are analogous
to synapses (Hastie et al. 2009). The NN is composed
of three types of layers, viz. an input layer, a hidden
layer and an output layer. This model is known as the
feed-forward NN. The NN which is used to estimate a
regression function usually consists of only one output
layer unit. The hidden layer units are functions of linear
combinations of the inputs, whereas, the output layer
units are functions of the hidden layer units. The output
function of a feed-forward NN can be dened as:
where is the number of units in the input layer,
is the number of output layer units, is the number
of hidden layer units, is the th input,
is the intercept, are the output
layer weights connecting the th hidden layer unit
to the output layer units, is the activation function
modeling the connection between the hidden layer
and the output layer, and and are the
unknown learning parameters of the hidden layer unit
connecting the th neuron in the input
layer.
In GS, marker genotypes are represented by
where is the number of individuals in the analysis.
We can choose the activation function as the sigmoid
or the Gaussian radial basis function. Gianola et al.
implemented NNs for GS in 2011.
1.9 Integrated Model Framework
The integrated model framework (Guha Majumdar
et al., 2019) for estimation of GEBV has been developed
by combining SpAM and HSIC LASSO and can be
used to capture both linear and non-linear eect of the
genetic markers on the phenotypic data. The model can
be expressed as
95
Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98
2.2 Implementation in R
Least squares regression (LSR)
In order to implement least square regression, the lm
function from the stats package (R Development Core
Team 2019) in R was used. In rst step, simple linear
regression model was tted for each of the individual
markers and most signicant 100 markers are selected
according to their p-values. Then in a nal model those
100 markers were included to simultaneously t a linear
regression model. This two-step method was applied as
there are more number of markers than the number of
individuals. Finally, phenotypic values were predicted
from testing dataset with the selected marker data and
estimated regression coecient of marker eects.
BLUP
BLUP was implemented by using mixed.solve
function from rrBLUP package (Endelman, 2011) in
R. The model was tted using training data. Then the
phenotypic value was predicted using testing dataset
and the predicted coecients of marker eects from
the tted model.
LASSO
The glmnet function of the glmnet package
(Friedman et al., 2010) in R was used with default
parameter values to implement LASSO. The prediction
was performed with the help of predict function of the
same package by minimizing cross-validation error.
Ridge Regression
Ridge regression can also be implemented through
glmnet function of the glmnet package (Friedman et al.,
2010) in R by setting the value of alpha equal to zero.
Then the prediction was performed in testing set by
usingpredict function.
SpAM
The sparse additive model was implemented by
using samQL function of the SAM package (Zhao et al.,
2014) in R with default parameter values. The predict
function of the same package was used to perform the
prediction of phenotypic value in testing dataset.
HSIC LASSO
In-house R function was developed to implement
HSIC LASSO or kernalized LASSO. The penalized
function of penalized package (Goeman et al., 2010)
where, is the predicted phenotype of the
integrated model framework, is , where
and are the error variances of models HSIC LASSO
and SpAM respectively, is the predicted phenotype
from Sparse Additive Models and is the predicted
phenotype from HSIC LASSO. The estimation of
and can be performed by following retted cross
validation approach of Fan et al., 2012.
2. MATERIALS AND METHODS
2.1 Data Description
We have a real dataset of wheat for implementing
genomic prediction models (Crossa et al., 2010).
Genotyping of wheat was done by using 1447 Diversity
Array Technology markers generated by Triticarte Pty.
Ltd. (Canberra, Australia; http://www.triticarte.com.
au). This dataset includes 599 lines observed for trait
grain yield (GY) for four mega environments. For
the convenience of our study the GY for rst mega
environment has been considered. The nal number
of DArT markers in the dataset after editing was 1279
which has been used in this study.
After implementation of statistical models
prediction accuracy (PA) and mean squared error
(MSE) have been estimated for all the models. PA can be
dened as the correlation between the actual phenotypic
values and the predicted phenotypic
values (Howard et al., 2014). MSE can be expressed as
Where, is the predicted value of phenotype,
is the actual value of phenotype, and is the number
of individuals in the dataset. In order to implement
the statistical models, the statistical software R was
used. Before implementing the models, the dataset
has been split into training and testing data. 80% of
the observation has been chosen randomly for training
purpose and rest 20% data has been kept as testing data.
This procedure of splitting was repeated 500 times to
get 500 training and testing datasets which were used to
implement all the statistical models.
96 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98
has been used to t this kernelized LASSO model.
Then predict function of the same package is used to
predict the phenotypic value of the testing dataset.
SVM
The ksvm function of the kernlab package
(Karatzoglou et al., 2004) in R with the default
parameters was used to perform SVM regression on
the training dataset. After tting the model, the predict
function was used to obtain the predicted phenotypic
values for the testing set.
Neural network
The NN model was implemented using the brnn
function of the brnn package (Rodriguez and Gianola,
2013) in R. This function uses a two layer NN and maps
the input information into some basis function. The
number of neurons was set to be three and the number
of epochs to train the model was 30. The predict.brnn
function of the same package was used in the next step
to predict the phenotype using testing dataset.
Integrated Model Framework
The GSelection package (Guha Majumdar et al.,
2019) in R was developed by us to implement the
integrated model framework in GS. To t the model
in training data feature.selection function was used.
The error variances of SpAM and HSIC LASSO model
were estimated with spam.var.rcv and hsic.var.rcv
function of the same package. Then the prediction of
the phenotypic value in testing dataset was performed
by using genomic.prediction function.
3. RESULTS AND DISCUSSION
In this study, various statistical models have been
implemented in real dataset of wheat for genomic
prediction. LSR, BLUP, LASSO, Ridge regression,
SpAM, HSIC LASSO, SVM, ANN, Integrated
model framework are compared on the basis of their
performance in genomic prediction of breeding values.
The results have been shown in Table 1.
Table 1. Statistical comparison of various
models for genomic prediction
Models Prediction
Accuracy (PA)
Standard
Error of PA MSE
LSR 0.0476 0.0142 2.5199
BLUP 0.1941 0.0076 2.0420
LASSO 0.4299 0.0070 1.8036
Ridge 0.5253 0.0058 1.1740
SpAM 0.4941 0.0056 1.4436
HSIC LASSO 0.1490 0.0023 0.0730
SVM 0.5784 0.0053 1.1667
ANN 0.4822 0.0065 1.4194
Integrated Model
Framework
0.4950 0.0056 1.3211
It is evident from Table 1 that the newly developed
integrated model has been performed better than LSR,
BLUP, LASSO, SpAM, HSIC LASSO and ANN in
terms of prediction accuracy. Only one parametric
statistical model, i.e. ridge regression and one non-
parametric statistical model i.e. SVM has better
prediction accuracy than the integrated model. Also
the mean square error is less in case of SVM and ridge
Fig. 1. The boxplots of prediction accuracy corresponding to dierent statistical models
97
Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98
regression than integrated model. Another observation
is that the mean square error of HSIC LASSO is very
less than the other models. This is due to the reason that
HSIC LASSO is nonlinear parametric model where
MSE may not be highly desirable criteria for evaluation
of model performance. The prediction accuracy of
dierent models are shown with the help of a boxplot
in Fig. 1.
It is known from the literature that genetic
architecture is responsible for the dierences of the
accurate predictions of breeding value among the GS
methods. Genetic architecture of a population depends
on the presence of additive and epistatic genetic eects.
The parametric models assume that the markers are
independent, i.e. additive in nature. But in practical
situation, both additive and epistatic eects are present
in the genetic architecture of the population. Because
of that reason the parametric models, viz. LSR, BLUP
do not perform well in genomic prediction. Although,
ridge regression, which is a biased parametric
estimator, performs very well in this study. It is also
observed that the non-parametric models (viz. SVM,
ANN) and newly developed integrated model perform
very well in genomic prediction of breeding value. This
is due to the reason that these models can capture both
linear (additive) and non-linear (epistatic) eect of the
markers in the dataset.
4. CONCLUSION
The performances of various statistical models in
case of genomic prediction have been compared in
the present study. The study is conducted in the real
dataset of wheat. So, this article will give a clear idea
about several statistical models on how they behave
in practical situation of genomic prediction. This will
help in choosing the appropriate model for our dataset.
The superiority of models like ridge regression, SVM,
integrated model framework has been depicted in the
above study. The accuracy of these models depends
on several factors including the trait of interest, extent
of additive and epistatic eect present in the dataset,
heritability of the trait etc. The performances of
these models can be improved further if we consider
dominance eect and genotype by environment
interaction in the study.
REFERENCES
Bain, A. (1873). Mind and Body: The Theories of Their Relation. D.
Appleton and Company, New York.
Crossa, J., de los Campos, G., Perez, P., Gianola, D., Burgueño, J.,
et al. (2010). Prediction of Genetic Values of Quantitative Traits in
Plant Breeding Using Pedigree and Molecular Markers, Genetics,
186, 713-724. https://doi.org/10.1534/genetics.110.118521
Endelman, J.B. (2011). Ridge regression and other kernels for genomic
selection with R package rrBLUP, Plant Genome, 4, 250-255.
Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using retted
cross-validation in ultrahigh dimensional regression, J. Roy.
Statist. Soc., 74(1), 37-65.
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths
for generalized linear models via coordinate descent, Journal of
Statistical Software, 33, 1-22. URL http://www.jstatsoft.org/v33/
i01/
Gianola, D., Okut, H., Weigel, K.A. and Rosa, G.J.M.(2011).Predicting
complex quantitative traits with Bayesian neural networks: a case
study with Jersey cows and wheat, BMC Genetics, 12, 87-100.
Goeman, J.J. (2010). L1 penalized estimation in the Cox proportional
hazards model, Biometrical Journal, 52(1), 70-84.
Gretton, A., Bousquet, O., Smola, A. and Scholkopf, B. (2005).
Measuring statistical dependence with Hilbert-Schmidt norms,
Algorithmic Learning Theory (ALT), pp 63-77.Springer.
Guha Majumdar, S., Rai, A. and Mishra, D.C.(2019).Integrated
Framework for Selection of Additive and Non additive Genetic
Markers for Genomic Selection, Journal of Computational
Biology.http://doi.org/10.1089/cmb.2019.0223
Guha Majumdar, S., Rai, A. and Mishra, D.C.(2019).GSelection:
Genomic Selection. R package version 0.1.0.https://CRAN.R-
project.org/package=GSelection
Habier, D., Fernando, R.L. and Dekkers, J.C.M. (2007). The impact
of genetic relationship information on genome-assisted breeding
values, Genetics, 177, 2389-2397.
Hastie, T., R. Tibshirani, and J. Friedman.(2009).The Elements of
Statistical Learning: Data Mining, Inference, and Prediction.
Springer, New York.
Henderson, C.R. (1949). Estimates of changes in herd environment.
Journal of Dairy Science, 32, 706.
Henderson, C.R. (1953). Estimation of Variance and Covariance
Components, Biometrics, 9(2), 226-252.
Henderson, C.R. (1975). Best linear unbiased estimation and prediction
under a selection model, Biometrics, 31(2), 423-447.
Hoerl, A.E. and Kennard, R.W. (1970). Ridge regression: biased
estimation for non orthogonal problems, Technometrics, 12,
55-67.
Hoerl, A.E. and Kennard, R.W. (1970). Ridge regression: applications
to non orthogonal problems, Technometrics, 12, 69-82.
Hoerl, A.E., Kennard, R.W. and Baldwin, K.F. (1975). Ridge
Regression: Some Simulation, Communications in Statistics:
Theory and Methods, 4(2), 105-123.
Howard, R., Carriquiry, A.L. and Beavis, W.D. (2014).Parametric and
nonparametric statistical methods for genomic selection of traits
with additive and epistatic genetic architectures, G3 (Bethesda),
4(6), 1027-46.
James, W. (1890).The Principles of Psychology. H. Holt and Company,
NewYork.
98 Sayanti Guha Majumdar et al. / Journal of the Indian Society of Agricultural Statistics 74(2) 2020 91–98
Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A.(2004). kernlab
- An S4 Package for Kernel Methods in R, Journal of Statistical
Software, 11, 1-20. URL http://www.jstatsoft.org/v11/i09/.
Long, N., Gianola, D., Rosa, G.J.M. and Weigel, K.A. (2011).
Application of support vector regression to genome-assisted
prediction of quantitative traits, Theoretical and Applied Genetics,
123, 1065-1074.
Maenhout, S., Baets, B.D., Haesaert, G. and Bockstaele, E.V. (2007).
Support vector machine regression for the prediction of maize
hybrid performance, Theoretical and applied genetics, 115(7),
1003-1013. doi: 10.1007/s00122-007-0627-9.
Meuwissen, T.H.E., Hayes, B.J. and Goddard, M.E. (2001).Prediction
of total genetic value using genome-wide dense marker maps,
Genetics, 157, 1819-1829.
Pérez-Rodriguez, P., and Gianola, D. (2013).brnn: brnn (Bayesian
regularization for feed-forward neural networks), R package
version 0.3. http://CRAN.R-project.org/package=brnn.
R Core Team. (2019). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna,
Austria.URL https://www.R-project.org/.
Ravikumar, P., Laerty, J., Liu, H. and Wasserman, L. (2009). Sparse
additive models, J. Roy. Statist. Soc., Series B (Statistical
Methodology), 71(5), 1009-1030.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso,
Journal of Royal Statistical Society Series B (Methodological),
58, 267-288.
Vapnik, V.(1995).The Nature of Statistical Learning Theory, Ed.
2.Springer, New York.
Yamada, M., Jitkrittum, W., Sigal, L., Xing, E.P. and Sugiyama, M.
(2014). High-Dimensional Feature Selection by Feature-Wise
Kernelized Lasso, Neural Computation, 26, 185-207.
Zhao, T., Li, X., Liu, H. and Roeder, K. (2014). SAM: Sparse Additive
Modelling, R package version 1.0.5. https://CRAN.R-project.org/
package=SAM
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Genomic selection is a modified form of marker-assisted selection in which the markers from the whole genome are used to estimate the genomic-estimated breeding value (GEBV). Several estimators are available to estimate GEBV. These estimators are able to capture either additive genetic effects or nonadditive genetic effects. However, there is hardly any procedure available that could capture both the effects simultaneously. Therefore, this study has been conducted to develop an integrated framework that is able to capture both additive and nonadditive effects efficiently. This integrated framework has been developed after evaluating existing additive and nonadditive models for marker selection. Furthermore, two efficient additive and nonadditive methods, that is, sparse additive models (SpAM) and Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC LASSO), have been combined to select both additive and nonadditive genetic markers for estimation of GEBV. The performance of the proposed framework has been evaluated on the basis of prediction accuracy, fraction of correctly selected features, and redundancy rate, along with standard error of mean for estimation of GEBV, compared with the individual performances of SpAM and HSIC LASSO separately. The newly developed framework is found to be satisfactory in terms of its performance and found to be robust for estimation of GEBV.
Article
Full-text available
Many important traits in plant and animal breeding are polygenic and therefore recalcitrant to traditional marker-assisted selection. Genomic selection addresses this complexity by including all markers in the prediction model. A key method for the genomic prediction of breeding values is ridge regression (RR), which is equivalent to BLUP when the genetic covariance between lines is proportional to their similarity in genotype space. This additive model can be broadened to include epistatic effects by using other kernels, such as the Gaussian, which represent inner products in a complex feature space. To facilitate the use of RR and non-additive kernels by breeders, a new software package for R called rrBLUP has been developed. At its core is a fast maximum-likelihood algorithm for mixed models with a single variance component besides the residual error, which allows for efficient prediction with unreplicated training data. Use of the rrBLUP software is demonstrated through several examples, including the identification of optimal crosses based on superior progeny value. In cross-validation tests, the prediction accuracy with non-additive kernels was significantly higher than RR for wheat grain yield but equivalent for several maize traits.
Chapter
In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.