ArticlePDF Available

Ridge Regression and Extensions for Genomewide Selection in Maize

Wiley
Crop Science
Authors:

Abstract and Figures

This paper reviews properties of ridge regression for genomewide (genomic) selection and establishes close relationships with other methods to model genetic correlation among relatives, including use of a kinship matrix and the simple matching coefficient as computed from marker data. A number of alternative models are then proposed exploiting ties between genetic correlation based on marker data and geostatistical concepts. A simple method for automatic marker selection is proposed. The methods are exemplified using a series of experiments with test‐cross hybrids of maize (Zea mays L.) conducted in five environments. Results underline the need to appropriately model genotype–environment interaction and to employ an independent estimate of error. It is also shown that accounting for genetic effects not captured by markers may be important.
This content is subject to copyright. Terms and conditions apply.
CROP SCIENCE, VOL. 49, JULYAUGUST 2009 1165
RESEARCH
G
  or genomic selection (GS) is a
marker-based method for estimating genotypic values with-
out prescreening of markers by signi cance testing or other sub-
set selection procedures (Whittaker et al., 2000; Meuwissen et
al., 2001). Most applications so far have been in animal breed-
ing (Meuwissen et al., 2001; Goddard and Hayes, 2007), but the
method is rapidly becoming popular in plant breeding (Bernardo
and Yu, 2007). With the development of high-throughput marker
technologies interest in such statistical methods is expected to
increase further in the near future.
The key idea is to predict the genotypic value of the ith geno-
type (i = 1, …, G), denoted as g
i
, using all available markers. One
option is to use a regression model with the form
1
M
ikik
k
guz
=
=
[1]
where z
ik
is a regressor variable for the ith genotype and kth marker,
while u
k
(k = 1, …, M) are regression coe cients. Typically, for a
Ridge Regression and Extensions
for Genomewide Selection in Maize
H. P. Piepho*
ABSTRACT
This paper reviews properties of ridge regres-
sion for genomewide (genomic) selection and
establishes close relationships with other meth-
ods to model genetic correlation among rela-
tives, including use of a kinship matrix and the
simple matching coef cient as computed from
marker data. A number of alternative models are
then proposed exploiting ties between genetic
correlation based on marker data and geostatis-
tical concepts. A simple method for automatic
marker selection is proposed. The methods are
exempli ed using a series of experiments with
test-cross hybrids of maize (Zea mays L.) con-
ducted in  ve environments. Results underline
the need to appropriately model genotype
environment interaction and to employ an inde-
pendent estimate of error. It is also shown that
accounting for genetic effects not captured by
markers may be important.
Institute for Crop Production and Grassland Science, Universität
Hohenheim, Fruwirthstrasse 23, 70599 Stuttgart, Germany. Received
13 Oct. 2008. *Corresponding author (piepho@uni-hohenheim.de).
Abbreviations: AIC, Akaike information criterion; BLUP, best linear
unbiased prediction; DH, doubled haploids; EXP, exponential model;
FA, factor-analytic; GAU, Gaussian model; GCA, general combining
ability; GS, genome-wide (genomic) selection; LS, least squares; LV,
linear variance; POW, power model; REML, restricted maximum like-
lihood; RR
het
, ridge regression with heterogeneous variance among
markers; RR
hom
, ordinary ridge regression; RR
hom2
, ridge regression
with reduced set of markers; SCA, speci c combining ability; SPH,
spherical model; SVM, support vector machine.
Published in Crop Sci. 49:1165–1176 (2009).
doi: 10.2135/cropsci2008.10.0595
© Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
All rights reserved. No part of this periodical may be reproduced or transmitted in any
form or by any means, electronic or mechanical, including photocopying, recording,
or any information storage and retrieval system, without permission in writing from
the publisher. Permission for printing and for reprinting the material contained herein
has been obtained by the publisher.
1166 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULYAUGUST 2009
biallelic marker with alleles A
1
and A
2
, we de ne z
ik
= 1
for A
1
A
1
, z
ik
= –1 for A
2
A
2
and z
ik
= 0 for A
1
A
2
or when
the marker genotype is missing. The linear model from
Eq. [1] can be written in matrix form as
g = Zu [2]
where g = (g
1
, g
2
, …, g
G
), Z = {z
ik
}, and u = (u
1
, u
2
, …, u
M
).
When there is a single (mean centered) observation
y
i
per genotype with independent residual errors e
i
hav-
ing zero mean and variance σ
2
e
, the model for observed
data is y = Zu + e, where y = (y
1
, y
2
, … , y
G
) and
e = (e
1
, e
2
, … , e
G
). The classical least squares estima-
tor, û = (ZZ)
–1
Zy minimizes the sum of squares given
by ||yZu||
2
, with ||||denoting the length of a vector.
This estimator is well known to perform poorly when the
number of markers (M) is large relative to the number of
genotypes (G), and using all markers is impossible when
M > G, which is expected to be increasingly the case with
high-density marker systems. Selecting a subset of mark-
ers by one of the common variable selection methods (for-
ward selection, stepwise regression, etc.; Miller, 2002) is a
possible alternative often used in marker-assisted selection
programs (Bernardo and Yu, 2007), but performance of
these methods is likely to be bogged down when there
are many, possibly highly correlated markers (Whittaker
et al., 2000).
There are many regularization methods addressing
the problem of large M, which avoid the selection prob-
lem essentially by keeping all markers in the model. One
of these methods is ridge regression (Hoerl and Kennard,
1970), which was  rst used for GS by Whittaker et al.
(2000). Ridge regression minimizes the penalized sum of
squares ||yZu||
2
+ λ
2
uu, where λ
2
is a penalty param-
eter, yielding the estimator
û = (ZZ + λ
2
I
G
)
–1
Zy, [3]
where I
G
is the G-dimensional identity matrix. The pen-
alty term overcomes the problem of ill-conditioning when
multicollinearity among columns in Z causes ZZ to be
singular, or nearly so. The penalized estimator in Eq. [3]
involves shrinkage, thus avoiding over- tting, and it sta-
bilizes estimation relative to least squares. The penalty
parameter λ
2
, which determines the amount of shrinkage,
may be chosen in a number of ways (Draper and Smith,
1998), including cross-validation (Ruppert et al., 2003).
One particular method, which has been used by Meu-
wissen et al. (2001) for GS, assumes that regression coef-
cients are independent random draws from a common
normal distribution, that is,
u
k
~ N (0,σ
2
u
), (k = 1,…, M) [4]
Under this model, we have λ
2
= σ
2
e
/σ
2
u
, where σ
2
e
is
the residual variance (Draper and Smith, 1998), and the
penalized estimator in Eq. [3] turns out to be equivalent to
be st li ne a r unbiased pred ict ion ( BLU P) of u ( Rupper t et a l.
(2003). This method has also been used by Bernardo and
Yu (2007), who found it, based on a simulation study, to
per for m wel l compared to subset selection, in wh ich m a rk-
ers were selected per chromosome by backward elimina-
tion with relaxed signi cance thresholds. One advantage
of the mixed model formulation of ridge regression is that
we can estimate the variance components, and hence the
penalty, in a straightforward way by restricted maximum
likelihood (REML) (Ruppert et al., 2003). Furthermore,
it is possible to account for other sources of variation by
adding  xed and random e ects.
The present paper brie y reviews some of the fea-
tures of ridge regression as performed in a mixed model
framework using REML, giving particular emphasis to its
similarity with spatial models. I then outline some alter-
native models for GS, including spatial models and ridge
regression with heterogeneous variances. The exposition
emphasizes non-Bayesian implementations of methods
that are mostly treated in a Bayesian framework, mainly
in the animal breeding literature. Equivalence relations
of perhaps seemingly di erent methods are discussed. I
give some hints on how these models can be  tted using
standard mixed model software and illustrate them using
a dataset from a breeding program in maize (Zea mays L.).
The importance of accounting for both genotypeenvi-
ronment interaction and for polygenic e ects not captured
by markers is highlighted.
MATERIAL AND METHODS
The total genotypic e ect will be partitioned into a component
explained by the markers (g
i
) and a polygenic component (v
i
) not
captured by the markers. Thus, the total genotypic e ect h
i
is
h
i
= g
i
+ v
i
[5]
Our main objective is to estimate h
i
. It is assumed through-
out that g
i
and v
i
are independent of one another. It is important
to account for residual polygenic e ects v
i
to avoid over- tting
(Goddard and Hayes, 2007). In case of a single unstructured
population, for example a population of doubled haploid (DH)
lines generated from a single cross, we have
var(v) = σ
2
v
I
G
[6]
where v = (v
1
, v
2
, … , v
G
). For s t r uc t u r e d p o pu l a t ion s , va r(v) may
involve covariances among relatives (Piepho et al., 2008a).
We will consider di erent models for g = (g
1
, g
2
,, g
G
),
conditionally on the markers Z = (z
ik
). All conditional models
will be of the form
var(g|Z) = σ
2
u
Γ [7]
for some matrix Γ that is a function of Z. In Eq. [7] and later in
the paper, the expression on which we condition (Z in this case)
is given following a vertical bar.
Ridge Regression and Related Models
Under the mixed model formulation of ridge regression in Eq.
[4], we have Γ = ZZ, so that the genotypic variancecovariance
CROP SCIENCE, VOL. 49, JULYAUGUST 2009 WWW.CROPS.ORG 1167
di culty with this approach is that often the number of prog-
eny per cross is rather limited, making variance estimates rather
unreliable. This problem can be overcome by ridge regression
where heterogeneity among crosses is modeled by a single vari-
ance component σ
2
u
, because heterogeneity between crosses is
represented by the structure of ZZ.
In animal breeding very accurate BLUPs are often avail-
able for sires and dams due to extensive records on progeny,
accounting for half the additive genetic variance, so focusing
GS on Mendelian within-family sampling has been suggested
(H. Simianer, personal communication, 2008). This approach
is applicable in plant breeding as well, though BLUPs of par-
ents are usually less reliable than in animal breeding programs.
In a diallel crossing scheme, this means replacing Z with
˜
Z = ZZ
F1
, where Z
F1
is the marker data of the F
1
generation
corresponding to the underlying crosses, and assuming condi-
tional independence between crosses for the marker-dependent
part of the model; that is, Γ is block-diagonal with blocks
˜
Z
c
˜
Z
c
,
where
˜
Z
c
is the submatrix of
˜
Z corresponding to the cth cross.
Simple Spatial Mixed Models
The property of a genetic covariance depending on similarity
of marker pro les brings to mind a host of alternatives such
as geostatistical methods, where covariance depends on spatial
proximity. Thus, replacing spatial with genetic distance, spatial
methods can be used to model genetic correlation (Piepho et
al., 2008a). Ridge regression may, in fact, be regarded as one
type of spatial model, as will be elaborated after a brief outline
of spatial models as applied to marker data.
If marker scores z
ik
are regarded as coordinates of geno-
types in M-dimensional marker space, covariance can be mod-
eled as a linear or nonlinear function of distance in that space.
Thus, the covariance is expressed as
Γ= [ f (d
ii
)] [11]
where d
ii
is the Euclidean distance of genotypes i and i, de ned
as d
ii
= || z
i
z
i
||, w ith z
i
equal to the ith row of Z, and f(d) is
some monotonically decreasing function of d. There are di er-
ent options for the function f(d), including those shown in Table
1 (Schabenberger and Gotway, 2005).
The  rst four models in Table 1 are commonly used as
spatial covariance structures in mixed model packages (Littell
et al., 2006). The power model (POW) is just a re-parame-
terization of the exponential model (EXP). It is worth  tting
both models, however, because convergence behavior may dif-
fer. The linear variance (LV) model was proposed by Williams
(1986) in the context of blocked  eld experiments. In  tting this
model, precautions must be taken to ensure that the resulting
variance-covariance matrix remains positive de nite (Piepho
et al., 2008b). An advantage of the model compared to the four
rst nonlinear spatial models shown in Table 1 is parsimony,
because σ
2
u
Γ = σ
2
u
J
G
σ
2
u
θ{d
ij
}, where the  rst term on the
right-hand side is confounded with the xed intercept and θ is
a parameter as de ned in Table 1. Thus, the only free parameter
to be estimated is φ = σ
2
u
θ.
The quadratic model is not commonly used in spatial
statistics, but is considered here to illustrate a close relation
between ridge regression and spatial models. It is worth point-
ing out that the quadratic model can be regarded as a  rst order
structure is linear with covariance of two genotypes depend-
ing on similarity in their marker pro les. Another well-known
model, in which covariance is a linear function of genetic simi-
larity, is given by Γ = 2A, where A is the numerator relation-
ship matrix computed either from pedigree records or from
marker data (Henderson, 1985). A further model is Γ = 2K,
where K is the kinship matrix estimated from the markers (Yu
et al., 2006). When A or K are estimated from marker data,
covariance of two genotypes is again a linear function of simi-
larity between marker pro les, so in this sense use of the A or K
matrix estimated from markers is an early form of GS. In fact,
under some circumstances, use of the kinship matrix K and
ridge regression are equivalent, as will be shown below.
As ridge regression implies a genetic covariance among
genotypes, the question may be posed if this model is com-
mensurate with an analysis that ignores marker information
and is based on the assumption of independent genotypes with
constant variance. This question is relevant for the two-stage
analysis of multi-environment data to be discussed later in this
paper. The answer is yes in a simple unstructured population,
for example a population of DH or recombinant inbred lines
originating from a single cross of inbred lines (Bernardo and
Yu, 2007). To see this, we may evaluate Eq. [2] in two ways: (i)
conditioning on the markers (ridge regression) and (ii) not con-
ditioning on markers. The variance conditioning on markers is
var(g|Z) = σ
2
u
ZZ. The unconditional variance can be derived
from a general result on moments of joint random variables
(Searle et al., 1992:461):
E(g) = E
Z
E(g | Z) = 0 [8]
var(g) = E
Z
var(g|Z) + var
Z
E(g|Z) = σ
2
u
E
Z
(ZZ) [9]
with E
Z
and var
Z
representing the expectation and variance over
Z. In a DH population derived from a single cross we have
E
Z
(ZZ) = M [ p I
G
+ (1p)J
G
] [10]
where p is the probability that a marker is segregating in the
underlying cross and J
G
is a G × G matrix of ones. The term in
J
G
is confounded w ith the xed intercept and so may be dropped
from the model (Piepho et al., 2008b). Hence, when ignor-
ing marker information, it is valid to assume independent and
identically distributed genotypic e ects; that is, var(g) = σ
2
g
I
G
.
Equation [10] shows that σ
ˆ
2
u
= M
–1
σ
ˆ
2
g
provides a reasonable
estimate of σ
2
u
, when p = 1 and σ
2
v
= 0 (Bernardo and Yu,
2007). It may be preferable, however, to estimate σ
2
u
directly by
REML based on the ridge regression model in Eq. [4], because
this allows accounting for σ
2
v
, as will be shown further below,
and because this caters for the case p < 1.
In structured populations, Eq. [10] does not usually hold,
but when the structure is simple, a parsimonious unconditional
random e ects model may be obtained for var(g). For example,
in a diallel crossing scheme of several inbred lines from the
same population, where each cross produces a family of DH
lines, the model for var(g) comprises random e ects for gen-
eral combining ability (GCA) and speci c combining ability
(SCA) as well as for lines within crosses (Piepho et al., 2008b).
Generally, p in Eq. [10] will vary between crosses, resulting in
heterogeneity of variance between crosses. This heterogene-
ity can be accounted for, in principle, in a nonmarker model
by assigning a di erent variance to each cross. The practical
1168 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULYAUGUST 2009
Taylor approximation of the Gaussian model (GAU), because
exp (d
2
/θ) 1 – d
2
/θ, when d
2
/θ is not far from zero. To study
the properties of the quadratic model, note that the squared
Euclidean distance of genotypes i and i can be expressed as
()
()
22
222
11
2
MM
iik ikik
ii i i k i k i k
kk
dzzzzzz
′′
==
= = = +
∑∑
zz
[12]
If all genotypes and all markers have either z
ik
= 1 or
z
ik
= –1, which happens, for example, for recombinant inbred
lines or DH lines, we have (z
ik
z
ik
)
2
= 0 when z
ik
= z
ik
(geno-
types have identical alleles) and (z
ik
z
ik
)
2
= 4 when z
ik
= –z
ik
(genotypes have opposite alleles). Also, z
2
ik
= 1 for all i and k.
Thus, if s
ii
denotes the simple matching coe cient of geno-
types i and i, that is, the proportion of markers identical in
state, we have
()
2
1
41 2
M
ik
ii ii i k
k
dMs M zz
′′
=
⎛⎞
= =
⎝⎠
[13]
so that the squared distance matrix has the form
D
sq
= {d
2
ii
} = 4M ( J
G
S) = 2(M J
G
ZZ) [14]
where S = {s
ii
}. The key observation here is that the matrices
D
sq
, S, and ZZ are all linear shift-scale transformations of one
another; that is, either one can be obtained from any of the
other two by (i) multiplication with a constant and (ii) addition
of a constant term times the matrix J
G
. For example, if we  t a
quadratic spatial model of the form f(d) = 1 – θd
2
, the variance
covariance matrix is
var(g|Z) = σ
2
u
( J
G
θD
sq
) =
σ
2
u
[ J
G
θ(2M J
G
– 2ZZ)] = α
1
J
G
+ α
2
ZZ
[15]
where α
1
= (1 – 2M θ)σ
2
u
and α
2
= 2σ
2
u
θ. The term in J
G
is con-
founded with the ( xed) intercept and can therefore be dropped
from the model (Piepho et al., 2008b). It emerges that the qua-
dratic spatial model is equivalent to ridge regression. In other
words, ridge regression can be seen as a special type of spatial
model, in which covariance is a quadratic function of Euclidean
distance. In the same vein, we may obtain an equivalent  t by
just using var(g|Z) = σ
2
u
S, which has also been proposed in
a plant breeding context to model genetic correlation among
relatives (Bauer et al., 2006).
On a similar note, in association mapping, the residual
genotypic e ect not accounted for by regression on a candidate
marker can be modeled as var(g|Z) = 2σ
2
u
K, where K is a kin-
ship matrix of the form
K = aS + b J
G
[16]
with a = (1 c)
–1
, b = – c(1 c)
–1
, and c equal to the average
probability of identity in state for genes coming from random
individuals in the population (Yu et al., 2006; Stich et al., 2008).
Again, this is just a shift-scale transformation of S, and the term
bJ
G
may be dropped as it is confounded with the intercept. So
the use of K for inbred lines is equivalent to ridge regression
and to a quadratic spatial model. Also, using some other simi-
larity measure such as Jaccard or Dice coe cient (for dominant
marker systems) in place of simple matching (Yu et al., 2006) is
seen to be very similar to use of the kinship matrix.
All spatial models considered so far employ the Euclidean
distance. Mixed model software usually requires the coordinates
(markers) as input to compute the Euclidean distance. Other dis-
tances (Reif et al., 2005) may be  tted within a mixed model,
if a representation in Euclidean space, or some approximation
thereof, is found by principal coordinate analysis (Gower, 1966),
but this is not elaborated here (Piepho et al., 2008a).
Based on spatial models, BLUPs can be computed for any
point in the space spanned by the markers. This includes the
genotypes tested as well as potentially other genotypes, which
have been genotyped, but for which no phenotypic data is avail-
able. BLUP based on spatial models is equivalent to Kriging in
spatial statistics (Ruppert et al., 2003). Thus, the BLUP for a gen-
otype constitutes an interpolation of the genotypic value based
on the genotype’s own data and that of the other genotypes, with
the impact by a tested genotype on the prediction of another
genotype depending on genetic proximity. It is also worth not-
ing that adding a polygenic component σ
2
v
I
G
is equivalent to
adding a nugget e ect that accounts for residual measurement
error in spatial models (Schabenberger and Gotway, 2005).
BLUP based on GAU also bears an intimate relationship
with least squares support vector machine (LS-SVM) regres-
sion (Suykens et al., 2002 p. 106–107), when a Gaussian ker-
nel is used, as is common in chemometric applications such as
near infrared spectroscopy (Cogdill and Dardenne, 2004). For
details on the analogies of LS-SVM regression and Gaussian
processes see Suykens et al. (2002). Thus, in as much as ridge
regression may be seen as an approximation to GAU, it is also
an approximation of LS-SVM regression. SVM is another class
of regularization methods, which have recently been applied to
the problem of hybrid prediction in plant breeding (Maenhout
et al., 2007, 2008). Finally, GAU as applied to markers is essen-
tially equivalent to reproducing kernel Hilbert spaces regres-
sion as proposed by Gianola and van Kaam (2008).
Mixed Models with Heterogeneous Variance
The BayesA and BayesB methods of Meuwissen et al. (2001)
assume that the ridge regression model is extended such that
each marker has its own variance. Thus, under both approaches
the regression model is
2
1
M
ikkik
k
gtz
=
= σ
[17]
where t
k
~ N(0,1) and σ
2
k
is the variance for the kth marker.
Under both the BayesA and BayesB models, a prior distribu-
tion is assumed for the variances σ
2
k
. The regression coe cient
under these models has the form
2
kkk
ut= σ . To relate these
models to ridge regression, it is important to recognize that the
Table 1. Genotypic covariance models of the form Γ = {f(d
ii
)},
where d is the Euclidean distance computed from marker
data and θ is a parameter.
Name Equation
Gaussian
f(d) = exp(–d
2
/θ)
Power (exponential)
f(d) = θ
d
Exponential
f(d) = exp(–d/θ)
Spherical
()
= +
θ
θ
3
3
3
1
2
2
dd
fd
, (d < θ)
Linear
f(d) = 1 – θd
Quadratic
f(d) = 1 – θd
2
CROP SCIENCE, VOL. 49, JULYAUGUST 2009 WWW.CROPS.ORG 1169
variance σ
2
k
essentially is just another marker-speci c random
e ect. As t
k
is standard normal, the random regression coe -
cient
2
kkk
ut= σ will have a symmetric nonnormal marginal
distribution whose speci c form depends on the assumed prior
for σ
2
k
. Clearly, this marginal distribution for u
k
(not condition-
ing on σ
2
k
) has a constant variance, and so the only di erence to
ridge regression is that nonnormality holds for u
k
.
These properties of Bayesian models suggest that in a non-
Bayesian framework we may consider  tting nonnormal dis-
tributions to u
k
, which closely mimic the BayesA and BayesB
model  ts. A nonnormal distribution with stronger peaks and
heavier tails may be more realistic than a normal distribution
when most of the markers have a very small e ect. One conve-
nient class of distributions is available via the Johnson S
U
system
of transformed normal random variables (Johnson et al., 1994;
Piepho and McCulloch, 2004). I found this type of nonlinear
mixed model di cult to t, however, using adaptive Gauss-
ian quadrature (Pinheiro and Bates, 1995). When  tting these
models using a  rst-order method, corresponding to Gaussian
quadrature with a single quadrature point, I obtained the same
log-likelihoods with normal and nonnormal u. This similarity
is not unexpected because under either model the genotypic
value is Zu, and based on the central limit theorem this linear
combination is expected to be nearly normal even when u is
nonnormal. In light of these considerations, the good perfor-
mance of BayesB relative to ridge regression in Meuwissen et
al. (2001) is probably at least partly due to the strong impact
of the assumed prior distribution, which was derived based on
the model used to simulate the data (though it did not exactly
match that model; Goddard and Hayes, 2007).
A simple alternative is to  t Eq. [17] in a frequentist setting,
regarding σ
2
k
as xed parameters (RR
het
). A REML  t will
typically yield many zero estimates, which essentially implies
an automatic selection of markers. Thus, we may simply drop
markers with zero variance estimates. This is similar in spirit
to BayesB, where the prior for σ
2
k
has a peak at σ
2
k
= 0, which
induces an automatic marker selection. For the remaining mark-
ers, we can perform a likelihood ratio test for homogeneity of
variance. In case of homogeneity, the ordinary ridge regres-
sion model with homogeneous variances can be  tted with the
selected markers (RR
hom2
). In case of heterogeneity, we may
stick with RR
het
.
Extension of Models
to GenotypeEnvironment Data
When data from multiple environments are available, modeling
of genotype–environment interaction requires special atten-
tion. In particular, genetic correlation between environments
needs to be modeled (Piepho, 2000). In analogy to the case of a
single environment, it will be assumed that the e ect of the ith
genotype in the jth environment (j = 1, 2,, E), denoted as
h
ij
, can be partitioned into marker-based e ect g
ij
and polygenic
e ect v
ij
:
h
ij
= g
ij
+ v
ij
[18]
Each of the two component e ects is partitioned into main
e ect and interaction, that is,
g
ij
= g
i
+ f
ij
[19]
and
v
ij
= v
i
+ w
ij
[20]
where g
i
and v
i
are the marker-based and polygenic main e ects,
while f
ij
and w
ij
are the corresponding interaction terms with
the environment. Both g
i
and f
ij
are modeled as functions of the
marker data. Our main objective is to estimate the genotypic
main e ect h
i
= g
i
+ v
i
.
Let f
j
= ( f
1j
, f
2j
, … , f
Gj
) and f = (f
1
, f
2
, … f
E
) and let w be
similarly de ned. For the marker-based e ects it is assumed that
var(g|Z) = σ
2
u
Γ [21]
and
var(f|Z) = Σ
f
Γ
[22]
where denotes the Kronecker (or direct) product (Searle et
al., 1992). For example, with ridge regression (RR
hom
) we have
Γ = ZZ. Some choices for the E × E variance–covariance
matrix Σ
f
are given in Table 2, including the factor-analytic
(FA) model (Piepho, 1997, 1998). For the polygenic e ects we
may assume
var(v) = σ
2
v
I
G
[23]
and
var(w) = Σ
w
I
G
[24]
where Σ
w
is also chosen from the options in Table 2.
Two-Stage Analysis for
GenotypeEnvironment Data
It is generally desirable to  t a suitable model directly to gen-
otypeenvironment data. Estimation requires, however, that
replicate data are available to separate genotypic from environ-
mental e ects. When no replicate data are available, as is the
case for the example considered in this paper, separation of these
e ects is not possible. In this case, one may compute genotype
means over environments. I use a two-stage approach in which
genotype means are computed based on a model with  xed
genotype main e ects and random interactions. The marker-
based component in the interaction is set to zero at this stage,
because genotypeenvironment e ects cannot be separated
from residual error; it will be absorbed into the polygenic e ect
w
ij
according to Eq. [9] and [10]. Thus, I  tted the model
y
ij
= µ
j
+ h
i
+ w
ij
[25]
where y
ij
is the adjusted mean of the ith genotype in the jth envi-
ronment, µ
j
is the main e ect of the jth environment. Note that
the e ect w
ij
in Eq. [25] subsumes residual error of the adjusted
mean. Based on this model I estimated adjusted genotype means
y
i
taking both µ
j
and h
i
as xed, and then  tted the model
y
i
= µ + h
i
+ e
i
[26]
Table 2. Models for variancecovariance among genotypes
in different environments (Σ
q
; q = f, w).
Model Short-hand Equation
Independent ID
σ
2
1
I
E
Diagonal DIAG
D = diag(σ
2
1
, σ
2
2
, …, σ
2
E
)
Factor-analytic
FA(P)
=
+
1
P
pp
p
Dλλ
λ
p
= (λ
p1
, λ
p 2
, …, λ
pE
).
1170 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULYAUGUST 2009
where var(h|Z) = var(g|Z) + var (v) with h = (h
1
, h
2
, … , h
G
)
and var(e
i
) is  xed at the squared standard error of the adjusted
mean y
i
. For comparison, I also  tted Eq. [26] merging e
i
with
the polygenic e ect v
i
(contained in h
i
) into an independent
residual with constant variance. Thus, in this analysis, var(e
i
)
was not  xed.
One could try to directly  t the model y
ij
= µ
j
+ h
i
+ f
ij
+
w
ij
in a single step, in which case the interaction e ect would
comprise the marker-dependent term f
ij
. With no independent
estimate of error, however, this is prone to over tting, because
the  t for f
ij
would then be confounded with any correlation
among adjusted means that is due to the trial design in the dif-
ferent environments.
The Maize Data
Two hundred eight DH lines originating from a single cross of
inbred parental lines in maize were tested in three series of trials
over ve locations. In four locations (LOC), a lattice design with
block size 10 was employed, while in one location a complete
block design was used. In four locations, only a single replicate
was planted, while in one location there were two replicates
planted according to a lattice design. Trials, replicates, and
incomplete blocks were coded as TRIAL, REP, and BLOCK,
respectively. For each location adjusted entry means y
ij
were
computed. For unreplicated trials, the model was ENTRY +
TRIAL. For the location with replicated trials, the model was
ENTRY + TRIAL.REP.BLOCK. Adjusted means for entries
with marker data were subjected to mixed model analysis. The
trait evaluated was kernel dry weight per plot.
There were seven check genotypes. For some of the DH
lines marker information was missing, so these were treated as
additional checks. Adjusted means of all checks were excluded
from mixed model analysis. A total of 136 single sequence
repeat and single nucleotide polymorphism markers evenly dis-
tributed across the genome were scored for the DH lines. The
two alleles of a marker were coded Z = –1 and Z = +1, while
missing data were coded as Z = 0.
Software and Model Evaluation
All models were  tted by the REML method. For each model,
we report both the deviance (minus twice the restricted log like-
lihood) and the Akaike Information Criterion (AIC), de ned as
the deviance minus twice the number of variance parameters.
Small values of AIC indicate a preferable model. The AIC is
closely related to cross-validation criteria (McQuarrie and Tsai,
1998; Piepho and Gauch, 2001). Some code for SAS PROC
MIXED (Littell et al., 2006) is given in the Appendix.
RESULTS
Computation of Genotype Means
To compute genotype means across environments, we  t-
ted the two-way model from Eq. [25] with  xed main
e ects for environments and genotypes and various struc-
tures for Σ
w
. Based on the AIC values in Table 3, it was
decided to compute genotype means y
i
using the FA(1)
model. The average variance of an adjusted mean based on
this analysis was 0.174.
Analysis of Genotype Means
Not Fixing the Error Variance
We rst tted models for h based on Eq. [26] without  x-
ing the error variance var(e
i
) at the squared standard error
of a mean, such that error could not be separated from
var(h|Z). Thus, the residual variance comprised the vari-
ance of both the polygenic e ect v
i
and the error associ-
ated with the mean (e
i
). The model  ts are shown in Table
4. There was no signi cant heterogeneity among the 38
markers with nonzero variance under model RR
het
. Thus,
RR
hom2
was  tted. The example shows that ridge regres-
sion and spatial models give better  ts than a model with
independent genotypic e ects. Also, the spatial models
provide a similar  t as ridge regression in terms of AIC.
Strikingly, some of the spatial models have a rather smaller
residual variance than ridge regression, in which case BLUP
comes very close to the adjusted means, which explains
the high correlation of adjusted means with BLUPs under
spatial models (POW, EXP, GAU, SPH). By contrast,
correlation of adjusted means with BLUPs is quite low
for ridge regression and the quadratic model (Table 5), so
selection decisions by these GS methods would be quite
di erent than by adjusted means. The  nding that some of
the spatial models have a residual variance rather smaller
than the average variance of an adjusted means (0.174) is
indicative of over- tting. In terms of AIC, di erences are
minor between spatial models and ridge regression. Over-
all, the LV model is marginally better than other marker-
based spatial models. RR
hom2
has by far the best AIC value
of all models, showing that preselection of markers is an
important consideration.
Analysis of Genotype Means
Fixing the Error Variance
The fact that some of the models yielded very tiny resid-
ual variances, when var(e
i
) was not  xed, thus rendering
BLUP essentially the same as adjusted means, is reason
for concern. Adjusted means are typically correlated,
though the correlation may not be large. It is therefore
possible that in a model for adjusted means, the genetic
covariance model captures part of the correlation among
adjusted means, which is purely nongenetic, thus yield-
ing an upward bias in genetic variance. For this reason it
is advisable to generally obtain an independent estimate
of error (as in Bernardo and Yu, 2007). Thus, we  xed
var(e
i
) at the squared standard error of adjusted genotype
means based on the FA(1) model for genotypeenviron-
ment means. The resulting  ts are shown in Table 6. Most
models leave rather little polygenic variance σ
2
v
. Again,
RR
hom2
has by far the best  t in terms of AIC. Among
spatial models, LV and GAU are best. None of the GS
methods is perfectly correlated with the adjusted mean
(Table 7), while several of the spatial models (LV, POW,
EXP, SPH) are virtually identical.
CROP SCIENCE, VOL. 49, JULYAUGUST 2009 WWW.CROPS.ORG 1171
DISCUSSION
This paper has discussed some models for GS that are read-
ily implemented with a mixed model package. Results
for a maize dataset indicate that the spatial models are an
interesting alternative to ridge regression. The LV model
is particularly attractive because it involves only a single
parameter. Automatic marker selection by a preliminary
t of a model with heterogeneous variance between mark-
ers is a promising method. A thorough comparison with
other methods of subset selection would be worthwhile.
In the analysis of across-environment genotype means
without independent estimate of error the residual vari-
ance estimator often was close to zero, indicating that the
marker-based component captured substantial noise. This
stresses the need to provide for an independent estimate of
error in GS projects. Also, it is desirable to explicitly account
for genetic variance not captured by the markers (Calus and
Veerkamp, 2007). This polygenic variance should be sepa-
rated from residual error, which requires independent esti-
mates of error for individual trials. It is quite common in
breeding programs to perform unreplicated trials, as was the
case in the example, where there was not su cient infor-
mation to estimate within-trial errors for all environments.
Thus, genotypeenvironment interaction could not be sepa-
rated from error in a mixed model. We could have  tted
a marker-based model to the genotypeenvironment e ect,
but this would have entailed the risk of over- tting, because
nongenetic correlations due to  eld trend could have been
captured by the marker-based terms. For this reason a two-
stage approach was employed computing genotype means
across environments based on an unconditional mixed model
for genotypeenvironment interaction that did not exploit
marker information. While the unconditional model is valid
as shown in this paper, using a conditional model of geno-
typeenvironment interaction for given marker information
is expected to be more e cient. This is forthcoming only
with su cient replication in all trials, stressing the need for
good individual trial design.
When markers are mapped, the ridge regression model
can be extended to allow spatial correlation of regression
coe cients pertaining to markers on the same chromosome
(Gianola et al., 2003), using the same types of spatial model
discussed here. Unfortunately, such models are currently
not conveniently  tted using mixed model software such
Table 3. AICs for various variancecovariance structures
Σ
w
tted to the phenotypic data (genotypeenvironment
means). Models had xed main effects for genotypes and
environments.
Model (Σ
w
)
Deviance AIC
ID 2843.1 2845.1
DIAG 2753.7 2763.7
FA(1) 2744.4 2762.4
FA(2) 2743.5 2771.5
AIC, Akaike information criterion.
Table 4. Model fi ts of different genetic covariance models
with the maize data. Error variance var(e
i
) not fi xed.
Model for g
i
Deviance AIC
Residual
variance
§
Independent
372.8 374.8 0.3454
Ridge regression
RR
hom
336.9 340.9 0.2272
RR
het
(38 markers selected) 289.5 367.5 0.1635
RR
hom2
(38 markers) 303.3 307.3 0.1773
Spatial models
Linear 335.6 339.6 0.1139
Quadratic 336.9 340.9 0.2272
Power 334.8 340.8 0.0020
Exponential 334.8 340.8 0.0018
Gaussian 333.9 339.9 0.0002
Spherical 334.3 340.3 <0.0001
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordi-
nary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
AIC, Akaike information criterion.
§
Residual subsumes v
i
and e
i
, because var(e
i
) was not fi xed.
Heterogeneity of variance among selected markers was not signifi cant according
to a likelihood ratio test (α = 5%).
Table 5. Pearson correlation (above diagonal) and Spearman rank correlation (below diagonal) of different estimators of geno-
typic value for the maize data. Error variance var(e
i
) is not fi xed.
Model/estimator
Model/estimator
AM RR
hom
RR
het1
RR
hom2
LV QUAD POW EXP GAU SPH
Adj. mean (AM)10.7030.7740.7560.9660.7031111
RR
hom
0.670 1 0.920 0.942 0.862 1 0.705 0.705 0.703 0.703
RR
het
0.759 0.916 1 0.974 0.884 0.920 0.776 0.776 0.774 0.774
RR
hom2
0.745 0.935 0.974 1 0.880 0.942 0.758 0.758 0.756 0.756
Linear (LV) 0.960 0.862 0.880 0.976 1 0.862 0.967 0.967 0.966 0.966
Quadratic (QUAD) 0.697 1 0.916 0.935 0.862 1 0.705 0.705 0.703 0.703
Power (POW) 10.6980.7610.7460.9610.6981111
Exponential (EXP)10.6980.7600.7460.9600.6981111
Gaussian (GAU)10.6970.7590.7450.9600.6971111
Spherical (SPH)10.6970.7590.7450.9600.6971111
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordinary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
1172 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULYAUGUST 2009
as PROC MIXED. If the spatial model only has a single
parameter, a pro le likelihood approach may be used.
If one is prepared to work within a fully Bayesian
framework, more options are available (Meuwissen et al.,
2001; Gianola et al., 2003; Xu, 2003; Habier et al., 2007;
Gianola and van Kaam, 2008). A general problem with
Bayesian methods is coming up with a choice for the prior
distribution. Meuwissen et al. (2001) use prior distribu-
tions for BayesA and BayesB that were derived from their
simulation program. Thus, the prior distributions favor-
ably matched the data generation mechanism, putting the
Bayesian methods somewhat at an advantage that would
be hard to realize in most plant breeding applications,
where prior information may be rather much vaguer.
One can investigate the merits of di erent models
by simulation (Meuwissen et al., 2001; Bernardo and Yu,
2007). The main di culty is that a model needs to be cho-
sen for simulating the data, and naturally, a model close
to the one used for simulation is more likely to perform
well in the analysis. Simulations must necessarily make
a number of assumptions, the validity of which is hard
to verify in practice. Thus, a more reliable assessment of
performance is with real data from breeding programs.
Ideally, parallel programs using di erent models for pre-
diction would be compared based on the realized genetic
gain. The less-than-perfect correlation among BLUPs
by di erent GS methods found in the present study sug-
gests that a thorough comparison of di erent methods
in current breeding programs would be useful, prefer-
ably by cross-validation re ecting the breeder’s selection
decision process (Schrag et al., 2009). When devising a
cross-validation scheme, it must be realized that a family
structure as studied in the present paper induces genetic
correlation. Optimality of cross-validation methods often
rests on independence assumptions, and generalizing to
the case of dependent data is not straightforward (Lahiri,
2003). Developing suitable cross-validation schemes for
plant breeding programs therefore is an interesting topic
for future research.
A very promising application of GS is for hybrid
prediction. Bernardo (1993, 1994) proposed a BLUP
approach for hybrid prediction, which is closely related
to ridge regression. He suggested to estimate the coef-
cient of coancestry ( f
ii
) for two maize inbred lines i and
i from the same heterotic pool X by a linear function of
the simple matching coe cient s
ij
, that is, by f
ii
= a
ii
s
ii
+
b
ii
, where a
ii
= (1 – c
ii
)
–1
, b
ii
= – c
ii
(1 c
ii
)
–1
, c
ii
= 0.5(s
iY
+ s
iY
), and s
iY
is the average simple matching coe cient
between inbred i and a sample of inbreds from the oppo-
site heterotic pool Y. This estimate of the coe cient of
coancestry is then used in the LV model σ
2
u
{f
ii
} to pre-
dict GCA e ects within a mixed model. This approach is
Table 6. Model fi ts of different genetic covariance models
with the maize data. Error variance var(e
i
) fi xed at value of
squared standard error of a mean based on FA(1) model fi t-
ted to genotypeenvironment data.
Model for g
i
Deviance AIC
Polygenic genetic
variance (σ
2
v
)
Independent 372.8 374.8 0.1712
Ridge regression
RR
hom
336.9 340.9 0.0528
RR
het
(37 markers selected) 289.7 363.7 0
RR
hom2
(37 markers)
§
301.9 305.9 0.0045
Spatial models
Linear 337.1 339.1 0
Quadratic 336.9 340.9 0.0528
Power
337.2 341.2 0
Exponential 337.1 341.1 0
Gaussian 335.2 339.2 0
Spherical 337.1 341.1 0
AIC, Akaike information criterion.
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordi-
nary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
§
Heterogeneity of variance among selected markers was not signifi cant according
to a likelihood ratio test (α = 5%); variance estimates were shrunken to the overall
mean (for details see text).
Autocorrelation converged to value close to unity.
Table 7. Pearson correlation (above diagonal) and Spearman rank correlation (below diagonal) of different estimators of geno-
typic value for the maize data. Error variance var(e
i
) fi xed at value of squared standard error of a mean based on FA(1) model
tted to genotypeenvironment data.
Model/estimator
Model/estimator
AM RR
hom
RR
het1
RR
hom2
LV QUAD POW EXP GAU SPH
Adj. mean (AM) 1 0.881 0.768 0.769 0.920 0.881 0.920 0.920 0.887 0.920
RR
hom
0.871 1 0.933 0.946 0.995 1 0.995 0.995 0.997 0.995
RR
het
0.753 0.929 1 0.975 0.915 0.933 0.914 0.914 0.927 0.914
RR
hom2
0.760 0.941 0.975 1 0.926 0.769 0.926 0.926 0.942 0.926
Linear (LV) 0.912 0.994 0.912 0.923 1 0.920 1 1 0.996 1
Quadratic (QUAD) 0.871 1 0.929 0.941 0.994 1 0.995 0.995 0.997 0.995
Power (POW) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.996 1
Exponential (EXP) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.996 1
Gaussian (GAU) 0.879 0.996 0.926 0.939 0.995 0.996 0.995 0.995 1 0.996
Spherical (SPH) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.995 1
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordinary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
CROP SCIENCE, VOL. 49, JULYAUGUST 2009 WWW.CROPS.ORG 1173
seen to be quite similar to ridge regression for the GCA
e ects, but not equivalent because scale and shift param-
eters (a
ii
and b
ii
) depend on the pair of genotypes, while
with ridge regression a
ii
= a and b
ii
= b for all pairs (i,i )
(see Eq. [16]).
Ridge regression and spatial models discussed in this
paper can be used as alternative methods to model GCA
and SCA e ects in hybrid prediction. For example, a ridge
regression model for prediction of hybrid performance in
a complete factorial is
()()
3
21
11 223PP
= + +gZ1u ZuZu1 [27]
where g is the vector of genotypic values of G hybrids,
P
1
and P
2
are the number of inbred parents in the two
heterotic pools, Z
1
and Z
2
are the marker-based design
matrices of parents in the two pools, 1
P
is a P-dimen-
sional vector of ones, Z
3
= (Z
1
1
P
2
) • (Z
P
1
1
2
), where
• denotes the elementwise (Hadamard or Schur) product,
u
1
= (u
11
, …, u
1M
) and u
2
= (u
21
, …, u
2M
) are vectors
of the GCA e ects at the markers of the two pools and
u
3
= (u
31
, …, u
3M
) is the corresponding vector of SCA
e ects. Coding of the design matrices for GCA and SCA
e ects has a standard two-way ANOVA form as shown
in Table 8. When the factorial is not complete, the corre-
sponding lines need to be deleted in the design matrices.
Apart from variable selection, this model is essen-
tially the factorial regression model proposed by Charcos-
set et al. (1998), who take regression coe cients as xed.
If instead we assume independent sampling from normal
distributions according to u
rk
~ N(0,σ
2
ur
) (r = 1, 2, 3), we
have a ridge regression equivalent of factorial regression.
The resulting variancecovariance structure is
var(g|Z) = σ
2
u1
Z
1
Z
1
J
P
2
+
σ
2
u2
J
P
1
Z
2
Z
2
+
σ
2
u3
Z
3
Z
3
[28]
An alternative variancecovariance model, more akin
to Bernardos (1993, 1994) approach, is
var(g|Z) = σ
2
u1
Z
1
Z
1
J
P
2
+
σ
2
u2
J
P
1
Z
2
Z
2
+
σ
2
u3
Z
1
Z
1
Z
2
Z
2
[29]
The model is easily generalized as
var(g|Z) = σ
2
u1
Γ
1
J
P
2
+
σ
2
u2
J
P
1
Γ
2
+
σ
2
u3
Γ
1
Γ
2
[30]
where Γ
r
(r = 1, 2) is chosen according to some spatial
model in terms of Z
r
.
The term Γ
1
Γ
2
is equivalent to a separable two-
dimensional spatial process (Martin, 1979), the dimen-
sions corresponding to genetic distance of hybrid parents
in the two pools.
When Γ
1
and Γ
2
are computed from coe cients of
coancestry of each hybrid’s inbred parents in the two pools,
we have Bernardos (1993, 1994) approach. Alternatively,
Γ
1
and Γ
2
can have any of the spatial structures proposed in
the present paper, based on the genetic distance of parents
in both heterotic pools, giving rise to a host of alterna-
tive methods. Note that when we apply ridge regression,
Γ
r
(r = 1, 2) may be any positive de nite linear function
Γ
r
= a
r
J
P
r
+ b
r
S
r
, where S
r
is the matrix of simple match-
ing coe cients of hybrids parents in the rth pool. This is
because the variance for the SCA e ects (a
1
J
P
1
+ b
1
S
1
)
(a
2
J
P
2
+ b
2
S
2
) equals a
1
a
2
J
P
1
J
P
2
+
a
1
b
2
J
P
1
S
2
+ a
2
b
1
S
1
J
P
2
+ b
1
b
2
S
1
S
2
, where the  rst term on the right-hand
side is confounded with the intercept and the second and
third terms are confounded with the GCA e ects. Finally,
it should be stressed that the variance terms for GCA and
SCA e ects may be extended by polygenic terms to account
for residual e ects not captured by markers.
In case of multi-allelic markers, or when haplotypes
are used (Calus et al., 2008), there are di erent, essen-
tially equivalent options of extending the model. The fol-
lowing discussion is restricted to additive e ects, but the
same principles apply to coding of e ects for dominance
and epistasis (Xu and Jia, 2007). The starting point is to
assume that each allele has an additive e ect drawn from
the same normal distribution. Let v
qk
denote the additive
e ect of the qth allele (q = 1, …, Q
k
) of the kth marker
and x
iqk
the corresponding dummy variable counting the
number of copies of the qth allele of the kth marker for the
ith genotype. Let v
k
= (v
1k
,v
2k
,…v
Q
k
k
). The contribution
of the kth marker to the genotypic value is X
k
v
k
, where
X
k
= {x
iqk
}. Assuming that entries in v
k
are identically and
independently normally distributed with zero mean and
variance σ
2
v
, we have
var(X
k
v
k
) = σ
2
v
X
k
X
k
[31]
We might impose a sum-to-zero restriction, replacing
v
k
with w
k
= (I
Q
k
Q
k
–1
J
Q
k
)v
k
. It is found that
var(X
k
w
k
) = σ
2
v
X
k
(I
Q
k
Q
k
–1
J
Q
k
)X
k
=
σ
2
v
(X
k
X
k
4Q
k
–1
J
G
)
[32]
The second term involving the matrix J
G
is confounded
with the intercept and so can be dropped, showing that
the sum-to-zero constraint is not needed.
In case of two alleles and inbred lines, marker k may
be represented by a single covariate z
k
= X
k
c where
z
k
= (z
1k
, z
2k
, …, z
Gk
) and c = (1/2,1/2) such that z
ik
= 1
or z
ik
= –1 for inbred lines, as in Eq. [1]. In this case
Table 8. Coding of design matrices for general combining
ability (GCA) and special combining ability (SCA) effects in
Eq. [27].
Parental marker
genotype
Covariates in design matrices for one marker
GCA SCA
Pool 1 Pool 2 Z
1
Z
2
Z
3
A
1
A
1
–1 –1 +1
A
1
A
2
–1 +1 –1
A
2
A
1
+1 –1 –1
A
2
A
2
+1 +1 +1
1174 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULYAUGUST 2009
var(z
k
u
k
) = σ
2
u
X
k
ccX
k
=
1
4
σ
2
u
X
k
(2I
2
J
2
)X
k
=
1
2
σ
2
u
(X
k
X
k
– 2J
G
)
[33]
Again, the term in J
G
may be dropped, so ridge regression
as per Eq. [1] is equivalent to the parameterization with
v
k
. In the biallelic case, the parameterization with a single
column per marker in Z is most parsimonious, but this
option is not available in the multi-allelic case.
This paper has focused on marker data for predict-
ing genotypic values. Instead of markers, or in addition to
markers, expression or metabolic pro le data may be used
for the same purpose. In this case for ridge regression it
is important to standardize the di erent expression prod-
ucts to justify the assumption of a common variance for
the regression coe cients. Similar considerations apply for
any of the spatial methods proposed in this paper. If di er-
ent sources are used simultaneously (markers, expression
data, metabolite data), it may be prudent to t a separate
covariance model for each component in the joint model.
APPENDIX
This appendix shows how to  t the models discussed in
this paper using PROC MIXED of the SAS System (Lit-
tell et al., 2006). It is assumed that markers are coded z1
to zM, while genotypes are coded by gen. The relevant
RANDOM statement for the genotypic e ect under the
di erent models is given.
Ridge Regression
The model may be  tted by
random z1-zM/subject = intercept type
= toep(1);
By this code each marker generates a column in the design
matrix for the random e ects. When the number of mark-
ers is very large, solving the mixed model equations may
become computationally quite demanding. In this case, it is
useful to specify the model di erently. Noting that var(g)
= σ
2
u
ZZ is linear in Γ = ZZ, it may be advantageous
to compute Γ = ZZ explicitly before running PROC
MIXED and then specify a linear structure as follows:
random gen/subject = intercept type =
lin(1) ldata = gamma;
Savings in storage space and computing time required to
solve the mixed model equations may be considerable when
M > > G. This code requires that Γ = ZZ be stored in a
SAS dataset “gamma” according to one of two possible for-
mats (for details see manual). One option for a hypothetical
3 × 4 Z matrix is as given in Fig. 1 (this assumes that a SAS
dataset “w” contains variables z1 to zM).
Spatial Models
The POW, EXP (equivalent to POW), GAU and SPH can
be  tted by these RANDOM statements:
random gen/subject = intercept type =
sp(pow) (z1-zM);
random gen/subject = intercept type =
sp(exp) (z1-zM);
random gen/subject = intercept type =
sp(gau) (z1-zM);
random gen/subject = intercept type =
sp(sph) (z1-zM);
The spatial models may have convergence problems, so
it is advisable to try a number of starting values for the
spatial parameters using the PARMS statement. If the
residual variance var(e
i
) is  xed as described in this paper,
and a polygenic e ect v
i
is  tted in addition to a marker-
dependent e ect g
i
, a typical call of PROC MIXED is as
shown in Fig. 2. The weighting variable w contains the
inverse of var(e
i
), that is, of the squared standard errors of
adjusted means. For background on the method of xing
var(e
i
) see Piepho (1999).
Also, at times the log likelihood changes only mar-
ginally in iterations and yet the default convergence cri-
terion is not met. In such instances it may be useful to
slightly relax the convergence criterion relative to the
default value. Additionally, rescaling Z such that σ
2
u
is of
the same order of magnitude as σ
2
v
may be bene cial.
For LV we use the same code to generate a linear variance-
covariance matrix as for the ridge regression model, except
that a*b is replaced by (a-b)**2/&m/2 and the square root
is taken of col[i] after the do loop. The relevant portion
that needs to be replaced in Fig. 1 is given in Fig. 3.
Figure 1. SAS code generating the matrix Γ = ZZ for fi tting the
ridge regression model using the LIN structure in PROC MIXED.
CROP SCIENCE, VOL. 49, JULYAUGUST 2009 WWW.CROPS.ORG 1175
Then, the code in Fig. 4 may be used
to  t the linear model, assuming the vari-
ance-covariance matrix has been stored in
a SAS dataset “lv.” The coding of entries
in “lv” ensures that the resulting variance
covariance matrix will have only nonneg-
ative entries (Piepho et al., 2008b).
Mixed Models with
Heterogeneous Variance
The heterogeneous variance ridge regression model RR
het
is  tted by random z1-zM;.
Acknowledgments
KWS SAAT AG is thanked for providing the maize data. Jens
Möhring and Bettina Müller are thanked for carefully reading
an earlier version of this paper. I am also grateful for helpful
comments by two anonymous referees.
References
Bauer, A.M., T.C. Reetz, and J. Léon. 2006. Estimation of breed-
ing values of inbred lines using best linear unbiased prediction
(BLUP) and genetic similarities. Crop Sci. 46:2685–2691.
Bernardo, R. 1993. Estimation of coe cient of coancestry
using molecular markers in maize. Theor. Appl. Genet.
85:1055–1062.
Bernardo, R. 1994. Prediction of maize single-cross performance
using RFLPs and information from related hybrids. Crop Sci.
34:20–25.
Bernardo, R., and J. Yu. 2007. Prospects for genomewide selection
for quantitative traits in maize. Crop Sci. 47:1082–1090.
Calus, M.P.L., T.H.E. Meuwissen, A.P.W. deRoos, and R.F.
Veerkamp. 2008. Accuracy of genomic selection using di er-
ent methods to de ne haplotypes. Genetics 178:553561.
Calus, M.P.L., and R.F. Veerkamp. 2007. Accuracy of breed-
ing values when using and ignoring the polygenic e ect in
genomic breeding value estimation with a marker density of
one SNP per cM. J. Anim. Breed. Genet. 124:362–368.
Charcosset, A., B. Bonnisseau, O. Touchebeuf, J. Burstin, P.
Dubreuil, Y. Barriere, A. Gallais, and J.B. Denis. 1998. Pre-
diction of maize hybrid silage performance using marker data:
Comparison of several models for speci c combining ability.
Crop Sci. 38:3844.
Cogdill, R.P., and P. Dardenne. 2004. Least-squares support vec-
tor machines for chemometrics: An introduction and evalua-
tion. J. Near Infrared Spetrosc. 12:93–100.
Draper, N.R., and H. Smith. 1998. Applied regression analysis.
3rd ed. John Wiley & Sons, New York.
Gianola, D., M. Perez-Enciso, and M.E. Toro. 2003. On marker-
assisted prediction of genetic value: Beyond the
ridge. Genetics 163:347365.
Gianola, D., and J.B.C.H.M. van Kaam. 2008. Repro-
ducing kernel Hilbert spaces regression methods
for genomic assisted prediction of quantitative
traits. Genetics 178:2305–2313.
Goddard, M.E., and B.J. Hayes. 2007. Genomic selec-
tion. J. Anim. Breed. Genet. 124:323–330.
Gower, J.C. 1966. Some distance properties of latent
roots and vector methods used in multivariate
analysis. Biometrika 53:325–338.
Habier, D., R.L. Fernando, and J.C.M. Dekkers. 2007. The impact
of genetic relationship information on genome-assisted breed-
ing values. Genetics 177:2389–2397.
Henderson, C.R. 1985. Best linear unbiased prediction of non-
additive genetic merits in non-inbred populations. J. Anim.
Sci. 60:111–117.
Hoerl, A.E., and R.W. Kennard. 1970. Ridge regression: Biased esti-
mation for nonorthogonal problems. Technometrics 12:5567.
Johnson, N.L., S. Kotz, and N. Balakrishnan. 1994. Continuous
univariate distributions. Vol. 1. 2nd ed. John Wiley & Sons,
New York.
Lahiri, S.N. 2003. Resampling methods for dependent data.
Springer, New York.
Littell, R.C., G.A. Milliken, W.W. Stroup, R. Wol nger, and O.
Schabenberger. 2006. SAS for mixed models. 2nd ed. SAS
Inst., Cary, NC.
Maenhout, S., B. de Baets, G. Haesaert, and E. van Bockstaele. 2007.
Support vector machine regression for the prediction of maize
hybrid performance. Theor. Appl. Genet. 115:1003–1013.
Maenhout, S., B. de Baets, G. Haesaert, and E. van Bockstaele.
2008. Marker-based screening of maize inbred lines using
support vector machine regression. Euphytica 161:123–131.
Martin, R.J. 1979. A subclass of lattice processes applied to a prob-
lem of planar sampling. Biometrika 66:209–217.
McQuarrie, A.D.R., and C.L. Tsai. 1998. Regression and time
series model selection. World Scienti c, Singapore.
Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Predic-
tion of total genetic value using genome-wide dense marker
maps. Genetics 157:18191829.
Figure 2. MIXED code to fi t the power model with fi xed var(e
i
).
Figure 3. Portion of SAS code that needs to be replaced for the
corresponding part in Fig. 1 to generate a matrix for fi tting the
linear variance model using the LIN structure in PROC MIXED.
Figure 4. MIXED code to fi t the linear variance model with fi xed var(e
i
).
1176 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULYAUGUST 2009
Miller, A. 2002. Subset selection in regression. Chapman and Hall,
London.
Piepho, H.P. 1997. Analyzing genotypeenvironment data by mixed
models with multiplicative e ects. Biometrics 53:761–766.
Piepho, H.P. 1998. Empirical best linear unbiased prediction in
cultivar trials using factor analytic variance–covariance struc-
tures. Theor. Appl. Genet. 97:195–201.
Piepho, H.P. 1999. Stability analysis using the SAS system. Agron.
J. 91:154–160.
Piepho, H.P. 2000. A mixed model approach to mapping quantita-
tive trait loci in barley on the basis of multiple environment
data. Genetics 15:253–260.
Piepho, H.P., and H.G. Gauch. 2001. Marker pair selection for
QTL detection. Genetics 157:433444.
Piepho, H.P., and C.E. McCulloch. 2004. Transformations in
mixed models: Application to risk analysis for a multienviron-
ment trial. J. Agric. Biol. Environ. Stat. 9:123–137.
Piepho, H.P., J. Möhring, A.E. Melchinger, and A. Büchse. 2008a.
BLUP for phenotypic selection in plant breeding and variety
testing. Euphytica 161:209–228.
Piepho, H.P., C. Richter, and E. Williams. 2008b. Nearest neigh-
bour adjustment and linear variance models in plant breeding
trials. Biometrical J. 50:164–189.
Pinheiro, J.C., and D.M. Bates. 1995. Approximations to the log-
likelihood function in the nonlinear mixed e ects model. J.
Comput. Graph. Stat. 4:12–35.
Reif, J.C., A.E. Melchinger, and M. Frisch. 2005. Genetical and
mathematical properties of similarity and dissimilarity coef-
cients applied in plant breeding and seed bank management.
Crop Sci. 45:1–7.
Ruppert, D., M.P. Wand, and R.J. Carroll. 2003. Semiparametric
regression. Cambridge Univ. Press, Cambridge, UK.
Schabenberger, O., and C.A. Gotway. 2005. Statistical methods
for spatial data analysis, CRC Press, Boca Raton, FL.
Schrag, T.A., J. Möhring, H.P. Maurer, B.S. Dhillon, A.E. Melch-
inger, H.P. Piepho, A.P. Sørensen, and M. Frisch. 2009.
Molecular marker-based prediction of hybrid performance in
maize using unbalanced data from multiple experiments with
factorial crosses. Theor. Appl. Genet. 118:741–751.
Searle, S.R., G. Casella, and C.E. McCulloch. 1992. Variance
components. John Wiley & Sons, New York.
Stich, B., J. Möhring, H.P. Piepho, M. Heckenberger, E.S. Buck-
ler, and A.E. Melchinger. 2008. Comparison of mixed-model
approaches for association mapping. Genetics 178:17451754.
Suykens, J.A.K., T.V. Gestel, J. de Brabanter, B. de Moor, and
J. Vandewalle. 2002. Least squares support vector machines.
World Scienti c, Singapore.
Whittaker, J.C., R. Thompson, and M.C. Denham. 2000.
Marker-assisted selection using ridge regression. Genet. Res.
75:249–252.
Williams, E.R. 1986. A neighbour model for  eld experiments.
Biometrika 73:279–287.
Xu, S. 2003. Estimating polygenic e ects using markers of the
entire genome. Genetics 163:789–801.
Xu, S., and Z. Jia. 2007. Genomewide analysis of epistatic e ects
for quantitative traits in barley. Genetics 175:1955–1963.
Yu, J.M., G. Pressoir, W.H. Briggs, I.V. Bi, M. Yamasaki, J.F. Doe-
bley, M.D. McMullen, B.S. Gaut, D.M. Nielsen, J.B. Holland,
S. Kresovich, and E.S. Buckler. 2006. A uni ed mixed-model
method for association mapping that accounts for multiple
levels of relatedness. Nat. Genet. 38:203–208.
... In Eq. (8), is the regression coefficient; the higher the value of , the greater the fractures in the data. Since the value of is data-dependent, data-based methods such as crossvalidation can be used [36]. ...
... Parametric models, such as ridge regression best linear unbiased prediction (rr-GBLUP), make strong prior assumptions about the shape of functional relationships between SNP genotypes and traits, which can impact accuracy. As well, the statistical distribution of SNP marker effects is predetermined, which may not reflect reality [29]. The Bayes B parametric model allows the estimation of SNP effects with differential shrinkage, however, posterior inference depends heavily on prior assumptions, particularly for SNP variances [13]. ...
... In general, increasing the number of molecular markers can effectively improve the PA as well [25][26][27]. A range of statistical models have been utilized in GP, including ridge regression best linear unbiased prediction (RR-BLUP) [28,29], genomic best linear unbiased prediction (G-BLUP) [30,31], Bayesian models [32,33], and machine learning models [34,35]. These statistical models play a crucial role in determining the PA in GP breeding programs [36][37][38]. ...
Article
Full-text available
Genomic selection (GS) is a marker-based selection method used to improve the genetic gain of quantitative traits in plant breeding. A large number of breeding datasets are available in the soybean database, and the application of these public datasets in GS will improve breeding efficiency and reduce time and cost. However, the most important problem to be solved is how to improve the ability of across-population prediction. The objectives of this study were to perform genomic prediction (GP) and estimate the prediction ability (PA) for seed oil and protein contents in soybean using available public datasets to predict breeding populations in current, ongoing breeding programs. In this study, six public datasets of USDA GRIN soybean germplasm accessions with available phenotypic data of seed oil and protein contents from different experimental populations and their genotypic data of single-nucleotide polymorphisms (SNPs) were used to perform GP and to predict a bi-parent-derived breeding population in our experiment. The average PA was 0.55 and 0.50 for seed oil and protein contents within the bi-parents population according to the within-population prediction; and 0.45 for oil and 0.39 for protein content when the six USDA populations were combined and employed as training sets to predict the bi-parent-derived population. The results showed that four USDA-cultivated populations can be used as a training set individually or combined to predict oil and protein contents in GS when using 800 or more USDA germplasm accessions as a training set. The smaller the genetic distance between training population and testing population, the higher the PA. The PA increased as the population size increased. In across-population prediction, no significant difference was observed in PA for oil and protein content among different models. The PA increased as the SNP number increased until a marker set consisted of 10,000 SNPs. This study provides reasonable suggestions and methods for breeders to utilize public datasets for GS. It will aid breeders in developing GS-assisted breeding strategies to develop elite soybean cultivars with high oil and protein contents.
... The Ridge Regression-based Modest Adaboost method is the focus of the work. The Ridge Regression (RR) approach is widely utilized when features suffer from multicollinearity [54]. The Least Squares predictions are unbiased in this case; however, the variance is quite high. ...
Article
Full-text available
Detecting cardiac abnormalities promptly is critical for preventing unexpected and premature fatalities. In this research, four types of cardiac arrhythmias such as Ventricular Tachycardia, Premature Ventricular Contraction, Normal Sinus Rhythm and Supraventricular Tachycardia are detected from the amassed Physiobank MIT-BIH cardiac arrhythmia database. Dimensionality reduction techniques like Stochastic Neighbour Embedding (SNE), Neighbourhood Preserving Embedding, Linear Local Tangent Space Alignment and Gaussian Process Latent Variable Model are used to reduce the dimension of the ECG signals. The appropriate features of dimensionally reduced ECG signals are selected by the Elephant Search Optimization (ESO) technique. Finally, classification is performed using the relevant classifiers, such as Support Vector Machine, Adaboost, Modest Adaboost based on Ridge Regression (Modest Adaboost.RR), Extreme Gradient Boost (XGboost) and Naïve Bayes (NBC) classifiers. Multiple classifiers without and with ESO feature selection for different cardiac cases have an average classification accuracy of 62.23% and 73.61%, respectively. These multiple classifiers are defined by a set of control parameters known as hyper-parameters, which must be tuned in order to achieve optimal results. Experts have developed many approaches for detecting cardiac arrhythmias, but these multiple classifiers do not always perform well when the usual parameters for machine learning classification models are employed. In this paper, various classifiers are used in conjunction with the Stochastic Gradient Descent (SGD), Particle Swarm Optimization (PSO) and Bayesian Tree-structured Parzen Estimator (BTPE) to enhance the cardiac arrhythmia classification accuracy via hyper-parameter tuning. Multiple classifiers with SGD, PSO and BTPE hyper-parameters tuning techniques for various cardiac cases have an average classification accuracy of 80.13%, 90.67% and 94.96%, respectively. The Classifier’s performance is analysed based on metrics like Classification Accuracy, F1 score, Error Rate, Matthew’s correlation coefficient, Jaccard Index and Cohen’s Kappa Coefficient with and without ESO features selection method and hyper-parameters tuning techniques. The analysis utilizes the MATLAB R2014a software for result evaluation. The results show that the SNE-ESO approach, along with the XGboost-BTPE, achieved the highest classification accuracy of 99.89% for detecting {Ventricular Tachycardia}-{Normal Sinus Rhythm} cases. In terms of classification benchmarks, the results exhibit that the BTPE hyper-parameter tuning technique surpasses the SGD and PSO techniques.
... This can improve QTL mapping accuracy by accounting for genotype by environment interaction by aggregating the effects of all QTL by environment interactions, as well as the potential that different environment specific effects may be possessed by individual QTLs. Similarly, modeling the marker by environment interaction should enhance GS (Piepho 2009;Crossa et al. 2010;Burgueno et al. 2012). ...
Chapter
Full-text available
In view of current focus for improvement of crop varieties for disease resistance the adoption of genomic aided plant breeding is growing as an important approach. Till date lots of research has been done in the field of crop improvement for disease resistance using both conventional and molecular breeding approach. Major chunk of research for developing disease resistant variety, concentrated on major gene resistant for biotic stress but this type of resistance is more prone to breakdown with frequent changes in pathogenic strain. On other hand, more stable and broad spectrum resistance can be achieved by breeding varieties resistance for minor quantitative genes. Genomic selection is most convenient approach in molecular plant breeding for developing plant varieties resistant to quantitative disease resistance. This approach enhance genetic gain in the selecting individuals by exploring whole genome sequence data to calculate breeding value of progeny. The selection is based on adoption of genomic selection methodology and whole genome prediction model while using GS for yield and other economical traits. Although GS is a promising tool for genetic improvement of quantitative traits through reduction of breeding cycle but its efficiency in crop breeding programme could be increased by optimization of models regarding analysis of interaction between genotype and environment to improve predication accuracy. This could be gained by combining GS with different novel platforms like genotyping, phenotyping and speed breeding speed up the pace of genomic selection aided breeding procedure and higher genetic gain in context of per unit time.
... In this study, we assessed the predictive ability of different models for different traits using 26,711 SNP markers and BLUP values. We used 14 different parametric and semi-parametric models such as GBLUP 88 33 . Details of the models are available in previously published research articles 34,95 . ...
Article
Full-text available
Breeding programs require exhaustive phenotyping of germplasms, which is time-demanding and expensive. Genomic prediction helps breeders harness the diversity of any collection to bypass phenotyping. Here, we examined the genomic prediction’s potential for seed yield and nine agronomic traits using 26,171 single nucleotide polymorphism (SNP) markers in a set of 337 flax (Linum usitatissimum L.) germplasm, phenotyped in five environments. We evaluated 14 prediction models and several factors affecting predictive ability based on cross-validation schemes. Models yielded significant variation among predictive ability values across traits for the whole marker set. The ridge regression (RR) model covering additive gene action yielded better predictive ability for most of the traits, whereas it was higher for low heritable traits by models capturing epistatic gene action. Marker subsets based on linkage disequilibrium decay distance gave significantly higher predictive abilities to the whole marker set, but for randomly selected markers, it reached a plateau above 3000 markers. Markers having significant association with traits improved predictive abilities compared to the whole marker set when marker selection was made on the whole population instead of the training set indicating a clear overfitting. The correction for population structure did not increase predictive abilities compared to the whole collection. However, stratified sampling by picking representative genotypes from each cluster improved predictive abilities. The indirect predictive ability for a trait was proportionate to its correlation with other traits. These results will help breeders to select the best models, optimum marker set, and suitable genotype set to perform an indirect selection for quantitative traits in this diverse flax germplasm collection.
... , we find that g BLUP = X β BLUP establishing the equivalence of RR-REML and gBLUP [30,31]. 4 Due to the nature of the ℓ 1 penalty, particularly for high values of , the LASSO estimator will shrink many coefficients to exactly zero, something that never happens with the ridge estimator. ...
Article
Full-text available
Background The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep, ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program. Results Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction. Conclusions The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources. Peer Review reports
... Similarly, most of the related research are based on traditional regression algorithms like Partial Least Squares (PLSR) and Support Vector Machines (SVM) to estimate target parameters, but these algorithms often lack regularity (Wu et al., 2020). As Lasso Regression (LR) and Gaussian Process Regression (GPR) perform well in fitting problems, (Piepho, 2009) used LR to build a model for the height index in crop canopy based on lidar and achieved an R 2 of 0.81, (Ogutu et al., 2012) used GPR to optimize the PLSR algorithm for estimating the chlorophyll content of japonica rice in Northeast, which improved the accuracy of the prediction model. However, there are few reports on the remote sensing monitoring for LNC in rice canopy based on the information fusion from UAV multiple sensors using the LR and GPR methods, combined with the optimal spectral feature algorithms. ...
Article
Full-text available
Timely and accurately monitoring leaf nitrogen content (LNC) is essential for evaluating crop nutrition status. Currently, Unmanned Aerial Vehicles (UAV) imagery is becoming a potentially powerful tool of assessing crop nitrogen status in fields, but most of crop nitrogen estimates based on UAV remote sensing usually use single type imagery, the fusion information from different types of imagery has rarely been considered. In this study, the fusion images were firstly made from the simultaneously acquired digital RGB and multi-spectral images from UAV at three growth stages of rice, and then couple the selecting methods of optimal features with machine learning algorithms for the fusion images to estimate LNC in rice. Results showed that the combination with different types of features could improve the models’ accuracy effectively, the combined inputs with bands, vegetation indices (VIs) and Grey Level Co-occurrence Matrices (GLCMs) have the better performance. The LNC estimation of using fusion images was improved more obviously than multispectral those, and there was the best estimation at jointing stage based on Lasso Regression (LR), with R² of 0.66 and RMSE of 11.96%. Gaussian Process Regression (GPR) algorithm used in combination with one feature-screening method of Minimum Redundancy Maximum Correlation (mRMR) for the fusion images, showed the better improvement to LNC estimation, with R² of 0.68 and RMSE of 11.45%. It indicates that the information fusion from UAV multi-sensor imagery can significantly improve crop LNC estimates and the combination with multiple types of features also has a great potential for evaluating LNC in crops.
Preprint
Full-text available
Breeding programs require exhaustive phenotyping of germplasms, which is time-demanding and expensive. Genomic prediction based on next-generation sequencing techniques helps breeders harness the diversity of any collection to bypass phenotyping. Here, we examined the genomic prediction’s potential for seed yield and nine agronomic traits using 26171 single nucleotide polymorphism (SNP) markers in a set of 337 flax ( Linum usitatissimum L.) germplasm, phenotyped in five environments. We evaluated 14 prediction models and several factors affecting predictive ability based on cross-validation schemes. Most models gave close predictive ability values across traits for the whole marker set. Models covering non-additive effects yielded better predictive ability for low heritable traits, though no single model worked best across all traits. Marker subsets based on linkage disequilibrium decay distance gave similar predictive abilities to the whole marker set, but for randomly selected markers, it reached a plateau above 3000 markers. Markers having significant association with traits improved predictive abilities compared to the whole marker set, when marker selection was made on the whole population instead of the training set indicating a clear overfitting. The correction for population structure did not increase predictive abilities compared to the whole collection. However, stratified sampling by picking representative genotypes from each cluster improved predictive abilities. The indirect predictive ability for a trait was proportionate to its correlation with other traits. These results will help breeders to select the best models, optimum marker set, and suitable genotype set to perform an indirect selection for quantitative traits in this diverse flax germplasm collection.
Article
Molecular markers have been used to map quantitative trait loci. However, they are rarely used to evaluate effects of chromosome segments of the entire genome. The original interval-mapping approach and various modified versions of it may have limited use in evaluating the genetic effects of the entire genome because they require evaluation of multiple models and model selection. Here we present a Bayesian regression method to simultaneously estimate genetic effects associated with markers of the entire genome. With the Bayesian method, we were able to handle situations in which the number of effects is even larger than the number of observations. The key to the success is that we allow each marker effect to have its own variance parameter, which in turn has its own prior distribution so that the variance can be estimated from the data. Under this hierarchical model, we were able to handle a large number of markers and most of the markers may have negligible effects. As a result, it is possible to evaluate the distribution of the marker effects. Using data from the North American Barley Genome Mapping Project in double-haploid barley, we found that the distribution of gene effects follows closely an L-shaped Gamma distribution, which is in contrast to the bell-shaped Gamma distribution when the gene effects were estimated from interval mapping. In addition, we show that the Bayesian method serves as an alternative or even better QTL mapping method because it produces clearer signals for QTL. Similar results were found from simulated data sets of F2 and backcross (BC) families.
Article
Marked-assisted genetic improvement of agricultural species exploits statistical dependencies in the joint distribution of marker genotypes and quantitative traits. An issue is how molecular (e.g., dense marker maps) and phenotypic information (e.g., some measure of yield in plants) is to be used for predicting the genetic value of candidates for selection. Multiple regression, selection index techniques, best linear unbiased prediction, and ridge regression of phenotypes on marker genotypes have been suggested, as well as more elaborate methods. Here, phenotype-marker associations are modeled hierarchically via multilevel models including chromosomal effects, a spatial covariance of marked effects within chromosomes, background genetic variability, and family heterogeneity. Lorenz curves and Gini coefficients are suggested for assessing the inequality of the contribution of different marked effects to genetic variability. Classical and Bayesian methods are presented. The Bayesian approach includes a Markov chain Monte Carlo implementation. The generality and flexibility of the Bayesian method is illustrated when a Lorenz curve is to be inferred.
Article
A simple subclass of lattice processes is introduced. These processes are shown to have many desirable properties which may make them suitable for representing autocorrelated variables in practical situations. Some standard results concerning the optimal allocation of sample points on the line are generalized to aligned samples in the plane.
Book
1 Scope of Resampling Methods for Dependent Data.- 2 Bootstrap Methods.- 3 Properties of Block Bootstrap Methods for the Sample Mean.- 4 Extensions and Examples.- 5 Comparison of Block Bootstrap Methods.- 6 Second-Order Properties.- 7 Empirical Choice of the Block Size.- 8 Model-Based Bootstrap.- 9 Frequency Domain Bootstrap.- 10 Long-Range Dependence.- 11 Bootstrapping Heavy-Tailed Data and Extremes.- 12 Resampling Methods for Spatial Data.- A.- B.- References.- Author Index.
Article
Support Vector Machines Basic Methods of Least Squares Support Vector Machines Bayesian Inference for LS-SVM Models Robustness Large Scale Problems LS-SVM for Unsupervised Learning LS-SVM for Recurrent Networks and Control.