Content uploaded by Hans-Peter Piepho
Author content
All content in this area was uploaded by Hans-Peter Piepho on Mar 25, 2016
Content may be subject to copyright.
CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 1165
RESEARCH
G
or genomic selection (GS) is a
marker-based method for estimating genotypic values with-
out prescreening of markers by signi cance testing or other sub-
set selection procedures (Whittaker et al., 2000; Meuwissen et
al., 2001). Most applications so far have been in animal breed-
ing (Meuwissen et al., 2001; Goddard and Hayes, 2007), but the
method is rapidly becoming popular in plant breeding (Bernardo
and Yu, 2007). With the development of high-throughput marker
technologies interest in such statistical methods is expected to
increase further in the near future.
The key idea is to predict the genotypic value of the ith geno-
type (i = 1, …, G), denoted as g
i
, using all available markers. One
option is to use a regression model with the form
1
M
ikik
k
guz
=
=
∑
[1]
where z
ik
is a regressor variable for the ith genotype and kth marker,
while u
k
(k = 1, …, M) are regression coe cients. Typically, for a
Ridge Regression and Extensions
for Genomewide Selection in Maize
H. P. Piepho*
ABSTRACT
This paper reviews properties of ridge regres-
sion for genomewide (genomic) selection and
establishes close relationships with other meth-
ods to model genetic correlation among rela-
tives, including use of a kinship matrix and the
simple matching coef cient as computed from
marker data. A number of alternative models are
then proposed exploiting ties between genetic
correlation based on marker data and geostatis-
tical concepts. A simple method for automatic
marker selection is proposed. The methods are
exempli ed using a series of experiments with
test-cross hybrids of maize (Zea mays L.) con-
ducted in ve environments. Results underline
the need to appropriately model genotype–
environment interaction and to employ an inde-
pendent estimate of error. It is also shown that
accounting for genetic effects not captured by
markers may be important.
Institute for Crop Production and Grassland Science, Universität
Hohenheim, Fruwirthstrasse 23, 70599 Stuttgart, Germany. Received
13 Oct. 2008. *Corresponding author (piepho@uni-hohenheim.de).
Abbreviations: AIC, Akaike information criterion; BLUP, best linear
unbiased prediction; DH, doubled haploids; EXP, exponential model;
FA, factor-analytic; GAU, Gaussian model; GCA, general combining
ability; GS, genome-wide (genomic) selection; LS, least squares; LV,
linear variance; POW, power model; REML, restricted maximum like-
lihood; RR
het
, ridge regression with heterogeneous variance among
markers; RR
hom
, ordinary ridge regression; RR
hom2
, ridge regression
with reduced set of markers; SCA, speci c combining ability; SPH,
spherical model; SVM, support vector machine.
Published in Crop Sci. 49:1165–1176 (2009).
doi: 10.2135/cropsci2008.10.0595
© Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
All rights reserved. No part of this periodical may be reproduced or transmitted in any
form or by any means, electronic or mechanical, including photocopying, recording,
or any information storage and retrieval system, without permission in writing from
the publisher. Permission for printing and for reprinting the material contained herein
has been obtained by the publisher.
1166 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009
biallelic marker with alleles A
1
and A
2
, we de ne z
ik
= 1
for A
1
A
1
, z
ik
= –1 for A
2
A
2
and z
ik
= 0 for A
1
A
2
or when
the marker genotype is missing. The linear model from
Eq. [1] can be written in matrix form as
g = Zu [2]
where g′ = (g
1
, g
2
, …, g
G
), Z = {z
ik
}, and u′ = (u
1
, u
2
, …, u
M
).
When there is a single (mean centered) observation
y
i
per genotype with independent residual errors e
i
hav-
ing zero mean and variance σ
2
e
, the model for observed
data is y = Zu + e, where y′ = (y
1
, y
2
, … , y
G
) and
e′ = (e
1
, e
2
, … , e
G
). The classical least squares estima-
tor, û = (Z′Z)
–1
Z′y minimizes the sum of squares given
by ||y – Zu||
2
, with ||||denoting the length of a vector.
This estimator is well known to perform poorly when the
number of markers (M) is large relative to the number of
genotypes (G), and using all markers is impossible when
M > G, which is expected to be increasingly the case with
high-density marker systems. Selecting a subset of mark-
ers by one of the common variable selection methods (for-
ward selection, stepwise regression, etc.; Miller, 2002) is a
possible alternative often used in marker-assisted selection
programs (Bernardo and Yu, 2007), but performance of
these methods is likely to be bogged down when there
are many, possibly highly correlated markers (Whittaker
et al., 2000).
There are many regularization methods addressing
the problem of large M, which avoid the selection prob-
lem essentially by keeping all markers in the model. One
of these methods is ridge regression (Hoerl and Kennard,
1970), which was rst used for GS by Whittaker et al.
(2000). Ridge regression minimizes the penalized sum of
squares ||y – Zu||
2
+ λ
2
u′u, where λ
2
is a penalty param-
eter, yielding the estimator
û = (Z′Z + λ
2
I
G
)
–1
Z′y, [3]
where I
G
is the G-dimensional identity matrix. The pen-
alty term overcomes the problem of ill-conditioning when
multicollinearity among columns in Z causes Z′Z to be
singular, or nearly so. The penalized estimator in Eq. [3]
involves shrinkage, thus avoiding over- tting, and it sta-
bilizes estimation relative to least squares. The penalty
parameter λ
2
, which determines the amount of shrinkage,
may be chosen in a number of ways (Draper and Smith,
1998), including cross-validation (Ruppert et al., 2003).
One particular method, which has been used by Meu-
wissen et al. (2001) for GS, assumes that regression coef-
cients are independent random draws from a common
normal distribution, that is,
u
k
~ N (0,σ
2
u
), (k = 1,…, M) [4]
Under this model, we have λ
2
= σ
2
e
/σ
2
u
, where σ
2
e
is
the residual variance (Draper and Smith, 1998), and the
penalized estimator in Eq. [3] turns out to be equivalent to
be st li ne a r unbiased pred ict ion ( BLU P) of u ( Rupper t et a l.
(2003). This method has also been used by Bernardo and
Yu (2007), who found it, based on a simulation study, to
per for m wel l compared to subset selection, in wh ich m a rk-
ers were selected per chromosome by backward elimina-
tion with relaxed signi cance thresholds. One advantage
of the mixed model formulation of ridge regression is that
we can estimate the variance components, and hence the
penalty, in a straightforward way by restricted maximum
likelihood (REML) (Ruppert et al., 2003). Furthermore,
it is possible to account for other sources of variation by
adding xed and random e ects.
The present paper brie y reviews some of the fea-
tures of ridge regression as performed in a mixed model
framework using REML, giving particular emphasis to its
similarity with spatial models. I then outline some alter-
native models for GS, including spatial models and ridge
regression with heterogeneous variances. The exposition
emphasizes non-Bayesian implementations of methods
that are mostly treated in a Bayesian framework, mainly
in the animal breeding literature. Equivalence relations
of perhaps seemingly di erent methods are discussed. I
give some hints on how these models can be tted using
standard mixed model software and illustrate them using
a dataset from a breeding program in maize (Zea mays L.).
The importance of accounting for both genotype–envi-
ronment interaction and for polygenic e ects not captured
by markers is highlighted.
MATERIAL AND METHODS
The total genotypic e ect will be partitioned into a component
explained by the markers (g
i
) and a polygenic component (v
i
) not
captured by the markers. Thus, the total genotypic e ect h
i
is
h
i
= g
i
+ v
i
[5]
Our main objective is to estimate h
i
. It is assumed through-
out that g
i
and v
i
are independent of one another. It is important
to account for residual polygenic e ects v
i
to avoid over- tting
(Goddard and Hayes, 2007). In case of a single unstructured
population, for example a population of doubled haploid (DH)
lines generated from a single cross, we have
var(v) = σ
2
v
I
G
[6]
where v′ = (v
1
, v
2
, … , v
G
). For s t r uc t u r e d p o pu l a t ion s , va r(v) may
involve covariances among relatives (Piepho et al., 2008a).
We will consider di erent models for g′ = (g
1
, g
2
, … , g
G
),
conditionally on the markers Z = (z
ik
). All conditional models
will be of the form
var(g|Z) = σ
2
u
Γ [7]
for some matrix Γ that is a function of Z. In Eq. [7] and later in
the paper, the expression on which we condition (Z in this case)
is given following a vertical bar.
Ridge Regression and Related Models
Under the mixed model formulation of ridge regression in Eq.
[4], we have Γ = ZZ′, so that the genotypic variance–covariance
CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1167
di culty with this approach is that often the number of prog-
eny per cross is rather limited, making variance estimates rather
unreliable. This problem can be overcome by ridge regression
where heterogeneity among crosses is modeled by a single vari-
ance component σ
2
u
, because heterogeneity between crosses is
represented by the structure of ZZ′.
In animal breeding very accurate BLUPs are often avail-
able for sires and dams due to extensive records on progeny,
accounting for half the additive genetic variance, so focusing
GS on Mendelian within-family sampling has been suggested
(H. Simianer, personal communication, 2008). This approach
is applicable in plant breeding as well, though BLUPs of par-
ents are usually less reliable than in animal breeding programs.
In a diallel crossing scheme, this means replacing Z with
˜
Z = Z – Z
F1
, where Z
F1
is the marker data of the F
1
generation
corresponding to the underlying crosses, and assuming condi-
tional independence between crosses for the marker-dependent
part of the model; that is, Γ is block-diagonal with blocks
˜
Z
c
˜
Z′
c
,
where
˜
Z
c
is the submatrix of
˜
Z corresponding to the cth cross.
Simple Spatial Mixed Models
The property of a genetic covariance depending on similarity
of marker pro les brings to mind a host of alternatives such
as geostatistical methods, where covariance depends on spatial
proximity. Thus, replacing spatial with genetic distance, spatial
methods can be used to model genetic correlation (Piepho et
al., 2008a). Ridge regression may, in fact, be regarded as one
type of spatial model, as will be elaborated after a brief outline
of spatial models as applied to marker data.
If marker scores z
ik
are regarded as coordinates of geno-
types in M-dimensional marker space, covariance can be mod-
eled as a linear or nonlinear function of distance in that space.
Thus, the covariance is expressed as
Γ= [ f (d
ii′
)] [11]
where d
ii′
is the Euclidean distance of genotypes i and i′, de ned
as d
ii′
= || z
i
– z
i′
||, w ith z′
i
equal to the ith row of Z, and f(d) is
some monotonically decreasing function of d. There are di er-
ent options for the function f(d), including those shown in Table
1 (Schabenberger and Gotway, 2005).
The rst four models in Table 1 are commonly used as
spatial covariance structures in mixed model packages (Littell
et al., 2006). The power model (POW) is just a re-parame-
terization of the exponential model (EXP). It is worth tting
both models, however, because convergence behavior may dif-
fer. The linear variance (LV) model was proposed by Williams
(1986) in the context of blocked eld experiments. In tting this
model, precautions must be taken to ensure that the resulting
variance-covariance matrix remains positive de nite (Piepho
et al., 2008b). An advantage of the model compared to the four
rst nonlinear spatial models shown in Table 1 is parsimony,
because σ
2
u
Γ = σ
2
u
J
G
– σ
2
u
θ{d
ij
}, where the rst term on the
right-hand side is confounded with the xed intercept and θ is
a parameter as de ned in Table 1. Thus, the only free parameter
to be estimated is φ = σ
2
u
θ.
The quadratic model is not commonly used in spatial
statistics, but is considered here to illustrate a close relation
between ridge regression and spatial models. It is worth point-
ing out that the quadratic model can be regarded as a rst order
structure is linear with covariance of two genotypes depend-
ing on similarity in their marker pro les. Another well-known
model, in which covariance is a linear function of genetic simi-
larity, is given by Γ = 2A, where A is the numerator relation-
ship matrix computed either from pedigree records or from
marker data (Henderson, 1985). A further model is Γ = 2K,
where K is the kinship matrix estimated from the markers (Yu
et al., 2006). When A or K are estimated from marker data,
covariance of two genotypes is again a linear function of simi-
larity between marker pro les, so in this sense use of the A or K
matrix estimated from markers is an early form of GS. In fact,
under some circumstances, use of the kinship matrix K and
ridge regression are equivalent, as will be shown below.
As ridge regression implies a genetic covariance among
genotypes, the question may be posed if this model is com-
mensurate with an analysis that ignores marker information
and is based on the assumption of independent genotypes with
constant variance. This question is relevant for the two-stage
analysis of multi-environment data to be discussed later in this
paper. The answer is yes in a simple unstructured population,
for example a population of DH or recombinant inbred lines
originating from a single cross of inbred lines (Bernardo and
Yu, 2007). To see this, we may evaluate Eq. [2] in two ways: (i)
conditioning on the markers (ridge regression) and (ii) not con-
ditioning on markers. The variance conditioning on markers is
var(g|Z) = σ
2
u
ZZ′. The unconditional variance can be derived
from a general result on moments of joint random variables
(Searle et al., 1992:461):
E(g) = E
Z
E(g | Z) = 0 [8]
var(g) = E
Z
var(g|Z) + var
Z
E(g|Z) = σ
2
u
E
Z
(ZZ′) [9]
with E
Z
and var
Z
representing the expectation and variance over
Z. In a DH population derived from a single cross we have
E
Z
(ZZ′) = M [ p I
G
+ (1 – p)J
G
] [10]
where p is the probability that a marker is segregating in the
underlying cross and J
G
is a G × G matrix of ones. The term in
J
G
is confounded w ith the xed intercept and so may be dropped
from the model (Piepho et al., 2008b). Hence, when ignor-
ing marker information, it is valid to assume independent and
identically distributed genotypic e ects; that is, var(g) = σ
2
g
I
G
.
Equation [10] shows that σ
ˆ
2
u
= M
–1
σ
ˆ
2
g
provides a reasonable
estimate of σ
2
u
, when p = 1 and σ
2
v
= 0 (Bernardo and Yu,
2007). It may be preferable, however, to estimate σ
2
u
directly by
REML based on the ridge regression model in Eq. [4], because
this allows accounting for σ
2
v
, as will be shown further below,
and because this caters for the case p < 1.
In structured populations, Eq. [10] does not usually hold,
but when the structure is simple, a parsimonious unconditional
random e ects model may be obtained for var(g). For example,
in a diallel crossing scheme of several inbred lines from the
same population, where each cross produces a family of DH
lines, the model for var(g) comprises random e ects for gen-
eral combining ability (GCA) and speci c combining ability
(SCA) as well as for lines within crosses (Piepho et al., 2008b).
Generally, p in Eq. [10] will vary between crosses, resulting in
heterogeneity of variance between crosses. This heterogene-
ity can be accounted for, in principle, in a nonmarker model
by assigning a di erent variance to each cross. The practical
1168 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009
Taylor approximation of the Gaussian model (GAU), because
exp (–d
2
/θ) ≈ 1 – d
2
/θ, when d
2
/θ is not far from zero. To study
the properties of the quadratic model, note that the squared
Euclidean distance of genotypes i and i′ can be expressed as
()
()
22
222
11
2
MM
iik ikik
ii i i k i k i k
kk
dzzzzzz
′′ ′ ′′
==
= − = − = − +
∑∑
zz
[12]
If all genotypes and all markers have either z
ik
= 1 or
z
ik
= –1, which happens, for example, for recombinant inbred
lines or DH lines, we have (z
ik
– z
i′k
)
2
= 0 when z
ik
= z
i′k
(geno-
types have identical alleles) and (z
ik
– z
i′k
)
2
= 4 when z
ik
= –z
i′k
(genotypes have opposite alleles). Also, z
2
ik
= 1 for all i and k.
Thus, if s
ii′
denotes the simple matching coe cient of geno-
types i and i′, that is, the proportion of markers identical in
state, we have
()
2
1
41 2
M
ik
ii ii i k
k
dMs M zz
′′ ′
=
⎛⎞
⎟
⎜
⎟
= − = −
⎜
⎟
⎜
⎟
⎜
⎝⎠
∑
[13]
so that the squared distance matrix has the form
D
sq
= {d
2
ii′
} = 4M ( J
G
– S) = 2(M J
G
– ZZ′) [14]
where S = {s
ii′
}. The key observation here is that the matrices
D
sq
, S, and ZZ′ are all linear shift-scale transformations of one
another; that is, either one can be obtained from any of the
other two by (i) multiplication with a constant and (ii) addition
of a constant term times the matrix J
G
. For example, if we t a
quadratic spatial model of the form f(d) = 1 – θd
2
, the variance–
covariance matrix is
var(g|Z) = σ
2
u
( J
G
– θD
sq
) =
σ
2
u
[ J
G
– θ(2M J
G
– 2ZZ′)] = α
1
J
G
+ α
2
ZZ′
[15]
where α
1
= (1 – 2M θ)σ
2
u
and α
2
= 2σ
2
u
θ. The term in J
G
is con-
founded with the ( xed) intercept and can therefore be dropped
from the model (Piepho et al., 2008b). It emerges that the qua-
dratic spatial model is equivalent to ridge regression. In other
words, ridge regression can be seen as a special type of spatial
model, in which covariance is a quadratic function of Euclidean
distance. In the same vein, we may obtain an equivalent t by
just using var(g|Z) = σ
2
u
S, which has also been proposed in
a plant breeding context to model genetic correlation among
relatives (Bauer et al., 2006).
On a similar note, in association mapping, the residual
genotypic e ect not accounted for by regression on a candidate
marker can be modeled as var(g|Z) = 2σ
2
u
K, where K is a kin-
ship matrix of the form
K = aS + b J
G
[16]
with a = (1 – c)
–1
, b = – c(1 – c)
–1
, and c equal to the average
probability of identity in state for genes coming from random
individuals in the population (Yu et al., 2006; Stich et al., 2008).
Again, this is just a shift-scale transformation of S, and the term
bJ
G
may be dropped as it is confounded with the intercept. So
the use of K for inbred lines is equivalent to ridge regression
and to a quadratic spatial model. Also, using some other simi-
larity measure such as Jaccard or Dice coe cient (for dominant
marker systems) in place of simple matching (Yu et al., 2006) is
seen to be very similar to use of the kinship matrix.
All spatial models considered so far employ the Euclidean
distance. Mixed model software usually requires the coordinates
(markers) as input to compute the Euclidean distance. Other dis-
tances (Reif et al., 2005) may be tted within a mixed model,
if a representation in Euclidean space, or some approximation
thereof, is found by principal coordinate analysis (Gower, 1966),
but this is not elaborated here (Piepho et al., 2008a).
Based on spatial models, BLUPs can be computed for any
point in the space spanned by the markers. This includes the
genotypes tested as well as potentially other genotypes, which
have been genotyped, but for which no phenotypic data is avail-
able. BLUP based on spatial models is equivalent to Kriging in
spatial statistics (Ruppert et al., 2003). Thus, the BLUP for a gen-
otype constitutes an interpolation of the genotypic value based
on the genotype’s own data and that of the other genotypes, with
the impact by a tested genotype on the prediction of another
genotype depending on genetic proximity. It is also worth not-
ing that adding a polygenic component σ
2
v
I
G
is equivalent to
adding a nugget e ect that accounts for residual measurement
error in spatial models (Schabenberger and Gotway, 2005).
BLUP based on GAU also bears an intimate relationship
with least squares support vector machine (LS-SVM) regres-
sion (Suykens et al., 2002 p. 106–107), when a Gaussian ker-
nel is used, as is common in chemometric applications such as
near infrared spectroscopy (Cogdill and Dardenne, 2004). For
details on the analogies of LS-SVM regression and Gaussian
processes see Suykens et al. (2002). Thus, in as much as ridge
regression may be seen as an approximation to GAU, it is also
an approximation of LS-SVM regression. SVM is another class
of regularization methods, which have recently been applied to
the problem of hybrid prediction in plant breeding (Maenhout
et al., 2007, 2008). Finally, GAU as applied to markers is essen-
tially equivalent to reproducing kernel Hilbert spaces regres-
sion as proposed by Gianola and van Kaam (2008).
Mixed Models with Heterogeneous Variance
The BayesA and BayesB methods of Meuwissen et al. (2001)
assume that the ridge regression model is extended such that
each marker has its own variance. Thus, under both approaches
the regression model is
2
1
M
ikkik
k
gtz
=
= σ
∑
[17]
where t
k
~ N(0,1) and σ
2
k
is the variance for the kth marker.
Under both the BayesA and BayesB models, a prior distribu-
tion is assumed for the variances σ
2
k
. The regression coe cient
under these models has the form
2
kkk
ut= σ . To relate these
models to ridge regression, it is important to recognize that the
Table 1. Genotypic covariance models of the form Γ = {f(d
ii′
)},
where d is the Euclidean distance computed from marker
data and θ is a parameter.
Name Equation
Gaussian
f(d) = exp(–d
2
/θ)
Power (exponential)
f(d) = θ
d
Exponential
f(d) = exp(–d/θ)
Spherical
()
= − +
θ
θ
3
3
3
1
2
2
dd
fd
, (d < θ)
Linear
f(d) = 1 – θd
Quadratic
f(d) = 1 – θd
2
CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1169
variance σ
2
k
essentially is just another marker-speci c random
e ect. As t
k
is standard normal, the random regression coe -
cient
2
kkk
ut= σ will have a symmetric nonnormal marginal
distribution whose speci c form depends on the assumed prior
for σ
2
k
. Clearly, this marginal distribution for u
k
(not condition-
ing on σ
2
k
) has a constant variance, and so the only di erence to
ridge regression is that nonnormality holds for u
k
.
These properties of Bayesian models suggest that in a non-
Bayesian framework we may consider tting nonnormal dis-
tributions to u
k
, which closely mimic the BayesA and BayesB
model ts. A nonnormal distribution with stronger peaks and
heavier tails may be more realistic than a normal distribution
when most of the markers have a very small e ect. One conve-
nient class of distributions is available via the Johnson S
U
system
of transformed normal random variables (Johnson et al., 1994;
Piepho and McCulloch, 2004). I found this type of nonlinear
mixed model di cult to t, however, using adaptive Gauss-
ian quadrature (Pinheiro and Bates, 1995). When tting these
models using a rst-order method, corresponding to Gaussian
quadrature with a single quadrature point, I obtained the same
log-likelihoods with normal and nonnormal u. This similarity
is not unexpected because under either model the genotypic
value is Zu, and based on the central limit theorem this linear
combination is expected to be nearly normal even when u is
nonnormal. In light of these considerations, the good perfor-
mance of BayesB relative to ridge regression in Meuwissen et
al. (2001) is probably at least partly due to the strong impact
of the assumed prior distribution, which was derived based on
the model used to simulate the data (though it did not exactly
match that model; Goddard and Hayes, 2007).
A simple alternative is to t Eq. [17] in a frequentist setting,
regarding σ
2
k
as xed parameters (RR
het
). A REML t will
typically yield many zero estimates, which essentially implies
an automatic selection of markers. Thus, we may simply drop
markers with zero variance estimates. This is similar in spirit
to BayesB, where the prior for σ
2
k
has a peak at σ
2
k
= 0, which
induces an automatic marker selection. For the remaining mark-
ers, we can perform a likelihood ratio test for homogeneity of
variance. In case of homogeneity, the ordinary ridge regres-
sion model with homogeneous variances can be tted with the
selected markers (RR
hom2
). In case of heterogeneity, we may
stick with RR
het
.
Extension of Models
to Genotype–Environment Data
When data from multiple environments are available, modeling
of genotype–environment interaction requires special atten-
tion. In particular, genetic correlation between environments
needs to be modeled (Piepho, 2000). In analogy to the case of a
single environment, it will be assumed that the e ect of the ith
genotype in the jth environment (j = 1, 2, …, E), denoted as
h
ij
, can be partitioned into marker-based e ect g
ij
and polygenic
e ect v
ij
:
h
ij
= g
ij
+ v
ij
[18]
Each of the two component e ects is partitioned into main
e ect and interaction, that is,
g
ij
= g
i
+ f
ij
[19]
and
v
ij
= v
i
+ w
ij
[20]
where g
i
and v
i
are the marker-based and polygenic main e ects,
while f
ij
and w
ij
are the corresponding interaction terms with
the environment. Both g
i
and f
ij
are modeled as functions of the
marker data. Our main objective is to estimate the genotypic
main e ect h
i
= g
i
+ v
i
.
Let f ′
j
= ( f
1j
, f
2j
, … , f
Gj
) and f ′ = (f′
1
, f′
2
, … f′
E
) and let w be
similarly de ned. For the marker-based e ects it is assumed that
var(g|Z) = σ
2
u
Γ [21]
and
var(f|Z) = Σ
f
⊗ Γ
[22]
where ⊗ denotes the Kronecker (or direct) product (Searle et
al., 1992). For example, with ridge regression (RR
hom
) we have
Γ = ZZ′. Some choices for the E × E variance–covariance
matrix Σ
f
are given in Table 2, including the factor-analytic
(FA) model (Piepho, 1997, 1998). For the polygenic e ects we
may assume
var(v) = σ
2
v
I
G
[23]
and
var(w) = Σ
w
⊗ I
G
[24]
where Σ
w
is also chosen from the options in Table 2.
Two-Stage Analysis for
Genotype–Environment Data
It is generally desirable to t a suitable model directly to gen-
otype–environment data. Estimation requires, however, that
replicate data are available to separate genotypic from environ-
mental e ects. When no replicate data are available, as is the
case for the example considered in this paper, separation of these
e ects is not possible. In this case, one may compute genotype
means over environments. I use a two-stage approach in which
genotype means are computed based on a model with xed
genotype main e ects and random interactions. The marker-
based component in the interaction is set to zero at this stage,
because genotype–environment e ects cannot be separated
from residual error; it will be absorbed into the polygenic e ect
w
ij
according to Eq. [9] and [10]. Thus, I tted the model
y
ij
= µ
j
+ h
i
+ w
ij
[25]
where y
ij
is the adjusted mean of the ith genotype in the jth envi-
ronment, µ
j
is the main e ect of the jth environment. Note that
the e ect w
ij
in Eq. [25] subsumes residual error of the adjusted
mean. Based on this model I estimated adjusted genotype means
y
–
i
taking both µ
j
and h
i
as xed, and then tted the model
y
–
i
= µ + h
i
+ e
i
[26]
Table 2. Models for variance–covariance among genotypes
in different environments (Σ
q
; q = f, w).
Model Short-hand Equation
Independent ID
σ
2
1
I
E
Diagonal DIAG
D = diag(σ
2
1
, σ
2
2
, …, σ
2
E
)
Factor-analytic
†
FA(P)
=
′
+
∑
1
P
pp
p
Dλλ
†
λ′
p
= (λ
p1
, λ
p 2
, …, λ
pE
).
1170 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009
where var(h|Z) = var(g|Z) + var (v) with h′ = (h
1
, h
2
, … , h
G
)
and var(e
i
) is xed at the squared standard error of the adjusted
mean y
–
i
. For comparison, I also tted Eq. [26] merging e
i
with
the polygenic e ect v
i
(contained in h
i
) into an independent
residual with constant variance. Thus, in this analysis, var(e
i
)
was not xed.
One could try to directly t the model y
ij
= µ
j
+ h
i
+ f
ij
+
w
ij
in a single step, in which case the interaction e ect would
comprise the marker-dependent term f
ij
. With no independent
estimate of error, however, this is prone to over tting, because
the t for f
ij
would then be confounded with any correlation
among adjusted means that is due to the trial design in the dif-
ferent environments.
The Maize Data
Two hundred eight DH lines originating from a single cross of
inbred parental lines in maize were tested in three series of trials
over ve locations. In four locations (LOC), a lattice design with
block size 10 was employed, while in one location a complete
block design was used. In four locations, only a single replicate
was planted, while in one location there were two replicates
planted according to a lattice design. Trials, replicates, and
incomplete blocks were coded as TRIAL, REP, and BLOCK,
respectively. For each location adjusted entry means y
ij
were
computed. For unreplicated trials, the model was ENTRY +
TRIAL. For the location with replicated trials, the model was
ENTRY + TRIAL.REP.BLOCK. Adjusted means for entries
with marker data were subjected to mixed model analysis. The
trait evaluated was kernel dry weight per plot.
There were seven check genotypes. For some of the DH
lines marker information was missing, so these were treated as
additional checks. Adjusted means of all checks were excluded
from mixed model analysis. A total of 136 single sequence
repeat and single nucleotide polymorphism markers evenly dis-
tributed across the genome were scored for the DH lines. The
two alleles of a marker were coded Z = –1 and Z = +1, while
missing data were coded as Z = 0.
Software and Model Evaluation
All models were tted by the REML method. For each model,
we report both the deviance (minus twice the restricted log like-
lihood) and the Akaike Information Criterion (AIC), de ned as
the deviance minus twice the number of variance parameters.
Small values of AIC indicate a preferable model. The AIC is
closely related to cross-validation criteria (McQuarrie and Tsai,
1998; Piepho and Gauch, 2001). Some code for SAS PROC
MIXED (Littell et al., 2006) is given in the Appendix.
RESULTS
Computation of Genotype Means
To compute genotype means across environments, we t-
ted the two-way model from Eq. [25] with xed main
e ects for environments and genotypes and various struc-
tures for Σ
w
. Based on the AIC values in Table 3, it was
decided to compute genotype means y
–
i
using the FA(1)
model. The average variance of an adjusted mean based on
this analysis was 0.174.
Analysis of Genotype Means
Not Fixing the Error Variance
We rst tted models for h based on Eq. [26] without x-
ing the error variance var(e
i
) at the squared standard error
of a mean, such that error could not be separated from
var(h|Z). Thus, the residual variance comprised the vari-
ance of both the polygenic e ect v
i
and the error associ-
ated with the mean (e
i
). The model ts are shown in Table
4. There was no signi cant heterogeneity among the 38
markers with nonzero variance under model RR
het
. Thus,
RR
hom2
was tted. The example shows that ridge regres-
sion and spatial models give better ts than a model with
independent genotypic e ects. Also, the spatial models
provide a similar t as ridge regression in terms of AIC.
Strikingly, some of the spatial models have a rather smaller
residual variance than ridge regression, in which case BLUP
comes very close to the adjusted means, which explains
the high correlation of adjusted means with BLUPs under
spatial models (POW, EXP, GAU, SPH). By contrast,
correlation of adjusted means with BLUPs is quite low
for ridge regression and the quadratic model (Table 5), so
selection decisions by these GS methods would be quite
di erent than by adjusted means. The nding that some of
the spatial models have a residual variance rather smaller
than the average variance of an adjusted means (0.174) is
indicative of over- tting. In terms of AIC, di erences are
minor between spatial models and ridge regression. Over-
all, the LV model is marginally better than other marker-
based spatial models. RR
hom2
has by far the best AIC value
of all models, showing that preselection of markers is an
important consideration.
Analysis of Genotype Means
Fixing the Error Variance
The fact that some of the models yielded very tiny resid-
ual variances, when var(e
i
) was not xed, thus rendering
BLUP essentially the same as adjusted means, is reason
for concern. Adjusted means are typically correlated,
though the correlation may not be large. It is therefore
possible that in a model for adjusted means, the genetic
covariance model captures part of the correlation among
adjusted means, which is purely nongenetic, thus yield-
ing an upward bias in genetic variance. For this reason it
is advisable to generally obtain an independent estimate
of error (as in Bernardo and Yu, 2007). Thus, we xed
var(e
i
) at the squared standard error of adjusted genotype
means based on the FA(1) model for genotype–environ-
ment means. The resulting ts are shown in Table 6. Most
models leave rather little polygenic variance σ
2
v
. Again,
RR
hom2
has by far the best t in terms of AIC. Among
spatial models, LV and GAU are best. None of the GS
methods is perfectly correlated with the adjusted mean
(Table 7), while several of the spatial models (LV, POW,
EXP, SPH) are virtually identical.
CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1171
DISCUSSION
This paper has discussed some models for GS that are read-
ily implemented with a mixed model package. Results
for a maize dataset indicate that the spatial models are an
interesting alternative to ridge regression. The LV model
is particularly attractive because it involves only a single
parameter. Automatic marker selection by a preliminary
t of a model with heterogeneous variance between mark-
ers is a promising method. A thorough comparison with
other methods of subset selection would be worthwhile.
In the analysis of across-environment genotype means
without independent estimate of error the residual vari-
ance estimator often was close to zero, indicating that the
marker-based component captured substantial noise. This
stresses the need to provide for an independent estimate of
error in GS projects. Also, it is desirable to explicitly account
for genetic variance not captured by the markers (Calus and
Veerkamp, 2007). This polygenic variance should be sepa-
rated from residual error, which requires independent esti-
mates of error for individual trials. It is quite common in
breeding programs to perform unreplicated trials, as was the
case in the example, where there was not su cient infor-
mation to estimate within-trial errors for all environments.
Thus, genotype–environment interaction could not be sepa-
rated from error in a mixed model. We could have tted
a marker-based model to the genotype–environment e ect,
but this would have entailed the risk of over- tting, because
nongenetic correlations due to eld trend could have been
captured by the marker-based terms. For this reason a two-
stage approach was employed computing genotype means
across environments based on an unconditional mixed model
for genotype–environment interaction that did not exploit
marker information. While the unconditional model is valid
as shown in this paper, using a conditional model of geno-
type–environment interaction for given marker information
is expected to be more e cient. This is forthcoming only
with su cient replication in all trials, stressing the need for
good individual trial design.
When markers are mapped, the ridge regression model
can be extended to allow spatial correlation of regression
coe cients pertaining to markers on the same chromosome
(Gianola et al., 2003), using the same types of spatial model
discussed here. Unfortunately, such models are currently
not conveniently tted using mixed model software such
Table 3. AICs for various variance–covariance structures
Σ
w
fi tted to the phenotypic data (genotype–environment
means). Models had fi xed main effects for genotypes and
environments.
Model (Σ
w
)
Deviance AIC
†
ID 2843.1 2845.1
DIAG 2753.7 2763.7
FA(1) 2744.4 2762.4
FA(2) 2743.5 2771.5
†
AIC, Akaike information criterion.
Table 4. Model fi ts of different genetic covariance models
with the maize data. Error variance var(e
i
) not fi xed.
Model for g
i
†
Deviance AIC
‡
Residual
variance
§
Independent
¶
372.8 374.8 0.3454
Ridge regression
RR
hom
336.9 340.9 0.2272
RR
het
(38 markers selected) 289.5 367.5 0.1635
RR
hom2
(38 markers) 303.3 307.3 0.1773
Spatial models
Linear 335.6 339.6 0.1139
Quadratic 336.9 340.9 0.2272
Power 334.8 340.8 0.0020
Exponential 334.8 340.8 0.0018
Gaussian 333.9 339.9 0.0002
Spherical 334.3 340.3 <0.0001
†
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordi-
nary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
‡
AIC, Akaike information criterion.
§
Residual subsumes v
i
and e
i
, because var(e
i
) was not fi xed.
¶
Heterogeneity of variance among selected markers was not signifi cant according
to a likelihood ratio test (α = 5%).
Table 5. Pearson correlation (above diagonal) and Spearman rank correlation (below diagonal) of different estimators of geno-
typic value for the maize data. Error variance var(e
i
) is not fi xed.
Model/estimator
Model/estimator
†
AM RR
hom
RR
het1
RR
hom2
LV QUAD POW EXP GAU SPH
Adj. mean (AM)10.7030.7740.7560.9660.7031111
RR
hom
0.670 1 0.920 0.942 0.862 1 0.705 0.705 0.703 0.703
RR
het
0.759 0.916 1 0.974 0.884 0.920 0.776 0.776 0.774 0.774
RR
hom2
0.745 0.935 0.974 1 0.880 0.942 0.758 0.758 0.756 0.756
Linear (LV) 0.960 0.862 0.880 0.976 1 0.862 0.967 0.967 0.966 0.966
Quadratic (QUAD) 0.697 1 0.916 0.935 0.862 1 0.705 0.705 0.703 0.703
Power (POW) 10.6980.7610.7460.9610.6981111
Exponential (EXP)10.6980.7600.7460.9600.6981111
Gaussian (GAU)10.6970.7590.7450.9600.6971111
Spherical (SPH)10.6970.7590.7450.9600.6971111
†
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordinary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
1172 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009
as PROC MIXED. If the spatial model only has a single
parameter, a pro le likelihood approach may be used.
If one is prepared to work within a fully Bayesian
framework, more options are available (Meuwissen et al.,
2001; Gianola et al., 2003; Xu, 2003; Habier et al., 2007;
Gianola and van Kaam, 2008). A general problem with
Bayesian methods is coming up with a choice for the prior
distribution. Meuwissen et al. (2001) use prior distribu-
tions for BayesA and BayesB that were derived from their
simulation program. Thus, the prior distributions favor-
ably matched the data generation mechanism, putting the
Bayesian methods somewhat at an advantage that would
be hard to realize in most plant breeding applications,
where prior information may be rather much vaguer.
One can investigate the merits of di erent models
by simulation (Meuwissen et al., 2001; Bernardo and Yu,
2007). The main di culty is that a model needs to be cho-
sen for simulating the data, and naturally, a model close
to the one used for simulation is more likely to perform
well in the analysis. Simulations must necessarily make
a number of assumptions, the validity of which is hard
to verify in practice. Thus, a more reliable assessment of
performance is with real data from breeding programs.
Ideally, parallel programs using di erent models for pre-
diction would be compared based on the realized genetic
gain. The less-than-perfect correlation among BLUPs
by di erent GS methods found in the present study sug-
gests that a thorough comparison of di erent methods
in current breeding programs would be useful, prefer-
ably by cross-validation re ecting the breeder’s selection
decision process (Schrag et al., 2009). When devising a
cross-validation scheme, it must be realized that a family
structure as studied in the present paper induces genetic
correlation. Optimality of cross-validation methods often
rests on independence assumptions, and generalizing to
the case of dependent data is not straightforward (Lahiri,
2003). Developing suitable cross-validation schemes for
plant breeding programs therefore is an interesting topic
for future research.
A very promising application of GS is for hybrid
prediction. Bernardo (1993, 1994) proposed a BLUP
approach for hybrid prediction, which is closely related
to ridge regression. He suggested to estimate the coef-
cient of coancestry ( f
ii′
) for two maize inbred lines i and
i′ from the same heterotic pool X by a linear function of
the simple matching coe cient s
ij
, that is, by f
ii′
= a
ii′
s
ii′
+
b
ii′
, where a
ii′
= (1 – c
ii′
)
–1
, b
ii′
= – c
ii′
(1 – c
ii′
)
–1
, c
ii′
= 0.5(s
iY
+ s
i′Y
), and s
iY
is the average simple matching coe cient
between inbred i and a sample of inbreds from the oppo-
site heterotic pool Y. This estimate of the coe cient of
coancestry is then used in the LV model σ
2
u
{f
ii′
} to pre-
dict GCA e ects within a mixed model. This approach is
Table 6. Model fi ts of different genetic covariance models
with the maize data. Error variance var(e
i
) fi xed at value of
squared standard error of a mean based on FA(1) model fi t-
ted to genotype–environment data.
Model for g
i
Deviance AIC
†
Polygenic genetic
variance (σ
2
v
)
Independent 372.8 374.8 0.1712
Ridge regression
‡
RR
hom
336.9 340.9 0.0528
RR
het
(37 markers selected) 289.7 363.7 0
RR
hom2
(37 markers)
§
301.9 305.9 0.0045
Spatial models
Linear 337.1 339.1 0
Quadratic 336.9 340.9 0.0528
Power
¶
337.2 341.2 0
Exponential 337.1 341.1 0
Gaussian 335.2 339.2 0
Spherical 337.1 341.1 0
†
AIC, Akaike information criterion.
‡
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordi-
nary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
§
Heterogeneity of variance among selected markers was not signifi cant according
to a likelihood ratio test (α = 5%); variance estimates were shrunken to the overall
mean (for details see text).
¶
Autocorrelation converged to value close to unity.
Table 7. Pearson correlation (above diagonal) and Spearman rank correlation (below diagonal) of different estimators of geno-
typic value for the maize data. Error variance var(e
i
) fi xed at value of squared standard error of a mean based on FA(1) model
fi tted to genotype–environment data.
Model/estimator
Model/estimator
†
AM RR
hom
RR
het1
RR
hom2
LV QUAD POW EXP GAU SPH
Adj. mean (AM) 1 0.881 0.768 0.769 0.920 0.881 0.920 0.920 0.887 0.920
RR
hom
0.871 1 0.933 0.946 0.995 1 0.995 0.995 0.997 0.995
RR
het
0.753 0.929 1 0.975 0.915 0.933 0.914 0.914 0.927 0.914
RR
hom2
0.760 0.941 0.975 1 0.926 0.769 0.926 0.926 0.942 0.926
Linear (LV) 0.912 0.994 0.912 0.923 1 0.920 1 1 0.996 1
Quadratic (QUAD) 0.871 1 0.929 0.941 0.994 1 0.995 0.995 0.997 0.995
Power (POW) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.996 1
Exponential (EXP) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.996 1
Gaussian (GAU) 0.879 0.996 0.926 0.939 0.995 0.996 0.995 0.995 1 0.996
Spherical (SPH) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.995 1
†
RR
het
, ridge regression with heterogeneous variance among markers; RR
hom
, ordinary ridge regression; RR
hom2
, ridge regression with reduced set of markers.
CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1173
seen to be quite similar to ridge regression for the GCA
e ects, but not equivalent because scale and shift param-
eters (a
ii′
and b
ii′
) depend on the pair of genotypes, while
with ridge regression a
ii′
= a and b
ii′
= b for all pairs (i,i′ )
(see Eq. [16]).
Ridge regression and spatial models discussed in this
paper can be used as alternative methods to model GCA
and SCA e ects in hybrid prediction. For example, a ridge
regression model for prediction of hybrid performance in
a complete factorial is
()()
3
21
11 223PP
= ⊗ + ⊗ +gZ1u ZuZu1 [27]
where g is the vector of genotypic values of G hybrids,
P
1
and P
2
are the number of inbred parents in the two
heterotic pools, Z
1
and Z
2
are the marker-based design
matrices of parents in the two pools, 1
P
is a P-dimen-
sional vector of ones, Z
3
= (Z
1
⊗ 1
P
2
) • (Z
P
1
⊗ 1
2
), where
• denotes the elementwise (Hadamard or Schur) product,
u′
1
= (u
11
, …, u
1M
) and u′
2
= (u
21
, …, u
2M
) are vectors
of the GCA e ects at the markers of the two pools and
u′
3
= (u
31
, …, u
3M
) is the corresponding vector of SCA
e ects. Coding of the design matrices for GCA and SCA
e ects has a standard two-way ANOVA form as shown
in Table 8. When the factorial is not complete, the corre-
sponding lines need to be deleted in the design matrices.
Apart from variable selection, this model is essen-
tially the factorial regression model proposed by Charcos-
set et al. (1998), who take regression coe cients as xed.
If instead we assume independent sampling from normal
distributions according to u
rk
~ N(0,σ
2
ur
) (r = 1, 2, 3), we
have a ridge regression equivalent of factorial regression.
The resulting variance–covariance structure is
var(g|Z) = σ
2
u1
Z
1
Z′
1
⊗ J
P
2
+
σ
2
u2
J
P
1
⊗ Z
2
Z′
2
+
σ
2
u3
Z
3
Z′
3
[28]
An alternative variance–covariance model, more akin
to Bernardo’s (1993, 1994) approach, is
var(g|Z) = σ
2
u1
Z
1
Z′
1
⊗ J
P
2
+
σ
2
u2
J
P
1
⊗ Z
2
Z′
2
+
σ
2
u3
Z
1
Z′
1
⊗
Z
2
Z′
2
[29]
The model is easily generalized as
var(g|Z) = σ
2
u1
Γ
1
⊗ J
P
2
+
σ
2
u2
J
P
1
⊗ Γ
2
+
σ
2
u3
Γ
1
⊗
Γ
2
[30]
where Γ
r
(r = 1, 2) is chosen according to some spatial
model in terms of Z
r
.
The term Γ
1
⊗ Γ
2
is equivalent to a separable two-
dimensional spatial process (Martin, 1979), the dimen-
sions corresponding to genetic distance of hybrid parents
in the two pools.
When Γ
1
and Γ
2
are computed from coe cients of
coancestry of each hybrid’s inbred parents in the two pools,
we have Bernardo’s (1993, 1994) approach. Alternatively,
Γ
1
and Γ
2
can have any of the spatial structures proposed in
the present paper, based on the genetic distance of parents
in both heterotic pools, giving rise to a host of alterna-
tive methods. Note that when we apply ridge regression,
Γ
r
(r = 1, 2) may be any positive de nite linear function
Γ
r
= a
r
J
P
r
+ b
r
S
r
, where S
r
is the matrix of simple match-
ing coe cients of hybrids’ parents in the rth pool. This is
because the variance for the SCA e ects (a
1
J
P
1
+ b
1
S
1
)
⊗
(a
2
J
P
2
+ b
2
S
2
) equals a
1
a
2
J
P
1
⊗ J
P
2
+
a
1
b
2
J
P
1
⊗ S
2
+ a
2
b
1
S
1
⊗ J
P
2
+ b
1
b
2
S
1
⊗ S
2
, where the rst term on the right-hand
side is confounded with the intercept and the second and
third terms are confounded with the GCA e ects. Finally,
it should be stressed that the variance terms for GCA and
SCA e ects may be extended by polygenic terms to account
for residual e ects not captured by markers.
In case of multi-allelic markers, or when haplotypes
are used (Calus et al., 2008), there are di erent, essen-
tially equivalent options of extending the model. The fol-
lowing discussion is restricted to additive e ects, but the
same principles apply to coding of e ects for dominance
and epistasis (Xu and Jia, 2007). The starting point is to
assume that each allele has an additive e ect drawn from
the same normal distribution. Let v
qk
denote the additive
e ect of the qth allele (q = 1, …, Q
k
) of the kth marker
and x
iqk
the corresponding dummy variable counting the
number of copies of the qth allele of the kth marker for the
ith genotype. Let v′
k
= (v
1k
,v
2k
,…v
Q
k
k
). The contribution
of the kth marker to the genotypic value is X
k
v
k
, where
X
k
= {x
iqk
}. Assuming that entries in v
k
are identically and
independently normally distributed with zero mean and
variance σ
2
v
, we have
var(X
k
v
k
) = σ
2
v
X
k
X′
k
[31]
We might impose a sum-to-zero restriction, replacing
v
k
with w
k
= (I
Q
k
– Q
k
–1
J
Q
k
)v
k
. It is found that
var(X
k
w
k
) = σ
2
v
X
k
(I
Q
k
– Q
k
–1
J
Q
k
)X′
k
=
σ
2
v
(X
k
X′
k
– 4Q
k
–1
J
G
)
[32]
The second term involving the matrix J
G
is confounded
with the intercept and so can be dropped, showing that
the sum-to-zero constraint is not needed.
In case of two alleles and inbred lines, marker k may
be represented by a single covariate z
k
= X
k
c where
z′
k
= (z
1k
, z
2k
, …, z
Gk
) and c′ = (1/2,–1/2) such that z
ik
= 1
or z
ik
= –1 for inbred lines, as in Eq. [1]. In this case
Table 8. Coding of design matrices for general combining
ability (GCA) and special combining ability (SCA) effects in
Eq. [27].
Parental marker
genotype
Covariates in design matrices for one marker
GCA SCA
Pool 1 Pool 2 Z
1
Z
2
Z
3
A
1
A
1
–1 –1 +1
A
1
A
2
–1 +1 –1
A
2
A
1
+1 –1 –1
A
2
A
2
+1 +1 +1
1174 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009
var(z
k
u
k
) = σ
2
u
X
k
cc′X′
k
=
1
–
4
σ
2
u
X
k
(2I
2
– J
2
)X′
k
=
1
–
2
σ
2
u
(X
k
X′
k
– 2J
G
)
[33]
Again, the term in J
G
may be dropped, so ridge regression
as per Eq. [1] is equivalent to the parameterization with
v
k
. In the biallelic case, the parameterization with a single
column per marker in Z is most parsimonious, but this
option is not available in the multi-allelic case.
This paper has focused on marker data for predict-
ing genotypic values. Instead of markers, or in addition to
markers, expression or metabolic pro le data may be used
for the same purpose. In this case for ridge regression it
is important to standardize the di erent expression prod-
ucts to justify the assumption of a common variance for
the regression coe cients. Similar considerations apply for
any of the spatial methods proposed in this paper. If di er-
ent sources are used simultaneously (markers, expression
data, metabolite data), it may be prudent to t a separate
covariance model for each component in the joint model.
APPENDIX
This appendix shows how to t the models discussed in
this paper using PROC MIXED of the SAS System (Lit-
tell et al., 2006). It is assumed that markers are coded z1
to zM, while genotypes are coded by gen. The relevant
RANDOM statement for the genotypic e ect under the
di erent models is given.
Ridge Regression
The model may be tted by
random z1-zM/subject = intercept type
= toep(1);
By this code each marker generates a column in the design
matrix for the random e ects. When the number of mark-
ers is very large, solving the mixed model equations may
become computationally quite demanding. In this case, it is
useful to specify the model di erently. Noting that var(g)
= σ
2
u
ZZ′ is linear in Γ = ZZ′, it may be advantageous
to compute Γ = ZZ′ explicitly before running PROC
MIXED and then specify a linear structure as follows:
random gen/subject = intercept type =
lin(1) ldata = gamma;
Savings in storage space and computing time required to
solve the mixed model equations may be considerable when
M > > G. This code requires that Γ = ZZ′ be stored in a
SAS dataset “gamma” according to one of two possible for-
mats (for details see manual). One option for a hypothetical
3 × 4 Z matrix is as given in Fig. 1 (this assumes that a SAS
dataset “w” contains variables z1 to zM).
Spatial Models
The POW, EXP (equivalent to POW), GAU and SPH can
be tted by these RANDOM statements:
random gen/subject = intercept type =
sp(pow) (z1-zM);
random gen/subject = intercept type =
sp(exp) (z1-zM);
random gen/subject = intercept type =
sp(gau) (z1-zM);
random gen/subject = intercept type =
sp(sph) (z1-zM);
The spatial models may have convergence problems, so
it is advisable to try a number of starting values for the
spatial parameters using the PARMS statement. If the
residual variance var(e
i
) is xed as described in this paper,
and a polygenic e ect v
i
is tted in addition to a marker-
dependent e ect g
i
, a typical call of PROC MIXED is as
shown in Fig. 2. The weighting variable w contains the
inverse of var(e
i
), that is, of the squared standard errors of
adjusted means. For background on the method of xing
var(e
i
) see Piepho (1999).
Also, at times the log likelihood changes only mar-
ginally in iterations and yet the default convergence cri-
terion is not met. In such instances it may be useful to
slightly relax the convergence criterion relative to the
default value. Additionally, rescaling Z such that σ
2
u
is of
the same order of magnitude as σ
2
v
may be bene cial.
For LV we use the same code to generate a linear variance-
covariance matrix as for the ridge regression model, except
that a*b is replaced by (a-b)**2/&m/2 and the square root
is taken of col[i] after the do loop. The relevant portion
that needs to be replaced in Fig. 1 is given in Fig. 3.
Figure 1. SAS code generating the matrix Γ = ZZ′ for fi tting the
ridge regression model using the LIN structure in PROC MIXED.
CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1175
Then, the code in Fig. 4 may be used
to t the linear model, assuming the vari-
ance-covariance matrix has been stored in
a SAS dataset “lv.” The coding of entries
in “lv” ensures that the resulting variance–
covariance matrix will have only nonneg-
ative entries (Piepho et al., 2008b).
Mixed Models with
Heterogeneous Variance
The heterogeneous variance ridge regression model RR
het
is tted by random z1-zM;.
Acknowledgments
KWS SAAT AG is thanked for providing the maize data. Jens
Möhring and Bettina Müller are thanked for carefully reading
an earlier version of this paper. I am also grateful for helpful
comments by two anonymous referees.
References
Bauer, A.M., T.C. Reetz, and J. Léon. 2006. Estimation of breed-
ing values of inbred lines using best linear unbiased prediction
(BLUP) and genetic similarities. Crop Sci. 46:2685–2691.
Bernardo, R. 1993. Estimation of coe cient of coancestry
using molecular markers in maize. Theor. Appl. Genet.
85:1055–1062.
Bernardo, R. 1994. Prediction of maize single-cross performance
using RFLPs and information from related hybrids. Crop Sci.
34:20–25.
Bernardo, R., and J. Yu. 2007. Prospects for genomewide selection
for quantitative traits in maize. Crop Sci. 47:1082–1090.
Calus, M.P.L., T.H.E. Meuwissen, A.P.W. deRoos, and R.F.
Veerkamp. 2008. Accuracy of genomic selection using di er-
ent methods to de ne haplotypes. Genetics 178:553–561.
Calus, M.P.L., and R.F. Veerkamp. 2007. Accuracy of breed-
ing values when using and ignoring the polygenic e ect in
genomic breeding value estimation with a marker density of
one SNP per cM. J. Anim. Breed. Genet. 124:362–368.
Charcosset, A., B. Bonnisseau, O. Touchebeuf, J. Burstin, P.
Dubreuil, Y. Barriere, A. Gallais, and J.B. Denis. 1998. Pre-
diction of maize hybrid silage performance using marker data:
Comparison of several models for speci c combining ability.
Crop Sci. 38:38–44.
Cogdill, R.P., and P. Dardenne. 2004. Least-squares support vec-
tor machines for chemometrics: An introduction and evalua-
tion. J. Near Infrared Spetrosc. 12:93–100.
Draper, N.R., and H. Smith. 1998. Applied regression analysis.
3rd ed. John Wiley & Sons, New York.
Gianola, D., M. Perez-Enciso, and M.E. Toro. 2003. On marker-
assisted prediction of genetic value: Beyond the
ridge. Genetics 163:347–365.
Gianola, D., and J.B.C.H.M. van Kaam. 2008. Repro-
ducing kernel Hilbert spaces regression methods
for genomic assisted prediction of quantitative
traits. Genetics 178:2305–2313.
Goddard, M.E., and B.J. Hayes. 2007. Genomic selec-
tion. J. Anim. Breed. Genet. 124:323–330.
Gower, J.C. 1966. Some distance properties of latent
roots and vector methods used in multivariate
analysis. Biometrika 53:325–338.
Habier, D., R.L. Fernando, and J.C.M. Dekkers. 2007. The impact
of genetic relationship information on genome-assisted breed-
ing values. Genetics 177:2389–2397.
Henderson, C.R. 1985. Best linear unbiased prediction of non-
additive genetic merits in non-inbred populations. J. Anim.
Sci. 60:111–117.
Hoerl, A.E., and R.W. Kennard. 1970. Ridge regression: Biased esti-
mation for nonorthogonal problems. Technometrics 12:55–67.
Johnson, N.L., S. Kotz, and N. Balakrishnan. 1994. Continuous
univariate distributions. Vol. 1. 2nd ed. John Wiley & Sons,
New York.
Lahiri, S.N. 2003. Resampling methods for dependent data.
Springer, New York.
Littell, R.C., G.A. Milliken, W.W. Stroup, R. Wol nger, and O.
Schabenberger. 2006. SAS for mixed models. 2nd ed. SAS
Inst., Cary, NC.
Maenhout, S., B. de Baets, G. Haesaert, and E. van Bockstaele. 2007.
Support vector machine regression for the prediction of maize
hybrid performance. Theor. Appl. Genet. 115:1003–1013.
Maenhout, S., B. de Baets, G. Haesaert, and E. van Bockstaele.
2008. Marker-based screening of maize inbred lines using
support vector machine regression. Euphytica 161:123–131.
Martin, R.J. 1979. A subclass of lattice processes applied to a prob-
lem of planar sampling. Biometrika 66:209–217.
McQuarrie, A.D.R., and C.L. Tsai. 1998. Regression and time
series model selection. World Scienti c, Singapore.
Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Predic-
tion of total genetic value using genome-wide dense marker
maps. Genetics 157:1819–1829.
Figure 2. MIXED code to fi t the power model with fi xed var(e
i
).
Figure 3. Portion of SAS code that needs to be replaced for the
corresponding part in Fig. 1 to generate a matrix for fi tting the
linear variance model using the LIN structure in PROC MIXED.
Figure 4. MIXED code to fi t the linear variance model with fi xed var(e
i
).
1176 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009
Miller, A. 2002. Subset selection in regression. Chapman and Hall,
London.
Piepho, H.P. 1997. Analyzing genotype–environment data by mixed
models with multiplicative e ects. Biometrics 53:761–766.
Piepho, H.P. 1998. Empirical best linear unbiased prediction in
cultivar trials using factor analytic variance–covariance struc-
tures. Theor. Appl. Genet. 97:195–201.
Piepho, H.P. 1999. Stability analysis using the SAS system. Agron.
J. 91:154–160.
Piepho, H.P. 2000. A mixed model approach to mapping quantita-
tive trait loci in barley on the basis of multiple environment
data. Genetics 15:253–260.
Piepho, H.P., and H.G. Gauch. 2001. Marker pair selection for
QTL detection. Genetics 157:433–444.
Piepho, H.P., and C.E. McCulloch. 2004. Transformations in
mixed models: Application to risk analysis for a multienviron-
ment trial. J. Agric. Biol. Environ. Stat. 9:123–137.
Piepho, H.P., J. Möhring, A.E. Melchinger, and A. Büchse. 2008a.
BLUP for phenotypic selection in plant breeding and variety
testing. Euphytica 161:209–228.
Piepho, H.P., C. Richter, and E. Williams. 2008b. Nearest neigh-
bour adjustment and linear variance models in plant breeding
trials. Biometrical J. 50:164–189.
Pinheiro, J.C., and D.M. Bates. 1995. Approximations to the log-
likelihood function in the nonlinear mixed e ects model. J.
Comput. Graph. Stat. 4:12–35.
Reif, J.C., A.E. Melchinger, and M. Frisch. 2005. Genetical and
mathematical properties of similarity and dissimilarity coef-
cients applied in plant breeding and seed bank management.
Crop Sci. 45:1–7.
Ruppert, D., M.P. Wand, and R.J. Carroll. 2003. Semiparametric
regression. Cambridge Univ. Press, Cambridge, UK.
Schabenberger, O., and C.A. Gotway. 2005. Statistical methods
for spatial data analysis, CRC Press, Boca Raton, FL.
Schrag, T.A., J. Möhring, H.P. Maurer, B.S. Dhillon, A.E. Melch-
inger, H.P. Piepho, A.P. Sørensen, and M. Frisch. 2009.
Molecular marker-based prediction of hybrid performance in
maize using unbalanced data from multiple experiments with
factorial crosses. Theor. Appl. Genet. 118:741–751.
Searle, S.R., G. Casella, and C.E. McCulloch. 1992. Variance
components. John Wiley & Sons, New York.
Stich, B., J. Möhring, H.P. Piepho, M. Heckenberger, E.S. Buck-
ler, and A.E. Melchinger. 2008. Comparison of mixed-model
approaches for association mapping. Genetics 178:1745–1754.
Suykens, J.A.K., T.V. Gestel, J. de Brabanter, B. de Moor, and
J. Vandewalle. 2002. Least squares support vector machines.
World Scienti c, Singapore.
Whittaker, J.C., R. Thompson, and M.C. Denham. 2000.
Marker-assisted selection using ridge regression. Genet. Res.
75:249–252.
Williams, E.R. 1986. A neighbour model for eld experiments.
Biometrika 73:279–287.
Xu, S. 2003. Estimating polygenic e ects using markers of the
entire genome. Genetics 163:789–801.
Xu, S., and Z. Jia. 2007. Genomewide analysis of epistatic e ects
for quantitative traits in barley. Genetics 175:1955–1963.
Yu, J.M., G. Pressoir, W.H. Briggs, I.V. Bi, M. Yamasaki, J.F. Doe-
bley, M.D. McMullen, B.S. Gaut, D.M. Nielsen, J.B. Holland,
S. Kresovich, and E.S. Buckler. 2006. A uni ed mixed-model
method for association mapping that accounts for multiple
levels of relatedness. Nat. Genet. 38:203–208.
A preview of this full-text is provided by Wiley.
Content available from Crop Science
This content is subject to copyright. Terms and conditions apply.