ArticlePDF Available

Ridge Regression and Extensions for Genomewide Selection in Maize

Crop Science

July 2009
49(4)

DOI:10.2135/cropsci2008.10.0595

Authors:

Hans-Peter Piepho

University of Hohenheim

This paper reviews properties of ridge regression for genomewide (genomic) selection and establishes close relationships with other methods to model genetic correlation among relatives, including use of a kinship matrix and the simple matching coefficient as computed from marker data. A number of alternative models are then proposed exploiting ties between genetic correlation based on marker data and geostatistical concepts. A simple method for automatic marker selection is proposed. The methods are exemplified using a series of experiments with test‐cross hybrids of maize (Zea mays L.) conducted in five environments. Results underline the need to appropriately model genotype–environment interaction and to employ an independent estimate of error. It is also shown that accounting for genetic effects not captured by markers may be important.

SAS code generating the matrix Γ = ZZ′ for fitting the ridge regression model using the LIN structure in PROC MIXED.

…

MIXED code to fit the power model with fixed var(ei).

…

Portion of SAS code that needs to be replaced for the corresponding part in Fig. 1 to generate a matrix for fitting the linear variance model using the LIN structure in PROC MIXED.

…

MIXED code to fit the linear variance model with fixed var(ei).

…

Figures - available from: Crop Science

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Hans-Peter Piepho

Content may be subject to copyright.

CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 1165

RESEARCH

  or genomic selection (GS) is a

marker-based method for estimating genotypic values with-

out prescreening of markers by signi cance testing or other sub-

set selection procedures (Whittaker et al., 2000; Meuwissen et

al., 2001). Most applications so far have been in animal breed-

ing (Meuwissen et al., 2001; Goddard and Hayes, 2007), but the

method is rapidly becoming popular in plant breeding (Bernardo

and Yu, 2007). With the development of high-throughput marker

technologies interest in such statistical methods is expected to

increase further in the near future.

The key idea is to predict the genotypic value of the ith geno-

type (i = 1, …, G), denoted as g

, using all available markers. One

option is to use a regression model with the form

ikik

guz

∑

[1]

where z

is a regressor variable for the ith genotype and kth marker,

while u

(k = 1, …, M) are regression coe cients. Typically, for a

Ridge Regression and Extensions

for Genomewide Selection in Maize

H. P. Piepho*

ABSTRACT

This paper reviews properties of ridge regres-

sion for genomewide (genomic) selection and

establishes close relationships with other meth-

ods to model genetic correlation among rela-

tives, including use of a kinship matrix and the

simple matching coef cient as computed from

marker data. A number of alternative models are

then proposed exploiting ties between genetic

correlation based on marker data and geostatis-

tical concepts. A simple method for automatic

marker selection is proposed. The methods are

exempli ed using a series of experiments with

test-cross hybrids of maize (Zea mays L.) con-

ducted in  ve environments. Results underline

the need to appropriately model genotype–

environment interaction and to employ an inde-

pendent estimate of error. It is also shown that

accounting for genetic effects not captured by

markers may be important.

Institute for Crop Production and Grassland Science, Universität

Hohenheim, Fruwirthstrasse 23, 70599 Stuttgart, Germany. Received

13 Oct. 2008. *Corresponding author (piepho@uni-hohenheim.de).

Abbreviations: AIC, Akaike information criterion; BLUP, best linear

unbiased prediction; DH, doubled haploids; EXP, exponential model;

FA, factor-analytic; GAU, Gaussian model; GCA, general combining

ability; GS, genome-wide (genomic) selection; LS, least squares; LV,

linear variance; POW, power model; REML, restricted maximum like-

lihood; RR

het

, ridge regression with heterogeneous variance among

markers; RR

hom

, ordinary ridge regression; RR

hom2

, ridge regression

with reduced set of markers; SCA, speci c combining ability; SPH,

spherical model; SVM, support vector machine.

Published in Crop Sci. 49:1165–1176 (2009).

doi: 10.2135/cropsci2008.10.0595

677 S. Segoe Rd., Madison, WI 53711 USA

form or by any means, electronic or mechanical, including photocopying, recording,

or any information storage and retrieval system, without permission in writing from

the publisher. Permission for printing and for reprinting the material contained herein

has been obtained by the publisher.

1166 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009

biallelic marker with alleles A

and A

, we de ne z

= 1

for A

, z

= –1 for A

and z

= 0 for A

or when

the marker genotype is missing. The linear model from

Eq. [1] can be written in matrix form as

g = Zu [2]

where g′ = (g

, g

, …, g

), Z = {z

}, and u′ = (u

, u

, …, u

When there is a single (mean centered) observation

per genotype with independent residual errors e

hav-

ing zero mean and variance σ

, the model for observed

data is y = Zu + e, where y′ = (y

, y

, … , y

) and

e′ = (e

, e

, … , e

). The classical least squares estima-

tor, û = (Z′Z)

–1

Z′y minimizes the sum of squares given

by ||y – Zu||

, with ||||denoting the length of a vector.

This estimator is well known to perform poorly when the

number of markers (M) is large relative to the number of

genotypes (G), and using all markers is impossible when

M > G, which is expected to be increasingly the case with

high-density marker systems. Selecting a subset of mark-

ers by one of the common variable selection methods (for-

ward selection, stepwise regression, etc.; Miller, 2002) is a

possible alternative often used in marker-assisted selection

programs (Bernardo and Yu, 2007), but performance of

these methods is likely to be bogged down when there

are many, possibly highly correlated markers (Whittaker

et al., 2000).

There are many regularization methods addressing

the problem of large M, which avoid the selection prob-

lem essentially by keeping all markers in the model. One

of these methods is ridge regression (Hoerl and Kennard,

1970), which was  rst used for GS by Whittaker et al.

(2000). Ridge regression minimizes the penalized sum of

squares ||y – Zu||

+ λ

u′u, where λ

is a penalty param-

eter, yielding the estimator

û = (Z′Z + λ

)

–1

Z′y, [3]

where I

is the G-dimensional identity matrix. The pen-

alty term overcomes the problem of ill-conditioning when

multicollinearity among columns in Z causes Z′Z to be

singular, or nearly so. The penalized estimator in Eq. [3]

involves shrinkage, thus avoiding over- tting, and it sta-

bilizes estimation relative to least squares. The penalty

parameter λ

, which determines the amount of shrinkage,

may be chosen in a number of ways (Draper and Smith,

1998), including cross-validation (Ruppert et al., 2003).

One particular method, which has been used by Meu-

wissen et al. (2001) for GS, assumes that regression coef-

 cients are independent random draws from a common

normal distribution, that is,

~ N (0,σ

), (k = 1,…, M) [4]

Under this model, we have λ

= σ

/σ

, where σ

the residual variance (Draper and Smith, 1998), and the

penalized estimator in Eq. [3] turns out to be equivalent to

be st li ne a r unbiased pred ict ion ( BLU P) of u ( Rupper t et a l.

(2003). This method has also been used by Bernardo and

Yu (2007), who found it, based on a simulation study, to

per for m wel l compared to subset selection, in wh ich m a rk-

ers were selected per chromosome by backward elimina-

tion with relaxed signi cance thresholds. One advantage

of the mixed model formulation of ridge regression is that

we can estimate the variance components, and hence the

penalty, in a straightforward way by restricted maximum

likelihood (REML) (Ruppert et al., 2003). Furthermore,

it is possible to account for other sources of variation by

adding  xed and random e ects.

The present paper brie y reviews some of the fea-

tures of ridge regression as performed in a mixed model

framework using REML, giving particular emphasis to its

similarity with spatial models. I then outline some alter-

native models for GS, including spatial models and ridge

regression with heterogeneous variances. The exposition

emphasizes non-Bayesian implementations of methods

that are mostly treated in a Bayesian framework, mainly

in the animal breeding literature. Equivalence relations

of perhaps seemingly di erent methods are discussed. I

give some hints on how these models can be  tted using

standard mixed model software and illustrate them using

a dataset from a breeding program in maize (Zea mays L.).

The importance of accounting for both genotype–envi-

ronment interaction and for polygenic e ects not captured

by markers is highlighted.

MATERIAL AND METHODS

The total genotypic e ect will be partitioned into a component

explained by the markers (g

) and a polygenic component (v

) not

captured by the markers. Thus, the total genotypic e ect h

= g

+ v

[5]

Our main objective is to estimate h

. It is assumed through-

out that g

and v

are independent of one another. It is important

to account for residual polygenic e ects v

to avoid over- tting

(Goddard and Hayes, 2007). In case of a single unstructured

population, for example a population of doubled haploid (DH)

lines generated from a single cross, we have

var(v) = σ

[6]

where v′ = (v

, v

, … , v

). For s t r uc t u r e d p o pu l a t ion s , va r(v) may

involve covariances among relatives (Piepho et al., 2008a).

We will consider di erent models for g′ = (g

, g

, … , g

conditionally on the markers Z = (z

). All conditional models

will be of the form

var(g|Z) = σ

Γ [7]

for some matrix Γ that is a function of Z. In Eq. [7] and later in

the paper, the expression on which we condition (Z in this case)

is given following a vertical bar.

Ridge Regression and Related Models

Under the mixed model formulation of ridge regression in Eq.

[4], we have Γ = ZZ′, so that the genotypic variance–covariance

CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1167

di culty with this approach is that often the number of prog-

eny per cross is rather limited, making variance estimates rather

unreliable. This problem can be overcome by ridge regression

where heterogeneity among crosses is modeled by a single vari-

ance component σ

, because heterogeneity between crosses is

represented by the structure of ZZ′.

In animal breeding very accurate BLUPs are often avail-

able for sires and dams due to extensive records on progeny,

accounting for half the additive genetic variance, so focusing

GS on Mendelian within-family sampling has been suggested

(H. Simianer, personal communication, 2008). This approach

is applicable in plant breeding as well, though BLUPs of par-

ents are usually less reliable than in animal breeding programs.

In a diallel crossing scheme, this means replacing Z with

Z = Z – Z

, where Z

is the marker data of the F

generation

corresponding to the underlying crosses, and assuming condi-

tional independence between crosses for the marker-dependent

part of the model; that is, Γ is block-diagonal with blocks

Z′

where

is the submatrix of

Z corresponding to the cth cross.

Simple Spatial Mixed Models

The property of a genetic covariance depending on similarity

of marker pro les brings to mind a host of alternatives such

as geostatistical methods, where covariance depends on spatial

proximity. Thus, replacing spatial with genetic distance, spatial

methods can be used to model genetic correlation (Piepho et

al., 2008a). Ridge regression may, in fact, be regarded as one

type of spatial model, as will be elaborated after a brief outline

of spatial models as applied to marker data.

If marker scores z

are regarded as coordinates of geno-

types in M-dimensional marker space, covariance can be mod-

eled as a linear or nonlinear function of distance in that space.

Thus, the covariance is expressed as

Γ= [ f (d

ii′

)] [11]

where d

ii′

is the Euclidean distance of genotypes i and i′, de ned

as d

ii′

= || z

– z

i′

||, w ith z′

equal to the ith row of Z, and f(d) is

some monotonically decreasing function of d. There are di er-

ent options for the function f(d), including those shown in Table

1 (Schabenberger and Gotway, 2005).

The  rst four models in Table 1 are commonly used as

spatial covariance structures in mixed model packages (Littell

et al., 2006). The power model (POW) is just a re-parame-

terization of the exponential model (EXP). It is worth  tting

both models, however, because convergence behavior may dif-

fer. The linear variance (LV) model was proposed by Williams

(1986) in the context of blocked  eld experiments. In  tting this

model, precautions must be taken to ensure that the resulting

variance-covariance matrix remains positive de nite (Piepho

et al., 2008b). An advantage of the model compared to the four

 rst nonlinear spatial models shown in Table 1 is parsimony,

because σ

Γ = σ

– σ

θ{d

}, where the  rst term on the

right-hand side is confounded with the  xed intercept and θ is

a parameter as de ned in Table 1. Thus, the only free parameter

to be estimated is φ = σ

θ.

The quadratic model is not commonly used in spatial

statistics, but is considered here to illustrate a close relation

between ridge regression and spatial models. It is worth point-

ing out that the quadratic model can be regarded as a  rst order

structure is linear with covariance of two genotypes depend-

ing on similarity in their marker pro les. Another well-known

model, in which covariance is a linear function of genetic simi-

larity, is given by Γ = 2A, where A is the numerator relation-

ship matrix computed either from pedigree records or from

marker data (Henderson, 1985). A further model is Γ = 2K,

where K is the kinship matrix estimated from the markers (Yu

et al., 2006). When A or K are estimated from marker data,

covariance of two genotypes is again a linear function of simi-

larity between marker pro les, so in this sense use of the A or K

matrix estimated from markers is an early form of GS. In fact,

under some circumstances, use of the kinship matrix K and

ridge regression are equivalent, as will be shown below.

As ridge regression implies a genetic covariance among

genotypes, the question may be posed if this model is com-

mensurate with an analysis that ignores marker information

and is based on the assumption of independent genotypes with

constant variance. This question is relevant for the two-stage

analysis of multi-environment data to be discussed later in this

paper. The answer is yes in a simple unstructured population,

for example a population of DH or recombinant inbred lines

originating from a single cross of inbred lines (Bernardo and

Yu, 2007). To see this, we may evaluate Eq. [2] in two ways: (i)

conditioning on the markers (ridge regression) and (ii) not con-

ditioning on markers. The variance conditioning on markers is

var(g|Z) = σ

ZZ′. The unconditional variance can be derived

from a general result on moments of joint random variables

(Searle et al., 1992:461):

E(g) = E

E(g | Z) = 0 [8]

var(g) = E

var(g|Z) + var

E(g|Z) = σ

(ZZ′) [9]

with E

and var

representing the expectation and variance over

Z. In a DH population derived from a single cross we have

(ZZ′) = M [ p I

+ (1 – p)J

] [10]

where p is the probability that a marker is segregating in the

underlying cross and J

is a G × G matrix of ones. The term in

is confounded w ith the  xed intercept and so may be dropped

from the model (Piepho et al., 2008b). Hence, when ignor-

ing marker information, it is valid to assume independent and

identically distributed genotypic e ects; that is, var(g) = σ

Equation [10] shows that σ

= M

–1

provides a reasonable

estimate of σ

, when p = 1 and σ

= 0 (Bernardo and Yu,

2007). It may be preferable, however, to estimate σ

directly by

REML based on the ridge regression model in Eq. [4], because

this allows accounting for σ

, as will be shown further below,

and because this caters for the case p < 1.

In structured populations, Eq. [10] does not usually hold,

but when the structure is simple, a parsimonious unconditional

random e ects model may be obtained for var(g). For example,

in a diallel crossing scheme of several inbred lines from the

same population, where each cross produces a family of DH

lines, the model for var(g) comprises random e ects for gen-

eral combining ability (GCA) and speci c combining ability

(SCA) as well as for lines within crosses (Piepho et al., 2008b).

Generally, p in Eq. [10] will vary between crosses, resulting in

heterogeneity of variance between crosses. This heterogene-

ity can be accounted for, in principle, in a nonmarker model

by assigning a di erent variance to each cross. The practical

1168 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009

Taylor approximation of the Gaussian model (GAU), because

exp (–d

/θ) ≈ 1 – d

/θ, when d

/θ is not far from zero. To study

the properties of the quadratic model, note that the squared

Euclidean distance of genotypes i and i′ can be expressed as

()

222

iik ikik

ii i i k i k i k

dzzzzzz

′′ ′ ′′

= − = − = − +

∑∑

[12]

If all genotypes and all markers have either z

= 1 or

= –1, which happens, for example, for recombinant inbred

lines or DH lines, we have (z

– z

i′k

)

= 0 when z

= z

i′k

(geno-

types have identical alleles) and (z

– z

i′k

)

= 4 when z

= –z

i′k

(genotypes have opposite alleles). Also, z

= 1 for all i and k.

Thus, if s

ii′

denotes the simple matching coe cient of geno-

types i and i′, that is, the proportion of markers identical in

state, we have

()

41 2

ii ii i k

dMs M zz

′′ ′

⎛⎞

⎟

⎜

⎟

= − = −

⎜

⎟

⎜

⎟

⎜

⎝⎠

∑

[13]

so that the squared distance matrix has the form

= {d

ii′

} = 4M ( J

– S) = 2(M J

– ZZ′) [14]

where S = {s

ii′

}. The key observation here is that the matrices

, S, and ZZ′ are all linear shift-scale transformations of one

another; that is, either one can be obtained from any of the

other two by (i) multiplication with a constant and (ii) addition

of a constant term times the matrix J

. For example, if we  t a

quadratic spatial model of the form f(d) = 1 – θd

, the variance–

covariance matrix is

var(g|Z) = σ

( J

– θD

) =

[ J

– θ(2M J

– 2ZZ′)] = α

+ α

ZZ′

[15]

where α

= (1 – 2M θ)σ

and α

= 2σ

θ. The term in J

is con-

founded with the ( xed) intercept and can therefore be dropped

from the model (Piepho et al., 2008b). It emerges that the qua-

dratic spatial model is equivalent to ridge regression. In other

words, ridge regression can be seen as a special type of spatial

model, in which covariance is a quadratic function of Euclidean

distance. In the same vein, we may obtain an equivalent  t by

just using var(g|Z) = σ

S, which has also been proposed in

a plant breeding context to model genetic correlation among

relatives (Bauer et al., 2006).

On a similar note, in association mapping, the residual

genotypic e ect not accounted for by regression on a candidate

marker can be modeled as var(g|Z) = 2σ

K, where K is a kin-

ship matrix of the form

K = aS + b J

[16]

with a = (1 – c)

–1

, b = – c(1 – c)

–1

, and c equal to the average

probability of identity in state for genes coming from random

individuals in the population (Yu et al., 2006; Stich et al., 2008).

Again, this is just a shift-scale transformation of S, and the term

may be dropped as it is confounded with the intercept. So

the use of K for inbred lines is equivalent to ridge regression

and to a quadratic spatial model. Also, using some other simi-

larity measure such as Jaccard or Dice coe cient (for dominant

marker systems) in place of simple matching (Yu et al., 2006) is

seen to be very similar to use of the kinship matrix.

All spatial models considered so far employ the Euclidean

distance. Mixed model software usually requires the coordinates

(markers) as input to compute the Euclidean distance. Other dis-

tances (Reif et al., 2005) may be  tted within a mixed model,

if a representation in Euclidean space, or some approximation

thereof, is found by principal coordinate analysis (Gower, 1966),

but this is not elaborated here (Piepho et al., 2008a).

Based on spatial models, BLUPs can be computed for any

point in the space spanned by the markers. This includes the

genotypes tested as well as potentially other genotypes, which

have been genotyped, but for which no phenotypic data is avail-

able. BLUP based on spatial models is equivalent to Kriging in

spatial statistics (Ruppert et al., 2003). Thus, the BLUP for a gen-

otype constitutes an interpolation of the genotypic value based

on the genotype’s own data and that of the other genotypes, with

the impact by a tested genotype on the prediction of another

genotype depending on genetic proximity. It is also worth not-

ing that adding a polygenic component σ

is equivalent to

adding a nugget e ect that accounts for residual measurement

error in spatial models (Schabenberger and Gotway, 2005).

BLUP based on GAU also bears an intimate relationship

with least squares support vector machine (LS-SVM) regres-

sion (Suykens et al., 2002 p. 106–107), when a Gaussian ker-

nel is used, as is common in chemometric applications such as

near infrared spectroscopy (Cogdill and Dardenne, 2004). For

details on the analogies of LS-SVM regression and Gaussian

processes see Suykens et al. (2002). Thus, in as much as ridge

regression may be seen as an approximation to GAU, it is also

an approximation of LS-SVM regression. SVM is another class

of regularization methods, which have recently been applied to

the problem of hybrid prediction in plant breeding (Maenhout

et al., 2007, 2008). Finally, GAU as applied to markers is essen-

tially equivalent to reproducing kernel Hilbert spaces regres-

sion as proposed by Gianola and van Kaam (2008).

Mixed Models with Heterogeneous Variance

The BayesA and BayesB methods of Meuwissen et al. (2001)

assume that the ridge regression model is extended such that

each marker has its own variance. Thus, under both approaches

the regression model is

ikkik

gtz

= σ

∑

[17]

where t

~ N(0,1) and σ

is the variance for the kth marker.

Under both the BayesA and BayesB models, a prior distribu-

tion is assumed for the variances σ

. The regression coe cient

under these models has the form

kkk

ut= σ . To relate these

models to ridge regression, it is important to recognize that the

Table 1. Genotypic covariance models of the form Γ = {f(d

ii′

)},

where d is the Euclidean distance computed from marker

data and θ is a parameter.

Name Equation

Gaussian

f(d) = exp(–d

/θ)

Power (exponential)

f(d) = θ

Exponential

f(d) = exp(–d/θ)

Spherical

()

= − +

, (d < θ)

Linear

f(d) = 1 – θd

Quadratic

f(d) = 1 – θd

CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1169

variance σ

essentially is just another marker-speci c random

e ect. As t

is standard normal, the random regression coe -

cient

kkk

ut= σ will have a symmetric nonnormal marginal

distribution whose speci c form depends on the assumed prior

for σ

. Clearly, this marginal distribution for u

(not condition-

ing on σ

) has a constant variance, and so the only di erence to

ridge regression is that nonnormality holds for u

These properties of Bayesian models suggest that in a non-

Bayesian framework we may consider  tting nonnormal dis-

tributions to u

, which closely mimic the BayesA and BayesB

model  ts. A nonnormal distribution with stronger peaks and

heavier tails may be more realistic than a normal distribution

when most of the markers have a very small e ect. One conve-

nient class of distributions is available via the Johnson S

system

of transformed normal random variables (Johnson et al., 1994;

Piepho and McCulloch, 2004). I found this type of nonlinear

mixed model di cult to  t, however, using adaptive Gauss-

ian quadrature (Pinheiro and Bates, 1995). When  tting these

models using a  rst-order method, corresponding to Gaussian

quadrature with a single quadrature point, I obtained the same

log-likelihoods with normal and nonnormal u. This similarity

is not unexpected because under either model the genotypic

value is Zu, and based on the central limit theorem this linear

combination is expected to be nearly normal even when u is

nonnormal. In light of these considerations, the good perfor-

mance of BayesB relative to ridge regression in Meuwissen et

al. (2001) is probably at least partly due to the strong impact

of the assumed prior distribution, which was derived based on

the model used to simulate the data (though it did not exactly

match that model; Goddard and Hayes, 2007).

A simple alternative is to  t Eq. [17] in a frequentist setting,

regarding σ

as  xed parameters (RR

het

). A REML  t will

typically yield many zero estimates, which essentially implies

an automatic selection of markers. Thus, we may simply drop

markers with zero variance estimates. This is similar in spirit

to BayesB, where the prior for σ

has a peak at σ

= 0, which

induces an automatic marker selection. For the remaining mark-

ers, we can perform a likelihood ratio test for homogeneity of

variance. In case of homogeneity, the ordinary ridge regres-

sion model with homogeneous variances can be  tted with the

selected markers (RR

hom2

). In case of heterogeneity, we may

stick with RR

het

Extension of Models

to Genotype–Environment Data

When data from multiple environments are available, modeling

of genotype–environment interaction requires special atten-

tion. In particular, genetic correlation between environments

needs to be modeled (Piepho, 2000). In analogy to the case of a

single environment, it will be assumed that the e ect of the ith

genotype in the jth environment (j = 1, 2, …, E), denoted as

, can be partitioned into marker-based e ect g

and polygenic

e ect v

= g

+ v

[18]

Each of the two component e ects is partitioned into main

e ect and interaction, that is,

= g

+ f

[19]

and

= v

+ w

[20]

where g

and v

are the marker-based and polygenic main e ects,

while f

and w

are the corresponding interaction terms with

the environment. Both g

and f

are modeled as functions of the

marker data. Our main objective is to estimate the genotypic

main e ect h

= g

+ v

Let f ′

= ( f

, f

, … , f

) and f ′ = (f′

, f′

, … f′

) and let w be

similarly de ned. For the marker-based e ects it is assumed that

var(g|Z) = σ

Γ [21]

and

var(f|Z) = Σ

⊗ Γ

[22]

where ⊗ denotes the Kronecker (or direct) product (Searle et

al., 1992). For example, with ridge regression (RR

hom

) we have

Γ = ZZ′. Some choices for the E × E variance–covariance

matrix Σ

are given in Table 2, including the factor-analytic

(FA) model (Piepho, 1997, 1998). For the polygenic e ects we

may assume

var(v) = σ

[23]

and

var(w) = Σ

⊗ I

[24]

where Σ

is also chosen from the options in Table 2.

Two-Stage Analysis for

Genotype–Environment Data

It is generally desirable to  t a suitable model directly to gen-

otype–environment data. Estimation requires, however, that

replicate data are available to separate genotypic from environ-

mental e ects. When no replicate data are available, as is the

case for the example considered in this paper, separation of these

e ects is not possible. In this case, one may compute genotype

means over environments. I use a two-stage approach in which

genotype means are computed based on a model with  xed

genotype main e ects and random interactions. The marker-

based component in the interaction is set to zero at this stage,

because genotype–environment e ects cannot be separated

from residual error; it will be absorbed into the polygenic e ect

according to Eq. [9] and [10]. Thus, I  tted the model

= µ

+ h

+ w

[25]

where y

is the adjusted mean of the ith genotype in the jth envi-

ronment, µ

is the main e ect of the jth environment. Note that

the e ect w

in Eq. [25] subsumes residual error of the adjusted

mean. Based on this model I estimated adjusted genotype means

–

taking both µ

and h

as  xed, and then  tted the model

–

= µ + h

+ e

[26]

Table 2. Models for variance–covariance among genotypes

in different environments (Σ

; q = f, w).

Model Short-hand Equation

Independent ID

Diagonal DIAG

D = diag(σ

, σ

, …, σ

)

Factor-analytic

†

FA(P)

′

∑

Dλλ

†

λ′

= (λ

, λ

p 2

, …, λ

1170 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009

where var(h|Z) = var(g|Z) + var (v) with h′ = (h

, h

, … , h

)

and var(e

) is  xed at the squared standard error of the adjusted

mean y

–

. For comparison, I also  tted Eq. [26] merging e

with

the polygenic e ect v

(contained in h

) into an independent

residual with constant variance. Thus, in this analysis, var(e

)

was not  xed.

One could try to directly  t the model y

= µ

+ h

+ f

in a single step, in which case the interaction e ect would

comprise the marker-dependent term f

. With no independent

estimate of error, however, this is prone to over tting, because

the  t for f

would then be confounded with any correlation

among adjusted means that is due to the trial design in the dif-

ferent environments.

The Maize Data

Two hundred eight DH lines originating from a single cross of

inbred parental lines in maize were tested in three series of trials

over  ve locations. In four locations (LOC), a lattice design with

block size 10 was employed, while in one location a complete

block design was used. In four locations, only a single replicate

was planted, while in one location there were two replicates

planted according to a lattice design. Trials, replicates, and

incomplete blocks were coded as TRIAL, REP, and BLOCK,

respectively. For each location adjusted entry means y

were

computed. For unreplicated trials, the model was ENTRY +

TRIAL. For the location with replicated trials, the model was

ENTRY + TRIAL.REP.BLOCK. Adjusted means for entries

with marker data were subjected to mixed model analysis. The

trait evaluated was kernel dry weight per plot.

There were seven check genotypes. For some of the DH

lines marker information was missing, so these were treated as

additional checks. Adjusted means of all checks were excluded

from mixed model analysis. A total of 136 single sequence

repeat and single nucleotide polymorphism markers evenly dis-

tributed across the genome were scored for the DH lines. The

two alleles of a marker were coded Z = –1 and Z = +1, while

missing data were coded as Z = 0.

Software and Model Evaluation

All models were  tted by the REML method. For each model,

we report both the deviance (minus twice the restricted log like-

lihood) and the Akaike Information Criterion (AIC), de ned as

the deviance minus twice the number of variance parameters.

Small values of AIC indicate a preferable model. The AIC is

closely related to cross-validation criteria (McQuarrie and Tsai,

1998; Piepho and Gauch, 2001). Some code for SAS PROC

MIXED (Littell et al., 2006) is given in the Appendix.

RESULTS

Computation of Genotype Means

To compute genotype means across environments, we  t-

ted the two-way model from Eq. [25] with  xed main

e ects for environments and genotypes and various struc-

tures for Σ

. Based on the AIC values in Table 3, it was

decided to compute genotype means y

–

using the FA(1)

model. The average variance of an adjusted mean based on

this analysis was 0.174.

Analysis of Genotype Means

Not Fixing the Error Variance

We  rst  tted models for h based on Eq. [26] without  x-

ing the error variance var(e

) at the squared standard error

of a mean, such that error could not be separated from

var(h|Z). Thus, the residual variance comprised the vari-

ance of both the polygenic e ect v

and the error associ-

ated with the mean (e

). The model  ts are shown in Table

4. There was no signi cant heterogeneity among the 38

markers with nonzero variance under model RR

het

. Thus,

hom2

was  tted. The example shows that ridge regres-

sion and spatial models give better  ts than a model with

independent genotypic e ects. Also, the spatial models

provide a similar  t as ridge regression in terms of AIC.

Strikingly, some of the spatial models have a rather smaller

residual variance than ridge regression, in which case BLUP

comes very close to the adjusted means, which explains

the high correlation of adjusted means with BLUPs under

spatial models (POW, EXP, GAU, SPH). By contrast,

correlation of adjusted means with BLUPs is quite low

for ridge regression and the quadratic model (Table 5), so

selection decisions by these GS methods would be quite

di erent than by adjusted means. The  nding that some of

the spatial models have a residual variance rather smaller

than the average variance of an adjusted means (0.174) is

indicative of over- tting. In terms of AIC, di erences are

minor between spatial models and ridge regression. Over-

all, the LV model is marginally better than other marker-

based spatial models. RR

hom2

has by far the best AIC value

of all models, showing that preselection of markers is an

important consideration.

Analysis of Genotype Means

Fixing the Error Variance

The fact that some of the models yielded very tiny resid-

ual variances, when var(e

) was not  xed, thus rendering

BLUP essentially the same as adjusted means, is reason

for concern. Adjusted means are typically correlated,

though the correlation may not be large. It is therefore

possible that in a model for adjusted means, the genetic

covariance model captures part of the correlation among

adjusted means, which is purely nongenetic, thus yield-

ing an upward bias in genetic variance. For this reason it

is advisable to generally obtain an independent estimate

of error (as in Bernardo and Yu, 2007). Thus, we  xed

var(e

) at the squared standard error of adjusted genotype

means based on the FA(1) model for genotype–environ-

ment means. The resulting  ts are shown in Table 6. Most

models leave rather little polygenic variance σ

. Again,

hom2

has by far the best  t in terms of AIC. Among

spatial models, LV and GAU are best. None of the GS

methods is perfectly correlated with the adjusted mean

(Table 7), while several of the spatial models (LV, POW,

EXP, SPH) are virtually identical.

CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1171

DISCUSSION

This paper has discussed some models for GS that are read-

ily implemented with a mixed model package. Results

for a maize dataset indicate that the spatial models are an

interesting alternative to ridge regression. The LV model

is particularly attractive because it involves only a single

parameter. Automatic marker selection by a preliminary

 t of a model with heterogeneous variance between mark-

ers is a promising method. A thorough comparison with

other methods of subset selection would be worthwhile.

In the analysis of across-environment genotype means

without independent estimate of error the residual vari-

ance estimator often was close to zero, indicating that the

marker-based component captured substantial noise. This

stresses the need to provide for an independent estimate of

error in GS projects. Also, it is desirable to explicitly account

for genetic variance not captured by the markers (Calus and

Veerkamp, 2007). This polygenic variance should be sepa-

rated from residual error, which requires independent esti-

mates of error for individual trials. It is quite common in

breeding programs to perform unreplicated trials, as was the

case in the example, where there was not su cient infor-

mation to estimate within-trial errors for all environments.

Thus, genotype–environment interaction could not be sepa-

rated from error in a mixed model. We could have  tted

a marker-based model to the genotype–environment e ect,

but this would have entailed the risk of over- tting, because

nongenetic correlations due to  eld trend could have been

captured by the marker-based terms. For this reason a two-

stage approach was employed computing genotype means

across environments based on an unconditional mixed model

for genotype–environment interaction that did not exploit

marker information. While the unconditional model is valid

as shown in this paper, using a conditional model of geno-

type–environment interaction for given marker information

is expected to be more e cient. This is forthcoming only

with su cient replication in all trials, stressing the need for

good individual trial design.

When markers are mapped, the ridge regression model

can be extended to allow spatial correlation of regression

coe cients pertaining to markers on the same chromosome

(Gianola et al., 2003), using the same types of spatial model

discussed here. Unfortunately, such models are currently

not conveniently  tted using mixed model software such

Table 3. AICs for various variance–covariance structures

ﬁ tted to the phenotypic data (genotype–environment

means). Models had ﬁ xed main effects for genotypes and

environments.

Model (Σ

)

Deviance AIC

†

ID 2843.1 2845.1

DIAG 2753.7 2763.7

FA(1) 2744.4 2762.4

FA(2) 2743.5 2771.5

†

AIC, Akaike information criterion.

Table 4. Model ﬁ ts of different genetic covariance models

with the maize data. Error variance var(e

) not ﬁ xed.

Model for g

†

Deviance AIC

‡

Residual

variance

Independent

372.8 374.8 0.3454

Ridge regression

hom

336.9 340.9 0.2272

het

(38 markers selected) 289.5 367.5 0.1635

hom2

(38 markers) 303.3 307.3 0.1773

Spatial models

Linear 335.6 339.6 0.1139

Quadratic 336.9 340.9 0.2272

Power 334.8 340.8 0.0020

Exponential 334.8 340.8 0.0018

Gaussian 333.9 339.9 0.0002

Spherical 334.3 340.3 <0.0001

†

het

, ridge regression with heterogeneous variance among markers; RR

hom

, ordi-

nary ridge regression; RR

hom2

, ridge regression with reduced set of markers.

‡

AIC, Akaike information criterion.

Residual subsumes v

and e

, because var(e

) was not ﬁ xed.

Heterogeneity of variance among selected markers was not signiﬁ cant according

to a likelihood ratio test (α = 5%).

Table 5. Pearson correlation (above diagonal) and Spearman rank correlation (below diagonal) of different estimators of geno-

typic value for the maize data. Error variance var(e

) is not ﬁ xed.

Model/estimator

†

AM RR

hom

het1

hom2

LV QUAD POW EXP GAU SPH

Adj. mean (AM)10.7030.7740.7560.9660.7031111

hom

0.670 1 0.920 0.942 0.862 1 0.705 0.705 0.703 0.703

het

0.759 0.916 1 0.974 0.884 0.920 0.776 0.776 0.774 0.774

hom2

0.745 0.935 0.974 1 0.880 0.942 0.758 0.758 0.756 0.756

Linear (LV) 0.960 0.862 0.880 0.976 1 0.862 0.967 0.967 0.966 0.966

Quadratic (QUAD) 0.697 1 0.916 0.935 0.862 1 0.705 0.705 0.703 0.703

Power (POW) 10.6980.7610.7460.9610.6981111

Exponential (EXP)10.6980.7600.7460.9600.6981111

Gaussian (GAU)10.6970.7590.7450.9600.6971111

Spherical (SPH)10.6970.7590.7450.9600.6971111

†

het

, ridge regression with heterogeneous variance among markers; RR

hom

, ordinary ridge regression; RR

hom2

, ridge regression with reduced set of markers.

1172 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009

as PROC MIXED. If the spatial model only has a single

parameter, a pro le likelihood approach may be used.

If one is prepared to work within a fully Bayesian

framework, more options are available (Meuwissen et al.,

2001; Gianola et al., 2003; Xu, 2003; Habier et al., 2007;

Gianola and van Kaam, 2008). A general problem with

Bayesian methods is coming up with a choice for the prior

distribution. Meuwissen et al. (2001) use prior distribu-

tions for BayesA and BayesB that were derived from their

simulation program. Thus, the prior distributions favor-

ably matched the data generation mechanism, putting the

Bayesian methods somewhat at an advantage that would

be hard to realize in most plant breeding applications,

where prior information may be rather much vaguer.

One can investigate the merits of di erent models

by simulation (Meuwissen et al., 2001; Bernardo and Yu,

2007). The main di culty is that a model needs to be cho-

sen for simulating the data, and naturally, a model close

to the one used for simulation is more likely to perform

well in the analysis. Simulations must necessarily make

a number of assumptions, the validity of which is hard

to verify in practice. Thus, a more reliable assessment of

performance is with real data from breeding programs.

Ideally, parallel programs using di erent models for pre-

diction would be compared based on the realized genetic

gain. The less-than-perfect correlation among BLUPs

by di erent GS methods found in the present study sug-

gests that a thorough comparison of di erent methods

in current breeding programs would be useful, prefer-

ably by cross-validation re ecting the breeder’s selection

decision process (Schrag et al., 2009). When devising a

cross-validation scheme, it must be realized that a family

structure as studied in the present paper induces genetic

correlation. Optimality of cross-validation methods often

rests on independence assumptions, and generalizing to

the case of dependent data is not straightforward (Lahiri,

2003). Developing suitable cross-validation schemes for

plant breeding programs therefore is an interesting topic

for future research.

A very promising application of GS is for hybrid

prediction. Bernardo (1993, 1994) proposed a BLUP

approach for hybrid prediction, which is closely related

to ridge regression. He suggested to estimate the coef-

 cient of coancestry ( f

ii′

) for two maize inbred lines i and

i′ from the same heterotic pool X by a linear function of

the simple matching coe cient s

, that is, by f

ii′

= a

ii′

, where a

ii′

= (1 – c

ii′

)

–1

, b

ii′

= – c

ii′

(1 – c

ii′

)

–1

, c

ii′

= 0.5(s

+ s

i′Y

), and s

is the average simple matching coe cient

between inbred i and a sample of inbreds from the oppo-

site heterotic pool Y. This estimate of the coe cient of

coancestry is then used in the LV model σ

ii′

} to pre-

dict GCA e ects within a mixed model. This approach is

Table 6. Model ﬁ ts of different genetic covariance models

with the maize data. Error variance var(e

) ﬁ xed at value of

squared standard error of a mean based on FA(1) model ﬁ t-

ted to genotype–environment data.

Model for g

Deviance AIC

†

Polygenic genetic

variance (σ

)

Independent 372.8 374.8 0.1712

Ridge regression

‡

hom

336.9 340.9 0.0528

het

(37 markers selected) 289.7 363.7 0

hom2

(37 markers)

301.9 305.9 0.0045

Spatial models

Linear 337.1 339.1 0

Quadratic 336.9 340.9 0.0528

Power

337.2 341.2 0

Exponential 337.1 341.1 0

Gaussian 335.2 339.2 0

Spherical 337.1 341.1 0

†

AIC, Akaike information criterion.

‡

het

, ridge regression with heterogeneous variance among markers; RR

hom

, ordi-

nary ridge regression; RR

hom2

, ridge regression with reduced set of markers.

Heterogeneity of variance among selected markers was not signiﬁ cant according

to a likelihood ratio test (α = 5%); variance estimates were shrunken to the overall

mean (for details see text).

Autocorrelation converged to value close to unity.

Table 7. Pearson correlation (above diagonal) and Spearman rank correlation (below diagonal) of different estimators of geno-

typic value for the maize data. Error variance var(e

) ﬁ xed at value of squared standard error of a mean based on FA(1) model

ﬁ tted to genotype–environment data.

Model/estimator

†

AM RR

hom

het1

hom2

LV QUAD POW EXP GAU SPH

Adj. mean (AM) 1 0.881 0.768 0.769 0.920 0.881 0.920 0.920 0.887 0.920

hom

0.871 1 0.933 0.946 0.995 1 0.995 0.995 0.997 0.995

het

0.753 0.929 1 0.975 0.915 0.933 0.914 0.914 0.927 0.914

hom2

0.760 0.941 0.975 1 0.926 0.769 0.926 0.926 0.942 0.926

Linear (LV) 0.912 0.994 0.912 0.923 1 0.920 1 1 0.996 1

Quadratic (QUAD) 0.871 1 0.929 0.941 0.994 1 0.995 0.995 0.997 0.995

Power (POW) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.996 1

Exponential (EXP) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.996 1

Gaussian (GAU) 0.879 0.996 0.926 0.939 0.995 0.996 0.995 0.995 1 0.996

Spherical (SPH) 0.912 0.994 0.912 0.923 1 0.994 1 1 0.995 1

†

het

, ridge regression with heterogeneous variance among markers; RR

hom

, ordinary ridge regression; RR

hom2

, ridge regression with reduced set of markers.

CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1173

seen to be quite similar to ridge regression for the GCA

e ects, but not equivalent because scale and shift param-

eters (a

ii′

and b

ii′

) depend on the pair of genotypes, while

with ridge regression a

ii′

= a and b

ii′

= b for all pairs (i,i′ )

(see Eq. [16]).

Ridge regression and spatial models discussed in this

paper can be used as alternative methods to model GCA

and SCA e ects in hybrid prediction. For example, a ridge

regression model for prediction of hybrid performance in

a complete factorial is

()()

11 223PP

= ⊗ + ⊗ +gZ1u ZuZu1 [27]

where g is the vector of genotypic values of G hybrids,

and P

are the number of inbred parents in the two

heterotic pools, Z

and Z

are the marker-based design

matrices of parents in the two pools, 1

is a P-dimen-

sional vector of ones, Z

= (Z

⊗ 1

) • (Z

⊗ 1

), where

• denotes the elementwise (Hadamard or Schur) product,

u′

= (u

, …, u

) and u′

= (u

, …, u

) are vectors

of the GCA e ects at the markers of the two pools and

u′

= (u

, …, u

) is the corresponding vector of SCA

e ects. Coding of the design matrices for GCA and SCA

e ects has a standard two-way ANOVA form as shown

in Table 8. When the factorial is not complete, the corre-

sponding lines need to be deleted in the design matrices.

Apart from variable selection, this model is essen-

tially the factorial regression model proposed by Charcos-

set et al. (1998), who take regression coe cients as  xed.

If instead we assume independent sampling from normal

distributions according to u

~ N(0,σ

) (r = 1, 2, 3), we

have a ridge regression equivalent of factorial regression.

The resulting variance–covariance structure is

var(g|Z) = σ

Z′

⊗ J

⊗ Z

Z′

[28]

An alternative variance–covariance model, more akin

to Bernardo’s (1993, 1994) approach, is

var(g|Z) = σ

Z′

⊗ J

⊗ Z

Z′

⊗

Z′

[29]

The model is easily generalized as

var(g|Z) = σ

⊗ J

⊗ Γ

⊗

[30]

where Γ

(r = 1, 2) is chosen according to some spatial

model in terms of Z

The term Γ

⊗ Γ

is equivalent to a separable two-

dimensional spatial process (Martin, 1979), the dimen-

sions corresponding to genetic distance of hybrid parents

in the two pools.

When Γ

and Γ

are computed from coe cients of

coancestry of each hybrid’s inbred parents in the two pools,

we have Bernardo’s (1993, 1994) approach. Alternatively,

and Γ

can have any of the spatial structures proposed in

the present paper, based on the genetic distance of parents

in both heterotic pools, giving rise to a host of alterna-

tive methods. Note that when we apply ridge regression,

(r = 1, 2) may be any positive de nite linear function

= a

+ b

, where S

is the matrix of simple match-

ing coe cients of hybrids’ parents in the rth pool. This is

because the variance for the SCA e ects (a

+ b

)

⊗

+ b

) equals a

⊗ J

⊗ S

+ a

⊗ J

+ b

⊗ S

, where the  rst term on the right-hand

side is confounded with the intercept and the second and

third terms are confounded with the GCA e ects. Finally,

it should be stressed that the variance terms for GCA and

SCA e ects may be extended by polygenic terms to account

for residual e ects not captured by markers.

In case of multi-allelic markers, or when haplotypes

are used (Calus et al., 2008), there are di erent, essen-

tially equivalent options of extending the model. The fol-

lowing discussion is restricted to additive e ects, but the

same principles apply to coding of e ects for dominance

and epistasis (Xu and Jia, 2007). The starting point is to

assume that each allele has an additive e ect drawn from

the same normal distribution. Let v

denote the additive

e ect of the qth allele (q = 1, …, Q

) of the kth marker

and x

iqk

the corresponding dummy variable counting the

number of copies of the qth allele of the kth marker for the

ith genotype. Let v′

= (v

,…v

). The contribution

of the kth marker to the genotypic value is X

, where

= {x

iqk

}. Assuming that entries in v

are identically and

independently normally distributed with zero mean and

variance σ

, we have

var(X

) = σ

X′

[31]

We might impose a sum-to-zero restriction, replacing

with w

= (I

– Q

–1

. It is found that

var(X

) = σ

– Q

–1

)X′

X′

– 4Q

–1

)

[32]

The second term involving the matrix J

is confounded

with the intercept and so can be dropped, showing that

the sum-to-zero constraint is not needed.

In case of two alleles and inbred lines, marker k may

be represented by a single covariate z

= X

c where

z′

= (z

, z

, …, z

) and c′ = (1/2,–1/2) such that z

= 1

or z

= –1 for inbred lines, as in Eq. [1]. In this case

Table 8. Coding of design matrices for general combining

ability (GCA) and special combining ability (SCA) effects in

Eq. [27].

Parental marker

genotype

Covariates in design matrices for one marker

GCA SCA

Pool 1 Pool 2 Z

–1 –1 +1

–1 +1 –1

+1 –1 –1

+1 +1 +1

1174 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009

var(z

) = σ

cc′X′

–

(2I

– J

)X′

–

X′

– 2J

)

[33]

Again, the term in J

may be dropped, so ridge regression

as per Eq. [1] is equivalent to the parameterization with

. In the biallelic case, the parameterization with a single

column per marker in Z is most parsimonious, but this

option is not available in the multi-allelic case.

This paper has focused on marker data for predict-

ing genotypic values. Instead of markers, or in addition to

markers, expression or metabolic pro le data may be used

for the same purpose. In this case for ridge regression it

is important to standardize the di erent expression prod-

ucts to justify the assumption of a common variance for

the regression coe cients. Similar considerations apply for

any of the spatial methods proposed in this paper. If di er-

ent sources are used simultaneously (markers, expression

data, metabolite data), it may be prudent to  t a separate

covariance model for each component in the joint model.

APPENDIX

This appendix shows how to  t the models discussed in

this paper using PROC MIXED of the SAS System (Lit-

tell et al., 2006). It is assumed that markers are coded z1

to zM, while genotypes are coded by gen. The relevant

RANDOM statement for the genotypic e ect under the

di erent models is given.

Ridge Regression

The model may be  tted by

random z1-zM/subject = intercept type

= toep(1);

By this code each marker generates a column in the design

matrix for the random e ects. When the number of mark-

ers is very large, solving the mixed model equations may

become computationally quite demanding. In this case, it is

useful to specify the model di erently. Noting that var(g)

= σ

ZZ′ is linear in Γ = ZZ′, it may be advantageous

to compute Γ = ZZ′ explicitly before running PROC

MIXED and then specify a linear structure as follows:

random gen/subject = intercept type =

lin(1) ldata = gamma;

Savings in storage space and computing time required to

solve the mixed model equations may be considerable when

M > > G. This code requires that Γ = ZZ′ be stored in a

SAS dataset “gamma” according to one of two possible for-

mats (for details see manual). One option for a hypothetical

3 × 4 Z matrix is as given in Fig. 1 (this assumes that a SAS

dataset “w” contains variables z1 to zM).

Spatial Models

The POW, EXP (equivalent to POW), GAU and SPH can

be  tted by these RANDOM statements:

random gen/subject = intercept type =

sp(pow) (z1-zM);

random gen/subject = intercept type =

sp(exp) (z1-zM);

random gen/subject = intercept type =

sp(gau) (z1-zM);

random gen/subject = intercept type =

sp(sph) (z1-zM);

The spatial models may have convergence problems, so

it is advisable to try a number of starting values for the

spatial parameters using the PARMS statement. If the

residual variance var(e

) is  xed as described in this paper,

and a polygenic e ect v

is  tted in addition to a marker-

dependent e ect g

, a typical call of PROC MIXED is as

shown in Fig. 2. The weighting variable w contains the

inverse of var(e

), that is, of the squared standard errors of

adjusted means. For background on the method of  xing

var(e

) see Piepho (1999).

Also, at times the log likelihood changes only mar-

ginally in iterations and yet the default convergence cri-

terion is not met. In such instances it may be useful to

slightly relax the convergence criterion relative to the

default value. Additionally, rescaling Z such that σ

is of

the same order of magnitude as σ

may be bene cial.

For LV we use the same code to generate a linear variance-

covariance matrix as for the ridge regression model, except

that a*b is replaced by (a-b)**2/&m/2 and the square root

is taken of col[i] after the do loop. The relevant portion

that needs to be replaced in Fig. 1 is given in Fig. 3.

Figure 1. SAS code generating the matrix Γ = ZZ′ for ﬁ tting the

ridge regression model using the LIN structure in PROC MIXED.

CROP SCIENCE, VOL. 49, JULY–AUGUST 2009 WWW.CROPS.ORG 1175

Then, the code in Fig. 4 may be used

to  t the linear model, assuming the vari-

ance-covariance matrix has been stored in

a SAS dataset “lv.” The coding of entries

in “lv” ensures that the resulting variance–

covariance matrix will have only nonneg-

ative entries (Piepho et al., 2008b).

Mixed Models with

Heterogeneous Variance

The heterogeneous variance ridge regression model RR

het

is  tted by random z1-zM;.

Acknowledgments

KWS SAAT AG is thanked for providing the maize data. Jens

Möhring and Bettina Müller are thanked for carefully reading

an earlier version of this paper. I am also grateful for helpful

comments by two anonymous referees.

References

Bauer, A.M., T.C. Reetz, and J. Léon. 2006. Estimation of breed-

ing values of inbred lines using best linear unbiased prediction

(BLUP) and genetic similarities. Crop Sci. 46:2685–2691.

Bernardo, R. 1993. Estimation of coe cient of coancestry

using molecular markers in maize. Theor. Appl. Genet.

85:1055–1062.

Bernardo, R. 1994. Prediction of maize single-cross performance

using RFLPs and information from related hybrids. Crop Sci.

34:20–25.

Bernardo, R., and J. Yu. 2007. Prospects for genomewide selection

for quantitative traits in maize. Crop Sci. 47:1082–1090.

Calus, M.P.L., T.H.E. Meuwissen, A.P.W. deRoos, and R.F.

Veerkamp. 2008. Accuracy of genomic selection using di er-

ent methods to de ne haplotypes. Genetics 178:553–561.

Calus, M.P.L., and R.F. Veerkamp. 2007. Accuracy of breed-

ing values when using and ignoring the polygenic e ect in

genomic breeding value estimation with a marker density of

one SNP per cM. J. Anim. Breed. Genet. 124:362–368.

Charcosset, A., B. Bonnisseau, O. Touchebeuf, J. Burstin, P.

Dubreuil, Y. Barriere, A. Gallais, and J.B. Denis. 1998. Pre-

diction of maize hybrid silage performance using marker data:

Comparison of several models for speci c combining ability.

Crop Sci. 38:38–44.

Cogdill, R.P., and P. Dardenne. 2004. Least-squares support vec-

tor machines for chemometrics: An introduction and evalua-

tion. J. Near Infrared Spetrosc. 12:93–100.

Draper, N.R., and H. Smith. 1998. Applied regression analysis.

3rd ed. John Wiley & Sons, New York.

Gianola, D., M. Perez-Enciso, and M.E. Toro. 2003. On marker-

assisted prediction of genetic value: Beyond the

ridge. Genetics 163:347–365.

Gianola, D., and J.B.C.H.M. van Kaam. 2008. Repro-

ducing kernel Hilbert spaces regression methods

for genomic assisted prediction of quantitative

traits. Genetics 178:2305–2313.

Goddard, M.E., and B.J. Hayes. 2007. Genomic selec-

tion. J. Anim. Breed. Genet. 124:323–330.

Gower, J.C. 1966. Some distance properties of latent

roots and vector methods used in multivariate

analysis. Biometrika 53:325–338.

Habier, D., R.L. Fernando, and J.C.M. Dekkers. 2007. The impact

of genetic relationship information on genome-assisted breed-

ing values. Genetics 177:2389–2397.

Henderson, C.R. 1985. Best linear unbiased prediction of non-

additive genetic merits in non-inbred populations. J. Anim.

Sci. 60:111–117.

Hoerl, A.E., and R.W. Kennard. 1970. Ridge regression: Biased esti-

mation for nonorthogonal problems. Technometrics 12:55–67.

Johnson, N.L., S. Kotz, and N. Balakrishnan. 1994. Continuous

univariate distributions. Vol. 1. 2nd ed. John Wiley & Sons,

New York.

Lahiri, S.N. 2003. Resampling methods for dependent data.

Springer, New York.

Littell, R.C., G.A. Milliken, W.W. Stroup, R. Wol nger, and O.

Schabenberger. 2006. SAS for mixed models. 2nd ed. SAS

Inst., Cary, NC.

Maenhout, S., B. de Baets, G. Haesaert, and E. van Bockstaele. 2007.

Support vector machine regression for the prediction of maize

hybrid performance. Theor. Appl. Genet. 115:1003–1013.

Maenhout, S., B. de Baets, G. Haesaert, and E. van Bockstaele.

2008. Marker-based screening of maize inbred lines using

support vector machine regression. Euphytica 161:123–131.

Martin, R.J. 1979. A subclass of lattice processes applied to a prob-

lem of planar sampling. Biometrika 66:209–217.

McQuarrie, A.D.R., and C.L. Tsai. 1998. Regression and time

series model selection. World Scienti c, Singapore.

Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Predic-

tion of total genetic value using genome-wide dense marker

maps. Genetics 157:1819–1829.

Figure 2. MIXED code to ﬁ t the power model with ﬁ xed var(e

Figure 3. Portion of SAS code that needs to be replaced for the

corresponding part in Fig. 1 to generate a matrix for ﬁ tting the

linear variance model using the LIN structure in PROC MIXED.

Figure 4. MIXED code to ﬁ t the linear variance model with ﬁ xed var(e

1176 WWW.CROPS.ORG CROP SCIENCE, VOL. 49, JULY–AUGUST 2009

Miller, A. 2002. Subset selection in regression. Chapman and Hall,

London.

Piepho, H.P. 1997. Analyzing genotype–environment data by mixed

models with multiplicative e ects. Biometrics 53:761–766.

Piepho, H.P. 1998. Empirical best linear unbiased prediction in

cultivar trials using factor analytic variance–covariance struc-

tures. Theor. Appl. Genet. 97:195–201.

Piepho, H.P. 1999. Stability analysis using the SAS system. Agron.

J. 91:154–160.

Piepho, H.P. 2000. A mixed model approach to mapping quantita-

tive trait loci in barley on the basis of multiple environment

data. Genetics 15:253–260.

Piepho, H.P., and H.G. Gauch. 2001. Marker pair selection for

QTL detection. Genetics 157:433–444.

Piepho, H.P., and C.E. McCulloch. 2004. Transformations in

mixed models: Application to risk analysis for a multienviron-

ment trial. J. Agric. Biol. Environ. Stat. 9:123–137.

Piepho, H.P., J. Möhring, A.E. Melchinger, and A. Büchse. 2008a.

BLUP for phenotypic selection in plant breeding and variety

testing. Euphytica 161:209–228.

Piepho, H.P., C. Richter, and E. Williams. 2008b. Nearest neigh-

bour adjustment and linear variance models in plant breeding

trials. Biometrical J. 50:164–189.

Pinheiro, J.C., and D.M. Bates. 1995. Approximations to the log-

likelihood function in the nonlinear mixed e ects model. J.

Comput. Graph. Stat. 4:12–35.

Reif, J.C., A.E. Melchinger, and M. Frisch. 2005. Genetical and

mathematical properties of similarity and dissimilarity coef-

 cients applied in plant breeding and seed bank management.

Crop Sci. 45:1–7.

Ruppert, D., M.P. Wand, and R.J. Carroll. 2003. Semiparametric

regression. Cambridge Univ. Press, Cambridge, UK.

Schabenberger, O., and C.A. Gotway. 2005. Statistical methods

for spatial data analysis, CRC Press, Boca Raton, FL.

Schrag, T.A., J. Möhring, H.P. Maurer, B.S. Dhillon, A.E. Melch-

inger, H.P. Piepho, A.P. Sørensen, and M. Frisch. 2009.

Molecular marker-based prediction of hybrid performance in

maize using unbalanced data from multiple experiments with

factorial crosses. Theor. Appl. Genet. 118:741–751.

Searle, S.R., G. Casella, and C.E. McCulloch. 1992. Variance

components. John Wiley & Sons, New York.

Stich, B., J. Möhring, H.P. Piepho, M. Heckenberger, E.S. Buck-

ler, and A.E. Melchinger. 2008. Comparison of mixed-model

approaches for association mapping. Genetics 178:1745–1754.

Suykens, J.A.K., T.V. Gestel, J. de Brabanter, B. de Moor, and

J. Vandewalle. 2002. Least squares support vector machines.

World Scienti c, Singapore.

Whittaker, J.C., R. Thompson, and M.C. Denham. 2000.

Marker-assisted selection using ridge regression. Genet. Res.

75:249–252.

Williams, E.R. 1986. A neighbour model for  eld experiments.

Biometrika 73:279–287.

Xu, S. 2003. Estimating polygenic e ects using markers of the

entire genome. Genetics 163:789–801.

Xu, S., and Z. Jia. 2007. Genomewide analysis of epistatic e ects

for quantitative traits in barley. Genetics 175:1955–1963.

Yu, J.M., G. Pressoir, W.H. Briggs, I.V. Bi, M. Yamasaki, J.F. Doe-

bley, M.D. McMullen, B.S. Gaut, D.M. Nielsen, J.B. Holland,

S. Kresovich, and E.S. Buckler. 2006. A uni ed mixed-model

method for association mapping that accounts for multiple

levels of relatedness. Nat. Genet. 38:203–208.

A preview of this full-text is provided by Wiley.

Learn more

Content available from Crop Science

This content is subject to copyright. Terms and conditions apply.

Enhancing GPS Accuracy with Machine Learning: A Comparative Analysis of Algorithms

Article

Jun 2024
TRAIT SIGNAL

Advancing Selective Breeding in Leopard Coral Grouper (P. leopardus) through Development of a High-Throughput Image-Based Growth Trait

Article

Full-text available

Jun 2024

Ability of Genomic Prediction to Bi-Parent-Derived Breeding Population Using Public Data for Soybean Oil and Protein Content

Article

Full-text available

Apr 2024

Genomic selection (GS) is a marker-based selection method used to improve the genetic gain of quantitative traits in plant breeding. A large number of breeding datasets are available in the soybean database, and the application of these public datasets in GS will improve breeding efficiency and reduce time and cost. However, the most important problem to be solved is how to improve the ability of across-population prediction. The objectives of this study were to perform genomic prediction (GP) and estimate the prediction ability (PA) for seed oil and protein contents in soybean using available public datasets to predict breeding populations in current, ongoing breeding programs. In this study, six public datasets of USDA GRIN soybean germplasm accessions with available phenotypic data of seed oil and protein contents from different experimental populations and their genotypic data of single-nucleotide polymorphisms (SNPs) were used to perform GP and to predict a bi-parent-derived breeding population in our experiment. The average PA was 0.55 and 0.50 for seed oil and protein contents within the bi-parents population according to the within-population prediction; and 0.45 for oil and 0.39 for protein content when the six USDA populations were combined and employed as training sets to predict the bi-parent-derived population. The results showed that four USDA-cultivated populations can be used as a training set individually or combined to predict oil and protein contents in GS when using 800 or more USDA germplasm accessions as a training set. The smaller the genetic distance between training population and testing population, the higher the PA. The PA increased as the population size increased. In across-population prediction, no significant difference was observed in PA for oil and protein content among different models. The PA increased as the SNP number increased until a marker set consisted of 10,000 SNPs. This study provides reasonable suggestions and methods for breeders to utilize public datasets for GS. It will aid breeders in developing GS-assisted breeding strategies to develop elite soybean cultivars with high oil and protein contents.

Amelioration of multitudinous classifiers performance with hyper-parameters tuning in elephant search optimization for cardiac arrhythmias detection

Article

Full-text available

Mar 2024
J SUPERCOMPUT

Detecting cardiac abnormalities promptly is critical for preventing unexpected and premature fatalities. In this research, four types of cardiac arrhythmias such as Ventricular Tachycardia, Premature Ventricular Contraction, Normal Sinus Rhythm and Supraventricular Tachycardia are detected from the amassed Physiobank MIT-BIH cardiac arrhythmia database. Dimensionality reduction techniques like Stochastic Neighbour Embedding (SNE), Neighbourhood Preserving Embedding, Linear Local Tangent Space Alignment and Gaussian Process Latent Variable Model are used to reduce the dimension of the ECG signals. The appropriate features of dimensionally reduced ECG signals are selected by the Elephant Search Optimization (ESO) technique. Finally, classification is performed using the relevant classifiers, such as Support Vector Machine, Adaboost, Modest Adaboost based on Ridge Regression (Modest Adaboost.RR), Extreme Gradient Boost (XGboost) and Naïve Bayes (NBC) classifiers. Multiple classifiers without and with ESO feature selection for different cardiac cases have an average classification accuracy of 62.23% and 73.61%, respectively. These multiple classifiers are defined by a set of control parameters known as hyper-parameters, which must be tuned in order to achieve optimal results. Experts have developed many approaches for detecting cardiac arrhythmias, but these multiple classifiers do not always perform well when the usual parameters for machine learning classification models are employed. In this paper, various classifiers are used in conjunction with the Stochastic Gradient Descent (SGD), Particle Swarm Optimization (PSO) and Bayesian Tree-structured Parzen Estimator (BTPE) to enhance the cardiac arrhythmia classification accuracy via hyper-parameter tuning. Multiple classifiers with SGD, PSO and BTPE hyper-parameters tuning techniques for various cardiac cases have an average classification accuracy of 80.13%, 90.67% and 94.96%, respectively. The Classifier’s performance is analysed based on metrics like Classification Accuracy, F1 score, Error Rate, Matthew’s correlation coefficient, Jaccard Index and Cohen’s Kappa Coefficient with and without ESO features selection method and hyper-parameters tuning techniques. The analysis utilizes the MATLAB R2014a software for result evaluation. The results show that the SNE-ESO approach, along with the XGboost-BTPE, achieved the highest classification accuracy of 99.89% for detecting {Ventricular Tachycardia}-{Normal Sinus Rhythm} cases. In terms of classification benchmarks, the results exhibit that the BTPE hyper-parameter tuning technique surpasses the SGD and PSO techniques.

Genomic Selection for Quantitative Disease Resistance in Plants

Chapter

Full-text available

Jan 2024

In view of current focus for improvement of crop varieties for disease resistance the adoption of genomic aided plant breeding is growing as an important approach. Till date lots of research has been done in the field of crop improvement for disease resistance using both conventional and molecular breeding approach. Major chunk of research for developing disease resistant variety, concentrated on major gene resistant for biotic stress but this type of resistance is more prone to breakdown with frequent changes in pathogenic strain. On other hand, more stable and broad spectrum resistance can be achieved by breeding varieties resistance for minor quantitative genes. Genomic selection is most convenient approach in molecular plant breeding for developing plant varieties resistant to quantitative disease resistance. This approach enhance genetic gain in the selecting individuals by exploring whole genome sequence data to calculate breeding value of progeny. The selection is based on adoption of genomic selection methodology and whole genome prediction model while using GS for yield and other economical traits. Although GS is a promising tool for genetic improvement of quantitative traits through reduction of breeding cycle but its efficiency in crop breeding programme could be increased by optimization of models regarding analysis of interaction between genotype and environment to improve predication accuracy. This could be gained by combining GS with different novel platforms like genotyping, phenotyping and speed breeding speed up the pace of genomic selection aided breeding procedure and higher genetic gain in context of per unit time.

Genomic prediction for agronomic traits in a diverse Flax (Linum usitatissimum L.) germplasm collection

Article

Full-text available

Feb 2024

Breeding programs require exhaustive phenotyping of germplasms, which is time-demanding and expensive. Genomic prediction helps breeders harness the diversity of any collection to bypass phenotyping. Here, we examined the genomic prediction’s potential for seed yield and nine agronomic traits using 26,171 single nucleotide polymorphism (SNP) markers in a set of 337 flax (Linum usitatissimum L.) germplasm, phenotyped in five environments. We evaluated 14 prediction models and several factors affecting predictive ability based on cross-validation schemes. Models yielded significant variation among predictive ability values across traits for the whole marker set. The ridge regression (RR) model covering additive gene action yielded better predictive ability for most of the traits, whereas it was higher for low heritable traits by models capturing epistatic gene action. Marker subsets based on linkage disequilibrium decay distance gave significantly higher predictive abilities to the whole marker set, but for randomly selected markers, it reached a plateau above 3000 markers. Markers having significant association with traits improved predictive abilities compared to the whole marker set when marker selection was made on the whole population instead of the training set indicating a clear overfitting. The correction for population structure did not increase predictive abilities compared to the whole collection. However, stratified sampling by picking representative genotypes from each cluster improved predictive abilities. The indirect predictive ability for a trait was proportionate to its correlation with other traits. These results will help breeders to select the best models, optimum marker set, and suitable genotype set to perform an indirect selection for quantitative traits in this diverse flax germplasm collection.

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

Article

Full-text available

Feb 2024
BMC GENOMICS

Background The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep, ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program. Results Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction. Conclusions The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources. Peer Review reports

Monitoring leaf nitrogen content in rice based on information fusion of multi-sensor imagery from UAV

Article

Full-text available

Jul 2023
PRECIS AGRIC

Timely and accurately monitoring leaf nitrogen content (LNC) is essential for evaluating crop nutrition status. Currently, Unmanned Aerial Vehicles (UAV) imagery is becoming a potentially powerful tool of assessing crop nitrogen status in fields, but most of crop nitrogen estimates based on UAV remote sensing usually use single type imagery, the fusion information from different types of imagery has rarely been considered. In this study, the fusion images were firstly made from the simultaneously acquired digital RGB and multi-spectral images from UAV at three growth stages of rice, and then couple the selecting methods of optimal features with machine learning algorithms for the fusion images to estimate LNC in rice. Results showed that the combination with different types of features could improve the models’ accuracy effectively, the combined inputs with bands, vegetation indices (VIs) and Grey Level Co-occurrence Matrices (GLCMs) have the better performance. The LNC estimation of using fusion images was improved more obviously than multispectral those, and there was the best estimation at jointing stage based on Lasso Regression (LR), with R² of 0.66 and RMSE of 11.96%. Gaussian Process Regression (GPR) algorithm used in combination with one feature-screening method of Minimum Redundancy Maximum Correlation (mRMR) for the fusion images, showed the better improvement to LNC estimation, with R² of 0.68 and RMSE of 11.45%. It indicates that the information fusion from UAV multi-sensor imagery can significantly improve crop LNC estimates and the combination with multiple types of features also has a great potential for evaluating LNC in crops.

Genotype by Environment Interaction and Adaptation

Chapter

Apr 2018

Genomic prediction for agronomic traits in a diverse Flax (Linum usitatissimum L.) germplasm collection

Preprint

Full-text available

Jul 2023

Breeding programs require exhaustive phenotyping of germplasms, which is time-demanding and expensive. Genomic prediction based on next-generation sequencing techniques helps breeders harness the diversity of any collection to bypass phenotyping. Here, we examined the genomic prediction’s potential for seed yield and nine agronomic traits using 26171 single nucleotide polymorphism (SNP) markers in a set of 337 flax ( Linum usitatissimum L.) germplasm, phenotyped in five environments. We evaluated 14 prediction models and several factors affecting predictive ability based on cross-validation schemes. Most models gave close predictive ability values across traits for the whole marker set. Models covering non-additive effects yielded better predictive ability for low heritable traits, though no single model worked best across all traits. Marker subsets based on linkage disequilibrium decay distance gave similar predictive abilities to the whole marker set, but for randomly selected markers, it reached a plateau above 3000 markers. Markers having significant association with traits improved predictive abilities compared to the whole marker set, when marker selection was made on the whole population instead of the training set indicating a clear overfitting. The correction for population structure did not increase predictive abilities compared to the whole collection. However, stratified sampling by picking representative genotypes from each cluster improved predictive abilities. The indirect predictive ability for a trait was proportionate to its correlation with other traits. These results will help breeders to select the best models, optimum marker set, and suitable genotype set to perform an indirect selection for quantitative traits in this diverse flax germplasm collection.

Variance Components

Book

May 2008

Estimating Polygenic Effects Using Markers of the Entire Genome

Article

Feb 2003
GENETICS

Shizhong Xu

Molecular markers have been used to map quantitative trait loci. However, they are rarely used to evaluate effects of chromosome segments of the entire genome. The original interval-mapping approach and various modified versions of it may have limited use in evaluating the genetic effects of the entire genome because they require evaluation of multiple models and model selection. Here we present a Bayesian regression method to simultaneously estimate genetic effects associated with markers of the entire genome. With the Bayesian method, we were able to handle situations in which the number of effects is even larger than the number of observations. The key to the success is that we allow each marker effect to have its own variance parameter, which in turn has its own prior distribution so that the variance can be estimated from the data. Under this hierarchical model, we were able to handle a large number of markers and most of the markers may have negligible effects. As a result, it is possible to evaluate the distribution of the marker effects. Using data from the North American Barley Genome Mapping Project in double-haploid barley, we found that the distribution of gene effects follows closely an L-shaped Gamma distribution, which is in contrast to the bell-shaped Gamma distribution when the gene effects were estimated from interval mapping. In addition, we show that the Bayesian method serves as an alternative or even better QTL mapping method because it produces clearer signals for QTL. Similar results were found from simulated data sets of F2 and backcross (BC) families.

Subset Selection in Regression

Book

Apr 2002

Alan Miller

Regression and Time Series Model Selection

Book

May 1998

On Marker-Assisted Prediction of Genetic Value: Beyond the Ridge

Article

Jan 2000
GENETICS

Marked-assisted genetic improvement of agricultural species exploits statistical dependencies in the joint distribution of marker genotypes and quantitative traits. An issue is how molecular (e.g., dense marker maps) and phenotypic information (e.g., some measure of yield in plants) is to be used for predicting the genetic value of candidates for selection. Multiple regression, selection index techniques, best linear unbiased prediction, and ridge regression of phenotypes on marker genotypes have been suggested, as well as more elaborate methods. Here, phenotype-marker associations are modeled hierarchically via multilevel models including chromosomal effects, a spatial covariance of marked effects within chromosomes, background genetic variability, and family heterogeneity. Lorenz curves and Gini coefficients are suggested for assessing the inequality of the contribution of different marked effects to genetic variability. Classical and Bayesian methods are presented. The Bayesian approach includes a Markov chain Monte Carlo implementation. The generality and flexibility of the Bayesian method is illustrated when a Lorenz curve is to be inferred.

A subclass of lattice processes applied to a problem in planar sampling

Article

Jan 1979

R.J. Martin

A simple subclass of lattice processes is introduced. These processes are shown to have many desirable properties which may make them suitable for representing autocorrelated variables in practical situations. Some standard results concerning the optimal allocation of sample points on the line are generalized to aligned samples in the plane.

Springer Series in Statistics

Book

Jan 2003

S. N. Lahiri

1 Scope of Resampling Methods for Dependent Data.- 2 Bootstrap Methods.- 3 Properties of Block Bootstrap Methods for the Sample Mean.- 4 Extensions and Examples.- 5 Comparison of Block Bootstrap Methods.- 6 Second-Order Properties.- 7 Empirical Choice of the Block Size.- 8 Model-Based Bootstrap.- 9 Frequency Domain Bootstrap.- 10 Long-Range Dependence.- 11 Bootstrapping Heavy-Tailed Data and Extremes.- 12 Resampling Methods for Spatial Data.- A.- B.- References.- Author Index.

Ridge regression: Biased estimation for nonorthogonal problems

Article

TECHNOMETRICS

Some distance properties of latent root and vector methods used in multivariate data analysis

Article

Jan 1966
BIOMETRIKA

John Gower

Least Squares Support Vector Machines

Article

Nov 2002

Support Vector Machines Basic Methods of Least Squares Support Vector Machines Bayesian Inference for LS-SVM Models Robustness Large Scale Problems LS-SVM for Unsupervised Learning LS-SVM for Recurrent Networks and Control.

Ridge Regression and Extensions for Genomewide Selection in Maize

Abstract and Figures

Recommended publications

Spatial and temporal variability of soil C/N ratio in Songnen Plain maize belt

Geostatistical Integrated Analysis of MASW and CPTu data for Assessment of Soft Ground

A study of spatial variability of the physical, chemical and biological properties of two agricultur...

Spatial variability of soil temperature and moisture in northeast china