Content uploaded by Mark Cooper
Author content
All content in this area was uploaded by Mark Cooper
Content may be subject to copyright.
In Silico Biology 2 (2002) 151–164
IOS Press
Electronic publication can be found in In Silico Biol. 2, 0013 <http:// www.bioinfo.de/isb/2002/02/0013/>, 24 January 2002.
1386-6338/02/$8.00 © 2002 – IOS Press and Bioinformation Systems e.V. All rights reserved
151
The GP Problem: Quantifying Gene-to-Phenotype
Relationships
Mark Cooper
1, 2,*
, Scott C. Chapman
3
, Dean W. Podlich
1, 2
and Graeme L. Hammer
1, 4
1
School of Land and Food Sciences, The University of Queensland, Brisbane, Queensland 4072,
Australia
2
Current Address: Pioneer Hi-Bred International Inc., 7300 N.W. 62
nd
Avenue, P.O. Box 1004,
Johnston, Iowa 50131, USA
3
CSIRO Plant Industry, 120 Meiers Road, Indooroopilly, Queensland 4068, Australia
4
Agricultural and Production Systems Research Unit (APSRU), Queensland Department of
Primary Industries, Tor Street, Toowoomba, Queensland, Australia
Edited by E. Wingender; received 26 September 2001; revised and accepted 21 December 2001; published
24 January 2002
ABSTRACT
: In this paper we refer to the gene-to-phenotype modeling challenge as the GP problem.
Integrating information across levels of organization within a genotype-environment system is a major
challenge in computational biology. However, resolving the GP problem is a fundamental requirement if
we are to understand and predict phenotypes given knowledge of the genome and model dynamic proper-
ties of biological systems. Organisms are consequences of this integration, and it is a major property of
biological systems that underlies the responses we observe. We discuss the
E(NK)
model as a framework
for investigation of the GP problem and the prediction of system properties at different levels of organiza-
tion. We apply this quantitative framework to an investigation of the processes involved in genetic im-
provement of plants for agriculture. In our analysis,
N
genes determine the genetic variation for a set of
traits that are responsible for plant adaptation to
E
environment-types within a target population of envi-
ronments. The
N
genes can interact in epistatic
NK
gene-networks through the way that they influence
plant growth and development processes within a dynamic crop growth model. We use a sorghum crop
growth model, available within the APSIM agricultural production systems simulation model, to integrate
the gene-environment interactions that occur during growth and development and to predict genotype-to-
phenotype relationships for a given
E(NK)
model. Directional selection is then applied to the population
of genotypes, based on their predicted phenotypes, to simulate the dynamic aspects of genetic improve-
ment by a plant-breeding program. The outcomes of the simulated breeding are evaluated across cycles of
selection in terms of the changes in allele frequencies for the
N
genes and the genotypic and phenotypic
values of the populations of genotypes.
Links:
http://pig.ag.uq.edu.au/qu-gene/
http://www.apsru.gov.au/Products/apsim.htm
____________________________
*Corresponding author: Email: mark.cooper@pioneer.com.
M. Cooper et al. / The GP Problem
152
KEYWORDS:
E(NK)
model, epistasis, genotype-by-environment interactions, plant, crop, target popula-
tion of environments, genetic space
INTRODUCTION
Today, a major research focus in the field of genetics and computational biology is developing methods
to predict properties of organisms and populations of organisms at the phenotypic level from knowledge
of the structure, function and diversity of genomes. We refer to the problem of determining gene-to-
phenotype relationships as the GP problem. Formulating a solution to this problem for a defined natural
system requires integration of information across many levels of organization within biophysical systems.
An iterative modeling approach combined with strategic experimentation provides a powerful framework
for tackling the GP problem. The objective of this paper is to define what we mean by an iterative model-
ing approach. We commence by describing a general approach to modeling natural systems and then il-
lustrate its application in the modeling of plant breeding programs in an agricultural context.
It has often been stated that a model is a simplification of the natural system under investigation and
that the level of simplification must be balanced against the complexity of the properties of the system to
be studied. Therefore, what is an appropriate modeling approach to tackle the GP problem? In describing
modeling strategies in general, Rosen [1985] and Casti [1989, 1997] distinguished between the "Natural
System" that we are attempting to understand and the "Formal System" that is a mathematical construc-
tion of how we understand the properties of the natural system (Fig. 1). This is a useful starting point for
thinking about how we might model a genotype-environment system.
Fig.1. Concept map for modeling biophysical systems. The "Natural System" is the biophysical structure that is un-
der investigation and the "Formal System" is the model based investigative strategy that is in use to construct
"Knowledge Structures" that represent properties of the "Natural System".
Approaches to constructing a formal system that captures the important properties of a natural system
can take many forms. Here we are interested in a mathematical framework that allows us to represent the
key components of the natural system and define the key relationships between these components. We
M. Cooper et al. / The GP Problem
153
intend that the mathematical relationships that we construct within the formal system will ultimately be
representative of the causal relationships that are properties of the natural system, so that we can investi-
gate the implications of these causal relationships within the formal system. More detailed formalizations
might be appropriate if the objective is to investigate relationships at lower levels within the system,
rather than to understand the relationships among its components. An example of this is in the modeling
of processes related to the productivity of agricultural plants (Fig. 2). The genetics of different species
and varieties within species determine how plants interact with the soil and aerial (radiation, temperature,
rainfall) environments to develop structures to 'capture' radiation, CO
2
and water. While the enzymes and
biochemistry of the primary processes involved in the photosynthesis are extensively studied, it is diffi-
cult to integrate the results of these processes (assimilation of CO
2
into biomass) over various time scales,
and model the effect of this biomass as it is diverted to new tissues and organs to either store biomass, or
capture more resources. It has been considered that models can only simulate one or two levels of scale
away from the level of their primary function. Further, at molecular and atomic scales, the requirement
for information input to the system rapidly increases. By studying and experimenting within the natural
system we attempt to gain knowledge about the biophysical structures and their causal relationships at
appropriate scales. Some of these causal relationships may be functional, others pre-functional and in
many cases non-functional but consequential of the ways in which the biophysical structures interact
[Kauffman, 2000].
Fig. 2. Hierarchy and scale in modeling processes within plant systems.
In many cases the results of experiments in biology are summarized descriptively. Alternatively we can
attempt to encode, within a formal mathematical framework, our understanding of the results of the ex-
periments (Fig. 1). The collection of these formal mathematical structures that we create is a model of the
system. In many cases we may commence an experiment with a prior model or hypothesis and use the
results of our experimental program to update and improve our model of the natural system [e.g. Ideker et
al., 2001]. As we refine and improve our model through iterative cycles of experimentation and modeling
we will be able to study properties of the natural system within the properties of the formal system. This
will give us a basis for determining the level of confidence we have in decoding the structures observed
within the model and making predictions about the properties that we expect to see within the natural sys-
tem. Additionally, model building through iteration will enable us to acquire and interpret data structures
from experimental programs as a foundation for constructing knowledge structures and queries that apply
to the properties of the natural system. As we improve the quality of our model we will increasingly im-
prove our power to predict properties of the natural system across its levels of organization.
M. Cooper et al. / The GP Problem
154
INTEGRATION BY ASSUMPTION OR BY FOLDING OUT THE DETAIL?
If we attempt to construct an integrated model of a natural system without adequate attention to the
ways that the components of the system interact across levels of organization (Fig. 2) then we are either
confining ourselves to working within a level of organization or we will construct a model that has lim-
ited power to provide insight into many of the properties of the natural system. In the absence of experi-
mental evidence, attempting to integrate across levels of organization by assuming that interactions are
unlikely to be important will leave the resulting model vulnerable to deviate from the natural system
whenever these interactions become important.
In classical quantitative genetics many of the complicating interactions that can impact on gene-to-
phenotype relationships have been assumed to be unimportant, based on the expectation that their effects
are small, and/or that their estimation is impractical. Two properties of genotype-environment systems
that are often ignored are those of gene-to-gene interactions (epistasis) and gene-by-environment interac-
tions [e.g. Clark, 2000]. For example, in defining the value of a genotype for a quantitative trait that is
determined by multiple genes, the assumption that epistasis is zero implies that the effects of the alleles
for the segregating genes are independent of the effects of the alleles at the other genes. In this case, for
each gene, additive and dominance intra-gene effects can be defined in terms of contrasts between the
homozygous and heterozygous genotypes. Hence, the value of the multi-gene genotype for an individual
is then simply determined as the cumulative effects of the genes by summing the allele effects across the
segregating genes. Similarly, gene-by-environment (GE) interactions have been assumed to be unimpor-
tant or a source of error that can be summed to zero by evaluating genotypes in adequately large samples
of experimental environments representing the target population of environments.
Where experimental evidence demonstrates that the interactions are important it is necessary to directly
evaluate their implications within the formal system. Analyses of the genetic architecture of quantitative
traits in model systems indicate important sources of genetic variation attributed to epistasis and GE in-
teractions [Mackay, 2001]. The same can be expected of economically important traits in agricultural
plant species. Therefore, in tackling the GP problem for quantitative traits we seek a modeling framework
that enables investigation of the impact of gene-to-gene and gene-by-environment interactions.
MODELING A GENOTYPE-ENVIRONMENT SYSTEM
To progress from a general discussion of strategies for modeling natural systems to the specifics re-
quired to model genotype-environment systems it is necessary to define both the key properties and rela-
tionships that are important in the target natural system and the methods that are to be used in construct-
ing the formal system. Figure 3 is a concept map, based on the modeling framework described in Figure
1, which focuses on the GP problem for a genotype-environment system. Our objective is to establish a
formal representation of a genotype-environment system to enable modeling gene-to-phenotype relation-
ships as a basis for evaluating the efficiency of plant breeding strategies [Cooper et al., 1999]. Therefore,
here we emphasize the quantification of allelic variation at
N
genes and their potential interactions within
NK
gene networks [Kauffman, 1993] and with
E
environmental conditions [Podlich and Cooper, 1998] in
determining the gene-to-phenotype relationships for the traits to be improved by plant breeding.
The scope for modeling plant and animal breeding strategies has been a long-term focus of applied
quantitative genetics [e.g. Falconer and Mackay, 1996; Comstock, 1996]. The use of computer simulation
approaches has increased as hardware and software capability and flexibility have improved. Adopting a
simulation approach to study gene-to-phenotype relationships provides greater flexibility for investigating
the influences of epistasis and GE interactions than is possible within the classical statistical modeling
approach [Kempthorne, 1988; Podlich and Cooper, 1998]. Kauffman [1993] gave a comprehensive dis-
M. Cooper et al. / The GP Problem
155
cussion of the
NK
model and its suitability for investigating the impact of epistasis in evolutionary proc-
esses. Podlich and Cooper [1998] defined the
E(NK)
model as an extension of Kauffman's
NK
model in
order to accommodate the effects of gene-by-environment interactions. In the
E(NK)
model gene-by-
environment interactions are possible where different forms of
NK
gene network models can be expressed
in the different environmental conditions that are possible within a target population of environments.
Fig. 3. Concept map for modeling the key components of a genotype-environment system and the relationships to
the components of the E(NK) model and the investigative strategies applied to quantify the value of alleles of genes
within the genotype-environment system [Adapted from Cooper et al., 1999].
The relationships between the components of the
E(NK)
model and the biophysical components of a
genotype-environment system are indicated in Figure 3. Some of the investigation strategies that can be
used to provide the information necessary to build formal models of gene-to-phenotype relationships and
quantify the value of allelic variation in terms of the components of the
E(NK)
framework are indicated.
Key activities that are emphasized include: (i) environmental characterization as a basis for defining the
target population of environments and causes of GE interactions, (ii) genetic analysis to study genetic
variation for biochemical pathways, physiological processes and adaptive traits, (iii) genetic (recombina-
tion) and physical mapping of genes, (iv) functional genomics to study the regulation and expression of
genes, and (v) crop growth models that define the relationships between genetic variation for traits, plant
growth and development processes and variation in environmental resources within a target population of
environments [e.g. Bidinger et al., 1996].
SORGHUM BREEDING EXAMPLE: PROBLEM AND MODEL DEFINITION
To examine the effectiveness of a breeding strategy we need to define two properties of a genotype-
environment system: (1) the target population of environments, and (2) the target genotype for the gene-
to-phenotype model. Within the target geographical area that a breeding program operates, new genotypes
M. Cooper et al. / The GP Problem
156
are developed over sequences of cycles of intermating parents, evaluation and selection of progeny to
identify new genotypes that have high and stable yield performance across a wide range of environmental
conditions. The occurrence of environmental conditions within the geographical area has both spatial and
temporal dimensions and the different conditions can occur with different frequencies in both dimensions.
This results in a complex mixture of different environmental conditions that is referred to here as the
tar-
get population of environments
. In the presence of GE interactions, understanding the environmental fac-
tors that influence genotype performance and cause these interactions is an important step in designing an
effective testing strategy for measurement of trait phenotypes as part of a breeding program. The
target
genotype
is then defined as the genotype that results in the best trait performance across the target popula-
tion of environments for the specified gene-to-phenotype model. For complex genotype-environment sys-
tems there can be multiple genotype targets. As
E(NK)
models become more complex, with increasing
levels of
E
,
N
and
K
, it becomes increasingly difficult to compute and identify a single target genotype. In
these situations, where it is not possible to create and evaluate all potential genotypes for a gene-to-
phenotype model, alternative evaluation strategies are used. In the example we consider here the geno-
type-environment system is of a size that definition of a single target genotype is possible.
In this example we discuss some key results from a larger long-term study. This larger study is investi-
gating the requirements (Fig. 3) for model development and simulation of sorghum (
Sorghum bicolor
(L.)
Moench) adaptation and grain yield for the heterogeneous dryland agricultural system in northeastern
Australia [Chapman et al., 2000a,b,c, 2002a,b].
First we provide some background and context to the complexity of this genotype-environment system.
Sorghum is the major summer crop grown in the northeastern cropping region of Australia. Grain yield is
the major economic product and is used mainly as animal feed. Sorghum grain yield is a complex quanti-
tative trait and is the result of interactions and integration of many component traits that can themselves
interact with variation in environmental conditions (rainfall, temperature and solar radiation) during a
crop growth and developmental cycle of around 100 days. The major environmental variable that has a
dominant influence on grain yield variation is water availability to the crop. Variation in water availabil-
ity is a consequence of complex spatial and temporal variation in rainfall prior to and during the growth
of the crop and also the spatial variation in the water holding capacity of the soil types across the geo-
graphical area. We have found that the environmental variation in incidence of drought can explain a sig-
nificant component of the GE interactions for grain yield [Chapman et al., 2000a,b,c]. Research into the
genetic and physiological bases of drought tolerance of sorghum has identified and examined the impor-
tance of the following four traits: (1) phenology, in particular the timing of flowering (PH) [Hammer et
al., 1989], (2) stay-green (SG) [Borrell and Hammer, 2000], (3) transpiration efficiency (TE) [Hammer et
al., 1997; Mortlock and Hammer, 1999], and (4) osmotic adjustment (OA) [Hammer et al., 1999]. In par-
allel research, genetic analysis and the construction of a molecular marker map for grain sorghum [Tao et
al., 1998, 2000] has enabled trait dissection. This body of work provides working hypotheses of the num-
ber of genes or Quantitative Trait Loci (QTL) that may contribute to the genetic variation for these four
traits [Chapman et al., 2002a,b].
With access to this experimental database we have used a simulation approach to investigate the effi-
ciencies of plant breeding strategies used for genetic improvement of grain yield of sorghum under the
dryland conditions in Australia. This required us to develop an interface between a genetic modeling plat-
form (QU-GENE) [Podlich and Cooper, 1998; http://pig.ag.uq.edu.au/qu-gene/] and a cropping system
model (APSIM) [McCown et al., 1996; http://www.apsru.gov.au/products/apsim.htm], which has a mod-
ule for sorghum [Hammer and Muchow, 1994; Hammer et al., 2001]. This interface was constructed in a
way that used information generated from our ability to characterize environments for their occurrence of
drought, our understanding of the spatial and temporal distributions of drought in the target population of
environments, and the data available from genetic and physiological analyses of traits considered to con-
tribute to drought tolerance (Fig. 3). This provides a model architecture that links the alleles of genes and
M. Cooper et al. / The GP Problem
157
the plant growth and development processes that respond to variation in the environmental conditions to
determine grain yield (Fig. 4). Thus, by developing an interface between the QU-GENE genetic model
and the APSIM-Sorg model for sorghum there is a relationship between genes and phenotypes that en-
ables investigation of the GP problem within a genotype-environment system context. These gene-to-
phenotype relationships can be used to assess the value of genes in terms of an
E(NK)
model for grain
yield in a target population of environments. Further, as additional experimental information becomes
available it is possible to continually update the genetic and physiological models for the genotype-
environment system, our assessment of the allelic variation we have identified, and any impact that this
may have on the efficiency of the breeding strategies we are using for genetic improvement of sorghum.
Fig. 4. Schematic of the modular structures and linkages between QU-GENE and APSIM. In this example S1 recur-
rent selection was used as the breeding strategy to improve grain yield of the sorghum population of genotypes.
Other plant breeding strategies are indicated (e.g. pedigree selection). Genotypes are categorized into expression-
states in QU-GENE and these expression-states map to trait values modeled in APSIM-Sorg for different combina-
tions of soil and weather data. Output from APSIM is processed to define both the yield of all possible genotypes
(expression-state combinations) and the frequency of drought environment types (ETs) encountered in the target
population of environments (TPE).
The
E(NK)
model can be parameterized in a number of ways, including: (1) Constructing Boolean gene
networks and sampling genotype values for the components of the networks from underlying distributions
of gene effects; a procedure pioneered by Kauffman [1993]; (2) Defining inheritance models using em-
pirical estimates for classical quantitative genetic parameters [Podlich and Cooper, 1998]; and (3) Speci-
fying gene networks to represent the properties of biochemical pathways. For the sorghum genotype-
environment system in our example the resulting
E(NK)
model is a consequence of the number of genes
specified to control variation for traits, the number of environment-types identified for the target popula-
tion of environments and the physiological relationships that determine crop growth and development
with the APSIM-Sorg sorghum model. This is a novel approach for determining the parameters for an
E(NK)
model and it is made feasible by developing the interface between QU-GENE and APSIM (Fig. 4).
M. Cooper et al. / The GP Problem
158
Here we consider an
E(NK)
model where the number of environment-types
E
=3 and the total number of
genes
N
=15. Each of the 15 genes has two alleles segregating within a base population of genotypes. The
level of epistasis for grain yield, as defined by the
K
parameter, is not explicitly defined here and is an
emergent property of the extent of trait interconnectedness within the APSIM-Sorg crop growth model.
The three environment-types represent different levels of severity of drought: (1) mild terminal stress,
(2) moderate terminal stress, and (3) severe terminal stress. These drought environment types, together
with their frequencies of occurrence in the target population of environments, were determined from an
analysis of the timing and severity of water deficits during crop growth and development by running the
APSIM-Sorg model for a standard genotype with approximately 100 years of weather data across a num-
ber of locations in northeastern Australia. The locations represented different soil types from the target
geographical area. The APSIM-Sorg simulations were then summarized by cluster analysis to identify the
three key drought environment-types (Fig. 4) [Chapman et al., 2000b,c]. While there are three environ-
ment-types in the target population of environments, to be concise we will mostly concentrate on only
two of these in this paper; (1) the mild-terminal stress environment-type, and (2) the severe terminal stress
environment-type. The 15 genes determine the genetic variation for grain yield in the environment-types
by specifying the extent of genetic variation for the four traits PH (3 genes), SG (5 genes), TE (5 genes)
and OA (2 genes). Thus, the genetic variation for grain yield is an emergent property of the variation for
the physiologically defined growth and development processes in the APSIM-Sorg model impacted by
the four traits. The process we have used here to specify the genetic variation for grain yield differs from
the classical quantitative genetics approach where effects of "yield-genes" are specified in ways that are
unrelated to or unconstrained by the biophysical properties of plant growth and development processes.
The resultant genetic variation for grain yield in the base population of genotypes is then subjected to a
series of recurrent cycles of directional selection for increased levels of grain yield. The breeding strategy
we evaluate in this example is S1 recurrent selection [Hallauer and Miranda, 1988] and selection is based
on the yield phenotypes of genotypes when they are evaluated in samples of environments taken from the
target population of environments.
The genetic changes in the population of genotypes in response to the selection imposed by the breed-
ing strategy are examined in terms of: (1) the changes in frequencies of the alternative alleles for the 15
genes (referred to as changes in gene frequencies) on a trait basis, and (2) the changes in grain yield per-
formance of the genotypes created and selected during the course of the simulation experiment. We
examine these changes due to selection at both genetic and phenotypic levels by constructing response
surfaces that relate genetic distances between genotypes to the phenotypic values for the four traits PH,
SG, TE, OA and also grain yield. Genetic distances are calculated as Hamming Distances, which give a
measure of the number of alleles that differ between any pair of genotypes.
For 15 genes, each segregating for two alleles, there are 3
15
= 14,348,907 possible genotypes from all
combinations of alleles. The frequency of occurrence of these genotypes in the reference population is
dependent on the gene frequencies for the 15 genes. Running the APSIM-Sorg crop growth model
14,348,907 times for each environmental condition was not feasible. Therefore, in this example we re-
duced the number of simulations necessary by allocating genotypes to classes based on defining "expres-
sion states" for each trait. An expression state was defined for a trait by the total number of + or - alleles
summed across the genes influencing the trait, where the + allele increased trait value and the - allele de-
creased trait value. Adopting this approach, for
N
genes determining genetic variation for a trait, with two
alleles per gene, there are 2
N
+1 expression states for the trait. For example, for the trait OA with
N
=2,
individuals can have 0, 1, 2, 3 or 4 + alleles, representing the 5 states of expression for OA. There are
numbers of genotypes in each of the expression state classes. If we label the two genes A (
A,a
) and B
(
B,b
) such that the alternative alleles are
A
(+),
a
(-) and
B
(+),
b
(-) then the genotype membership of the
expression state classes are: 0 =
aabb
; 1 =
Aabb
,
aaBb
; 2 =
AAbb
,
AaBb
,
aaBB
; 3 =
AABb
, A
aBB
; 4 =
AABB
. We then divided the range of phenotypic values for the traits into equal increments on a linear
M. Cooper et al. / The GP Problem
159
scale, with genotype
aabb
defined as the lowest expression state and
AABB
the highest expression state
for OA. The same process was applied to the other three traits. Following this procedure, we have 5 ex-
pression states for OA, 7 expression states for PH, 11 expression states for both SG and TE. With the four
traits we have 5×7×11×11 = 4,235 combinations of expression states. Thus, the 14,348,907-dimension
genotype space is condensed and mapped onto a 4,235-dimension expression state space. Running 4,235
APSIM-Sorg simulations for the 600 environments used to represent the target population of environ-
ments was manageable with our computer cluster [Micallef et al., 2001; http://pig.ag.uq.edu.au/qu-gene]
resources. The deterministic relationship between genotypes and trait expression states used in this exam-
ple is only one of many ways in which a gene-to-phenotype relationship can be constructed within our
modeling framework (Figs. 3 and 4).
SORGHUM BREEDING EXAMPLE: RESULTS
For the three environment-types the APSIM-Sorg model was used to estimate a grain yield value for
each of the 4,235 trait expression states, referred to hereafter as genotype classes. These estimates were
averages from ca. 200 runs of the model, using as inputs daily weather data and soils data from location-
year combinations chosen to represent the target population of environments. Some appreciation of the
genetic variation for yield that exists among the genotype classes for each of the four traits in the mild
Fig. 5. Grain yield distribution of the genotype classes for the Mild Terminal Stress (colored blue) and Severe Ter-
minal Stress (colored red) environment-types, for representations where the genotype classes are distributed accord-
ing to their genetic distance from the target genotype (based on grain yield) for each of the four traits; (a) Transpira-
tion Efficiency, (b) Osmotic Adjustment, (c) Phenology and (d) Stay-green. The vertical axis indicates the percent-
age of the 4235-genotype classes present at each yield/Hamming distance combination. The horizontal left axis
indicates the level of grain yield (t/ha). The horizontal right axis indicates the number of alleles different from the
target genotype in the target population of environments (referred to as Hamming distance).
M. Cooper et al. / The GP Problem
160
terminal stress and severe terminal stress environment-types is given in Figure 5. For both environment-
types a series of grain yield frequency distributions is shown for each trait. The genotypic classes are or-
dered on their genetic distance (measured as a Hamming distance) from the allele combination of the tar-
get genotype in the target population of environments. As expected lower grain yields are achieved under
severe terminal stress (colored red) than in the mild terminal stress (colored blue) environment-type. For
any genotype class for the four traits there is considerable genetic variation for grain yield, which results
from genotypic variation for the other three traits.
To evaluate the consequences of the effects of GE interactions between the mild terminal stress and se-
vere terminal stress environment-types at the level of grain yield we need to examine the relationship be-
tween grain yield performance in both environment-types. To do this we construct a scatter plot of the
yield values in both environment-types (Fig. 6). If there were no GE interactions there would be a perfect
correlation of the grain yield values between the two environment-types. From the shape of the distribu-
tion of the yield values it can be seen that there are GE interactions and that the genotypes with highest
grain yield differ between the two environment-types.
Fig. 6. Grain yield values (t/ha) for the 4235-genotype classes in the Mild Terminal Stress and Severe Terminal
Stress environment-types for color coded representations of each of the four traits; (a) Transpiration Efficiency, (b)
Osmotic Adjustment, (c) Phenology and (d) Stay-green. Genotype classes are color coded according to their genetic
distance from the target genotype in the target population of environments (Hamming distance), extending from
yellow (all alleles different from the target genotype) to blue (no alleles different from the target genotype).
In Figure 6, each of the 4235 genotype classes is color coded by trait, extending from light (yellow) to
dark (blue), to depict for each trait the genetic distance between the genotype class and the target geno-
M. Cooper et al. / The GP Problem
161
type. As the colors get darker the genotypes in the classes have more alleles in common (giving a lower
Hamming distance) with the target genotype. For both TE (Fig. 6a) and OA (Fig. 6b), genotypes with
high yield in the severe terminal and mild terminal stress environment-types generally have a large pro-
portion of genes in common with genotypes that yield well in the target population of environments. The
situation is different for PH (Fig. 6c). For the PH trait, genotypes that have a high yield in the mild termi-
nal stress environment-type have many genes in common with the target genotype, whereas genotypes
that have high yield in the severe terminal stress environment-type are genetically distant from the target
genotype. Thus, we have strong GE interactions for grain yield that can impact on selection outcomes for
the PH trait and yield in the different environment-types and in the target population of environments. For
SG (Fig. 6d) there is a strong association between high yield in the mild terminal stress environment-type
and having genes in common with the target genotype. However, this relationship is much weaker in the
severe terminal stress environment-type, in part because the other traits have a stronger influence on yield
in this environment-type.
Since there are strong epistatic and GE interactions for the four traits in determining grain yield in the
genotype-environment system represented in this example, it is important to consider the influence of se-
lection environment on the expected changes in the genetic structure of the population. Here we examine
genetic responses over recurrent cycles of selection on yield phenotypes in either the severe terminal
stress or mild terminal stress environment-types. These responses to selection are examined in terms of
changes in the gene frequencies of alleles for increasing levels of trait expression for each trait (Fig. 7)
and finally in terms of trajectories through genetic space for yield (Fig. 8).
Fig. 7. Change in gene frequency of the + alleles for increasing level of the four traits (TE=Transpiration Efficiency,
OA=Osmotic Adjustment, Ph=Phenology, SG=Stay-green) over cycles of selection, when selection is conducted in
the Severe Terminal Stress (a) and Mild Terminal Stress (b) environment-types.
Selection for increased grain yield within the severe terminal stress environment-type (Fig. 7a) had the
effect of rapidly increasing the frequencies of alleles that enhanced expression of the two traits OA and
TE, gradually increasing the frequencies of alleles for enhanced SG, and decreasing the frequencies of
alleles for later flowering, thus selecting early flowering genotypes that could developmentally escape
from the severe terminal stress conditions. After selection cycles 5 and 6, once the alleles for greater ex-
pression of OA and TE were fixed, the rate of increase in frequency of alleles for enhanced levels of SG
was greater than in the previous selection cycles. Selection for higher grain yield under the mild terminal
M. Cooper et al. / The GP Problem
162
stress environment-type (Fig. 7b) resulted in a different pattern of changes in frequencies of alleles to that
observed for the severe terminal stress environment-type (Fig. 7a). Under the mild terminal stress envi-
ronment-type selection for greater yield favored an increase in the frequencies of alleles for higher ex-
pression levels of all four traits (Fig. 7b). Thus, in contrast to the severe terminal stress environment-type,
where early flowering genotypes were favored, selection in the mild terminal stress environment-type
favored late flowering genotypes. Therefore, as we expect in the presence of these interactions, if we plot
the trajectories through genetic space followed by the populations over cycles of selection for yield, these
trajectories contrast depending on whether we select under a severe terminal stress environment-type
(Fig. 8a) or a mild terminal stress environment-type (Fig. 8b).
Fig. 8. Grain yield values (t/ha) for the 4235-genotype classes and the average trajectory of a population of geno-
types (red line) over cycles of selection, when selection is conducted in the Severe Terminal Stress (a) and Mild
Terminal Stress (b) environment-types. Genotype classes are color coded according to their genetic distance from
the target genotype in either the Severe Terminal Stress (a) or the Mild Terminal Stress (b) environment-types, ex-
tending from yellow (all alleles different from the target genotype) to blue (no alleles different from the target geno-
type).
SORGHUM BREEDING EXAMPLE: DISCUSSION
The purposes for considering the sorghum breeding example we have described in this paper were
threefold: (1) to demonstrate some aspects of the approaches we are developing and using to investigate
and deal with the GP problem for complex traits in plant breeding applications (Fig. 3), (2) to emphasize
the importance that both epistatic and GE interactions can have in gene-to-phenotype relationships, and
(3) show how the
E(NK)
model can be used as a framework for many approaches to investigating the GP
problem. An equally valid case study, with availability of a suitable experimental information base, could
be the study of human health issues such as heart disease with influences from the genetics of individuals
and the lifestyle environment they choose.
To date our investigation of sorghum genetic improvement in Australia has synthesized a large body of
information that previously existed as a series of less well connected studies. The modeling framework
we now have has highlighted many previously unappreciated implications of interactions between breed-
ing strategies, the genetic architecture of traits and the environments in which we select for higher grain
yield. Also, and perhaps most importantly, the results of these studies have provided testable hypotheses
and focal points for further experimentation to test our current understanding of the ways in which these
M. Cooper et al. / The GP Problem
163
traits interact with each other and environmental conditions to determine grain yield. Thus, we are enter-
ing another cycle of the iterative modeling approach described in Figure 3.
The GP problem has always and will continue to be a major challenge in biology. With the increasing
availability of the complete genome sequences of a number of prokaryotic and eukaryotic organisms, our
improving ability to define the locations of genes in these sequences, and our growing knowledge of the
functional relationships between these genes and the biochemical and metabolic pathways they influence
[Karp, 2001], we are beginning to understand the dynamical nature of the GP problem. We see that an
iterative modeling approach, as described in this paper, is a logical quantitative framework for exploring
the growing experimental databases and creating knowledge structures for genotype-environment systems
(Fig. 1). This provides a foundation for defining priorities in the model development process and in decid-
ing when development of practical applications is feasible. In our case the practical applications we seek
are efficient plant breeding strategies that contribute to sustainable agricultural systems.
ACKNOWLEDGMENTS
We thank Professor John Casti for his permission to create a modification of his original modeling
concept map in Figure 1 and also Research Trends, Trivandrum, India, for permission to reproduce com-
ponents of Figure 3.
REFERENCES
[1] Bidinger, F. R., Hammer, G. L. and Muchow, R. C. (1996). The physiological basis of genotype by environ-
ment interaction in crop adaptation.
In
: Plant Adaptation and Crop Improvement, Cooper, M. and Hammer,
G.L. (eds). CAB International, Wallingford, pp. 329-347.
[2] Borrell, A. K. and Hammer, G. L. (2000). Nitrogen dynamics and the physiological basis of stay-green in
sorghum. Crop Sci. 40, 1295-1307.
[3] Casti, J. L. (1989). Paradigms Lost: Images of Man in the Mirror of Science. Cardinal, London.
[4] Casti, J. L. (1997). Would-be-Worlds: How Simulation is Changing the Frontiers of Science. John Wiley &
Sons, Inc., New York.
[5] Clark, A. G. (2000). Limits to prediction of phenotypes from knowledge of genotypes. Evol. Biol. 32, 205-
224.
[6] Chapman, S. C., Cooper, M., Butler, D. G. and Henzell, R. G. (2000a). Genotype by environment interac-
tions affecting grain sorghum. I. Characteristics that confound interpretation of hybrid yield. Aust. J. Agric.
Sci. 51, 197-207.
[7] Chapman, S. C., Cooper, M., Hammer, G. L. and Butler, D. G. (2000b). Genotype by environment interac-
tions affecting grain sorghum. II. Frequencies of different seasonal patterns of drought stress are related to
location effects on hybrid yields. Aust. J. Agric. Sci. 51, 209-221.
[8] Chapman, S. C., Hammer, G. L., Butler, D. G. and Cooper, M. (2000c). Genotype by environment interac-
tions affecting grain sorghum. III. Temporal sequences and spatial patterns in the target population of envi-
ronments. Aust. J. Agric. Sci. 51, 223-233.
[9] Chapman, S. C., Cooper, M. and Hammer, G. L. (2002a). Using crop simulation to generate genotype by
environment interaction effects for sorghum in water-limited environments. Aust. J. Agric. Sci., in press.
[10] Chapman, S. C., Cooper, M., Podlich, D. W. and Hammer, G. L. (2002b). Evaluating plant breeding strate-
gies by simulating gene action and environmental effects to predict phenotypes for dryland adaptation.
Agron. J., submitted.
[11] Comstock, R. E. (1996). Quantitative Genetics with Special Reference to Plant and Animal Breeding. Iowa
State University Press, Ames.
M. Cooper et al. / The GP Problem
164
[12] Cooper, M., Podlich, D. W., Jensen, N. M., Chapman, S. C. and Hammer, G. L. (1999). Modelling plant
breeding programs. Trends Agron. 2, 33-64.
[13] Falconer, D. S. and Mackay, T. F. C. (1996). Introduction to Quantitative Genetics. 4th edn. Longman, Es-
sex.
[14] Hallauer, A. R. and Miranda, J. B. F. (1988). Quantitative Genetics in Maize Breeding 2nd edn. Iowa State
University Press, Ames.
[15] Hammer, G. L., Chapman, S. C., and Snell, P. (1999). Crop simulation modelling to improve selection effi-
ciency in plant breeding programs. Proc. Ninth Assembly Wheat Breeding Society of Australia, Toowoomba,
pp. 79-85.
[16] Hammer, G. L. Farquhar, G. D. and Broad, I. J. (1997). On the extent of genetic variation for transpiration
efficiency in sorghum. Aust. J. Agric. Res. 48, 649-655.
[17] Hammer, G. L. and Muchow, R. C. (1994). Assessing climatic risk to sorghum production in water-limited
subtropical environments. I. Development and testing of a simulation model. Field Crops Res. 36, 221-234.
[18] Hammer G. L., Vanderlip R. L., Gibson G., Wade L. J., Henzell R. G., Younger D. R., Warren J., Dale A. B.
(1989). Genotype by environment interaction in grain sorghum. II. Effects of temperature and photoperiod on
ontogeny. Crop Sci. 29, 376-384.
[19] Hammer, G. L., van Oosterom, E. J., Chapman, S. C. and McLean, G. (2001). The economic theory of water
and nitrogen dynamics in field crops. In: Proceedings of the Fourth Australian Sorghum Conference, Kooral-
byn, Queensland, 5-8 Feb 2001, A. K. Borrell and R. G. Henzell (eds). CD-Rom Format. Range Media Pty
Ltd. (ISBN: 0-7242-2163-8).
[20] Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R.,
Aebersold, R. and Hood, L. (2001). Integrated genomic and proteomic analyses of a systematically perturbed
metabolic network. Science 292, 929-934.
[21] Karp, P. D. (2001). Pathway databases: A case study in computational symbolic theories. Science 293, 2040-
2044.
[22] Kauffman, S. A. (1993). The Origins of Order: Self-Organization and Selection in Evolution. Oxford Univer-
sity Press, New York.
[23] Kauffman, S. A. (2000). Investigations. Oxford University Press, Oxford.
[24] Kempthorne, O. (1988). An overview of the field of quantitative genetics. In: Proceedings of the Second
International Conference on Quantitative Genetics, Weir, B. S., Eisen, E. J., Goodman, M. M. and
Namkoong, G. (eds). Sinauer Associates, Inc., Sunderland, pp. 47-56.
[25] Mackay, T. F. C. (2001). Quantitative trait loci in Drosophila. Nat. Rev. Genet. 2, 11-20.
[26] McCown, R. L., Hammer, G. L., Hargreaves, J. N. G., Holzworth, D. P. and Freebairn, D. M. (1996). AP-
SIM: A novel software system for model development, model testing, and simulation in agricultural systems
research. Agric. Syst. 50, 255-271.
[27] Micallef, K. P., Cooper, M. and Podlich, D. W. (2001). Using clusters of computers for large QU-GENE
simulation experiments. Bioinformatics 17, 194-195.
[28] Mortlock, M. Y. and Hammer, G. L., (1999). Genotype and water limitation effects on transpiration effi-
ciency in sorghum. J. Crop Prod. 2, 265-286.
[29] Podlich, D. W. and Cooper, M. (1998). QU-GENE: a platform for quantitative analysis of genetic models.
Bioinformatics 14, 632-653.
[30] Rosen, R. (1985). Anticipatory Systems: Philosophical, Mathematical and Methodological Foundations. Per-
gamon Press, Oxford.
[31] Tao, Y. Z., Jordan, D. R., Henzell, R. G. and McIntyre, C. L. (1998). Construction of a genetic map in a sor-
ghum RIL population using probes from different sources and its alignment with other sorghum maps. Aust.
J. Agric. Res. 49, 729-736.
[32] Tao, Y. Z., Henzell, R. G., Jordan, D. R., Butler, D. G., Kelly, A. M. and McIntyre, C. L. (2000). Identifica-
tion of genomic regions associated with stay green in sorghum by testing RILs in multiple environments.
Theo. Appl. Genet. 100, 1225-1232.