ArticlePDF Available

Genotyping-by-Sequencing for Plant Breeding and Genetics

Wiley
The Plant Genome
Authors:

Abstract and Figures

Rapid advances in "next-generation" DNA sequencing technology have brought the US$1000 human (Homo sapiens) genome within reach while providing the raw sequencing output for researchers to revolutionize the way populations are genotyped. To capitalize on these advancements, genotyping-by-sequencing (GBS) has been developed as a rapid and robust approach for reduced-representation sequencing of multiplexed samples that combines genome-wide molecular marker discovery and genotyping. The flexibility and low cost of GBS makes this an excellent tool for many applications and research questions in plant genetics and breeding. Here we address some of the new research opportunities that are becoming more feasible with GBS. Furthermore, we highlight areas in which GBS will become more powerful with the continued increase of sequencing output, development of reference genomes, and improvement of bioinformatics. The ultimate goal of plant biology scientists is to connect phenotype to genotype. In plant breeding, the genotype can then be used to predict phenotypes and select improved cultivars. Furthering our understanding of the connection between heritable genetic factors and the resulting phenotypes will enable genomics-assisted breeding to exist on the scale needed to increase global food supplies in the face of decreasing arable land and climate change.
Content may be subject to copyright.
92 THE PLANT GENOME NOVEMBER 2012 VOL. 5, NO. 3
REVIEW & INTERPRETATION
Genotyping-by-Sequencing for Plant
Breeding and Genetics
Jesse A. Poland* and Trevor W. Rife
Abstract
Rapid advances in “next-generation” DNA sequencing
technology have brought the US$1000 human (Homo sapiens)
genome within reach while providing the raw sequencing
output for researchers to revolutionize the way populations are
genotyped. To capitalize on these advancements, genotyping-
by-sequencing (GBS) has been developed as a rapid and robust
approach for reduced-representation sequencing of multiplexed
samples that combines genome-wide molecular marker discovery
and genotyping. The fl exibility and low cost of GBS makes this
an excellent tool for many applications and research questions
in plant genetics and breeding. Here we address some of the
new research opportunities that are becoming more feasible with
GBS. Furthermore, we highlight areas in which GBS will become
more powerful with the continued increase of sequencing
output, development of reference genomes, and improvement of
bioinformatics. The ultimate goal of plant biology scientists is to
connect phenotype to genotype. In plant breeding, the genotype
can then be used to predict phenotypes and select improved
cultivars. Furthering our understanding of the connection between
heritable genetic factors and the resulting phenotypes will enable
genomics-assisted breeding to exist on the scale needed to
increase global food supplies in the face of decreasing arable
land and climate change.
Next-Generation Genotyping
DRIVEN BY THE QUEST for a $1000 human genome, rapid
advances in next-generation sequencing (NGS) output
have provided technology with the ability to greatly trans-
form the way we think about plant genomics and breeding.
With the introduction of massively parallel sequencing,
raw sequencing output is doubling roughly every 6 mo (Fig.
1). e availability of inexpensive sequencing technology
has transformed the way genomes are sequenced (Xu et
al., 2011; Wang et al., 2011), polymorphisms are discovered
(Mardis, 2008; Futschik and Schlötterer, 2010; You et al.,
2011; Nielsen et al., 2011), gene expression is analyzed (Ger-
aldes et al., 2011; Harper et al., 2012), and populations are
genotyped (Baird et al., 2008; Elshire et al., 2011; Davey et
al., 2011; Truong et al., 2012; Poland et al., 2012a; Wang et
al., 2012). Sequencing is rapidly becoming so inexpensive
that it will soon be reasonable to use it for every genetic
study. Next-generation sequencing applications have the
potential to revolutionize the  eld of plant genomics and the
practice of applied plant breeding.
One of the primary objectives of functional
genomics in agricultural species is to connect phenotype
to genotype and use this knowledge to make phenotypic
predictions and select improved plant types. To do this
on a genome-wide scale requires large populations with
dense molecular markers across the genome. To put the
power of NGS to work for plant breeding and genomics,
Published in The Plant Genome 5:92–102.
doi: 10.3835/plantgenome2012.05.0005
© Crop Science Society of America
5585 Guilford Rd., Madison, WI 53711 USA
An open-access publication
All rights reserved. No part of this periodical may be reproduced or
transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and
retrieval system, without permission in writing from the publisher.
Permission for printing and for reprinting the material contained herein
has been obtained by the publisher.
J.A. Poland, USDA-ARS, Hard Winter Wheat Genetics Research Unit
and Dep. of Agronomy, Kansas State Univ., 4008 Throckmorton
Hall, Manhattan KS, 66506; T.W. Rife, Interdepartmental Genetics,
Kansas State Univ., 4024 Throckmorton Hall, Manhattan KS,
66506. Received 29 May 2012. *Corresponding author (jesse.
poland@ars.usda.gov).
Abbreviations: AM, association mapping; GBS, genotyping-by-
sequencing; GS, genomic selection; HMM, hidden Markov model;
MSG, multiplexed shotgun genotyping; NGS, next-generation
sequencing; PAV, presence–absence variation; RAD, restriction
association DNA; SNP, single nucleotide polymorphism.
POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 93
new approaches for sequence-based genotyping have
been developed. One promising approach is genotyping-
by-sequencing (GBS), which uses enzyme-based
complexity reduction (using restriction endonucleases to
target only a small portion of the genome) coupled with
DNA barcoded adapters to produce multiplex libraries
of samples ready for NGS sequencing.  is approach
has been demonstrated to be robust across a range of
species and capable of producing tens of thousands to
hundreds of thousands of molecular markers (Elshire et
al., 2011; Poland et al., 2012a).  e exibility of GBS in
regards to species, populations, and research objectives
makes this an ideal tool for plant genetics studies. As the
phenomenal increase in NGS output continues, many
research questions that were once out of reach will be
resolved through the application of these approaches.
All-in-One
e two key components for genotyping germplasm are
nding DNA sequence polymorphisms and assaying the
markers across a full set of material. Classically, this has
been a two-step process involving marker discovery fol-
lowed by assay design and genotyping. An important
strength of sequence-based genotyping approaches is that
the marker discovery and genotyping are completed at the
same time.  is facilitates exploration of new germplasm
sets or even new species without the upfront e ort of
discovering and characterizing polymorphisms. Another
key component of GBS datasets is that the raw data is
dynamic.  e raw sequences obtained from GBS can be
reanalyzed, uncovering further information (e.g., new
polymorphisms, annotated genes, etc.) as bioinformatics
techniques improve, reference genomes develop, and the
collection of sequence data increases. Each of these factors
adds additional value to the same raw dataset.
One of the  rst and broadly adapted applications for
using NGS was for single nucleotide polymorphism (SNP)
and presence–absence variation (PAV) discovery in diverse
populations with and without reference genomes (Baird
et al., 2008; Wiedmann et al., 2008; Gore et al., 2009a,
2009b; Huang et al., 2009; Deschamps et al., 2010; Hyten
et al., 2010; You et al., 2011; Nelson et al., 2011; Hohenlohe
et al., 2011; Byers et al., 2012).  ese studies have focused
on assaying a few key genotypes with a reduced-
representation approach (Baird et al., 2008) or with whole-
genome resequencing (Huang et al., 2009). While highly
e ective for SNP discovery, this approach is limited in the
number of lines assayed and does not simultaneously assay
the markers across the full population of interest.
e key objective of the GBS approach, therefore, is
not merely to discover polymorphisms and then transfer
these to a  xed assay, but to simultaneously discover
polymorphisms and obtain genotypic information across
the whole population of interest. It is this combined
one-step approach that makes GBS a truly rapid and
exible platform for a range of species and germplasm
sets and perfectly suited for genomic selection (GS)
in plant breeding programs. As sequencing output
continues to increase, GBS will evolve  rst to lower
levels of complexity reduction (to capture more sequence
variants) and then to whole-genome resequencing (to
capture all variants). Whole-genome resequencing has
been applied in Arabidopsis thaliana (L.) Heynh., rice
(Oryza sativa L.), and maize (Zea mays L.) (Huang et al.,
2009; Ashelford et al., 2011; Gan et al., 2011; Chia et al.,
2012; Jiao et al., 2012; Xu et al., 2012), although it quickly
becomes less manageable with larger, more complex
genomes that lack a solid reference genome (Morrell et
al., 2011).  e level of multiplexing has also been limited
in this approach, increasing per-sample cost.
As GBS can be readily used for de novo discovery
and application of new molecular polymorphisms, it is
particularly powerful for new sets of germplasm and
uncharacterized species. In many ways the greatest
advantage of sequence-based genotyping approaches
is the reduction of ascertainment bias associated with
marker discovery in panels di ering from the target
population.  is is an obvious advantage for association
studies in which di ering allele frequencies greatly
in uence the power and precision of the study (Myles et
al., 2009; Hamblin et al., 2010). For breeding applications,
informative polymorphisms can be discovered as novel
germplasm is introduced into the breeding pool.  e
use of an unrepresentative marker panel in surveying
molecular diversity is highly problematic for getting a
true representation of molecular diversity present in a
target population. Most GBS approaches use methylation-
sensitive enzymes. If these enzymes target di erentially
methylated regions of the genome, ascertainment bias
could potentially be introduced in di erent sets of
germplasm, but evidence for this has yet to be seen. While
markers discovered with GBS should have little bias across
sets of germplasm, it is also unknown how uniformly
they are spaced across the genome. Evidence from Poland
et al. (2012a), however, indicated that GBS markers were
Figure 1. A comparison of actual sequencing capacity (orange)
to what would be expected if sequencing technology was
following Moore’s Law (blue). The signifi cant decrease in 2007
coincides roughly with the introduction of next-generation
sequencing technology. Data is from the National Human
Genome Research Institute (Wetterstrand, 2012).
94 THE PLANT GENOME NOVEMBER 2012 VOL. 5, NO. 3
uniformly spaced across the chromosomes of both wheat
(Triticum aestivum L.) and barley (Hordeum vulgare L.).
Many Flavors
e use of reduced-representation sequencing for target-
ing small portions of the genome was  rst demonstrated
by Altshuler et al. (2000).  is approach was then later
combined with NGS and DNA barcoded adapters to
sequence multiplex libraries in parallel.  ere are many
variations of this approach and GBS is one speci c
method for genotyping using NGS of multiplex DNA-
barcoded reduced-representation libraries (Table 1).
Furthermore, the combination of enzymes that can be
used for complexity reduction is almost endless. Davey
et al. (2011) has thoroughly reviewed several approaches
of complexity reduction including complexity reduction
of polymorphic sequences (van Orsouw et al., 2007) and
deep sequencing of reduced representation libraries (van
Tassell et al., 2008).
e use of restriction enzymes for targeted reduction
of genome complexity combined with NGS was  rst
described by Baird et al. (2008) and termed restriction
association DNA (RAD). Restriction association
DNA methods use a restriction enzyme to generate
genomic fragments, which are then ligated to an
adaptor containing a forward primer for ampli cation,
sequencing platform primer sites, and a unique DNA
barcode that enables sample multiplexing (Baird et al.,
2008; Craig et al., 2008; Cronn et al., 2008).  e samples
are pooled, randomly sheared, and size selected to create
a uniform collection of similarly-sized DNA fragments
(Baird et al., 2008).  e fragments are then ligated to a Y
adaptor that ensures only fragments containing the  rst
adaptor will be ampli ed (Baird et al., 2008). Restriction
association DNA markers provided a robust method
to discover polymorphisms and map variation in a
population (Miller et al., 2007).
First-generation RAD analysis had drawbacks similar
to older restriction enzyme-based marker technologies: the
requirement of species-speci c arrays, a hybridization for
every comparison, and limitations for assaying presence-
absence variation (Baird et al., 2008). Combining the
progressive features of RAD with NGS, however, resulted
in the discovery of new markers at a signi cantly decreased
cost (Baird et al., 2008).  e simultaneous discovery of
SNP markers during RAD sequencing facilitated robust
mapping of many polymorphisms and precise assignment
of chromosomal regions to mapping parents, allowing for
detection of recombination locations.  e RAD approach
has recently been modi ed to use restriction enzymes
that cut upstream and downstream of a target site (Wang
et al., 2012).  is new methodology produces uniform
length tags, allows nearly all of the restriction sites to
be surveyed, and permits marker intensity adjustment
Table 1. A technical comparison of current genotyping methods using next-generation sequencing of multiplex
barcoded libraries. Adapted from Wang et al. (2012). Flavors of genotyping using next-generation sequencing of
multiplex DNA-barcoded reduced-representation libraries.
Method
Random
shearing
Size
selection Fragment size Enzymes
Multiplexing
levelAnalysis tool(s) Reference
Multiplex shotgun genotyping No Yes Size selected MseI 96 (up to 384) Burrows-Wheeler alignment tool Andolfatto et al., 2011
Restriction association DNA
sequencing (RAD-seq)
Yes Yes Size selected SbfI 96 Custom Perl scripts Baird et al., 2008
EcoRI
Double digest RAD-seq No Yes Size selected EcoRI and MspI48
§MUSCLEPeterson et al., 2012
2b-restriction association DNA No No 33–36 bp BsaXI#NA†† Custom Perl scripts Wang et al., 2012
Genotyping-by-sequencing No No <350 bp ApeKI‡‡ 48 (up to 384) TASSEL§§ Elshire et al., 2011
Genotyping-by-sequencing –
two enzyme
No No <350 bp PstI and MspI 48 (up to 384) TASSEL Poland et al., 2012a
Sequence-based genotyping No Yes Size selected EcoRI and MseI32 Burrows-Wheeler alignment tool
and unifi ed genotyper
Truong et al., 2012
PstI and TaqI
Restriction enzyme sequence
comparative analysis
No Yes Size selected MseI NA¶¶ Burrows-Wheeler alignment tool
and Samtools
Monson-Miller et al., 2012
NlaIII
All of these approaches can use different enzymes. Shown are t he enzyme(s) used in the initial study.
All of these methods have the possibility to increase the number of multiplexed samples using additional unique barcodes. The multiplex level as reported in the reference paper. Given in parenthesis are
subsequent increases.
§Combinatorial barcoding is possible, placing a barcode on each end of the DNA fragment. Using a set of 48 adapter P1 barcodes and × 12 polymerase chain reaction (PCR) 2 indices it is possible to uniquely label
576 individuals (48 [adapter P1 barcodes] × 12 [PCR2 indices]). This method would require paired-end sequencing.
MUSCLE, multiple sequence comparison by log-expectation.
#Uses type IIB restriction endonucleases.
††NA, not applicable.
‡‡Has been successfully applied to using PstI and HindIII (E. Buckler and R. Elshire, personal communication, 2012).
§§TASSEL, trait analysis by association, evolution, and linkage.
¶¶96-plexing reported but unpublished.
POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 95
(Wang et al., 2012).  e next  avor of sequence-based
genotyping was multiplexed shotgun genotyping (MSG),
which required only one gel puri cation, eliminated DNA
shearing, required less starting DNA, and implemented
a hidden Markov model (HMM) to determine points of
chromosomal recombination (Andolfatto et al., 2011).
Multiplexed shotgun genotyping used a single common
cutting restriction enzyme and produced a limited
complexity reduction suitable for the smaller genome
(approximately 130 Mb) of Drosophila simulans (Andolfatto
et al., 2011). In the context of a reference genome, the
HMM imputation approach was highly e ective for tracing
parental origin and de ning recombination break points
(Andolfatto et al., 2011).
e original GBS protocol was developed to simplify
and streamline the construction of RAD libraries (Elshire et
al., 2011).  e strength of the GBS protocol is its simplicity:
using inexpensive adapters, allowing pooled library
construction, and avoiding shearing and size selection (Fig.
2).  e GBS approach removed the need for size selection
by using a short polymerase chain reaction extension of
the multiplexed library. Instead of the Y adapters used in
the RAD protocol, the original GBS protocol used a single
restriction enzyme, a barcoded adaptor, and a common
adaptor (Elshire et al., 2011). Although all combinations of
adapters can ligate to the DNA fragments, only those that
contained one of each barcode are able to be ampli ed and
sequenced (Davey et al., 2011).
e original GBS approach was recently extended
to a two-enzyme version that combines a rare- and a
common-cutting restriction enzyme to generate uniform
libraries consisting of a forward (barcoded) adaptor and
a reverse (Y) adaptor on alternate ends of each fragment
(Poland et al., 2012a).  e use of two enzymes in this GBS
approach enables the capture of most fragments associated
with the rare-cutting enzyme.  e use of a Y adaptor on
the common restriction site avoids ampli cation of more
common fragments, a preferential situation for larger,
more complex genomes. Following the original work on
wheat and barley, this GBS approach has been successfully
applied in several species including cotton (Gossypium
hirsutum L.), oat (Avena sativa L.), sorghum [Sorghum
bicolor (L.) Moench], and rice with little to no change in
protocol (Poland, unpublished data, 2012).
e options for tailoring GBS to any species or
desired application are almost endless. A range of
enzymes have been evaluated in maize with success in
varying the level of complexity reduction (E. Buckler,
personal communication, 2012). With a varied level of
complexity reduction, it is possible to increase coverage
Figure 2. Schematic overview of steps in genotyping-by-sequencing (GBS) library construction, sequencing, and analysis. (1) Genomic
DNA is quantifi ed using fl uorescence-based method. (2) Genomic DNA (gDNA) is normalized in a new plate. Normalization is needed
to ensure equal representation of all samples and equal molarity of gDNA and adapters. (3) A master mix with restriction enzyme(s)
and buffer is added to the plate and incubated. (4) The DNA barcoded adapters are added along with ligase and ligation buffers.
(5) Samples are pooled and cleaned. (6) The GBS library is polymerase chain reaction (PCR) amplifi ed. (7) The amplifi ed library is
cleaned and evaluated on a capillary sizing system. (8) Libraries are sequenced. Data analysis: Following a sequencing run, FASTQ
les containing raw data from the run are used to parse sequencing reads to samples using the DNA barcode sequence. Once
assigned to individual samples, the reads are aligned to a reference genome. In the case of species without a complete reference
genomic sequence, reads are internally aligned (alignment of all sequence reads will all other reads from that library) and single
nucleotide polymorphisms (SNPs) identifi ed from 1 or 2 bp sequence mismatch. Various fi ltering algorithms can then be used to
distinguish true biallelic SNPs from sequencing errors.
96 THE PLANT GENOME NOVEMBER 2012 VOL. 5, NO. 3
of a target genome or increase the multiplexing level of
a target population.  e interplay of these two factors
will determine the optimal approach for the species
under investigation. For species with large genomes or
no reference genome, the use of rare-cutting restriction
enzymes (i.e., 6 bp or greater target site) with methylation
sensitivity can assist in creating a higher level of
complexity reduction by targeting fewer sites.  is will
lead to higher sampling depth of the same genomic sites
and reduce the amount of missing data (Fig. 3).
Hand in Hand with the Reference Genome
Sequence-based genotyping greatly bene ts from a well-
characterized (sequenced) reference genome. A reference
genome makes ordering and imputing low coverage
marker data generated through GBS and other sequence-
based genotyping approaches straightforward.  is has
been seen in many of the reported uses of sequence-
based genotyping.  e MSG approach used by Andol-
fatto et al. (2011) made use of the D. simulans reference
genome to  rst align tags to the reference and then call
SNPs. Using a physical map framework, the parent-of-
origin was then imputed across all SNPs segregating in
the population.  is approach is very robust for assign-
ing parent-of-origin in biparental populations. Likewise,
Huang et al. (2009) used the reference genome of rice
to  rst align NGS tags and subsequently call SNPs.  e
physical ordering of these markers greatly enabled and
simpli ed the imputation and assignment of parent-of-
origin for segregating populations.
Although GBS approaches greatly bene t from a
reference genome, the rapid discovery and ordering
(through genetic mapping) of sequence-based molecular
markers can assist with the development and re nement of
a reference genome. High-density genetic maps developed
through GBS can be used to anchor and order physical
maps and re ne or correct unordered sequence contigs.
In D. simulans, Andolfatto et al. (2011) were able to assign
8 Mb to linkage groups, which comprised 30% of the
unassembled D. simulans genome or about 6% of the total
genome.  is is a substantial improvement of an already
well-characterized genome. Likewise, in current e orts
in much larger, more complex genomes including barley
(5.5 Gb) and wheat (16 Gb) (Arumuganathan and Earle,
1991), high-density GBS maps are being used to assist with
anchoring and ordering large numbers of assembled but
unanchored and unordered contigs (International Barley
Sequencing Consortium, 2012). is approach appears
very promising, creating a positive feedback loop in which
the development of the reference genome assisted by
GBS markers leads to better SNP calling and order-based
imputation for GBS datasets.
Maps Made Easy
e combination of GBS with a well-de ned refer-
ence genome makes the development of genetic maps
for characterizing segregating populations exception-
ally straightforward. In the absence of a solid reference
genome, a high-density reference genetic map can serve
the same purpose. For characterizing a new population,
there will no longer be any need to place markers on
linkage groups, calculate recombination frequencies, or
order markers. With a reference genome, markers can
be ordered along the physical chromosome.  is order-
ing can then be used to precisely place recombination
break points.  e power of such approaches has been
Figure 3. Integration of genotyping-by-sequencing (GBS) in the context of plant breeding and genomics for a species without a
completed reference genome.
POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 97
highlighted in recent papers with model species includ-
ing D. simulans (Andolfatto et al., 2011), rice (Huang et
al., 2010), and maize (Elshire et al., 2011). Even at low
coverage, the placement of sparse markers on the physi-
cal map can be used to narrow points of recombination
to 100 to 200 kb intervals (Huang et al., 2009; Xie et al.,
2010).  is approach can be extended to populations
with heterozygous chromosomal segments such as F2 or
BC1 populations. Andolfatto et al. (2011) demonstrated
a HMM that accurately inferred heterozygous states
from low-pass sequence-based genotyping.  ese same
approaches have successfully been applied in maize (P.
Bradbury, personal communication, 2012).
In the absence of a solid reference genome, the same
ease of genetic mapping can be accomplished through
development of a reference genetic map for the species
of interest. Genotyping-by-sequencing markers and
other framework markers can be integrated to develop a
high-density genetic map (Poland et al., 2012a). For new
populations, GBS tags can be used to make genotype
calls based on the reference map without the need to
construct a de novo map.  e extremely large number of
markers produced with GBS allows su cient coverage
for most populations even if only a fraction of the total
markers are used.
ese same approaches for developing genetic maps
and graphical genotypes can be broadly applied to the
characterization of populations of interest for breeding
and germplasm improvement including elite breeding
lines, segregating populations for selection, near-isogenic
lines, and alien-introgression lines.  e use of a variety
of algorithms to correctly infer the heterozygous or
homozygous state of chromosome regions will add value
to inferences and conclusions for molecular breeding
and selection (Andolfatto et al., 2011). Other algorithms
can be used for phasing markers in segregating and
outcrossing populations.  is will generally, however,
require known marker order of the GBS SNPs.
Mapping Single Genes
Genotyping-by-sequencing and other sequence-based
genotyping approaches can be very powerful for mapping
single genes. e de novo discovery of high-density mark-
ers in a population of interest has the potential to circum-
vent the cumbersome process of marker discovery and
testing for  ne mapping of target genes and mutations.
In the absence of a reference map, RAD markers have
been used in bulked segregant analysis to quickly identify
linked markers (Baird et al., 2008). For single genes of
interest, this can be a valuable approach to rapidly identify
segregating polymorphisms. In lupin (Lupinus angustifo-
lius L.), Yang et al. (2012) were able to identify 30 markers
linked to an anthracnose resistance gene. One advantage
of GBS for mapping single genes in F2 or similar popula-
tions is that the per-sample cost will be low enough that
individual samples can be used rather than bulks.  is
will allow correction or removal of any individuals that
were incorrectly phenotyped while con rming segregation
of linked markers. Depending on the application, there
will be a balance between  nding markers linked to the
gene of interest using GBS and developing single marker
assays from the resulting data. Considering breeding
approaches, it can still be optimal to prescreen populations
with markers for known single genes (with large e ects)
for smaller investment in time and sample costs before
conducting whole genome pro ling. Selected plants car-
rying desired genes can then be genotyped using GBS for
GS.
An Excess of Markers
While preselection of breeding populations for single
markers for important genes is a viable breeding strategy,
sequencing capacity is becoming so inexpensive and readily
available that it will soon be reasonable to generate whole-
genome pro les on any germplasm of interest. Previously,
scientists spent a majority of their time developing and
working with a small number of markers. Many projects
today still require only a small number of markers to com-
plete. Genotyping-by-sequencing, however, can readily
generate tens of thousands of usable markers, which can be
selectively  ltered into the few required for a target experi-
ment. While statistical geneticists will always prefer to have
as many markers as possible, GS models have diminishing
returns on additional markers once the population has
reached the point of “marker saturation” (Jannink et al.,
2010; He ner et al., 2011). On the other hand, for associa-
tion mapping (AM) studies, additional markers increase the
likelihood of  nding and tagging causal polymorphisms
(Cockram et al., 2010).  e current limitation for the gener-
ated data is computational.  ere are new algorithms and
developments in cluster computing to provide the computa-
tional resources needed to make these quantitative genetics
questions more manageable (Stanzione, 2011). Quantitative
geneticists and bioinformatics personnel will be needed to
manage breeding data and develop models. At the same
time, bioinformatics training will become a more central
component to any plant breeding and genetics curriculum.
Filling in the Blanks
e “catch” to GBS and sequence-based genotyping in
general is that datasets o en have a signi cant amount of
missing data due to low coverage sequencing (Davey et
al., 2011). Biologically, missing genotyping calls in GBS
datasets can be the result of presence–absence variation,
polymorphic restriction sites, and/or di erential meth-
ylation. On the other hand, the technical issue of missing
data with GBS is a combination of (i) library complexity
(i.e., number of unique sequence tags) and (ii) sequence
coverage of the library.
Library complexity is directly related to the species’
genome under investigation and the choice of enzyme(s)
used for complexity reduction. Enzymes with a shorter
recognition site will naturally produce more fragments
than those with a longer recognition site. Methylation-
sensitive enzymes will greatly reduce the number of
fragments in species with large portions of repetitive
98 THE PLANT GENOME NOVEMBER 2012 VOL. 5, NO. 3
DNA. In barley, libraries constructed using PstI and MspI
generate around 500,000 to 600,000 unique tags, while
in wheat around 1.5 million tags are generated (Poland,
unpublished data, 2012).  e actual number of sequence
tags present in a raw dataset is substantially higher partly
due to allelic variants but largely due to sequencing errors,
many of which can be nonrandom.  is can and will
generate many versions of “unique” tags.
e level of missing data is based on the sequencing
coverage, which is a function of the library complexity,
the multiplexing level, and the output of the sequencing
platform (Andolfatto et al., 2011).  e multiplexing level
and the number of independent sequences generated
from the sequencing platform will determine the average
number of reads per sample. Higher multiplexing
levels will reduce the data per sample while increased
sequencing output (when using the same multiplexing
level) will understandably increase the data per sample.
One key component of GBS on di erent sequencing
platforms is the number of independent reads. Post-
Sanger sequencing platforms generally rely on a large
number of short sequence reads to produce gigabases of
sequence data (Metzker, 2009).  e new platforms are
continually increasing the sequencing output, a function
of more and longer reads. For GBS, however, generating
longer reads is less advantageous than generating more
reads. More sequence reads provides more data per
sample. Alternatively, increasing read numbers allows
higher multiplexing levels with static amounts of data
per sample. For GBS, 10 Gb of sequence data generated
from 100 million reads of 100 bp would be preferable
to 10 million reads of 1000 bp. While increasing the
number of reads is clearly advantageous for GBS, longer
reads are also bene cial, leading to the discovery of more
polymorphisms (particularly in species with limited
diversity) and assisting GBS applications in polyploids
where secondary, genome-speci c polymorphisms
are needed to di erentiate a segregating SNP from
homeologous sequences on other genomes.
Missing data can be dealt with by (i) sequencing to
higher depth or (ii) imputing.  e logical approach to
removing missing data is to sequence to a higher depth
by reducing the multiplexing level or sequencing the
library multiple times.  is can be very e ective (Fig. 4),
but has the drawback of increasing per-sample cost. For
important AM panels or parents of a breeding program,
however, the additional investment to generate higher
coverage of the tags is likely worthwhile. For breeding
applications using GBS with targeted selection, other
approaches to minimize the impact of missing data are
preferable. Since a majority of the breeding population
will be discarded, minimizing genotyping cost will take
preference over minimizing missing data.
e second approach is imputation of missing data.
Depending on the genome, the type of GBS libraries, and
the overall size of the datasets, imputation can give very
accurate results.  ere are many imputation algorithms
(Marchini et al., 2007; Purcell et al., 2007; Browning and
Browning, 2007), most of which are targeted toward
haplotype reconstruction on a reference genome. Other
approaches such as a random forest model (Breiman,
2001) can be used to impute unordered markers (as is the
situation in wheat). Sequencing diverse, key individuals
in the population (parents or representatives of kinship
clusters) can greatly improve imputation accuracy by
de ning known haplotypes for the population.
Finally, a matrix of realized relationships among
individuals in a breeding population can be constructed
without imputation. For very high-density genotyped
data generated by GBS, the marker coverage is su cient
to saturate the genomic linkage disequilibrium present
in most breeding programs. From this perspective,
it is only necessary to determine a pairwise identity
between individuals for the markers that are present
in both individuals. With high marker density, there
will still be tens of thousands of pairwise comparisons
between two individuals, well beyond the saturation
point for most elite breeding material. Imputation with
the simple marker mean can still produce accurate GS
prediction models. From a GS perspective, kinship-based
marker imputation can be used to optimize the realized
relationship matrix in the presence of a high level of
missing data (Poland et al., 2012b).  is approach has
been shown to improve the relationship estimates and
give more accurate GS model predictions.
Association Mapping
Genotyping-by-sequencing has the potential to be an excel-
lent tool for genotyping of diverse panels for AM. One key
to applying GBS for AM is addressing the missing data
problem. As previously noted, higher coverage sequencing
will reduce the amount of missing data at the expense of
increased per-sample costs. For a high-value AM panel that
will be well characterized and extensively phenotyped and
serve as a community resource population, the additional
cost of sequencing several times to achieve high coverage is
likely worth the investment.  is will produce a very well-
characterized genetic population. At a high coverage, impu-
tation of missing data will become a very precise exercise,
particularly on populations with extensive linkage disequi-
librium. Depending on the species under interrogation, the
GBS markers will need to be ordered via a physical reference
map or through genetic mapping.
In such populations, GBS markers also have the
advantage of being able to survey multiple haplotypes
on a  ne scale. When two or more SNPs are within
the same tag, these SNP alleles are both evaluated
concurrently. For PAVs, GBS also has the power to
uncover these alleles. Array-based methods, particularly
those applied to polyploid species, are limited in the
ability to accurately survey PAVs as hybridization to a
duplicated sequence will indicate an allele call (for the
ancestral allele) even if the target locus is absent. Due to
the context sequence accompanying a SNP, GBS enables
discrimination between duplicated sequences. At higher
sequencing coverage of the GBS library, PAV can then be
POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 99
inferred by the absence of a given tag for a given sample
in the pool of sequenced tags.
Genomic Selection
In the  eld of plant breeding, an important objective
in the development of GBS is to create a low-cost geno-
typing platform capable of generating high-density
genotypes. For GS in crop species, breeders need a fast,
inexpensive,  exible method that will enable genotyping
of large populations of selection candidates. A majority
of the selection candidates are then discarded, creating a
situation that is greatly bene ted from low-cost genotyp-
ing. Genotyping-by-sequencing is quickly expanding to
ll those requirements.
Genomic selection was proposed in 2001 by Meuwissen
et al. as an approach to capture the full complement of small
e ect loci in genomic prediction models. Genomic selection
takes advantage of dense genome-wide molecular markers
by simultaneously  tting e ects to all markers and avoiding
statistical testing. By using these GS models, breeders are
able to predict the performance of new experimental lines
at early generations and generate suggested crosses and
selections based on the model predictions (Jannink et al.,
2010). Combined with a fast turnaround on generations,
selection based on predicted breeding values determined
by marker data provided by GBS could greatly increase
gains in plant breeding programs (Meuwissen et al., 2001;
Jannink et al., 2010).
e advantage of GBS for GS in breeding programs
is the low per-sample cost needed for generating tens
of thousands to hundreds of thousands of molecular
markers. Poland et al. (2012b) have demonstrated the
suitability for GBS markers in developing GS models in
the complex wheat genome.  ey were able to demonstrate
prediction accuracies for yield and other agronomic
traits that are high enough to be suitable for breeding
applications.  e GBS markers also showed a signi cant
improvement in the attained prediction accuracy over a
previously used array of hybridization-based markers.  e
important nding of this work is the practical implications
in breeding.  e training population was genotyped
without a priori knowledge of the population or SNPs and
per-sample cost was below $20 (Poland et al., 2012b).
Putting Genotyping-by-Sequencing
to Work
Looking forward, high-density markers from NGS
will soon be applied to almost every genomic ques-
tion.  ese marker datasets are low cost and dynamic,
with data and genotyping results getting more robust
and economical each year. Genotyping-by-sequencing
has been shown to be a valid tool for genetic mapping
(Baird et al., 2008; Elshire et al., 2011; Poland et al.,
2012a), breeding applications (Poland et al., 2012b), and
diversity studies (Fu, 2012; Lu et al., 2012).  e ability
to quickly generate robust datasets without consider-
able prior e ort for marker discovery is quickly dispel-
ling issues that have plagued researchers working with
obscure or foreign species: a lack of de ned and speci c
genetic tools for genome analysis (Allendorf et al., 2010).
Figure 4. Removal of missing data in genotyping-by-sequencing by increasing coverage of the library via resequencing. In a set of
international wheat breeding germplasm, several lines (samples) were replicated across two or more libraries. Replicating a sample
two times increased the coverage of single nucleotide polymorphisms (SNPs) to 60% while fi ve replications increase the coverage to
over 90%. While very effective as a means to remove missing data, replicated sequencing increases the per-sample cost. The average
per-sample cost is $15. In this situation for wheat, the number of replications is roughly equivalent to the sequencing coverage of the
library (i.e., 5 replications give approximately 5x coverage). Data from J. Poland (unpublished data, 2012).
100 THE PLANT GENOME NOVEMBER 2012 VOL. 5, NO. 3
Genotyping-by-sequencing is an ideal platform for stud-
ies ranging from quickly identifying single gene markers
to whole genome pro ling of association panels.
Perhaps one of the most exciting applications of
GBS will be in the  eld of plant breeding.  eoretical
and preliminary studies on genomic selection show
great promise for accelerating the rate of developing new
improved varieties. Genotyping-by-sequencing is providing
a rapid and low-cost tool for genotyping these populations,
allowing breeders to implement genomic selection
on a large scale in their breeding programs. Current
developments in sequencing output will drive per-sample
cost below $10. Furthermore, there is no requirement for a
priori knowledge of the species as the GBS methods have
been shown to be robust across a range of species and SNP
discovery and genotyping are completed together.  is
is a very important feature for moving genomics-assisted
breeding into orphan crops with understudied genomes
and commercial crops with large and complex genomes.
Challenges remaining include data management as well
as computational constraints on huge datasets, though the
future looks promising. Genomic selection via GBS stands
to be a major supplement to traditional crop development.
e potential for GBS data to improve breeding systems
through GS is enormous.
e application of sequence-based genotyping for
a whole range of diversity and genomic studies will
have an important place well into the future. Driven
by applications across the whole spectrum of human,
microbial, plant, and animal genomics, developments in
NGS and genomics platforms must be put to use for plant
breeding and genetics studies.
Acknowledgments
USDA-ARS and the USDA-NIFA funded Triticeae Coordinated
Agriculture Project (T-CAP) (2011-68002-30029) provided support for
T. Rife.  is manuscript was greatly improved by the helpful comments
of two anonymous reviewers. Mention of trade names or commercial
products in this publication is solely for the purpose of providing speci c
information and does not imply recommendation or endorsement by the
U.S. Depar tment of Agriculture. USDA is an equal opportunity provider
and employer.
References
Allendorf, F.W., P.A. Hohenlohe, and G. Luikart. 2010. Genomics and
the future of conservation genetics. Nat. Rev. Genet. 11:697–709.
doi:10.1038/nrg2844
Altshuler, D., V.J. Pollara, C.R. Cowles, W.J. Van Etten, J. Baldwin, L.
Linton, and E .S. La nder. 2000. An SNP map of the human genome
generated by reduced representation shotgun sequencing. Nature
407:513–516. doi:10.1038/3503508 3
Andolfatto, P., D. Davison, D. Erezyilmaz, T.T. Hu, J. Mast, T. Sunayama-
Morita, and D.L. Stern. 2011. Multiplexed shotgun genotyping
for rapid and e cient genetic mapping. Genome Res. 21:610–617.
doi:10.1101/gr.115402 .110
Arumuganathan, K., and E.D. Earle. 1991. Nuclear DNA content of some
important plant species. Plant Mol. Biol. Rep. 9:415–415.
Ashelford, K., M.E. Eriksson, C.M. Allen, R. D’Amore, M. Johansson,
P. Gould, S. Kay, A.J. Millar, N. Hall, and A. Hal l. 2011. Full
genome re-sequencing reveals a novel circadian clock mutation in
Arabidopsis. Genome Biol. 12:R28. doi:10.1186/gb-2011-12-3-r28
Baird, N.A., P.D. Etter, T.S. Atwood, M.C. Currey, A.L. Shiver, Z.A.
Lewis, E.U. Selker, W.A. Cresko, and E.A. Johnson. 2008. Rapid
SNP discovery and genetic mapping using sequenced R AD markers.
PLoS ONE 3:e3376. doi:10.1371/journal.pone.0003376
Breiman, L . 2001. Random forests. Mach. Learn. 45:5–32.
doi:10.1023/A:1010933 404324
Browning, S.R., and B.L. Browning. 2007. Rapid and accurate haplotype
phasing and missing-data inference for whole-genome association
studies by use of localized haplotype clustering. Am. J. Hum. Genet.
81:1084–1097. doi:10.1086/521987
Byers, R.L., D.B. Harker, S.M. Yourstone, P.J. Maughan, and J.A. Udall.
2012. Development and mapping of SNP assays in al lotetraploid
cotton.  eor. Appl. Genet. 124:1201–1214. doi:10.1007/s00122-011-
1780 -8
Chia, J.-M., C . Song, P.J. Bradbury, D. Costich, N. de Leon, J. Doebley,
R.J. Elshire, B. Gaut, L . Geller, J.C. Glaubitz, M. Gore, K.E. Guill, J.
Holland, M.B. Hu ord, J. Lai, M. Li, X. Liu, Y. Lu, R. McCombie,
R. Nelson, J. Poland, B.M. Prasanna, T. Pyhäjärvi, T. Rong, R.S.
Sekhon, Q. Sun, M.I. Tenaillon, F. Tian, J. Wang, X. Xu, Z. Zhang,
S.M. Kaeppler, J. Ross-Ibarra, M.D. McMullen, E.S. Buckler, G.
Zhang, Y. Xu, and D. Ware. 2012. Maize HapMap2 identi es
extant variation from a genome in  ux. Nat. Genet. 44:803–807.
doi:10.1038/ng.2313
Cockram, J., J. White, D.L. Zuluaga, D. Smith, J. Comadran, M. Macau lay,
Z. Luo, M.J. Kearsey, P. Werner, D. Harrap, C. Tapsell, H. Liu, P.E.
Hedley, N. Stein, D. Schulte, B. Steuernagel, D.F. Marshall, W.T.B.
omas, L. Ramsay, I. Mackay, D.J. Balding, R. Waugh, and D.M.
O’Sullivan. 2010. Genome-wide association mapping to candidate
polymorphism resolution in the unsequenced barley genome. Proc.
Natl. Acad. Sci. USA 107:21611–21616. doi:10.1073/pnas.1010179107
Craig, D.W., J.V. Pearson, S. Szelinger, A. Sekar, M. Redman, J.J.
Corneveau x, T.L. Pawlowski, T. Laub, G. Nunn, D.A. Stephan,
N. Homer, and M.J. Huentelman. 2008. Identi cation of genetic
variants using bar-coded multiplexed sequencing. Nat. Methods
5:887–893. doi:10.1038/nmeth.1251
Cronn, R., A. Liston, M. Parks, D.S. Gernandt, R. Shen, and T. Mockler.
2008. Multiplex sequencing of plant chloroplast genomes using
Solexa sequencing-by-synthesis technology. Nucleic Acids Res.
36:e122. doi:10.1093/nar/gkn502
Davey, J.W., P.A. Hohenlohe, P.D. Etter, J.Q. Boone, J.M. Catchen, and
M.L. Bla xter. 2011. Genome-wide genetic marker discovery and
genotyping using next-generation sequencing. Nat. Rev. Genet.
12:4 99–510. doi:10.1038/nrg3012
Deschamps, S., M. la Rota, J.P. Ratashak, P. Biddle, D.  ureen, A. Farmer,
S. Luck, M. Beatty, N. Nagasawa, L. Michael, V. Llaca, H. Sa kai, G.
May, J. Lightner, and M.A. Campbell. 2010. Rapid genome-wide
single nucleotide polymorphism discovery in soybean a nd rice
via deep resequencing of reduced representation libraries with
the Illumina genome analyzer. Plant Gen. 3:53–68. doi:10.3835/
plantgenome2009.09.0026
Elshire, R.J., J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawa moto, E.S.
Buckler, and S.E. Mitchell. 2011. A robust, simple genoty ping-by-
sequencing (GBS) approach for high diversity species. PLoS ONE
6:e19379. doi:10.1371/journal.pone.0019379
Fu, Y.-B. 2012. Genotyping-by-sequenci ng: A case study in barley. Workshop
presented at: Genomics of Genebanks. Plant and Animal Genome
Conference X X, San Diego, CA. 14–18 Jan. 2012. Workshop W362.
Futschik, A., and C. Schlötterer. 2010.  e next generation of molecular
markers from massively parallel sequencing of pooled DNA
samples. Genetics 186:207–218. doi:10.1534/genetics.110.114397
Gan, X., O. Stegle, J. Behr, J.G. Ste en, P. Drewe, K.L . Hildebrand,
R. Lyngsoe, S.J. Schultheiss, E.J. Osborne, V.T. Sreedharan, A.
Kahles, R. Bohnert, G. Jean, P. Derwent, P. Kersey, E.J. Bel eld,
N.P. Harberd, E. Kemen, C. Toomajian, P.X. Kover, R.M. Clark,
G. Rätsch, and R. Mott. 2011. Multiple reference genomes and
transcriptomes for Arabidopsis thaliana. Nature 477:419–423.
doi:10.1038/nature10414
Geraldes, A., J. Pang, N.  iessen, T. Cezard, R . Moore, Y. Zhao, A. Tam,
S. Wang, M. Friedmann, I. Birol, S.J.M. Jones, Q.C.B. Cronk, and
C.J. Douglas. 2011. SNP discover y in black cot tonwood (Populus
trichocarpa) by population transcriptome resequencing. Mol. Ecol.
Resou r. 11:81–92. doi:10 .1111/j.1755- 0998 .2010.0296 0.x
POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 101
Gore, M.A., J.M. Chia, R.J. Elshire, Q. Sun, E.S. Ersoz, B.L. Hurwitz, J.A.
Pei er, M.D. McMullen, G.S. Grills, and J. Ross-Ibarra. 2009a. A
rst-generation haplotype map of maize. Science 326:1115–1117.
doi:10.1126/science.1177837
Gore, M.A., M.H. Wright, E .S. Ersoz, P. Bou ard, E.S. Szekeres, T.P.
Jarvie, B.L. Hurwitz, A. Narecha nia, T.T. Harkins, G.S. Grills,
D.H. Ware, and E.S. Buckler. 2009b. Large-scale discovery
of gene-enriched SNPs. Plant Gen. 2:121–133. doi:10.3835/
plantgenome2009.01.0002
Hamblin, M.T., T.J. Close, P.R. Bhat, S. Chao, J.G. K ling, K.J. Abraha m,
T. Blake, W.S. Brooks, B. Cooper, C.A. Gri ey, P.M. Hayes, D.J.
Hole, R.D. Horsley, D.E. Obert, K.P. Smith, S.E. Ullrich, G.J.
Muehlbauer, and J.-L. Jannink. 2010. Population structure and
linkage disequilibrium in U.S. barley germplasm: Implications
for association mapping. Crop Sci. 50:556–566. doi:10.2135/
cropsci2009.04.0198
Harper, A.L., M. Trick, J. Higgins, F. Fraser, L. Clissold, R. Wells,
C. Hattori, P. Werner, and I. Bancro . 2012. Associative
transcriptomics of traits in the polyploid crop species Brassica
napus. Nat. Biotechnol. 30:798–802. doi:10.1038/nbt.2302
He ner, E.L., J.-L. Jannink, and M.E. Sorrells. 2011. Genomic selection
accuracy using multifamily prediction models in a wheat breeding
program. Plant Gen. 4:65–75. doi:10.3835/plantgenome.2010.12.0029
Hohenlohe, P.A., S.J. Amish, J.M. Catchen, F.W. Allendorf, and G.
Luikart. 2011. Next-generation RAD sequencing identi es
thousands of SNPs for assessing hybridizat ion between rainbow
and westslope cutthroat trout. Mol. Ecol. Resour. 11:117–122.
doi :10.1111/j.1755 -0998.2010.029 67.x
Huang, X., Q. Feng, Q. Qian, Q. Zhao, L. Wang, A. Wang, J. Guan, D.
Fan, Q. Weng, T. Huang, G. Dong, T. Sang, and B. Han. 2009. High-
throughput genotyping by whole-genome resequencing. Genome
Res . 19:10 68–1076. doi:10.1101/gr.0 89516.108
Huang, X., X. Wei, T. Sang, Q. Zhao, Q. Feng, Y. Zhao, C. Li, C. Zhu, T.
Lu, Z. Zhang, M. Li, D. Fan, Y. Guo, A. Wang, L. Wang, L . Deng, W.
Li, Y. Lu, Q. Weng, K. Liu, T. Huang, T. Zhou, Y. Jing, W. Li, Z. Lin,
E.S. Buckler, Q. Qian, Q.-F. Zhang, J. Li, and B. Han. 2010. Genome-
wide association studies of 14 agronomic traits in rice landraces.
Nat. Genet. 42:961–967. doi:10.1038/ng.695
Hyten, D.L., Q. Song, E.W. Fickus, C.V. Quigley, J.-S. Lim, I.-Y. Choi,
E.-Y. Hwang, M. Pastor-Corrales, and P.B. Cregan. 2010. High-
throughput SNP discovery and assay development in common bean.
BMC Genomics 11:475. doi:10.1186/1471-2164-11-475
International Barley Sequencing Consortium. 2012. A physical, genetic and
funct ional sequence assembly of the barley genome. Nature (in press).
Jannink, J.-L., A.J. Lorenz, and H. Iwata. 2010. Genomic selection in
plant breeding: From theor y to practice. Brie ngs Funct. Genomics
9:166–177. doi:10.1093/bfgp/elq001
Jiao, Y., H. Zhao, L. Ren, W. Song, B. Zeng, J. Guo, B. Wang, Z. Liu, J.
Chen, W. Li, M. Zhang, S. Xie, a nd J. Lai. 2012. Genome-wide
genetic changes during modern breeding of maize. Nat. Genet.
44:812–815. doi:10.1038/ng.2312
Lu, F., A.E. Lipk a, R.J. Elshire, J. Glaubitz, J. Cher ney, M. Casler, E.S. Buckler,
and D. Costich. 2012. Characterization of the genetic diversity of
switchgrass using genotyping by sequencing. Poster presented at: Poster
Session – Even Numbers. Plant a nd Animal Genome Conference XX,
San Diego, CA. 14–18 Jan. 2012. Poster P0195.
Marchini, J., B. Howie, S. Myers, G. McVean, and P. Donnelly. 2007. A
new multipoint method for genome-wide association studies by
imputation of genot ypes. Nat. Genet. 39:906–913. doi:10.1038/
ng2088
Mardis, E.R. 2008.  e impact of next-generation sequencing technology
on genetics. Trends Genet. 24:133–141. doi:10.1016/j.tig.2007.12.007
Metzker, M. 2009. Sequencing technologies –  e next generation. Nat.
Rev. Genet. 11:31–46. doi:10.1038/nrg2626
Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of
total genet ic value using genome-wide dense marker maps. Genetics
157:1819–1829.
Miller, M.R., J.P. Dunham, A. Amores, W.A. Cresko, and E.A. Johnson.
2007. Rapid and cost-e ective polymorphism identi cation and
genotyping using restriction site associated DNA (RAD) markers.
Genome Res. 17:240–248. doi:10.1101/gr.5681207
Monson-Mil ler, J., D.C. Sanchez-Mendez, J. Fass, I.M. Henry, T.H. Tai,
and L. Comai. 2012. Reference genome-independent assessment of
mutation densit y using restriction enzyme-phased sequencing. BMC
Genomics 13:72.
Morrell, P.L., E.S. Buckler, and J. Ross-Ibarra. 2011. Crop genomics:
Advances a nd applications. Nat. Rev. Genet. 13:85 –96.
Myles, S., J. Pei er, P.J. Brown, E.S. Ersoz, Z . Zhang, D.E. Costich, and
E.S. Buckler. 2009. Association mapping: Critical considerations
shi from genotyping to experimental design. Plant Cell 21:2194–
2202. doi:10.1105/tp c.109.068437
Nelson, J.C., S. Wang, Y. Wu, X. Li, G. Antony, F.F. White, and J. Yu. 2011.
Single-nucleotide polymorphism discovery by high-throughput
sequencing in sorghum. BMC Genomics 12:352. doi:10.1186/1471-
2164-12-352
Nielsen, R., J.S. Paul, A. Albrechtsen, and Y.S. Song. 2011. Genotype and
SNP calling from next-generation sequencing data. Nat. Rev. Genet.
12:443–451. doi:10.1038/nrg2986
Peterson, B.K., J.N. Weber, E.H. Kay, H.S. Fisher, and H.E. Hoekstra.
2012. Double digest RADseq: An inexpensive method for de novo
SNP discovery and genotyping in model and non-model species.
PLoS One 7:e37135.
Poland, J.A., P.J. Brown, M.E. Sorrells, and J.-L. Jannink. 2012a.
Development of high-density genetic maps for barley and wheat
using a novel two-enzy me genotyping-by-sequencing approach.
PLoS ONE 7:e32253. doi:10.1371/journal.pone.0032253
Poland, J., J. Endelman, J. Dawson, J. Rutkoski, S. Wu, Y. Manes, S.
Dreisigacker, J. Crossa, H. Sanchez-Villeda, M. Sorrells, and
J.-L. Jannink. 2012b. Genomic selection in wheat breeding using
genotyping-by-sequencing. Plant Gen. (in press). doi:10.3835/
plantgenome2012.06.0006
Purcell, S., B. Neale, K. Todd-Brown, L.  omas, M.A.R. Ferreira, D.
Bender, J. Maller, P. Sk lar, P.I.W. de Bakker, M.J. Daly, and P.C.
Sham. 20 07. PLINK: A tool set for whole-genome association and
population-based linkage a nalyses. Am. J. Hum. Genet. 81:559–575.
doi:10.1086/519795
Stanzione, D. 2011.  e iPlant collaborative: Cyberinfrastructure to feed
the world. Computer 44:44–52. doi:10.1109/MC.2011.297
Truong, H.T., A.M. Ramos, F. Ya lcin, M. de Ruiter, H.J.A. van der
Poel, K.H.J. Huvenaars, R.C.J. Hogers, L.J.G. van Enckevor t, A.
Janssen, N.J. van Orsouw, and M.J.T. van Eijk. 2012. Sequence-
based genotyping for marker discovery and co-dominant scoring
in germplasm and populations. PLoS ONE 7:e37565. doi:10.1371/
journal.pone.0037565
van Orsouw, N.J., R.C.J. Hogers, A. Janssen, F. Yalcin, S. Snoeijers,
E. Verstege, H. Schneiders, H. van der Poel, J. van Oeveren, H.
Verstegen, and M.J.T. van Eijk. 2007. Complexity reduction of
polymorphic sequences (CRoPS): A novel approach for large-scale
polymorphism discovery in complex genomes. PLoS ONE 2:e1172.
doi:10.1371/journal.pone.0001172
van Tassell, C.P., T.P.L. Smith, L.K. Matukumalli, J.F. Taylor, R.D.
Schnabel, C.T. Lawley, C.D. Haudenschild, S.S. Moore, W.C.
Warren, and T.S. Sonstegard. 2008. SNP discovery and allele
frequency estimation by deep sequencing of reduced representation
libraries. Nat. Methods 5:247–252. doi:10.1038/nmeth.1185
Wang, S., E. Meyer, J.K. McKay, and M.V. Matz. 2012. 2b-R AD: A simple
and  exible met hod for genome-wide genotyping. Nat. Methods
9:808–810. doi:10.1038/nmeth.2023
Wang, X., H. Wang, J. Wang, R. Sun, J. Wu, S. Liu, et al. 2011.  e genome
of the mesopolyploid crop species Brassica rapa. Nat. Genet.
43:1035–1039. doi:10.1038/ng.919
Wetterstrand, K.A. 2012. DNA sequencing costs: Data from the NHGRI
large-scale genome sequencing program. National Human Genome
Research Institute, Bethesda, MD. http://www.genome.gov/
sequencingcosts (accessed 5 Mar. 2012).
Wiedmann, R.T., T.P.L. Smith, and D.J. Nonneman. 2008. SNP
discover y in swine by reduced representation and high throughput
pyrosequencing. BMC Genet. 9:81. doi:10.1186/1471-2156 -9-81
102 THE PLANT GENOME NOVEMBER 2012 VOL. 5, NO. 3
Xie, W., Q. Feng, H. Yu, X. Huang, Q. Zhao, Y. Xing, S. Yu, B. Han, and
Q. Zhang. 2010. Parent-independent genotyping for constructing
an ultrahigh-density linkage map based on popu lation sequencing.
Proc. Natl. Acad. Sci. USA 107:10578–10583. doi:10.1073/
pnas.1005931107
Xu, X., X. Liu, S. Ge, J.D. Jensen, F. Hu, X. Li, Y. Dong, R.N. Gutenkunst,
L. Fang, L. Huang, J. Li, W. He, G. Zhang, X. Zheng, F. Zhang, Y. Li,
C. Yu, K. Kristiansen, X. Zha ng, J. Wang, M. Wright, S. McCouch,
R. Nielsen, J. Wang, and W. Wang. 2012. Resequencing 50
accessions of cultivated and wild rice yields markers for identifying
agronomically important genes. Nat. Biotechnol. 30:105–111.
doi:10.1038/nbt.2050
Xu, X., S. Pan, S. Cheng, B. Zhang, D. Mu, P. Ni, et al. 2011. Genome
sequence and analysis of the tuber crop potato. Nature 475:189–195.
doi:10.1038/nature10158
Yang, H., Y. Tao, Z. Zheng, C. Li, M. Sweetingham, and J. Howieson.
2012. Application of next-generation sequencing for rapid marker
development in molecular plant breeding: A case study on
anthracnose disease resistance in Lupinus angustifolius L. BMC
Genomics 13:318. doi:10.1186/1471-2164 -13-318
You, F.M., N. Huo, K.R. Deal, Y.Q. Gu, M.-C. Luo, P.E. McGuire, J.
Dvorak, and O.D. Anderson. 2011. Annotation-based genome-wide
SNP discovery in the large and complex Aegilops tauschii genome
using next-generation sequencing without a reference genome
sequence. BMC Genomics 12:59. doi:10.1186/1471-2164-12-59
... The development of molecular biology methods, such as high-throughput sequencing, has made it possible to accurately comprehend T. odorum Chun's genetic diversity and population structure. Genotyping-by-sequencing (GBS) is a simplified genome sequencing technology based on second-generation high-throughput sequencing that detects a large number of single nucleotide polymorphism (SNP) loci rapidly, simply, and at a low cost [11]. This technique can analyze and type SNPs in non-model organisms to explore genomic variation without a reference genome [10][11][12][13][14]. GBS has been successfully applied in various fields, including the construction of genetic maps for plant populations [15,16], the development of molecular markers [17], the analysis of population genetic diversity and structure [18][19][20], and genome-wide association analysis [21]. ...
... Genotyping-by-sequencing (GBS) is a simplified genome sequencing technology based on second-generation high-throughput sequencing that detects a large number of single nucleotide polymorphism (SNP) loci rapidly, simply, and at a low cost [11]. This technique can analyze and type SNPs in non-model organisms to explore genomic variation without a reference genome [10][11][12][13][14]. GBS has been successfully applied in various fields, including the construction of genetic maps for plant populations [15,16], the development of molecular markers [17], the analysis of population genetic diversity and structure [18][19][20], and genome-wide association analysis [21]. ...
Article
Full-text available
Tsoongiodendron odorum Chun is a large evergreen tree in the Magnoliaceae family and an ancient relict species represented by small wild populations. It has excellent material quality, high ornamental value, and scientific significance. However, due to the complicated natural reproduction and notable habitat destruction, its wild populations must be urgently conserved. We used genotyping-by-sequencing to examine 17 natural populations of T. odorum in China, the species’ primary habitat, to better understand the genetic diversity of this species and use its germplasm resources. T. odorum had a very low level of genetic diversity; its mean values for Ho, He, Pi, and PIC were 0.175, 0.123, 0.160, and 0.053, respectively. With an average within-population Fst of 0.023 and an inter-population gene flow Nm of 10.918, population genetic variation was primarily found within populations, demonstrating minute genetic divergence between populations. The 17 natural populations of T. odorum were divided into two major categories: the Fujian populations in eastern China and the Jiangxi, Guangdong, Hunan, and Guangxi populations in central and western China. Our research contributes to the understanding of T. odorum’s genetic diversity and organization and offers a theoretical framework for the species’ conservation, breeding, and selection.
... Genotyping-by-sequence (GBS) technology represented one of the streamlined genome sequencing methodologies, aiming to reduce genome complexity through the application of restriction enzymes and the incorporation of single nucleotide polymorphisms (SNPs) [20]. This technique yielded a substantial number of SNPs, which were harnessed for exploring interspecies diversity, constructing haplotype maps, conducting genome-wide association studies, and facilitating genome selection [21]. Notably, GBS offered the advantage of requiring fewer steps for database construction and enabled the establishment of databases for a large number of samples [22]. ...
Article
Full-text available
Background Sassafras tzumu , an elegant deciduous arboreal species, belongs to the esteemed genus Sassafras within the distinguished family Lauraceae. With its immense commercial value, escalating market demands and unforeseen human activities within its natural habitat have emerged as new threats to S. tzumu in recent decades, so it is necessary to study its genetic diversity and influencing factors, to propose correlative conservation strategies. Results By utilizing genotyping-by-sequence (GBS) technology, we acquired a comprehensive database of single nucleotide polymorphisms (SNPs) from a cohort of 106 individuals sourced from 13 diverse Sassafras tzumu natural populations, scattered across various Chinese mountainous regions. Through our meticulous analysis, we aimed to unravel the intricate genetic diversity and structure within these S. tzumu populations, while simultaneously investigating the various factors that potentially shape genetic distance. Our preliminary findings unveiled a moderate level of genetic differentiation ( F ST = 0.103, p < 0.01), accompanied by a reasonably high genetic diversity among the S. tzumu populations. Encouragingly, our principal component analysis painted a vivid picture of two distinct genetic and geographical regions across China, where gene flow appeared to be somewhat restricted. Furthermore, employing the sophisticated multiple matrix regression with randomization (MMRR) analysis method, we successfully ascertained that environmental distance exerted a more pronounced impact on genetic distance when compared to geographical distance ( β E = 0.46, p < 0.01; β D = 0.16, p < 0.01). This intriguing discovery underscores the potential significance of environmental factors in shaping the genetic landscape of S. tzumu populations. Conclusions The genetic variance among populations of S. tzumu in our investigation exhibited a moderate degree of differentiation, alongside a heightened level of genetic diversity. The environmental distance of S. tzumu had a greater impact on its genetic diversity than geographical distance. It is of utmost significance to formulate and implement meticulous management and conservation strategies to safeguard the invaluable genetic resources of S. tzumu .
... While GS has been extensively implemented in livestock breeding, it is still in the development stage in plants" [79]. Encouraging empirical results are being increasingly made available from a wide range of crops, including maize, barley, wheat, soybean, sugar beet, etc [83][84][85][86][87]. "Regarding provitamin A biofortification, [88] used approximately 200 maize lines for prediction analysis using three statistical approaches: RR-BLUP, LASSO, and EN. ...
Article
Biofortification, the process of enhancing the nutritional content of crops, offers a promising strategy to combat hidden hunger—micronutrient deficiencies affecting over two billion people globally. This review article explores the biofortification of major crops, focusing on both conventional breeding techniques and modern biotechnological approaches. Conventional methods, such as selective breeding and crossbreeding, have been instrumental in increasing the levels of essential micronutrients like iron (Fe) and zinc (Zn) in staple crops such as wheat, rice, and maize. For instance, wild relatives of cultivated wheat, including Triticum dicoccoides and Aegilops tauschii, have been utilized to significantly enhance Fe and Zn content in modern cultivars. Advancements in biotechnological tools, including genetic engineering, marker-assisted selection (MAS), and genome editing (CRISPR/Cas9), have further accelerated the development of biofortified crops. These technologies enable precise modifications to increase the accumulation of micronutrients and improve nutrient bioavailability. For example, transgenic rice varieties enriched with β-carotene (Golden Rice) and enhanced Fe and Zn content through gene editing showcase the potential of biotechnology in addressing micronutrient deficiencies. The review also highlights ongoing efforts and challenges in the field, such as regulatory hurdles, public acceptance, and the need for comprehensive strategies integrating conventional and modern approaches. Furthermore, it discusses the role of international research organizations and collaborations in facilitating the development and dissemination of biofortified crops. In conclusion, combining conventional breeding with cutting-edge biotechnological innovations presents a robust approach to biofortify major crops, offering a sustainable solution to mitigate hidden hunger and improve global food security. Continued research and multi-disciplinary collaborations are essential to fully realize the potential of biofortification in enhancing human nutrition.
... The genotyping-by-sequencing (GBS) technique is based on the complexity reduction of genomic DNA by restriction enzymes and on the use of barcode DNA adapters to produce multiplexed libraries of samples that are submitted to next-generation sequencing (NGS) (Poland and Rife 2012). With this combination, the technique has demonstrated the ability to produce thousands of SNPs in several species, including fruit trees (Goonetilleke et al. 2018). ...
Article
Full-text available
Umbu (Spondias tuberosa Arruda) is an endemic fruit tree restricted to the Brazilian seasonally dry tropical forest called Caatinga. This study aimed to evaluate the structure and genomic diversity of umbu trees from seven locations in the Caatinga biome, distributed among four Brazilian states. Using genotyping-by-sequencing (GBS), a total of 5,336 SNPs were obtained, of which 250 showed outlier behavior. Therefore, 5,086 neutral SNPs were used for population structure and genetic diversity analyses. Both discriminant analysis of principal components (DAPC) and neighbor-joining cluster analyses classified the accessions into four groups, with a genetic structure observed among groups, disagreeing with our initial hypothesis of low genetic structure between locations. Isolation by distance (r² = 0.974; p = 0.0015) was detected. Moderate to high levels of genetic diversity were found, with the average observed heterozygosity (HO = 0.221) higher than the expected heterozygosity (HE = 0.199) and with negative inbreeding coefficient (FIS) values. Most genetic variation was found within locations, although high diversity between locations (22.1%) was observed. The results obtained are important for understanding the levels and distribution of genetic variation, suggesting that most locations are priorities for conservation actions, contributing with different alleles to the species' gene pool in Brazil.
... Ligated products were PCR-amplified to construct a GBS library, selected for 200-300 bp fragments, purified, quantified, and sequenced on a NextSeq 2000 sequencer (Illumina, San Diego, CA). Sequence reads were processed as described by Poland and Rife (2012), and a reference-based SNP calling pipeline implemented in TASSEL 3 (Bradbury et al. 2007;Glaubitz et al. 2014) was used to call SNPs. The Chinese Spring IWGSC RefSeq v2.1 sequence was used as the reference genome (IWGSC 2018;Zhu et al. 2021). ...
Article
Full-text available
Greenbug [Schizaphis graminum (Rondani)] is a serious insect pest that not only damages cereal crops, but also transmits several destructive viruses. The emergence of new greenbug biotypes in the field makes it urgent to identify novel greenbug resistance genes in wheat. CWI 76364 (PI 703397), a synthetic hexaploid wheat (SHW) line, exhibits greenbug resistance. Evaluation of an F2:3 population from cross OK 14319 × CWI 76364 indicated that a dominant gene, designated Gb9, conditions greenbug resistance in CWI 76364. Selective genotyping of a subset of F2 plants with contrasting phenotypes by genotyping-by-sequencing identified 25 SNPs closely linked to Gb9 on chromosome arm 7DL. Ten of these SNPs were converted to Kompetitive allele-specific polymerase chain reaction (KASP) markers for genotyping the entire F2 population. Genetic analysis delimited Gb9 to a 0.6-Mb interval flanked by KASP markers located at 599,835,668 bp (Stars-KASP872) and 600,471,081 bp (Stars-KASP881) on 7DL. Gb9 was 0.5 cM distal to Stars-KASP872 and 0.5 cM proximal to Stars-KASP881. Allelism tests indicated that Gb9 is a new greenbug resistance gene which confers resistance to greenbug biotypes C, E, H, I, and TX1. TX1 is one of the most widely virulent biotypes and has overcome most known wheat greenbug resistance genes. The introgression of Gb9 into locally adapted wheat cultivars is of economic importance, and the KASP markers developed in this study can be used to tag Gb9 in cultivar development.
... The lines from the 2021 cohort were genotyped using genotyping-by-sequencing (GBS) (Elshire et al., 2011). 96-plex libraries were prepared according to Poland et al. (2012). Single nucleotide polymorphisms (SNPs) were called using the TASSEL5 GBSv2 (Bradbury et al., 2007;Glaubitz et al., 2014) software. ...
Article
Full-text available
Oats (Avena sativa L.) provide unique nutritional benefits and contribute to sustainable agricultural systems. Breeding high‐value oat varieties that meet milling industry standards is crucial for satisfying the demand for oat‐based food products. Test weight, thins, and groat percentage are primary traits that define oat milling quality and the final price of food‐grade oats. Conventional selection for milling quality is costly and burdensome. Multi‐trait genomic selection (MTGS) combines information from genome‐wide markers and secondary traits genetically correlated with primary traits to predict breeding values of primary traits on candidate breeding lines. MTGS can improve prediction accuracy and significantly accelerate the rate of genetic gain. In this study, we evaluated different MTGS models that used morphometric grain traits to improve prediction accuracy for primary grain quality traits within the constraints of a breeding program. We evaluated 558 breeding lines from the University of Illinois Oat Breeding Program across 2 years for primary milling traits, test weight, thins, and groat percentage, and secondary grain morphometric traits derived from kernel and groat images. Kernel morphometric traits were genetically correlated with test weight and thins percentage but were uncorrelated with groat percentage. For test weight and thins percentage, the MTGS model that included the kernel morphometric traits in both training and candidate sets outperformed single‐trait models by 52% and 59%, respectively. In contrast, MTGS models for groat percentage were not significantly better than the single‐trait model. We found that incorporating kernel morphometric traits can improve the genomic selection for test weight and thins percentage.
... For RILs and parents of the LR-68 and LR-70 populations, library construction was performed based on a two-enzyme system of Pstl and Mspl to cut genomic DNA following the protocol described by Poland and Rife (2012). DNA fragments were sequenced on an Illumina HiSeq 2500 sequencer (Illumina) using the paired-ended mode (2 × 100 bp) at the DNA Sequencing Laboratory, NRC-Saskatoon. ...
Article
Full-text available
Plant breeders are generally reluctant to cross elite crop cultivars with their wild relatives to introgress novel desirable traits due to associated negative traits such as pod shattering. This results in a genetic bottleneck that could be reduced through better understanding of the genomic locations of the gene(s) controlling this trait. We integrated information on parental genomes, pod shattering data from multiple environments, and high‐density genetic linkage maps to identify pod shattering quantitative trait loci (QTLs) in three lentil interspecific recombinant inbred line populations. The broad‐sense heritability on a multi‐environment basis varied from 0.46 (in LR‐70, Lens culinaris × Lens odemensis) to 0.77 (in LR‐68, Lens orientalis × L. culinaris). Genetic linkage maps of the interspecific populations revealed reciprocal translocations of chromosomal segments that differed among the populations, and which were associated with reduced recombination. LR‐68 had a 2–5 translocation, LR‐70 had 1–5, 2–6, and 2–7 translocations, and LR‐86 had a 2–7 translocation in one parent relative to the other. Segregation distortion was also observed for clusters of single nucleotide polymorphisms on multiple chromosomes per population, further affecting introgression. Two major QTL, on chromosomes 4 and 7, were repeatedly detected in the three populations and contain several candidate genes. These findings will be of significant value for lentil breeders to strategically access novel superior alleles while minimizing the genetic impact of pod shattering from wild parents.
Preprint
Urbanization modifies ecosystem conditions and evolutionary processes. This includes air pollution, mostly as tropospheric ozone (O3), which contributes to the decline of urban and peri-urban forests. A notable case are fir(Abies religiosa) forests in the peripheral mountains southwest of Mexico City, which have been severely affected by O3 pollution since the 1970s. Interestingly, some young individuals exhibiting minimal O3—related damage have been observed within a zone of significant O3 exposure. Using this setting as a natural experiment, we compared asymptomatic and symptomatic individuals of similar age (≤15 years old; n = 10) using histological, metabolomic and transcriptomic approaches. Plants were sampled during days of high (170 ppb) and moderate (87 ppb) O3 concentration. Given that there have been reforestation efforts in the region, with plants from different source populations, we first confirmed that all analysed individuals clustered within the local genetic group when compared to a species-wide panel (Admixture analysis with ~1.5K SNPs). We observed thicker epidermis and more collapsed cells in the palisade parenchyma of needles from symptomatic individuals than from their asymptomatic counterparts, with differences increasing with needle age. Furthermore, symptomatic individuals exhibited lower concentrations of various terpenes (ß-pinene, ß-caryophylene oxide, α-caryophylene and ß-α-cubebene) than asymptomatic trees, as evidenced through GC-MS. Finally, transcriptomic analyses revealed differential expression for thirteen genes related to carbohydrate metabolism, plant defense, and gene regulation. Our results indicate a rapid and contrasting phenotypic response among trees, likely influenced by standing genetic variation and/or plastic mechanisms. They open the door to future evolutionary studies for understanding how O3 tolerance develops in urban environments, and how this knowledge could contribute to forest restoration.
Chapter
Plants face diverse environmental challenges, including water scarcity, salinity, extreme temperatures, and nutrient deficiency, impeding their growth and productivity. Plants develop molecular mechanisms for adaptation to survive such conditions. Understanding these processes is pivotal for enhancing plant tolerance. This chapter thoroughly investigates the challenges and strategies for improving plant growth and development in stressful environments, delving into the mechanisms of plant adaptation to these stressors. Additionally, it identifies abiotic stress candidate genes for stress tolerance, which are crucial for developing stress-resistant crops. Moreover, the chapter underscores the paramount importance of implementing strategies to ensure food security amid a growing global population and increased environmental abiotic stress. It highlights the critical role of investigating plant responses to abiotic stress in addressing global food security challenges amid climate change.
Article
Full-text available
Climate change is one of the most pressing challenges facing the world, with profound implications for ecosystems, biodiversity, and human societies. The necessity to monitor, comprehend, and mitigate climate change impacts has spurred the emergence of biomarkers as essential tools in ecological and environmental research. This study investigates global knowledge in biomarker research within the context of climate change. Its goal is to offer valuable insights to both researchers and practitioners, thereby guiding the development of well-informed decisions. The analysis encompassed a performance evaluation aimed at scrutinizing both quantitative and qualitative indicators. Visualization techniques utilizing VOSviewer software were deployed to analyze collaboration patterns, co-citation links among prominent knowledge-sharing platforms, and key topics derived from keyword co-occurrence matrices. Globally, a total of 1045 relevant documents were identified and analyzed. The United States stands out as the top contributor (261 documents; 25.0%) with Chinese Academy of Sciences, China, as the most prolific institution (72 documents; 6.9%). Key trends were related to developing and utilizing novel biomarkers based on advancements in omics and nano-based technologies, bioinformatics and data analytics benefiting from machine learning and artificial intelligence tools, and the significance of integrative approaches that merge biomarkers with remote sensing data and ecological models. These advancements contribute to boosting predictive capabilities, precise sensing, and the effective identification of patterns within massive datasets that ultimately improve climate change monitoring and mitigation in ecosystems. Progress in this field demands interdisciplinary collaborations, international cooperation, the establishment of long-term monitoring programs, the creation of biomarker databases, and investment in emerging biomarker technologies. Education and outreach initiatives, accompanied by adequate funding and resources, are critical for advancing biomarker research.
Article
Full-text available
Genomic selection (GS) uses genome-wide molecular marker data to predict the genetic value of selection candidates in breeding programs. In plant breeding, the ability to produce large numbers of progeny per cross allows GS to be conducted within each family. However, this approach requires phenotypes of lines from each cross before conducting GS. This will prolong the selection cycle and may result in lower gains per year than approaches that estimate marker-effects with multiple families from previous selection cycles. In this study, phenotypic selection (PS), conventional marker-assisted selection (MAS), and GS prediction accuracy were compared for 13 agronomic traits in a population of 374 winter wheat (Triticum aestivum L.) advanced-cycle breeding lines. A cross-validation approach that trained and validated prediction accuracy across years was used to evaluate effects of model selection, training population size, and marker density in the presence of genotype x environment interactions (GxE). The average prediction accuracies using GS were 28% greater than with MAS and were 95% as accurate as PS. For net merit, the average accuracy across six selection indices for GS was 14% greater than for PS. These results provide empirical evidence that multifamily GS could increase genetic gain per unit time and cost in plant breeding.
Conference Paper
Full-text available
Next-generation DNA sequencing (NGS) technologies can survey sequence variation on a genome-wide scale, but their utility for crop genetic diversity analysis is poorly known. Many challenges remain in their applications, including sampling complex genomes, identifying single-nucleotide polymorphisms (SNPs), and analyzing missing data. This presentation will illustrate a practical application of the Roche 454 GS FLX Titanium technology in combination with genomic reduction and an advanced bioinformatics tool to analyze the genetic relationships of 16 diverse barley (Hordeum vulgare L.) landraces. A full 454 run generated roughly 1.7 million sequence reads with a total length of 612 Mbp. Application of the computational pipeline called DIAL (de novo identification of alleles) identified 2,578 contigs and 3,980 SNPs. Sanger sequencing of four barley samples confirmed 85 of the 100 selected contigs and 288 of the 620 putative SNPs, and identified 735 new SNPs and 39 new indels. Several diversity analyses revealed the eastern and western division in the barley samples. The division is compatible with those inferred with 156 microsatellite alleles of the same 16 samples and consistent with our current knowledge about cultivated barley. The NGS application not only provides a new informative set of genomic resources for barley research, but also helps to illustrate the feasibility of genotyping-by-sequencing with NGS technologies for crop diversity studies.
Article
Full-text available
Genomic selection (GS) uses genomewide molecular markers to predict breeding values and make selections of individuals or breeding lines prior to phenotyping. Here we show that genotyping-by-sequencing (GBS) can be used for de novo genotyping of breeding panels and to develop accurate GS models, even for the large, complex, and polyploid wheat (Triticum aestivum L.) genome. With GBS we discovered 41,371 single nucleotide polymorphisms (SNPs) in a set of 254 advanced breeding lines from CIMMYT's semiarid wheat breeding program. Four different methods were evaluated for imputing missing marker scores in this set of unmapped markers, including random forest regression and a newly developed multivariate-normal expectation-maximization algorithm, which gave more accurate imputation than heterozygous or mean imputation at the marker level, although no signifi cant differences were observed in the accuracy of genomic-estimated breeding values (GEBVs) among imputation methods. Genomic-estimated breeding value prediction accuracies with GBS were 0.28 to 0.45 for grain yield, an improvement of 0.1 to 0.2 over an established marker platform for wheat. Genotyping-by-sequencing combines marker discovery and genotyping of large populations, making it an excellent marker platform for breeding applications even in the absence of a reference genome sequence or previous polymorphism discovery. In addition, the fl exibility and low cost of GBS make this an ideal approach for genomics-assisted breeding.
Article
Full-text available
Massively parallel sequencing platforms have allowed for the rapid discovery of single nucleotide polymorphisms (SNPs) among related genotypes within a species. We describe the creation of reduced representation libraries (RRLs) using an initial digestion of nuclear genomic DNA with a methylation-sensitive restriction endonuclease followed by a secondary digestion with the 4bp-restriction endonuclease DpnII. This strategy allows for the enrichment of hypomethylated genomic DNA, which has been shown to be rich in genic sequences, and the digestion with DpnII serves to increase the number of common loci resequenced between individuals. Deep resequencing of these RRLs performed with the Illumina Genome Analyzer led to the identifi cation of 2618 SNPs in rice and 1682 SNPs in soybean for two representative genotypes in each of the species. A subset of these SNPs was validated via Sanger sequencing, exhibiting validation rates of 96.4 and 97.0%, in rice (Oryza sativa) and soybean (Glycine max), respectively. Comparative analysis of the read distribution relative to annotated genes in the reference genome assemblies indicated that the RRL strategy was primarily sampling within genic regions for both species. The massively parallel sequencing of methylation-sensitive RRLs for genome-wide SNP discovery can be applied across a wide range of plant species having suffi cient reference genomic sequence.
Article
Full-text available
Whole-genome association studies of complex traits in higher eukaryotes require a high density of single nucleotide polymorphism (SNP) markers at genome-wide coverage. To design high-throughput, multiplexed SNP genotyping assays, researchers must first discover large numbers of SNPs by extensively resequencing multiple individuals or lines. For SNP discovery approaches using short read-lengths that next-generation DNA sequencing technologies offer, the highly repetitive and duplicated nature of large plant genomes presents additional challenges. Here, we describe a genomic library construction procedure that facilitates pyrosequencing of genic and low-copy regions in plant genomes, and a customized computational pipeline to analyze and assemble short reads (100–200 bp), identify allelic reference sequence comparisons, and call SNPs with a high degree of accuracy. With maize (Zea mays L.) as the test organism in a pilot experiment, the implementation of these methods resulted in the identification of 126,683 putative SNPs between two maize inbred lines at an estimated false discovery rate (FDR) of 15.1%. We estimated rates of false SNP discovery using an internal control, and we validated these FDR rates with an external SNP dataset that was generated using locus-specific PCR amplification and Sanger sequencing. These results show that this approach has wide applicability for efficiently and accurately detecting gene-enriched SNPs in large, complex plant genomes.
Article
Genomic selection (GS) uses genome-wide molecular marker data to predict the genetic value of selection candidates in breeding programs. In plant breeding, the ability to produce large numbers of progeny per cross allows GS to be conducted within each family. However, this approach requires phenotypes of lines from each cross before conducting GS. This will prolong the selection cycle and may result in lower gains per year than approaches that estimate marker-effects with multiple families from previous selection cycles. In this study, phenotypic selection (PS), conventional marker-assisted selection (MAS), and GS prediction accuracy were compared for 13 agronomic traits in a population of 374 winter wheat ( L.) advanced-cycle breeding lines. A cross-validation approach that trained and validated prediction accuracy across years was used to evaluate effects of model selection, training population size, and marker density in the presence of genotype × environment interactions (G×E). The average prediction accuracies using GS were 28% greater than with MAS and were 95% as accurate as PS. For net merit, the average accuracy across six selection indices for GS was 14% greater than for PS. These results provide empirical evidence that multifamily GS could increase genetic gain per unit time and cost in plant breeding.
Article
If one accepts that the fundamental pursuit of genetics is to determine the genotypes that explain phenotypes, the meteoric increase of DNA sequence information applied toward that pursuit has nowhere to go but up. The recent introduction of instruments capable of producing millions of DNA sequence reads in a single run is rapidly changing the landscape of genetics, providing the ability to answer questions with heretofore unimaginable speed. These technologies will provide an inexpensive, genome-wide sequence readout as an endpoint to applications ranging from chromatin immunoprecipitation, mutation mapping and polymorphism discovery to noncoding RNA discovery. Here I survey next-generation sequencing technologies and consider how they can provide a more complete picture of how the genome shapes the organism.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
The Methods section of the paper describes how missing genotypes are inferred through the use of a model of an individual's genotype vector Gi conditional upon a set of N known haplotypes H. A Hidden Markov Model (HMM) is used that has the form