ArticlePDF Available

Genotyping-by-Sequencing for Plant Breeding and Genetics

The Plant Genome

November 2012
5(3)

DOI:10.3835/plantgenome2012.05.0005

License
CC BY-NC-ND 4.0

Authors:

Trevor W. Rife

Clemson University

Rapid advances in "next-generation" DNA sequencing technology have brought the US$1000 human (Homo sapiens) genome within reach while providing the raw sequencing output for researchers to revolutionize the way populations are genotyped. To capitalize on these advancements, genotyping-by-sequencing (GBS) has been developed as a rapid and robust approach for reduced-representation sequencing of multiplexed samples that combines genome-wide molecular marker discovery and genotyping. The flexibility and low cost of GBS makes this an excellent tool for many applications and research questions in plant genetics and breeding. Here we address some of the new research opportunities that are becoming more feasible with GBS. Furthermore, we highlight areas in which GBS will become more powerful with the continued increase of sequencing output, development of reference genomes, and improvement of bioinformatics. The ultimate goal of plant biology scientists is to connect phenotype to genotype. In plant breeding, the genotype can then be used to predict phenotypes and select improved cultivars. Furthering our understanding of the connection between heritable genetic factors and the resulting phenotypes will enable genomics-assisted breeding to exist on the scale needed to increase global food supplies in the face of decreasing arable land and climate change.

A comparison of actual sequencing capacity (orange) to what would be expected if sequencing technology was following Moore's Law (blue). The signifi cant decrease in 2007 coincides roughly with the introduction of next-generation sequencing technology. Data is from the National Human Genome Research Institute (Wetterstrand, 2012).

…

Schematic overview of steps in genotyping-by-sequencing (GBS) library construction, sequencing, and analysis. (1) Genomic DNA is quantifi ed using fl uorescence-based method. (2) Genomic DNA (gDNA) is normalized in a new plate. Normalization is needed to ensure equal representation of all samples and equal molarity of gDNA and adapters. (3) A master mix with restriction enzyme(s) and buffer is added to the plate and incubated. (4) The DNA barcoded adapters are added along with ligase and ligation buffers. (5) Samples are pooled and cleaned. (6) The GBS library is polymerase chain reaction (PCR) amplifi ed. (7) The amplifi ed library is cleaned and evaluated on a capillary sizing system. (8) Libraries are sequenced. Data analysis: Following a sequencing run, FASTQ fi les containing raw data from the run are used to parse sequencing reads to samples using the DNA barcode sequence. Once assigned to individual samples, the reads are aligned to a reference genome. In the case of species without a complete reference genomic sequence, reads are internally aligned (alignment of all sequence reads will all other reads from that library) and single nucleotide polymorphisms (SNPs) identifi ed from 1 or 2 bp sequence mismatch. Various fi ltering algorithms can then be used to distinguish true biallelic SNPs from sequencing errors.

…

Integration of genotyping-by-sequencing (GBS) in the context of plant breeding and genomics for a species without a completed reference genome.

…

Removal of missing data in genotyping-by-sequencing by increasing coverage of the library via resequencing. In a set of international wheat breeding germplasm, several lines (samples) were replicated across two or more libraries. Replicating a sample two times increased the coverage of single nucleotide polymorphisms (SNPs) to 60% while fi ve replications increase the coverage to over 90%. While very effective as a means to remove missing data, replicated sequencing increases the per-sample cost. The average per-sample cost is $15. In this situation for wheat, the number of replications is roughly equivalent to the sequencing coverage of the library (i.e., 5 replications give approximately 5x coverage). Data from J. Poland (unpublished data, 2012).

…

Figures - uploaded by Trevor W. Rife

Content may be subject to copyright.

Content uploaded by Trevor W. Rife

Content may be subject to copyright.

Content uploaded by Trevor W. Rife

Content may be subject to copyright.

92 THE PLANT GENOME  NOVEMBER 2012  VOL. 5, NO. 3

REVIEW & INTERPRETATION

Genotyping-by-Sequencing for Plant

Breeding and Genetics

Jesse A. Poland* and Trevor W. Rife

Abstract

Rapid advances in “next-generation” DNA sequencing

technology have brought the US$1000 human (Homo sapiens)

genome within reach while providing the raw sequencing

output for researchers to revolutionize the way populations are

genotyped. To capitalize on these advancements, genotyping-

by-sequencing (GBS) has been developed as a rapid and robust

approach for reduced-representation sequencing of multiplexed

samples that combines genome-wide molecular marker discovery

and genotyping. The ﬂ exibility and low cost of GBS makes this

an excellent tool for many applications and research questions

in plant genetics and breeding. Here we address some of the

new research opportunities that are becoming more feasible with

GBS. Furthermore, we highlight areas in which GBS will become

more powerful with the continued increase of sequencing

output, development of reference genomes, and improvement of

bioinformatics. The ultimate goal of plant biology scientists is to

connect phenotype to genotype. In plant breeding, the genotype

can then be used to predict phenotypes and select improved

cultivars. Furthering our understanding of the connection between

heritable genetic factors and the resulting phenotypes will enable

genomics-assisted breeding to exist on the scale needed to

increase global food supplies in the face of decreasing arable

land and climate change.

Next-Generation Genotyping

DRIVEN BY THE QUEST for a $1000 human genome, rapid

advances in next-generation sequencing (NGS) output

have provided technology with the ability to greatly trans-

form the way we think about plant genomics and breeding.

With the introduction of massively parallel sequencing,

raw sequencing output is doubling roughly every 6 mo (Fig.

1).  e availability of inexpensive sequencing technology

has transformed the way genomes are sequenced (Xu et

al., 2011; Wang et al., 2011), polymorphisms are discovered

(Mardis, 2008; Futschik and Schlötterer, 2010; You et al.,

2011; Nielsen et al., 2011), gene expression is analyzed (Ger-

aldes et al., 2011; Harper et al., 2012), and populations are

genotyped (Baird et al., 2008; Elshire et al., 2011; Davey et

al., 2011; Truong et al., 2012; Poland et al., 2012a; Wang et

al., 2012). Sequencing is rapidly becoming so inexpensive

that it will soon be reasonable to use it for every genetic

study. Next-generation sequencing applications have the

potential to revolutionize the  eld of plant genomics and the

practice of applied plant breeding.

One of the primary objectives of functional

genomics in agricultural species is to connect phenotype

to genotype and use this knowledge to make phenotypic

predictions and select improved plant types. To do this

on a genome-wide scale requires large populations with

dense molecular markers across the genome. To put the

power of NGS to work for plant breeding and genomics,

Published in The Plant Genome 5:92–102.

doi: 10.3835/plantgenome2012.05.0005

5585 Guilford Rd., Madison, WI 53711 USA

An open-access publication

transmitted in any form or by any means, electronic or mechanical,

including photocopying, recording, or any information storage and

retrieval system, without permission in writing from the publisher.

Permission for printing and for reprinting the material contained herein

has been obtained by the publisher.

J.A. Poland, USDA-ARS, Hard Winter Wheat Genetics Research Unit

and Dep. of Agronomy, Kansas State Univ., 4008 Throckmorton

Hall, Manhattan KS, 66506; T.W. Rife, Interdepartmental Genetics,

Kansas State Univ., 4024 Throckmorton Hall, Manhattan KS,

66506. Received 29 May 2012. *Corresponding author (jesse.

poland@ars.usda.gov).

Abbreviations: AM, association mapping; GBS, genotyping-by-

sequencing; GS, genomic selection; HMM, hidden Markov model;

MSG, multiplexed shotgun genotyping; NGS, next-generation

sequencing; PAV, presence–absence variation; RAD, restriction

association DNA; SNP, single nucleotide polymorphism.

POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 93

new approaches for sequence-based genotyping have

been developed. One promising approach is genotyping-

by-sequencing (GBS), which uses enzyme-based

complexity reduction (using restriction endonucleases to

target only a small portion of the genome) coupled with

DNA barcoded adapters to produce multiplex libraries

of samples ready for NGS sequencing.  is approach

has been demonstrated to be robust across a range of

species and capable of producing tens of thousands to

hundreds of thousands of molecular markers (Elshire et

al., 2011; Poland et al., 2012a).  e  exibility of GBS in

regards to species, populations, and research objectives

makes this an ideal tool for plant genetics studies. As the

phenomenal increase in NGS output continues, many

research questions that were once out of reach will be

resolved through the application of these approaches.

All-in-One

 e two key components for genotyping germplasm are

 nding DNA sequence polymorphisms and assaying the

markers across a full set of material. Classically, this has

been a two-step process involving marker discovery fol-

lowed by assay design and genotyping. An important

strength of sequence-based genotyping approaches is that

the marker discovery and genotyping are completed at the

same time.  is facilitates exploration of new germplasm

sets or even new species without the upfront e ort of

discovering and characterizing polymorphisms. Another

key component of GBS datasets is that the raw data is

dynamic.  e raw sequences obtained from GBS can be

reanalyzed, uncovering further information (e.g., new

polymorphisms, annotated genes, etc.) as bioinformatics

techniques improve, reference genomes develop, and the

collection of sequence data increases. Each of these factors

adds additional value to the same raw dataset.

One of the  rst and broadly adapted applications for

using NGS was for single nucleotide polymorphism (SNP)

and presence–absence variation (PAV) discovery in diverse

populations with and without reference genomes (Baird

et al., 2008; Wiedmann et al., 2008; Gore et al., 2009a,

2009b; Huang et al., 2009; Deschamps et al., 2010; Hyten

et al., 2010; You et al., 2011; Nelson et al., 2011; Hohenlohe

et al., 2011; Byers et al., 2012).  ese studies have focused

on assaying a few key genotypes with a reduced-

representation approach (Baird et al., 2008) or with whole-

genome resequencing (Huang et al., 2009). While highly

e ective for SNP discovery, this approach is limited in the

number of lines assayed and does not simultaneously assay

the markers across the full population of interest.

 e key objective of the GBS approach, therefore, is

not merely to discover polymorphisms and then transfer

these to a  xed assay, but to simultaneously discover

polymorphisms and obtain genotypic information across

the whole population of interest. It is this combined

one-step approach that makes GBS a truly rapid and

 exible platform for a range of species and germplasm

sets and perfectly suited for genomic selection (GS)

in plant breeding programs. As sequencing output

continues to increase, GBS will evolve  rst to lower

levels of complexity reduction (to capture more sequence

variants) and then to whole-genome resequencing (to

capture all variants). Whole-genome resequencing has

been applied in Arabidopsis thaliana (L.) Heynh., rice

(Oryza sativa L.), and maize (Zea mays L.) (Huang et al.,

2009; Ashelford et al., 2011; Gan et al., 2011; Chia et al.,

2012; Jiao et al., 2012; Xu et al., 2012), although it quickly

becomes less manageable with larger, more complex

genomes that lack a solid reference genome (Morrell et

al., 2011).  e level of multiplexing has also been limited

in this approach, increasing per-sample cost.

As GBS can be readily used for de novo discovery

and application of new molecular polymorphisms, it is

particularly powerful for new sets of germplasm and

uncharacterized species. In many ways the greatest

advantage of sequence-based genotyping approaches

is the reduction of ascertainment bias associated with

marker discovery in panels di ering from the target

population.  is is an obvious advantage for association

studies in which di ering allele frequencies greatly

in uence the power and precision of the study (Myles et

al., 2009; Hamblin et al., 2010). For breeding applications,

informative polymorphisms can be discovered as novel

germplasm is introduced into the breeding pool.  e

use of an unrepresentative marker panel in surveying

molecular diversity is highly problematic for getting a

true representation of molecular diversity present in a

target population. Most GBS approaches use methylation-

sensitive enzymes. If these enzymes target di erentially

methylated regions of the genome, ascertainment bias

could potentially be introduced in di erent sets of

germplasm, but evidence for this has yet to be seen. While

markers discovered with GBS should have little bias across

sets of germplasm, it is also unknown how uniformly

they are spaced across the genome. Evidence from Poland

et al. (2012a), however, indicated that GBS markers were

Figure 1. A comparison of actual sequencing capacity (orange)

to what would be expected if sequencing technology was

following Moore’s Law (blue). The signiﬁ cant decrease in 2007

coincides roughly with the introduction of next-generation

sequencing technology. Data is from the National Human

Genome Research Institute (Wetterstrand, 2012).

94 THE PLANT GENOME  NOVEMBER 2012  VOL. 5, NO. 3

uniformly spaced across the chromosomes of both wheat

(Triticum aestivum L.) and barley (Hordeum vulgare L.).

Many Flavors

 e use of reduced-representation sequencing for target-

ing small portions of the genome was  rst demonstrated

by Altshuler et al. (2000).  is approach was then later

combined with NGS and DNA barcoded adapters to

sequence multiplex libraries in parallel.  ere are many

variations of this approach and GBS is one speci c

method for genotyping using NGS of multiplex DNA-

barcoded reduced-representation libraries (Table 1).

Furthermore, the combination of enzymes that can be

used for complexity reduction is almost endless. Davey

et al. (2011) has thoroughly reviewed several approaches

of complexity reduction including complexity reduction

of polymorphic sequences (van Orsouw et al., 2007) and

deep sequencing of reduced representation libraries (van

Tassell et al., 2008).

 e use of restriction enzymes for targeted reduction

of genome complexity combined with NGS was  rst

described by Baird et al. (2008) and termed restriction

association DNA (RAD). Restriction association

DNA methods use a restriction enzyme to generate

genomic fragments, which are then ligated to an

adaptor containing a forward primer for ampli cation,

sequencing platform primer sites, and a unique DNA

barcode that enables sample multiplexing (Baird et al.,

2008; Craig et al., 2008; Cronn et al., 2008).  e samples

are pooled, randomly sheared, and size selected to create

a uniform collection of similarly-sized DNA fragments

(Baird et al., 2008).  e fragments are then ligated to a Y

adaptor that ensures only fragments containing the  rst

adaptor will be ampli ed (Baird et al., 2008). Restriction

association DNA markers provided a robust method

to discover polymorphisms and map variation in a

population (Miller et al., 2007).

First-generation RAD analysis had drawbacks similar

to older restriction enzyme-based marker technologies: the

requirement of species-speci c arrays, a hybridization for

every comparison, and limitations for assaying presence-

absence variation (Baird et al., 2008). Combining the

progressive features of RAD with NGS, however, resulted

in the discovery of new markers at a signi cantly decreased

cost (Baird et al., 2008).  e simultaneous discovery of

SNP markers during RAD sequencing facilitated robust

mapping of many polymorphisms and precise assignment

of chromosomal regions to mapping parents, allowing for

detection of recombination locations.  e RAD approach

has recently been modi ed to use restriction enzymes

that cut upstream and downstream of a target site (Wang

et al., 2012).  is new methodology produces uniform

length tags, allows nearly all of the restriction sites to

be surveyed, and permits marker intensity adjustment

Table 1. A technical comparison of current genotyping methods using next-generation sequencing of multiplex

barcoded libraries. Adapted from Wang et al. (2012). Flavors of genotyping using next-generation sequencing of

multiplex DNA-barcoded reduced-representation libraries.

Method

Random

shearing

Size

selection Fragment size Enzymes†

Multiplexing

level‡Analysis tool(s) Reference

Multiplex shotgun genotyping No Yes Size selected MseI 96 (up to 384) Burrows-Wheeler alignment tool Andolfatto et al., 2011

Restriction association DNA

sequencing (RAD-seq)

Yes Yes Size selected SbfI 96 Custom Perl scripts Baird et al., 2008

EcoRI

Double digest RAD-seq No Yes Size selected EcoRI and MspI48

§MUSCLE¶Peterson et al., 2012

2b-restriction association DNA No No 33–36 bp BsaXI#NA†† Custom Perl scripts Wang et al., 2012

Genotyping-by-sequencing No No <350 bp ApeKI‡‡ 48 (up to 384) TASSEL§§ Elshire et al., 2011

Genotyping-by-sequencing –

two enzyme

No No <350 bp PstI and MspI 48 (up to 384) TASSEL Poland et al., 2012a

Sequence-based genotyping No Yes Size selected EcoRI and MseI32 Burrows-Wheeler alignment tool

and uniﬁ ed genotyper

Truong et al., 2012

PstI and TaqI

Restriction enzyme sequence

comparative analysis

No Yes Size selected MseI NA¶¶ Burrows-Wheeler alignment tool

and Samtools

Monson-Miller et al., 2012

NlaIII

†All of these approaches can use different enzymes. Shown are t he enzyme(s) used in the initial study.

‡All of these methods have the possibility to increase the number of multiplexed samples using additional unique barcodes. The multiplex level as reported in the reference paper. Given in parenthesis are

subsequent increases.

§Combinatorial barcoding is possible, placing a barcode on each end of the DNA fragment. Using a set of 48 adapter P1 barcodes and × 12 polymerase chain reaction (PCR) 2 indices it is possible to uniquely label

576 individuals (48 [adapter P1 barcodes] × 12 [PCR2 indices]). This method would require paired-end sequencing.

¶MUSCLE, multiple sequence comparison by log-expectation.

#Uses type IIB restriction endonucleases.

††NA, not applicable.

‡‡Has been successfully applied to using PstI and HindIII (E. Buckler and R. Elshire, personal communication, 2012).

§§TASSEL, trait analysis by association, evolution, and linkage.

¶¶96-plexing reported but unpublished.

POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 95

(Wang et al., 2012).  e next  avor of sequence-based

genotyping was multiplexed shotgun genotyping (MSG),

which required only one gel puri cation, eliminated DNA

shearing, required less starting DNA, and implemented

a hidden Markov model (HMM) to determine points of

chromosomal recombination (Andolfatto et al., 2011).

Multiplexed shotgun genotyping used a single common

cutting restriction enzyme and produced a limited

complexity reduction suitable for the smaller genome

(approximately 130 Mb) of Drosophila simulans (Andolfatto

et al., 2011). In the context of a reference genome, the

HMM imputation approach was highly e ective for tracing

parental origin and de ning recombination break points

(Andolfatto et al., 2011).

 e original GBS protocol was developed to simplify

and streamline the construction of RAD libraries (Elshire et

al., 2011).  e strength of the GBS protocol is its simplicity:

using inexpensive adapters, allowing pooled library

construction, and avoiding shearing and size selection (Fig.

2).  e GBS approach removed the need for size selection

by using a short polymerase chain reaction extension of

the multiplexed library. Instead of the Y adapters used in

the RAD protocol, the original GBS protocol used a single

restriction enzyme, a barcoded adaptor, and a common

adaptor (Elshire et al., 2011). Although all combinations of

adapters can ligate to the DNA fragments, only those that

contained one of each barcode are able to be ampli ed and

sequenced (Davey et al., 2011).

 e original GBS approach was recently extended

to a two-enzyme version that combines a rare- and a

common-cutting restriction enzyme to generate uniform

libraries consisting of a forward (barcoded) adaptor and

a reverse (Y) adaptor on alternate ends of each fragment

(Poland et al., 2012a).  e use of two enzymes in this GBS

approach enables the capture of most fragments associated

with the rare-cutting enzyme.  e use of a Y adaptor on

the common restriction site avoids ampli cation of more

common fragments, a preferential situation for larger,

more complex genomes. Following the original work on

wheat and barley, this GBS approach has been successfully

applied in several species including cotton (Gossypium

hirsutum L.), oat (Avena sativa L.), sorghum [Sorghum

bicolor (L.) Moench], and rice with little to no change in

protocol (Poland, unpublished data, 2012).

 e options for tailoring GBS to any species or

desired application are almost endless. A range of

enzymes have been evaluated in maize with success in

varying the level of complexity reduction (E. Buckler,

personal communication, 2012). With a varied level of

complexity reduction, it is possible to increase coverage

Figure 2. Schematic overview of steps in genotyping-by-sequencing (GBS) library construction, sequencing, and analysis. (1) Genomic

DNA is quantiﬁ ed using ﬂ uorescence-based method. (2) Genomic DNA (gDNA) is normalized in a new plate. Normalization is needed

to ensure equal representation of all samples and equal molarity of gDNA and adapters. (3) A master mix with restriction enzyme(s)

and buffer is added to the plate and incubated. (4) The DNA barcoded adapters are added along with ligase and ligation buffers.

(5) Samples are pooled and cleaned. (6) The GBS library is polymerase chain reaction (PCR) ampliﬁ ed. (7) The ampliﬁ ed library is

cleaned and evaluated on a capillary sizing system. (8) Libraries are sequenced. Data analysis: Following a sequencing run, FASTQ

ﬁ les containing raw data from the run are used to parse sequencing reads to samples using the DNA barcode sequence. Once

assigned to individual samples, the reads are aligned to a reference genome. In the case of species without a complete reference

genomic sequence, reads are internally aligned (alignment of all sequence reads will all other reads from that library) and single

nucleotide polymorphisms (SNPs) identiﬁ ed from 1 or 2 bp sequence mismatch. Various ﬁ ltering algorithms can then be used to

distinguish true biallelic SNPs from sequencing errors.

96 THE PLANT GENOME  NOVEMBER 2012  VOL. 5, NO. 3

of a target genome or increase the multiplexing level of

a target population.  e interplay of these two factors

will determine the optimal approach for the species

under investigation. For species with large genomes or

no reference genome, the use of rare-cutting restriction

enzymes (i.e., 6 bp or greater target site) with methylation

sensitivity can assist in creating a higher level of

complexity reduction by targeting fewer sites.  is will

lead to higher sampling depth of the same genomic sites

and reduce the amount of missing data (Fig. 3).

Hand in Hand with the Reference Genome

Sequence-based genotyping greatly bene ts from a well-

characterized (sequenced) reference genome. A reference

genome makes ordering and imputing low coverage

marker data generated through GBS and other sequence-

based genotyping approaches straightforward.  is has

been seen in many of the reported uses of sequence-

based genotyping.  e MSG approach used by Andol-

fatto et al. (2011) made use of the D. simulans reference

genome to  rst align tags to the reference and then call

SNPs. Using a physical map framework, the parent-of-

origin was then imputed across all SNPs segregating in

the population.  is approach is very robust for assign-

ing parent-of-origin in biparental populations. Likewise,

Huang et al. (2009) used the reference genome of rice

to  rst align NGS tags and subsequently call SNPs.  e

physical ordering of these markers greatly enabled and

simpli ed the imputation and assignment of parent-of-

origin for segregating populations.

Although GBS approaches greatly bene t from a

reference genome, the rapid discovery and ordering

(through genetic mapping) of sequence-based molecular

markers can assist with the development and re nement of

a reference genome. High-density genetic maps developed

through GBS can be used to anchor and order physical

maps and re ne or correct unordered sequence contigs.

In D. simulans, Andolfatto et al. (2011) were able to assign

8 Mb to linkage groups, which comprised 30% of the

unassembled D. simulans genome or about 6% of the total

genome.  is is a substantial improvement of an already

well-characterized genome. Likewise, in current e orts

in much larger, more complex genomes including barley

(5.5 Gb) and wheat (16 Gb) (Arumuganathan and Earle,

1991), high-density GBS maps are being used to assist with

anchoring and ordering large numbers of assembled but

unanchored and unordered contigs (International Barley

Sequencing Consortium, 2012).  is approach appears

very promising, creating a positive feedback loop in which

the development of the reference genome assisted by

GBS markers leads to better SNP calling and order-based

imputation for GBS datasets.

Maps Made Easy

 e combination of GBS with a well-de ned refer-

ence genome makes the development of genetic maps

for characterizing segregating populations exception-

ally straightforward. In the absence of a solid reference

genome, a high-density reference genetic map can serve

the same purpose. For characterizing a new population,

there will no longer be any need to place markers on

linkage groups, calculate recombination frequencies, or

order markers. With a reference genome, markers can

be ordered along the physical chromosome.  is order-

ing can then be used to precisely place recombination

break points.  e power of such approaches has been

Figure 3. Integration of genotyping-by-sequencing (GBS) in the context of plant breeding and genomics for a species without a

completed reference genome.

POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 97

highlighted in recent papers with model species includ-

ing D. simulans (Andolfatto et al., 2011), rice (Huang et

al., 2010), and maize (Elshire et al., 2011). Even at low

coverage, the placement of sparse markers on the physi-

cal map can be used to narrow points of recombination

to 100 to 200 kb intervals (Huang et al., 2009; Xie et al.,

2010).  is approach can be extended to populations

with heterozygous chromosomal segments such as F2 or

BC1 populations. Andolfatto et al. (2011) demonstrated

a HMM that accurately inferred heterozygous states

from low-pass sequence-based genotyping.  ese same

approaches have successfully been applied in maize (P.

Bradbury, personal communication, 2012).

In the absence of a solid reference genome, the same

ease of genetic mapping can be accomplished through

development of a reference genetic map for the species

of interest. Genotyping-by-sequencing markers and

other framework markers can be integrated to develop a

high-density genetic map (Poland et al., 2012a). For new

populations, GBS tags can be used to make genotype

calls based on the reference map without the need to

construct a de novo map.  e extremely large number of

markers produced with GBS allows su cient coverage

for most populations even if only a fraction of the total

markers are used.

 ese same approaches for developing genetic maps

and graphical genotypes can be broadly applied to the

characterization of populations of interest for breeding

and germplasm improvement including elite breeding

lines, segregating populations for selection, near-isogenic

lines, and alien-introgression lines.  e use of a variety

of algorithms to correctly infer the heterozygous or

homozygous state of chromosome regions will add value

to inferences and conclusions for molecular breeding

and selection (Andolfatto et al., 2011). Other algorithms

can be used for phasing markers in segregating and

outcrossing populations.  is will generally, however,

require known marker order of the GBS SNPs.

Mapping Single Genes

Genotyping-by-sequencing and other sequence-based

genotyping approaches can be very powerful for mapping

single genes.  e de novo discovery of high-density mark-

ers in a population of interest has the potential to circum-

vent the cumbersome process of marker discovery and

testing for  ne mapping of target genes and mutations.

In the absence of a reference map, RAD markers have

been used in bulked segregant analysis to quickly identify

linked markers (Baird et al., 2008). For single genes of

interest, this can be a valuable approach to rapidly identify

segregating polymorphisms. In lupin (Lupinus angustifo-

lius L.), Yang et al. (2012) were able to identify 30 markers

linked to an anthracnose resistance gene. One advantage

of GBS for mapping single genes in F2 or similar popula-

tions is that the per-sample cost will be low enough that

individual samples can be used rather than bulks.  is

will allow correction or removal of any individuals that

were incorrectly phenotyped while con rming segregation

of linked markers. Depending on the application, there

will be a balance between  nding markers linked to the

gene of interest using GBS and developing single marker

assays from the resulting data. Considering breeding

approaches, it can still be optimal to prescreen populations

with markers for known single genes (with large e ects)

for smaller investment in time and sample costs before

conducting whole genome pro ling. Selected plants car-

rying desired genes can then be genotyped using GBS for

GS.

An Excess of Markers

While preselection of breeding populations for single

markers for important genes is a viable breeding strategy,

sequencing capacity is becoming so inexpensive and readily

available that it will soon be reasonable to generate whole-

genome pro les on any germplasm of interest. Previously,

scientists spent a majority of their time developing and

working with a small number of markers. Many projects

today still require only a small number of markers to com-

plete. Genotyping-by-sequencing, however, can readily

generate tens of thousands of usable markers, which can be

selectively  ltered into the few required for a target experi-

ment. While statistical geneticists will always prefer to have

as many markers as possible, GS models have diminishing

returns on additional markers once the population has

reached the point of “marker saturation” (Jannink et al.,

2010; He ner et al., 2011). On the other hand, for associa-

tion mapping (AM) studies, additional markers increase the

likelihood of  nding and tagging causal polymorphisms

(Cockram et al., 2010).  e current limitation for the gener-

ated data is computational.  ere are new algorithms and

developments in cluster computing to provide the computa-

tional resources needed to make these quantitative genetics

questions more manageable (Stanzione, 2011). Quantitative

geneticists and bioinformatics personnel will be needed to

manage breeding data and develop models. At the same

time, bioinformatics training will become a more central

component to any plant breeding and genetics curriculum.

Filling in the Blanks

 e “catch” to GBS and sequence-based genotyping in

general is that datasets o en have a signi cant amount of

missing data due to low coverage sequencing (Davey et

al., 2011). Biologically, missing genotyping calls in GBS

datasets can be the result of presence–absence variation,

polymorphic restriction sites, and/or di erential meth-

ylation. On the other hand, the technical issue of missing

data with GBS is a combination of (i) library complexity

(i.e., number of unique sequence tags) and (ii) sequence

coverage of the library.

Library complexity is directly related to the species’

genome under investigation and the choice of enzyme(s)

used for complexity reduction. Enzymes with a shorter

recognition site will naturally produce more fragments

than those with a longer recognition site. Methylation-

sensitive enzymes will greatly reduce the number of

fragments in species with large portions of repetitive

98 THE PLANT GENOME  NOVEMBER 2012  VOL. 5, NO. 3

DNA. In barley, libraries constructed using PstI and MspI

generate around 500,000 to 600,000 unique tags, while

in wheat around 1.5 million tags are generated (Poland,

unpublished data, 2012).  e actual number of sequence

tags present in a raw dataset is substantially higher partly

due to allelic variants but largely due to sequencing errors,

many of which can be nonrandom.  is can and will

generate many versions of “unique” tags.

 e level of missing data is based on the sequencing

coverage, which is a function of the library complexity,

the multiplexing level, and the output of the sequencing

platform (Andolfatto et al., 2011).  e multiplexing level

and the number of independent sequences generated

from the sequencing platform will determine the average

number of reads per sample. Higher multiplexing

levels will reduce the data per sample while increased

sequencing output (when using the same multiplexing

level) will understandably increase the data per sample.

One key component of GBS on di erent sequencing

platforms is the number of independent reads. Post-

Sanger sequencing platforms generally rely on a large

number of short sequence reads to produce gigabases of

sequence data (Metzker, 2009).  e new platforms are

continually increasing the sequencing output, a function

of more and longer reads. For GBS, however, generating

longer reads is less advantageous than generating more

reads. More sequence reads provides more data per

sample. Alternatively, increasing read numbers allows

higher multiplexing levels with static amounts of data

per sample. For GBS, 10 Gb of sequence data generated

from 100 million reads of 100 bp would be preferable

to 10 million reads of 1000 bp. While increasing the

number of reads is clearly advantageous for GBS, longer

reads are also bene cial, leading to the discovery of more

polymorphisms (particularly in species with limited

diversity) and assisting GBS applications in polyploids

where secondary, genome-speci c polymorphisms

are needed to di erentiate a segregating SNP from

homeologous sequences on other genomes.

Missing data can be dealt with by (i) sequencing to

higher depth or (ii) imputing.  e logical approach to

removing missing data is to sequence to a higher depth

by reducing the multiplexing level or sequencing the

library multiple times.  is can be very e ective (Fig. 4),

but has the drawback of increasing per-sample cost. For

important AM panels or parents of a breeding program,

however, the additional investment to generate higher

coverage of the tags is likely worthwhile. For breeding

applications using GBS with targeted selection, other

approaches to minimize the impact of missing data are

preferable. Since a majority of the breeding population

will be discarded, minimizing genotyping cost will take

preference over minimizing missing data.

 e second approach is imputation of missing data.

Depending on the genome, the type of GBS libraries, and

the overall size of the datasets, imputation can give very

accurate results.  ere are many imputation algorithms

(Marchini et al., 2007; Purcell et al., 2007; Browning and

Browning, 2007), most of which are targeted toward

haplotype reconstruction on a reference genome. Other

approaches such as a random forest model (Breiman,

2001) can be used to impute unordered markers (as is the

situation in wheat). Sequencing diverse, key individuals

in the population (parents or representatives of kinship

clusters) can greatly improve imputation accuracy by

de ning known haplotypes for the population.

Finally, a matrix of realized relationships among

individuals in a breeding population can be constructed

without imputation. For very high-density genotyped

data generated by GBS, the marker coverage is su cient

to saturate the genomic linkage disequilibrium present

in most breeding programs. From this perspective,

it is only necessary to determine a pairwise identity

between individuals for the markers that are present

in both individuals. With high marker density, there

will still be tens of thousands of pairwise comparisons

between two individuals, well beyond the saturation

point for most elite breeding material. Imputation with

the simple marker mean can still produce accurate GS

prediction models. From a GS perspective, kinship-based

marker imputation can be used to optimize the realized

relationship matrix in the presence of a high level of

missing data (Poland et al., 2012b).  is approach has

been shown to improve the relationship estimates and

give more accurate GS model predictions.

Association Mapping

Genotyping-by-sequencing has the potential to be an excel-

lent tool for genotyping of diverse panels for AM. One key

to applying GBS for AM is addressing the missing data

problem. As previously noted, higher coverage sequencing

will reduce the amount of missing data at the expense of

increased per-sample costs. For a high-value AM panel that

will be well characterized and extensively phenotyped and

serve as a community resource population, the additional

cost of sequencing several times to achieve high coverage is

likely worth the investment.  is will produce a very well-

characterized genetic population. At a high coverage, impu-

tation of missing data will become a very precise exercise,

particularly on populations with extensive linkage disequi-

librium. Depending on the species under interrogation, the

GBS markers will need to be ordered via a physical reference

map or through genetic mapping.

In such populations, GBS markers also have the

advantage of being able to survey multiple haplotypes

on a  ne scale. When two or more SNPs are within

the same tag, these SNP alleles are both evaluated

concurrently. For PAVs, GBS also has the power to

uncover these alleles. Array-based methods, particularly

those applied to polyploid species, are limited in the

ability to accurately survey PAVs as hybridization to a

duplicated sequence will indicate an allele call (for the

ancestral allele) even if the target locus is absent. Due to

the context sequence accompanying a SNP, GBS enables

discrimination between duplicated sequences. At higher

sequencing coverage of the GBS library, PAV can then be

POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 99

inferred by the absence of a given tag for a given sample

in the pool of sequenced tags.

Genomic Selection

In the  eld of plant breeding, an important objective

in the development of GBS is to create a low-cost geno-

typing platform capable of generating high-density

genotypes. For GS in crop species, breeders need a fast,

inexpensive,  exible method that will enable genotyping

of large populations of selection candidates. A majority

of the selection candidates are then discarded, creating a

situation that is greatly bene ted from low-cost genotyp-

ing. Genotyping-by-sequencing is quickly expanding to

 ll those requirements.

Genomic selection was proposed in 2001 by Meuwissen

et al. as an approach to capture the full complement of small

e ect loci in genomic prediction models. Genomic selection

takes advantage of dense genome-wide molecular markers

by simultaneously  tting e ects to all markers and avoiding

statistical testing. By using these GS models, breeders are

able to predict the performance of new experimental lines

at early generations and generate suggested crosses and

selections based on the model predictions (Jannink et al.,

2010). Combined with a fast turnaround on generations,

selection based on predicted breeding values determined

by marker data provided by GBS could greatly increase

gains in plant breeding programs (Meuwissen et al., 2001;

Jannink et al., 2010).

 e advantage of GBS for GS in breeding programs

is the low per-sample cost needed for generating tens

of thousands to hundreds of thousands of molecular

markers. Poland et al. (2012b) have demonstrated the

suitability for GBS markers in developing GS models in

the complex wheat genome.  ey were able to demonstrate

prediction accuracies for yield and other agronomic

traits that are high enough to be suitable for breeding

applications.  e GBS markers also showed a signi cant

improvement in the attained prediction accuracy over a

previously used array of hybridization-based markers.  e

important  nding of this work is the practical implications

in breeding.  e training population was genotyped

without a priori knowledge of the population or SNPs and

per-sample cost was below $20 (Poland et al., 2012b).

Putting Genotyping-by-Sequencing

to Work

Looking forward, high-density markers from NGS

will soon be applied to almost every genomic ques-

tion.  ese marker datasets are low cost and dynamic,

with data and genotyping results getting more robust

and economical each year. Genotyping-by-sequencing

has been shown to be a valid tool for genetic mapping

(Baird et al., 2008; Elshire et al., 2011; Poland et al.,

2012a), breeding applications (Poland et al., 2012b), and

diversity studies (Fu, 2012; Lu et al., 2012).  e ability

to quickly generate robust datasets without consider-

able prior e ort for marker discovery is quickly dispel-

ling issues that have plagued researchers working with

obscure or foreign species: a lack of de ned and speci c

genetic tools for genome analysis (Allendorf et al., 2010).

Figure 4. Removal of missing data in genotyping-by-sequencing by increasing coverage of the library via resequencing. In a set of

international wheat breeding germplasm, several lines (samples) were replicated across two or more libraries. Replicating a sample

two times increased the coverage of single nucleotide polymorphisms (SNPs) to 60% while ﬁ ve replications increase the coverage to

over 90%. While very effective as a means to remove missing data, replicated sequencing increases the per-sample cost. The average

per-sample cost is $15. In this situation for wheat, the number of replications is roughly equivalent to the sequencing coverage of the

library (i.e., 5 replications give approximately 5x coverage). Data from J. Poland (unpublished data, 2012).

100 THE PLANT GENOME  NOVEMBER 2012  VOL. 5, NO. 3

Genotyping-by-sequencing is an ideal platform for stud-

ies ranging from quickly identifying single gene markers

to whole genome pro ling of association panels.

Perhaps one of the most exciting applications of

GBS will be in the  eld of plant breeding.  eoretical

and preliminary studies on genomic selection show

great promise for accelerating the rate of developing new

improved varieties. Genotyping-by-sequencing is providing

a rapid and low-cost tool for genotyping these populations,

allowing breeders to implement genomic selection

on a large scale in their breeding programs. Current

developments in sequencing output will drive per-sample

cost below $10. Furthermore, there is no requirement for a

priori knowledge of the species as the GBS methods have

been shown to be robust across a range of species and SNP

discovery and genotyping are completed together.  is

is a very important feature for moving genomics-assisted

breeding into orphan crops with understudied genomes

and commercial crops with large and complex genomes.

Challenges remaining include data management as well

as computational constraints on huge datasets, though the

future looks promising. Genomic selection via GBS stands

to be a major supplement to traditional crop development.

 e potential for GBS data to improve breeding systems

through GS is enormous.

 e application of sequence-based genotyping for

a whole range of diversity and genomic studies will

have an important place well into the future. Driven

by applications across the whole spectrum of human,

microbial, plant, and animal genomics, developments in

NGS and genomics platforms must be put to use for plant

breeding and genetics studies.

Acknowledgments

USDA-ARS and the USDA-NIFA funded Triticeae Coordinated

Agriculture Project (T-CAP) (2011-68002-30029) provided support for

T. Rife.  is manuscript was greatly improved by the helpful comments

of two anonymous reviewers. Mention of trade names or commercial

products in this publication is solely for the purpose of providing speci c

information and does not imply recommendation or endorsement by the

U.S. Depar tment of Agriculture. USDA is an equal opportunity provider

and employer.

References

Allendorf, F.W., P.A. Hohenlohe, and G. Luikart. 2010. Genomics and

the future of conservation genetics. Nat. Rev. Genet. 11:697–709.

doi:10.1038/nrg2844

Altshuler, D., V.J. Pollara, C.R. Cowles, W.J. Van Etten, J. Baldwin, L.

Linton, and E .S. La nder. 2000. An SNP map of the human genome

generated by reduced representation shotgun sequencing. Nature

407:513–516. doi:10.1038/3503508 3

Andolfatto, P., D. Davison, D. Erezyilmaz, T.T. Hu, J. Mast, T. Sunayama-

Morita, and D.L. Stern. 2011. Multiplexed shotgun genotyping

for rapid and e cient genetic mapping. Genome Res. 21:610–617.

doi:10.1101/gr.115402 .110

Arumuganathan, K., and E.D. Earle. 1991. Nuclear DNA content of some

important plant species. Plant Mol. Biol. Rep. 9:415–415.

Ashelford, K., M.E. Eriksson, C.M. Allen, R. D’Amore, M. Johansson,

P. Gould, S. Kay, A.J. Millar, N. Hall, and A. Hal l. 2011. Full

genome re-sequencing reveals a novel circadian clock mutation in

Arabidopsis. Genome Biol. 12:R28. doi:10.1186/gb-2011-12-3-r28

Baird, N.A., P.D. Etter, T.S. Atwood, M.C. Currey, A.L. Shiver, Z.A.

Lewis, E.U. Selker, W.A. Cresko, and E.A. Johnson. 2008. Rapid

SNP discovery and genetic mapping using sequenced R AD markers.

PLoS ONE 3:e3376. doi:10.1371/journal.pone.0003376

Breiman, L . 2001. Random forests. Mach. Learn. 45:5–32.

doi:10.1023/A:1010933 404324

Browning, S.R., and B.L. Browning. 2007. Rapid and accurate haplotype

phasing and missing-data inference for whole-genome association

studies by use of localized haplotype clustering. Am. J. Hum. Genet.

81:1084–1097. doi:10.1086/521987

Byers, R.L., D.B. Harker, S.M. Yourstone, P.J. Maughan, and J.A. Udall.

2012. Development and mapping of SNP assays in al lotetraploid

cotton.  eor. Appl. Genet. 124:1201–1214. doi:10.1007/s00122-011-

1780 -8

Chia, J.-M., C . Song, P.J. Bradbury, D. Costich, N. de Leon, J. Doebley,

R.J. Elshire, B. Gaut, L . Geller, J.C. Glaubitz, M. Gore, K.E. Guill, J.

Holland, M.B. Hu ord, J. Lai, M. Li, X. Liu, Y. Lu, R. McCombie,

R. Nelson, J. Poland, B.M. Prasanna, T. Pyhäjärvi, T. Rong, R.S.

Sekhon, Q. Sun, M.I. Tenaillon, F. Tian, J. Wang, X. Xu, Z. Zhang,

S.M. Kaeppler, J. Ross-Ibarra, M.D. McMullen, E.S. Buckler, G.

Zhang, Y. Xu, and D. Ware. 2012. Maize HapMap2 identi es

extant variation from a genome in  ux. Nat. Genet. 44:803–807.

doi:10.1038/ng.2313

Cockram, J., J. White, D.L. Zuluaga, D. Smith, J. Comadran, M. Macau lay,

Z. Luo, M.J. Kearsey, P. Werner, D. Harrap, C. Tapsell, H. Liu, P.E.

Hedley, N. Stein, D. Schulte, B. Steuernagel, D.F. Marshall, W.T.B.

 omas, L. Ramsay, I. Mackay, D.J. Balding, R. Waugh, and D.M.

O’Sullivan. 2010. Genome-wide association mapping to candidate

polymorphism resolution in the unsequenced barley genome. Proc.

Natl. Acad. Sci. USA 107:21611–21616. doi:10.1073/pnas.1010179107

Craig, D.W., J.V. Pearson, S. Szelinger, A. Sekar, M. Redman, J.J.

Corneveau x, T.L. Pawlowski, T. Laub, G. Nunn, D.A. Stephan,

N. Homer, and M.J. Huentelman. 2008. Identi cation of genetic

variants using bar-coded multiplexed sequencing. Nat. Methods

5:887–893. doi:10.1038/nmeth.1251

Cronn, R., A. Liston, M. Parks, D.S. Gernandt, R. Shen, and T. Mockler.

2008. Multiplex sequencing of plant chloroplast genomes using

Solexa sequencing-by-synthesis technology. Nucleic Acids Res.

36:e122. doi:10.1093/nar/gkn502

Davey, J.W., P.A. Hohenlohe, P.D. Etter, J.Q. Boone, J.M. Catchen, and

M.L. Bla xter. 2011. Genome-wide genetic marker discovery and

genotyping using next-generation sequencing. Nat. Rev. Genet.

12:4 99–510. doi:10.1038/nrg3012

Deschamps, S., M. la Rota, J.P. Ratashak, P. Biddle, D.  ureen, A. Farmer,

S. Luck, M. Beatty, N. Nagasawa, L. Michael, V. Llaca, H. Sa kai, G.

May, J. Lightner, and M.A. Campbell. 2010. Rapid genome-wide

single nucleotide polymorphism discovery in soybean a nd rice

via deep resequencing of reduced representation libraries with

the Illumina genome analyzer. Plant Gen. 3:53–68. doi:10.3835/

plantgenome2009.09.0026

Elshire, R.J., J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawa moto, E.S.

Buckler, and S.E. Mitchell. 2011. A robust, simple genoty ping-by-

sequencing (GBS) approach for high diversity species. PLoS ONE

6:e19379. doi:10.1371/journal.pone.0019379

Fu, Y.-B. 2012. Genotyping-by-sequenci ng: A case study in barley. Workshop

presented at: Genomics of Genebanks. Plant and Animal Genome

Conference X X, San Diego, CA. 14–18 Jan. 2012. Workshop W362.

Futschik, A., and C. Schlötterer. 2010.  e next generation of molecular

markers from massively parallel sequencing of pooled DNA

samples. Genetics 186:207–218. doi:10.1534/genetics.110.114397

Gan, X., O. Stegle, J. Behr, J.G. Ste en, P. Drewe, K.L . Hildebrand,

R. Lyngsoe, S.J. Schultheiss, E.J. Osborne, V.T. Sreedharan, A.

Kahles, R. Bohnert, G. Jean, P. Derwent, P. Kersey, E.J. Bel eld,

N.P. Harberd, E. Kemen, C. Toomajian, P.X. Kover, R.M. Clark,

G. Rätsch, and R. Mott. 2011. Multiple reference genomes and

transcriptomes for Arabidopsis thaliana. Nature 477:419–423.

doi:10.1038/nature10414

Geraldes, A., J. Pang, N.  iessen, T. Cezard, R . Moore, Y. Zhao, A. Tam,

S. Wang, M. Friedmann, I. Birol, S.J.M. Jones, Q.C.B. Cronk, and

C.J. Douglas. 2011. SNP discover y in black cot tonwood (Populus

trichocarpa) by population transcriptome resequencing. Mol. Ecol.

Resou r. 11:81–92. doi:10 .1111/j.1755- 0998 .2010.0296 0.x

POLAND AND RIFE: GENOTYPING-BY-SEQUENCING 101

Gore, M.A., J.M. Chia, R.J. Elshire, Q. Sun, E.S. Ersoz, B.L. Hurwitz, J.A.

Pei er, M.D. McMullen, G.S. Grills, and J. Ross-Ibarra. 2009a. A

 rst-generation haplotype map of maize. Science 326:1115–1117.

doi:10.1126/science.1177837

Gore, M.A., M.H. Wright, E .S. Ersoz, P. Bou ard, E.S. Szekeres, T.P.

Jarvie, B.L. Hurwitz, A. Narecha nia, T.T. Harkins, G.S. Grills,

D.H. Ware, and E.S. Buckler. 2009b. Large-scale discovery

of gene-enriched SNPs. Plant Gen. 2:121–133. doi:10.3835/

plantgenome2009.01.0002

Hamblin, M.T., T.J. Close, P.R. Bhat, S. Chao, J.G. K ling, K.J. Abraha m,

T. Blake, W.S. Brooks, B. Cooper, C.A. Gri ey, P.M. Hayes, D.J.

Hole, R.D. Horsley, D.E. Obert, K.P. Smith, S.E. Ullrich, G.J.

Muehlbauer, and J.-L. Jannink. 2010. Population structure and

linkage disequilibrium in U.S. barley germplasm: Implications

for association mapping. Crop Sci. 50:556–566. doi:10.2135/

cropsci2009.04.0198

Harper, A.L., M. Trick, J. Higgins, F. Fraser, L. Clissold, R. Wells,

C. Hattori, P. Werner, and I. Bancro . 2012. Associative

transcriptomics of traits in the polyploid crop species Brassica

napus. Nat. Biotechnol. 30:798–802. doi:10.1038/nbt.2302

He ner, E.L., J.-L. Jannink, and M.E. Sorrells. 2011. Genomic selection

accuracy using multifamily prediction models in a wheat breeding

program. Plant Gen. 4:65–75. doi:10.3835/plantgenome.2010.12.0029

Hohenlohe, P.A., S.J. Amish, J.M. Catchen, F.W. Allendorf, and G.

Luikart. 2011. Next-generation RAD sequencing identi es

thousands of SNPs for assessing hybridizat ion between rainbow

and westslope cutthroat trout. Mol. Ecol. Resour. 11:117–122.

doi :10.1111/j.1755 -0998.2010.029 67.x

Huang, X., Q. Feng, Q. Qian, Q. Zhao, L. Wang, A. Wang, J. Guan, D.

Fan, Q. Weng, T. Huang, G. Dong, T. Sang, and B. Han. 2009. High-

throughput genotyping by whole-genome resequencing. Genome

Res . 19:10 68–1076. doi:10.1101/gr.0 89516.108

Huang, X., X. Wei, T. Sang, Q. Zhao, Q. Feng, Y. Zhao, C. Li, C. Zhu, T.

Lu, Z. Zhang, M. Li, D. Fan, Y. Guo, A. Wang, L. Wang, L . Deng, W.

Li, Y. Lu, Q. Weng, K. Liu, T. Huang, T. Zhou, Y. Jing, W. Li, Z. Lin,

E.S. Buckler, Q. Qian, Q.-F. Zhang, J. Li, and B. Han. 2010. Genome-

wide association studies of 14 agronomic traits in rice landraces.

Nat. Genet. 42:961–967. doi:10.1038/ng.695

Hyten, D.L., Q. Song, E.W. Fickus, C.V. Quigley, J.-S. Lim, I.-Y. Choi,

E.-Y. Hwang, M. Pastor-Corrales, and P.B. Cregan. 2010. High-

throughput SNP discovery and assay development in common bean.

BMC Genomics 11:475. doi:10.1186/1471-2164-11-475

International Barley Sequencing Consortium. 2012. A physical, genetic and

funct ional sequence assembly of the barley genome. Nature (in press).

Jannink, J.-L., A.J. Lorenz, and H. Iwata. 2010. Genomic selection in

plant breeding: From theor y to practice. Brie ngs Funct. Genomics

9:166–177. doi:10.1093/bfgp/elq001

Jiao, Y., H. Zhao, L. Ren, W. Song, B. Zeng, J. Guo, B. Wang, Z. Liu, J.

Chen, W. Li, M. Zhang, S. Xie, a nd J. Lai. 2012. Genome-wide

genetic changes during modern breeding of maize. Nat. Genet.

44:812–815. doi:10.1038/ng.2312

Lu, F., A.E. Lipk a, R.J. Elshire, J. Glaubitz, J. Cher ney, M. Casler, E.S. Buckler,

and D. Costich. 2012. Characterization of the genetic diversity of

switchgrass using genotyping by sequencing. Poster presented at: Poster

Session – Even Numbers. Plant a nd Animal Genome Conference XX,

San Diego, CA. 14–18 Jan. 2012. Poster P0195.

Marchini, J., B. Howie, S. Myers, G. McVean, and P. Donnelly. 2007. A

new multipoint method for genome-wide association studies by

imputation of genot ypes. Nat. Genet. 39:906–913. doi:10.1038/

ng2088

Mardis, E.R. 2008.  e impact of next-generation sequencing technology

on genetics. Trends Genet. 24:133–141. doi:10.1016/j.tig.2007.12.007

Metzker, M. 2009. Sequencing technologies –  e next generation. Nat.

Rev. Genet. 11:31–46. doi:10.1038/nrg2626

Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of

total genet ic value using genome-wide dense marker maps. Genetics

157:1819–1829.

Miller, M.R., J.P. Dunham, A. Amores, W.A. Cresko, and E.A. Johnson.

2007. Rapid and cost-e ective polymorphism identi cation and

genotyping using restriction site associated DNA (RAD) markers.

Genome Res. 17:240–248. doi:10.1101/gr.5681207

Monson-Mil ler, J., D.C. Sanchez-Mendez, J. Fass, I.M. Henry, T.H. Tai,

and L. Comai. 2012. Reference genome-independent assessment of

mutation densit y using restriction enzyme-phased sequencing. BMC

Genomics 13:72.

Morrell, P.L., E.S. Buckler, and J. Ross-Ibarra. 2011. Crop genomics:

Advances a nd applications. Nat. Rev. Genet. 13:85 –96.

Myles, S., J. Pei er, P.J. Brown, E.S. Ersoz, Z . Zhang, D.E. Costich, and

E.S. Buckler. 2009. Association mapping: Critical considerations

shi from genotyping to experimental design. Plant Cell 21:2194–

2202. doi:10.1105/tp c.109.068437

Nelson, J.C., S. Wang, Y. Wu, X. Li, G. Antony, F.F. White, and J. Yu. 2011.

Single-nucleotide polymorphism discovery by high-throughput

sequencing in sorghum. BMC Genomics 12:352. doi:10.1186/1471-

2164-12-352

Nielsen, R., J.S. Paul, A. Albrechtsen, and Y.S. Song. 2011. Genotype and

SNP calling from next-generation sequencing data. Nat. Rev. Genet.

12:443–451. doi:10.1038/nrg2986

Peterson, B.K., J.N. Weber, E.H. Kay, H.S. Fisher, and H.E. Hoekstra.

2012. Double digest RADseq: An inexpensive method for de novo

SNP discovery and genotyping in model and non-model species.

PLoS One 7:e37135.

Poland, J.A., P.J. Brown, M.E. Sorrells, and J.-L. Jannink. 2012a.

Development of high-density genetic maps for barley and wheat

using a novel two-enzy me genotyping-by-sequencing approach.

PLoS ONE 7:e32253. doi:10.1371/journal.pone.0032253

Poland, J., J. Endelman, J. Dawson, J. Rutkoski, S. Wu, Y. Manes, S.

Dreisigacker, J. Crossa, H. Sanchez-Villeda, M. Sorrells, and

J.-L. Jannink. 2012b. Genomic selection in wheat breeding using

genotyping-by-sequencing. Plant Gen. (in press). doi:10.3835/

plantgenome2012.06.0006

Purcell, S., B. Neale, K. Todd-Brown, L.  omas, M.A.R. Ferreira, D.

Bender, J. Maller, P. Sk lar, P.I.W. de Bakker, M.J. Daly, and P.C.

Sham. 20 07. PLINK: A tool set for whole-genome association and

population-based linkage a nalyses. Am. J. Hum. Genet. 81:559–575.

doi:10.1086/519795

Stanzione, D. 2011.  e iPlant collaborative: Cyberinfrastructure to feed

the world. Computer 44:44–52. doi:10.1109/MC.2011.297

Truong, H.T., A.M. Ramos, F. Ya lcin, M. de Ruiter, H.J.A. van der

Poel, K.H.J. Huvenaars, R.C.J. Hogers, L.J.G. van Enckevor t, A.

Janssen, N.J. van Orsouw, and M.J.T. van Eijk. 2012. Sequence-

based genotyping for marker discovery and co-dominant scoring

in germplasm and populations. PLoS ONE 7:e37565. doi:10.1371/

journal.pone.0037565

van Orsouw, N.J., R.C.J. Hogers, A. Janssen, F. Yalcin, S. Snoeijers,

E. Verstege, H. Schneiders, H. van der Poel, J. van Oeveren, H.

Verstegen, and M.J.T. van Eijk. 2007. Complexity reduction of

polymorphic sequences (CRoPS): A novel approach for large-scale

polymorphism discovery in complex genomes. PLoS ONE 2:e1172.

doi:10.1371/journal.pone.0001172

van Tassell, C.P., T.P.L. Smith, L.K. Matukumalli, J.F. Taylor, R.D.

Schnabel, C.T. Lawley, C.D. Haudenschild, S.S. Moore, W.C.

Warren, and T.S. Sonstegard. 2008. SNP discovery and allele

frequency estimation by deep sequencing of reduced representation

libraries. Nat. Methods 5:247–252. doi:10.1038/nmeth.1185

Wang, S., E. Meyer, J.K. McKay, and M.V. Matz. 2012. 2b-R AD: A simple

and  exible met hod for genome-wide genotyping. Nat. Methods

9:808–810. doi:10.1038/nmeth.2023

Wang, X., H. Wang, J. Wang, R. Sun, J. Wu, S. Liu, et al. 2011.  e genome

of the mesopolyploid crop species Brassica rapa. Nat. Genet.

43:1035–1039. doi:10.1038/ng.919

Wetterstrand, K.A. 2012. DNA sequencing costs: Data from the NHGRI

large-scale genome sequencing program. National Human Genome

Research Institute, Bethesda, MD. http://www.genome.gov/

sequencingcosts (accessed 5 Mar. 2012).

Wiedmann, R.T., T.P.L. Smith, and D.J. Nonneman. 2008. SNP

discover y in swine by reduced representation and high throughput

pyrosequencing. BMC Genet. 9:81. doi:10.1186/1471-2156 -9-81

102 THE PLANT GENOME  NOVEMBER 2012  VOL. 5, NO. 3

Xie, W., Q. Feng, H. Yu, X. Huang, Q. Zhao, Y. Xing, S. Yu, B. Han, and

Q. Zhang. 2010. Parent-independent genotyping for constructing

an ultrahigh-density linkage map based on popu lation sequencing.

Proc. Natl. Acad. Sci. USA 107:10578–10583. doi:10.1073/

pnas.1005931107

Xu, X., X. Liu, S. Ge, J.D. Jensen, F. Hu, X. Li, Y. Dong, R.N. Gutenkunst,

L. Fang, L. Huang, J. Li, W. He, G. Zhang, X. Zheng, F. Zhang, Y. Li,

C. Yu, K. Kristiansen, X. Zha ng, J. Wang, M. Wright, S. McCouch,

R. Nielsen, J. Wang, and W. Wang. 2012. Resequencing 50

accessions of cultivated and wild rice yields markers for identifying

agronomically important genes. Nat. Biotechnol. 30:105–111.

doi:10.1038/nbt.2050

Xu, X., S. Pan, S. Cheng, B. Zhang, D. Mu, P. Ni, et al. 2011. Genome

sequence and analysis of the tuber crop potato. Nature 475:189–195.

doi:10.1038/nature10158

Yang, H., Y. Tao, Z. Zheng, C. Li, M. Sweetingham, and J. Howieson.

2012. Application of next-generation sequencing for rapid marker

development in molecular plant breeding: A case study on

anthracnose disease resistance in Lupinus angustifolius L. BMC

Genomics 13:318. doi:10.1186/1471-2164 -13-318

You, F.M., N. Huo, K.R. Deal, Y.Q. Gu, M.-C. Luo, P.E. McGuire, J.

Dvorak, and O.D. Anderson. 2011. Annotation-based genome-wide

SNP discovery in the large and complex Aegilops tauschii genome

using next-generation sequencing without a reference genome

sequence. BMC Genomics 12:59. doi:10.1186/1471-2164-12-59

Genotyping-by-Sequencing Study of the Genetic Diversity and Population Structure of the Endangered Plant Tsoongiodendron odorum Chun in China

Article

Full-text available

May 2024

Tsoongiodendron odorum Chun is a large evergreen tree in the Magnoliaceae family and an ancient relict species represented by small wild populations. It has excellent material quality, high ornamental value, and scientific significance. However, due to the complicated natural reproduction and notable habitat destruction, its wild populations must be urgently conserved. We used genotyping-by-sequencing to examine 17 natural populations of T. odorum in China, the species’ primary habitat, to better understand the genetic diversity of this species and use its germplasm resources. T. odorum had a very low level of genetic diversity; its mean values for Ho, He, Pi, and PIC were 0.175, 0.123, 0.160, and 0.053, respectively. With an average within-population Fst of 0.023 and an inter-population gene flow Nm of 10.918, population genetic variation was primarily found within populations, demonstrating minute genetic divergence between populations. The 17 natural populations of T. odorum were divided into two major categories: the Fujian populations in eastern China and the Jiangxi, Guangdong, Hunan, and Guangxi populations in central and western China. Our research contributes to the understanding of T. odorum’s genetic diversity and organization and offers a theoretical framework for the species’ conservation, breeding, and selection.

Environment influences the genetic structure and genetic differentiation of Sassafras tzumu (Lauraceae)

Article

Full-text available

Jun 2024

Background Sassafras tzumu , an elegant deciduous arboreal species, belongs to the esteemed genus Sassafras within the distinguished family Lauraceae. With its immense commercial value, escalating market demands and unforeseen human activities within its natural habitat have emerged as new threats to S. tzumu in recent decades, so it is necessary to study its genetic diversity and influencing factors, to propose correlative conservation strategies. Results By utilizing genotyping-by-sequence (GBS) technology, we acquired a comprehensive database of single nucleotide polymorphisms (SNPs) from a cohort of 106 individuals sourced from 13 diverse Sassafras tzumu natural populations, scattered across various Chinese mountainous regions. Through our meticulous analysis, we aimed to unravel the intricate genetic diversity and structure within these S. tzumu populations, while simultaneously investigating the various factors that potentially shape genetic distance. Our preliminary findings unveiled a moderate level of genetic differentiation ( F ST = 0.103, p < 0.01), accompanied by a reasonably high genetic diversity among the S. tzumu populations. Encouragingly, our principal component analysis painted a vivid picture of two distinct genetic and geographical regions across China, where gene flow appeared to be somewhat restricted. Furthermore, employing the sophisticated multiple matrix regression with randomization (MMRR) analysis method, we successfully ascertained that environmental distance exerted a more pronounced impact on genetic distance when compared to geographical distance ( β E = 0.46, p < 0.01; β D = 0.16, p < 0.01). This intriguing discovery underscores the potential significance of environmental factors in shaping the genetic landscape of S. tzumu populations. Conclusions The genetic variance among populations of S. tzumu in our investigation exhibited a moderate degree of differentiation, alongside a heightened level of genetic diversity. The environmental distance of S. tzumu had a greater impact on its genetic diversity than geographical distance. It is of utmost significance to formulate and implement meticulous management and conservation strategies to safeguard the invaluable genetic resources of S. tzumu .

Biofortification of Major Crops through Conventional and Modern Biotechnological Approaches to Fight Hidden Hunger: An Overview

Article

Jun 2024

Biofortification, the process of enhancing the nutritional content of crops, offers a promising strategy to combat hidden hunger—micronutrient deficiencies affecting over two billion people globally. This review article explores the biofortification of major crops, focusing on both conventional breeding techniques and modern biotechnological approaches. Conventional methods, such as selective breeding and crossbreeding, have been instrumental in increasing the levels of essential micronutrients like iron (Fe) and zinc (Zn) in staple crops such as wheat, rice, and maize. For instance, wild relatives of cultivated wheat, including Triticum dicoccoides and Aegilops tauschii, have been utilized to significantly enhance Fe and Zn content in modern cultivars. Advancements in biotechnological tools, including genetic engineering, marker-assisted selection (MAS), and genome editing (CRISPR/Cas9), have further accelerated the development of biofortified crops. These technologies enable precise modifications to increase the accumulation of micronutrients and improve nutrient bioavailability. For example, transgenic rice varieties enriched with β-carotene (Golden Rice) and enhanced Fe and Zn content through gene editing showcase the potential of biotechnology in addressing micronutrient deficiencies. The review also highlights ongoing efforts and challenges in the field, such as regulatory hurdles, public acceptance, and the need for comprehensive strategies integrating conventional and modern approaches. Furthermore, it discusses the role of international research organizations and collaborations in facilitating the development and dissemination of biofortified crops. In conclusion, combining conventional breeding with cutting-edge biotechnological innovations presents a robust approach to biofortify major crops, offering a sustainable solution to mitigate hidden hunger and improve global food security. Continued research and multi-disciplinary collaborations are essential to fully realize the potential of biofortification in enhancing human nutrition.

SNP-based analysis reveals high genetic structure and diversity in umbu tree (Spondias tuberosa Arruda), a native and endemic species of the Caatinga biome

Article

Full-text available

May 2024
GENET RESOUR CROP EV

Umbu (Spondias tuberosa Arruda) is an endemic fruit tree restricted to the Brazilian seasonally dry tropical forest called Caatinga. This study aimed to evaluate the structure and genomic diversity of umbu trees from seven locations in the Caatinga biome, distributed among four Brazilian states. Using genotyping-by-sequencing (GBS), a total of 5,336 SNPs were obtained, of which 250 showed outlier behavior. Therefore, 5,086 neutral SNPs were used for population structure and genetic diversity analyses. Both discriminant analysis of principal components (DAPC) and neighbor-joining cluster analyses classified the accessions into four groups, with a genetic structure observed among groups, disagreeing with our initial hypothesis of low genetic structure between locations. Isolation by distance (r² = 0.974; p = 0.0015) was detected. Moderate to high levels of genetic diversity were found, with the average observed heterozygosity (HO = 0.221) higher than the expected heterozygosity (HE = 0.199) and with negative inbreeding coefficient (FIS) values. Most genetic variation was found within locations, although high diversity between locations (22.1%) was observed. The results obtained are important for understanding the levels and distribution of genetic variation, suggesting that most locations are priorities for conservation actions, contributing with different alleles to the species' gene pool in Brazil.

Characterization of a new greenbug resistance gene Gb9 in a synthetic hexaploid wheat

Article

Full-text available

May 2024
THEOR APPL GENET

Greenbug [Schizaphis graminum (Rondani)] is a serious insect pest that not only damages cereal crops, but also transmits several destructive viruses. The emergence of new greenbug biotypes in the field makes it urgent to identify novel greenbug resistance genes in wheat. CWI 76364 (PI 703397), a synthetic hexaploid wheat (SHW) line, exhibits greenbug resistance. Evaluation of an F2:3 population from cross OK 14319 × CWI 76364 indicated that a dominant gene, designated Gb9, conditions greenbug resistance in CWI 76364. Selective genotyping of a subset of F2 plants with contrasting phenotypes by genotyping-by-sequencing identified 25 SNPs closely linked to Gb9 on chromosome arm 7DL. Ten of these SNPs were converted to Kompetitive allele-specific polymerase chain reaction (KASP) markers for genotyping the entire F2 population. Genetic analysis delimited Gb9 to a 0.6-Mb interval flanked by KASP markers located at 599,835,668 bp (Stars-KASP872) and 600,471,081 bp (Stars-KASP881) on 7DL. Gb9 was 0.5 cM distal to Stars-KASP872 and 0.5 cM proximal to Stars-KASP881. Allelism tests indicated that Gb9 is a new greenbug resistance gene which confers resistance to greenbug biotypes C, E, H, I, and TX1. TX1 is one of the most widely virulent biotypes and has overcome most known wheat greenbug resistance genes. The introgression of Gb9 into locally adapted wheat cultivars is of economic importance, and the KASP markers developed in this study can be used to tag Gb9 in cultivar development.

Implementing multi‐trait genomic selection to improve grain milling quality in oats (Avena sativa L.)

Article

Full-text available

May 2024

Oats (Avena sativa L.) provide unique nutritional benefits and contribute to sustainable agricultural systems. Breeding high‐value oat varieties that meet milling industry standards is crucial for satisfying the demand for oat‐based food products. Test weight, thins, and groat percentage are primary traits that define oat milling quality and the final price of food‐grade oats. Conventional selection for milling quality is costly and burdensome. Multi‐trait genomic selection (MTGS) combines information from genome‐wide markers and secondary traits genetically correlated with primary traits to predict breeding values of primary traits on candidate breeding lines. MTGS can improve prediction accuracy and significantly accelerate the rate of genetic gain. In this study, we evaluated different MTGS models that used morphometric grain traits to improve prediction accuracy for primary grain quality traits within the constraints of a breeding program. We evaluated 558 breeding lines from the University of Illinois Oat Breeding Program across 2 years for primary milling traits, test weight, thins, and groat percentage, and secondary grain morphometric traits derived from kernel and groat images. Kernel morphometric traits were genetically correlated with test weight and thins percentage but were uncorrelated with groat percentage. For test weight and thins percentage, the MTGS model that included the kernel morphometric traits in both training and candidate sets outperformed single‐trait models by 52% and 59%, respectively. In contrast, MTGS models for groat percentage were not significantly better than the single‐trait model. We found that incorporating kernel morphometric traits can improve the genomic selection for test weight and thins percentage.

Understanding genome structure facilitates the use of wild lentil germplasm for breeding: A case study with shattering loci

Article

Full-text available

May 2024

Plant breeders are generally reluctant to cross elite crop cultivars with their wild relatives to introgress novel desirable traits due to associated negative traits such as pod shattering. This results in a genetic bottleneck that could be reduced through better understanding of the genomic locations of the gene(s) controlling this trait. We integrated information on parental genomes, pod shattering data from multiple environments, and high‐density genetic linkage maps to identify pod shattering quantitative trait loci (QTLs) in three lentil interspecific recombinant inbred line populations. The broad‐sense heritability on a multi‐environment basis varied from 0.46 (in LR‐70, Lens culinaris × Lens odemensis) to 0.77 (in LR‐68, Lens orientalis × L. culinaris). Genetic linkage maps of the interspecific populations revealed reciprocal translocations of chromosomal segments that differed among the populations, and which were associated with reduced recombination. LR‐68 had a 2–5 translocation, LR‐70 had 1–5, 2–6, and 2–7 translocations, and LR‐86 had a 2–7 translocation in one parent relative to the other. Segregation distortion was also observed for clusters of single nucleotide polymorphisms on multiple chromosomes per population, further affecting introgression. Two major QTL, on chromosomes 4 and 7, were repeatedly detected in the three populations and contain several candidate genes. These findings will be of significant value for lentil breeders to strategically access novel superior alleles while minimizing the genetic impact of pod shattering from wild parents.

Histological, metabolomic, and transcriptomic differences in fir trees from a peri-urban forest under chronic ozone exposure

Preprint

May 2024

Urbanization modifies ecosystem conditions and evolutionary processes. This includes air pollution, mostly as tropospheric ozone (O3), which contributes to the decline of urban and peri-urban forests. A notable case are fir(Abies religiosa) forests in the peripheral mountains southwest of Mexico City, which have been severely affected by O3 pollution since the 1970s. Interestingly, some young individuals exhibiting minimal O3—related damage have been observed within a zone of significant O3 exposure. Using this setting as a natural experiment, we compared asymptomatic and symptomatic individuals of similar age (≤15 years old; n = 10) using histological, metabolomic and transcriptomic approaches. Plants were sampled during days of high (170 ppb) and moderate (87 ppb) O3 concentration. Given that there have been reforestation efforts in the region, with plants from different source populations, we first confirmed that all analysed individuals clustered within the local genetic group when compared to a species-wide panel (Admixture analysis with ~1.5K SNPs). We observed thicker epidermis and more collapsed cells in the palisade parenchyma of needles from symptomatic individuals than from their asymptomatic counterparts, with differences increasing with needle age. Furthermore, symptomatic individuals exhibited lower concentrations of various terpenes (ß-pinene, ß-caryophylene oxide, α-caryophylene and ß-α-cubebene) than asymptomatic trees, as evidenced through GC-MS. Finally, transcriptomic analyses revealed differential expression for thirteen genes related to carbohydrate metabolism, plant defense, and gene regulation. Our results indicate a rapid and contrasting phenotypic response among trees, likely influenced by standing genetic variation and/or plastic mechanisms. They open the door to future evolutionary studies for understanding how O3 tolerance develops in urban environments, and how this knowledge could contribute to forest restoration.

Abiotic Stress in Plants: Challenges and Strategies for Enhancing Plant Growth and Development

Chapter

May 2024

Plants face diverse environmental challenges, including water scarcity, salinity, extreme temperatures, and nutrient deficiency, impeding their growth and productivity. Plants develop molecular mechanisms for adaptation to survive such conditions. Understanding these processes is pivotal for enhancing plant tolerance. This chapter thoroughly investigates the challenges and strategies for improving plant growth and development in stressful environments, delving into the mechanisms of plant adaptation to these stressors. Additionally, it identifies abiotic stress candidate genes for stress tolerance, which are crucial for developing stress-resistant crops. Moreover, the chapter underscores the paramount importance of implementing strategies to ensure food security amid a growing global population and increased environmental abiotic stress. It highlights the critical role of investigating plant responses to abiotic stress in addressing global food security challenges amid climate change.

Unveiling the potential of biomarkers in the context of climate change: analysis of knowledge landscapes, trends, and research priorities

Article

Full-text available

May 2024
REG ENVIRON CHANGE

Shaher Zyoud

Climate change is one of the most pressing challenges facing the world, with profound implications for ecosystems, biodiversity, and human societies. The necessity to monitor, comprehend, and mitigate climate change impacts has spurred the emergence of biomarkers as essential tools in ecological and environmental research. This study investigates global knowledge in biomarker research within the context of climate change. Its goal is to offer valuable insights to both researchers and practitioners, thereby guiding the development of well-informed decisions. The analysis encompassed a performance evaluation aimed at scrutinizing both quantitative and qualitative indicators. Visualization techniques utilizing VOSviewer software were deployed to analyze collaboration patterns, co-citation links among prominent knowledge-sharing platforms, and key topics derived from keyword co-occurrence matrices. Globally, a total of 1045 relevant documents were identified and analyzed. The United States stands out as the top contributor (261 documents; 25.0%) with Chinese Academy of Sciences, China, as the most prolific institution (72 documents; 6.9%). Key trends were related to developing and utilizing novel biomarkers based on advancements in omics and nano-based technologies, bioinformatics and data analytics benefiting from machine learning and artificial intelligence tools, and the significance of integrative approaches that merge biomarkers with remote sensing data and ecological models. These advancements contribute to boosting predictive capabilities, precise sensing, and the effective identification of patterns within massive datasets that ultimately improve climate change monitoring and mitigation in ecosystems. Progress in this field demands interdisciplinary collaborations, international cooperation, the establishment of long-term monitoring programs, the creation of biomarker databases, and investment in emerging biomarker technologies. Education and outreach initiatives, accompanied by adequate funding and resources, are critical for advancing biomarker research.

Double digest RADseq: An inexpensive method for de Novo SNP discovery and genotypin in model and non-model species

Article

Full-text available

Jan 2012
PLOS ONE

Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program

Article

Full-text available

Mar 2011

Genomic selection (GS) uses genome-wide molecular marker data to predict the genetic value of selection candidates in breeding programs. In plant breeding, the ability to produce large numbers of progeny per cross allows GS to be conducted within each family. However, this approach requires phenotypes of lines from each cross before conducting GS. This will prolong the selection cycle and may result in lower gains per year than approaches that estimate marker-effects with multiple families from previous selection cycles. In this study, phenotypic selection (PS), conventional marker-assisted selection (MAS), and GS prediction accuracy were compared for 13 agronomic traits in a population of 374 winter wheat (Triticum aestivum L.) advanced-cycle breeding lines. A cross-validation approach that trained and validated prediction accuracy across years was used to evaluate effects of model selection, training population size, and marker density in the presence of genotype x environment interactions (GxE). The average prediction accuracies using GS were 28% greater than with MAS and were 95% as accurate as PS. For net merit, the average accuracy across six selection indices for GS was 14% greater than for PS. These results provide empirical evidence that multifamily GS could increase genetic gain per unit time and cost in plant breeding.

Genotyping-by-sequencing: a Case Study in Barley

Conference Paper

Full-text available

Yong-Bi Fu (符永碧)

Next-generation DNA sequencing (NGS) technologies can survey sequence variation on a genome-wide scale, but their utility for crop genetic diversity analysis is poorly known. Many challenges remain in their applications, including sampling complex genomes, identifying single-nucleotide polymorphisms (SNPs), and analyzing missing data. This presentation will illustrate a practical application of the Roche 454 GS FLX Titanium technology in combination with genomic reduction and an advanced bioinformatics tool to analyze the genetic relationships of 16 diverse barley (Hordeum vulgare L.) landraces. A full 454 run generated roughly 1.7 million sequence reads with a total length of 612 Mbp. Application of the computational pipeline called DIAL (de novo identification of alleles) identified 2,578 contigs and 3,980 SNPs. Sanger sequencing of four barley samples confirmed 85 of the 100 selected contigs and 288 of the 620 putative SNPs, and identified 735 new SNPs and 39 new indels. Several diversity analyses revealed the eastern and western division in the barley samples. The division is compatible with those inferred with 156 microsatellite alleles of the same 16 samples and consistent with our current knowledge about cultivated barley. The NGS application not only provides a new informative set of genomic resources for barley research, but also helps to illustrate the feasibility of genotyping-by-sequencing with NGS technologies for crop diversity studies.

Genomic Selection in Wheat Breeding using Genotyping-by-Sequencing

Article

Full-text available

Nov 2012

Genomic selection (GS) uses genomewide molecular markers to predict breeding values and make selections of individuals or breeding lines prior to phenotyping. Here we show that genotyping-by-sequencing (GBS) can be used for de novo genotyping of breeding panels and to develop accurate GS models, even for the large, complex, and polyploid wheat (Triticum aestivum L.) genome. With GBS we discovered 41,371 single nucleotide polymorphisms (SNPs) in a set of 254 advanced breeding lines from CIMMYT's semiarid wheat breeding program. Four different methods were evaluated for imputing missing marker scores in this set of unmapped markers, including random forest regression and a newly developed multivariate-normal expectation-maximization algorithm, which gave more accurate imputation than heterozygous or mean imputation at the marker level, although no signifi cant differences were observed in the accuracy of genomic-estimated breeding values (GEBVs) among imputation methods. Genomic-estimated breeding value prediction accuracies with GBS were 0.28 to 0.45 for grain yield, an improvement of 0.1 to 0.2 over an established marker platform for wheat. Genotyping-by-sequencing combines marker discovery and genotyping of large populations, making it an excellent marker platform for breeding applications even in the absence of a reference genome sequence or previous polymorphism discovery. In addition, the fl exibility and low cost of GBS make this an ideal approach for genomics-assisted breeding.

Rapid Genome-wide Single Nucleotide Polymorphism Discovery in Soybean and Rice via Deep Resequencing of Reduced Representation Libraries with the Illumina Genome Analyzer

Article

Full-text available

Jul 2010

Massively parallel sequencing platforms have allowed for the rapid discovery of single nucleotide polymorphisms (SNPs) among related genotypes within a species. We describe the creation of reduced representation libraries (RRLs) using an initial digestion of nuclear genomic DNA with a methylation-sensitive restriction endonuclease followed by a secondary digestion with the 4bp-restriction endonuclease DpnII. This strategy allows for the enrichment of hypomethylated genomic DNA, which has been shown to be rich in genic sequences, and the digestion with DpnII serves to increase the number of common loci resequenced between individuals. Deep resequencing of these RRLs performed with the Illumina Genome Analyzer led to the identifi cation of 2618 SNPs in rice and 1682 SNPs in soybean for two representative genotypes in each of the species. A subset of these SNPs was validated via Sanger sequencing, exhibiting validation rates of 96.4 and 97.0%, in rice (Oryza sativa) and soybean (Glycine max), respectively. Comparative analysis of the read distribution relative to annotated genes in the reference genome assemblies indicated that the RRL strategy was primarily sampling within genic regions for both species. The massively parallel sequencing of methylation-sensitive RRLs for genome-wide SNP discovery can be applied across a wide range of plant species having suffi cient reference genomic sequence.

Large-Scale Discovery of Gene-Enriched SNPs

Article

Full-text available

Jul 2009

Whole-genome association studies of complex traits in higher eukaryotes require a high density of single nucleotide polymorphism (SNP) markers at genome-wide coverage. To design high-throughput, multiplexed SNP genotyping assays, researchers must first discover large numbers of SNPs by extensively resequencing multiple individuals or lines. For SNP discovery approaches using short read-lengths that next-generation DNA sequencing technologies offer, the highly repetitive and duplicated nature of large plant genomes presents additional challenges. Here, we describe a genomic library construction procedure that facilitates pyrosequencing of genic and low-copy regions in plant genomes, and a customized computational pipeline to analyze and assemble short reads (100–200 bp), identify allelic reference sequence comparisons, and call SNPs with a high degree of accuracy. With maize (Zea mays L.) as the test organism in a pilot experiment, the implementation of these methods resulted in the identification of 126,683 putative SNPs between two maize inbred lines at an estimated false discovery rate (FDR) of 15.1%. We estimated rates of false SNP discovery using an internal control, and we validated these FDR rates with an external SNP dataset that was generated using locus-specific PCR amplification and Sanger sequencing. These results show that this approach has wide applicability for efficiently and accurately detecting gene-enriched SNPs in large, complex plant genomes.

Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program

Article

Mar 2011

Genomic selection (GS) uses genome-wide molecular marker data to predict the genetic value of selection candidates in breeding programs. In plant breeding, the ability to produce large numbers of progeny per cross allows GS to be conducted within each family. However, this approach requires phenotypes of lines from each cross before conducting GS. This will prolong the selection cycle and may result in lower gains per year than approaches that estimate marker-effects with multiple families from previous selection cycles. In this study, phenotypic selection (PS), conventional marker-assisted selection (MAS), and GS prediction accuracy were compared for 13 agronomic traits in a population of 374 winter wheat ( L.) advanced-cycle breeding lines. A cross-validation approach that trained and validated prediction accuracy across years was used to evaluate effects of model selection, training population size, and marker density in the presence of genotype × environment interactions (G×E). The average prediction accuracies using GS were 28% greater than with MAS and were 95% as accurate as PS. For net merit, the average accuracy across six selection indices for GS was 14% greater than for PS. These results provide empirical evidence that multifamily GS could increase genetic gain per unit time and cost in plant breeding.

The Impact of Next-Generation Sequencing Technology on Genetics

Article

Apr 2008
TRENDS GENET

Elaine Mardis

If one accepts that the fundamental pursuit of genetics is to determine the genotypes that explain phenotypes, the meteoric increase of DNA sequence information applied toward that pursuit has nowhere to go but up. The recent introduction of instruments capable of producing millions of DNA sequence reads in a single run is rapidly changing the landscape of genetics, providing the ability to answer questions with heretofore unimaginable speed. These technologies will provide an inexpensive, genome-wide sequence readout as an endpoint to applications ranging from chromatin immunoprecipitation, mutation mapping and polymorphism discovery to noncoding RNA discovery. Here I survey next-generation sequencing technologies and consider how they can provide a more complete picture of how the genome shapes the organism.

Machine Learning, Volume 45, Number 1 - SpringerLink

Article

Oct 2001

Leo Breiman

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

A new multipoint method for genome-wide association studies via imputation of genotypes : Supplementary Methods

Article

Jan 2007

The Methods section of the paper describes how missing genotypes are inferred through the use of a model of an individual's genotype vector Gi conditional upon a set of N known haplotypes H. A Hidden Markov Model (HMM) is used that has the form

Genotyping-by-Sequencing for Plant Breeding and Genetics

Abstract and Figures

Recommended publications

Plant Breeding Reviews

Advances in agricultural research. [Review]

Genotyping-by-Sequencing (GBS) in Tomato using Ion AmpliSeq™ Technology

Mediating Between Plant Science and Plant Breeding: The Role of Research-Technology