ArticlePDF Available

High‐throughput sequencing‐based microsatellite genotyping for polyploids to resolve allele dosage uncertainty and improve analyses of genetic diversity, structure and differentiation: A case study of the hexaploid Camellia oleifera

July 2021
Molecular Ecology Resources 22(1)

July 2021
22(1)

DOI:10.1111/1755-0998.13469

Authors:

Sheng-Yuan Qin

kunming institute of botany

Show all 13 authorsHide

Conventional microsatellite (simple sequence repeat, SSR) genotyping methods cannot accurately identify polyploid genotypes leading to allele dosage uncertainty, introducing biases in population genetic analysis. Here, a new SSR genotyping method was developed to directly infer accurate polyploid genotypes. The frequency distribution of SSR sequences was obtained based on deep‐coverage high‐throughput sequencing data. Corrections were performed accounting for the ‘stutter peak’ and amplification efficiency of SSR sequences. Perl scripts and an online SSR genotyping tool ‘SSRSeq’ were provided to process the sequencing data and output genotypes with corrected allele dosages. Hexaploid Camellia oleifera is the dominant woody oilseed crop in China. Understanding the geographical pattern of genetic variation in wild C. oleifera is essential for the conservation and utilization of genetic resources. Six wild C. oleifera populations were sampled across geographical ranges in subtropical evergreen broadleaf forests of China. Using 35 SSR markers, the high‐throughput sequencing‐based SSRSeq method was applied to obtain accurate hexaploid genotypes of wild C. oleifera. The results demonstrated that the new method could resolve allele dosage uncertainty and considerably improve genetic diversity, structure and differentiation analyses for polyploids. The genetic variation patterns of wild C. oleifera across geographical ranges agree with the ‘central‐marginal hypothesis’, stating that genetic diversity is high in the central population and declines from the central to the peripheral populations, and genetic differentiation increases from the center to the periphery. This method and findings can facilitate the utilization of wild C. oleifera genetic resources for the breeding of cultivated C. oleifera.

Content uploaded by Jun Rong

Content may be subject to copyright.

Mol Ecol Resour. 2022;22:199–211. wileyonlinelibrary.com/journal/men

Received: 3 March 2020

Revised: 8 July 2021

Accepted: 12 July 20 21

DOI: 10.1111/1755-0998.13469

RESOURCE ARTICLE

High- throughput sequencing- based microsatellite genotyping

for polyploids to resolve allele dosage uncertainty and improve

analyses of genetic diversity, structure and differentiation:

A case study of the hexaploid Camellia oleifera

Jiakuan Chen1,4 | Yao Zhao1,5 | Jun Rong1,5

Xiang yan Cui and Caih ua Li are contrib uted equally to t his work.

1Jiangxi Province Key Laborator y of

Watershe d Ecosys tem Chan ge and

Biodive rsity, Center for Watershed

Ecology, Institute of Life Science and

School of Life Sciences, N anchang

University, Nanc hang, China

2Center for Genet ic & Geno mic Analysis,

Genesky Biote chnol ogies In c, Shan ghai,

China

3Genesky Diag nostics (Suzh ou) Inc. ,

Suzhou, China

4Fudan Deve lopment Institute, Fudan

University, Shanghai, China

5Lushan Botanic al Garden, Chinese

Academy of Sciences, Lus han, China

Correspondence

Jun Rong an d Yao Zhao, Jiangxi Province

Key Laborator y of Watershed Ecosystem

Change an d Biodiversit y, Center for

Watershe d Ecolog y, Institute of Life

Science a nd Scho ol of Life S ciences,

Nanchang University, Nan chang, China.

Emails: ro ng_ jun@hotmail.com and

yaozhao@ncu.edu.cn

Funding information

Nationa l Key Research and Developm ent

Program of China, Grant /Award Number:

2018YFD100 0603; National Natural

Science Foundation of Chin a, Grant/

Award Number: 31870311; “Gan- Po

Talent 555” Project of Jiangxi Provin ce,

China

Abstract

Conventional microsatellite (simple sequence repeat, SSR) genotyping methods can-

not accurately identify polyploid genotypes leading to allele dosage uncertainty, in-

troducing biases in population genetic analysis. Here, a new SSR genotyping method

was developed to directly infer accurate polyploid genotypes. The frequency distri-

bution of SSR sequences was obtained based on deep- coverage high- throughput se-

quencing data. Corrections were performed accounting for the “stutter peak” and

amplification efficiency of SSR sequences. Perl scripts and an online SSR genotyping

tool “SSRSeq” were provided to process the sequencing data and output genotypes

with corrected allele dosages. Hexaploid Camellia oleifera is the dominant woody oil-

seed crop in China. Understanding the geographical pattern of genetic variation in

wild C. oleifera is essential for the conser vation and utilization of genetic resources.

Six wild C. oleifera populations were sampled across geographical ranges in subtropi-

cal evergreen broadleaf forests of China. Using 35 SSR markers, the high- throughput

sequencing- based SSRSeq method was applied to obtain accurate hexaploid geno-

types of wild C. oleifera. The results demonstrated that the new method could resolve

allele dosage uncertainty and considerably improve genetic diversity, structure and

differentiation analyses for polyploids. The genetic variation patterns of wild C. oleif-

era across geographical ranges agree with the “central- marginal hypothesis”, stating

that genetic diversity is high in the central population and declines from the central

to the peripheral populations, and genetic differentiation increases from the centre to

the periphery. This method and findings can facilitate the utilization of wild C. oleifera

genetic resources for the breeding of cultivated C. oleifera.

KEYWORDS

allele dosage, Camellia oleifera, genetic differentiation, genetic diversity, genetic structure,

polyploid

200

CU I et al.

1 | INTRODUCTION

Polyploidy plays an important role in the diversification of angio-

sperms. Approximately 15% of angiosperm speciation events are

accompanied by ploidy increases, and approximately 35% of angio-

sperm species are polyploids (Wood et al., 2009). Because polyploids

are of ten accompanied by heterosis and gene redundancy and may

grow larger, more quickly, and with higher yields compared with their

diploid relatives, polyploidy has also facilitated the domestication

and improvement of crops (Renny- Byfield & Wendel, 2014). Many

important crops are polyploid. For instance, potato (Solanum tubero-

sum) is tetraploid (2n = 4x = 48) (The Potato Genome Sequencing

Consortium, 2011), bread wheat (Triticum aestivum) is hexaploid (2n

= 6x = 42) (International Wheat Genome Sequencing Consortium,

2014), and oilseed rape (Brassica napus) is tetraploid (2n = 4x = 38)

(Chalhoub et al., 2014). Thus, studies on the population genetics of

polyploids can not only shed light on the evolution of angiosperms

but also improve our underst anding of crop domestication and im-

provement (Renny- Byfield & Wendel, 2014).

Despite the importance of polyploids, the population genet-

ics of polyploids are still underdeveloped compared with diploids

(Dufresne et al., 2014). The major challenge is to develop molecular

approaches for reliably resolving allele dosage uncertainty in poly-

ploids (Dufresne et al., 2014). For instance, alleles A and B detected

at a locus in a hexaploid may represent a genotype of ABBBBB,

AABBBB, AAABBB, AAAABB, or A AAAAB, but conventional ap-

proaches cannot identify the exact genotype leading to so- called

allele dosage uncertainty. With allele dosage uncertainty, allele

and genotype frequency estimation are unreliable, which may lead

to considerable biases in subsequent analyses of genetic diversity,

structure and differentiation (Dufresne et al., 2014; Meirmans et al.,

2018). Microsatellites or simple sequence repeats (SSRs) are among

the most popular molecular markers in population genetics (Andrew

et al., 2013; Dufresne et al., 2014). However, applications of SSR s

in polyploids suffer from allele dosage uncertainty (Dufresne et al.,

2014). Recently, sequencing- based SSR genotyping techniques have

been developed based on high- throughput sequencing (De Barba

et al., 2017; Vartia et al., 2016; Yang et al., 2019). The new tech-

niques facilitate rapid, accurate and cost- effective genotyping at

a large number of SSR loci in large- scale population genetic stud-

ies (De Barba et al., 2017; Vartia et al., 2016; Yang et al., 2019).

Nevertheless, the new techniques cannot resolve SSR allele dosage

uncertainty when genotyping polyploids. New methods need to be

developed to obtain accurate polyploid genotypes with corrected

SSR allele dosages.

Polyploids are common in the genus Camellia (Theaceae), in-

cluding predominantly tetraploids and hexaploids, especially in the

section Paracamellia (Ming, 2000). Camellia oleifera as the type spe-

cies of the section Paracamellia, is a hexaploid evergreen broadleaf

shrub or small tree (Huang et al., 2018). Cultivated C. oleifera is the

dominant woody oilseed crop in China. The seed oil of C. oleifera is

rich in the monounsaturated fatty acid oleic acid (up to >80%), and

it is known as “oriental olive oil” (Ma et al., 2011; Zhuang, 2008).

Wild C. oleifera is an essential genetic resource for cultivated C.

oleifera breeding. Wild C. oleifera is widely distributed in the sub-

tropical evergreen broadleaf forests of the Yangtze River Basin

and South China (Ming, 2000). Based on size polymorphisms of 8

SSR markers determined through capillary electrophoresis, genetic

structure analyses indicated clear genetic differentiation between

wild C. oleifera from Lu Mountain (29.60°N, 115.98°E) and Jinggang

Mountain (26.55°N, 114.17°E) (380 km between the two mountains)

and less genetic differentiation among altitudes within each moun-

tain (altitude range <700 m) (Huang et al., 2018). However, classical

genetic differentiation estimates of FST showed the same very low

genetic differentiation (FST = 0.0 07 ) bet wee n and with in ea ch mo un -

tain (Huang et al., 2018). Moreover, major SSRs showed significant

heterozygosity excesses (Huang et al., 2018). These results may be

caused by biased genotyping. Because wild C. oleifera is hexaploid,

allele dosage uncertainty at the SSR loci may lead to biases in popu-

lation genetic analyses. Fur ther studies are needed to obtain accu-

rate hexaploid genotypes of wild C. oleifera wit h co rrec te d SSR allele

dosages to determine the geographical pat tern of genetic variation,

which is the basis for the utilization of wild C. oleifera genetic re-

sources. For the purposes of germplasm collection, special attention

should be given to wild C. oleifera with high genetic diversity and

differentiation.

In this study, a new high- throughput sequencing- based SSR ge-

notyping method was developed to resolve allele dosage uncertainty

in polyploids (Figure 1). Our study used hexaploid wild C. oleifera

as a case for analysing genetic diversity and structure in polyploid

plants. Wild C. oleifera samples were collected from six populations

across geographical ranges. Using 35 microsatellite markers, the

new method was applied to genotype hexaploid wild C. oleifera sam-

ples. Correc tions were performed accounting for the “stut ter peak”

and amplification efficiency of SSR sequences to obtain hexaploid

genot ypes with corrected allele dosages. With correc ted and uncor-

rected allele dosages, genetic diversity, structure and differentiation

were analysed and compared. The main objec tives of our study were

to establish methods to resolve allele dosage uncertainty in poly-

ploids, improve population genetic analysis in polyploids, and pro-

vide suppor t for the evaluation and utilization of genetic resources

in hexaploid C. oleifera.

2 | MATERIALS AND METHODS

2.1 | Plant materials

Wild C. oleifera samples were collected from six natural distribu-

tion sites (Table 1 and Figure 2), across its geographical ranges in

China (mainly between the Yangt ze River and the Pearl River). Wild

C. oleifera is widely distributed in the subtropical mountain areas of

China at altitudes ranging from approximately 200 to 2000 m. Lu

Mountain is in the northern distribution range of wild C. oleifera at

201

CUI et al .

the border between the middle and northern subtropical regions of

China. Jinggang Mount ain is in the distribution centre of wild C. oleif-

era in the central section of the middle subtropical region of China.

Nanling Mountain is at the border between the middle and southern

subtropical regions of China. Luofu Mountain is in the southern dis-

tribution range of wild C. oleifera in the southern subtropical region

of China. Fanjing Mountain is in the western distribution range of

wild C. oleifera, and Matou Mountain is in the eastern distribution

range of wild C. oleifera. Differences in climate conditions (Table 1)

as well as geographical isolation between mountains may lead to ge-

netic differentiation between wild C. oleifera populations. The num-

ber of samples was proportional to the wild C. oleifera pop ulati on siz e

FIGURE 1 Flow chart of the new

high- throughput sequencing- based

microsatellite genotyping method for

resolving allele dosage uncertainty in

polyploids. A Perl script “SSRSeq count”

str_count.pl (https://github.com/ccoo2 2/

SSRseq_count) is provided to process the

clean reads to generate an SSR read count

table. A Perl script str_type.pl (https://

github.com/ccoo2 2/SSRSeq) and an

online SSR genotyping tool – SSRSeq V1.1

(http://bioin fo.genes kybio tech.com/softw

are/ssrseq_type/v1.1/en/) are provided

to output SSR genotypes with correc ted

allele dosages. Details of the method are

described in the Materials and Methods

Multiplex PCR of SSR markers

High throughput sequencing of deep coverage

Read frequency distribution of SSR sequences

SSR alleles identification

‘Stutter peak’ correction

Amplification efficiency correction

SSR genotypes with corrected allele dosages

SSRSeq V1.1

Quality control with FastQC

Paired-end reads merged with FLASH

Reads aligned to SSR reference sequences using Blastn

Read count of SSR sequences

SSRSeq count

TAB LE 1 Wild Camellia oleifera sampling sites

Site (label)

Latitude

(°N)

Longitude

(°E) Altitude (m)

Annual mean

temperature (°C)

Annual precipitation

(mm)

Number of

samples

Lu Mountain, Jiangxi (LU) 29. 6 0 115.98 2 5 6 – 8 7 4 1 2 . 8 0 – 1 6 . 1 4 15 0 2 – 17 9 0 123

Fanjing Mountain, Guizhou (FJ) 27. 91 108.63 1 0 1 6 – 1 3 8 9 1 1 . 5 8 – 1 3 . 3 2 12 5 1– 1 321 29

Matou Mountain, Jiangxi (MTS) 27. 7 3 11 7.1 6 5 2 6 – 5 7 0 16.0 8 1902 16

Jinggang Mount ain, Jiangxi (JG) 26.55 114 .17 4 1 2 – 1 0 4 4 13.63– 16.52 1 4 3 5 – 1 7 7 0 85

Nanling Mountain, Guangdong (NL ) 24.9 0 113.06 592– 93 2 1 5 . 7 2 – 1 6 . 4 4 1566– 1627 64

Luofu Mountain, Guangdong (LF) 23.28 114.02 1125– 1217 17. 26– 1 7.74 2019– 2034 27

202

CU I et al.

at the distr ib ution sites. Le af sam ples were taken fro m fl ow er in g wi ld

C. oleifera and then placed in small zip- lock plastic bags containing

silica gel for dehydration. The dry leaf samples were stored at room

temperature.

2.2 | DNA extraction and quality control

Approximately 30 mg of dry leaf tissue was taken from each sam-

ple and placed in a 2.0- ml tube with a 5- mm glass bead. Sample

tubes were placed in liquid nitrogen for 30 s and transferred to a

Tissuelyser II (Qiagen) for grinding at 30 Hz for 1 min. Genomic

DNA was extracted using the DNAsecure Plant Kit ( Tiangen). DNA

integrit y was checked by agarose gel electrophoresis. The quality

and quantity of the DNA samples were measured using a NanoDrop

2000 (Thermo Scientific).

2.3 | Single and multiplex PCR optimization of

SSR markers

Polymorphic SSR markers of wild C. oleifera were selected from Cui

et al. (2018). The SSR markers are single loci containing trinucleotide

sim ple repeats, that is, (TCC)n, which were developed based on high-

throughput transcriptome sequencing of wild C. oleifera from the Lu

and Jinggang Mountains (Cui et al., 2018). First, single PCR amplifi-

cation was performed for each SSR marker. A 10 μl mixture was pre-

pared for each reaction and included 1× reaction buffer (TaKaRa),

2 mM Mg2+, 0.2 mM dNTPs, 0.2 μM each primer, 1 U Hot StarTaq

polymerase (TaKaRa) and 1 μl template DNA (10 ng/μl). The cycling

program was 95°C for 2 min; 11 cycles of 95°C for 20 s, 63– 58°C

(−0.5°C per cycle) for 40 s, 72°C for 1 min; 24 cycles of 95°C for

20 s, 65°C for 30 s, 72°C for 1 min; 72°C for 2 min. The amplifica-

tion reactions were carried out on an AB 2720 Thermal Cycler (Life

FIGURE 2 Genetic structure analyses of wild Camellia oleifera populations using SSR data of corrected and uncorrected allele dosages.

(a) Corrected allele dosage (K = 2), showing the results (K = 2) of genetic structure analysis with SSR data of corrected allele dosage. Pie

charts next to wild C. oleifera sites indicate proportions of different clusters (dif ferent colours) within local populations. (b) Uncorrec ted

allele dosage (K = 2), showing the results (K = 2) with uncorrected allele dosage data. (c) Corrected allele dosage (K = 5), showing the

results (K = 5) with corrected allele dosage data. (d) Corrected allele dosage, showing the results of delta K with corrected allele dosage

data. (e) Uncorrected allele dosage, showing the results of delta K with uncorrected allele dosage data

(d) Corrected allele dosage (e) Uncorrected allele dosage

Yangtze

River

Taiwan

East

China

Sea

South

China Sea

Pearl River

Yangtze River

East

China Sea

South

China Sea

Pearl River

(a) Corrected allele dosage(K= 2)

Yangtze

River

Taiwan

East

China

Sea

South

China Sea

Pearl River

Yangtze River

East

China Sea

South

China Sea

Pearl River

(b) Uncorrected allele dosage (K= 2)

Yangtze

River

Taiwan

East

China

Sea

South

China Sea

Pearl River

Yangtze River

East

China Sea

South

China Sea

Pearl River

203

CUI et al .

Technologies Corporation). Thirty- five SSR markers with clear bands

were selected for further multiplex PCR.

Th e 35 SSR marker s we re di vi ded int o two pan els (17 or 18 mark-

ers/panel) for multiplex PCR. The composition of SSR markers

was adjusted based on the multiplex PCR results to achieve equal

amounts of the PCR products of each marker, to optimize primer

compatibility and to avoid undesirable primer pairing. A 20 μl mix-

ture was prepared for each reaction and included 1× reaction buf-

fer ( TaKaRa), 2 mM Mg2+, 0.2 mM dNTPs, 0.1 μM each primer, 1 U

HotStarTaq polymerase (TaKaRa) and 2 μl template DNA (10 ng/μl).

The cycling programme was 95°C for 2 min; 11 cycles of 94°C for

20 s, 63– 58°C (−0.5°C per cycle) for 40 s, 72°C for 1 min; 24 cycles

of 94°C for 20 s, 65°C for 30 s, 72°C for 1 min; 72°C for 2 min.

Finally, two panels of the 35 SSR markers were optimized (Table

S1) for multiplex PCR. Ten DNA samples (random samples from FJ,

JG and NL) were used to check the consistency of PCR results be-

tween single and multiplex PCRs of the 35 SSR markers.

2.4 | High throughput sequencing and

data processing

The PCR products (<300 bp) of each sample were mixed and la-

belled with 8 bp sample specific barcode using index primers. All

PCR products were pooled prior to library preparation. DNA librar-

ies were constructed for Illumina paired- end sequencing following

the Illumina protocol and then sequenced on the Illumina HiSeq

2500 platform (paired- end 2 × 150 bp) with mean coverage >5000×

per SSR locus per sample. Raw reads were analysed with FastQC

(http://www.bioin forma tics.babra ham.ac.uk/proje cts/fastq c/) for

quality control. A Perl script “SSRSeq count” str_count.pl (https://

github.com/ccoo2 2/SSRseq_count) was developed to process the

clean reads to generate an SSR read count table (Figure 1). First,

paired- end reads were merged with FLASH (http://ccb.jhu.edu/

softw are/FLASH/). Then, merged reads were aligned to C. oleifera

sequences where the SSR markers were located (Cui et al., 2018)

using Blastn (ftp://ftp.ncbi.nlm.nih.gov/blast/ execu table s/blast +/

LATES T/). Finally, the SSR read count table was generated, showing

read counts of SSR sequences with different repeat numbers at each

locus for each sample.

2.5 | SSR genotyping

A Perl script str_type.pl (https://github.com/ccoo2 2/SSRSeq) was

developed for SSR genotyping. A user- friendly online SSR genotyp-

ing tool −SSRSeq V1.1 (http://bioin fo.genes kybio tech.com/softw

are/ssrseq_type/v1.1/en/) was also provided for SSR genotyping

using str_type.pl, and the genotyping methods are described below,

including “stutter peak” correction and amplification efficiency cor-

rection (Figure 1). Using the SSR read count table as input data, the

read frequency distribution was obtained of SSR sequences with

different repeat numbers of an SSR motif at a locus in each sample.

SSR alleles (genotyping as repeat number) were inferred for read

frequencies higher than a given genotyping threshold. Based on a

large amount of empirical genot yping in data sets of different spe-

cies (ploidy ≤6, e.g., diploid Allium sativum, tetraploid Brassica napus,

hexaploid Camellia oleifera) with SSR markers of at least trinucleotide

simple repeats, the genot yping threshold is usually set to 0.5×(1/

ploidy) and in our case 0.5×(1/6) = 0.083 to avoid allelic dropout.

Read frequencies higher than the genotyping threshold are high-

lighted in the output file of SSRSeq for manual checking. To fur ther

avoid allelic dropout, the genotyping threshold can be altered in the

input file of SSRSeq depending on the genot yping results of a spe-

cific data set.

A “stutter peak” (a sequence with an SSR motif repeat number

that is different from an SSR allele) with a high read frequency may be

misidentified as an SSR allele. Sometimes, a stutter peak may overlap

with an SSR allele showing an abn ormally high read frequency, ca us-

ing difficulties in allele dosage estimation. To count for the stutter

peak before an SSR allele, high- quality samples were selected with

read numbers higher than the median read number at a locus, and a

stutter peak before an SSR allele was identified for read frequencies

lower than the 0.6× genotyping threshold (0.6 × 0.083 = 0.05 in the

study). The threshold is based on a large amount of empirical geno-

typing in data sets of different species (ploidy ≤6, e.g., diploid Allium

sativum, tetraploid Brassica napus, hexaploid Camellia oleifera) with

SSR markers of at least trinucleotide simple repeats. SSR markers

of trinucleotide simple repeats are recommended because dinucleo-

tide repeat markers may have a high degree of stut ter peaks causing

difficulties in genotyping. The threshold can be altered in the input

file of SSRSeq depending on the genotyping results of a specific data

set. For each sample, the slip ratio of an SSR sequence at a locus can

be calculated as:

where SRn is the slip ratio of the SSR sequence with repeat number

n of the SSR motif at a locus, Fn−1 is the read frequency of the stutter

peak with repeat number n−1 and Fn is the read frequency of the SSR

sequence. The slip ratio may var y with repeat number at a locus as

follows:

where a and b are locus- specific coefficients. With regression between

the mean observed slip ratio across samples and the repeat number

of SSR alleles at each SSR locus, coefficients a and b were estimated

for each SSR locus. For the SSR sequence missing the observed slip

rat io, the expected slip rat io wa s calcula ted using eq uatio n 2. Thus , slip

ratios were estimated based on high- quality samples and then used

for the correction of all samples. For each sample, the expected read

frequency of stutter peak Fn−1 before each SSR sequence with repeat

number n at each SSR locus was estimated according to equation 1.

For stutter peak correction, the expected read frequency of stutter

(1)

n−1

(2)

SRn

=an

204

CU I et al.

peak Fn−1 was subtracted from the read frequency of the sequence

with repeat number n−1 before an SSR sequence with repeat number

n. If the expected read frequency of the stutter peak was higher than

the observed read frequenc y, the corrected read frequency was set to

0. Then, all read frequencies were recalculated so that the sum of read

frequencies at a locus in a sample = 1. Finally, SSR alleles were identi-

fied again with corrected read frequencies higher than the genotyping

threshold (>0.083 in the study).

SSR alleles with higher repeat numbers may have lower ampli-

fication efficiency. Due to low amplification efficiency, some SSR

alleles may have read frequencies lower than the genotyping thresh-

old but higher than the 0.6× genotyping threshold and are identi-

fied as potential alleles. To identify SSR alleles and infer expected

dosages of SSR alleles, corrections were performed to account for

differences in amplification efficiency. Again, high- quality samples

were selected with read numbers higher than the median read num-

ber at a locus. For each sample, the amplification ratio of each allele

at a locus can be calculated as the obser ved allele dosage divided by

the expected allele dosage:

where AR is the amplification ratio of the allele at a locus, Do is the

observed allele dosage, and De is the expected allele dosage. The ob-

served allele dosage can be calculated as ploidy × read frequency of

an SSR allele at a locus in a sample. The expected allele dosage can be

estimated following the method below. In the study, ploidy = 6. For

SSR genoty pes wit h si x di ffe re nt all eles in a sampl e, the exp ect ed all ele

dosage of each allele should be 1. For SSR genotypes with five dif-

ferent alleles in a sample, four alleles should have the expected allele

dosage of 1 of each, and one allele should have the expected allele

dosage of 2. Therefore, the allele with the highest obser ved read fre-

quency may have the expected allele dosage of 2, and the other alleles

may have the expected allele dosage of 1. For SSR genotypes with four

different alleles in a sample, the most abundant allele should have the

expected allele dosage of 3 or 2, and the other allele should have the

expected allele dosage of 1. For SSR genotypes with fewer than four

different alleles in a sample, the expected allele dosage should be es ti-

mat ed by rou nd in g th e observed allele dosage so that the tot al numbe r

of alleles at a locus in a sample = 6. Afterwards, the mean amplification

ratio across highquality samples was estimated for each allele at each

SSR locus and used for the correction of all samples. For amplifica-

tion efficiency correction, the read frequency of each allele at each

SSR locus was first divided by the mean amplification ratio of the al-

lele. SSR alleles were identified again with corrected read frequencies

higher than the genotyping threshold (>0.083 in the study). Then, all

read frequencies of SSR alleles were recalculated so that the sum of

read frequencies at a locus in a sample = 1. Afterwards, the corrected

allele dosage was calculated as ploidy × read frequency of the SSR al-

lele. Finally, the correc ted allele dosage was rounded so that the total

number of alleles at a locus in a sample = 6.

The output SSR genotype data in the output Excel file of SSRSeq

are in GenoDive format and can be easily modified as input data

for GenoDive analysis. The output SSR genot ype data include SSR

genotypes with corrected and uncorrected (showing only dif ferent

alleles) allele dosages. In this study, the SSR genotype data set with

corrected allele dosages of C. oleifera populations was generated for

GenoDive analysis (Cui et al., 2021). In addition, another data set of

SSR genoty pes wit h uncorre cted alle le dosages (showing onl y di f fe r-

ent alleles) was generated for GenoDive analysis (Cui et al., 2021) to

mimic the situation for conventional SSR genotyping methods.

2.6 | Genetic diversity

The two data set s of corrected and uncorrected allele dosages (Cui

et al., 2021) were used for genetic diversity analysis with GenoDive

version 2.0b27 (Meirmans & Van Tienderen, 2004). The number of

alleles (A), effective number of alleles (Ae), observed heterozygo-

sity (Ho), expected heterozygosity (He) and inbreeding coefficient

(Gis) were calculated for each population and each locus. The effec-

tive number of alleles (Ae) is the number of alleles in a population

weig ht ed fo r the ir freq uen cie s. Th e obs erv ed he te roz ygosi t y (Ho) es-

timated in GenoDive is “gametic heteroz ygosity” for polyploids, that

is, the frequency of heterozygotes among randomly sampled diploid

gamete s (M oo dy et al., 1993). The expe cted he te rozy go si ty (He) est i-

mated in GenoDive is “gene diversity”, determined by calculating He

in polyploids as in diploids, including a correction for sampling bias

(Meirmans et al., 2018). Tests for Hardy- Weinberg equilibrium were

performed using the heterozygosity- based Gis statistic with 9999

permutations. GenoDive was used to export data into SPAGeDi

format. For uneven sample sizes of dif ferent populations, rarefied

alle lic rich ness was estimated as the expe cted allele number of mini-

mal sample size with SPAGeDi 1.5d (Hardy & Vekemans, 2002).

2.7 | Genetic structure

The two data sets of corrected and uncorrected allele dosages

(Cui et al., 2021) were used for genetic structure analysis. Principal

component analysis (PCA) was performed using a covariance ma-

trix between allele frequencies for individuals with GenoDive ver-

sion 2.0b27 (Meirmans & Van Tienderen, 2004). GenoDive was

also used to expor t data into the structure format (Pritchard et al.,

2000). ParallelStructure (Besnier & Glover, 2013; Pritchard et al.,

2000) from the CIPRES Science Gateway (Miller et al., 2010) was

used to infer the population genetic structure. The admixture model

was used to determine the ancestry of individuals. The allele fre-

quencies were assumed to be independent among populations. The

population number (K) was evaluated from 1– 10, and five replicate

runs were carried out for each K. Each run had a burnin period of

1,00 0,000, and there were 1,0 00,000 iterations after burnin. struc-

ture harvester (Earl & vonHoldt, 2012) was used to determine the

(3)

205

CUI et al .

optimal K. For the optimal K, clumpp (Jakobsson & Rosenberg, 2007)

was used to find the optimal alignments of the five replicate runs.

2.8 | Genetic differentiation

The two data set s of corrected and uncorrec ted allele dosages (Cui

et al., 2021) were used for genetic differentiation analysis with

GenoDive version 2.0b27 (Meirmans & Van Tienderen, 2004).

Estimates of genetic differentiation between populations were cal-

culated, including FST from the analysis of molecular variance be-

tween each pair of populations (AMOVA) (Michalakis & Excoffier,

1996) and Rho (independent of the ploidy level and inheritance

pattern) (Ronfort et al., 1998). A paired t test was used to compare

genetic dif ferentiation estimates with correc ted and unco rrected al-

lele dosages. Mantel tests were per formed to analyse correlations

of genetic differentiation matrixes with corrected and uncorrected

allele dosages and between FST and Rho. Significance levels were

generated with Bonferroni correction for multiple comparisons. For

testing isolation by distance, linear regressions were performed be-

tween FST/(1−FST) and the natural logarithm of geographical distance

between populations.

3 | RESULTS

3.1 | High- throughput sequencing- based

microsatellite genotyping

With high- throughput sequencing, the mean coverage was 5853,

and the median was 5832 per SSR locus per sample in our study.

Such high coverages ensured high precision of SSR read count and

frequency estimations. The output file of SSRSeq involved slip ratios

estimated for each SSR sequence at each locus. The SSR sequence

slip ratio generally increased with the increase in repeat number at

eac h locus , match ing equ ation 2 (Figur e S1). Befo re stutter pe ak cor-

rection, stutter peaks may be misidentified as SSR alleles and cause

biases in allele dosage estimations. The expected read frequency of

the stutter peak could be estimated with the slip ratio, and stutter

peak corrections were performed by subtracting the expected read

frequency of the stutter peak from the observed read frequency. On

the other hand, the output file of SSRSeq also involved amplification

ratios estimated for each SSR allele at each locus. The amplification

ratio generally decreased with the increase in repeat number at each

locus (Figure S2). Amplification efficiency corrections were per-

formed with the amplification ratio. The output file of SSRSeq had

both unrounded and rounded data for the corrected allele dosages.

The corrected allele dosages (unrounded) were in close agreement

with the expected allele dosages (rounded) in our study, showing

high SSR genotyping accuracy with the new method (Figure S3). Ten

DNA samples were used for both single and multiplex PCR of the 35

SSR markers. The PCR products were sequenced, and the results

obtained via single or multiplex PCR were genotyped separately

TAB LE 2 Genetic diversity analysis of wild Camellia oleifera using two SSR datasets with corrected and uncorrected allele dosages

Site

Corrected allele dosages Uncorrected allele dosages

A AeAllelic richness HoHeGis A AeAllelic richness HoHeGis

LU 6.686 2.846 5.20 0.534 0.557 0.041*6.686 3.886 4.50 0. 851 0.695 − 0.224*

FJ 7.229 3 . 241 6.39 0. 572 0.602 0.050*7. 2 2 9 4.4 07 5.29 0. 869 0.739 − 0.175*

MTS 6.571 3.227 6.30 0.5 67 0.607 0.065*6.571 4 .3 41 5.26 0.877 0 .743 − 0.18 0*

JG 8.200 3.340 6 .55 0.588 0.608 0.032*8.200 4 .614 5.38 0.884 0 .74 4 −0.188*

NL 7.57 1 3.273 6.24 0.536 0.603 0.111*7.57 1 4.302 5.14 0.831 0.722 − 0.151*

LF 6. 514 3.080 5.80 0.554 0. 576 0.038*6 . 514 4.142 4.89 0.857 0.718 − 0.19 3*

*Significant heterozygosity deficit (positive Gis value) or heterozygosity excess (negative Gis value) in tests for Hardy- Weinberg equilibrium.

206

CU I et al.

with the new method. The genot yping results showed that the allele

number identified was not significantly different between single and

multiplex PCR of the 35 SSR markers (Figure S 4).

3.2 | Genetic diversity

The results of genetic diversity analysis are shown in Table 2. The

numb er of all el es (A) was the same between data sets with corrected

and uncorrected allele dosages. However, the effective number of

alleles (Ae) was higher with uncorrected allele dosages. Allelic rich-

ness was higher with corrected allele dosages. With corrected allele

dosages, observed heterozygosity values were all lower than ex-

pected heterozygosity values and inbreeding coefficient values were

all positive, showing significant heterozygosity deficits (Figure 3a).

With uncorrec ted allele dosages, observed heterozygosity values

were all higher than expected heterozygosity values, and inbreeding

coefficient values were all negative showing significant heterozy-

gosity excesses (Figure 3b). The obser ved/expected heterozygosity

values with uncorrected allele dosages were all higher than those

with corrected allele dosages (Figure 3). The genetic diversity of wild

C. oleifera at Jinggang Mountain (JG) was the highest and decreased

from the distribution centre to the northern/southern distribution

range of wild C. oleifera. The genetic diversity of wild C. oleifera was

the lowest at Lu Mountain (LU) in the northern distribution range of

wild C. oleifera.

3.3 | Genetic structure

The PCA showed that most individuals of the LU population at

the highest latitude were differentiated from individuals of other

populations with corrected or uncorrected allele dosages (Figure

S5). Only with corrected allele dosages were most individuals of

the LF population at the lowest latitude separated from individu-

als of other populations, and the latter were mixed and located

between the LU and LF populations (Figure S5). With corrected or

uncorrected allele dosages, the optimal K was 2 (Figure 2d,e). Only

with corrected allele dosage did a secondary peak occur at K = 5

(Figure 2d). The results of STRUCTURE analyses (K = 2) showed

more gradual changes in genetic structure along latitudes with

corrected allele dosages (Figure 2a) than those with uncorrected

allele dosages (Figure 2b). The LU population at the highest lati-

tude was the most dif ferent from the others. From high to low lati-

tudes, the genetic structure with corrected allele dosages shifted

toward the genetic structure of the LF population at the lowest

latitude (Figure 2a), similar to the results of the PCA (Figure S5).

With uncorrected allele dosages, all populations except for the LU

population sho we d mo re or les s the s am e ge ne tic s tr uct ur e despite

geograph ic al lo cation (Figur e 2b). In ad di ti on, mor e ge netic clu st er s

(K = 5) were found with corrected allele dosages showing finer ge-

netic structures of wild populations (Figure 2c). In addition to the

distinguished LU population in the northern distribution range, the

genetic structures of the LF population at the lowest latitude and

the FJ population in the western distribution range were clearly

separated. The genetic structures of the MTS, JG and NL popula-

tions showed similarly mixed clusters.

3.4 | Genetic differentiation

FST estimates with corrected allele dosages (mean FST = 0.026)

were significantly higher (p = .001) than those with uncorrected al-

lele dosages (mean FST = 0.009) (Table 3). In addition, the Mantel

test indicated that the correlation was insignificant between FST

estimates with corrected and uncorrected allele dosages (r = .604,

p = .070). With corrected allele dosages, FST was the highest (0.067)

between populations at the highest (LU, 29.60°N) and the lowest (LF,

23.28°N) latitudes (Table 3). However, with uncorrected allele dos-

ages, FST between LU and LF (0.013) was the same as that bet ween

LU and MTS (27.73°N). With corrected/uncorrected allele dosages,

linear regression was insignificant (corrected: p = .427; uncorrected:

FIGURE 3 Observed and expec ted heterozygosit y estimates of wild Camellia oleifera populations. (a) Estimates with corrected allele

dosage. (b) Estimates with uncorrected allele dosage. Solid circles indicate obser ved heterozygosity Ho estimates. Hollow circles indicate

expected heterozygosity He estimates. From left to right, populations are sorted from high to low latitudes

0.5

0.6

0.7

0.8

0.9

LU FJ MTSJGNLLF

Heterozygosity

Population

(a) Corrected allele dosage

0.5

0.6

0.7

0.8

0.9

LU FJ MTSJGNLLF

Heterozygosity

Population

(b) Uncorrected allele dosage

207

CUI et al .

p = .5 3 8) be twe en FST/(1−FST ) an d the natu ral lo gar ith m of ge og r aph -

ical distance between populations (Figure 4).

Rho estimates with corrected allele dosages (mean Rho = 0.087)

were not significantly different from those with uncorrected allele

dosages (mean Rho = 0.076) (p = .103). With corrected allele dos-

ages, FST estimates were significantly correlated with Rho estimates

(r = .950, p = .004); with uncorrected allele dosages, the correlation

was insignificant (r = .440, p = .146).

4 | DISCUSSION

Conventional molecular methods cannot accurately identify the

SSR genotype of polyploids. Thus, codominant SSR genotypes in

polyploids may have to be treated as dominant data losing valuable

information in subsequent analyses (Dufresne et al., 2014). On the

other hand, software such as GenoDive can handle polyploid SSR

genotypes with unknown allele dosages and per form correction of

allele dosages using a maximum likelihood method based on random

mating within populations modified from De Silva et al. (2005). Since

actual allele frequencies are unknown in the correction, biases may

be introduced to population differentiation and structure analyses.

Methods have been developed to directly infer polyploid genot ypes

based on ratios between SSR allele peak areas, for example, the

microsatellite DNA allele counting- peak ratios (MAC- PR) method

(Esselink et al., 2004). However, ratios between SSR allele peak areas

of capillary electrophoresis may not represent actual ratios of SSR

alleles, especially if they do not account for the stutter peak and

amplification efficiency of SSR alleles.

If allele dosages are uncer tain in polyploid SSR genotypes, SSR

allele frequency estimation is biased. Population genetic analyses

based on biased allele frequencies may also be biased. Based on

model simulation, when allele dosage information is missing, ob-

served heterozygosity estimates in tetraploid populations are much

higher than true values, while expected heterozygosity estimates

are slightly higher than true values (Meirmans et al., 2018). With

allele dosage uncertainty, statistical testing for Hardy- Weinberg

equilibrium is not possible for polyploids (Meirmans et al., 2018).

For genetic structure analysis, structure is well suited for analys-

ing polyploids (Meirmans et al., 2018). In simulated mixed- ploidy

populations, structure is more robust than other clustering meth-

ods (Stift et al., 2019). Especially when population differentiation is

weak, structure is the only method that allows unbiased inference

with limited genotypic information of codominant markers with un-

known allele dosages or dominant markers (Stift et al., 2019). For

genetic differentiation estimates, missing dosage information leads

to overestimation of genetic diversity within populations and con-

sequently underestimation of the degree of population differenti-

ation (Meirmans et al., 2018). To estimate genetic differentiation in

polyploids, Rho may be the statistic of choice, as it is generally un-

biased with allele dosage uncertainty, independent of ploidy level

and mode of inheritance, and closely related to FST (Meirmans et al.,

2018; Meirmans & Van Tienderen, 2013).

TAB LE 3 Genetic differentiation between populations of wild Camellia oleifera using two SSR data sets with correc ted and uncorrec ted allele dosages. FST estimates (with corrected/

uncorrected allele dosages) are in the lower triangle, and Rho estimates (with corrected/uncorrected allele dosages) are in the upper triangle. The p- values are indicated in bracket s (p < .01 in

bold)

LU FJ MTS JG NL LF

LU 0.110/0.120 0.097/0.087 0.069/0.093 0.143/0.112 0.222/0.170

FJ 0.033 (0.001)/0.010 (0.001) 0.034/0.029 0.039/0.049 0.052/0.047 0.122/0.089

MTS 0.030 (0.001)/0.013 (0.001) 0.014 (0.001)/0.010 (0.177) 0.002/0.020 0.025/0.025 0.113/0.096

JG 0.019 (0.001)/0.006 (0.001) 0.015 (0.001)/0.006 (0.030) 0.004 (0.003)/0.007 (0.307) 0.049/0.026 0.132/0.081

NL 0.031 (0.001)/0.007 (0.001) 0.015 (0.001)/0.008 (0.041) 0.003 (0.113)/0.011 (0.307) 0.007 (0.001)/0.003 (0.012) 0.093/0.099

LF 0.067 (0.001)/0.013 (0.001) 0.040 (0.001)/0.010 (0.001) 0.040 (0.001)/0.012 (0.001) 0.041 (0.001)/0.011 (0.001) 0.039 (0.001)/0.012 (0.001)

208

CU I et al.

Our study has developed a new high- throughput sequencing-

based microsatellite genotyping method (Figure 1) to directly resolve

allele dosage uncertainty in polyploids using hexaploid wild C. oleifera

as a case study. As an alternative to multiplex PCR, one may perform

single PCR for each SSR marker and mix the products for sequenc-

ing. However, with the increases in number of markers and sample

size, the labour needed for single PCR and post- PCR multiplexing

will dramatically increase much more than that required for multi-

plex PCR. Our study demonstrated that with optimization, the allele

number identified was not significantly different between single and

multiplex PCRs of the 35 SSR markers (Figure S4). Therefore, we

propose to use multiplex PCR with optimization in the method. For

100 SS R ma rkers , the cos t of mult ip lex PCR, 50 0 0× high- throughput

sequencing and data analysis is approximately 30 U.S. dollars per

sample or 0.3 U.S. dollars per genotype. The typical genotyping- by-

sequencing (GBS) method can generate data for many more mark-

ers, so the cost per genotype is much lower. Nevertheless, the cost

per sample of the GBS method is generally several times higher than

that of our method. Most impor tantly, our method feasibly provides

accurate SSR genotypes for up to hundreds of SSR markers in hun-

dreds or thousands of polyploid samples for genetic diversit y anal-

ysis. Perl scripts and an online SSR genotyping tool, SSRSeq V1.1,

are provided to output accurate polyploid genotypes with the new

method. Compared with capillary electrophoresis, high- throughput

sequencing of deep coverage enables more accurate estimation of

SSR sequence amounts and frequencies. Moreover, specific correc-

tions are introduced for the stutter peak and amplification efficiency

of SSR sequences. The results of hexaploid C. oleifera showed that

SSR sequences with higher repeat numbers had a higher ratio of

stutter peaks (Figure S1) and may lead to errors and biases in SSR

allele identification and dosage estimation. The slip ratio model pro-

posed in the study nicely represented the actual SSR sequencing

data and therefore provided solid stut ter peak correction of SSR se-

quence frequency. In addition, we found that SSR alleles with higher

repeat numbers may have lower amplification efficiency (Figure S2);

therefore, amplification efficiency corrections must be performed.

Using the new method, accurate hexaploid genotypes of C. oleifera

with correc ted allele dosages were obtained in the study (Figure S3).

These enabled direct comparisons of population genetic analyses

with corrected and uncorrected allele dosages. The results of our

st ud y dem ons tr ate d th at, wit h cor re c te d and unc orr ec ted all ele dos -

ages, genetic diversity, structure and differentiation estimates and

inferences were considerably different.

Similar to the results of model simulations by Meirmans et al.

(2018), with uncorrected allele dosages, obser ved heterozygos-

ity estimates were abnormally high (>0.8) and significantly higher

than expected heterozygosit y estimates, and both were higher

than those with corrected allele dosages (Figure 3). Using eight

highly polymorphic microsatellite markers with the traditional

capillary electrophoresis method, Huang et al. (2018) found simi-

larly high obser ved heteroz ygosity in wild C. oleifera of the Lu and

Jinggang Mountains, and some loci had observed heterozygosity

equal to 1. The authors argued that such high observed heterozy-

gosity suggested that C. oleifera was an allopolyploid with disomic

inheritance. However, in this study, with corrected allele dosages,

observed heterozygosity estimates (<0.6) were significantly lower

than expected heterozygosity, indicating significant heterozygosity

deficits in all populations. Thus, for hexaploid wild C. oleifera, the

genetic diversity estimates with uncorrected allele dosages were

seriously overestimated, especially for observed heterozygosity,

resulting in unrealistic inferences in the previous study. Wild C.

oleifera outcrosses through insect pollination, and its seeds are dis-

persed via small rodents in forests (Huang et al., 2018; Xiao et al.,

2004). With limited gene dispersal within populations, observed

heterozygosity estimates should be significantly lower than ex-

pected heterozygosity, as indicated by the results with corrected

allele dosage. Moreover, our study supported the “central- marginal

hypothesis”, which states that across geographical ranges of spe-

cies, within- population genetic diversity declines from the centre to

the periphery, although the differences were small in wild C. oleifera

(Figure 3), as in most cases in previous studies (Eckert et al., 2008).

Our study demonstrates that resolving allele dosage uncertainty

FIGURE 4 Relationships between FST/(1- FST) and the natural logarithm of geographical distance between populations. (a) Estimates with

corrected allele dosage. (b) Estimates with uncorrected allele dosage. Linear regression lines and equations with R2 are shown

y = 0.0088x - 0.0267

R² = 0.0491

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.46.6 6.8 7.0

Fst/(1-Fst)

Ln (geographical distance)

(a) Corrected allele dosage

y = 0.0011x + 0.0027

R² = 0.0299

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.

Fst/(1-Fst)

Ln (geographical distance)

(b) Uncorrected allele dosage

209

CUI et al .

using our new method can achieve accurate estimates of genetic

diversit y for polyploids.

Although strong genetic structure could be distinguished even

with uncorrected allele dosages, subtle genetic structures could be

discovered among populations only with corrected allele dosages

(Figure 2). The wild C. oleifera population in Lu Mountain (LU) at the

highest latitude in the study was the most differentiated in genetic

structure with corrected and uncorrected allele dosages (Figure 2).

Lu Mountain is in the northern periphery of wild C. oleifera, adjacent

to the Yangtze River in the north and next to Poyang Lake in the

east and south, and isolated from other wild C. oleifera populations.

Adaptation isolation by cold climate conditions together with geo-

graphical isolation might lead to distinct genetic structures (Zhao

et al., 2013). With corrected allele dosages, the southern peripheral

population of wild C. oleifera in Luofu Mountain (LF) was distin-

guished in terms of genetic structure. Again, adaptation isolation by

warm climate conditions and geographical isolation from other pop-

ulations by Nanling Mountain might lead to distinct genetic struc-

tures. Our study indicates that resolving allele dosage uncertainty

is essential for discovering subtle genetic struc tures in polyploids.

As indicated in model simulations by Meirmans et al. (2018), the

classical FST estimates in our study were all very low with uncor-

rected allele dosages (Table 3), underestimating genetic differentia-

tion between wild C. oleifera populations compared to the estimates

with corrected allele dosages. With corrected allele dosages, FST was

the highest between the northern and southern peripheral popula-

tions, similar to the results of genetic structure analysis (Figure 2).

However, with uncorrected allele dosages, the FST between the

northern and southern peripheral populations was the same as that

between adjacent populations (Table 3), showing considerable bi-

ases. With corrected and uncorrected allele dosages, the patterns

of isolation- by- distance were insignificant, although with corrected

allele dosages, a slightly increased trend in FST/(1−FST) was detected

with the increase in the natural logarithm of geographical distance

between populations (Figure 4). The insignificance of isolation- by-

distance may be due to the small number of populations in the study.

According to the “central- marginal hypothesis”, in addition to the

declines in within- population genetic diversity from the centre of

the geographic al range to the peripher y, among- population differ-

entiation increases from the centre to the periphery (Eckert et al.,

2008). Again, the results of genetic structure and differentiation in

our study supported this hypothesis. Most importantly, our study

demonstrates that resolving allele dosage uncertainty c an improve

FST estimates for polyploids.

Huang et al. (2018) showed that Rho could discriminate genetic

differentiation between and within hexaploid wild C. oleifera popula-

tions using the traditional microsatellite genot yping method. In our

st ud y, we confir med tha t, wi th correc ted and uncorr ect ed allele dos-

ages, Rho estimates showed similar genetic differentiation patterns

between wild C. oleifera populations correlated to FST estimates with

corrected allele dosages. However, the interpretation of Rho is dif-

ferent from that of FST (Meirmans & Van Tienderen, 2013). The Rho

estimate corresponds to the FST estimate for a haploid species with

the same population size and migration rate; therefore, for hexaploid

wild C. oleifera, the Rho estimates were consistently higher than the

FST estimates, as indicated by model simulations (Meirmans & Van

Tienderen, 2013).

In summary, our study demonstrated that with uncorrected al-

lele dosages, genetic diversity, structure and differentiation anal-

yses were considerably biased in hexaploid wild C. oleifera. The

new high- throughput sequencing- based microsatellite genotyping

method established in the study can resolve allele dosage uncer-

tainty and considerably improve genetic diversity, structure and

differentiation analyses for polyploids. The genetic variation pat-

terns of wild C. oleifera across geographical ranges agree with the

“central- marginal hypothesis”, stating that genetic diversity is high

in the cen tral po pulatio n an d dec li ne s fro m the centr al to per ip her al

populations, and genetic differentiation increases from the centre

to the periphery. In future studies, more populations of wild C. oleif-

era across geographical ranges are needed to verify the findings

and discover the underlying mechanisms generating such genetic

variation patterns.

ACKNOWLEDGEMENTS

This work was supported by the National Key Research and

Development Program of China (No. 2018YFD1000603), the

National Natural Science Foundation of China (NSFC Grant No.

31870311) and the “Gan- Po Talent 555” Project of Jiangxi Province,

China. We thank Jinxia Fu and colleagues at the Centre for Genetic

& Genomic Analysis, Genesky Biotechnologies Inc., Shanghai for

support in the development of the high- throughput sequencing-

based microsatellite genot yping method. We are grateful to valuable

comments of editors and reviewers helping dramatically improve

the manuscript. Jun Rong would like to thank Professor Peter G.

L. Klinkhamer and Dr. Klaas Vrieling of Leiden University and Dr.

Patrick G. Meirmans of Universit y of Amsterdam for motivating him

to develop such an efficient molecular method in polyploids.

AUTHOR CONTRIBUTIONS

Xiangyan Cui and Jun Rong designed and performed the experi-

ments, analysed the data and wrote the manuscript. Caihua Li, Yao

Zhao, Shengyuan Qin and Zebin Huang contributed to the experi-

ments, data analyses and writing. Bin Gan, Zhengwen Jiang, Xiaomao

Huang and Xiaoqiang Yang contributed to the experiments and data

analyses. Qin Li, Xiaoguo Xiang and Jiakuan Chen contributed to

writing the manuscript.

DATA AVAILAB ILITY STATE MEN T

Microsatellite genotyping data with corrected and uncorrected allele

dosages of wild C. oleifera populations in the study have been made

available on Dryad (https://doi.org/10.5061/dryad.t4b8g thxd).

ORCID

Jun Rong https://orcid.org/0000-0003-1408-2898

210

CU I et al.

REFERENCES

Andrew, R. L., Bernatchez, L ., Bonin, A., Buerkle, C. A., Carstens, B. C.,

Emerson, B. C., Garant, D., Giraud, T., Kane, N. C ., Roger s, S. M.,

Slate, J., Smith, H., Sork, V. L., Stone, G. N., Vines, T. H., Waits,

L., Widmer, A., & Rieseberg, L. H. (2013). A road map for mo-

lecular ecolog y. Molecular Ecology, 22, 2605– 2626. htt ps://doi.

org /10.1111/me c.12319

Besnier, F., & Glover, K . A. (2013). ParallelStr ucture: A R package

to dis tribute parallel runs of the population genetics program

STRUCTURE on multi- core computers. PLoS One, 8, e70651.

https://doi.org/10.1371/journ al.pone.0070651

Chalhoub, B., Denoeud, F., Liu, S., Parkin, I. A. P., Tang, H., Wang, X.,

Chiquet , J., Belcram, H., Tong, C ., Samans, B., Correa, M., Da Silva,

C., Just, J., Falentin, C., Koh, C. S., Le Clainche, I., Bernard, M.,

Bento, P., Noel, B., … Wincker, P. (2014). Early allopolyploid evolu-

tion in the post- Neolithic Brassica napus oilseed genome. Science,

345, 950– 953. https://doi.org/10.1126/scien ce.1253435

Cui, X., Qin, S., Huang, X., Yang, X., & Rong, J. (2021). Microsatellite gen-

otypes of Camellia oleifera for GenoDive analysis. Dryad, Dataset,

https://doi.org/10.5061/dryad.t4b8g thxd

Cui, X., Huang, X., Chen, J., Yang, X., & Rong, J. (2018). An efficient

method for developing polymorphic microsatellite markers from

high- throughput transcriptome sequencing: a case study of hexa-

ploid oil- tea camellia (Camellia oleifera). Euphytica, 214 , 26. https://

d o i . o r g / 1 0 . 1 0 0 7 / s 1 0 6 8 1 - 0 1 8 - 2 1 1 4 - 6

De Barba, M., Miquel, C., Lobréaux, S., Quenette, P. Y., Swenson,

J. E., & Taberlet, P. (2017). High- throughput microsatellite

genotyping in ecology: Improved accuracy, efficiency, stan-

dardization and success with low- quantity and degraded

DNA. Molecular Ecology Resources, 17, 492– 507. https://doi.

org /10.1111/1755 - 0998 .12594

De Silva, H. N., Hall, A . J., Rikkerink, E., McNeilage, M. A., & Fraser, L.

G. (2005). Estimation of allele frequencies in polyploids under cer-

tain pat terns of inheritance. Heredity, 95, 327– 334. ht tps://doi.

org/10.1038/sj.hdy.6800728

Dufresne, F., Stift, M., Vergilino, R ., & Mable, B. K. (2014). Recent prog-

ress and challenges in population genetics of polyploid organisms:

An over view of current state- of- the- art molecular and statisti-

cal tools. Molecular Ecology, 23, 4 0– 69. https://doi .org/10.1111/

mec .12581

Earl, D. A ., & vonHoldt, B . M. (2012). STRUCTURE HARVESTER: A web-

site and program for visualizing STRUC TURE output and imple-

menting the Evanno method. Conservation Genetics Resources, 4,

3 5 9 – 3 6 1 . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 1 2 6 8 6 - 0 1 1 - 9 5 4 8 - 7

Eckert, C. G., Samis, K. E., & Lougheed, S. C. (2008). Genetic variation

across species’ geographical ranges: The central- marginal hypoth-

esis and beyond. Molecular Ecology, 17, 1170– 1188. htt ps://doi.

org /10.1111/j .136 5- 294X.2007.03659.x

Esselink, G. D., Nybom, H., & Vosman, B. (20 04). Assignment of al-

lelic configuration in polyploids using the MAC- PR (microsatellite

DNA allele counting— peak ratios) method. Theoretical and Applied

Genetics, 109, 402– 408.

Hardy, O. J., & Vekemans, X. (2002). SPAGeDi: A versatile computer

program to analyse spatial genetic structure at the individual or

population levels. Molecular Ecology Notes, 2, 618– 620. https://doi.

org/10.1046/j.1471- 8286.2002.00305.x

Huang, X ., Chen, J., Yang, X., Duan, S., Long, C., Ge, G., & Rong, J. (2018).

Low genetic differentiation among altitudes in wild Camellia oleifera,

a subtropic al evergreen hexaploid plant. Tree Genetics & Genomes,

14, 2 1 . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 1 1 2 9 5 - 0 1 8 - 1 2 3 4 - 4

International Wheat Genome Sequencing Consortium. (2014). A

chromosome- based draft sequence of the hexaploid bread wheat

(Triticum aestivum) genome. Science, 345, 1251788. https://doi.

org/10.1126/scien ce.1251788

Jakobsson, M., & Rosenberg, N. A. (2007). CLUMPP: A cluster matching

and permutation program for dealing with label switching and mul-

timodality in analysis of population structure. Bioinformatics, 23,

1801– 1806. https://doi.org/10.1093/bioin forma tics/btm233

Ma, J., Ye, H., Rui, Y., Chen, G., & Zhang, N. (2011). Fatty acid compo-

sition of Camellia oleifera oil. Journal für Verbraucherschutz und

Lebensmittelsicherheit, 6, 9– 12. https://doi.org/10.1007/s0000

3 - 0 1 0 - 0 5 8 1 - 3

Meirmans, P. G., Liu, S., & van T ienderen, P. H. (2018). The analysis of

polyploid genetic data. Journal of Heredity, 109, 283– 296. https://

doi.org/10.1093/jhere d/esy006

Meirmans, P. G., & Van Tienderen, P. H. (20 04). GENOTYPE and

GENODIVE: Two programs for the analysis of genetic diversity of

asexual organisms. Molecular Ecology Notes, 4, 792– 7 94. ht tps://do i.

org /10.1111/j .1471- 8286 .2004 .0 0770. x

Meirmans, P., & Van Tienderen, P. (2013). The effects of inheritance

in tetraploids on genetic diversity and population divergence.

Heredity, 110, 131– 137. https://doi.org/10.1038/hdy.2012 .80

Michalakis, Y., & Excof fier, L . (1996). A generic estimation of population

subdivision using distances between alleles with special refer-

ence for microsatellite loci. Genetics, 142, 1061– 1064. htt ps://doi.

org /10.1093/gene t ics/142.3.1061

Miller, M. A ., Pfeiffer, W., & Schwartz, T. (2010). Creating the CIPRE S

Science Gateway for inference of large phylogenetic trees. In

Proceedings of the Gateway Computing Enviroments Workshop (GCE)

(pp. 1– 8). New Orleans, L A.

Ming, T. L. (200 0). Monograph of the genus Camellia. Yunnan Science and

Technology Press.

Moody, M. E., Mueller, L. D., & Soltis, D. E. (1993). Genetic variation and

random drift in autotetraploid populations. Genetics, 134, 649– 657.

Pritchard, J. K., Stephens, M., & Donnelly, P. (20 00). Inference of pop-

ulation structure using multilocus genotype data. Genetics, 155,

9 4 5 – 9 5 9 .

Renny- Byfield, S., & Wendel, J. F. (2014). Doubling down on genomes:

Polyploidy and crop plants. American Journal of Botany, 101, 1711–

1725. https://doi.org/10.3732/ajb.1400119

Ronfort, J., Jenczewski, E ., Bat aillon, T., & Rousset, F. (1998). Analysis of

population st ructure in autotetraploid species. Genetics, 150, 921–

930. https://doi.org/10.1093/genet ics/150.2.921

Stift, M., Kolář, F., & Meirmans, P. G. (2019). STRUCTURE is more ro-

bust than other clustering methods in simulated mixed- ploidy pop-

ulations. Heredity, 123, 429– 4 41. https://doi.org/10.1038/s4143

7 - 0 1 9 - 0 2 4 7 - 6

The Potato Genome Sequencing Consortium. (2011). Genome sequence

and analysis of the tuber crop potato. Nature, 475, 189– 195.

Vartia, S., Villanueva- Cañas, J. L., Finarelli, J., Farrell, E. D., Collins, P.

C., Hughes, G . M ., Carlsson, J. E . L ., Gauthier, D. T., McGinnity, P.,

Cross, T. F., FitzG erald, R. D., Mirimin, L., Crispie, F., Cotter, P. D., &

Carlsson, J. (2016). A novel method of microsatellite genotyping- by-

sequencing using individual combinatorial barcoding. Royal Society

Open Science, 3, 150565. https://doi.org/10.1098/rsos.150565

Wood, T. E., Takebayashi, N., Barker, M. S., Mayrose, I., G reenspoon, P.

B., & Rieseberg, L. H. (2009). The frequency of polyploid speciation

in vascular plants. Proceedings of the National Academy of Sciences

of the United States of America, 106, 13875– 13879. https://doi.

org /10.1073/pnas.08115 75106

Xiao, Z., Zhang, Z., & Wang, Y. (200 4). Impacts of scatter- hoarding ro-

dents on restoration of oil tea Camellia oleifera in a fragmented

forest . Forest Ecology and Management, 196, 405– 412. https://doi.

org/10.1016/j.foreco.2004.04.001

Yang, J., Zhang, J., Han, R., Zhang, F., Mao, A., Luo, J., Dong, B., Liu, H.,

Tang, H., Zhang, J., & Wen, C . (2019). Target SSR- Seq: A novel SSR

gen ot yping tec hnology ass oc iat e with per fect SSR s in gene ti c anal-

ysis of cucumber varieties. Frontiers in Plant Science, 10, 53. https://

doi.org/10.3389/fpls.2019.00531

211

CUI et al .

Zhao, Y., Vrieling, K., Liao, H., Xiao, M., Zhu, Y., Rong, J., Zhang, W.,

Wang, Y., Yang, J., Chen, J., & Song, Z. (2013). Are habitat fragmen-

tation, local adaptation and isolation- by- distance driving popula-

tion divergence in wild rice Oryza rufipogon? Molecular Ecology, 22,

5531– 55 47.

Zhuang, R. L. (2008). Oil- tea camellia in China (2nd ed.). China Forestr y

Publishing House.

SUPPORTING INFORMATION

Additional supporting information may be found online in the

Supporting Information section.

How to cite this article: Cui, X., Li, C., Qin, S., Huang, Z., Gan,

B., Jiang, Z., Huang, X., Yang, X ., Li, Q., Xiang, X ., Chen, J.,

Zhao, Y., & Rong, J. (2022). High- throughput sequencing-

based microsatellite genotyping for polyploids to resolve

allele dosage uncer tainty and improve analyses of genetic

diversit y, structure and differentiation: A case study of the

hexaploid Camellia oleifera. Molecular Ecology Resources, 22,

199– 211. htt ps://doi.org/10.1111/1755- 0998 .13 469

Variation in Fruit Traits and Seed Nutrient Compositions of Wild Camellia oleifera: Implications for Camellia oleifera Domestication

Article

Full-text available

Apr 2024

Camellia oleifera is a woody oil crop with the highest oil yield and the largest cultivation area in China, and C. oleifera seed oil is a high-quality edible oil recommended by the Food and Agriculture Organization of the United Nations. The objectives of this study were to investigate the variation in fruit yield traits and seed chemical compositions of wild C. oleifera in China and to identify the differences between wild C. oleifera and cultivated varieties. In this study, we collected wild C. oleifera samples from 13 sites covering the main distribution areas of wild C. oleifera to comprehensively evaluate 25 quantitative traits of wild C. oleifera fruit and seed chemical compositions and collected data of 10 quantitative traits from 434 cultivated varieties for a comparative analysis of the differences between wild and cultivars. The results showed that the coefficients of variation of the 25 quantitative traits of wild C. oleifera ranged from 2.605% to 156.641%, with an average of 38.569%. The phenotypic differentiation coefficients ranged from 25.003% to 99.911%, with an average of 77.894%. The Shannon–Wiener index (H’) ranged from 0.195 to 1.681. Based on the results of principal component analysis (PCA) and phenotypic differentiation coefficients, 10 traits differed significantly between wild C. oleifera and cultivated varieties, while the differentiation coefficients (VST) for fresh fruit weight, oleic acid, unsaturated fatty acids, stearic acid, and saturated fatty acids were more than 95%, of which fresh fruit weight and oleic acid content were potential domestication traits of C. oleifera. The results of this study can contribute to the efficient excavation and utilization of wild C. oleifera genetic resources for C. oleifera breeding.

Field plus lab experiments help identify freezing tolerance and associated genes in subtropical evergreen broadleaf trees: A case study of Camellia oleifera

Article

Full-text available

Feb 2023

The molecular mechanisms of freezing tolerance are unresolved in the perennial trees that can survive under much lower freezing temperatures than annual herbs. Since natural conditions involve many factors and temperature usually cannot be controlled, field experiments alone cannot directly identify the effects of freezing stress. Lab experiments are insufficient for trees to complete cold acclimation and cannot reflect natural freezing-stress responses. In this study, a new method was proposed using field plus lab experiments to identify freezing tolerance and associated genes in subtropical evergreen broadleaf trees using Camellia oleifera as a case. Cultivated C. oleifera is the dominant woody oil crop in China. Wild C. oleifera at the high-elevation site in Lu Mountain could survive below −30°C, providing a valuable genetic resource for the breeding of freezing tolerance. In the field experiment, air temperature was monitored from autumn to winter on wild C. oleifera at the high-elevation site in Lu Mountain. Leave samples were taken from wild C. oleifera before cold acclimation, during cold acclimation and under freezing temperature. Leaf transcriptome analyses indicated that the gene functions and expression patterns were very different during cold acclimation and under freezing temperature. In the lab experiments, leaves samples from wild C. oleifera after cold acclimation were placed under −10°C in climate chambers. A cultivated C. oleifera variety “Ganwu 1” was used as a control. According to relative conductivity changes of leaves, wild C. oleifera showed more freezing-tolerant than cultivated C. oleifera. Leaf transcriptome analyses showed that the gene expression patterns were very different between wild and cultivated C. oleifera in the lab experiment. Combing transcriptome results in both of the field and lab experiments, the common genes associated with freezing-stress responses were identified. Key genes of the flg22, Ca²⁺ and gibberellin signal transduction pathways and the lignin biosynthesis pathway may be involved in the freezing-stress responses. Most of the genes had the highest expression levels under freezing temperature in the field experiment and showed higher expression in wild C. oleifera with stronger freezing tolerance in the lab experiment. Our study may help identify freezing tolerance and underlying molecular mechanisms in trees.

The complex hexaploid oil‐Camellia genome traces back its phylogenomic history and multi‐omics analysis of Camellia oil biosynthesis

Article

Full-text available

Jun 2024
PLANT BIOTECHNOL J

Oil‐Camellia (Camellia oleifera), belonging to the Theaceae family Camellia, is an important woody edible oil tree species. The Camellia oil in its mature seed kernels, mainly consists of more than 90% unsaturated fatty acids, tea polyphenols, flavonoids, squalene and other active substances, which is one of the best quality edible vegetable oils in the world. However, genetic research and molecular breeding on oil‐Camellia are challenging due to its complex genetic background. Here, we successfully report a chromosome‐scale genome assembly for a hexaploid oil‐Camellia cultivar Changlin40. This assembly contains 8.80 Gb genomic sequences with scaffold N50 of 180.0 Mb and 45 pseudochromosomes comprising 15 homologous groups with three members each, which contain 135 868 genes with an average length of 3936 bp. Referring to the diploid genome, intragenomic and intergenomic comparisons of synteny indicate homologous chromosomal similarity and changes. Moreover, comparative and evolutionary analyses reveal three rounds of whole‐genome duplication (WGD) events, as well as the possible diversification of hexaploid Changlin40 with diploid occurred approximately 9.06 million years ago (MYA). Furthermore, through the combination of genomics, transcriptomics and metabolomics approaches, a complex regulatory network was constructed and allows to identify potential key structural genes (SAD, FAD2 and FAD3) and transcription factors (AP2 and C2H2) that regulate the metabolism of Camellia oil, especially for unsaturated fatty acids biosynthesis. Overall, the genomic resource generated from this study has great potential to accelerate the research for the molecular biology and genetic improvement of hexaploid oil‐Camellia, as well as to understand polyploid genome evolution.

РАЗРАБОТКА МУЛЬТИПЛЕКСНОЙ ПАНЕЛИ МИКРОСАТЕЛЛИТОВ ДЛЯ ГЕНЕТИЧЕСКОЙ ПАСПОРТИЗАЦИИ СИБИРСКОГО ОСЕТРА (Acipenser baerii)

Article

Jan 2024

Н.В. БАРДУКОВ

DEVELOPMENT OF MULTIPLEX PANEL OF MICROSATELLITES FOR GENETIC STUDIES OF SIBERIAN STURGEON (Acipenser baerii) BRED IN COMMERCIAL AQUACULTURE

Article

Jan 2024

N.V. Bardukov

Genetic diversity of wild Camellia oleifera in northern China revealed by simple sequence repeat markers

Article

Full-text available

Nov 2023
GENET RESOUR CROP EV

Camellia oleifera Abel., as one of the four major woody oilseeds, has a high economic value, and the wild C. oleifera genes, whose distribution area is located at the northern edge, are abundant and are valuable resources for C. oleifera breeding. In this study, a total of 341 wild C. oleifera populations were sampled from 11 different localitions in Xinxian County, the hinterland of the Dabie Mountains in the northern margin of the distribution of C. oleifera in China, and 16 pairs of simple sequence repeat (SSR) molecular markers were used to analyse the genetic diversity. Using these 16 pairs of primers to detect the genetic diversity of the wild C. oleifera population, 174 alleles were amplified. The average number of alleles (Na) was 10.875, the average expected heterozygosity (He) was 0.739, the observed heterozygosity (Ho) was 0.718, and the average polymorphic information index (PIC) was 0.739. The 11 wild C. oleifera populations in Xinxian County had high genetic diversity, and the average expected heterozygosity (He) among populations was 0.735. The molecular variance showed that the genetic variation mainly came from within the population, accounting for 88.21% of the total variation. The genetic differentiation coefficient Fst between populations was small, with an average of only 0.04. According to the results of Structure and principal cordinate analysis (PCoA) and cluster analysis, these 11 populations could be roughly divided into two categories. The Mantel test preferentially clustered some populations close to each other, but there was no significant correlation between genetic distance and geographical distance. We provides a theoretical basis for the rational development and utilisation of wild C. oleifera resources in the future and provide a scientific and technological method for future breeding.

Genome survey and identification of key genes associated with freezing tolerance in genomic draft of hexaploid wild Camellia oleifera

Article

Full-text available

Nov 2023
J HORTIC SCI BIOTECH

Camellia oleifera Abel. is the dominant woody oil crop under significant development in China. Wild C. oleifera in Lu Mountain is a valuable genetic resource with strong freezing tolerance. With high-throughput sequencing, the genome of wild C. oleifera in Lu Mountain was analysed and 700.3 Gb clean reads were obtained. The genome of wild C. oleifera was estimated as allohexaploid, and its haplotype genome size was about 2.69 Gb-2.79 Gb, with repeat content of 63.01%-73.02% and heterozygosity of 6.30%-7.43%, belonging to a very complex genome. The genomic draft was assembled that contained a total of 6,952,303 scaffolds with N50 length of 1.23 kb, and the overall length was 2.39 Gb with GC content of 40.87%. In the genomic draft, 1,104,618 SSRs were identified; scaffold1096012 and scaffold1779458 were identified as key genes associated with freezing tolerance combined with the transcriptome data of field plus lab experiments. In this study, the genomic background of hexaploid wild C. oleifera in Lu Mountain was revealed. This lays the foundation for obtaining the high-quality chromosome level reference genome of wild C. oleifera. The identification of SSRs and key genes associated with freezing tolerance may contribute to the efficient exploration and utilisation of this genetic resource.

Vitis vinifera genotyping toolbox to highlight diversity and germplasm identification

Article

Full-text available

Apr 2023

The contribution of vine cultivation to human welfare as well as the stimulation of basic social and cultural features of civilization has been great. The wide temporal and regional distribution created a wide array of genetic variants that have been used as propagating material to promote cultivation. Information on the origin and relationships among cultivars is of great interest from a phylogenetics and biotechnology perspective. Fingerprinting and exploration of the complicated genetic background of varieties may contribute to future breeding programs. In this review, we present the most frequently used molecular markers, which have been used on Vitis germplasm. We discuss the scientific progress that led to the new strategies being implemented utilizing state-of-the-art next generation sequencing technologies. Additionally, we attempted to delimit the discussion on the algorithms used in phylogenetic analyses and differentiation of grape varieties. Lastly, the contribution of epigenetics is highlighted to tackle future roadmaps for breeding and exploitation of Vitis germplasm. The latter will remain in the top of the edge for future breeding and cultivation and the molecular tools presented herein, will serve as a reference point in the challenging years to come.

Putting rose microsatellites into orbit: development and assessment of an SSR sequencing method

Article

May 2023

Genetic differentiation and genetic structure of mixed-ploidy Camellia hainanica populations

Article

Full-text available

Feb 2023

Camellia hainanica , which is common in China’s Hainan Province, is an important woody olive tree species. Due to many years of geographic isolation, C. hainanica has not received the attention it deserves, which limits the exploitation of germplasm resources. Therefore, it is necessary to study population genetic characteristics for further utilization and conservation of C. hainanica . In this study, 96 individuals in six wild Camellia hainanica populations were used for ploidy analysis of the chromosome number, and the genetic diversity and population structure were investigated using 12 pairs of SSR primers. The results show complex ploidy differentiation in C. hainanica species. The ploidy of wild C. hainanica includes tetraploid, pentaploid, hexaploid, heptaploid, octoploid and decaploid species. Genetic analysis shows that genetic diversity and genetic differentiation among populations are low. Populations can be divided into two clusters based on their genetic structure, which matches their geographic location. Finally, to further maintain the genetic diversity of C. hainanica , ex-situ cultivation and in-situ management measures should be considered to protect it in the future.

STRUCTURE is more robust than other clustering methods in simulated mixed-ploidy populations

Article

Full-text available

Jul 2019
HEREDITY

Analysis of population genetic structure has become a standard approach in population genetics. In polyploid complexes, clustering analyses can elucidate the origin of polyploid populations and patterns of admixture between different cytotypes. However, combining diploid and polyploid data can theoretically lead to biased inference with (artefactual) clustering by ploidy. We used simulated mixed-ploidy (diploid-autotetraploid) data to systematically compare the performance of k-means clustering and the model-based clustering methods implemented in Structure, Admixture, FastStructure and InStruct under different scenarios of differentiation and with different marker types. Under scenarios of strong population differentiation, the tested applications performed equally well. However, when population differentiation was weak, Structure was the only method that allowed unbiased inference with markers with limited genotypic information (co-dominant markers with unknown dosage or dominant markers). Still, since Structure was comparatively slow, the much faster but less powerful FastStructure provides a reasonable alternative for large datasets. Finally, although bias makes k-means clustering unsuitable for markers with incomplete genotype information, for large numbers of loci (>1000) with known dosage k-means clustering was superior to FastStructure in terms of power and speed. We conclude that Structure is the most robust method for the analysis of genetic structure in mixed-ploidy populations, although alternative methods should be considered under some specific conditions.

Target SSR-Seq: A Novel SSR Genotyping Technology Associate With Perfect SSRs in Genetic Analysis of Cucumber Varieties

Article

Full-text available

Apr 2019

Simple sequence repeats (SSR) – also known as microsatellites – have been used extensively in genetic analysis, fine mapping, quantitative trait locus (QTL) mapping, as well as marker-assisted selection (MAS) breeding and other techniques. Despite a plethora of studies reporting that perfect SSRs with stable motifs and flanking sequences are more efficient for genetic research, the lack of a high throughput technology for SSR genotyping has limited their use as genetic targets in many crops. In this study, we developed a technology called Target SSR-seq that combined the multiplexed amplification of perfect SSRs with high throughput sequencing. This method can genotype plenty of SSR loci in hundreds of samples with highly accurate results, due to the substantial coverage afforded by high throughput sequencing. We also detected 844 perfect SSRs based on 182 resequencing datasets in cucumber, of which 91 SSRs were selected for Target SSR-seq. Finally, 122 SSRs, including 31 SSRs for varieties identification, were used to genotype 382 key cucumber varieties readily available in Chinese markets using our Target SSR-seq method. Libraries of PCR products were constructed and then sequenced on the Illumina HiSeq X Ten platform. Bioinformatics analysis revealed that 111 filtered SSRs were accurately genotyped with an average coverage of 1289× at an extremely low cost; furthermore, 398 alleles were observed in 382 cucumber cultivars. Genetic analysis identified four populations: northern China type, southern China type, European type, and Xishuangbanna type. Moreover, we acquired a set of 16 core SSRs for the identification of 382 cucumber varieties, of which 42 were isolated as backbone cucumber varieties. This study demonstrated that Target SSR-seq is a novel and efficient method for genetic research.

Low genetic differentiation among altitudes in wild Camellia oleifera, a subtropical evergreen hexaploid plant

Article

Full-text available

Feb 2018
TREE GENET GENOMES

Camellia oleifera is a subtropical evergreen plant. Cultivated C. oleifera is the most important woody oil crop in China. Wild C. oleifera is an essential genetic resource for breeding. The patterns of genetic differentiation among altitudes/latitudes in wild C. oleifera are still unknown. Camellia oleifera may be predominantly hexaploid. The characteristics of polyploidy may lead to considerable biases in estimates of genetic diversity and differentiation. Our study used C. oleifera as a case study for analysing genetic diversity, structure and differentiation in polyploid plants using simple sequence repeats (SSRs). Wild C. oleifera samples were collected at different altitudes on the Jinggang and Lu mountains of China. The ploidy levels were determined with flow cytometry analysis. Eight highly polymorphic SSRs were used to genotype the samples. Genetic diversity and structure were analysed. Various estimates of genetic differentiation were compared. The flow cytometry results indicated that wild C. oleifera samples were all hexaploid at various altitudes of the Jinggang and Lu mountains. High levels of genetic diversity were found on both the Jinggang and Lu mountains. Genetic structure analyses indicated clear genetic differentiation between the Jinggang and Lu mountains and lower genetic differentiation among altitudes within each mountain. Classical genetic differentiation estimates of Fst failed to discriminate genetic differentiation between and within mountains. The Rho statistic showed a moderate level of genetic differentiation between mountains and lower levels of genetic differentiation within each mountain. Our study demonstrates that Rho is the statistic of choice for estimating genetic differentiation in polyploids.

The Analysis of Polyploid Genetic Data

Article

Full-text available

Jan 2018

Though polyploidy is an important aspect of the evolutionary genetics of both plants and animals, the development of population genetic theory of polyploids has seriously lagged behind that of diploids. This is unfortunate since the analysis of polyploid genetic data -and the interpretation of the results- requires even more scrutiny than with diploid data. This is because of several polyploidy-specific complications in segregation and genotyping such as tetrasomy, double reduction and missing dosage information. Here, we review the theoretical and statistical aspects of the population genetics of polyploids. We discuss several widely-used types of inferences, including genetic diversity, Hardy-Weinberg equilibrium, population differentiation, genetic distance, and detecting population structure. For each, we point out how the statistical approach, expected result, and interpretation differ between different ploidy levels. We also discuss for each type of inference what biases may arise from the polyploid-specific complications and how these biases can be overcome. From our overview, it is clear that the statistical toolbox that is available for the analysis of genetic data is flexible and still expanding. Modern sequencing techniques will soon be able to overcome some of the current limitations to the analysis of polyploid data, though the techniques are lagging behind those available for diploids. Furthermore, the availability of more data may aggravate the biases that can arise, and increase the risk of false inferences. Therefore, simulations such as we used throughout this review are an important tool to verify the results of analyses of polyploid genetic data.

An efficient method for developing polymorphic microsatellite markers from high-throughput transcriptome sequencing: a case study of hexaploid oil-tea camellia (Camellia oleifera)

Article

Full-text available

Jan 2018
EUPHYTICA

The bottleneck of microsatellite marker development is to determine polymorphisms of microsatellite markers. A large amount of microsatellites can be detected via high-throughput sequencing. However, most previous studies didn’t fully use the high-throughput sequencing data to predict number of alleles at microsatellite loci. Instead, laborious experiments were performed for manually screening microsatellite loci, finding out number of alleles at each microsatellite loci and selecting those with polymorphisms for marker development. In this study, we improved the method for efficient development of polymorphic microsatellite markers from high-throughput transcriptome sequencing, using hexaploid oil-tea camellia as a case study. Leaf transcriptomes were sequenced of eight wild oil-tea camellia samples at different altitudes in Jinggang and Lu Mountains, China. Microsatellites were directly identified in the sequencing reads and primers were designed. Strategies were designed to filtering duplicate and multi-locus markers. For each marker, number of alleles cross samples was predicted and length of the potentially amplifiable sequence was estimated. 153 predicted polymorphic markers were selected and empirically validated in the eight samples. Sixty five markers (42%) were polymorphic (2–12 alleles) and 31 (20%) were highly polymorphic (6–12 alleles). The empirical number of alleles was generally higher than the predicted number of alleles but they were significantly correlated. The predicted allele length was among the empirical allele length range. Compared with most previous studies, the method shows a higher efficiency for developing polymorphic markers and filtering duplicate and multi-locus markers. The polymorphic microsatellite markers developed can be used for analyzing the genetic diversity of oil-tea camellia.

A novel method of microsatellite genotyping-by-sequencing using individual combinatorial barcoding

Article

Full-text available

Feb 2016

This study examines the potential of next-generation sequencing based 'genotyping-by-sequencing' (GBS) of microsatellite loci for rapid and cost-effective genotyping in large-scale population genetic studies. The recovery of individual genotypes from large sequence pools was achieved by PCR-incorporated combinatorial barcoding using universal primers. Three experimental conditions were employed to explore the possibility of using this approach with existing and novel multiplex marker panels and weighted amplicon mixture. The GBS approach was validated against microsatellite data generated by capillary electrophoresis. GBS allows access to the underlying nucleotide sequences that can reveal homoplasy, even in large datasets and facilitates cross laboratory transfer. GBS of microsatellites, using individual combinatorial barcoding, is potentially faster and cheaper than current microsatellite approaches and offers better and more data.

A Generic Estimation of Population Subdivision Using Distances Between Alleles With Special Reference for Microsatellite Loci

Article

Mar 1996
GENETICS

Several estimators of population differentiation have been proposed in the recent past to deal with various types of genetic markers (i.e., allozymes, nucleotide sequences, restriction fragment length polymorphisms, or microsatellites). We discuss the relationships among these estimators and show how a single analysis of variance framework can accomodate these qualitatively different data types.

Inference of Population Structure Using Multilocus Genotype Data

Article

Jun 2000
GENETICS

We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.

High-throughput microsatellite genotyping in ecology: Improved accuracy, efficiency, standardization and success with low-quantity and degraded DNA

Article

Aug 2016
MOL ECOL RESOUR

Microsatellite markers have played a major role in ecological, evolutionary and conservation research during the past 20 years. However, technical constrains related to the use of capillary electrophoresis and a recent technological revolution that has impacted other marker types have brought to question the continued use of microsatellites for certain applications. We present a study for improving microsatellite genotyping in ecology using high-throughput sequencing (HTS). This approach entails selection of short markers suitable for HTS, sequencing PCR-amplified microsatellites on an Illumina platform, and bioinformatic treatment of the sequence data to obtain multilocus genotypes. It takes advantage of the fact that HTS gives direct access to microsatellite sequences, allowing unambiguous allele identification, and enabling automation of the genotyping process through bioinformatics. In addition, the massive parallel sequencing abilities expand the information content of single experimental runs far beyond capillary electrophoresis. We illustrated the method by genotyping brown bear samples amplified with a multiplex PCR of 13 new microsatellite markers and a sex marker. HTS of microsatellites provided accurate individual identification and parentage assignment, and resulted in significant improvement of genotyping success (84%) of fecal degraded DNA and costs reduction compared to capillary electrophoresis. The HTS approach holds vast potential for improving success, accuracy, efficiency and standardization of microsatellite genotyping in ecological and conservation applications, especially those that rely on profiling of low-quantity/quality DNA and on the construction of genetic databases. We discuss and give perspectives for the implementation of the method in light of the challenges encountered in wildlife studies. This article is protected by copyright. All rights reserved.

A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome

Article

Jan 2014

An ordered draft sequence of the 17-gigabase hexaploid bread wheat (Triticum aestivum) genome has been produced by sequencing isolated chromosome arms. We have annotated 124,201 gene loci distributed nearly evenly across the homeologous chromosomes and subgenomes. Comparative gene analysis of wheat subgenomes and extant diploid and tetraploid wheat relatives showed that high sequence similarity and structural conservation are retained, with limited gene loss, after polyploidization. However, across the genomes there was evidence of dynamic gene gain, loss, and duplication since the divergence of the wheat lineages. A high degree of transcriptional autonomy and no global dominance was found for the subgenomes. These insights into the genome biology of a polyploid crop provide a springboard for faster gene isolation, rapid genetic marker development, and precise breeding to meet the needs of increasing food demand worldwide.

High‐throughput sequencing‐based microsatellite genotyping for polyploids to resolve allele dosage uncertainty and improve analyses of genetic diversity, structure and differentiation: A case study of the hexaploid Camellia oleifera

Abstract

Recommended publications

Low genetic differentiation among altitudes in wild Camellia oleifera, a subtropical evergreen hexap...

An efficient method for developing polymorphic microsatellite markers from high-throughput transcrip...

Genetic differentiation and genetic structure of mixed-ploidy Camellia hainanica populations

Genetic Diversity of Crop Wild Relatives under Threat in Yangtze River Basin: Call for Enhanced In S...