Content uploaded by Jun Rong
Author content
All content in this area was uploaded by Jun Rong on Apr 19, 2022
Content may be subject to copyright.
Mol Ecol Resour. 2022;22:199–211. wileyonlinelibrary.com/journal/men
|
199© 2021 John Wiley & Sons Ltd
Received: 3 March 2020
|
Revised: 8 July 2021
|
Accepted: 12 July 20 21
DOI: 10.1111/1755-0998.13469
RESOURCE ARTICLE
High- throughput sequencing- based microsatellite genotyping
for polyploids to resolve allele dosage uncertainty and improve
analyses of genetic diversity, structure and differentiation:
A case study of the hexaploid Camellia oleifera
Xiangyan Cui1 | Caihua Li2 | Shengyuan Qin1 | Zebin Huang2 | Bin Gan2 |
Zhengwen Jiang3 | Xiaomao Huang1 | Xiaoqiang Yang1 | Qin Li4 | Xiaoguo Xiang1 |
Jiakuan Chen1,4 | Yao Zhao1,5 | Jun Rong1,5
Xiang yan Cui and Caih ua Li are contrib uted equally to t his work.
1Jiangxi Province Key Laborator y of
Watershe d Ecosys tem Chan ge and
Biodive rsity, Center for Watershed
Ecology, Institute of Life Science and
School of Life Sciences, N anchang
University, Nanc hang, China
2Center for Genet ic & Geno mic Analysis,
Genesky Biote chnol ogies In c, Shan ghai,
China
3Genesky Diag nostics (Suzh ou) Inc. ,
Suzhou, China
4Fudan Deve lopment Institute, Fudan
University, Shanghai, China
5Lushan Botanic al Garden, Chinese
Academy of Sciences, Lus han, China
Correspondence
Jun Rong an d Yao Zhao, Jiangxi Province
Key Laborator y of Watershed Ecosystem
Change an d Biodiversit y, Center for
Watershe d Ecolog y, Institute of Life
Science a nd Scho ol of Life S ciences,
Nanchang University, Nan chang, China.
Emails: ro ng_ jun@hotmail.com and
yaozhao@ncu.edu.cn
Funding information
Nationa l Key Research and Developm ent
Program of China, Grant /Award Number:
2018YFD100 0603; National Natural
Science Foundation of Chin a, Grant/
Award Number: 31870311; “Gan- Po
Talent 555” Project of Jiangxi Provin ce,
China
Abstract
Conventional microsatellite (simple sequence repeat, SSR) genotyping methods can-
not accurately identify polyploid genotypes leading to allele dosage uncertainty, in-
troducing biases in population genetic analysis. Here, a new SSR genotyping method
was developed to directly infer accurate polyploid genotypes. The frequency distri-
bution of SSR sequences was obtained based on deep- coverage high- throughput se-
quencing data. Corrections were performed accounting for the “stutter peak” and
amplification efficiency of SSR sequences. Perl scripts and an online SSR genotyping
tool “SSRSeq” were provided to process the sequencing data and output genotypes
with corrected allele dosages. Hexaploid Camellia oleifera is the dominant woody oil-
seed crop in China. Understanding the geographical pattern of genetic variation in
wild C. oleifera is essential for the conser vation and utilization of genetic resources.
Six wild C. oleifera populations were sampled across geographical ranges in subtropi-
cal evergreen broadleaf forests of China. Using 35 SSR markers, the high- throughput
sequencing- based SSRSeq method was applied to obtain accurate hexaploid geno-
types of wild C. oleifera. The results demonstrated that the new method could resolve
allele dosage uncertainty and considerably improve genetic diversity, structure and
differentiation analyses for polyploids. The genetic variation patterns of wild C. oleif-
era across geographical ranges agree with the “central- marginal hypothesis”, stating
that genetic diversity is high in the central population and declines from the central
to the peripheral populations, and genetic differentiation increases from the centre to
the periphery. This method and findings can facilitate the utilization of wild C. oleifera
genetic resources for the breeding of cultivated C. oleifera.
KEYWORDS
allele dosage, Camellia oleifera, genetic differentiation, genetic diversity, genetic structure,
polyploid
200
|
CU I et al.
1 | INTRODUCTION
Polyploidy plays an important role in the diversification of angio-
sperms. Approximately 15% of angiosperm speciation events are
accompanied by ploidy increases, and approximately 35% of angio-
sperm species are polyploids (Wood et al., 2009). Because polyploids
are of ten accompanied by heterosis and gene redundancy and may
grow larger, more quickly, and with higher yields compared with their
diploid relatives, polyploidy has also facilitated the domestication
and improvement of crops (Renny- Byfield & Wendel, 2014). Many
important crops are polyploid. For instance, potato (Solanum tubero-
sum) is tetraploid (2n = 4x = 48) (The Potato Genome Sequencing
Consortium, 2011), bread wheat (Triticum aestivum) is hexaploid (2n
= 6x = 42) (International Wheat Genome Sequencing Consortium,
2014), and oilseed rape (Brassica napus) is tetraploid (2n = 4x = 38)
(Chalhoub et al., 2014). Thus, studies on the population genetics of
polyploids can not only shed light on the evolution of angiosperms
but also improve our underst anding of crop domestication and im-
provement (Renny- Byfield & Wendel, 2014).
Despite the importance of polyploids, the population genet-
ics of polyploids are still underdeveloped compared with diploids
(Dufresne et al., 2014). The major challenge is to develop molecular
approaches for reliably resolving allele dosage uncertainty in poly-
ploids (Dufresne et al., 2014). For instance, alleles A and B detected
at a locus in a hexaploid may represent a genotype of ABBBBB,
AABBBB, AAABBB, AAAABB, or A AAAAB, but conventional ap-
proaches cannot identify the exact genotype leading to so- called
allele dosage uncertainty. With allele dosage uncertainty, allele
and genotype frequency estimation are unreliable, which may lead
to considerable biases in subsequent analyses of genetic diversity,
structure and differentiation (Dufresne et al., 2014; Meirmans et al.,
2018). Microsatellites or simple sequence repeats (SSRs) are among
the most popular molecular markers in population genetics (Andrew
et al., 2013; Dufresne et al., 2014). However, applications of SSR s
in polyploids suffer from allele dosage uncertainty (Dufresne et al.,
2014). Recently, sequencing- based SSR genotyping techniques have
been developed based on high- throughput sequencing (De Barba
et al., 2017; Vartia et al., 2016; Yang et al., 2019). The new tech-
niques facilitate rapid, accurate and cost- effective genotyping at
a large number of SSR loci in large- scale population genetic stud-
ies (De Barba et al., 2017; Vartia et al., 2016; Yang et al., 2019).
Nevertheless, the new techniques cannot resolve SSR allele dosage
uncertainty when genotyping polyploids. New methods need to be
developed to obtain accurate polyploid genotypes with corrected
SSR allele dosages.
Polyploids are common in the genus Camellia (Theaceae), in-
cluding predominantly tetraploids and hexaploids, especially in the
section Paracamellia (Ming, 2000). Camellia oleifera as the type spe-
cies of the section Paracamellia, is a hexaploid evergreen broadleaf
shrub or small tree (Huang et al., 2018). Cultivated C. oleifera is the
dominant woody oilseed crop in China. The seed oil of C. oleifera is
rich in the monounsaturated fatty acid oleic acid (up to >80%), and
it is known as “oriental olive oil” (Ma et al., 2011; Zhuang, 2008).
Wild C. oleifera is an essential genetic resource for cultivated C.
oleifera breeding. Wild C. oleifera is widely distributed in the sub-
tropical evergreen broadleaf forests of the Yangtze River Basin
and South China (Ming, 2000). Based on size polymorphisms of 8
SSR markers determined through capillary electrophoresis, genetic
structure analyses indicated clear genetic differentiation between
wild C. oleifera from Lu Mountain (29.60°N, 115.98°E) and Jinggang
Mountain (26.55°N, 114.17°E) (380 km between the two mountains)
and less genetic differentiation among altitudes within each moun-
tain (altitude range <700 m) (Huang et al., 2018). However, classical
genetic differentiation estimates of FST showed the same very low
genetic differentiation (FST = 0.0 07 ) bet wee n and with in ea ch mo un -
tain (Huang et al., 2018). Moreover, major SSRs showed significant
heterozygosity excesses (Huang et al., 2018). These results may be
caused by biased genotyping. Because wild C. oleifera is hexaploid,
allele dosage uncertainty at the SSR loci may lead to biases in popu-
lation genetic analyses. Fur ther studies are needed to obtain accu-
rate hexaploid genotypes of wild C. oleifera wit h co rrec te d SSR allele
dosages to determine the geographical pat tern of genetic variation,
which is the basis for the utilization of wild C. oleifera genetic re-
sources. For the purposes of germplasm collection, special attention
should be given to wild C. oleifera with high genetic diversity and
differentiation.
In this study, a new high- throughput sequencing- based SSR ge-
notyping method was developed to resolve allele dosage uncertainty
in polyploids (Figure 1). Our study used hexaploid wild C. oleifera
as a case for analysing genetic diversity and structure in polyploid
plants. Wild C. oleifera samples were collected from six populations
across geographical ranges. Using 35 microsatellite markers, the
new method was applied to genotype hexaploid wild C. oleifera sam-
ples. Correc tions were performed accounting for the “stut ter peak”
and amplification efficiency of SSR sequences to obtain hexaploid
genot ypes with corrected allele dosages. With correc ted and uncor-
rected allele dosages, genetic diversity, structure and differentiation
were analysed and compared. The main objec tives of our study were
to establish methods to resolve allele dosage uncertainty in poly-
ploids, improve population genetic analysis in polyploids, and pro-
vide suppor t for the evaluation and utilization of genetic resources
in hexaploid C. oleifera.
2 | MATERIALS AND METHODS
2.1 | Plant materials
Wild C. oleifera samples were collected from six natural distribu-
tion sites (Table 1 and Figure 2), across its geographical ranges in
China (mainly between the Yangt ze River and the Pearl River). Wild
C. oleifera is widely distributed in the subtropical mountain areas of
China at altitudes ranging from approximately 200 to 2000 m. Lu
Mountain is in the northern distribution range of wild C. oleifera at
|
201
CUI et al .
the border between the middle and northern subtropical regions of
China. Jinggang Mount ain is in the distribution centre of wild C. oleif-
era in the central section of the middle subtropical region of China.
Nanling Mountain is at the border between the middle and southern
subtropical regions of China. Luofu Mountain is in the southern dis-
tribution range of wild C. oleifera in the southern subtropical region
of China. Fanjing Mountain is in the western distribution range of
wild C. oleifera, and Matou Mountain is in the eastern distribution
range of wild C. oleifera. Differences in climate conditions (Table 1)
as well as geographical isolation between mountains may lead to ge-
netic differentiation between wild C. oleifera populations. The num-
ber of samples was proportional to the wild C. oleifera pop ulati on siz e
FIGURE 1 Flow chart of the new
high- throughput sequencing- based
microsatellite genotyping method for
resolving allele dosage uncertainty in
polyploids. A Perl script “SSRSeq count”
str_count.pl (https://github.com/ccoo2 2/
SSRseq_count) is provided to process the
clean reads to generate an SSR read count
table. A Perl script str_type.pl (https://
github.com/ccoo2 2/SSRSeq) and an
online SSR genotyping tool – SSRSeq V1.1
(http://bioin fo.genes kybio tech.com/softw
are/ssrseq_type/v1.1/en/) are provided
to output SSR genotypes with correc ted
allele dosages. Details of the method are
described in the Materials and Methods
Multiplex PCR of SSR markers
High throughput sequencing of deep coverage
Read frequency distribution of SSR sequences
SSR alleles identification
‘Stutter peak’ correction
Amplification efficiency correction
SSR genotypes with corrected allele dosages
SSRSeq V1.1
Quality control with FastQC
Paired-end reads merged with FLASH
Reads aligned to SSR reference sequences using Blastn
Read count of SSR sequences
SSRSeq count
TAB LE 1 Wild Camellia oleifera sampling sites
Site (label)
Latitude
(°N)
Longitude
(°E) Altitude (m)
Annual mean
temperature (°C)
Annual precipitation
(mm)
Number of
samples
Lu Mountain, Jiangxi (LU) 29. 6 0 115.98 2 5 6 – 8 7 4 1 2 . 8 0 – 1 6 . 1 4 15 0 2 – 17 9 0 123
Fanjing Mountain, Guizhou (FJ) 27. 91 108.63 1 0 1 6 – 1 3 8 9 1 1 . 5 8 – 1 3 . 3 2 12 5 1– 1 321 29
Matou Mountain, Jiangxi (MTS) 27. 7 3 11 7.1 6 5 2 6 – 5 7 0 16.0 8 1902 16
Jinggang Mount ain, Jiangxi (JG) 26.55 114 .17 4 1 2 – 1 0 4 4 13.63– 16.52 1 4 3 5 – 1 7 7 0 85
Nanling Mountain, Guangdong (NL ) 24.9 0 113.06 592– 93 2 1 5 . 7 2 – 1 6 . 4 4 1566– 1627 64
Luofu Mountain, Guangdong (LF) 23.28 114.02 1125– 1217 17. 26– 1 7.74 2019– 2034 27
202
|
CU I et al.
at the distr ib ution sites. Le af sam ples were taken fro m fl ow er in g wi ld
C. oleifera and then placed in small zip- lock plastic bags containing
silica gel for dehydration. The dry leaf samples were stored at room
temperature.
2.2 | DNA extraction and quality control
Approximately 30 mg of dry leaf tissue was taken from each sam-
ple and placed in a 2.0- ml tube with a 5- mm glass bead. Sample
tubes were placed in liquid nitrogen for 30 s and transferred to a
Tissuelyser II (Qiagen) for grinding at 30 Hz for 1 min. Genomic
DNA was extracted using the DNAsecure Plant Kit ( Tiangen). DNA
integrit y was checked by agarose gel electrophoresis. The quality
and quantity of the DNA samples were measured using a NanoDrop
2000 (Thermo Scientific).
2.3 | Single and multiplex PCR optimization of
SSR markers
Polymorphic SSR markers of wild C. oleifera were selected from Cui
et al. (2018). The SSR markers are single loci containing trinucleotide
sim ple repeats, that is, (TCC)n, which were developed based on high-
throughput transcriptome sequencing of wild C. oleifera from the Lu
and Jinggang Mountains (Cui et al., 2018). First, single PCR amplifi-
cation was performed for each SSR marker. A 10 μl mixture was pre-
pared for each reaction and included 1× reaction buffer (TaKaRa),
2 mM Mg2+, 0.2 mM dNTPs, 0.2 μM each primer, 1 U Hot StarTaq
polymerase (TaKaRa) and 1 μl template DNA (10 ng/μl). The cycling
program was 95°C for 2 min; 11 cycles of 95°C for 20 s, 63– 58°C
(−0.5°C per cycle) for 40 s, 72°C for 1 min; 24 cycles of 95°C for
20 s, 65°C for 30 s, 72°C for 1 min; 72°C for 2 min. The amplifica-
tion reactions were carried out on an AB 2720 Thermal Cycler (Life
FIGURE 2 Genetic structure analyses of wild Camellia oleifera populations using SSR data of corrected and uncorrected allele dosages.
(a) Corrected allele dosage (K = 2), showing the results (K = 2) of genetic structure analysis with SSR data of corrected allele dosage. Pie
charts next to wild C. oleifera sites indicate proportions of different clusters (dif ferent colours) within local populations. (b) Uncorrec ted
allele dosage (K = 2), showing the results (K = 2) with uncorrected allele dosage data. (c) Corrected allele dosage (K = 5), showing the
results (K = 5) with corrected allele dosage data. (d) Corrected allele dosage, showing the results of delta K with corrected allele dosage
data. (e) Uncorrected allele dosage, showing the results of delta K with uncorrected allele dosage data
(d) Corrected allele dosage (e) Uncorrected allele dosage
Yangtze
River
Taiwan
East
China
Sea
South
China Sea
Pearl River
Yangtze River
East
China Sea
South
China Sea
Pearl River
(a) Corrected allele dosage(K= 2)
Yangtze
River
Taiwan
East
China
Sea
South
China Sea
Pearl River
Yangtze River
East
China Sea
South
China Sea
Pearl River
(b) Uncorrected allele dosage (K= 2)
Yangtze
River
Taiwan
East
China
Sea
South
China Sea
Pearl River
Yangtze River
East
China Sea
South
China Sea
Pearl River
(c) Corrected allele dosage (K= 5)
|
203
CUI et al .
Technologies Corporation). Thirty- five SSR markers with clear bands
were selected for further multiplex PCR.
Th e 35 SSR marker s we re di vi ded int o two pan els (17 or 18 mark-
ers/panel) for multiplex PCR. The composition of SSR markers
was adjusted based on the multiplex PCR results to achieve equal
amounts of the PCR products of each marker, to optimize primer
compatibility and to avoid undesirable primer pairing. A 20 μl mix-
ture was prepared for each reaction and included 1× reaction buf-
fer ( TaKaRa), 2 mM Mg2+, 0.2 mM dNTPs, 0.1 μM each primer, 1 U
HotStarTaq polymerase (TaKaRa) and 2 μl template DNA (10 ng/μl).
The cycling programme was 95°C for 2 min; 11 cycles of 94°C for
20 s, 63– 58°C (−0.5°C per cycle) for 40 s, 72°C for 1 min; 24 cycles
of 94°C for 20 s, 65°C for 30 s, 72°C for 1 min; 72°C for 2 min.
Finally, two panels of the 35 SSR markers were optimized (Table
S1) for multiplex PCR. Ten DNA samples (random samples from FJ,
JG and NL) were used to check the consistency of PCR results be-
tween single and multiplex PCRs of the 35 SSR markers.
2.4 | High throughput sequencing and
data processing
The PCR products (<300 bp) of each sample were mixed and la-
belled with 8 bp sample specific barcode using index primers. All
PCR products were pooled prior to library preparation. DNA librar-
ies were constructed for Illumina paired- end sequencing following
the Illumina protocol and then sequenced on the Illumina HiSeq
2500 platform (paired- end 2 × 150 bp) with mean coverage >5000×
per SSR locus per sample. Raw reads were analysed with FastQC
(http://www.bioin forma tics.babra ham.ac.uk/proje cts/fastq c/) for
quality control. A Perl script “SSRSeq count” str_count.pl (https://
github.com/ccoo2 2/SSRseq_count) was developed to process the
clean reads to generate an SSR read count table (Figure 1). First,
paired- end reads were merged with FLASH (http://ccb.jhu.edu/
softw are/FLASH/). Then, merged reads were aligned to C. oleifera
sequences where the SSR markers were located (Cui et al., 2018)
using Blastn (ftp://ftp.ncbi.nlm.nih.gov/blast/ execu table s/blast +/
LATES T/). Finally, the SSR read count table was generated, showing
read counts of SSR sequences with different repeat numbers at each
locus for each sample.
2.5 | SSR genotyping
A Perl script str_type.pl (https://github.com/ccoo2 2/SSRSeq) was
developed for SSR genotyping. A user- friendly online SSR genotyp-
ing tool −SSRSeq V1.1 (http://bioin fo.genes kybio tech.com/softw
are/ssrseq_type/v1.1/en/) was also provided for SSR genotyping
using str_type.pl, and the genotyping methods are described below,
including “stutter peak” correction and amplification efficiency cor-
rection (Figure 1). Using the SSR read count table as input data, the
read frequency distribution was obtained of SSR sequences with
different repeat numbers of an SSR motif at a locus in each sample.
SSR alleles (genotyping as repeat number) were inferred for read
frequencies higher than a given genotyping threshold. Based on a
large amount of empirical genot yping in data sets of different spe-
cies (ploidy ≤6, e.g., diploid Allium sativum, tetraploid Brassica napus,
hexaploid Camellia oleifera) with SSR markers of at least trinucleotide
simple repeats, the genot yping threshold is usually set to 0.5×(1/
ploidy) and in our case 0.5×(1/6) = 0.083 to avoid allelic dropout.
Read frequencies higher than the genotyping threshold are high-
lighted in the output file of SSRSeq for manual checking. To fur ther
avoid allelic dropout, the genotyping threshold can be altered in the
input file of SSRSeq depending on the genot yping results of a spe-
cific data set.
A “stutter peak” (a sequence with an SSR motif repeat number
that is different from an SSR allele) with a high read frequency may be
misidentified as an SSR allele. Sometimes, a stutter peak may overlap
with an SSR allele showing an abn ormally high read frequency, ca us-
ing difficulties in allele dosage estimation. To count for the stutter
peak before an SSR allele, high- quality samples were selected with
read numbers higher than the median read number at a locus, and a
stutter peak before an SSR allele was identified for read frequencies
lower than the 0.6× genotyping threshold (0.6 × 0.083 = 0.05 in the
study). The threshold is based on a large amount of empirical geno-
typing in data sets of different species (ploidy ≤6, e.g., diploid Allium
sativum, tetraploid Brassica napus, hexaploid Camellia oleifera) with
SSR markers of at least trinucleotide simple repeats. SSR markers
of trinucleotide simple repeats are recommended because dinucleo-
tide repeat markers may have a high degree of stut ter peaks causing
difficulties in genotyping. The threshold can be altered in the input
file of SSRSeq depending on the genotyping results of a specific data
set. For each sample, the slip ratio of an SSR sequence at a locus can
be calculated as:
where SRn is the slip ratio of the SSR sequence with repeat number
n of the SSR motif at a locus, Fn−1 is the read frequency of the stutter
peak with repeat number n−1 and Fn is the read frequency of the SSR
sequence. The slip ratio may var y with repeat number at a locus as
follows:
where a and b are locus- specific coefficients. With regression between
the mean observed slip ratio across samples and the repeat number
of SSR alleles at each SSR locus, coefficients a and b were estimated
for each SSR locus. For the SSR sequence missing the observed slip
rat io, the expected slip rat io wa s calcula ted using eq uatio n 2. Thus , slip
ratios were estimated based on high- quality samples and then used
for the correction of all samples. For each sample, the expected read
frequency of stutter peak Fn−1 before each SSR sequence with repeat
number n at each SSR locus was estimated according to equation 1.
For stutter peak correction, the expected read frequency of stutter
(1)
SR
n=
F
n−1
F
n
(2)
SRn
=an
2
+
bn
204
|
CU I et al.
peak Fn−1 was subtracted from the read frequency of the sequence
with repeat number n−1 before an SSR sequence with repeat number
n. If the expected read frequency of the stutter peak was higher than
the observed read frequenc y, the corrected read frequency was set to
0. Then, all read frequencies were recalculated so that the sum of read
frequencies at a locus in a sample = 1. Finally, SSR alleles were identi-
fied again with corrected read frequencies higher than the genotyping
threshold (>0.083 in the study).
SSR alleles with higher repeat numbers may have lower ampli-
fication efficiency. Due to low amplification efficiency, some SSR
alleles may have read frequencies lower than the genotyping thresh-
old but higher than the 0.6× genotyping threshold and are identi-
fied as potential alleles. To identify SSR alleles and infer expected
dosages of SSR alleles, corrections were performed to account for
differences in amplification efficiency. Again, high- quality samples
were selected with read numbers higher than the median read num-
ber at a locus. For each sample, the amplification ratio of each allele
at a locus can be calculated as the obser ved allele dosage divided by
the expected allele dosage:
where AR is the amplification ratio of the allele at a locus, Do is the
observed allele dosage, and De is the expected allele dosage. The ob-
served allele dosage can be calculated as ploidy × read frequency of
an SSR allele at a locus in a sample. The expected allele dosage can be
estimated following the method below. In the study, ploidy = 6. For
SSR genoty pes wit h si x di ffe re nt all eles in a sampl e, the exp ect ed all ele
dosage of each allele should be 1. For SSR genotypes with five dif-
ferent alleles in a sample, four alleles should have the expected allele
dosage of 1 of each, and one allele should have the expected allele
dosage of 2. Therefore, the allele with the highest obser ved read fre-
quency may have the expected allele dosage of 2, and the other alleles
may have the expected allele dosage of 1. For SSR genotypes with four
different alleles in a sample, the most abundant allele should have the
expected allele dosage of 3 or 2, and the other allele should have the
expected allele dosage of 1. For SSR genotypes with fewer than four
different alleles in a sample, the expected allele dosage should be es ti-
mat ed by rou nd in g th e observed allele dosage so that the tot al numbe r
of alleles at a locus in a sample = 6. Afterwards, the mean amplification
ratio across highquality samples was estimated for each allele at each
SSR locus and used for the correction of all samples. For amplifica-
tion efficiency correction, the read frequency of each allele at each
SSR locus was first divided by the mean amplification ratio of the al-
lele. SSR alleles were identified again with corrected read frequencies
higher than the genotyping threshold (>0.083 in the study). Then, all
read frequencies of SSR alleles were recalculated so that the sum of
read frequencies at a locus in a sample = 1. Afterwards, the corrected
allele dosage was calculated as ploidy × read frequency of the SSR al-
lele. Finally, the correc ted allele dosage was rounded so that the total
number of alleles at a locus in a sample = 6.
The output SSR genotype data in the output Excel file of SSRSeq
are in GenoDive format and can be easily modified as input data
for GenoDive analysis. The output SSR genot ype data include SSR
genotypes with corrected and uncorrected (showing only dif ferent
alleles) allele dosages. In this study, the SSR genotype data set with
corrected allele dosages of C. oleifera populations was generated for
GenoDive analysis (Cui et al., 2021). In addition, another data set of
SSR genoty pes wit h uncorre cted alle le dosages (showing onl y di f fe r-
ent alleles) was generated for GenoDive analysis (Cui et al., 2021) to
mimic the situation for conventional SSR genotyping methods.
2.6 | Genetic diversity
The two data set s of corrected and uncorrected allele dosages (Cui
et al., 2021) were used for genetic diversity analysis with GenoDive
version 2.0b27 (Meirmans & Van Tienderen, 2004). The number of
alleles (A), effective number of alleles (Ae), observed heterozygo-
sity (Ho), expected heterozygosity (He) and inbreeding coefficient
(Gis) were calculated for each population and each locus. The effec-
tive number of alleles (Ae) is the number of alleles in a population
weig ht ed fo r the ir freq uen cie s. Th e obs erv ed he te roz ygosi t y (Ho) es-
timated in GenoDive is “gametic heteroz ygosity” for polyploids, that
is, the frequency of heterozygotes among randomly sampled diploid
gamete s (M oo dy et al., 1993). The expe cted he te rozy go si ty (He) est i-
mated in GenoDive is “gene diversity”, determined by calculating He
in polyploids as in diploids, including a correction for sampling bias
(Meirmans et al., 2018). Tests for Hardy- Weinberg equilibrium were
performed using the heterozygosity- based Gis statistic with 9999
permutations. GenoDive was used to export data into SPAGeDi
format. For uneven sample sizes of dif ferent populations, rarefied
alle lic rich ness was estimated as the expe cted allele number of mini-
mal sample size with SPAGeDi 1.5d (Hardy & Vekemans, 2002).
2.7 | Genetic structure
The two data sets of corrected and uncorrected allele dosages
(Cui et al., 2021) were used for genetic structure analysis. Principal
component analysis (PCA) was performed using a covariance ma-
trix between allele frequencies for individuals with GenoDive ver-
sion 2.0b27 (Meirmans & Van Tienderen, 2004). GenoDive was
also used to expor t data into the structure format (Pritchard et al.,
2000). ParallelStructure (Besnier & Glover, 2013; Pritchard et al.,
2000) from the CIPRES Science Gateway (Miller et al., 2010) was
used to infer the population genetic structure. The admixture model
was used to determine the ancestry of individuals. The allele fre-
quencies were assumed to be independent among populations. The
population number (K) was evaluated from 1– 10, and five replicate
runs were carried out for each K. Each run had a burnin period of
1,00 0,000, and there were 1,0 00,000 iterations after burnin. struc-
ture harvester (Earl & vonHoldt, 2012) was used to determine the
(3)
AR
=
D
o
D
e
|
205
CUI et al .
optimal K. For the optimal K, clumpp (Jakobsson & Rosenberg, 2007)
was used to find the optimal alignments of the five replicate runs.
2.8 | Genetic differentiation
The two data set s of corrected and uncorrec ted allele dosages (Cui
et al., 2021) were used for genetic differentiation analysis with
GenoDive version 2.0b27 (Meirmans & Van Tienderen, 2004).
Estimates of genetic differentiation between populations were cal-
culated, including FST from the analysis of molecular variance be-
tween each pair of populations (AMOVA) (Michalakis & Excoffier,
1996) and Rho (independent of the ploidy level and inheritance
pattern) (Ronfort et al., 1998). A paired t test was used to compare
genetic dif ferentiation estimates with correc ted and unco rrected al-
lele dosages. Mantel tests were per formed to analyse correlations
of genetic differentiation matrixes with corrected and uncorrected
allele dosages and between FST and Rho. Significance levels were
generated with Bonferroni correction for multiple comparisons. For
testing isolation by distance, linear regressions were performed be-
tween FST/(1−FST) and the natural logarithm of geographical distance
between populations.
3 | RESULTS
3.1 | High- throughput sequencing- based
microsatellite genotyping
With high- throughput sequencing, the mean coverage was 5853,
and the median was 5832 per SSR locus per sample in our study.
Such high coverages ensured high precision of SSR read count and
frequency estimations. The output file of SSRSeq involved slip ratios
estimated for each SSR sequence at each locus. The SSR sequence
slip ratio generally increased with the increase in repeat number at
eac h locus , match ing equ ation 2 (Figur e S1). Befo re stutter pe ak cor-
rection, stutter peaks may be misidentified as SSR alleles and cause
biases in allele dosage estimations. The expected read frequency of
the stutter peak could be estimated with the slip ratio, and stutter
peak corrections were performed by subtracting the expected read
frequency of the stutter peak from the observed read frequency. On
the other hand, the output file of SSRSeq also involved amplification
ratios estimated for each SSR allele at each locus. The amplification
ratio generally decreased with the increase in repeat number at each
locus (Figure S2). Amplification efficiency corrections were per-
formed with the amplification ratio. The output file of SSRSeq had
both unrounded and rounded data for the corrected allele dosages.
The corrected allele dosages (unrounded) were in close agreement
with the expected allele dosages (rounded) in our study, showing
high SSR genotyping accuracy with the new method (Figure S3). Ten
DNA samples were used for both single and multiplex PCR of the 35
SSR markers. The PCR products were sequenced, and the results
obtained via single or multiplex PCR were genotyped separately
TAB LE 2 Genetic diversity analysis of wild Camellia oleifera using two SSR datasets with corrected and uncorrected allele dosages
Site
Corrected allele dosages Uncorrected allele dosages
A AeAllelic richness HoHeGis A AeAllelic richness HoHeGis
LU 6.686 2.846 5.20 0.534 0.557 0.041*6.686 3.886 4.50 0. 851 0.695 − 0.224*
FJ 7.229 3 . 241 6.39 0. 572 0.602 0.050*7. 2 2 9 4.4 07 5.29 0. 869 0.739 − 0.175*
MTS 6.571 3.227 6.30 0.5 67 0.607 0.065*6.571 4 .3 41 5.26 0.877 0 .743 − 0.18 0*
JG 8.200 3.340 6 .55 0.588 0.608 0.032*8.200 4 .614 5.38 0.884 0 .74 4 −0.188*
NL 7.57 1 3.273 6.24 0.536 0.603 0.111*7.57 1 4.302 5.14 0.831 0.722 − 0.151*
LF 6. 514 3.080 5.80 0.554 0. 576 0.038*6 . 514 4.142 4.89 0.857 0.718 − 0.19 3*
*Significant heterozygosity deficit (positive Gis value) or heterozygosity excess (negative Gis value) in tests for Hardy- Weinberg equilibrium.
206
|
CU I et al.
with the new method. The genot yping results showed that the allele
number identified was not significantly different between single and
multiplex PCR of the 35 SSR markers (Figure S 4).
3.2 | Genetic diversity
The results of genetic diversity analysis are shown in Table 2. The
numb er of all el es (A) was the same between data sets with corrected
and uncorrected allele dosages. However, the effective number of
alleles (Ae) was higher with uncorrected allele dosages. Allelic rich-
ness was higher with corrected allele dosages. With corrected allele
dosages, observed heterozygosity values were all lower than ex-
pected heterozygosity values and inbreeding coefficient values were
all positive, showing significant heterozygosity deficits (Figure 3a).
With uncorrec ted allele dosages, observed heterozygosity values
were all higher than expected heterozygosity values, and inbreeding
coefficient values were all negative showing significant heterozy-
gosity excesses (Figure 3b). The obser ved/expected heterozygosity
values with uncorrected allele dosages were all higher than those
with corrected allele dosages (Figure 3). The genetic diversity of wild
C. oleifera at Jinggang Mountain (JG) was the highest and decreased
from the distribution centre to the northern/southern distribution
range of wild C. oleifera. The genetic diversity of wild C. oleifera was
the lowest at Lu Mountain (LU) in the northern distribution range of
wild C. oleifera.
3.3 | Genetic structure
The PCA showed that most individuals of the LU population at
the highest latitude were differentiated from individuals of other
populations with corrected or uncorrected allele dosages (Figure
S5). Only with corrected allele dosages were most individuals of
the LF population at the lowest latitude separated from individu-
als of other populations, and the latter were mixed and located
between the LU and LF populations (Figure S5). With corrected or
uncorrected allele dosages, the optimal K was 2 (Figure 2d,e). Only
with corrected allele dosage did a secondary peak occur at K = 5
(Figure 2d). The results of STRUCTURE analyses (K = 2) showed
more gradual changes in genetic structure along latitudes with
corrected allele dosages (Figure 2a) than those with uncorrected
allele dosages (Figure 2b). The LU population at the highest lati-
tude was the most dif ferent from the others. From high to low lati-
tudes, the genetic structure with corrected allele dosages shifted
toward the genetic structure of the LF population at the lowest
latitude (Figure 2a), similar to the results of the PCA (Figure S5).
With uncorrected allele dosages, all populations except for the LU
population sho we d mo re or les s the s am e ge ne tic s tr uct ur e despite
geograph ic al lo cation (Figur e 2b). In ad di ti on, mor e ge netic clu st er s
(K = 5) were found with corrected allele dosages showing finer ge-
netic structures of wild populations (Figure 2c). In addition to the
distinguished LU population in the northern distribution range, the
genetic structures of the LF population at the lowest latitude and
the FJ population in the western distribution range were clearly
separated. The genetic structures of the MTS, JG and NL popula-
tions showed similarly mixed clusters.
3.4 | Genetic differentiation
FST estimates with corrected allele dosages (mean FST = 0.026)
were significantly higher (p = .001) than those with uncorrected al-
lele dosages (mean FST = 0.009) (Table 3). In addition, the Mantel
test indicated that the correlation was insignificant between FST
estimates with corrected and uncorrected allele dosages (r = .604,
p = .070). With corrected allele dosages, FST was the highest (0.067)
between populations at the highest (LU, 29.60°N) and the lowest (LF,
23.28°N) latitudes (Table 3). However, with uncorrected allele dos-
ages, FST between LU and LF (0.013) was the same as that bet ween
LU and MTS (27.73°N). With corrected/uncorrected allele dosages,
linear regression was insignificant (corrected: p = .427; uncorrected:
FIGURE 3 Observed and expec ted heterozygosit y estimates of wild Camellia oleifera populations. (a) Estimates with corrected allele
dosage. (b) Estimates with uncorrected allele dosage. Solid circles indicate obser ved heterozygosity Ho estimates. Hollow circles indicate
expected heterozygosity He estimates. From left to right, populations are sorted from high to low latitudes
0.5
0.6
0.7
0.8
0.9
LU FJ MTSJGNLLF
Heterozygosity
Population
(a) Corrected allele dosage
Ho
He
0.5
0.6
0.7
0.8
0.9
LU FJ MTSJGNLLF
Heterozygosity
Population
(b) Uncorrected allele dosage
Ho
He
|
207
CUI et al .
p = .5 3 8) be twe en FST/(1−FST ) an d the natu ral lo gar ith m of ge og r aph -
ical distance between populations (Figure 4).
Rho estimates with corrected allele dosages (mean Rho = 0.087)
were not significantly different from those with uncorrected allele
dosages (mean Rho = 0.076) (p = .103). With corrected allele dos-
ages, FST estimates were significantly correlated with Rho estimates
(r = .950, p = .004); with uncorrected allele dosages, the correlation
was insignificant (r = .440, p = .146).
4 | DISCUSSION
Conventional molecular methods cannot accurately identify the
SSR genotype of polyploids. Thus, codominant SSR genotypes in
polyploids may have to be treated as dominant data losing valuable
information in subsequent analyses (Dufresne et al., 2014). On the
other hand, software such as GenoDive can handle polyploid SSR
genotypes with unknown allele dosages and per form correction of
allele dosages using a maximum likelihood method based on random
mating within populations modified from De Silva et al. (2005). Since
actual allele frequencies are unknown in the correction, biases may
be introduced to population differentiation and structure analyses.
Methods have been developed to directly infer polyploid genot ypes
based on ratios between SSR allele peak areas, for example, the
microsatellite DNA allele counting- peak ratios (MAC- PR) method
(Esselink et al., 2004). However, ratios between SSR allele peak areas
of capillary electrophoresis may not represent actual ratios of SSR
alleles, especially if they do not account for the stutter peak and
amplification efficiency of SSR alleles.
If allele dosages are uncer tain in polyploid SSR genotypes, SSR
allele frequency estimation is biased. Population genetic analyses
based on biased allele frequencies may also be biased. Based on
model simulation, when allele dosage information is missing, ob-
served heterozygosity estimates in tetraploid populations are much
higher than true values, while expected heterozygosity estimates
are slightly higher than true values (Meirmans et al., 2018). With
allele dosage uncertainty, statistical testing for Hardy- Weinberg
equilibrium is not possible for polyploids (Meirmans et al., 2018).
For genetic structure analysis, structure is well suited for analys-
ing polyploids (Meirmans et al., 2018). In simulated mixed- ploidy
populations, structure is more robust than other clustering meth-
ods (Stift et al., 2019). Especially when population differentiation is
weak, structure is the only method that allows unbiased inference
with limited genotypic information of codominant markers with un-
known allele dosages or dominant markers (Stift et al., 2019). For
genetic differentiation estimates, missing dosage information leads
to overestimation of genetic diversity within populations and con-
sequently underestimation of the degree of population differenti-
ation (Meirmans et al., 2018). To estimate genetic differentiation in
polyploids, Rho may be the statistic of choice, as it is generally un-
biased with allele dosage uncertainty, independent of ploidy level
and mode of inheritance, and closely related to FST (Meirmans et al.,
2018; Meirmans & Van Tienderen, 2013).
TAB LE 3 Genetic differentiation between populations of wild Camellia oleifera using two SSR data sets with correc ted and uncorrec ted allele dosages. FST estimates (with corrected/
uncorrected allele dosages) are in the lower triangle, and Rho estimates (with corrected/uncorrected allele dosages) are in the upper triangle. The p- values are indicated in bracket s (p < .01 in
bold)
LU FJ MTS JG NL LF
LU 0.110/0.120 0.097/0.087 0.069/0.093 0.143/0.112 0.222/0.170
FJ 0.033 (0.001)/0.010 (0.001) 0.034/0.029 0.039/0.049 0.052/0.047 0.122/0.089
MTS 0.030 (0.001)/0.013 (0.001) 0.014 (0.001)/0.010 (0.177) 0.002/0.020 0.025/0.025 0.113/0.096
JG 0.019 (0.001)/0.006 (0.001) 0.015 (0.001)/0.006 (0.030) 0.004 (0.003)/0.007 (0.307) 0.049/0.026 0.132/0.081
NL 0.031 (0.001)/0.007 (0.001) 0.015 (0.001)/0.008 (0.041) 0.003 (0.113)/0.011 (0.307) 0.007 (0.001)/0.003 (0.012) 0.093/0.099
LF 0.067 (0.001)/0.013 (0.001) 0.040 (0.001)/0.010 (0.001) 0.040 (0.001)/0.012 (0.001) 0.041 (0.001)/0.011 (0.001) 0.039 (0.001)/0.012 (0.001)
208
|
CU I et al.
Our study has developed a new high- throughput sequencing-
based microsatellite genotyping method (Figure 1) to directly resolve
allele dosage uncertainty in polyploids using hexaploid wild C. oleifera
as a case study. As an alternative to multiplex PCR, one may perform
single PCR for each SSR marker and mix the products for sequenc-
ing. However, with the increases in number of markers and sample
size, the labour needed for single PCR and post- PCR multiplexing
will dramatically increase much more than that required for multi-
plex PCR. Our study demonstrated that with optimization, the allele
number identified was not significantly different between single and
multiplex PCRs of the 35 SSR markers (Figure S4). Therefore, we
propose to use multiplex PCR with optimization in the method. For
100 SS R ma rkers , the cos t of mult ip lex PCR, 50 0 0× high- throughput
sequencing and data analysis is approximately 30 U.S. dollars per
sample or 0.3 U.S. dollars per genotype. The typical genotyping- by-
sequencing (GBS) method can generate data for many more mark-
ers, so the cost per genotype is much lower. Nevertheless, the cost
per sample of the GBS method is generally several times higher than
that of our method. Most impor tantly, our method feasibly provides
accurate SSR genotypes for up to hundreds of SSR markers in hun-
dreds or thousands of polyploid samples for genetic diversit y anal-
ysis. Perl scripts and an online SSR genotyping tool, SSRSeq V1.1,
are provided to output accurate polyploid genotypes with the new
method. Compared with capillary electrophoresis, high- throughput
sequencing of deep coverage enables more accurate estimation of
SSR sequence amounts and frequencies. Moreover, specific correc-
tions are introduced for the stutter peak and amplification efficiency
of SSR sequences. The results of hexaploid C. oleifera showed that
SSR sequences with higher repeat numbers had a higher ratio of
stutter peaks (Figure S1) and may lead to errors and biases in SSR
allele identification and dosage estimation. The slip ratio model pro-
posed in the study nicely represented the actual SSR sequencing
data and therefore provided solid stut ter peak correction of SSR se-
quence frequency. In addition, we found that SSR alleles with higher
repeat numbers may have lower amplification efficiency (Figure S2);
therefore, amplification efficiency corrections must be performed.
Using the new method, accurate hexaploid genotypes of C. oleifera
with correc ted allele dosages were obtained in the study (Figure S3).
These enabled direct comparisons of population genetic analyses
with corrected and uncorrected allele dosages. The results of our
st ud y dem ons tr ate d th at, wit h cor re c te d and unc orr ec ted all ele dos -
ages, genetic diversity, structure and differentiation estimates and
inferences were considerably different.
Similar to the results of model simulations by Meirmans et al.
(2018), with uncorrected allele dosages, obser ved heterozygos-
ity estimates were abnormally high (>0.8) and significantly higher
than expected heterozygosit y estimates, and both were higher
than those with corrected allele dosages (Figure 3). Using eight
highly polymorphic microsatellite markers with the traditional
capillary electrophoresis method, Huang et al. (2018) found simi-
larly high obser ved heteroz ygosity in wild C. oleifera of the Lu and
Jinggang Mountains, and some loci had observed heterozygosity
equal to 1. The authors argued that such high observed heterozy-
gosity suggested that C. oleifera was an allopolyploid with disomic
inheritance. However, in this study, with corrected allele dosages,
observed heterozygosity estimates (<0.6) were significantly lower
than expected heterozygosity, indicating significant heterozygosity
deficits in all populations. Thus, for hexaploid wild C. oleifera, the
genetic diversity estimates with uncorrected allele dosages were
seriously overestimated, especially for observed heterozygosity,
resulting in unrealistic inferences in the previous study. Wild C.
oleifera outcrosses through insect pollination, and its seeds are dis-
persed via small rodents in forests (Huang et al., 2018; Xiao et al.,
2004). With limited gene dispersal within populations, observed
heterozygosity estimates should be significantly lower than ex-
pected heterozygosity, as indicated by the results with corrected
allele dosage. Moreover, our study supported the “central- marginal
hypothesis”, which states that across geographical ranges of spe-
cies, within- population genetic diversity declines from the centre to
the periphery, although the differences were small in wild C. oleifera
(Figure 3), as in most cases in previous studies (Eckert et al., 2008).
Our study demonstrates that resolving allele dosage uncertainty
FIGURE 4 Relationships between FST/(1- FST) and the natural logarithm of geographical distance between populations. (a) Estimates with
corrected allele dosage. (b) Estimates with uncorrected allele dosage. Linear regression lines and equations with R2 are shown
y = 0.0088x - 0.0267
R² = 0.0491
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.46.6 6.8 7.0
Fst/(1-Fst)
Ln (geographical distance)
(a) Corrected allele dosage
y = 0.0011x + 0.0027
R² = 0.0299
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.
87
.0
Fst/(1-Fst)
Ln (geographical distance)
(b) Uncorrected allele dosage
|
209
CUI et al .
using our new method can achieve accurate estimates of genetic
diversit y for polyploids.
Although strong genetic structure could be distinguished even
with uncorrected allele dosages, subtle genetic structures could be
discovered among populations only with corrected allele dosages
(Figure 2). The wild C. oleifera population in Lu Mountain (LU) at the
highest latitude in the study was the most differentiated in genetic
structure with corrected and uncorrected allele dosages (Figure 2).
Lu Mountain is in the northern periphery of wild C. oleifera, adjacent
to the Yangtze River in the north and next to Poyang Lake in the
east and south, and isolated from other wild C. oleifera populations.
Adaptation isolation by cold climate conditions together with geo-
graphical isolation might lead to distinct genetic structures (Zhao
et al., 2013). With corrected allele dosages, the southern peripheral
population of wild C. oleifera in Luofu Mountain (LF) was distin-
guished in terms of genetic structure. Again, adaptation isolation by
warm climate conditions and geographical isolation from other pop-
ulations by Nanling Mountain might lead to distinct genetic struc-
tures. Our study indicates that resolving allele dosage uncertainty
is essential for discovering subtle genetic struc tures in polyploids.
As indicated in model simulations by Meirmans et al. (2018), the
classical FST estimates in our study were all very low with uncor-
rected allele dosages (Table 3), underestimating genetic differentia-
tion between wild C. oleifera populations compared to the estimates
with corrected allele dosages. With corrected allele dosages, FST was
the highest between the northern and southern peripheral popula-
tions, similar to the results of genetic structure analysis (Figure 2).
However, with uncorrected allele dosages, the FST between the
northern and southern peripheral populations was the same as that
between adjacent populations (Table 3), showing considerable bi-
ases. With corrected and uncorrected allele dosages, the patterns
of isolation- by- distance were insignificant, although with corrected
allele dosages, a slightly increased trend in FST/(1−FST) was detected
with the increase in the natural logarithm of geographical distance
between populations (Figure 4). The insignificance of isolation- by-
distance may be due to the small number of populations in the study.
According to the “central- marginal hypothesis”, in addition to the
declines in within- population genetic diversity from the centre of
the geographic al range to the peripher y, among- population differ-
entiation increases from the centre to the periphery (Eckert et al.,
2008). Again, the results of genetic structure and differentiation in
our study supported this hypothesis. Most importantly, our study
demonstrates that resolving allele dosage uncertainty c an improve
FST estimates for polyploids.
Huang et al. (2018) showed that Rho could discriminate genetic
differentiation between and within hexaploid wild C. oleifera popula-
tions using the traditional microsatellite genot yping method. In our
st ud y, we confir med tha t, wi th correc ted and uncorr ect ed allele dos-
ages, Rho estimates showed similar genetic differentiation patterns
between wild C. oleifera populations correlated to FST estimates with
corrected allele dosages. However, the interpretation of Rho is dif-
ferent from that of FST (Meirmans & Van Tienderen, 2013). The Rho
estimate corresponds to the FST estimate for a haploid species with
the same population size and migration rate; therefore, for hexaploid
wild C. oleifera, the Rho estimates were consistently higher than the
FST estimates, as indicated by model simulations (Meirmans & Van
Tienderen, 2013).
In summary, our study demonstrated that with uncorrected al-
lele dosages, genetic diversity, structure and differentiation anal-
yses were considerably biased in hexaploid wild C. oleifera. The
new high- throughput sequencing- based microsatellite genotyping
method established in the study can resolve allele dosage uncer-
tainty and considerably improve genetic diversity, structure and
differentiation analyses for polyploids. The genetic variation pat-
terns of wild C. oleifera across geographical ranges agree with the
“central- marginal hypothesis”, stating that genetic diversity is high
in the cen tral po pulatio n an d dec li ne s fro m the centr al to per ip her al
populations, and genetic differentiation increases from the centre
to the periphery. In future studies, more populations of wild C. oleif-
era across geographical ranges are needed to verify the findings
and discover the underlying mechanisms generating such genetic
variation patterns.
ACKNOWLEDGEMENTS
This work was supported by the National Key Research and
Development Program of China (No. 2018YFD1000603), the
National Natural Science Foundation of China (NSFC Grant No.
31870311) and the “Gan- Po Talent 555” Project of Jiangxi Province,
China. We thank Jinxia Fu and colleagues at the Centre for Genetic
& Genomic Analysis, Genesky Biotechnologies Inc., Shanghai for
support in the development of the high- throughput sequencing-
based microsatellite genot yping method. We are grateful to valuable
comments of editors and reviewers helping dramatically improve
the manuscript. Jun Rong would like to thank Professor Peter G.
L. Klinkhamer and Dr. Klaas Vrieling of Leiden University and Dr.
Patrick G. Meirmans of Universit y of Amsterdam for motivating him
to develop such an efficient molecular method in polyploids.
AUTHOR CONTRIBUTIONS
Xiangyan Cui and Jun Rong designed and performed the experi-
ments, analysed the data and wrote the manuscript. Caihua Li, Yao
Zhao, Shengyuan Qin and Zebin Huang contributed to the experi-
ments, data analyses and writing. Bin Gan, Zhengwen Jiang, Xiaomao
Huang and Xiaoqiang Yang contributed to the experiments and data
analyses. Qin Li, Xiaoguo Xiang and Jiakuan Chen contributed to
writing the manuscript.
DATA AVAILAB ILITY STATE MEN T
Microsatellite genotyping data with corrected and uncorrected allele
dosages of wild C. oleifera populations in the study have been made
available on Dryad (https://doi.org/10.5061/dryad.t4b8g thxd).
ORCID
Jun Rong https://orcid.org/0000-0003-1408-2898
210
|
CU I et al.
REFERENCES
Andrew, R. L., Bernatchez, L ., Bonin, A., Buerkle, C. A., Carstens, B. C.,
Emerson, B. C., Garant, D., Giraud, T., Kane, N. C ., Roger s, S. M.,
Slate, J., Smith, H., Sork, V. L., Stone, G. N., Vines, T. H., Waits,
L., Widmer, A., & Rieseberg, L. H. (2013). A road map for mo-
lecular ecolog y. Molecular Ecology, 22, 2605– 2626. htt ps://doi.
org /10.1111/me c.12319
Besnier, F., & Glover, K . A. (2013). ParallelStr ucture: A R package
to dis tribute parallel runs of the population genetics program
STRUCTURE on multi- core computers. PLoS One, 8, e70651.
https://doi.org/10.1371/journ al.pone.0070651
Chalhoub, B., Denoeud, F., Liu, S., Parkin, I. A. P., Tang, H., Wang, X.,
Chiquet , J., Belcram, H., Tong, C ., Samans, B., Correa, M., Da Silva,
C., Just, J., Falentin, C., Koh, C. S., Le Clainche, I., Bernard, M.,
Bento, P., Noel, B., … Wincker, P. (2014). Early allopolyploid evolu-
tion in the post- Neolithic Brassica napus oilseed genome. Science,
345, 950– 953. https://doi.org/10.1126/scien ce.1253435
Cui, X., Qin, S., Huang, X., Yang, X., & Rong, J. (2021). Microsatellite gen-
otypes of Camellia oleifera for GenoDive analysis. Dryad, Dataset,
https://doi.org/10.5061/dryad.t4b8g thxd
Cui, X., Huang, X., Chen, J., Yang, X., & Rong, J. (2018). An efficient
method for developing polymorphic microsatellite markers from
high- throughput transcriptome sequencing: a case study of hexa-
ploid oil- tea camellia (Camellia oleifera). Euphytica, 214 , 26. https://
d o i . o r g / 1 0 . 1 0 0 7 / s 1 0 6 8 1 - 0 1 8 - 2 1 1 4 - 6
De Barba, M., Miquel, C., Lobréaux, S., Quenette, P. Y., Swenson,
J. E., & Taberlet, P. (2017). High- throughput microsatellite
genotyping in ecology: Improved accuracy, efficiency, stan-
dardization and success with low- quantity and degraded
DNA. Molecular Ecology Resources, 17, 492– 507. https://doi.
org /10.1111/1755 - 0998 .12594
De Silva, H. N., Hall, A . J., Rikkerink, E., McNeilage, M. A., & Fraser, L.
G. (2005). Estimation of allele frequencies in polyploids under cer-
tain pat terns of inheritance. Heredity, 95, 327– 334. ht tps://doi.
org/10.1038/sj.hdy.6800728
Dufresne, F., Stift, M., Vergilino, R ., & Mable, B. K. (2014). Recent prog-
ress and challenges in population genetics of polyploid organisms:
An over view of current state- of- the- art molecular and statisti-
cal tools. Molecular Ecology, 23, 4 0– 69. https://doi .org/10.1111/
mec .12581
Earl, D. A ., & vonHoldt, B . M. (2012). STRUCTURE HARVESTER: A web-
site and program for visualizing STRUC TURE output and imple-
menting the Evanno method. Conservation Genetics Resources, 4,
3 5 9 – 3 6 1 . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 1 2 6 8 6 - 0 1 1 - 9 5 4 8 - 7
Eckert, C. G., Samis, K. E., & Lougheed, S. C. (2008). Genetic variation
across species’ geographical ranges: The central- marginal hypoth-
esis and beyond. Molecular Ecology, 17, 1170– 1188. htt ps://doi.
org /10.1111/j .136 5- 294X.2007.03659.x
Esselink, G. D., Nybom, H., & Vosman, B. (20 04). Assignment of al-
lelic configuration in polyploids using the MAC- PR (microsatellite
DNA allele counting— peak ratios) method. Theoretical and Applied
Genetics, 109, 402– 408.
Hardy, O. J., & Vekemans, X. (2002). SPAGeDi: A versatile computer
program to analyse spatial genetic structure at the individual or
population levels. Molecular Ecology Notes, 2, 618– 620. https://doi.
org/10.1046/j.1471- 8286.2002.00305.x
Huang, X ., Chen, J., Yang, X., Duan, S., Long, C., Ge, G., & Rong, J. (2018).
Low genetic differentiation among altitudes in wild Camellia oleifera,
a subtropic al evergreen hexaploid plant. Tree Genetics & Genomes,
14, 2 1 . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 1 1 2 9 5 - 0 1 8 - 1 2 3 4 - 4
International Wheat Genome Sequencing Consortium. (2014). A
chromosome- based draft sequence of the hexaploid bread wheat
(Triticum aestivum) genome. Science, 345, 1251788. https://doi.
org/10.1126/scien ce.1251788
Jakobsson, M., & Rosenberg, N. A. (2007). CLUMPP: A cluster matching
and permutation program for dealing with label switching and mul-
timodality in analysis of population structure. Bioinformatics, 23,
1801– 1806. https://doi.org/10.1093/bioin forma tics/btm233
Ma, J., Ye, H., Rui, Y., Chen, G., & Zhang, N. (2011). Fatty acid compo-
sition of Camellia oleifera oil. Journal für Verbraucherschutz und
Lebensmittelsicherheit, 6, 9– 12. https://doi.org/10.1007/s0000
3 - 0 1 0 - 0 5 8 1 - 3
Meirmans, P. G., Liu, S., & van T ienderen, P. H. (2018). The analysis of
polyploid genetic data. Journal of Heredity, 109, 283– 296. https://
doi.org/10.1093/jhere d/esy006
Meirmans, P. G., & Van Tienderen, P. H. (20 04). GENOTYPE and
GENODIVE: Two programs for the analysis of genetic diversity of
asexual organisms. Molecular Ecology Notes, 4, 792– 7 94. ht tps://do i.
org /10.1111/j .1471- 8286 .2004 .0 0770. x
Meirmans, P., & Van Tienderen, P. (2013). The effects of inheritance
in tetraploids on genetic diversity and population divergence.
Heredity, 110, 131– 137. https://doi.org/10.1038/hdy.2012 .80
Michalakis, Y., & Excof fier, L . (1996). A generic estimation of population
subdivision using distances between alleles with special refer-
ence for microsatellite loci. Genetics, 142, 1061– 1064. htt ps://doi.
org /10.1093/gene t ics/142.3.1061
Miller, M. A ., Pfeiffer, W., & Schwartz, T. (2010). Creating the CIPRE S
Science Gateway for inference of large phylogenetic trees. In
Proceedings of the Gateway Computing Enviroments Workshop (GCE)
(pp. 1– 8). New Orleans, L A.
Ming, T. L. (200 0). Monograph of the genus Camellia. Yunnan Science and
Technology Press.
Moody, M. E., Mueller, L. D., & Soltis, D. E. (1993). Genetic variation and
random drift in autotetraploid populations. Genetics, 134, 649– 657.
Pritchard, J. K., Stephens, M., & Donnelly, P. (20 00). Inference of pop-
ulation structure using multilocus genotype data. Genetics, 155,
9 4 5 – 9 5 9 .
Renny- Byfield, S., & Wendel, J. F. (2014). Doubling down on genomes:
Polyploidy and crop plants. American Journal of Botany, 101, 1711–
1725. https://doi.org/10.3732/ajb.1400119
Ronfort, J., Jenczewski, E ., Bat aillon, T., & Rousset, F. (1998). Analysis of
population st ructure in autotetraploid species. Genetics, 150, 921–
930. https://doi.org/10.1093/genet ics/150.2.921
Stift, M., Kolář, F., & Meirmans, P. G. (2019). STRUCTURE is more ro-
bust than other clustering methods in simulated mixed- ploidy pop-
ulations. Heredity, 123, 429– 4 41. https://doi.org/10.1038/s4143
7 - 0 1 9 - 0 2 4 7 - 6
The Potato Genome Sequencing Consortium. (2011). Genome sequence
and analysis of the tuber crop potato. Nature, 475, 189– 195.
Vartia, S., Villanueva- Cañas, J. L., Finarelli, J., Farrell, E. D., Collins, P.
C., Hughes, G . M ., Carlsson, J. E . L ., Gauthier, D. T., McGinnity, P.,
Cross, T. F., FitzG erald, R. D., Mirimin, L., Crispie, F., Cotter, P. D., &
Carlsson, J. (2016). A novel method of microsatellite genotyping- by-
sequencing using individual combinatorial barcoding. Royal Society
Open Science, 3, 150565. https://doi.org/10.1098/rsos.150565
Wood, T. E., Takebayashi, N., Barker, M. S., Mayrose, I., G reenspoon, P.
B., & Rieseberg, L. H. (2009). The frequency of polyploid speciation
in vascular plants. Proceedings of the National Academy of Sciences
of the United States of America, 106, 13875– 13879. https://doi.
org /10.1073/pnas.08115 75106
Xiao, Z., Zhang, Z., & Wang, Y. (200 4). Impacts of scatter- hoarding ro-
dents on restoration of oil tea Camellia oleifera in a fragmented
forest . Forest Ecology and Management, 196, 405– 412. https://doi.
org/10.1016/j.foreco.2004.04.001
Yang, J., Zhang, J., Han, R., Zhang, F., Mao, A., Luo, J., Dong, B., Liu, H.,
Tang, H., Zhang, J., & Wen, C . (2019). Target SSR- Seq: A novel SSR
gen ot yping tec hnology ass oc iat e with per fect SSR s in gene ti c anal-
ysis of cucumber varieties. Frontiers in Plant Science, 10, 53. https://
doi.org/10.3389/fpls.2019.00531
|
211
CUI et al .
Zhao, Y., Vrieling, K., Liao, H., Xiao, M., Zhu, Y., Rong, J., Zhang, W.,
Wang, Y., Yang, J., Chen, J., & Song, Z. (2013). Are habitat fragmen-
tation, local adaptation and isolation- by- distance driving popula-
tion divergence in wild rice Oryza rufipogon? Molecular Ecology, 22,
5531– 55 47.
Zhuang, R. L. (2008). Oil- tea camellia in China (2nd ed.). China Forestr y
Publishing House.
SUPPORTING INFORMATION
Additional supporting information may be found online in the
Supporting Information section.
How to cite this article: Cui, X., Li, C., Qin, S., Huang, Z., Gan,
B., Jiang, Z., Huang, X., Yang, X ., Li, Q., Xiang, X ., Chen, J.,
Zhao, Y., & Rong, J. (2022). High- throughput sequencing-
based microsatellite genotyping for polyploids to resolve
allele dosage uncer tainty and improve analyses of genetic
diversit y, structure and differentiation: A case study of the
hexaploid Camellia oleifera. Molecular Ecology Resources, 22,
199– 211. htt ps://doi.org/10.1111/1755- 0998 .13 469