ArticlePDF Available

High‐throughput sequencing‐based microsatellite genotyping for polyploids to resolve allele dosage uncertainty and improve analyses of genetic diversity, structure and differentiation: A case study of the hexaploid Camellia oleifera

Authors:
  • kunming institute of botany

Abstract

Conventional microsatellite (simple sequence repeat, SSR) genotyping methods cannot accurately identify polyploid genotypes leading to allele dosage uncertainty, introducing biases in population genetic analysis. Here, a new SSR genotyping method was developed to directly infer accurate polyploid genotypes. The frequency distribution of SSR sequences was obtained based on deep‐coverage high‐throughput sequencing data. Corrections were performed accounting for the ‘stutter peak’ and amplification efficiency of SSR sequences. Perl scripts and an online SSR genotyping tool ‘SSRSeq’ were provided to process the sequencing data and output genotypes with corrected allele dosages. Hexaploid Camellia oleifera is the dominant woody oilseed crop in China. Understanding the geographical pattern of genetic variation in wild C. oleifera is essential for the conservation and utilization of genetic resources. Six wild C. oleifera populations were sampled across geographical ranges in subtropical evergreen broadleaf forests of China. Using 35 SSR markers, the high‐throughput sequencing‐based SSRSeq method was applied to obtain accurate hexaploid genotypes of wild C. oleifera. The results demonstrated that the new method could resolve allele dosage uncertainty and considerably improve genetic diversity, structure and differentiation analyses for polyploids. The genetic variation patterns of wild C. oleifera across geographical ranges agree with the ‘central‐marginal hypothesis’, stating that genetic diversity is high in the central population and declines from the central to the peripheral populations, and genetic differentiation increases from the center to the periphery. This method and findings can facilitate the utilization of wild C. oleifera genetic resources for the breeding of cultivated C. oleifera.
Mol Ecol Resour. 2022;22:199–211. wileyonlinelibrary.com/journal/men
|
199© 2021 John Wiley & Sons Ltd
Received: 3 March 2020 
|
Revised: 8 July 2021 
|
Accepted: 12 July 20 21
DOI: 10.1111/1755-0998.13469
RESOURCE ARTICLE
High- throughput sequencing- based microsatellite genotyping
for polyploids to resolve allele dosage uncertainty and improve
analyses of genetic diversity, structure and differentiation:
A case study of the hexaploid Camellia oleifera
Xiangyan Cui1| Caihua Li2| Shengyuan Qin1| Zebin Huang2| Bin Gan2|
Zhengwen Jiang3| Xiaomao Huang1| Xiaoqiang Yang1| Qin Li4| Xiaoguo Xiang1|
Jiakuan Chen1,4| Yao Zhao1,5| Jun Rong1,5
Xiang yan Cui and Caih ua Li are contrib uted equally to t his work.
1Jiangxi Province Key Laborator y of
Watershe d Ecosys tem Chan ge and
Biodive rsity, Center for Watershed
Ecology, Institute of Life Science and
School of Life Sciences, N anchang
University, Nanc hang, China
2Center for Genet ic & Geno mic Analysis,
Genesky Biote chnol ogies In c, Shan ghai,
China
3Genesky Diag nostics (Suzh ou) Inc. ,
Suzhou, China
4Fudan Deve lopment Institute, Fudan
University, Shanghai, China
5Lushan Botanic al Garden, Chinese
Academy of Sciences, Lus han, China
Correspondence
Jun Rong an d Yao Zhao, Jiangxi Province
Key Laborator y of Watershed Ecosystem
Change an d Biodiversit y, Center for
Watershe d Ecolog y, Institute of Life
Science a nd Scho ol of Life S ciences,
Nanchang University, Nan chang, China.
Emails: ro ng_ jun@hotmail.com and
yaozhao@ncu.edu.cn
Funding information
Nationa l Key Research and Developm ent
Program of China, Grant /Award Number:
2018YFD100 0603; National Natural
Science Foundation of Chin a, Grant/
Award Number: 31870311; “Gan- Po
Talent 555” Project of Jiangxi Provin ce,
China
Abstract
Conventional microsatellite (simple sequence repeat, SSR) genotyping methods can-
not accurately identify polyploid genotypes leading to allele dosage uncertainty, in-
troducing biases in population genetic analysis. Here, a new SSR genotyping method
was developed to directly infer accurate polyploid genotypes. The frequency distri-
bution of SSR sequences was obtained based on deep- coverage high- throughput se-
quencing data. Corrections were performed accounting for the “stutter peak” and
amplification efficiency of SSR sequences. Perl scripts and an online SSR genotyping
tool “SSRSeq” were provided to process the sequencing data and output genotypes
with corrected allele dosages. Hexaploid Camellia oleifera is the dominant woody oil-
seed crop in China. Understanding the geographical pattern of genetic variation in
wild C. oleifera is essential for the conser vation and utilization of genetic resources.
Six wild C. oleifera populations were sampled across geographical ranges in subtropi-
cal evergreen broadleaf forests of China. Using 35 SSR markers, the high- throughput
sequencing- based SSRSeq method was applied to obtain accurate hexaploid geno-
types of wild C. oleifera. The results demonstrated that the new method could resolve
allele dosage uncertainty and considerably improve genetic diversity, structure and
differentiation analyses for polyploids. The genetic variation patterns of wild C. oleif-
era across geographical ranges agree with the “central- marginal hypothesis”, stating
that genetic diversity is high in the central population and declines from the central
to the peripheral populations, and genetic differentiation increases from the centre to
the periphery. This method and findings can facilitate the utilization of wild C. oleifera
genetic resources for the breeding of cultivated C. oleifera.
KEYWORDS
allele dosage, Camellia oleifera, genetic differentiation, genetic diversity, genetic structure,
polyploid
200 
|
    CU I et al.
1 | INTRODUCTION
Polyploidy plays an important role in the diversification of angio-
sperms. Approximately 15% of angiosperm speciation events are
accompanied by ploidy increases, and approximately 35% of angio-
sperm species are polyploids (Wood et al., 2009). Because polyploids
are of ten accompanied by heterosis and gene redundancy and may
grow larger, more quickly, and with higher yields compared with their
diploid relatives, polyploidy has also facilitated the domestication
and improvement of crops (Renny- Byfield & Wendel, 2014). Many
important crops are polyploid. For instance, potato (Solanum tubero-
sum) is tetraploid (2n = 4x = 48) (The Potato Genome Sequencing
Consortium, 2011), bread wheat (Triticum aestivum) is hexaploid (2n
= 6x = 42) (International Wheat Genome Sequencing Consortium,
2014), and oilseed rape (Brassica napus) is tetraploid (2n = 4x = 38)
(Chalhoub et al., 2014). Thus, studies on the population genetics of
polyploids can not only shed light on the evolution of angiosperms
but also improve our underst anding of crop domestication and im-
provement (Renny- Byfield & Wendel, 2014).
Despite the importance of polyploids, the population genet-
ics of polyploids are still underdeveloped compared with diploids
(Dufresne et al., 2014). The major challenge is to develop molecular
approaches for reliably resolving allele dosage uncertainty in poly-
ploids (Dufresne et al., 2014). For instance, alleles A and B detected
at a locus in a hexaploid may represent a genotype of ABBBBB,
AABBBB, AAABBB, AAAABB, or A AAAAB, but conventional ap-
proaches cannot identify the exact genotype leading to so- called
allele dosage uncertainty. With allele dosage uncertainty, allele
and genotype frequency estimation are unreliable, which may lead
to considerable biases in subsequent analyses of genetic diversity,
structure and differentiation (Dufresne et al., 2014; Meirmans et al.,
2018). Microsatellites or simple sequence repeats (SSRs) are among
the most popular molecular markers in population genetics (Andrew
et al., 2013; Dufresne et al., 2014). However, applications of SSR s
in polyploids suffer from allele dosage uncertainty (Dufresne et al.,
2014). Recently, sequencing- based SSR genotyping techniques have
been developed based on high- throughput sequencing (De Barba
et al., 2017; Vartia et al., 2016; Yang et al., 2019). The new tech-
niques facilitate rapid, accurate and cost- effective genotyping at
a large number of SSR loci in large- scale population genetic stud-
ies (De Barba et al., 2017; Vartia et al., 2016; Yang et al., 2019).
Nevertheless, the new techniques cannot resolve SSR allele dosage
uncertainty when genotyping polyploids. New methods need to be
developed to obtain accurate polyploid genotypes with corrected
SSR allele dosages.
Polyploids are common in the genus Camellia (Theaceae), in-
cluding predominantly tetraploids and hexaploids, especially in the
section Paracamellia (Ming, 2000). Camellia oleifera as the type spe-
cies of the section Paracamellia, is a hexaploid evergreen broadleaf
shrub or small tree (Huang et al., 2018). Cultivated C. oleifera is the
dominant woody oilseed crop in China. The seed oil of C. oleifera is
rich in the monounsaturated fatty acid oleic acid (up to >80%), and
it is known as “oriental olive oil” (Ma et al., 2011; Zhuang, 2008).
Wild C. oleifera is an essential genetic resource for cultivated C.
oleifera breeding. Wild C. oleifera is widely distributed in the sub-
tropical evergreen broadleaf forests of the Yangtze River Basin
and South China (Ming, 2000). Based on size polymorphisms of 8
SSR markers determined through capillary electrophoresis, genetic
structure analyses indicated clear genetic differentiation between
wild C. oleifera from Lu Mountain (29.60°N, 115.98°E) and Jinggang
Mountain (26.55°N, 114.17°E) (380 km between the two mountains)
and less genetic differentiation among altitudes within each moun-
tain (altitude range <700 m) (Huang et al., 2018). However, classical
genetic differentiation estimates of FST showed the same very low
genetic differentiation (FST = 0.0 07 ) bet wee n and with in ea ch mo un -
tain (Huang et al., 2018). Moreover, major SSRs showed significant
heterozygosity excesses (Huang et al., 2018). These results may be
caused by biased genotyping. Because wild C. oleifera is hexaploid,
allele dosage uncertainty at the SSR loci may lead to biases in popu-
lation genetic analyses. Fur ther studies are needed to obtain accu-
rate hexaploid genotypes of wild C. oleifera wit h co rrec te d SSR allele
dosages to determine the geographical pat tern of genetic variation,
which is the basis for the utilization of wild C. oleifera genetic re-
sources. For the purposes of germplasm collection, special attention
should be given to wild C. oleifera with high genetic diversity and
differentiation.
In this study, a new high- throughput sequencing- based SSR ge-
notyping method was developed to resolve allele dosage uncertainty
in polyploids (Figure 1). Our study used hexaploid wild C. oleifera
as a case for analysing genetic diversity and structure in polyploid
plants. Wild C. oleifera samples were collected from six populations
across geographical ranges. Using 35 microsatellite markers, the
new method was applied to genotype hexaploid wild C. oleifera sam-
ples. Correc tions were performed accounting for the “stut ter peak”
and amplification efficiency of SSR sequences to obtain hexaploid
genot ypes with corrected allele dosages. With correc ted and uncor-
rected allele dosages, genetic diversity, structure and differentiation
were analysed and compared. The main objec tives of our study were
to establish methods to resolve allele dosage uncertainty in poly-
ploids, improve population genetic analysis in polyploids, and pro-
vide suppor t for the evaluation and utilization of genetic resources
in hexaploid C. oleifera.
2 | MATERIALS AND METHODS
2.1  |  Plant materials
Wild C. oleifera samples were collected from six natural distribu-
tion sites (Table 1 and Figure 2), across its geographical ranges in
China (mainly between the Yangt ze River and the Pearl River). Wild
C. oleifera is widely distributed in the subtropical mountain areas of
China at altitudes ranging from approximately 200 to 2000 m. Lu
Mountain is in the northern distribution range of wild C. oleifera at
   
|
  201
CUI et al .
the border between the middle and northern subtropical regions of
China. Jinggang Mount ain is in the distribution centre of wild C. oleif-
era in the central section of the middle subtropical region of China.
Nanling Mountain is at the border between the middle and southern
subtropical regions of China. Luofu Mountain is in the southern dis-
tribution range of wild C. oleifera in the southern subtropical region
of China. Fanjing Mountain is in the western distribution range of
wild C. oleifera, and Matou Mountain is in the eastern distribution
range of wild C. oleifera. Differences in climate conditions (Table 1)
as well as geographical isolation between mountains may lead to ge-
netic differentiation between wild C. oleifera populations. The num-
ber of samples was proportional to the wild C. oleifera pop ulati on siz e
FIGURE 1 Flow chart of the new
high- throughput sequencing- based
microsatellite genotyping method for
resolving allele dosage uncertainty in
polyploids. A Perl script “SSRSeq count”
str_count.pl (https://github.com/ccoo2 2/
SSRseq_count) is provided to process the
clean reads to generate an SSR read count
table. A Perl script str_type.pl (https://
github.com/ccoo2 2/SSRSeq) and an
online SSR genotyping tool – SSRSeq V1.1
(http://bioin fo.genes kybio tech.com/softw
are/ssrseq_type/v1.1/en/) are provided
to output SSR genotypes with correc ted
allele dosages. Details of the method are
described in the Materials and Methods
Multiplex PCR of SSR markers
High throughput sequencing of deep coverage
Read frequency distribution of SSR sequences
SSR alleles identification
‘Stutter peakcorrection
Amplification efficiency correction
SSR genotypes with corrected allele dosages
SSRSeq V1.1
Quality control with FastQC
Paired-end reads merged with FLASH
Reads aligned to SSR reference sequences using Blastn
Read count of SSR sequences
SSRSeq count
TAB LE 1  Wild Camellia oleifera sampling sites
Site (label)
Latitude
(°N)
Longitude
(°E) Altitude (m)
Annual mean
temperature (°C)
Annual precipitation
(mm)
Number of
samples
Lu Mountain, Jiangxi (LU) 29. 6 0 115.98 2 5 6 – 8 7 4 1 2 . 8 0 1 6 . 1 4 15 0 2 – 17 9 0 123
Fanjing Mountain, Guizhou (FJ) 27. 91 108.63 1 0 1 6 – 1 3 8 9 1 1 . 5 8 – 1 3 . 3 2 12 5 1– 1 321 29
Matou Mountain, Jiangxi (MTS) 27. 7 3 11 7.1 6 5 2 6 – 5 7 0 16.0 8 1902 16
Jinggang Mount ain, Jiangxi (JG) 26.55 114 .17 4 1 2 – 1 0 4 4 13.63– 16.52 1 4 3 5 – 1 7 7 0 85
Nanling Mountain, Guangdong (NL ) 24.9 0 113.06 592– 93 2 1 5 . 7 2 – 1 6 . 4 4 1566– 1627 64
Luofu Mountain, Guangdong (LF) 23.28 114.02 1125– 1217 17. 26– 1 7.74 2019– 2034 27
202 
|
    CU I et al.
at the distr ib ution sites. Le af sam ples were taken fro m fl ow er in g wi ld
C. oleifera and then placed in small zip- lock plastic bags containing
silica gel for dehydration. The dry leaf samples were stored at room
temperature.
2.2  |  DNA extraction and quality control
Approximately 30 mg of dry leaf tissue was taken from each sam-
ple and placed in a 2.0- ml tube with a 5- mm glass bead. Sample
tubes were placed in liquid nitrogen for 30 s and transferred to a
Tissuelyser II (Qiagen) for grinding at 30 Hz for 1 min. Genomic
DNA was extracted using the DNAsecure Plant Kit ( Tiangen). DNA
integrit y was checked by agarose gel electrophoresis. The quality
and quantity of the DNA samples were measured using a NanoDrop
2000 (Thermo Scientific).
2.3  |  Single and multiplex PCR optimization of
SSR markers
Polymorphic SSR markers of wild C. oleifera were selected from Cui
et al. (2018). The SSR markers are single loci containing trinucleotide
sim ple repeats, that is, (TCC)n, which were developed based on high-
throughput transcriptome sequencing of wild C. oleifera from the Lu
and Jinggang Mountains (Cui et al., 2018). First, single PCR amplifi-
cation was performed for each SSR marker. A 10 μl mixture was pre-
pared for each reaction and included 1× reaction buffer (TaKaRa),
2 mM Mg2+, 0.2 mM dNTPs, 0.2 μM each primer, 1 U Hot StarTaq
polymerase (TaKaRa) and 1 μl template DNA (10 ng/μl). The cycling
program was 95°C for 2 min; 11 cycles of 95°C for 20 s, 63– 58°C
(−0.5°C per cycle) for 40 s, 72°C for 1 min; 24 cycles of 95°C for
20 s, 65°C for 30 s, 72°C for 1 min; 72°C for 2 min. The amplifica-
tion reactions were carried out on an AB 2720 Thermal Cycler (Life
FIGURE 2 Genetic structure analyses of wild Camellia oleifera populations using SSR data of corrected and uncorrected allele dosages.
(a) Corrected allele dosage (K = 2), showing the results (K = 2) of genetic structure analysis with SSR data of corrected allele dosage. Pie
charts next to wild C. oleifera sites indicate proportions of different clusters (dif ferent colours) within local populations. (b) Uncorrec ted
allele dosage (K = 2), showing the results (K = 2) with uncorrected allele dosage data. (c) Corrected allele dosage (K = 5), showing the
results (K = 5) with corrected allele dosage data. (d) Corrected allele dosage, showing the results of delta K with corrected allele dosage
data. (e) Uncorrected allele dosage, showing the results of delta K with uncorrected allele dosage data
(d) Corrected allele dosage (e) Uncorrected allele dosage
Yangtze
River
Taiwan
East
China
Sea
South
China Sea
Pearl River
Yangtze River
East
China Sea
South
China Sea
Pearl River
(a) Corrected allele dosage(K= 2)
Yangtze
River
Taiwan
East
China
Sea
South
China Sea
Pearl River
Yangtze River
East
China Sea
South
China Sea
Pearl River
(b) Uncorrected allele dosage (K= 2)
Yangtze
River
Taiwan
East
China
Sea
South
China Sea
Pearl River
Yangtze River
East
China Sea
South
China Sea
Pearl River
(c) Corrected allele dosage (K= 5)
   
|
  203
CUI et al .
Technologies Corporation). Thirty- five SSR markers with clear bands
were selected for further multiplex PCR.
Th e 35 SSR marker s we re di vi ded int o two pan els (17 or 18 mark-
ers/panel) for multiplex PCR. The composition of SSR markers
was adjusted based on the multiplex PCR results to achieve equal
amounts of the PCR products of each marker, to optimize primer
compatibility and to avoid undesirable primer pairing. A 20 μl mix-
ture was prepared for each reaction and included 1× reaction buf-
fer ( TaKaRa), 2 mM Mg2+, 0.2 mM dNTPs, 0.1 μM each primer, 1 U
HotStarTaq polymerase (TaKaRa) and 2 μl template DNA (10 ng/μl).
The cycling programme was 95°C for 2 min; 11 cycles of 94°C for
20 s, 63– 58°C (−0.5°C per cycle) for 40 s, 72°C for 1 min; 24 cycles
of 94°C for 20 s, 65°C for 30 s, 72°C for 1 min; 72°C for 2 min.
Finally, two panels of the 35 SSR markers were optimized (Table
S1) for multiplex PCR. Ten DNA samples (random samples from FJ,
JG and NL) were used to check the consistency of PCR results be-
tween single and multiplex PCRs of the 35 SSR markers.
2.4  |  High throughput sequencing and
data processing
The PCR products (<300 bp) of each sample were mixed and la-
belled with 8 bp sample specific barcode using index primers. All
PCR products were pooled prior to library preparation. DNA librar-
ies were constructed for Illumina paired- end sequencing following
the Illumina protocol and then sequenced on the Illumina HiSeq
2500 platform (paired- end 2 × 150 bp) with mean coverage >5000×
per SSR locus per sample. Raw reads were analysed with FastQC
(http://www.bioin forma tics.babra ham.ac.uk/proje cts/fastq c/) for
quality control. A Perl script “SSRSeq count” str_count.pl (https://
github.com/ccoo2 2/SSRseq_count) was developed to process the
clean reads to generate an SSR read count table (Figure 1). First,
paired- end reads were merged with FLASH (http://ccb.jhu.edu/
softw are/FLASH/). Then, merged reads were aligned to C. oleifera
sequences where the SSR markers were located (Cui et al., 2018)
using Blastn (ftp://ftp.ncbi.nlm.nih.gov/blast/ execu table s/blast +/
LATES T/). Finally, the SSR read count table was generated, showing
read counts of SSR sequences with different repeat numbers at each
locus for each sample.
2.5  |  SSR genotyping
A Perl script str_type.pl (https://github.com/ccoo2 2/SSRSeq) was
developed for SSR genotyping. A user- friendly online SSR genotyp-
ing tool −SSRSeq V1.1 (http://bioin fo.genes kybio tech.com/softw
are/ssrseq_type/v1.1/en/) was also provided for SSR genotyping
using str_type.pl, and the genotyping methods are described below,
including “stutter peak” correction and amplification efficiency cor-
rection (Figure 1). Using the SSR read count table as input data, the
read frequency distribution was obtained of SSR sequences with
different repeat numbers of an SSR motif at a locus in each sample.
SSR alleles (genotyping as repeat number) were inferred for read
frequencies higher than a given genotyping threshold. Based on a
large amount of empirical genot yping in data sets of different spe-
cies (ploidy ≤6, e.g., diploid Allium sativum, tetraploid Brassica napus,
hexaploid Camellia oleifera) with SSR markers of at least trinucleotide
simple repeats, the genot yping threshold is usually set to 0.5×(1/
ploidy) and in our case 0.5×(1/6) = 0.083 to avoid allelic dropout.
Read frequencies higher than the genotyping threshold are high-
lighted in the output file of SSRSeq for manual checking. To fur ther
avoid allelic dropout, the genotyping threshold can be altered in the
input file of SSRSeq depending on the genot yping results of a spe-
cific data set.
A “stutter peak” (a sequence with an SSR motif repeat number
that is different from an SSR allele) with a high read frequency may be
misidentified as an SSR allele. Sometimes, a stutter peak may overlap
with an SSR allele showing an abn ormally high read frequency, ca us-
ing difficulties in allele dosage estimation. To count for the stutter
peak before an SSR allele, high- quality samples were selected with
read numbers higher than the median read number at a locus, and a
stutter peak before an SSR allele was identified for read frequencies
lower than the 0.6× genotyping threshold (0.6 × 0.083 = 0.05 in the
study). The threshold is based on a large amount of empirical geno-
typing in data sets of different species (ploidy ≤6, e.g., diploid Allium
sativum, tetraploid Brassica napus, hexaploid Camellia oleifera) with
SSR markers of at least trinucleotide simple repeats. SSR markers
of trinucleotide simple repeats are recommended because dinucleo-
tide repeat markers may have a high degree of stut ter peaks causing
difficulties in genotyping. The threshold can be altered in the input
file of SSRSeq depending on the genotyping results of a specific data
set. For each sample, the slip ratio of an SSR sequence at a locus can
be calculated as:
where SRn is the slip ratio of the SSR sequence with repeat number
n of the SSR motif at a locus, Fn−1 is the read frequency of the stutter
peak with repeat number n−1 and Fn is the read frequency of the SSR
sequence. The slip ratio may var y with repeat number at a locus as
follows:
where a and b are locus- specific coefficients. With regression between
the mean observed slip ratio across samples and the repeat number
of SSR alleles at each SSR locus, coefficients a and b were estimated
for each SSR locus. For the SSR sequence missing the observed slip
rat io, the expected slip rat io wa s calcula ted using eq uatio n 2. Thus , slip
ratios were estimated based on high- quality samples and then used
for the correction of all samples. For each sample, the expected read
frequency of stutter peak Fn−1 before each SSR sequence with repeat
number n at each SSR locus was estimated according to equation 1.
For stutter peak correction, the expected read frequency of stutter
(1)
SR
n=
F
n1
F
n
(2)
SRn
=an
2
+
bn
204 
|
    CU I et al.
peak Fn−1 was subtracted from the read frequency of the sequence
with repeat number n−1 before an SSR sequence with repeat number
n. If the expected read frequency of the stutter peak was higher than
the observed read frequenc y, the corrected read frequency was set to
0. Then, all read frequencies were recalculated so that the sum of read
frequencies at a locus in a sample = 1. Finally, SSR alleles were identi-
fied again with corrected read frequencies higher than the genotyping
threshold (>0.083 in the study).
SSR alleles with higher repeat numbers may have lower ampli-
fication efficiency. Due to low amplification efficiency, some SSR
alleles may have read frequencies lower than the genotyping thresh-
old but higher than the 0.6× genotyping threshold and are identi-
fied as potential alleles. To identify SSR alleles and infer expected
dosages of SSR alleles, corrections were performed to account for
differences in amplification efficiency. Again, high- quality samples
were selected with read numbers higher than the median read num-
ber at a locus. For each sample, the amplification ratio of each allele
at a locus can be calculated as the obser ved allele dosage divided by
the expected allele dosage:
where AR is the amplification ratio of the allele at a locus, Do is the
observed allele dosage, and De is the expected allele dosage. The ob-
served allele dosage can be calculated as ploidy × read frequency of
an SSR allele at a locus in a sample. The expected allele dosage can be
estimated following the method below. In the study, ploidy = 6. For
SSR genoty pes wit h si x di ffe re nt all eles in a sampl e, the exp ect ed all ele
dosage of each allele should be 1. For SSR genotypes with five dif-
ferent alleles in a sample, four alleles should have the expected allele
dosage of 1 of each, and one allele should have the expected allele
dosage of 2. Therefore, the allele with the highest obser ved read fre-
quency may have the expected allele dosage of 2, and the other alleles
may have the expected allele dosage of 1. For SSR genotypes with four
different alleles in a sample, the most abundant allele should have the
expected allele dosage of 3 or 2, and the other allele should have the
expected allele dosage of 1. For SSR genotypes with fewer than four
different alleles in a sample, the expected allele dosage should be es ti-
mat ed by rou nd in g th e observed allele dosage so that the tot al numbe r
of alleles at a locus in a sample = 6. Afterwards, the mean amplification
ratio across highquality samples was estimated for each allele at each
SSR locus and used for the correction of all samples. For amplifica-
tion efficiency correction, the read frequency of each allele at each
SSR locus was first divided by the mean amplification ratio of the al-
lele. SSR alleles were identified again with corrected read frequencies
higher than the genotyping threshold (>0.083 in the study). Then, all
read frequencies of SSR alleles were recalculated so that the sum of
read frequencies at a locus in a sample = 1. Afterwards, the corrected
allele dosage was calculated as ploidy × read frequency of the SSR al-
lele. Finally, the correc ted allele dosage was rounded so that the total
number of alleles at a locus in a sample = 6.
The output SSR genotype data in the output Excel file of SSRSeq
are in GenoDive format and can be easily modified as input data
for GenoDive analysis. The output SSR genot ype data include SSR
genotypes with corrected and uncorrected (showing only dif ferent
alleles) allele dosages. In this study, the SSR genotype data set with
corrected allele dosages of C. oleifera populations was generated for
GenoDive analysis (Cui et al., 2021). In addition, another data set of
SSR genoty pes wit h uncorre cted alle le dosages (showing onl y di f fe r-
ent alleles) was generated for GenoDive analysis (Cui et al., 2021) to
mimic the situation for conventional SSR genotyping methods.
2.6  |  Genetic diversity
The two data set s of corrected and uncorrected allele dosages (Cui
et al., 2021) were used for genetic diversity analysis with GenoDive
version 2.0b27 (Meirmans & Van Tienderen, 2004). The number of
alleles (A), effective number of alleles (Ae), observed heterozygo-
sity (Ho), expected heterozygosity (He) and inbreeding coefficient
(Gis) were calculated for each population and each locus. The effec-
tive number of alleles (Ae) is the number of alleles in a population
weig ht ed fo r the ir freq uen cie s. Th e obs erv ed he te roz ygosi t y (Ho) es-
timated in GenoDive is “gametic heteroz ygosity” for polyploids, that
is, the frequency of heterozygotes among randomly sampled diploid
gamete s (M oo dy et al., 1993). The expe cted he te rozy go si ty (He) est i-
mated in GenoDive is “gene diversity”, determined by calculating He
in polyploids as in diploids, including a correction for sampling bias
(Meirmans et al., 2018). Tests for Hardy- Weinberg equilibrium were
performed using the heterozygosity- based Gis statistic with 9999
permutations. GenoDive was used to export data into SPAGeDi
format. For uneven sample sizes of dif ferent populations, rarefied
alle lic rich ness was estimated as the expe cted allele number of mini-
mal sample size with SPAGeDi 1.5d (Hardy & Vekemans, 2002).
2.7  |  Genetic structure
The two data sets of corrected and uncorrected allele dosages
(Cui et al., 2021) were used for genetic structure analysis. Principal
component analysis (PCA) was performed using a covariance ma-
trix between allele frequencies for individuals with GenoDive ver-
sion 2.0b27 (Meirmans & Van Tienderen, 2004). GenoDive was
also used to expor t data into the structure format (Pritchard et al.,
2000). ParallelStructure (Besnier & Glover, 2013; Pritchard et al.,
2000) from the CIPRES Science Gateway (Miller et al., 2010) was
used to infer the population genetic structure. The admixture model
was used to determine the ancestry of individuals. The allele fre-
quencies were assumed to be independent among populations. The
population number (K) was evaluated from 1– 10, and five replicate
runs were carried out for each K. Each run had a burnin period of
1,00 0,000, and there were 1,0 00,000 iterations after burnin. struc-
ture harvester (Earl & vonHoldt, 2012) was used to determine the
(3)
AR
=
D
o
D
e
   
|
  205
CUI et al .
optimal K. For the optimal K, clumpp (Jakobsson & Rosenberg, 2007)
was used to find the optimal alignments of the five replicate runs.
2.8  |  Genetic differentiation
The two data set s of corrected and uncorrec ted allele dosages (Cui
et al., 2021) were used for genetic differentiation analysis with
GenoDive version 2.0b27 (Meirmans & Van Tienderen, 2004).
Estimates of genetic differentiation between populations were cal-
culated, including FST from the analysis of molecular variance be-
tween each pair of populations (AMOVA) (Michalakis & Excoffier,
1996) and Rho (independent of the ploidy level and inheritance
pattern) (Ronfort et al., 1998). A paired t test was used to compare
genetic dif ferentiation estimates with correc ted and unco rrected al-
lele dosages. Mantel tests were per formed to analyse correlations
of genetic differentiation matrixes with corrected and uncorrected
allele dosages and between FST and Rho. Significance levels were
generated with Bonferroni correction for multiple comparisons. For
testing isolation by distance, linear regressions were performed be-
tween FST/(1−FST) and the natural logarithm of geographical distance
between populations.
3 | RESULTS
3.1  |  High- throughput sequencing- based
microsatellite genotyping
With high- throughput sequencing, the mean coverage was 5853,
and the median was 5832 per SSR locus per sample in our study.
Such high coverages ensured high precision of SSR read count and
frequency estimations. The output file of SSRSeq involved slip ratios
estimated for each SSR sequence at each locus. The SSR sequence
slip ratio generally increased with the increase in repeat number at
eac h locus , match ing equ ation 2 (Figur e S1). Befo re stutter pe ak cor-
rection, stutter peaks may be misidentified as SSR alleles and cause
biases in allele dosage estimations. The expected read frequency of
the stutter peak could be estimated with the slip ratio, and stutter
peak corrections were performed by subtracting the expected read
frequency of the stutter peak from the observed read frequency. On
the other hand, the output file of SSRSeq also involved amplification
ratios estimated for each SSR allele at each locus. The amplification
ratio generally decreased with the increase in repeat number at each
locus (Figure S2). Amplification efficiency corrections were per-
formed with the amplification ratio. The output file of SSRSeq had
both unrounded and rounded data for the corrected allele dosages.
The corrected allele dosages (unrounded) were in close agreement
with the expected allele dosages (rounded) in our study, showing
high SSR genotyping accuracy with the new method (Figure S3). Ten
DNA samples were used for both single and multiplex PCR of the 35
SSR markers. The PCR products were sequenced, and the results
obtained via single or multiplex PCR were genotyped separately
TAB LE 2  Genetic diversity analysis of wild Camellia oleifera using two SSR datasets with corrected and uncorrected allele dosages
Site
Corrected allele dosages Uncorrected allele dosages
A AeAllelic richness HoHeGis A AeAllelic richness HoHeGis
LU 6.686 2.846 5.20 0.534 0.557 0.041*6.686 3.886 4.50 0. 851 0.695 0.224*
FJ 7.229 3 . 241 6.39 0. 572 0.602 0.050*7. 2 2 9 4.4 07 5.29 0. 869 0.739 0.175*
MTS 6.571 3.227 6.30 0.5 67 0.607 0.065*6.571 4 .3 41 5.26 0.877 0 .743 0.18 0*
JG 8.200 3.340 6 .55 0.588 0.608 0.032*8.200 4 .614 5.38 0.884 0 .74 4 0.188*
NL 7.57 1 3.273 6.24 0.536 0.603 0.111*7.57 1 4.302 5.14 0.831 0.722 0.151*
LF 6. 514 3.080 5.80 0.554 0. 576 0.038*6 . 514 4.142 4.89 0.857 0.718 0.19 3*
*Significant heterozygosity deficit (positive Gis value) or heterozygosity excess (negative Gis value) in tests for Hardy- Weinberg equilibrium.
206 
|
    CU I et al.
with the new method. The genot yping results showed that the allele
number identified was not significantly different between single and
multiplex PCR of the 35 SSR markers (Figure S 4).
3.2  |  Genetic diversity
The results of genetic diversity analysis are shown in Table 2. The
numb er of all el es (A) was the same between data sets with corrected
and uncorrected allele dosages. However, the effective number of
alleles (Ae) was higher with uncorrected allele dosages. Allelic rich-
ness was higher with corrected allele dosages. With corrected allele
dosages, observed heterozygosity values were all lower than ex-
pected heterozygosity values and inbreeding coefficient values were
all positive, showing significant heterozygosity deficits (Figure 3a).
With uncorrec ted allele dosages, observed heterozygosity values
were all higher than expected heterozygosity values, and inbreeding
coefficient values were all negative showing significant heterozy-
gosity excesses (Figure 3b). The obser ved/expected heterozygosity
values with uncorrected allele dosages were all higher than those
with corrected allele dosages (Figure 3). The genetic diversity of wild
C. oleifera at Jinggang Mountain (JG) was the highest and decreased
from the distribution centre to the northern/southern distribution
range of wild C. oleifera. The genetic diversity of wild C. oleifera was
the lowest at Lu Mountain (LU) in the northern distribution range of
wild C. oleifera.
3.3  |  Genetic structure
The PCA showed that most individuals of the LU population at
the highest latitude were differentiated from individuals of other
populations with corrected or uncorrected allele dosages (Figure
S5). Only with corrected allele dosages were most individuals of
the LF population at the lowest latitude separated from individu-
als of other populations, and the latter were mixed and located
between the LU and LF populations (Figure S5). With corrected or
uncorrected allele dosages, the optimal K was 2 (Figure 2d,e). Only
with corrected allele dosage did a secondary peak occur at K = 5
(Figure 2d). The results of STRUCTURE analyses (K = 2) showed
more gradual changes in genetic structure along latitudes with
corrected allele dosages (Figure 2a) than those with uncorrected
allele dosages (Figure 2b). The LU population at the highest lati-
tude was the most dif ferent from the others. From high to low lati-
tudes, the genetic structure with corrected allele dosages shifted
toward the genetic structure of the LF population at the lowest
latitude (Figure 2a), similar to the results of the PCA (Figure S5).
With uncorrected allele dosages, all populations except for the LU
population sho we d mo re or les s the s am e ge ne tic s tr uct ur e despite
geograph ic al lo cation (Figur e 2b). In ad di ti on, mor e ge netic clu st er s
(K = 5) were found with corrected allele dosages showing finer ge-
netic structures of wild populations (Figure 2c). In addition to the
distinguished LU population in the northern distribution range, the
genetic structures of the LF population at the lowest latitude and
the FJ population in the western distribution range were clearly
separated. The genetic structures of the MTS, JG and NL popula-
tions showed similarly mixed clusters.
3.4  |  Genetic differentiation
FST estimates with corrected allele dosages (mean FST = 0.026)
were significantly higher (p = .001) than those with uncorrected al-
lele dosages (mean FST = 0.009) (Table 3). In addition, the Mantel
test indicated that the correlation was insignificant between FST
estimates with corrected and uncorrected allele dosages (r = .604,
p = .070). With corrected allele dosages, FST was the highest (0.067)
between populations at the highest (LU, 29.60°N) and the lowest (LF,
23.28°N) latitudes (Table 3). However, with uncorrected allele dos-
ages, FST between LU and LF (0.013) was the same as that bet ween
LU and MTS (27.73°N). With corrected/uncorrected allele dosages,
linear regression was insignificant (corrected: p = .427; uncorrected:
FIGURE 3 Observed and expec ted heterozygosit y estimates of wild Camellia oleifera populations. (a) Estimates with corrected allele
dosage. (b) Estimates with uncorrected allele dosage. Solid circles indicate obser ved heterozygosity Ho estimates. Hollow circles indicate
expected heterozygosity He estimates. From left to right, populations are sorted from high to low latitudes
0.5
0.6
0.7
0.8
0.9
LU FJ MTSJGNLLF
Heterozygosity
Population
(a) Corrected allele dosage
Ho
He
0.5
0.6
0.7
0.8
0.9
LU FJ MTSJGNLLF
Heterozygosity
Population
(b) Uncorrected allele dosage
Ho
He
   
|
  207
CUI et al .
p = .5 3 8) be twe en FST/(1−FST ) an d the natu ral lo gar ith m of ge og r aph -
ical distance between populations (Figure 4).
Rho estimates with corrected allele dosages (mean Rho = 0.087)
were not significantly different from those with uncorrected allele
dosages (mean Rho = 0.076) (p = .103). With corrected allele dos-
ages, FST estimates were significantly correlated with Rho estimates
(r = .950, p = .004); with uncorrected allele dosages, the correlation
was insignificant (r = .440, p = .146).
4 | DISCUSSION
Conventional molecular methods cannot accurately identify the
SSR genotype of polyploids. Thus, codominant SSR genotypes in
polyploids may have to be treated as dominant data losing valuable
information in subsequent analyses (Dufresne et al., 2014). On the
other hand, software such as GenoDive can handle polyploid SSR
genotypes with unknown allele dosages and per form correction of
allele dosages using a maximum likelihood method based on random
mating within populations modified from De Silva et al. (2005). Since
actual allele frequencies are unknown in the correction, biases may
be introduced to population differentiation and structure analyses.
Methods have been developed to directly infer polyploid genot ypes
based on ratios between SSR allele peak areas, for example, the
microsatellite DNA allele counting- peak ratios (MAC- PR) method
(Esselink et al., 2004). However, ratios between SSR allele peak areas
of capillary electrophoresis may not represent actual ratios of SSR
alleles, especially if they do not account for the stutter peak and
amplification efficiency of SSR alleles.
If allele dosages are uncer tain in polyploid SSR genotypes, SSR
allele frequency estimation is biased. Population genetic analyses
based on biased allele frequencies may also be biased. Based on
model simulation, when allele dosage information is missing, ob-
served heterozygosity estimates in tetraploid populations are much
higher than true values, while expected heterozygosity estimates
are slightly higher than true values (Meirmans et al., 2018). With
allele dosage uncertainty, statistical testing for Hardy- Weinberg
equilibrium is not possible for polyploids (Meirmans et al., 2018).
For genetic structure analysis, structure is well suited for analys-
ing polyploids (Meirmans et al., 2018). In simulated mixed- ploidy
populations, structure is more robust than other clustering meth-
ods (Stift et al., 2019). Especially when population differentiation is
weak, structure is the only method that allows unbiased inference
with limited genotypic information of codominant markers with un-
known allele dosages or dominant markers (Stift et al., 2019). For
genetic differentiation estimates, missing dosage information leads
to overestimation of genetic diversity within populations and con-
sequently underestimation of the degree of population differenti-
ation (Meirmans et al., 2018). To estimate genetic differentiation in
polyploids, Rho may be the statistic of choice, as it is generally un-
biased with allele dosage uncertainty, independent of ploidy level
and mode of inheritance, and closely related to FST (Meirmans et al.,
2018; Meirmans & Van Tienderen, 2013).
TAB LE 3  Genetic differentiation between populations of wild Camellia oleifera using two SSR data sets with correc ted and uncorrec ted allele dosages. FST estimates (with corrected/
uncorrected allele dosages) are in the lower triangle, and Rho estimates (with corrected/uncorrected allele dosages) are in the upper triangle. The p- values are indicated in bracket s (p < .01 in
bold)
LU FJ MTS JG NL LF
LU 0.110/0.120 0.097/0.087 0.069/0.093 0.143/0.112 0.222/0.170
FJ 0.033 (0.001)/0.010 (0.001) 0.034/0.029 0.039/0.049 0.052/0.047 0.122/0.089
MTS 0.030 (0.001)/0.013 (0.001) 0.014 (0.001)/0.010 (0.177) 0.002/0.020 0.025/0.025 0.113/0.096
JG 0.019 (0.001)/0.006 (0.001) 0.015 (0.001)/0.006 (0.030) 0.004 (0.003)/0.007 (0.307) 0.049/0.026 0.132/0.081
NL 0.031 (0.001)/0.007 (0.001) 0.015 (0.001)/0.008 (0.041) 0.003 (0.113)/0.011 (0.307) 0.007 (0.001)/0.003 (0.012) 0.093/0.099
LF 0.067 (0.001)/0.013 (0.001) 0.040 (0.001)/0.010 (0.001) 0.040 (0.001)/0.012 (0.001) 0.041 (0.001)/0.011 (0.001) 0.039 (0.001)/0.012 (0.001)
208 
|
    CU I et al.
Our study has developed a new high- throughput sequencing-
based microsatellite genotyping method (Figure 1) to directly resolve
allele dosage uncertainty in polyploids using hexaploid wild C. oleifera
as a case study. As an alternative to multiplex PCR, one may perform
single PCR for each SSR marker and mix the products for sequenc-
ing. However, with the increases in number of markers and sample
size, the labour needed for single PCR and post- PCR multiplexing
will dramatically increase much more than that required for multi-
plex PCR. Our study demonstrated that with optimization, the allele
number identified was not significantly different between single and
multiplex PCRs of the 35 SSR markers (Figure S4). Therefore, we
propose to use multiplex PCR with optimization in the method. For
100 SS R ma rkers , the cos t of mult ip lex PCR, 50 0 0× high- throughput
sequencing and data analysis is approximately 30 U.S. dollars per
sample or 0.3 U.S. dollars per genotype. The typical genotyping- by-
sequencing (GBS) method can generate data for many more mark-
ers, so the cost per genotype is much lower. Nevertheless, the cost
per sample of the GBS method is generally several times higher than
that of our method. Most impor tantly, our method feasibly provides
accurate SSR genotypes for up to hundreds of SSR markers in hun-
dreds or thousands of polyploid samples for genetic diversit y anal-
ysis. Perl scripts and an online SSR genotyping tool, SSRSeq V1.1,
are provided to output accurate polyploid genotypes with the new
method. Compared with capillary electrophoresis, high- throughput
sequencing of deep coverage enables more accurate estimation of
SSR sequence amounts and frequencies. Moreover, specific correc-
tions are introduced for the stutter peak and amplification efficiency
of SSR sequences. The results of hexaploid C. oleifera showed that
SSR sequences with higher repeat numbers had a higher ratio of
stutter peaks (Figure S1) and may lead to errors and biases in SSR
allele identification and dosage estimation. The slip ratio model pro-
posed in the study nicely represented the actual SSR sequencing
data and therefore provided solid stut ter peak correction of SSR se-
quence frequency. In addition, we found that SSR alleles with higher
repeat numbers may have lower amplification efficiency (Figure S2);
therefore, amplification efficiency corrections must be performed.
Using the new method, accurate hexaploid genotypes of C. oleifera
with correc ted allele dosages were obtained in the study (Figure S3).
These enabled direct comparisons of population genetic analyses
with corrected and uncorrected allele dosages. The results of our
st ud y dem ons tr ate d th at, wit h cor re c te d and unc orr ec ted all ele dos -
ages, genetic diversity, structure and differentiation estimates and
inferences were considerably different.
Similar to the results of model simulations by Meirmans et al.
(2018), with uncorrected allele dosages, obser ved heterozygos-
ity estimates were abnormally high (>0.8) and significantly higher
than expected heterozygosit y estimates, and both were higher
than those with corrected allele dosages (Figure 3). Using eight
highly polymorphic microsatellite markers with the traditional
capillary electrophoresis method, Huang et al. (2018) found simi-
larly high obser ved heteroz ygosity in wild C. oleifera of the Lu and
Jinggang Mountains, and some loci had observed heterozygosity
equal to 1. The authors argued that such high observed heterozy-
gosity suggested that C. oleifera was an allopolyploid with disomic
inheritance. However, in this study, with corrected allele dosages,
observed heterozygosity estimates (<0.6) were significantly lower
than expected heterozygosity, indicating significant heterozygosity
deficits in all populations. Thus, for hexaploid wild C. oleifera, the
genetic diversity estimates with uncorrected allele dosages were
seriously overestimated, especially for observed heterozygosity,
resulting in unrealistic inferences in the previous study. Wild C.
oleifera outcrosses through insect pollination, and its seeds are dis-
persed via small rodents in forests (Huang et al., 2018; Xiao et al.,
2004). With limited gene dispersal within populations, observed
heterozygosity estimates should be significantly lower than ex-
pected heterozygosity, as indicated by the results with corrected
allele dosage. Moreover, our study supported the “central- marginal
hypothesis”, which states that across geographical ranges of spe-
cies, within- population genetic diversity declines from the centre to
the periphery, although the differences were small in wild C. oleifera
(Figure 3), as in most cases in previous studies (Eckert et al., 2008).
Our study demonstrates that resolving allele dosage uncertainty
FIGURE 4 Relationships between FST/(1- FST) and the natural logarithm of geographical distance between populations. (a) Estimates with
corrected allele dosage. (b) Estimates with uncorrected allele dosage. Linear regression lines and equations with R2 are shown
y = 0.0088x - 0.0267
R² = 0.0491
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.46.6 6.8 7.0
Fst/(1-Fst)
Ln (geographical distance)
(a) Corrected allele dosage
y = 0.0011x + 0.0027
R² = 0.0299
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.
87
.0
Fst/(1-Fst)
Ln (geographical distance)
(b) Uncorrected allele dosage
   
|
  209
CUI et al .
using our new method can achieve accurate estimates of genetic
diversit y for polyploids.
Although strong genetic structure could be distinguished even
with uncorrected allele dosages, subtle genetic structures could be
discovered among populations only with corrected allele dosages
(Figure 2). The wild C. oleifera population in Lu Mountain (LU) at the
highest latitude in the study was the most differentiated in genetic
structure with corrected and uncorrected allele dosages (Figure 2).
Lu Mountain is in the northern periphery of wild C. oleifera, adjacent
to the Yangtze River in the north and next to Poyang Lake in the
east and south, and isolated from other wild C. oleifera populations.
Adaptation isolation by cold climate conditions together with geo-
graphical isolation might lead to distinct genetic structures (Zhao
et al., 2013). With corrected allele dosages, the southern peripheral
population of wild C. oleifera in Luofu Mountain (LF) was distin-
guished in terms of genetic structure. Again, adaptation isolation by
warm climate conditions and geographical isolation from other pop-
ulations by Nanling Mountain might lead to distinct genetic struc-
tures. Our study indicates that resolving allele dosage uncertainty
is essential for discovering subtle genetic struc tures in polyploids.
As indicated in model simulations by Meirmans et al. (2018), the
classical FST estimates in our study were all very low with uncor-
rected allele dosages (Table 3), underestimating genetic differentia-
tion between wild C. oleifera populations compared to the estimates
with corrected allele dosages. With corrected allele dosages, FST was
the highest between the northern and southern peripheral popula-
tions, similar to the results of genetic structure analysis (Figure 2).
However, with uncorrected allele dosages, the FST between the
northern and southern peripheral populations was the same as that
between adjacent populations (Table 3), showing considerable bi-
ases. With corrected and uncorrected allele dosages, the patterns
of isolation- by- distance were insignificant, although with corrected
allele dosages, a slightly increased trend in FST/(1−FST) was detected
with the increase in the natural logarithm of geographical distance
between populations (Figure 4). The insignificance of isolation- by-
distance may be due to the small number of populations in the study.
According to the “central- marginal hypothesis”, in addition to the
declines in within- population genetic diversity from the centre of
the geographic al range to the peripher y, among- population differ-
entiation increases from the centre to the periphery (Eckert et al.,
2008). Again, the results of genetic structure and differentiation in
our study supported this hypothesis. Most importantly, our study
demonstrates that resolving allele dosage uncertainty c an improve
FST estimates for polyploids.
Huang et al. (2018) showed that Rho could discriminate genetic
differentiation between and within hexaploid wild C. oleifera popula-
tions using the traditional microsatellite genot yping method. In our
st ud y, we confir med tha t, wi th correc ted and uncorr ect ed allele dos-
ages, Rho estimates showed similar genetic differentiation patterns
between wild C. oleifera populations correlated to FST estimates with
corrected allele dosages. However, the interpretation of Rho is dif-
ferent from that of FST (Meirmans & Van Tienderen, 2013). The Rho
estimate corresponds to the FST estimate for a haploid species with
the same population size and migration rate; therefore, for hexaploid
wild C. oleifera, the Rho estimates were consistently higher than the
FST estimates, as indicated by model simulations (Meirmans & Van
Tienderen, 2013).
In summary, our study demonstrated that with uncorrected al-
lele dosages, genetic diversity, structure and differentiation anal-
yses were considerably biased in hexaploid wild C. oleifera. The
new high- throughput sequencing- based microsatellite genotyping
method established in the study can resolve allele dosage uncer-
tainty and considerably improve genetic diversity, structure and
differentiation analyses for polyploids. The genetic variation pat-
terns of wild C. oleifera across geographical ranges agree with the
“central- marginal hypothesis”, stating that genetic diversity is high
in the cen tral po pulatio n an d dec li ne s fro m the centr al to per ip her al
populations, and genetic differentiation increases from the centre
to the periphery. In future studies, more populations of wild C. oleif-
era across geographical ranges are needed to verify the findings
and discover the underlying mechanisms generating such genetic
variation patterns.
ACKNOWLEDGEMENTS
This work was supported by the National Key Research and
Development Program of China (No. 2018YFD1000603), the
National Natural Science Foundation of China (NSFC Grant No.
31870311) and the “Gan- Po Talent 555” Project of Jiangxi Province,
China. We thank Jinxia Fu and colleagues at the Centre for Genetic
& Genomic Analysis, Genesky Biotechnologies Inc., Shanghai for
support in the development of the high- throughput sequencing-
based microsatellite genot yping method. We are grateful to valuable
comments of editors and reviewers helping dramatically improve
the manuscript. Jun Rong would like to thank Professor Peter G.
L. Klinkhamer and Dr. Klaas Vrieling of Leiden University and Dr.
Patrick G. Meirmans of Universit y of Amsterdam for motivating him
to develop such an efficient molecular method in polyploids.
AUTHOR CONTRIBUTIONS
Xiangyan Cui and Jun Rong designed and performed the experi-
ments, analysed the data and wrote the manuscript. Caihua Li, Yao
Zhao, Shengyuan Qin and Zebin Huang contributed to the experi-
ments, data analyses and writing. Bin Gan, Zhengwen Jiang, Xiaomao
Huang and Xiaoqiang Yang contributed to the experiments and data
analyses. Qin Li, Xiaoguo Xiang and Jiakuan Chen contributed to
writing the manuscript.
DATA AVAILAB ILITY STATE MEN T
Microsatellite genotyping data with corrected and uncorrected allele
dosages of wild C. oleifera populations in the study have been made
available on Dryad (https://doi.org/10.5061/dryad.t4b8g thxd).
ORCID
Jun Rong https://orcid.org/0000-0003-1408-2898
210 
|
    CU I et al.
REFERENCES
Andrew, R. L., Bernatchez, L ., Bonin, A., Buerkle, C. A., Carstens, B. C.,
Emerson, B. C., Garant, D., Giraud, T., Kane, N. C ., Roger s, S. M.,
Slate, J., Smith, H., Sork, V. L., Stone, G. N., Vines, T. H., Waits,
L., Widmer, A., & Rieseberg, L. H. (2013). A road map for mo-
lecular ecolog y. Molecular Ecology, 22, 2605– 2626. htt ps://doi.
org /10.1111/me c.12319
Besnier, F., & Glover, K . A. (2013). ParallelStr ucture: A R package
to dis tribute parallel runs of the population genetics program
STRUCTURE on multi- core computers. PLoS One, 8, e70651.
https://doi.org/10.1371/journ al.pone.0070651
Chalhoub, B., Denoeud, F., Liu, S., Parkin, I. A. P., Tang, H., Wang, X.,
Chiquet , J., Belcram, H., Tong, C ., Samans, B., Correa, M., Da Silva,
C., Just, J., Falentin, C., Koh, C. S., Le Clainche, I., Bernard, M.,
Bento, P., Noel, B., … Wincker, P. (2014). Early allopolyploid evolu-
tion in the post- Neolithic Brassica napus oilseed genome. Science,
345, 950– 953. https://doi.org/10.1126/scien ce.1253435
Cui, X., Qin, S., Huang, X., Yang, X., & Rong, J. (2021). Microsatellite gen-
otypes of Camellia oleifera for GenoDive analysis. Dryad, Dataset,
https://doi.org/10.5061/dryad.t4b8g thxd
Cui, X., Huang, X., Chen, J., Yang, X., & Rong, J. (2018). An efficient
method for developing polymorphic microsatellite markers from
high- throughput transcriptome sequencing: a case study of hexa-
ploid oil- tea camellia (Camellia oleifera). Euphytica, 214 , 26. https://
d o i . o r g / 1 0 . 1 0 0 7 / s 1 0 6 8 1 - 0 1 8 - 2 1 1 4 - 6
De Barba, M., Miquel, C., Lobréaux, S., Quenette, P. Y., Swenson,
J. E., & Taberlet, P. (2017). High- throughput microsatellite
genotyping in ecology: Improved accuracy, efficiency, stan-
dardization and success with low- quantity and degraded
DNA. Molecular Ecology Resources, 17, 492– 507. https://doi.
org /10.1111/1755 - 0998 .12594
De Silva, H. N., Hall, A . J., Rikkerink, E., McNeilage, M. A., & Fraser, L.
G. (2005). Estimation of allele frequencies in polyploids under cer-
tain pat terns of inheritance. Heredity, 95, 327– 334. ht tps://doi.
org/10.1038/sj.hdy.6800728
Dufresne, F., Stift, M., Vergilino, R ., & Mable, B. K. (2014). Recent prog-
ress and challenges in population genetics of polyploid organisms:
An over view of current state- of- the- art molecular and statisti-
cal tools. Molecular Ecology, 23, 4 0– 69. https://doi .org/10.1111/
mec .12581
Earl, D. A ., & vonHoldt, B . M. (2012). STRUCTURE HARVESTER: A web-
site and program for visualizing STRUC TURE output and imple-
menting the Evanno method. Conservation Genetics Resources, 4,
3 5 9 – 3 6 1 . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 1 2 6 8 6 - 0 1 1 - 9 5 4 8 - 7
Eckert, C. G., Samis, K. E., & Lougheed, S. C. (2008). Genetic variation
across species’ geographical ranges: The central- marginal hypoth-
esis and beyond. Molecular Ecology, 17, 1170– 1188. htt ps://doi.
org /10.1111/j .136 5- 294X.2007.03659.x
Esselink, G. D., Nybom, H., & Vosman, B. (20 04). Assignment of al-
lelic configuration in polyploids using the MAC- PR (microsatellite
DNA allele counting— peak ratios) method. Theoretical and Applied
Genetics, 109, 402– 408.
Hardy, O. J., & Vekemans, X. (2002). SPAGeDi: A versatile computer
program to analyse spatial genetic structure at the individual or
population levels. Molecular Ecology Notes, 2, 618– 620. https://doi.
org/10.1046/j.1471- 8286.2002.00305.x
Huang, X ., Chen, J., Yang, X., Duan, S., Long, C., Ge, G., & Rong, J. (2018).
Low genetic differentiation among altitudes in wild Camellia oleifera,
a subtropic al evergreen hexaploid plant. Tree Genetics & Genomes,
14, 2 1 . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 1 1 2 9 5 - 0 1 8 - 1 2 3 4 - 4
International Wheat Genome Sequencing Consortium. (2014). A
chromosome- based draft sequence of the hexaploid bread wheat
(Triticum aestivum) genome. Science, 345, 1251788. https://doi.
org/10.1126/scien ce.1251788
Jakobsson, M., & Rosenberg, N. A. (2007). CLUMPP: A cluster matching
and permutation program for dealing with label switching and mul-
timodality in analysis of population structure. Bioinformatics, 23,
1801– 1806. https://doi.org/10.1093/bioin forma tics/btm233
Ma, J., Ye, H., Rui, Y., Chen, G., & Zhang, N. (2011). Fatty acid compo-
sition of Camellia oleifera oil. Journal für Verbraucherschutz und
Lebensmittelsicherheit, 6, 9– 12. https://doi.org/10.1007/s0000
3 - 0 1 0 - 0 5 8 1 - 3
Meirmans, P. G., Liu, S., & van T ienderen, P. H. (2018). The analysis of
polyploid genetic data. Journal of Heredity, 109, 283– 296. https://
doi.org/10.1093/jhere d/esy006
Meirmans, P. G., & Van Tienderen, P. H. (20 04). GENOTYPE and
GENODIVE: Two programs for the analysis of genetic diversity of
asexual organisms. Molecular Ecology Notes, 4, 792– 7 94. ht tps://do i.
org /10.1111/j .1471- 8286 .2004 .0 0770. x
Meirmans, P., & Van Tienderen, P. (2013). The effects of inheritance
in tetraploids on genetic diversity and population divergence.
Heredity, 110, 131– 137. https://doi.org/10.1038/hdy.2012 .80
Michalakis, Y., & Excof fier, L . (1996). A generic estimation of population
subdivision using distances between alleles with special refer-
ence for microsatellite loci. Genetics, 142, 10611064. htt ps://doi.
org /10.1093/gene t ics/142.3.1061
Miller, M. A ., Pfeiffer, W., & Schwartz, T. (2010). Creating the CIPRE S
Science Gateway for inference of large phylogenetic trees. In
Proceedings of the Gateway Computing Enviroments Workshop (GCE)
(pp. 1– 8). New Orleans, L A.
Ming, T. L. (200 0). Monograph of the genus Camellia. Yunnan Science and
Technology Press.
Moody, M. E., Mueller, L. D., & Soltis, D. E. (1993). Genetic variation and
random drift in autotetraploid populations. Genetics, 134, 649– 657.
Pritchard, J. K., Stephens, M., & Donnelly, P. (20 00). Inference of pop-
ulation structure using multilocus genotype data. Genetics, 155,
9 4 5 – 9 5 9 .
Renny- Byfield, S., & Wendel, J. F. (2014). Doubling down on genomes:
Polyploidy and crop plants. American Journal of Botany, 101, 1711–
1725. https://doi.org/10.3732/ajb.1400119
Ronfort, J., Jenczewski, E ., Bat aillon, T., & Rousset, F. (1998). Analysis of
population st ructure in autotetraploid species. Genetics, 150, 921–
930. https://doi.org/10.1093/genet ics/150.2.921
Stift, M., Kolář, F., & Meirmans, P. G. (2019). STRUCTURE is more ro-
bust than other clustering methods in simulated mixed- ploidy pop-
ulations. Heredity, 123, 429– 4 41. https://doi.org/10.1038/s4143
7 - 0 1 9 - 0 2 4 7 - 6
The Potato Genome Sequencing Consortium. (2011). Genome sequence
and analysis of the tuber crop potato. Nature, 475, 189– 195.
Vartia, S., Villanueva- Cañas, J. L., Finarelli, J., Farrell, E. D., Collins, P.
C., Hughes, G . M ., Carlsson, J. E . L ., Gauthier, D. T., McGinnity, P.,
Cross, T. F., FitzG erald, R. D., Mirimin, L., Crispie, F., Cotter, P. D., &
Carlsson, J. (2016). A novel method of microsatellite genotyping- by-
sequencing using individual combinatorial barcoding. Royal Society
Open Science, 3, 150565. https://doi.org/10.1098/rsos.150565
Wood, T. E., Takebayashi, N., Barker, M. S., Mayrose, I., G reenspoon, P.
B., & Rieseberg, L. H. (2009). The frequency of polyploid speciation
in vascular plants. Proceedings of the National Academy of Sciences
of the United States of America, 106, 13875– 13879. https://doi.
org /10.1073/pnas.08115 75106
Xiao, Z., Zhang, Z., & Wang, Y. (200 4). Impacts of scatter- hoarding ro-
dents on restoration of oil tea Camellia oleifera in a fragmented
forest . Forest Ecology and Management, 196, 405– 412. https://doi.
org/10.1016/j.foreco.2004.04.001
Yang, J., Zhang, J., Han, R., Zhang, F., Mao, A., Luo, J., Dong, B., Liu, H.,
Tang, H., Zhang, J., & Wen, C . (2019). Target SSR- Seq: A novel SSR
gen ot yping tec hnology ass oc iat e with per fect SSR s in gene ti c anal-
ysis of cucumber varieties. Frontiers in Plant Science, 10, 53. https://
doi.org/10.3389/fpls.2019.00531
   
|
  211
CUI et al .
Zhao, Y., Vrieling, K., Liao, H., Xiao, M., Zhu, Y., Rong, J., Zhang, W.,
Wang, Y., Yang, J., Chen, J., & Song, Z. (2013). Are habitat fragmen-
tation, local adaptation and isolation- by- distance driving popula-
tion divergence in wild rice Oryza rufipogon? Molecular Ecology, 22,
5531– 55 47.
Zhuang, R. L. (2008). Oil- tea camellia in China (2nd ed.). China Forestr y
Publishing House.
SUPPORTING INFORMATION
Additional supporting information may be found online in the
Supporting Information section.
How to cite this article: Cui, X., Li, C., Qin, S., Huang, Z., Gan,
B., Jiang, Z., Huang, X., Yang, X ., Li, Q., Xiang, X ., Chen, J.,
Zhao, Y., & Rong, J. (2022). High- throughput sequencing-
based microsatellite genotyping for polyploids to resolve
allele dosage uncer tainty and improve analyses of genetic
diversit y, structure and differentiation: A case study of the
hexaploid Camellia oleifera. Molecular Ecology Resources, 22,
199– 211. htt ps://doi.org/10.1111/1755- 0998 .13 469
... By analyzing 25 phenotypic traits of 13 wild C. oleifera populations, it was found that wild C. oleifera varied significantly among different populations and that inter-population variation was greater than intra-population variation. Similar results were found in the genetic structure analysis of wild C. oleifera by Cui et al. [10]. This is mainly due to the fact that wild C. oleifera is widely distributed in the Wuyi Mountain Range, Luoxiao Mountain Range, Nanling Mountain Range of Guangdong Province, Huangshan Mountain Range, and other low mountainous areas, where the habitat differences between individual populations are large and gene flow is somewhat hindered [10]. ...
... Similar results were found in the genetic structure analysis of wild C. oleifera by Cui et al. [10]. This is mainly due to the fact that wild C. oleifera is widely distributed in the Wuyi Mountain Range, Luoxiao Mountain Range, Nanling Mountain Range of Guangdong Province, Huangshan Mountain Range, and other low mountainous areas, where the habitat differences between individual populations are large and gene flow is somewhat hindered [10]. Bioactive compounds such as tocopherols and squalene found in C. oleifera have antioxidant, anti-inflammatory, and other health benefits. ...
Article
Full-text available
Camellia oleifera is a woody oil crop with the highest oil yield and the largest cultivation area in China, and C. oleifera seed oil is a high-quality edible oil recommended by the Food and Agriculture Organization of the United Nations. The objectives of this study were to investigate the variation in fruit yield traits and seed chemical compositions of wild C. oleifera in China and to identify the differences between wild C. oleifera and cultivated varieties. In this study, we collected wild C. oleifera samples from 13 sites covering the main distribution areas of wild C. oleifera to comprehensively evaluate 25 quantitative traits of wild C. oleifera fruit and seed chemical compositions and collected data of 10 quantitative traits from 434 cultivated varieties for a comparative analysis of the differences between wild and cultivars. The results showed that the coefficients of variation of the 25 quantitative traits of wild C. oleifera ranged from 2.605% to 156.641%, with an average of 38.569%. The phenotypic differentiation coefficients ranged from 25.003% to 99.911%, with an average of 77.894%. The Shannon–Wiener index (H’) ranged from 0.195 to 1.681. Based on the results of principal component analysis (PCA) and phenotypic differentiation coefficients, 10 traits differed significantly between wild C. oleifera and cultivated varieties, while the differentiation coefficients (VST) for fresh fruit weight, oleic acid, unsaturated fatty acids, stearic acid, and saturated fatty acids were more than 95%, of which fresh fruit weight and oleic acid content were potential domestication traits of C. oleifera. The results of this study can contribute to the efficient excavation and utilization of wild C. oleifera genetic resources for C. oleifera breeding.
... As one of the representative plant species in subtropical evergreen broadleaf forests, wild C. oleifera is widely distributed in the subtropical mountain and hilly areas in the Yangtze River Basin and the Southern China, with elevation ranging from about 200 to 2000 m (Ming, 2000;Zhuang, 2008). Wild C. oleifera showed rich genetic diversity and clear genetic differentiation among populations from different latitudes and longitudes with diverse climatic conditions (Cui et al., 2022). The wild C. oleifera population in Lu Mountain was found to have the most distinguished genetic structure (Cui et al., 2022). ...
... Wild C. oleifera showed rich genetic diversity and clear genetic differentiation among populations from different latitudes and longitudes with diverse climatic conditions (Cui et al., 2022). The wild C. oleifera population in Lu Mountain was found to have the most distinguished genetic structure (Cui et al., 2022). Lu Mountain is located in the north of Jiangxi Province, at the border between the middle and northern subtropical regions in China, and it is in the northern distribution periphery of wild C. oleifera. ...
Article
Full-text available
The molecular mechanisms of freezing tolerance are unresolved in the perennial trees that can survive under much lower freezing temperatures than annual herbs. Since natural conditions involve many factors and temperature usually cannot be controlled, field experiments alone cannot directly identify the effects of freezing stress. Lab experiments are insufficient for trees to complete cold acclimation and cannot reflect natural freezing-stress responses. In this study, a new method was proposed using field plus lab experiments to identify freezing tolerance and associated genes in subtropical evergreen broadleaf trees using Camellia oleifera as a case. Cultivated C. oleifera is the dominant woody oil crop in China. Wild C. oleifera at the high-elevation site in Lu Mountain could survive below −30°C, providing a valuable genetic resource for the breeding of freezing tolerance. In the field experiment, air temperature was monitored from autumn to winter on wild C. oleifera at the high-elevation site in Lu Mountain. Leave samples were taken from wild C. oleifera before cold acclimation, during cold acclimation and under freezing temperature. Leaf transcriptome analyses indicated that the gene functions and expression patterns were very different during cold acclimation and under freezing temperature. In the lab experiments, leaves samples from wild C. oleifera after cold acclimation were placed under −10°C in climate chambers. A cultivated C. oleifera variety “Ganwu 1” was used as a control. According to relative conductivity changes of leaves, wild C. oleifera showed more freezing-tolerant than cultivated C. oleifera. Leaf transcriptome analyses showed that the gene expression patterns were very different between wild and cultivated C. oleifera in the lab experiment. Combing transcriptome results in both of the field and lab experiments, the common genes associated with freezing-stress responses were identified. Key genes of the flg22, Ca²⁺ and gibberellin signal transduction pathways and the lignin biosynthesis pathway may be involved in the freezing-stress responses. Most of the genes had the highest expression levels under freezing temperature in the field experiment and showed higher expression in wild C. oleifera with stronger freezing tolerance in the lab experiment. Our study may help identify freezing tolerance and underlying molecular mechanisms in trees.
... 'Nanyongensis' (CON), Camellia chekiangoleosa Hu. (CCH) and Camellia lanceoleosa, 2n = 2x = 30) have also been reported, and genome size of 2.95 Gb, 2.73 Gb and 3.00 Gb were obtained, respectively Lin et al., 2022;Shen et al., 2022). However, C. chekiangoleosa belongs to the section Camellia, C. lanceoleosa and C. oleifera belong to the same Oleifera, Sect, but are evolutionarily distantly related (Cui et al., 2022). The diploid CON is considered to be the ancestor species of the hexaploid C. oleifera, and they differ morphologically . ...
Article
Full-text available
Oil‐Camellia (Camellia oleifera), belonging to the Theaceae family Camellia, is an important woody edible oil tree species. The Camellia oil in its mature seed kernels, mainly consists of more than 90% unsaturated fatty acids, tea polyphenols, flavonoids, squalene and other active substances, which is one of the best quality edible vegetable oils in the world. However, genetic research and molecular breeding on oil‐Camellia are challenging due to its complex genetic background. Here, we successfully report a chromosome‐scale genome assembly for a hexaploid oil‐Camellia cultivar Changlin40. This assembly contains 8.80 Gb genomic sequences with scaffold N50 of 180.0 Mb and 45 pseudochromosomes comprising 15 homologous groups with three members each, which contain 135 868 genes with an average length of 3936 bp. Referring to the diploid genome, intragenomic and intergenomic comparisons of synteny indicate homologous chromosomal similarity and changes. Moreover, comparative and evolutionary analyses reveal three rounds of whole‐genome duplication (WGD) events, as well as the possible diversification of hexaploid Changlin40 with diploid occurred approximately 9.06 million years ago (MYA). Furthermore, through the combination of genomics, transcriptomics and metabolomics approaches, a complex regulatory network was constructed and allows to identify potential key structural genes (SAD, FAD2 and FAD3) and transcription factors (AP2 and C2H2) that regulate the metabolism of Camellia oil, especially for unsaturated fatty acids biosynthesis. Overall, the genomic resource generated from this study has great potential to accelerate the research for the molecular biology and genetic improvement of hexaploid oil‐Camellia, as well as to understand polyploid genome evolution.
... Для создания панели микросателлитных маркеров, предназначенной для индивидуальной идентификации и контроля достоверности происхождения сибирского осетра, мы протестировали 27 микросателлитных локу- Как известно, наличие нулевых аллелей приводит к искажению статистических расчетов, завышая гомозиготность (27,28). Виды с полиплоидными геномами имеют повышенный риск образования нулевых аллелей (29). ...
... Как известно, наличие нулевых аллелей приводит к искажению статистических расчетов, завышая гомозиготность (27,28). Виды с полиплоидными геномами имеют повышенный риск образования нулевых аллелей (29). ...
... SSR molecular markers, also known as simple sequence repeats and microsatellites, are among the most widely used markers. Research on molecular markers has been conducted worldwide on woody oil plants such as olives (Mousavi et al. 2019;Vuletin Selak et al. 2021), coconuts (Bandupriya et al. 2022) and walnuts (Sütyemez et al. 2021;Guney et al. 2021), and molecular markers have been widely used in genetic diversity analysis of C.oleifera (Jia et al. 2014;Cui et al. 2022a;Dong et al. 2022). ...
Article
Full-text available
Camellia oleifera Abel., as one of the four major woody oilseeds, has a high economic value, and the wild C. oleifera genes, whose distribution area is located at the northern edge, are abundant and are valuable resources for C. oleifera breeding. In this study, a total of 341 wild C. oleifera populations were sampled from 11 different localitions in Xinxian County, the hinterland of the Dabie Mountains in the northern margin of the distribution of C. oleifera in China, and 16 pairs of simple sequence repeat (SSR) molecular markers were used to analyse the genetic diversity. Using these 16 pairs of primers to detect the genetic diversity of the wild C. oleifera population, 174 alleles were amplified. The average number of alleles (Na) was 10.875, the average expected heterozygosity (He) was 0.739, the observed heterozygosity (Ho) was 0.718, and the average polymorphic information index (PIC) was 0.739. The 11 wild C. oleifera populations in Xinxian County had high genetic diversity, and the average expected heterozygosity (He) among populations was 0.735. The molecular variance showed that the genetic variation mainly came from within the population, accounting for 88.21% of the total variation. The genetic differentiation coefficient Fst between populations was small, with an average of only 0.04. According to the results of Structure and principal cordinate analysis (PCoA) and cluster analysis, these 11 populations could be roughly divided into two categories. The Mantel test preferentially clustered some populations close to each other, but there was no significant correlation between genetic distance and geographical distance. We provides a theoretical basis for the rational development and utilisation of wild C. oleifera resources in the future and provide a scientific and technological method for future breeding.
... Crop wild relatives, which have rich genetic variation and retain excellent characteristics, are valuable genetic resources for the improvement of various crops (Bohra et al., 2022). Cui et al. (2022) conducted a study on wild C. oleifera populations from different latitudes and longitudes in China, and revealed that the genetic structure of the wild C. oleifera population in Lu Mountain was the most unique. Lu Mountain is located at the border between the middle and northern subtropical regions in China, and it is in the northern distribution periphery of wild C. oleifera, where extreme sub-zero temperatures often occur during the winter (Xie et al., 2023). ...
Article
Full-text available
Camellia oleifera Abel. is the dominant woody oil crop under significant development in China. Wild C. oleifera in Lu Mountain is a valuable genetic resource with strong freezing tolerance. With high-throughput sequencing, the genome of wild C. oleifera in Lu Mountain was analysed and 700.3 Gb clean reads were obtained. The genome of wild C. oleifera was estimated as allohexaploid, and its haplotype genome size was about 2.69 Gb-2.79 Gb, with repeat content of 63.01%-73.02% and heterozygosity of 6.30%-7.43%, belonging to a very complex genome. The genomic draft was assembled that contained a total of 6,952,303 scaffolds with N50 length of 1.23 kb, and the overall length was 2.39 Gb with GC content of 40.87%. In the genomic draft, 1,104,618 SSRs were identified; scaffold1096012 and scaffold1779458 were identified as key genes associated with freezing tolerance combined with the transcriptome data of field plus lab experiments. In this study, the genomic background of hexaploid wild C. oleifera in Lu Mountain was revealed. This lays the foundation for obtaining the high-quality chromosome level reference genome of wild C. oleifera. The identification of SSRs and key genes associated with freezing tolerance may contribute to the efficient exploration and utilisation of this genetic resource.
... Kunej et al. (2020), compared the two SSR approaches and stated that the HTS, although there is space for improvements in terms of speed, accuracy and price, HTS SSR can replace the CE approach in the years to come. Such an HTS SSR approach may be advantageous, particularly in the case of polyploid species, to resolve allele dosage uncertainty, a problem in which it is impossible to determine how many copies of the allele (one to five) are present, for example, in a hexaploidy species (Cui et al., 2022). ...
Article
Full-text available
The contribution of vine cultivation to human welfare as well as the stimulation of basic social and cultural features of civilization has been great. The wide temporal and regional distribution created a wide array of genetic variants that have been used as propagating material to promote cultivation. Information on the origin and relationships among cultivars is of great interest from a phylogenetics and biotechnology perspective. Fingerprinting and exploration of the complicated genetic background of varieties may contribute to future breeding programs. In this review, we present the most frequently used molecular markers, which have been used on Vitis germplasm. We discuss the scientific progress that led to the new strategies being implemented utilizing state-of-the-art next generation sequencing technologies. Additionally, we attempted to delimit the discussion on the algorithms used in phylogenetic analyses and differentiation of grape varieties. Lastly, the contribution of epigenetics is highlighted to tackle future roadmaps for breeding and exploitation of Vitis germplasm. The latter will remain in the top of the edge for future breeding and cultivation and the molecular tools presented herein, will serve as a reference point in the challenging years to come.
Article
Full-text available
Camellia hainanica , which is common in China’s Hainan Province, is an important woody olive tree species. Due to many years of geographic isolation, C. hainanica has not received the attention it deserves, which limits the exploitation of germplasm resources. Therefore, it is necessary to study population genetic characteristics for further utilization and conservation of C. hainanica . In this study, 96 individuals in six wild Camellia hainanica populations were used for ploidy analysis of the chromosome number, and the genetic diversity and population structure were investigated using 12 pairs of SSR primers. The results show complex ploidy differentiation in C. hainanica species. The ploidy of wild C. hainanica includes tetraploid, pentaploid, hexaploid, heptaploid, octoploid and decaploid species. Genetic analysis shows that genetic diversity and genetic differentiation among populations are low. Populations can be divided into two clusters based on their genetic structure, which matches their geographic location. Finally, to further maintain the genetic diversity of C. hainanica , ex-situ cultivation and in-situ management measures should be considered to protect it in the future.
Article
Full-text available
Analysis of population genetic structure has become a standard approach in population genetics. In polyploid complexes, clustering analyses can elucidate the origin of polyploid populations and patterns of admixture between different cytotypes. However, combining diploid and polyploid data can theoretically lead to biased inference with (artefactual) clustering by ploidy. We used simulated mixed-ploidy (diploid-autotetraploid) data to systematically compare the performance of k-means clustering and the model-based clustering methods implemented in Structure, Admixture, FastStructure and InStruct under different scenarios of differentiation and with different marker types. Under scenarios of strong population differentiation, the tested applications performed equally well. However, when population differentiation was weak, Structure was the only method that allowed unbiased inference with markers with limited genotypic information (co-dominant markers with unknown dosage or dominant markers). Still, since Structure was comparatively slow, the much faster but less powerful FastStructure provides a reasonable alternative for large datasets. Finally, although bias makes k-means clustering unsuitable for markers with incomplete genotype information, for large numbers of loci (>1000) with known dosage k-means clustering was superior to FastStructure in terms of power and speed. We conclude that Structure is the most robust method for the analysis of genetic structure in mixed-ploidy populations, although alternative methods should be considered under some specific conditions.
Article
Full-text available
Simple sequence repeats (SSR) – also known as microsatellites – have been used extensively in genetic analysis, fine mapping, quantitative trait locus (QTL) mapping, as well as marker-assisted selection (MAS) breeding and other techniques. Despite a plethora of studies reporting that perfect SSRs with stable motifs and flanking sequences are more efficient for genetic research, the lack of a high throughput technology for SSR genotyping has limited their use as genetic targets in many crops. In this study, we developed a technology called Target SSR-seq that combined the multiplexed amplification of perfect SSRs with high throughput sequencing. This method can genotype plenty of SSR loci in hundreds of samples with highly accurate results, due to the substantial coverage afforded by high throughput sequencing. We also detected 844 perfect SSRs based on 182 resequencing datasets in cucumber, of which 91 SSRs were selected for Target SSR-seq. Finally, 122 SSRs, including 31 SSRs for varieties identification, were used to genotype 382 key cucumber varieties readily available in Chinese markets using our Target SSR-seq method. Libraries of PCR products were constructed and then sequenced on the Illumina HiSeq X Ten platform. Bioinformatics analysis revealed that 111 filtered SSRs were accurately genotyped with an average coverage of 1289× at an extremely low cost; furthermore, 398 alleles were observed in 382 cucumber cultivars. Genetic analysis identified four populations: northern China type, southern China type, European type, and Xishuangbanna type. Moreover, we acquired a set of 16 core SSRs for the identification of 382 cucumber varieties, of which 42 were isolated as backbone cucumber varieties. This study demonstrated that Target SSR-seq is a novel and efficient method for genetic research.
Article
Full-text available
Camellia oleifera is a subtropical evergreen plant. Cultivated C. oleifera is the most important woody oil crop in China. Wild C. oleifera is an essential genetic resource for breeding. The patterns of genetic differentiation among altitudes/latitudes in wild C. oleifera are still unknown. Camellia oleifera may be predominantly hexaploid. The characteristics of polyploidy may lead to considerable biases in estimates of genetic diversity and differentiation. Our study used C. oleifera as a case study for analysing genetic diversity, structure and differentiation in polyploid plants using simple sequence repeats (SSRs). Wild C. oleifera samples were collected at different altitudes on the Jinggang and Lu mountains of China. The ploidy levels were determined with flow cytometry analysis. Eight highly polymorphic SSRs were used to genotype the samples. Genetic diversity and structure were analysed. Various estimates of genetic differentiation were compared. The flow cytometry results indicated that wild C. oleifera samples were all hexaploid at various altitudes of the Jinggang and Lu mountains. High levels of genetic diversity were found on both the Jinggang and Lu mountains. Genetic structure analyses indicated clear genetic differentiation between the Jinggang and Lu mountains and lower genetic differentiation among altitudes within each mountain. Classical genetic differentiation estimates of Fst failed to discriminate genetic differentiation between and within mountains. The Rho statistic showed a moderate level of genetic differentiation between mountains and lower levels of genetic differentiation within each mountain. Our study demonstrates that Rho is the statistic of choice for estimating genetic differentiation in polyploids.
Article
Full-text available
Though polyploidy is an important aspect of the evolutionary genetics of both plants and animals, the development of population genetic theory of polyploids has seriously lagged behind that of diploids. This is unfortunate since the analysis of polyploid genetic data -and the interpretation of the results- requires even more scrutiny than with diploid data. This is because of several polyploidy-specific complications in segregation and genotyping such as tetrasomy, double reduction and missing dosage information. Here, we review the theoretical and statistical aspects of the population genetics of polyploids. We discuss several widely-used types of inferences, including genetic diversity, Hardy-Weinberg equilibrium, population differentiation, genetic distance, and detecting population structure. For each, we point out how the statistical approach, expected result, and interpretation differ between different ploidy levels. We also discuss for each type of inference what biases may arise from the polyploid-specific complications and how these biases can be overcome. From our overview, it is clear that the statistical toolbox that is available for the analysis of genetic data is flexible and still expanding. Modern sequencing techniques will soon be able to overcome some of the current limitations to the analysis of polyploid data, though the techniques are lagging behind those available for diploids. Furthermore, the availability of more data may aggravate the biases that can arise, and increase the risk of false inferences. Therefore, simulations such as we used throughout this review are an important tool to verify the results of analyses of polyploid genetic data.
Article
Full-text available
The bottleneck of microsatellite marker development is to determine polymorphisms of microsatellite markers. A large amount of microsatellites can be detected via high-throughput sequencing. However, most previous studies didn’t fully use the high-throughput sequencing data to predict number of alleles at microsatellite loci. Instead, laborious experiments were performed for manually screening microsatellite loci, finding out number of alleles at each microsatellite loci and selecting those with polymorphisms for marker development. In this study, we improved the method for efficient development of polymorphic microsatellite markers from high-throughput transcriptome sequencing, using hexaploid oil-tea camellia as a case study. Leaf transcriptomes were sequenced of eight wild oil-tea camellia samples at different altitudes in Jinggang and Lu Mountains, China. Microsatellites were directly identified in the sequencing reads and primers were designed. Strategies were designed to filtering duplicate and multi-locus markers. For each marker, number of alleles cross samples was predicted and length of the potentially amplifiable sequence was estimated. 153 predicted polymorphic markers were selected and empirically validated in the eight samples. Sixty five markers (42%) were polymorphic (2–12 alleles) and 31 (20%) were highly polymorphic (6–12 alleles). The empirical number of alleles was generally higher than the predicted number of alleles but they were significantly correlated. The predicted allele length was among the empirical allele length range. Compared with most previous studies, the method shows a higher efficiency for developing polymorphic markers and filtering duplicate and multi-locus markers. The polymorphic microsatellite markers developed can be used for analyzing the genetic diversity of oil-tea camellia.
Article
Full-text available
This study examines the potential of next-generation sequencing based 'genotyping-by-sequencing' (GBS) of microsatellite loci for rapid and cost-effective genotyping in large-scale population genetic studies. The recovery of individual genotypes from large sequence pools was achieved by PCR-incorporated combinatorial barcoding using universal primers. Three experimental conditions were employed to explore the possibility of using this approach with existing and novel multiplex marker panels and weighted amplicon mixture. The GBS approach was validated against microsatellite data generated by capillary electrophoresis. GBS allows access to the underlying nucleotide sequences that can reveal homoplasy, even in large datasets and facilitates cross laboratory transfer. GBS of microsatellites, using individual combinatorial barcoding, is potentially faster and cheaper than current microsatellite approaches and offers better and more data.
Article
Several estimators of population differentiation have been proposed in the recent past to deal with various types of genetic markers (i.e., allozymes, nucleotide sequences, restriction fragment length polymorphisms, or microsatellites). We discuss the relationships among these estimators and show how a single analysis of variance framework can accomodate these qualitatively different data types.
Article
We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.
Article
Microsatellite markers have played a major role in ecological, evolutionary and conservation research during the past 20 years. However, technical constrains related to the use of capillary electrophoresis and a recent technological revolution that has impacted other marker types have brought to question the continued use of microsatellites for certain applications. We present a study for improving microsatellite genotyping in ecology using high-throughput sequencing (HTS). This approach entails selection of short markers suitable for HTS, sequencing PCR-amplified microsatellites on an Illumina platform, and bioinformatic treatment of the sequence data to obtain multilocus genotypes. It takes advantage of the fact that HTS gives direct access to microsatellite sequences, allowing unambiguous allele identification, and enabling automation of the genotyping process through bioinformatics. In addition, the massive parallel sequencing abilities expand the information content of single experimental runs far beyond capillary electrophoresis. We illustrated the method by genotyping brown bear samples amplified with a multiplex PCR of 13 new microsatellite markers and a sex marker. HTS of microsatellites provided accurate individual identification and parentage assignment, and resulted in significant improvement of genotyping success (84%) of fecal degraded DNA and costs reduction compared to capillary electrophoresis. The HTS approach holds vast potential for improving success, accuracy, efficiency and standardization of microsatellite genotyping in ecological and conservation applications, especially those that rely on profiling of low-quantity/quality DNA and on the construction of genetic databases. We discuss and give perspectives for the implementation of the method in light of the challenges encountered in wildlife studies. This article is protected by copyright. All rights reserved.
Article
An ordered draft sequence of the 17-gigabase hexaploid bread wheat (Triticum aestivum) genome has been produced by sequencing isolated chromosome arms. We have annotated 124,201 gene loci distributed nearly evenly across the homeologous chromosomes and subgenomes. Comparative gene analysis of wheat subgenomes and extant diploid and tetraploid wheat relatives showed that high sequence similarity and structural conservation are retained, with limited gene loss, after polyploidization. However, across the genomes there was evidence of dynamic gene gain, loss, and duplication since the divergence of the wheat lineages. A high degree of transcriptional autonomy and no global dominance was found for the subgenomes. These insights into the genome biology of a polyploid crop provide a springboard for faster gene isolation, rapid genetic marker development, and precise breeding to meet the needs of increasing food demand worldwide.