Allele frequency spectrum of array data (left) and WGS data (right) for 39 populations 

Allele frequency spectrum of array data (left) and WGS data (right) for 39 populations 

Source publication
Article
Full-text available
Background Single nucleotide polymorphism (SNP) panels have been widely used to study genomic variations within and between populations. Methods of SNP discovery have been a matter of debate for their potential of introducing ascertainment bias, and genetic diversity results obtained from the SNP genotype data can be misleading. We used a total of...

Contexts in source publication

Context 1
... assessing allele frequency calling in the pooled WGS data, high correlations were obtained between the allele frequencies estimated with the Array_all' data set and pool WGS data set at each corresponding locus and very slight differences between the allele frequency spectra, we conclude that the estimation of allele frequencies from pooled sequences is sufficiently reliable. When comparing the AFS from the two datasets (not based on corresponding loci), the array dataset severely underrepresented the rare SNPs (Fig. 2). This confirmed the already known findings of other studies on ascertained SNP data e.g. [9,14,15] and therefore suggests a risk for an ascertainment bias in array- based analysis of the chicken biodiversity ...
Context 2
... proportion of SNPs was 39.6% and 39.9% in genes and was 0.044% and 0.012% in exons for array and WGS data, respectively (see in Additional file 2: Table S1). Differ- ences in (minor) allele frequencies (in genic and non-genic regions) followed a similar pattern to that observed in Fig. 2 whereby rare variants were underrepresented in the array data. The correlations between MAF proportions in genic and non-genic regions were 0.956 and 0.999 in the array and WGS data, respectively. The minor allele frequency of SNPs differed very little between the genic and non-genic regions with the array and sequence data (Additional file 3: Figure S1). From this we concluded that the selection of SNPs in the array was not biased based on their positions in genic or non-genic region, although, differences between the two sets were found to be in the exonic regions whereby the array set had an overrepresentation of ...
Context 3
... allele frequency spectra showed remarkable differ- ences for the two data types (Fig. 2). The array data had very low but increasing numbers of SNPs at allele fre- quencies between 0 and 0.175 while the WGS had a very high number of rare variants between 0 and 0.025 and SNP numbers decreased with increasing frequencies, with the exception of the last window (which includes the fixation of the derived allele) which was found to be slightly ...
Context 4
... agreement with e.g. [3], (based on microsatellite data) both the commercial brown (BL_A and BL_D) and white Figure S2, estimated using the data with 42 populations). The commercial white egg layers, which emerged from a single parental origin, the White Leghorn breed [5,54], had very low genetic diver- sity. The brown layers (BL_A and BL_D) with multi- parental origins of Asian and European background had more genetic diversity compared to white layers. Noting that these commercial breeds were part of the discovery panel, we investigated whether the H e results behaved dif- ferently than in other populations when using array data. Unlike the two brown layer lines with elevated H e ranking when using any of the array data, the white layers didn't deviate from the WGS H e ranking when using the array data (Additional file 3: Figure S2). So this makes it difficult to tie the effects of ascertainment bias on H e estimation to the relatedness of the breeds to the discovery panel breeds. Furthermore, the fact that the commercial lines' individ- uals used in the array data are different to those used in the WGS could also be of impact in this ...
Context 5
... agreement with e.g. [3], (based on microsatellite data) both the commercial brown (BL_A and BL_D) and white Figure S2, estimated using the data with 42 populations). The commercial white egg layers, which emerged from a single parental origin, the White Leghorn breed [5,54], had very low genetic diver- sity. The brown layers (BL_A and BL_D) with multi- parental origins of Asian and European background had more genetic diversity compared to white layers. Noting that these commercial breeds were part of the discovery panel, we investigated whether the H e results behaved dif- ferently than in other populations when using array data. Unlike the two brown layer lines with elevated H e ranking when using any of the array data, the white layers didn't deviate from the WGS H e ranking when using the array data (Additional file 3: Figure S2). So this makes it difficult to tie the effects of ascertainment bias on H e estimation to the relatedness of the breeds to the discovery panel breeds. Furthermore, the fact that the commercial lines' individ- uals used in the array data are different to those used in the WGS could also be of impact in this ...
Context 6
... fitting a linear regression of the WGS-based H e values on array-based H e values the slope is >2 with all considered data sets (smallest with 2.150 for the LD pruned data, see Table 3 and in Additional file 3: Figure S3) reflect- ing not only a systematic overestimation of expected hetero- zygosity from array data, but also a scale effect resulting in an even more severe overestimation for highly heterozygous breeds. While the underrepresentation of low MAF SNPs in the array data compared to WGS data (cf. Fig. 2) provides a good explanation for the observed difference in the average H e , the reason for the scale effect remains to be understood. A comparison between the estimated pairwise F ST values of WGS and the different filtered versions of the array data is shown in Fig. 4. The black regression line shows the expected linear relationship between the F ST of WGS and array where the pairwise F ST values estimated from the two sets are equal. The Array_all, Array_MAF5 and the versions filtered for being polymorphic in the Gallus gallus populations (GG and GG_MAF5) underestimated the F ST where WGS F ST was low (0.09 to <0.15) and overestimated the F ST where WGS F ST was high (>0.15). The LD pruned versions (Pruned and Pruned_MAF5) and the LD pruned plus polymorphic to Gallus gallus popula- tions' (Pruned_GG and Pruned_GG_MAF5) data sets consistently underestimated the pairwise F ST values. The regression lines for comparing WGS F ST and F ST esti- mated from the LD pruned versions didn't cross through the expected regression line, while for versions without LD pruning the regression lines crossed each other. The slopes and regression coefficients (R 2 ) of these linear rela- tionships are presented in Table 4. The WGS vs. Pruned data had the lowest R 2 (0.887), however, with a slope (1.027) closer to 1 compared to the rest of the other array sets. The WGS vs. GG and GG_MAF5 had the highest R 2 (0.919 for both of them) and yet the highest slope too (1.208 and 1.209 respectively), whereas in this case a bet- ter slope (close to 1) is preferred (it justifies the signifi- cance of the linear relationship between the pairwise F ST values estimated from WGS and array data). A combin- ation of filtering SNPs based on LD and retaining SNPs that are polymorphic in the wild populations (GG) im- proved the R 2 but compromised the slope. Table 5 shows the Frobenius (F) distances between the distance matrices of WGS and array (on the diagonal), and the different array sets among themselves. The mean F distance between WGS and Pruned data was the lowest (3.152) and highest between WGS and GG_MAF5 data (6.700). A lower F distance means two compared distant matrices are more similar. Therefore the pairwise distance matrix of Pruned data is more related to the WGS than the rest of the sets. Among the array versions, the most distant matrices were found between the Pruned version and the GG and GG_MAF5 versions (these GG and GG_MAF5 versions had the highest distances to the matrix of WGS ...

Similar publications

Article
Full-text available
Vrindavani is an Indian composite cattle breed developed by crossbreeding taurine dairy breeds with native indicine cattle. The constituent breeds were selected for higher milk production and adaptation to the tropical climate. However, the selection response for production and adaptation traits in the Vrindavani genome is not explored. In this stu...
Article
Full-text available
The performance and productivity of livestock have consistently improved by natural and artificial selection over the centuries. Both these selections are expected to leave patterns on the genome and lead to changes in allele frequencies, but natural selection has played the major role among indigenous populations. Detecting selective sweeps in liv...
Article
Full-text available
To conduct ex-situ creole pig conservation programs, it is essential to determine which breeding animals will be used, preferentially those with a more significant Iberian genetic component to preserve their origin. This study used a Yucatan black hairless pigs (YBHP) subpopulation to estimate its genetic diversity and population structure. One hu...
Article
Full-text available
Horses are traditionally used in Kazakhstan as a source of food and as working and saddle animals as well. Here, for the first time, microarray-based medium-density single nucleotide polymorphism (SNP) genotyping of six traditionally defined types and breeds of indigenous Kazakh horses was conducted to reveal their genetic structure and find marker...
Article
Full-text available
Genetic diversity analysis is crucial for maintaining and managing genetic resources. Several studies have examined the genetic diversity of Korean domestic chicken (KDC) populations using microsatellite markers, but it is difficult to capture the characteristics of the whole genome in this manner. Hence, this study analyzed the genetic diversity o...

Citations

... This is not an issue for most analyses that are performed within a single panmictic population. However, if the samples come from multiple ancestral populations then standard LD pruning can cause biases [14,50]. This is because LD is created between alleles with different ancestral allele frequencies: the so-called two-locus Wahlund effect [2,12,13]. ...
Preprint
Full-text available
Standard measures of linkage disequilibrium (LD) are affected by admixture and population structure, such that loci that are not in LD within each ancestral population appear linked when considered jointly. The influence of population structure on LD can cause problems for downstream analysis methods, in particular those that rely on LD pruning or clumping. To address this issue, we propose a measure of LD that accommodates population structure using the top inferred principal components. We estimate LD from the correlation of genotype residuals and prove that this LD measure remains unaffected by population structure when analyzing multiple populations jointly, even with admixed individuals. Based on this adjusted measure of LD, we can perform LD pruning to remove the correlation between markers for downstream analysis. Traditional LD pruning is more likely to remove markers with high differences in allele frequencies between populations, which biases measures for genetic differentiation and removes markers that are not in LD in the ancestral populations. Using data from moderately differentiated human populations and highly differentiated giraffe populations we show that traditional LD pruning biases Fst and PCA but that this can be alleviated with the adjusted LD measure. In addition, we show the adjusted LD leads to better PCA when pruning and that LD clumping retains more sites and the retained sites have stronger associations.
... However, we think that this is unlikely, because ascertainment bias typically affects analyses such as selection signal detection that use individual SNP locus frequencybased statistics (e.g. F ST ) substantially more than genome-wide multi-locus dimension reduction tools like ADMIXTURE and PCA [59][60][61][62]. Importantly, in this regard, we used only WGS data for the CSS analyses ( figure 3 and electronic supplementary material, table S3). ...
Article
Full-text available
Criollo cattle, the descendants of animals brought by Iberian colonists to the Americas, have been the subject of natural and human-mediated selection in novel tropical agroecological zones for centuries. Consequently, these breeds have evolved distinct characteristics such as resistance to diseases and exceptional heat tolerance. In addition to European taurine (Bos taurus) ancestry, it has been proposed that gene flow from African taurine and Asian indicine (Bos indicus) cattle has shaped the ancestry of Criollo cattle. In this study, we analysed Criollo breeds from Colombia and Venezuela using whole-genome sequencing (WGS) and single-nucleotide polymorphism (SNP) array data to examine population structure and admixture at high resolution. Analysis of genetic structure and ancestry components provided evidence for African taurine and Asian indicine admixture in Criollo cattle. In addition, using WGS data, we detected selection signatures associated with a myriad of adaptive traits, revealing genes linked to thermotolerance, reproduction, fertility, immunity and distinct coat and skin coloration traits. This study underscores the remarkable adaptability of Criollo cattle and highlights the genetic richness and potential of these breeds in the face of climate change, habitat flux and disease challenges. Further research is warranted to leverage these findings for more effective and sustainable cattle breeding programmes.
... We designed our assay specifically for the Pilbara population of M. gigas, yet the species also persists in disjunct, threatened populations elsewhere in northern Australia. SNP panels are known to be impacted by ascertainment bias since SNP selection is typically made based on the allele frequencies of only a subset of individuals or populations that are available to study, rather than the global population 67,68 . Whilst we only had genomic data from eight colonies of M. gigas, these spanned the two main geographic sub-regions in the Pilbara (Chichester and Hamersley). ...
Article
Full-text available
Genetic tagging from scats is one of the minimally invasive sampling (MIS) monitoring approaches commonly used to guide management decisions and evaluate conservation efforts. Microsatellite markers have traditionally been used but are prone to genotyping errors. Here, we present a novel method for individual identification in the Threatened ghost bat Macroderma gigas using custom-designed Single Nucleotide Polymorphism (SNP) arrays on the MassARRAY system. We identified 611 informative SNPs from DArTseq data from which three SNP panels (44–50 SNPs per panel) were designed. We applied SNP genotyping and molecular sexing to 209 M. gigas scats collected from seven caves in the Pilbara, Western Australia, employing a two-step genotyping protocol and identifying unique genotypes using a custom-made R package, ScatMatch. Following data cleaning, the average amplification rate was 0.90 ± 0.01 and SNP genotyping errors were low (allelic dropout 0.003 ± 0.000) allowing clustering of scats based on one or fewer allelic mismatches. We identified 19 unique bats (9 confirmed/likely males and 10 confirmed/likely females) from a maternity and multiple transitory roosts, with two male bats detected using roosts, 9 km and 47 m apart. The accuracy of our SNP panels enabled a high level of confidence in the identification of individual bats. Targeted SNP genotyping is a valuable tool for monitoring and tracking of non-model species through a minimally invasive sampling approach.
... The SNPs utilised in this panel were originally detected in domestic cattle 3,11,28,29,34,[36][37][38][39] . Though common practice [40][41][42] , it is obvious that such a reduced number of SNPs found in a related species as well as an ascertainment bias from selecting for high polymorphism in our target species will not allow for unbiased estimates of genetic diversity 43,44 . Thus, any results regarding genetic diversity using this SNP panel need to be interpreted with caution. ...
Article
Full-text available
The European bison was saved from the brink of extinction due to considerable conservation efforts since the early twentieth century. The current global population of > 9500 individuals is the result of successful ex situ breeding based on a stock of only 12 founders, resulting in an extremely low level of genetic variability. Due to the low allelic diversity, traditional molecular tools, such as microsatellites, fail to provide sufficient resolution for accurate genetic assessments in European bison, let alone from non-invasive samples. Here, we present a SNP panel for accurate high-resolution genotyping of European bison, which is suitable for a wide variety of sample types. The panel accommodates 96 markers allowing for individual and parental assignment, sex determination, breeding line discrimination, and cross-species detection. Two applications were shown to be utilisable in further Bos species with potential conservation significance. The new SNP panel will allow to tackle crucial tasks in European bison conservation, including the genetic monitoring of reintroduced populations, and a molecular assessment of pedigree data documented in the world’s first studbook of a threatened species.
... While biological ascertainment bias between wild and cultivated subpopulations or different cultivated subpopulations (i.e., breeding programs) can lead to misinterpretation of total genome-wide diversity (Heslot et al., 2013;Malomane et al., 2018), technical ascertainment bias in the amplification of homologous allelic sequences increases the likelihood of genotyping errors at loci targeted by markers (Lighten et al., 2014;Zhang et al., 2015). This can result from non-copy specific oligo binding to both target and off-target genome sequences, or alternatively, weaker oligo binding to non-reference haplotypes within on-target regions due to interference from OTVs around the target SNP. ...
Article
Full-text available
Genomic prediction in breeding populations containing hundreds to thousands of parents and seedlings is prohibitively expensive with current high‐density genetic marker platforms designed for strawberry. We developed mid‐density panels of molecular inversion probes (MIPs) to be deployed with the “DArTag” marker platform to provide a low‐cost, high‐throughput genotyping solution for strawberry genomic prediction. In total, 7742 target single nucleotide polymorphism (SNP) regions were used to generate MIP assays that were tested with a screening panel of 376 octoploid Fragaria accessions. We evaluated the performance of DArTag assays based on genotype segregation, amplicon coverage, and their ability to produce subgenome‐specific amplicon alignments to the FaRR1 assembly and subsequent alignment‐based variant calls with strong concordance to DArT's alignment‐free, count‐based genotype reports. We used a combination of marker performance metrics and physical distribution in the FaRR1 assembly to select 3K and 5K production panels for genotyping of large strawberry populations. We show that the 3K and 5K DArTag panels are able to target and amplify homologous alleles within subgenomic sequences with low‐amplification bias between reference and alternate alleles, supporting accurate genotype calling while producing marker genotypes that can be treated as functionally diploid for quantitative genetic analysis. The 3K and 5K target SNPs show high levels of polymorphism in diverse F. × ananassa germplasm and UC Davis cultivars, with mean pairwise diversity (π) estimates of 0.40 and 0.32 and mean heterozygous genotype frequencies of 0.35 and 0.33, respectively.
... Ascertainment bias, the nonrandom analysis of loci resulting in parameter estimate biases (Nielsen 2000), has the potential to affect both genetic and genomic methods. For instance, microsatellite selection based on a single population will be biased toward detection of variation present in that population and away from detection of variation in a distinct, highly differentiated population (Malomane et al. 2018). This results in systematic deviations, such as the underrepresentation of rare alleles, which can underestimate signals of a population expansion (Nielsen 2000). ...
Article
Full-text available
Conservation genetic analyses of many endangered species have been based on genotyping of microsatellite loci and sequencing of short fragments of mtDNA. The increase in power and resolution afforded by whole genome approaches may challenge conclusions made on limited numbers of loci and maternally inherited haploid markers. Here we provide a matched comparison of whole genome sequencing versus microsatellite and control region genotyping for Eurasian otters (Lutra lutra). Previous work identified four genetically differentiated 'stronghold' populations of otter in Britain, derived from regional populations that survived the population crash of the 1950-80 s. Using whole genome resequencing data from 45 samples from across the British stronghold populations we confirmed some aspects of population structure derived from previous marker-driven studies. Importantly we showed that genomic signals of the population crash bottlenecks matched evidence from otter population surveys. Unexpectedly, two strongly divergent mitochondrial lineages were identified that were undetectable using control region fragments, and otters in the east of England were genetically distinct and surprisingly variable. We hypothesise that this previously unsuspected variability may derive from past releases of Eurasian otters from other, non-British source populations in England around the time of the population bottleneck. Our work highlights that even reasonably well studied species may harbour genetic surprises, if studied using modern high-throughput sequencing methods.
... SNPs were then pruned on the basis of linkage disequilibrium (LD) using the parameter -indep 50 5 2 in PLINK. Pruning of SNPs that are in high LD have been shown to counter the effect of ascertainment bias and to generate meaningful comparisons among breeds (Molomane et al, 2018). Following QC and pruning, a set of 9,015 SNPs was used for the analyses. ...
Article
Full-text available
The traditionally bred Irish Sport Horse, known as the Traditional Irish Horse, is an important cultural asset to horse genetic resources in Ireland. We tested the hypothesis that the Irish Sport Horse, which was originally developed from the Irish Hunter, may contain a genetic background distinct from European horse populations that would be valuable to preserve. Using genome-wide single nucleotide (SNP) data, the results show that Traditional Irish Horses (with confirmed pedigrees) have lower levels of European ancestry components than other Irish Sport Horses. These results indicate that measurement of the levels of European ancestry components in the Irish Sport Horse may assist in the preservation of traditional Irish lineages.
... Given the small size of the group used for SNP discovery (the ascertainment group), the target SNPs are expected to be subject to ascertainment bias, for instance not containing rare SNPs that are monomorphic in the ascertainment group. Ascertainment bias is known to generate several artifacts with respect to non-biased methods such as whole genome sequencing (Malomane et al., 2018): an increase of expected heterozygosity, an underestimation of fixation index (F ST ) in populations with low (<0.15) F ST values and overestimation of those with high (>0.15) ...
... F ST . Various corrections can be applied to reduce the undesired effects of ascertainment bias, of which the most effective is linkage disequilibrium (LD-based SNP pruning; Malomane et al., 2018). In addition, since SPET is based on the sequencing of a 110-bp region surrounding the target SNP, it allows the discovery of several additional SNPs, not represented in the ascertainment panel, which are in LD with the target SNP. ...
Article
Eggplant (Solanum melongena) is an important Solanaceous crop, widely cultivated and consumed in Asia, the Mediterranean basin, and Southeast Europe. Its domestication centers and migration and diversification routes are still a matter of debate. We report the largest georeferenced and genotyped collection to this date for eggplant and its wild relatives, consisting of 3499 accessions from seven worldwide genebanks, originating from 105 countries in five continents. The combination of genotypic and passport data points to the existence of at least two main centers of domestication, in Southeast Asia and the Indian subcontinent, with limited genetic exchange between them. The wild and weedy eggplant ancestor S. insanum shows admixture with domesticated S. melongena, similar to what was described for other fruit-bearing Solanaceous crops such as tomato and pepper and their wild ancestors. After domestication, migration and admixture of eggplant populations from different regions have been less conspicuous with respect to tomato and pepper, thus better preserving 'local' phenotypic characteristics. The data allowed the identification of misclassified and putatively duplicated accessions, facilitating genebank management. All the genetic, phenotypic, and passport data have been deposited in the Open Access G2P-SOL database, and constitute an invaluable resource for understanding the domestication, migration and diversification of this cosmopolitan vegetable.
... The problematic case is non-outgroup ascertainment, that is ascertainment on a population that is co-analyzed with others. A series of papers explored non-outgroup ascertainment affecting measures of population divergence on simulated data and real data for humans and domestic animals [37][38][39][40][41][42][43][44][45][46][47]. However, D-and f-statistics which have more robustness than other allele frequency-based statistics in many cases [16], were not considered in those studies. ...
Article
Full-text available
f -statistics have emerged as a first line of analysis for making inferences about demographic history from genome-wide data. Not only are they guaranteed to allow robust tests of the fits of proposed models of population history to data when analyzing full genome sequencing data—that is, all single nucleotide polymorphisms (SNPs) in the individuals being analyzed—but they are also guaranteed to allow robust tests of models for SNPs ascertained as polymorphic in a population that is an outgroup in a phylogenetic sense to all groups being analyzed. True “outgroup ascertainment” is in practice impossible in humans because our species has arisen from a substructured ancestral population that does not descend from a homogeneous ancestral population going back many hundreds of thousands of years into the past. However, initial studies suggested that non-outgroup-ascertainment schemes might produce robust enough results using f -statistics, and that motivated widespread fitting of models to data using non-outgroup-ascertained SNP panels such as the “Affymetrix Human Origins array” which has been genotyped on thousands of modern individuals from hundreds of populations, or the “1240k” in-solution enrichment reagent which has been the source of about 70% of published genome-wide data for ancient humans. In this study, we show that while analyses of population history using such panels work well for studies of relationships among non-African populations and one African outgroup, when co-modeling more than one sub-Saharan African and/or archaic human groups (Neanderthals and Denisovans), fitting of f -statistics to such SNP sets is expected to frequently lead to false rejection of true demographic histories, and failure to reject incorrect models. Analyzing panels of SNPs polymorphic in archaic humans, which has been suggested as a solution for the ascertainment problem, has limited statistical power and retains important biases. However, by carrying out simulations of diverse demographic histories, we show that bias in inferences based on f -statistics can be minimized by ascertaining on variants common in a union of diverse African groups; such ascertainment retains high statistical power while allowing co-analysis of archaic and modern groups.
... In our research, the SNP call rate was > 0.985 and MAF was > 0.1 for all used markers. Despite the high quality of the generated genotype data, the efficient strategy proposed by Malomane et al. [45] to mitigate the ascertainment bias was applied. Namely, stringent LD-based SNP filtering was carried out using the Plink 1.9 software [46] with flag indep 50 5 2, representing a moving window variance inflation factor (VIF) based SNP pruning within 50 SNP windows and a moving step of 5, where VIF is calculated as VIF = 1/(1 − r 2 ) . ...
... To further analyze the genomic implications of population expansion observed in the MRIZP panel, genomewide scans for signatures of selection were carried out. To facilitate the robust search for signatures of selection and avoid the detection of false positives in data generated by SNP-array suffering from ascertainment bias of polymorphism states, a stringent SNP pruning mitigation strategy was applied, as suggested by [45]. The scan for sweeps was based on a SweepFinder method [34] implemented for large genomewide SNP data in SweeD software [49]. ...
Article
Full-text available
Southeast Europe (SEE) is a very important maize-growing region, comparable to the Corn belt region of the United States, with similar dent germplasm (dent by dent hybrids). Historically, this region has undergone several genetic material swaps, following the trends in the US, with one of the most significant swaps related to US aid programs after WWII. The imported accessions used to make double-cross hybrids were also mixed with previously adapted germplasm originating from several more distant OPVs, supporting the transition to single cross-breeding. Many of these materials were deposited at the Maize Gene Bank of the Maize Research Institute Zemun Polje (MRIZP) between the 1960s and 1980s. A part of this Gene Bank (572 inbreds) was genotyped with Affymetrix Axiom Maize Genotyping Array with 616,201 polymorphic variants. Data were merged with two other genotyping datasets with mostly European flint (TUM dataset) and dent (DROPS dataset) germplasm. The final pan-European dataset consisted of 974 inbreds and 460,243 markers. Admixture analysis showed seven ancestral populations representing European flint, B73/B14, Lancaster, B37, Wf9/Oh07, A374, and Iodent pools. Subpanel of inbreds with SEE origin showed a lack of Iodent germplasm, marking its historical context. Several signatures of selection were identified at chromosomes 1, 3, 6, 7, 8, 9, and 10. The regions under selection were mined for protein-coding genes and were used for gene ontology (GO) analysis, showing a highly significant overrepresentation of genes involved in response to stress. Our results suggest the accumulation of favorable allelic diversity, especially in the context of changing climate in the genetic resources of SEE. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-023-04336-2.