Figure - available via license: Creative Commons Attribution 2.0 Generic
Content may be subject to copyright.
Geographic map of the HapMap phase III world populations. ASW = Southwest USA residents with African ancestry; CEU = Utah residents with Northern and Western European ancestry; CHB = Han Chinese in Beijing, China; CHD = Chinese in Metropolitan Denver, Colorado; GIH = Gujarati Indians in Houston, Texas; JPT = Japanese in Tokyo, Japan; LWK = Luhya in Webuye, Kenya; MKK = Maasai in Kinyawa, Kenya; MXL = Mexicans in Los Angeles, California; TSI = Toscani in Italia; YRI = Yoruba in Ibadan, Nigeria.

Geographic map of the HapMap phase III world populations. ASW = Southwest USA residents with African ancestry; CEU = Utah residents with Northern and Western European ancestry; CHB = Han Chinese in Beijing, China; CHD = Chinese in Metropolitan Denver, Colorado; GIH = Gujarati Indians in Houston, Texas; JPT = Japanese in Tokyo, Japan; LWK = Luhya in Webuye, Kenya; MKK = Maasai in Kinyawa, Kenya; MXL = Mexicans in Los Angeles, California; TSI = Toscani in Italia; YRI = Yoruba in Ibadan, Nigeria.

Source publication
Article
Full-text available
Background Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ance...

Context in source publication

Context 1
... HapMap phase III datasets, released in 2009, contained 1458387 SNPs of 1397 subjects including 87 Southwest USA residents with African ancestry (ASW), 165 Utah residents with ancestry from Northern and Western Europe (CEU), 137 Han Chinese in Beijing, China (CHB), 109 metropolitan Denver, Colorado residents with Chinese ancestry (CHD), 101 Gujarati Indians in Houston, Texas (GIH), 113 Japanese in Tokyo, Japan (JPT), 110 individuals from Luhya tribe in Webuye, Kenya (LWK), 86 Los Angeles, California residents with Mexican ancestry (MXL), 184 individuals from Maasai tribe in Kinyawa, Kenya (MKK), 102 Toscani Italians (TSI), and 203 Yorubans in Ibadan, Nigeria (YRI). Figure 1 shows the geographic map of the HapMap III world populations. We utilize the HapMap III datasets to build predictive models for infering sub-continental ancestry origins of Africans (LWK vs. MKK vs. YRI), Europeans (CEU vs. TSI), East Asians (CHB vs. JPT), North Americans (ASW vs. CEU vs. CHD vs. GIH vs. MXL), Kenyans (LWK vs. MKK), and Chinese (CHB vs. CHD). ...

Citations

... Additional potential applications of ML methods in the forensic DNA field beyond the focus of this review include forensic intelligence and inference, such as prediction of the body fluid origin [114,115], visual appearance traits [116], biogeographical ancestry [117][118][119], biological age [120][121][122], and post-mortem interval estimation [123,124]. Most of these applications require analysis of complex MPS data, which can be streamlined using ML solutions, as discussed above. ...
Article
Full-text available
Machine learning (ML) is a range of powerful computational algorithms capable of generating predictivemodels via intelligent autonomous analysis of relatively large and often unstructured data. ML has becomean integral part of our daily lives with a plethora of applications, including web, business, automotiveindustry, clinical diagnostics, scientific research, and more recently, forensic science. In the field of forensicDNA, the manual analysis of complex data can be challenging, time-consuming, and error-prone. Theintegration of novel ML-based methods may aid in streamlining this process while maintaining the highaccuracy and reproducibility required for forensic tools. Due to the relative novelty of such applications,the forensic community is largely unaware of ML capabilities and limitations. Furthermore, computerscience and ML professionals are often unfamiliar with the forensic science field and its specificrequirements. This manuscript offers a brief introduction to the capabilities of machine learning methodsand their applications in the context of forensic DNA analysis and offers a critical review of the currentliterature in this rapidly developing field
... Instead, our species tends to form relatively isolated or endogamic reproductive communities that present genetic differences between them, that internally share a genealogical or ancestral history (Peter, 2016;Vanegas et al., 2008). Population structure occurs in populations with recent admixture among subpopulations, reflected as systematic differences in allele frequency associated with the different ancestries of origin (Hajiloo et al., 2013;Santos et al., 2010). This phenomenon is the result of historic divisions between populations, including social, migratory, mating, and demographic differences (Hajiloo et al., 2013;Relethford, 2019;Sevini et al., 2013). ...
... Population structure occurs in populations with recent admixture among subpopulations, reflected as systematic differences in allele frequency associated with the different ancestries of origin (Hajiloo et al., 2013;Santos et al., 2010). This phenomenon is the result of historic divisions between populations, including social, migratory, mating, and demographic differences (Hajiloo et al., 2013;Relethford, 2019;Sevini et al., 2013). ...
Article
Objectives Punta Arenas is a Chilean city situated on ancestral Aönikenk territory. The city was founded by 19th‐ and 20th‐century colonists from Chile (Chiloé) and Europe (Croatia). This work uses uniparental and ancestry‐informative markers (AIMs) to explore the effects of historic migratory and admixture patterns on the current genetic composition of Punta Arenas. Methods We analyzed mitochondrial DNA (mtDNA), Y‐chromosome single‐nucleotide polymorphisms (SNPs), and 141 AIMs obtained from 129 DNA samples from male residents with regional ancestry. After characterizing uniparental lineages and ancestry proportions, multivariate analysis was used to explore relationships among the various types of data. Results Punta Arenas has an admixed population with three main genetic components: European (56.5%), northern Native (11.3%), and south‐central Native (28.6%). The Native component is preponderant in the mtDNA (83.76%), while the foreign component predominates in the Y‐chromosome (92.25%). Non‐Native mtDNA lineages are associated with European genetic ancestry, and Native mtDNA lineages originated mainly in the southern and southernmost regions of Chile. Most non‐Native Y‐chromosome SNPs originated in Spain, and secondly, in Croatia. Conclusions The population of Punta Arenas is mainly of Chilote origin with south‐central Native and Spanish ancestral components, as well as some Croatian components. The persistence of local Native lineages is notable, suggesting continuity with the ancestral populations of the region such as the Kawésqar, Aönikenk, Yámana, or Selknam peoples. This study contributes to our knowledge of local history and its links to national and global developments in genetic ancestry.
... The ancestry estimation or population stratification affects the quality of GWAS since it is capable of bias correction caused by population variation among the samples 11,12 . There have been several statistical and machine learning approaches reached the the phase of implementations 13,14,15,16,17,18,19 . Separately, within our research group, we approached the ancestry study by mainly using K-means algorithm 20 . ...
... In addition to STRUCTURE and ADMIXTURE, SMARTPCA 16 and EIGENSTART 17 both adopted the principal component analysis (PCA)-based technique to analyze population stratification. A relatively recent and relevant study was that by Hajiloo et al. who developed ETHNOPRED by incorporating decision trees on the HapMap dataset to determine the continental and subcontinental ancestry of an individual 18 . Aimed at investigating genomic ancestry in Qatar population, Omberg et al. applied a machine-learning-based method referred as "SupportMix" 19 . ...
Article
Full-text available
Genomics study, as opposed to socio-anthropology, has been demonstrated as an excellent tool to picture biological relatedness and disease risk factors. To analyze the data obtained from the study, Genome-wide Association Study (GWAS) has been more than decades known as the mainstay approach., is the most popular approach in analysing genomics data. The confounding variables selection, being that ancestry estimation or population stratification, is substantial to maintain the quality of GWAS. Researchers have developed various methods in extracting the population stratification information from high dimensional genomics data, especially Single Nucleotide Polymorphisms (SNPs) data. In the present study, we proposed an implementation of Principal Component Analysis (PCA)-complemented Gaussian Mixture Model (GMM) as an unsupervised model to estimate population stratification from samples. The results derived from this approach was further compared to that resulted from K-means and from the commonly used ancestry estimation software, fast STRUCTURE. We figured out that our recent improved approach outperformed the two later mentioned as shown by the average cluster and population scores. Furthermore, it was able to generate the probability distribution of each sample across all population, despite its limited quality. These intriguing results worth further investigations with much more comprehensive population coverage and more advanced algorithm.
... However, they used unsupervised learning (clustering) methods, such as STRUCTURE [35], to determine which populations cluster together and thus observed the ability of a SNP panel to infer ancestry. Important progress has been made in the use of genomic information for ancestry detection [36][37][38][39], however, significant challenges still remain. Although a panel with a small number of SNPs can produce sufficiently accurate continental-level ancestry classification, reliable sub-continental population detection using only limited number of marker SNPs is still a major challenge. ...
Article
Full-text available
Background: While continental level ancestry is relatively simple using genomic information, distinguishing between individuals from closely associated sub-populations (e.g., from the same continent) is still a difficult challenge. Methods: We study the problem of predicting human biogeographical ancestry from genomic data under resource constraints. In particular, we focus on the case where the analysis is constrained to using single nucleotide polymorphisms (SNPs) from just one chromosome. We propose methods to construct such ancestry informative SNP panels using correlation-based and outlier-based methods. Results: We accessed the performance of the proposed SNP panels derived from just one chromosome, using data from the 1000 Genome Project, Phase 3. For continental-level ancestry classification, we achieved an overall classification rate of 96.75% using 206 single nucleotide polymorphisms (SNPs). For sub-population level ancestry prediction, we achieved an average pairwise binary classification rates as follows: subpopulations in Europe: 76.6% (58 SNPs); Africa: 87.02% (87 SNPs); East Asia: 73.30% (68 SNPs); South Asia: 81.14% (75 SNPs); America: 85.85% (68 SNPs). Conclusion: Our results demonstrate that one single chromosome (in particular, Chromosome 1), if carefully analyzed, could hold enough information for accurate prediction of human biogeographical ancestry. This has significant implications in terms of the computational resources required for analysis of ancestry, and in the applications of such analyses, such as in studies of genetic diseases, forensics, and soft biometrics.
... Standardized queries, such as those implemented by the Global Alliance for Genomics and Health (GA4GH) Beacon Project(GA4GH, 015b) and GA4GH Matchmaker Exchange (MatchmakerExchange, 2015), are gaining importance as data volumes increase and more organizations are using these data in clinical settings. Machine learning techniques on genome sequences are used to predict phenotype (Gonzalez-Recio and Forni, 2011;Ornella et al., 2014;Yoon et al., 2012), generate disease prognoses (Abraham et al., 2012;Kourou et al., 2014), strengthen genome-wide association studies (Botta et al., 2014;Mittag et al., 2012;Pirooznia et al., 2012;Roshan et al., 2011), and elucidate ancestry (Hajiloo et al., 2013). Finally, we feel this whole genome representation must be consistent: two researchers or clinicians, given the same called genome, should be able to generate the same representation of that genome. ...
Article
Full-text available
The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l- j7d0g-swtofxa2rct8495
... Standardized queries, such as those implemented by the Global Alliance for Genomics and Health (GA4GH) Beacon Project(GA4GH, 015b) and GA4GH Matchmaker Exchange (MatchmakerExchange, 2015), are gaining importance as data volumes increase and more organizations are using these data in clinical settings. Machine learning techniques on genome sequences are used to predict phenotype (Gonzalez-Recio and Forni, 2011;Ornella et al., 2014;Yoon et al., 2012), generate disease prognoses (Abraham et al., 2012;Kourou et al., 2014), strengthen genome-wide association studies (Botta et al., 2014;Mittag et al., 2012;Pirooznia et al., 2012;Roshan et al., 2011), and elucidate ancestry (Hajiloo et al., 2013). Finally, we feel this whole genome representation must be consistent: two researchers or clinicians, given the same called genome, should be able to generate the same representation of that genome. ...
Article
Full-text available
The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l- j7d0g-swtofxa2rct8495
... Lasso has been extended to account for population structures through linear mixed models [105], which are gaining much popularity in association studies [106]. Machine learning methods enable also the detection of population substructures, for instance, by learning ensembles of decision trees that are capable of accurately predicting individual's subcontinental ancestry [107]. Linkage disequilibrium (LD) tends to lead to the selection of highly correlated genetic features when using unpenalized modeling approaches [24]. ...
Article
Full-text available
Overview Compared to univariate analysis of genome-wide association (GWA) studies, machine learning–based models have been shown to provide improved means of learning such multilocus panels of genetic variants and their interactions that are most predictive of complex phenotypic traits. Many applications of predictive modeling rely on effective variable selection, often implemented through model regularization, which penalizes the model complexity and enables predictions in individuals outside of the training dataset. However, the different regularization approaches may also lead to considerable differences, especially in the number of genetic variants needed for maximal predictive accuracy, as illustrated here in examples from both disease classification and quantitative trait prediction. We also highlight the potential pitfalls of the regularized machine learning models, related to issues such as model overfitting to the training data, which may lead to over-optimistic prediction results, as well as identifiability of the predictive variants, which is important in many medical applications. While genetic risk prediction for human diseases is used as a motivating use case, we argue that these models are also widely applicable in nonhuman applications, such as animal and plant breeding, where accurate genotype-to-phenotype modeling is needed. Finally, we discuss some key future advances, open questions and challenges in this developing field, when moving toward low-frequency variants and cross-phenotype interactions.
Article
Full-text available
This study evaluates the performance of a set of machine learning techniques in predicting the prognosis of Hodgkin’s lymphoma using clinical factors and gene expression data. Analysed samples from 130 Hodgkin’s lymphoma patients included a small set of clinical variables and more than 54,000 gene features. Machine learning classifiers included three black-box algorithms (k-nearest neighbour, Artificial Neural Network, and Support Vector Machine) and two methods based on intelligible rules (Decision Tree and the innovative Logic Learning Machine method). Support Vector Machine clearly outperformed any of the other methods. Among the two rule-based algorithms, Logic Learning Machine performed better and identified a set of simple intelligible rules based on a combination of clinical variables and gene expressions. Decision Tree identified a non-coding gene (XIST) involved in the early phases of X chromosome inactivation that was overexpressed in females and in non-relapsed patients. XIST expression might be responsible for the better prognosis of female Hodgkin’s lymphoma patients.