Example 1 of axes of variation calculated from variance-covariance parameters and sample sizes.
The eigenvectors for the samples from three hypothetical populations, defined in Table 1, were calculated using Equation (27). Because the variance, within-population covariance and between-population covariance are all the same, the pattern shown here reflects purely the effect of sample sizes. Populations with small sample sizes are far away from the center, whereas populations with large sample sizes are around the center, as predicted by Equation (30).

Example 1 of axes of variation calculated from variance-covariance parameters and sample sizes. The eigenvectors for the samples from three hypothetical populations, defined in Table 1, were calculated using Equation (27). Because the variance, within-population covariance and between-population covariance are all the same, the pattern shown here reflects purely the effect of sample sizes. Populations with small sample sizes are far away from the center, whereas populations with large sample sizes are around the center, as predicted by Equation (30).

Source publication
Article
Full-text available
The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component...

Similar publications

Article
Full-text available
Azuki bean [Vigna angularis (Willd). Ohwi and Ohashi] is an important crop in East Asia. However, little is known about the wild and weedy relatives and their relationship with the cultigen. This study was conducted to obtain information on the population genetic diversity of the azuki bean complex germplasm and relate this information to breeding,...
Article
Full-text available
Model-based (likelihood and Bayesian) and non-model-based (PCA and K -means clustering) methods were developed to identify populations and assign individuals to the identified populations using marker genotype data. Model-based methods are favoured because they are based on a probabilistic model of population genetics with biologically meaningful p...
Article
Full-text available
Deciphering genetic structure and inferring migration routes of insects with high migratory ability have been challenging, due to weak genetic differentiation and limited resolution offered by traditional genotyping methods. Here, we tested the ability of double digest restriction-site associated DNA sequencing (ddRADseq)-based single nucleotide po...
Preprint
Full-text available
Population structure inference with genetic data has been motivated by a variety of applications in population genetics and genetic association studies. Several approaches have been proposed for the identification of genetic ancestry differences in samples where study participants are assumed to be unrelated, including principal components analysis...
Article
Full-text available
In this study, we used microsatellite markers to examine the genetic structures of Centropomus undecimalis (Bloch, 1792) and Epinephelus marginatus (Lowe, 1834) populations collected from artisanal fishing sites along a stretch of coastline in southeastern Brazil. Based on F-statistics, there was no significant genetic differentiation evident in an...

Citations

... PCA-based methods have been developed for ancestry inference (Wang et al., 2015;Zhang et al., 2020) and the connection between eigenvectors and ancestral proportions was established (Ma and Amos, 2010). We will first consider the PCA with population-specific SNPs and show that the principal scores are unbiased estimates of the ancestral proportions with maximum variances. ...
... The first component of λ k depends on the MAFs of SNPs specific to population k, and the second consists of the intrapopulation variance of the SNPs. When plotting the PC associated with λ k , that is the k-th eigenvector scaled by √ Mλ k , the individuals from population k share the same coordinates, which represent the center of population k in this dimension (Ma and Amos, 2010). In real data analysis, the coordinates of the individuals from population k are distributed around the center and will converge to it as M k increases. ...
Article
Full-text available
With the advance of sequencing technology, an increasing number of populations have been sequenced to study the histories of worldwide populations, including their divergence, admixtures, migration, and effective sizes. The variants detected in sequencing studies are largely rare and mostly population specific. Population-specific variants are often recent mutations and are informative for revealing substructures and admixtures in populations; however, computational methods and tools to analyze them are still lacking. In this work, we propose using reference populations and single nucleotide polymorphisms (SNPs) specific to the reference populations. Ancestral information, the best linear unbiased estimator (BLUE) of the ancestral proportion, is proposed, which can be used to infer ancestral proportions in recently admixed target populations and measure the extent to which reference populations serve as good proxies for the admixing sources. Based on the same panel of SNPs, the ancestral information is comparable across samples from different studies and is not affected by genetic outliers, related samples, or the sample sizes of the admixed target populations. In addition, ancestral spectrum is useful for detecting genetic outliers or exploring co-ancestry between study samples and the reference populations. The methods are implemented in a program, Ancestral Spectrum Analyzer (ASA), and are applied in analyzing high-coverage sequencing data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP). In the analyses of American populations from the 1000 Genomes Project, we demonstrate that recent admixtures can be dissected from ancient admixtures by comparing ancestral spectra with and without indigenous Americans being included in the reference populations.
... Our theoretical framework assumes that the observed genotypes correspond to the sampling of K discrete populations. Decomposing the genotype matrix as a sum of between and within-population matrices, we extend the results obtained in [10,16,19,25]. Our main result states that the mean value of F ST over loci is equal to the squared Hilbert-Schmidt norm of the between-population matrix, which can be computed by a spectral analysis. ...
Article
Full-text available
Wright's inbreeding coefficient, FST, is a fundamental measure in population genetics. Assuming a predefined population subdivision, this statistic is classically used to evaluate population structure at a given genomic locus. With large numbers of loci, unsupervised approaches such as principal component analysis (PCA) have, however, become prominent in recent analyses of population structure. In this study, we describe the relationships between Wright's inbreeding coefficients and PCA for a model of K discrete populations. Our theory provides an equivalent definition of FST based on the decomposition of the genotype matrix into between and within-population matrices. The average value of Wright's FST over all loci included in the genotype matrix can be obtained from the PCA of the between-population matrix. Assuming that a separation condition is fulfilled and for reasonably large data sets, this value of FST approximates the proportion of genetic variation explained by the first (K - 1) principal components accurately. The new definition of FST is useful for computing inbreeding coefficients from surrogate genotypes, for example, obtained after correction of experimental artifacts or after removing adaptive genetic variation associated with environmental variables. The relationships between inbreeding coefficients and the spectrum of the genotype matrix not only allow interpretations of PCA results in terms of population genetic concepts but extend those concepts to population genetic analyses accounting for temporal, geographical and environmental contexts.
... The EGRM Z, the mathematical expectation of GRM Z, depends only on the population sizes N 1 , N 2 , ⋯, N K and allele frequencies of the M SNPs in K populations [ f k1 , f k2 , ⋯, f kM ], k = 1, 2, ⋯, K. Here, we treat the SNP number M and the allele frequencies as fixed numbers. A theoretical formulation of the PCA that considers genotypic values as a random vector and allele frequencies in different populations being random was proposed in Ma and Amos, 2010 [23]. Based on different assumptions, a genotypic variance-covariance matrix of the same structure was attained; nevertheless, elements of the EGRM are different from those of the variancecovariance matrix in [23]. ...
... A theoretical formulation of the PCA that considers genotypic values as a random vector and allele frequencies in different populations being random was proposed in Ma and Amos, 2010 [23]. Based on different assumptions, a genotypic variance-covariance matrix of the same structure was attained; nevertheless, elements of the EGRM are different from those of the variancecovariance matrix in [23]. ...
... x T k 1 N =N represents center of the total population in the k-th dimension, where 1 N is a column vector of dimension N and with each element as 1. Due to the structure of Z, individuals from the same population share the same coordinates in the K-dimension space, and the common points denote the representative points of the populations, or centers of the populations [23]. We define ...
Article
Full-text available
Background: Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. Results: We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d2 decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio FPC is 93.85, population distance d2 is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. Conclusions: The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.
... Our theoretical framework assumes that the observed genotypes correspond to the sampling of K discrete populations. Decomposing the genotype matrix as a sum of between and within-population matrices, we extend the results obtained in [10,16,19,25]. Our main result states that the mean value of F ST over loci is equal to the squared norm of the between-population matrix. ...
Preprint
Full-text available
Wright’s inbreeding coefficient, F ST , is a fundamental measure in population genetics. Assuming a predefined population subdivision, this statistic is classically used to evaluate population structure at a given genomic locus. With large numbers of loci, unsupervised approaches such as principal component analysis (PCA) have, however, become prominent in recent analyses of population structure. In this study, we describe the relationships between Wright’s inbreeding coefficients and PCA for a model of K discrete populations. Our theory provides an equivalent definition of F ST based on the decomposition of the genotype matrix into between and within-population matrices. The average value of Wright’s F ST over all loci included in the genotype matrix can be obtained from the PCA of the between-population matrix. Assuming that a separation condition is fulfilled and for reasonably large data sets, this value of F ST approximates the proportion of genetic variation explained by the first ( K – 1) principal components accurately. The new definition of F ST is useful for computing inbreeding coefficients from surrogate genotypes, for example, obtained after correction of experimental artifacts or after removing adaptive genetic variation associated with environmental variables. The relationships between inbreeding coefficients and the spectrum of the genotype matrix not only allow interpretations of PCA results in terms of population genetic concepts but extend those concepts to population genetic analyses accounting for temporal, geographical and environmental contexts. Author’s summary Principal component analysis (PCA) is the most-frequently used approach to describe population genetic structure from large population genomic data sets. In this study, we show that PCA not only estimates ancestries of sampled individuals, but also computes the average value of Wright’s inbreeding coefficient over the loci included in the genotype matrix. Our result shows that inbreeding coefficients and PCA eigenvalues provide equivalent descriptions of population structure. As a consequence, PCA extends the definition of this coefficient beyond the framework of allelic frequencies. We give examples on how F ST can be computed from ancient DNA samples for which genotypes are corrected for coverage, and in an ecological genomic example where a proportion of genetic variation is explained by environmental variables.
... This is the process in which the "longevity" or "vulnerability" alleles or genotypes can modify genetic structure in the populations of the old and oldest-old compared to the younger groups of individuals. This property indicates that controlling for possible population stratification, for example, due to the differences in ancestry (Ma and Amos 2010;Price et al. 2006;Yang et al. 2011), has to be done with care because it could substantially reduce association of genetic variants with aging and longevity traits. ...
Article
Background and objective: To clarify mechanisms of genetic regulation of human aging and longevity traits, a number of genome-wide association studies (GWAS) of these traits have been performed. However, the results of these analyses did not meet expectations of the researchers. Most detected genetic associations have not reached a genome-wide level of statistical significance, and suffered from the lack of replication in the studies of independent populations. The reasons for slow progress in this research area include low efficiency of statistical methods used in data analyses, genetic heterogeneity of aging and longevity related traits, possibility of pleiotropic (e.g., age dependent) effects of genetic variants on such traits, underestimation of the effects of (i) mortality selection in genetically heterogeneous cohorts, (ii) external factors and differences in genetic backgrounds of individuals in the populations under study, the weakness of conceptual biological framework that does not fully account for above mentioned factors. One more limitation of conducted studies is that they did not fully realize the potential of longitudinal data that allow for evaluating how genetic influences on life span are mediated by physiological variables and other biomarkers during the life course. The objective of this paper is to address these issues. Data and methods: We performed GWAS of human life span using different subsets of data from the original Framingham Heart Study cohort corresponding to different quality control (QC) procedures and used one subset of selected genetic variants for further analyses. We used simulation study to show that approach to combining data improves the quality of GWAS. We used FHS longitudinal data to compare average age trajectories of physiological variables in carriers and non-carriers of selected genetic variants. We used stochastic process model of human mortality and aging to investigate genetic influence on hidden biomarkers of aging and on dynamic interaction between aging and longevity. We investigated properties of genes related to selected variants and their roles in signaling and metabolic pathways. Results: We showed that the use of different QC procedures results in different sets of genetic variants associated with life span. We selected 24 genetic variants negatively associated with life span. We showed that the joint analyses of genetic data at the time of bio-specimen collection and follow up data substantially improved significance of associations of selected 24 SNPs with life span. We also showed that aging related changes in physiological variables and in hidden biomarkers of aging differ for the groups of carriers and non-carriers of selected variants. Conclusions: . The results of these analyses demonstrated benefits of using biodemographic models and methods in genetic association studies of these traits. Our findings showed that the absence of a large number of genetic variants with deleterious effects may make substantial contribution to exceptional longevity. These effects are dynamically mediated by a number of physiological variables and hidden biomarkers of aging. The results of these research demonstrated benefits of using integrative statistical models of mortality risks in genetic studies of human aging and longevity.
... He showed that the projection of samples onto the principal components could be obtained from the pairwise coalescence times between study individuals. Ma and Amos (2010) proposed a formulation of PCA based on the variance-covariance matrix of the sample allele frequencies. ...
Article
Full-text available
Principal component analysis (PCA) is widely used in genome-wide association studies (GWAS), and the principal component axes often represent perpendicular gradients in geographic space. The explanation of PCA results is of major interest for geneticists to understand fundamental demographic parameters. Here, we provide an interpretation of PCA based on relatedness measures, which are described by the probability that sets of genes are identical-by-descent (IBD). An approximately linear transformation between ancestral proportions (AP) of individuals with multiple ancestries and their projections onto the principal components is found. In addition, a new method of eigenanalysis "EIGMIX" is proposed to estimate individual ancestries. EIGMIX is a method of moments with computational efficiency suitable for millions of SNP data, and it is not subject to the assumption of linkage equilibrium. With the assumptions of multiple ancestries and their surrogate ancestral samples, EIGMIX is able to infer ancestral proportions (APs) of individuals. The methods were applied to the SNP data from the HapMap Phase 3 project and the Human Genome Diversity Panel. The APs of individuals inferred by EIGMIX are consistent with the findings of the program ADMIXTURE. In conclusion, EIGMIX can be used to detect population structure and estimate genome-wide ancestral proportions with a relatively high accuracy.
... 1 are the centroids of the ancestral populations along the first principal axis (Bryc et al. 2010). Note that variation between individuals within a population is represented by the smaller eigenvalues and corresponding eigenvectors (Ma and Amos 2010). ...
Article
Full-text available
Admixture between long-separated populations is a defining feature of the genomes of many species. The mosaic block structure of admixed genomes can provide information about past contact events, including the time and extent of admixture. Here, we describe an improved wavelet-based technique that better characterizes ancestry block structure from observed genomic patterns. Principal Components Analysis is first applied to genomic data to identify the primary population structure, followed by wavelet decomposition to develop a new characterization of local ancestry information along the chromosomes. For testing purposes, this method is applied to human genome-wide genotype data from Indonesia, as well as virtual genetic data generated using genome-scale sequential coalescent simulations under a wide range of admixture scenarios. Time of admixture is inferred using an approximate Bayesian computation framework, providing robust estimates of both admixture times and their associated levels of uncertainty. Crucially, we demonstrate that this revised wavelet approach, which we have released as the R package adwave, provides improved statistical power over existing wavelet-based techniques and can be used to address a broad range of admixture questions. Copyright © 2015, The Genetics Society of America.
... A set of randomly picked SNPs located throughout the genome was used to estimate genetic relatedness between populations due to migrations or possible statistical artifacts. This methodology was first used by Cavalli-Sforza and colleagues over 30 years ago to study the evolutionary history of human populations and reconstruct patterns of migration, and it continues to be used today (Ma & Amos, 2010). If the component extracted from the IQ increasing alleles really reflects selection pressures on intelligence, it will predict IQ also after partialling out the confounding element due to population migrations. ...
Preprint
Full-text available
Principal components analysis on allele frequencies for 14 and 50 populations (from 1K Genomes and ALFRED databases) produced a factor accounting for over half of the variance, which indicates selection pressure on intelligence or genotypic IQ. Very high correlations between this factor and phenotypic IQ, educational achievement were observed (r>0.9 and r>0.8), also after partialling out GDP and the Human Development Index. Regression analysis was used to estimate a genotypic (predicted) IQ also for populations with missing data for phenotypic IQ. Socio-economic indicators (GDP and Human Development Index) failed to predict residuals, not providing evidence for the effects of environmental factors on intelligence. Another analysis revealed that the relationship between IQ and the genotypic factor was not mediated by race, implying that it exists at a finer resolution, a finding which in turn suggests selective pressures postdating sub-continental population splits. Genotypic height and IQ were inversely correlated but this correlation was mostly mediated by race. In at least two cases (Native Americans vs East Asians and Africans vs Papuans) genetic distance inferred from evolutionarily neutral genetic markers contrasts markedly with the resemblance observed for IQ and height increasing alleles. A principal component analysis on a random sample of 20 SNPs revealed two factors representing genetic relatedness due to migrations. However, the correlation between IQ and the intelligence PC was not mediated by them. In fact, the intelligence PC emerged as an even stronger predictor of IQ after entering the “migratory” PCs in a regression, indicating that it represents selection pressure instead of migrational effects. Finally, some observations on the high IQ of Mongoloid people are made which lend support to the “cold winters theory” on the evolution of intelligence.
... When association studies knowingly or unknowingly involve sampling individuals with a phenotype from one population and individuals without that phenotype from a different population, they are often plagued by false positive associations between the phenotype and any genetic variant that has frequency differences between those populations [11] . This 'stratification' issue can be overcome to some degree by assessing the ancestries or genetic backgrounds of the individuals in the study, although this strategy is not necessarily trivial or likely to work in all settings [11][12][13][14] . In addition, there are many phenotypes that have emerged in different populations due to different genetic variants or sets of causative variants arising in those populations (i.e. the genetic causes of these phenotypes manifest locus or allelic heterogeneity) [8] . ...
... In essence, this assumes that a PC reflects the variation in the individuals' genetic profiles that is attributable mainly to (the degree of) the individuals' origins with respect to two ancestral populations. Where each individual sits along (the PC) coordinates reflects the contribution of the two ancestral populations to the genotype data, and this can be used to estimate the degree to which each individual's unique genetic profile reflects ancestry shared with those populations [13] . In actual datasets containing sufficiently dense genotype data collected on a relatively large number (e.g. ...
... In actual datasets containing sufficiently dense genotype data collected on a relatively large number (e.g. hundreds) of unrelated individuals of varying ancestry, this approach is valid for at least a few PCs that explain the most variance [13] . However, the primary PCA component may actually reflect phenomena, such as genealogical or familial relatedness, linkage disequilibrium or stratification by a genotyping method, and this has to be taken into account as a result [11] . ...
Article
Full-text available
All human populations exhibit some level of genetic differentiation. This differentiation, or population stratification, has many interacting sources, including historical migrations, population isolation over time, genetic drift, and selection and adaptation. If differentiated populations remained isolated from each other over a long period of time such that there is no mating of individuals between those populations, then some level of global consanguinity within those populations will lead to the formation of gene pools that will become more and more distinct over time. Global genetic differentiation of this sort can lead to overt phenotypic differences between populations if phenotypically relevant variants either arise uniquely within those populations or begin to exhibit frequency differences across the populations. This can occur at the single variant level for monogenic phenotypes or at the level of aggregate variant frequency differences across the many loci that contribute to a phenotype with a multifactorial or polygenic basis. However, if individuals begin to interbreed (or 'admix') from populations with different frequencies of phenotypically relevant genetic variants, then these admixed individuals will exhibit the phenotype to varying degrees. The level of phenotypic expression will depend on the degree to which the admixed individuals have inherited causative variants that have descended from the ancestral population in which those variants were present (or, more likely, simply more frequent). We review studies that consider the association between the degree of admixture (or ancestry) and phenotypes of clinical relevance. We find a great deal of literature-based evidence for associations between the degree of admixture and phenotypic variation for a number of admixed populations and phenotypes, although not all this evidence is confirmatory. We also consider the implications of such associations for gene-mapping initiatives as well as general clinical epidemiology studies and medical practice. We end with some thoughts on the future of studies exploring phenotypic differences among admixed individuals as well as individuals with different ancestral backgrounds. © 2014 S. Karger AG, Basel.
... Detecting and genotyping inversions using PCa We first briefly review the PCa-based approach to detecting and genotyping inversions. In a previous publication (Ma and amos 2010 ), we developed a theoretical formulation of PCa, in which individuals are treated as features and markers (snPs) as " realizations " of a random vector. In the space spanned by the first few eigenvectors, individuals sampled from different populations are distributed into different clusters. ...
Article
Full-text available
Although inversions have occasionally been found to be associated with disease susceptibility through interrupting a gene or its regulatory region, or by increasing the risk for deleterious secondary rearrangements, no association study has been specifically conducted for risks associated with inversions, mainly because existing approaches to detecting and genotyping inversions do not readily scale to a large number of samples. Based on our recently proposed approach to identifying and genotyping inversions using principal components analysis (PCA), we herein develop a method of detecting association between inversions and disease in a genome-wide fashion. Our method uses genotype data for single nucleotide polymorphisms (SNPs), and is thus cost-efficient and computationally fast. For an inversion polymorphism, local PCA around the inversion region is performed to infer the inversion genotypes of all samples. For many inversions, we found that some of the SNPs inside an inversion region are fixed in the two lineages of different orientations and thus can serve as surrogate markers. Our method can be applied to case-control and quantitative trait association studies to identify inversions that may interrupt a gene or the connection between a gene and its regulatory agents. Our method also offers a new venue to identify inversions that are responsible for disease-causing secondary rearrangements. We illustrated our proposed approach to case-control data for psoriasis and identified novel associations with a few inversion polymorphisms.