Schematic Flow Chart Illustrating the Procedure Employed for Computing Type I Error and Power by Way of Data Simulation
(A) shows type I error, and (B) shows power by way of data simulation.

Schematic Flow Chart Illustrating the Procedure Employed for Computing Type I Error and Power by Way of Data Simulation (A) shows type I error, and (B) shows power by way of data simulation.

Source publication
Article
Full-text available
Because current molecular haplotyping methods are expensive and not amenable to automation, many researchers rely on statistical methods to infer haplotype pairs from multilocus genotypes, and subsequently treat these inferred haplotype pairs as observations. These procedures are prone to haplotype misclassification. We examine the effect of these...

Context in source publication

Context 1
... investigate the behavior of these test statistics for a variety of situations, we applied these statistical tests to many simulated datasets. Figure 1 illustrates the procedure we used to simulate the data and to evaluate the false-positive rate or type I error and power at fixed significance levels for each statistic. For the analysis of each replicate dataset simulated, the multilocus genotype data from cases and controls were pooled to infer haplotype pairs for each individual. ...

Similar publications

Article
Full-text available
Linkage disequilibrium (LD) plays a fundamental role in population genetics and in the current surge of studies to screen for subtle genetic variants affecting complex traits. Methods widely implemented in LD analyses require samples to be randomly collected, which, however, are usually ignored and thus raise the general question to the LD communit...
Article
Full-text available
A commonly used tool in disease association studies is the search for discrepancies between the haplotype distribution in the case and control populations. In order to find this discrepancy, the haplotypes frequency in each of the populations is estimated from the genotypes. We present a new method HAPLOFREQ to estimate haplotype frequencies over a...
Article
Full-text available
The development of online software designed for genetic studies has been exponentially growing, providing numerous benefits to the scientific community. However, they should be used with care, since some require adjustments. The efficiency of two programs for haplogroup prediction was tested with 119 samples of known haplotypes and haplogroups from...

Citations

... Alleles at multiple loci along a single chromosome are referred to haplotype. Haplotype information is essential to explain the relationships between genotypes and phenotypes [1][2][3], map disease genes roundly [4] and describe genetic ancestry completely [5]. 1 Although the diploid nature has been observed for over 50 years [6][7][8], phasing adiploid still a laborious task. Up till now, karyotyping is the gold standard in clinical laboratories. ...
Article
Full-text available
Although the diploid nature has been observed for over 50 years, phasing the diploid is still a laborious task. The speed and throughput of next generation sequencing have largely increased in the past decades. However, the short read-length remains one of the biggest challenges of haplotype analysis. For instance, reads as short as 150 bp span no more than one variant in most cases. Numerous experimental technologies have been developed to overcome this challenge. Distance, complexity and accuracy of the linkages obtained are the main factors to evaluate the efficiency of whole genome haplotyping methods. Here, we review these experimental technologies, evaluating their efficiency in linkages obtaining and system complexity. The technologies are organized into four categories based on its strategy: (i) chromosomes separation, (ii) dilution pools, (iii) crosslinking and proximity ligation, (ix) long-read technologies. Within each category, several subsections are listed to classify each technology. Innovative experimental strategies are expected to have high-quality performance, low cost and be labor-saving, which will be largely desired in the future.
... Different methods have been proposed to increase power in the presence of non-differential genotype error. [14][15][16][17][18] However, the genotype measurement errors may have systematic variabilities between cases and controls when the two groups are subject to different experiment conditions. 19 Genotype error may depend on factors such as, among others, quality of blood samples, working conditions of genotyping instrument and expertise of laboratory researchers. ...
Article
Differential genotype error in case-control association studies occurs when cases and controls are genotyped under different conditions. Existence of differential errors can considerably bias the association test, resulting in inflation of type I error and spurious significance. With the availability of high-throughput genotyping technologies such as the SNPchip, null markers that are unlinked with the disease can be used to correct for the bias caused by differential errors. A similar method, known as the genomic control, had been used to correct for population stratification in association studies. In this paper, we show that the same idea can be used to correct for the bias caused by differential errors, under the assumption that the null markers and the candidate marker are subject to the same or similar genotyping error model. The variance inflation is shown to be minor and the bias in the association test is the major source of type I error inflation in the presence of differential errors. Our method centralizes the test statistic by deducting the bias estimated from null markers through a quadratic regression method, which adjusts for the variability of null marker allele frequencies. Simulation results show that the proposed method performs very well in correcting for the type I error inflations.Journal of Human Genetics advance online publication, 18 July 2013; doi:10.1038/jhg.2013.74.
... In particular cases, these methods are known to have high accuracy (Adkins 2004;Avery et al. 2005). However, the process of phasing fundamentally represents a missing data problem, and can create complex statistical artefacts that are difficult to detect and can heavily influence downstream analyses (Levenstien et al. 2006;Lin & Huang 2007;Browning & Browning 2009). Additionally, these methods can be statistically inconsistent, potentially converging on a set of haplotypes and haplotype frequencies that are incorrect, even as more data are added (Andr es et al. 2007;Uddin et al. 2008). ...
Article
Full-text available
While standard DNA-sequencing approaches readily yield genotypic sequence data, haplotype information is often of greater utility for population genetic analyses. However, obtaining individual haplotype sequences can be costly and time-consuming and sometimes requires statistical reconstruction approaches that are subject to bias and error. Advancements have recently been made in determining individual chromosomal sequences in large-scale genomic studies, yet few options exist for obtaining this information from large numbers of highly polymorphic individuals in a cost-effective manner. As a solution, we developed a simple PCR-based method for obtaining sequence information from individual DNA strands using standard laboratory equipment. The method employs a water-in-oil emulsion to separate the PCR mixture into thousands of individual microreactors. PCR within these small vesicles results in amplification from only a single starting DNA template molecule and thus a single haplotype. We improved upon previous approaches by including SYBR Green I and a melted agarose solution in the PCR, allowing easy identification and separation of individually amplified DNA molecules. We demonstrate the use of this method on a highly polymorphic estuarine population of the copepod Eurytemora affinis for which current molecular and computational methods for haplotype determination have been inadequate.
... To increase heterozygosity, genetic information, and statistical power, haplotypes are usually reconstructed from observed genotypes to identify the individuals carrying the risk haplotypes. Haplotypes are also valuable for exploring cis-effects of specific combinations of intragenic polymorphisms, such as those variations located in gene promoters, or polymorphisms in closely linked genes where there may be interaction that affects gene expression (Levenstien et al., 2006;Fan et al., 2011). ...
Article
Full-text available
When susceptibility to diseases is caused by cis-effects of multiple alleles at adjacent polymorphic sites, it may be difficult to assess with confidence the genetic phase and identify individuals carrying the risk haplotype. Experimental assessment of genetic phase is still challenging and most population studies use statistical approaches to infer haplotypes given the observed genotypes. While these statistical approaches are powerful and have been proven very useful in large scale genetic population studies, they may be prone to errors in studies with small sample size, especially in the presence of compound heterozygotes. Here, we describe a simple and novel approach using the popular PCR–RFLP based strategy to assess the genetic phase in compound heterozygotes. We apply this method to two extensively studied SNPs in two clustered immune-related genes: The −308 (G > A) and the +252 (A > G) SNPs of the tumor necrosis factor (TNF) alpha and the lymphotoxin alpha (LTA) genes, respectively. Using this method, we successfully determined the genetic phase of these two SNPs in known compound heterozygous individuals and in every sample tested. We show that the A allele of TNF −308 is carried on the same chromosome as the LTA +252(G) allele.
... Software-based statistical methods (for example, PHASE 27 , which is a software for haplotype reconstruction) are commonly used in haplotyping, particularly when molecular haplotyping is not available. However, previous studies have suggested that inference from these statistical methods might result in haplotype misclassification 28 , particularly when linkage disequilibrium is not strong. Indeed, the PHASE software analysis of 10q26 led to relatively low accuracy (84.4%, Supplementary Table S8). ...
Article
Full-text available
Completion of the Human Genome Project and the HapMap Project has led to increasing demands for mapping complex traits in humans to understand the aetiology of diseases. Identifying variations in the DNA sequence, which affect how we develop disease and respond to pathogens and drugs, is important for this purpose, but it is difficult to identify these variations in large sample sets. Here we show that through a combination of capillary sequencing and polymerase chain reaction assisted by gold nanoparticles, it is possible to identify several DNA variations that are associated with age-related macular degeneration and psoriasis on significant regions of human genomic DNA. Our method is accurate and promising for large-scale and high-throughput genetic analysis of susceptibility towards disease and drug resistance.
... Haplotyping is an important component of genetic analysis. It improves the power of genetic association, and is useful in inferring evolutionary scenarios, historical recombination events, and detecting cis-regulatory events [1,2] . Given the importance of the problem , a variety of computational and experimental techniques have been developed to phase chromosomes, and we discuss a few here to put our work in context. ...
Article
Full-text available
Humans are diploid, carrying two copies of each chromosome, one from each parent. Separating the paternal and maternal chromosomes is an important component of genetic analyses such as determining genetic association, inferring evolutionary scenarios, computing recombination rates, and detecting cis-regulatory events. As the pair of chromosomes are mostly identical to each other, linking together of alleles at heterozygous sites is sufficient to phase, or separate the two chromosomes. In Haplotype Assembly, the linking is done by sequenced fragments that overlap two heterozygous sites. While there has been a lot of research on correcting errors to achieve accurate haplotypes via assembly, relatively little work has been done on designing sequencing experiments to get long haplotypes. Here, we describe the different design parameters that can be adjusted with next generation and upcoming sequencing technologies, and study the impact of design choice on the length of the haplotype. We show that a number of parameters influence haplotype length, with the most significant one being the advance length (distance between two fragments of a clone). Given technologies like strobe sequencing that allow for large variations in advance lengths, we design and implement a simulated annealing algorithm to sample a large space of distributions over advance-lengths. Extensive simulations on individual genomic sequences suggest that a non-trivial distribution over advance lengths results a 1-2 order of magnitude improvement in median haplotype length. Our results suggest that haplotyping of large, biologically important genomic regions is feasible with current technologies.
... The importance of studying haplotypes ranges from elucidating the exact biological role played by neighbouring amino-acids on the protein structure, to providing information about ancient ancestral chromosome segments that harbour alleles influencing human traits [23] . Moreover , haplotype association methods are considered to be more powerful compared to single marker analyses [24,25], even though this is questioned by some researchers [26]. A major problem in haplotype analyses is that in order for the analysis to be performed we need to reconstruct or infer the haplotypes, usually with an approach based on missing data imputation272829. ...
Article
Full-text available
Meta-analysis is a popular methodology in several fields of medical research, including genetic association studies. However, the methods used for meta-analysis of association studies that report haplotypes have not been studied in detail. In this work, methods for performing meta-analysis of haplotype association studies are summarized, compared and presented in a unified framework along with an empirical evaluation of the literature. We present multivariate methods that use summary-based data as well as methods that use binary and count data in a generalized linear mixed model framework (logistic regression, multinomial regression and Poisson regression). The methods presented here avoid the inflation of the type I error rate that could be the result of the traditional approach of comparing a haplotype against the remaining ones, whereas, they can be fitted using standard software. Moreover, formal global tests are presented for assessing the statistical significance of the overall association. Although the methods presented here assume that the haplotypes are directly observed, they can be easily extended to allow for such an uncertainty by weighting the haplotypes by their probability. An empirical evaluation of the published literature and a comparison against the meta-analyses that use single nucleotide polymorphisms, suggests that the studies reporting meta-analysis of haplotypes contain approximately half of the included studies and produce significant results twice more often. We show that this excess of statistically significant results, stems from the sub-optimal method of analysis used and, in approximately half of the cases, the statistical significance is refuted if the data are properly re-analyzed. Illustrative examples of code are given in Stata and it is anticipated that the methods developed in this work will be widely applied in the meta-analysis of haplotype association studies.
... Interestingly, the haplotype uncertainty added through the genotype error appeared to be independent from haplotype frequencies in contrast to the haplotype reconstruction error (Lamina et al., 2008). An alternative approach to assess haplotype misclassification probabilities might be molecular haplotyping (Levenstien et al., 2006). However, laboratory-assessed haplotypes are also subject to error possibly to a larger extent than SNP genotypes and cannot be considered a gold standard. ...
Article
Haplotypes are an important concept for genetic association studies, but involve uncertainty due to statistical reconstruction from single nucleotide polymorphism (SNP) genotypes and genotype error. We developed a re-sampling approach to quantify haplotype misclassification probabilities and implemented the MC-SIMEX approach to tackle this as a 3 x 3 misclassification problem. Using a previously published approach as a benchmark for comparison, we evaluated the performance of our approach by simulations and exemplified it on real data from 15 SNPs of the APM1 gene. Misclassification due to reconstruction error was small for most, but notable for some, especially rarer haplotypes. Genotype error added misclassification to all haplotypes resulting in a non-negligible drop in sensitivity. In our real data example, the bias of association estimates due to reconstruction error alone reached -48.2% for a 1% genotype error, indicating that haplotype misclassification should not be ignored if high genotype error can be expected. Our 3 x 3 misclassification view of haplotype error adds a novel perspective to currently used methods based on genotype intensities and expected number of haplotype copies. Our findings give a sense of the impact of haplotype error under realistic scenarios and underscore the importance of high-quality genotyping, in which case the bias in haplotype association estimates is negligible.
... Two such examples of technological exploitation providing increased power are double sampling, which consists of genotyping all individuals in the study and then using gene sequencing (viewed as a perfect method of genotyping) on a subset of the individuals (Gordon et al. 2004, Ji et al. 2005, Barral et al. 2006), and duplicate genotyping, which consists of re-running a subset of the samples (Tintle et al. 2007, Rice, Holmans 2003). Double sampling has been shown to be a cost-effective design when genotyping errors are present and increasingly so as the cost of genotyping declines (Ji et al. 2005, Levenstien, Ott & Gordon 2006). Including duplicate genotype data in the test of association has been shown to increase power when duplicate data has already been gathered (i.e. if you already have the duplicates, use them) (Tintle et al. 2007), but has not been evaluated for its a priori cost-effectiveness (i.e. should you collect duplicates?). ...
Article
We consider a modification to the traditional genome wide association (GWA) study design: duplicate genotyping. Duplicate genotyping (re-genotyping some of the samples) has long been suggested for quality control reasons; however, it has not been evaluated for its statistical cost-effectiveness. We demonstrate that when genotyping error rates are at least m%, duplicate genotyping provides a cost-effective (more statistical power for the same price) design alternative when relative genotype to phenotype/sample acquisition costs are no more than m%. In addition to cost and error rate, duplicate genotyping is most cost-effective for SNPs with low minor allele frequency. In general, relative genotype to phenotype/sample acquisition costs will be low when following up a limited number of SNPs in the second stage of a two-stage GWA study design, and, thus, duplicate genotyping may be useful in these situations. In cases where many SNPs are being followed up at the second stage, duplicate genotyping only low-quality SNPs with low minor allele frequency may be cost-effective. We also find that in almost all cases where duplicate genotyping is cost-effective, the most cost-effective design strategy involves duplicate genotyping all samples. Free software is provided which evaluates the cost-effectiveness of duplicate genotyping based on user inputs.
... Due to the lack of a gold standard, we can only provide an estimation of expected haplotype misclassification based on the frequencies of observed haplotypes. Levenstien at al. [31] presented a method which uses molecular haplotypes on a subset of individuals to estimate haplotype misclassification and account the Likelihood Ratio test for it in the setting of case-control studies. However, due to the absence of high throughput procedures for molecular haplotyping, this method is too time-and moneyconsuming in most cases. ...
Article
Full-text available
Statistically reconstructing haplotypes from single nucleotide polymorphism (SNP) genotypes, can lead to falsely classified haplotypes. This can be an issue when interpreting haplotype association results or when selecting subjects with certain haplotypes for subsequent functional studies. It was our aim to quantify haplotype reconstruction error and to provide tools for it. By numerous simulation scenarios, we systematically investigated several error measures, including discrepancy, error rate, and R(2), and introduced the sensitivity and specificity to this context. We exemplified several measures in the KORA study, a large population-based study from Southern Germany. We find that the specificity is slightly reduced only for common haplotypes, while the sensitivity was decreased for some, but not all rare haplotypes. The overall error rate was generally increasing with increasing number of loci, increasing minor allele frequency of SNPs, decreasing correlation between the alleles and increasing ambiguity. We conclude that, with the analytical approach presented here, haplotype-specific error measures can be computed to gain insight into the haplotype uncertainty. This method provides the information, if a specific risk haplotype can be expected to be reconstructed with rather no or high misclassification and thus on the magnitude of expected bias in association estimates. We also illustrate that sensitivity and specificity separate two dimensions of the haplotype reconstruction error, which completely describe the misclassification matrix and thus provide the prerequisite for methods accounting for misclassification.