ArticlePDF Available

A General Framework for Estimating the Relative Pathogenicity of Human Genetic Variants

February 2014
Nature Genetics 46(3)

February 2014
46(3)

DOI:10.1038/ng.2892

Source
PubMed

Authors:

Martin Kircher

Universitätsklinikum Schleswig - Holstein

Preti Jain

HudsonAlpha Institute for Biotechnology

Show all 6 authorsHide

Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation-Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.

Relationship between scaled C scores and genetic variation. (a) Mean DAF by scaled C score for variants listed by the 1000 Genomes Project14 or ESP24. Dashed lines indicate mean DAF values, and confidence intervals indicate 1.96 × s.e.m. for DAFs in each bin. (b) Under-representation of polymorphic sites in 1000 Genomes Project data. (c) Under-representation of chimpanzee lineage–derived variants. Under-representation is defined as the proportion of 1000 Genomes Project (b) or chimpanzee-derived (c) variants in a specific scaled C score bin divided by the frequency with which that scaled C score is observed for all possible mutations of the human reference assembly (10C score/−10). The stronger under-representation of chimpanzee-derived variants relative to 1000 Genomes Project variants is expected given that the former are mostly fixed or high-frequency variants (and have survived many generations of purifying selection), whereas the latter are mostly low-frequency variants. Depletion values in b,c for C score bins other than 0 are significantly different from expectation (binomial proportion test, all P < 1 × 10−11).

…

Sensitivity of methods in distinguishing pathogenic and benign variants. Receiver operating characteristics (ROCs) are shown discriminating curated, pathogenic mutations defined by the ClinVar database27 from matched, likely benign ESP alleles (DAF ≥ 5%)24 with the same categorical consequence. (a) Genome-wide variants for which GerpS, PhCons and phyloP scores are defined (n = 16,334). (b) Analysis limited to missense changes (n = 15,154), with missing values imputed to an upper limit of each score. (c) Analysis limited to missense changes for which PolyPhen, SIFT and Grantham scores are all defined (n = 13,358). Versions of the plot in c that exclude overlap between PolyPhen training data and the ClinVar database or use a CADD model trained without PolyPhen as a feature are shown in Supplementary Area under the curve (AUC) values are provided for each of the scores used.

…

Ranking of pathogenic ClinVar variants among the variants identified by whole-genome sequencing in 11 human individuals from diverse populations. (a) Cumulative distribution of the rankings of 9,831 pathogenic ClinVar variants when 'spiked' into each of 11 personal genomes. For example, C scores of ~30% for ClinVar variants rank in the top 0.1% of all variants within a personal genome, and most rank in the top 1%. About 25% of pathogenic ClinVar SNVs are not scored by PolyPhen or SIFT because of missing values or the restriction of these methods to missense variation; note also that rankings for PolyPhen and SIFT are computed among missense variants only and are therefore derived from far fewer total variants (see a plot restricted to missense variation in Supplementary 6). (b) Quantile-quantile plot of C scores for the SNVs identified in the 11 individual genomes and pathogenic ClinVar SNVs. For a given scaled C score observed in an individual, the fraction of that individual's variants with a C score at least that high was computed (y axis). The C score corresponding to this quantile of the distribution of all possible variants is displayed on the x axis. High C scores are under-represented compared to the set of all possible variants. In contrast, known disease-causal variants from ClinVar have large C scores relative to the set of all possible variants. This fact can be exploited to prioritize causal variants identified from whole-genome sequencing of individual genomes as in a (see also Supplementary Tables 10 and 11).

…

C scores for GWAS SNPs are higher than for nearby control SNPs and are dependent on study sample size. The average scaled C score (y axis) is plotted for each category of SNPs, as indicated by color, relative to the sample size of the association study in which the SNP was identified (x axis). Sample size bins are log2 scaled and mutually exclusive; for example, the bin labeled 1,024 represents all SNPs from studies with between 512 and 1,024 samples. Error bars, ±1 s.e.m. Each shaded rectangle represents overall (across all sample sizes) scaled C score mean ± 1 s.e.m. for each category as indicated by color.

…

Figures - uploaded by Martin Kircher

Content may be subject to copyright.

Content uploaded by Martin Kircher

Content may be subject to copyright.

A general framework for estimating the relative pathogenicity of

human genetic variants

Martin Kircher1,*, Daniela M. Witten2,*, Preti Jain3,4, Brian J. O’Roak1,4, Gregory M.

Cooper3,#, and Jay Shendure1,#

Martin Kircher: mkircher@uw.edu; Daniela M. Witten: dwitten@u.washington.edu; Preti Jain: pjain@hudsonalpha.org;

Brian J. O’Roak: oroak@uw.edu; Gregory M. Cooper: gcooper@hudsonalpha.org; Jay Shendure: shendure@uw.edu

1Department of Genome Sciences, University of Washington, Seattle, WA, USA

2Department of Biostatistics, University of Washington, Seattle, WA, USA

3HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA

Abstract

Our capacity to sequence human genomes has exceeded our ability to interpret genetic variation.

Current genomic annotations tend to exploit a single information type (e.g. conservation) and/or

are restricted in scope (e.g. to missense changes). Here, we describe Combined Annotation

Dependent Depletion (CADD), a framework that objectively integrates many diverse annotations

into a single, quantitative score. We implement CADD as a support vector machine trained to

differentiate 14.7 million high-frequency human derived alleles from 14.7 million simulated

variants. We pre-compute “C-scores” for all 8.6 billion possible human single nucleotide variants

and enable scoring of short insertions/deletions. C-scores correlate with allelic diversity,

annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory

effects, and complex trait associations, and highly rank known pathogenic variants within

individual genomes. The ability of CADD to prioritize functional, deleterious, and pathogenic

variants across many functional categories, effect sizes and genetic architectures is unmatched by

any current annotation.

Technical Report

A strength of genomic approaches to study disease is the replacement of informed but biased

hypotheses with unbiased but generic ones, like the “equal treatment” of all genetic variants

in genome-wide association studies (GWAS). However, for both rare variants of large effect

and common variants of weak effect, the use of prior knowledge can be critical for disease

Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research,

subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

#To whom correspondence should be addressed: shendure@uw.edu, gcooper@hudsonalpha.org.

*These authors contributed equally to this work

4Present address: Department of Molecular & Medical Genetics, Oregon Health & Science University, Portland, OR, USA

G.C. and J.S. designed the study; M.K. processed the annotation data and scores, developed and implemented the simulator and scripts

required for scoring; P.J. and B.O. prepared and provided data sets and annotations; D.W. and M.K. developed the model and

performed model training; D.W. performed the analysis of individual features and interactions; M.K., D.W., G.C., and J.S. analyzed

the model’s performance on different data sets; G.C. analyzed the GWAS data; J.S., G.C., M.K. and D.W. wrote the manuscript with

input from all authors.

NIH Public Access

Author Manuscript

Nat Genet. Author manuscript; available in PMC 2014 September 01.

Published in final edited form as:

Nat Genet. 2014 March ; 46(3): 310–315. doi:10.1038/ng.2892.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

gene discovery1–4. For example, exome sequencing is an effective discovery strategy

because it focuses on protein-altering variation, which is enriched for causal effects5.

While many existing annotations are useful for prioritizing causal variants to boost

discovery power (e.g. PolyPhen6, SIFT7, and GERP8), current approaches tend to suffer

from one or more of four major limitations. First, annotations vary widely with respect to

both inputs and outputs. For example, conservation metrics8–10 are defined genome-wide

but do not use functional information and are not allele-specific, while protein-based

metrics6,7 apply only to coding, and often only to missense, variants, thereby excluding

>99% of human genetic variation. Second, each annotation has its own metric and these

metrics are rarely comparable, making it difficult to evaluate the relative importance of

distinct variant categories or annotations. Third, annotations trained on known pathogenic

mutations are subject to major ascertainment biases and may not generalize. Fourth, it is a

major practical challenge to obtain, let alone to objectively evaluate or combine, the existing

panoply of partially correlated and partially overlapping annotations; this challenge will only

magnify as large-scale projects like ENCODE11 continually increase the amount of relevant

data available. The net result of these limitations is that many potentially relevant

annotations are ignored, while the subset that are used are applied and combined in ad hoc

and subjective ways that undermine their utility.

Here, we describe a general framework, Combined Annotation Dependent Depletion

(CADD), for integrating diverse genome annotations and scoring any possible human single

nucleotide variant (SNV) or small insertion/deletion (indel) event. The basis of CADD is to

contrast the annotations of fixed or nearly fixed derived alleles in humans relative to

simulated variants. Deleterious variants – that is, variants that reduce organismal fitness –

are depleted by natural selection in fixed but not simulated variation. CADD therefore

measures deleteriousness, a property that strongly correlates with both molecular

functionality and pathogenicity12. Importantly, metrics of deleteriousness, in contrast with

pathogenicity or molecular functionality, have major advantages. Whereas the latter are

limited in scope to a small set of genetically or experimentally well-characterized mutations

and subject to major ascertainment biases, deleteriousness can be measured systematically

across the genome assembly (see refs 8, 9, 10 and below). Further, selective constraint on

genetic variants is related to the totality of their phenotype-relevant effects rather than any

individual molecular or phenotypic consequence. Measures of deleteriousness can therefore

provide, in principle, a genome-wide, data-rich, functionally generic, and organismally

relevant estimate of variant impact.

We identified differences between human genomes and the inferred human-chimpanzee

ancestral genome13 where humans carry a derived allele with a frequency of at least 95%

(14.9 million SNVs and 1.7 million indels). Nearly all of these events are fully fixed in the

human lineage, with fewer than 5% appearing as nearly fixed polymorphisms in the 1000

Genomes Project14 variant catalog (derived allele frequency (DAF) ≥ 95%). To simulate an

equivalent number of de novo mutations, we used an empirical model of sequence evolution

with CpG dinucleotide-specific rates and mutation rates locally estimated at a 1 megabase

(Mb) scale (Supplementary Note). Mutation rate parameters as well as the size distribution

of indels were estimated from six-way primate genome alignments15.

Kircher et al. Page 2

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

To generate annotations, we used the Ensembl Variant Effect Predictor16 (VEP), data from

the ENCODE project11 and information from UCSC genome browser tracks17

(Supplementary Table 1). The annotations span a range of data types including conservation

metrics like GERP8, phastCons9, and phyloP10; regulatory information11 like genomic

regions of DNase hypersensitivity18 and transcription factor binding19; transcript

information like distance to exon-intron boundaries or expression levels in commonly

studied cell lines11; and protein-level scores like Grantham20, SIFT7, and PolyPhen6. The

resulting variant-by-annotation matrix contained 29.4 million variants (half fixed or nearly

fixed human derived alleles (“observed”), half simulated de novo mutations (“simulated”))

and 63 distinct annotations, some of which are composites that summarize many underlying

annotations (Supplementary Note, Supplementary Tables 1–2).

We first assessed the validity of our general approach by constructing a series of univariate

models that contrast observed and simulated variants using each of the 63 annotations as

individual predictors (Supplementary Note). Nearly all models were highly significant

(Supplementary Tables 3–5) and consistent with expectation. For example, we find a nearly

20-fold depletion of nonsense variants, a 2-fold depletion of missense variants, and no

depletion of intergenic or upstream/downstream variants (Supplementary Table 6).

Nonsense and missense mutations that occur near the starts of cDNAs were more depleted

than those occurring near the ends (Supplementary Table 7), and variants within 20, and

especially within 2, nucleotides of splice junctions were also depleted (Supplementary Fig.

1). The best performing individual annotations were protein-level metrics such as PolyPhen6

and SIFT7, but these evaluated only missense variants (0.63% of all variants in the training

data are missense; of these, 88% had defined PolyPhen values and 90% had defined SIFT

values). Conservation metrics were the strongest individual genome-wide annotations

(Supplementary Table 3).

We also examined correlations between annotations (Supplementary Fig. 2) and the value of

adding interaction terms between annotations (Supplementary Fig. 3). Many annotations

were correlated and many interactions were statistically significant, but only a handful of

interacting pairs meaningfully improved a simple additive model. Overall, these analyses

demonstrate that substantial biological differences are present between the observed and

simulated variants with respect to the 63 annotations, and that linear models capture much of

this information.

We next trained a support vector machine21 (SVM) with a linear kernel on features derived

from the 63 annotations, supplemented by a limited number of interaction terms

(Supplementary Note, Supplementary Tables 1–2, Supplementary Fig. 4). Ten models,

independently trained on observed variants and different samples of simulated variants, were

highly correlated (all pairwise Spearman rank correlations >0.99; Supplementary Fig. 5). An

average of these models was applied to score all 8.6 billion possible SNVs of the human

reference genome (GRCh37). To simplify interpretation in some contexts, we also defined

phred-like22 scores (“scaled C-scores”) based on the rank of the C-score of each variant

relative to all 8.6 billion possible SNVs, ranging from 1 to 99 (Supplementary Note). For

example, substitutions with the highest 10% (10−1) of all scores - that is, least likely to be

observed human alleles under our model - were assigned values of 10 or greater (“≥C10”),

Kircher et al. Page 3

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

while variants in the highest 1% (10−2), 0.1% (10−3), etc. were assigned scores ≥C20, ≥C30,

etc.

We first calculated the proportion of all possible substitutions with a given scaled C-score

having specific functional consequences (Fig. 1; Supplementary Table 8). Although trained

solely on the difference between observed and simulated variants, rather than on sets of

known disease causing variants that might introduce ascertainment bias, the C-scores of

potential nonsense variants are highest (median 37), followed by missense and canonical

splice site variants (median 15) and with intergenic variants comprising the bottom of the

list (median 2). At the same time, 76% of potential SNVs with ≥C20 are non-coding (i.e.

categories other than missense, nonsense, canonical splice or stop loss), while 74% of

potential missense and 18% of potential nonsense SNVs are below C20. Further, within each

functional class there are distinctions that are biologically relevant and likely predictively

useful. For example, potential nonsense variants – often treated as a homogeneous group in

disease studies – in olfactory receptors score lower than in other genes, while potential

nonsense variants in genes found previously to be “essential”23 score higher (Fig. 1 lower

panel, Supplementary Fig. 6). C-scores thus capture considerable information both between

and within functional categories. Of note, these same distinctions are absent or muted with

other measures, either due to missingness (e.g., for missense-only measures) or lack of

functional awareness (e.g., conservation measures cannot distinguish between a nonsense

and missense allele at a given position).

We next compared scaled C-scores with levels of genetic diversity, finding that C-scores are

negatively correlated with the DAF of variants identified in the 1000 Genomes Project14 or

the Exome Sequencing Project24 (ESP) (Fig. 2a; Supplementary Figs. 7–9), depletion of

human genetic variation from the 1000 Genomes Project catalog (Fig. 2b), and depletion of

chimp-derived variants (Fig. 2c). Importantly, these validation datasets have minimal

overlap with the “observed” subset of the training data, which consists only of fixed or

nearly fixed (>95% DAF) human derived alleles. Furthermore, although we cannot fully

eliminate confounding by these factors, the negative correlation between C-scores and the

DAF of standing variation is robust to controlling for variation in background selection,

local GC content, local CpG density, and site-based conservation (Supplementary Fig. 9).

We next sought to assess the utility of CADD to prioritize functional and disease-relevant

variation within five distinct contexts.

First, for MLL2, the gene mutated in Kabuki syndrome, C-scores enable discrimination of a

diverse set of disease-associated alleles25 versus rare, likely benign variants from ESP24

(Wilcoxon rank sum test p = 9.9 × 10−94; n = 210/679). Other metrics were markedly

inferior in terms of accuracy or comprehensiveness (Supplementary Fig. 10).

Second, for HBB, the gene mutated in beta-thalassemia, C-scores of disease-associated

alleles26 – a set of indels (n=93) and SNVs (n=119) with regulatory/upstream (n=54),

splicing (n=37), missense (n=22), nonsense (n=18) and other effects – are significantly, and

more strongly than other measures, correlated with three levels of phenotypic severity

(Kruskal-Wallis rank sum test p = 2.4 × 10−7; n = 48/65/99, Supplementary Fig. 11).

Kircher et al. Page 4

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Third, pathogenic variants curated by the NIH ClinVar database27 are well separated from

likely benign alleles (ESP24 DAF ≥ 5%) matched to the same categorical consequences

(Wilcoxon rank sum test p < 10−300, n = 8174/8174, Fig. 3; Supplementary Figs. 12–16).

We note that there is substantial overlap between ClinVar and the training data underlying

PolyPhen. When these sites are excluded from the test dataset, or when PolyPhen is

excluded as a training feature from CADD, C-scores continue to outperform all or nearly all

missense-only metrics and conservation measures (Supplementary Fig. 12).

Fourth, C-scores strongly correlate with the number of observations for somatic cancer

mutations in p53 reported to the International Agency for Research on Cancer (Spearman

rank correlation 0.38, p = 6 × 10−73, n = 2068, Supplementary Note).

Fifth, we examined two enhancers28 and one promoter29 in which we previously performed

saturation mutagenesis. C-scores are significantly correlated, and overall more so than

measures of sequence conservation, with the experimentally measured absolute expression

fold change of individual variants (Spearman rank correlation of combined data = 0.31, p =

1.9 × 10−65, n = 2847; Supplementary Fig. 17).

Collectively, these analyses demonstrate that CADD is quantitatively predictive of

deleteriousness, pathogenicity, and molecular functionality, both protein-altering and

regulatory, in a variety of experimental and disease contexts. Within each of these contexts,

CADD’s predictive utility is much better than measures of sequence conservation, the only

comprehensive type of variant score, and also tends to be better, in most cases substantially

so, than function-specific metrics when restricted to the appropriate variant subsets.

We next considered how CADD may be useful in evaluating candidate variation within

exome or genome-wide studies.

First, we analyzed de novo exome variants (SNVs and indels) identified in children with

autism spectrum disorders30–34 (ASD) and intellectual disability35,36 (ID) along with

unaffected siblings or controls, including 88 nonsense, 1,015 missense, 359 synonymous, 32

canonical splice site, and 150 other variants, including indels. Variants in affected children

are significantly more deleterious than those in unaffected siblings/controls, considering

each disease separately (Supplementary Table 9) or combined (ASD+ID Wilcoxon rank sum

test p = 2.0 × 10−4, n = 1130/514). Additionally, de novo variants in ID probands are

significantly more deleterious than those of ASD probands (p = 4.7 × 10−5, n=170/960),

suggesting a more deleterious global mutation burden in ID, consistent with the observation

of increased sizes and numbers of copy number variants in ID relative to ASD37.

Second, it is well established that annotations like PolyPhen and conservation are valuable

in the sequencing-based identification of disease-causal genes by virtue of their ability to

highly rank pathogenic variants1,2,38. We therefore examined the distribution of C-scores in

the genomes of 11 individuals representing diverse populations39,40, and find that CADD

highly ranks known disease-causal variants (ClinVar pathogenic) within the complete

spectrum of variation in personal genomes (Fig. 4; Supplementary Fig. 16 and

Supplementary Table 10–11). Furthermore, CADD is both more quantitative and

comprehensive in this task (e.g., ~27% of pathogenic ClinVar SNVs are not scored by

Kircher et al. Page 5

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

PolyPhen because of missing values or its restriction to missense variation). Given its

considerable superiority over the best available protein-based and conservation metrics in

terms of ranking known pathogenic variants in the complete spectrum of variation within

personal genomes, it is likely that CADD will improve the power of sequence-based disease

studies beyond current standard approaches.

Finally, we analyzed CADD scores for single nucleotide polymorphisms (SNPs) identified

by GWAS of complex traits, contrasting them with nearby control SNPs matched for allele

frequency and genotyping array availability (Fig. 5, Supplementary Note). We find that lead

GWAS SNPs have significantly higher C-scores than control SNPs (one-sided Wilcoxon

rank sum test, p-value = 1.3 × 10−12, n = 5498/5498); nearby SNPs in linkage disequilibrium

with lead SNPs (“tags”) score lower on average than leads but are also significantly higher

than their matched controls (p-value = 5.1 × 10−107). C-score differences remain significant

after controlling for properties like gene-body effect, gene expression level, conservation,

and regulatory element overlap; each of these are significantly different between associated

and control SNPs but none can fully explain the C-score discrepancy (Supplementary Note).

C-scores of trait-associated SNPs furthermore correlate with the size of the underlying

association study and with statistical significance of the association itself (Fig. 5;

Supplementary Figure 16; Supplementary Note), likely due to the increased ability of larger

studies and stronger association statistics to enrich for causal variants. While for the most

part not causal, our analysis suggests that GWAS-identified SNPs, especially strongly

associated lead SNPs from large studies, are enriched for causal variants, consistent with

previously observed GWAS enrichments for individual annotations11,41–44.

With CADD, we describe a generic, expandable framework for integrating information

contained in diverse annotations of genetic variation to a single score. We demonstrate that

in a variety of contexts this approach is better, in some cases modestly but in many cases

dramatically, than other widely used annotations at prioritizing functional and pathogenic

variants. Further, beyond utility in any one setting, there are practical and conceptual

advantages to CADD that should prove of major value to genetic studies of human disease.

First, the information content of many individual annotations is objectively merged into a

single value, which is far preferable to ad hoc approaches for combining annotations and

likely to improve performance, consistent with benefits seen for “consensus” methods in

missense-specific annotation45. Second, CADD can readily incorporate expansions to

existing annotations and entirely new annotations. The ability to indefinitely and readily

integrate new information is crucial in light of projects like ENCODE, which are

continuously and rapidly expanding available annotations11. Third, CADD combines the

generality of conservation-based metrics with the specificity of subset-relevant functional

metrics (e.g. PolyPhen), exploiting the advantages of both while attenuating their respective

disadvantages.

CADD also has a number of limitations which may restrict its utility for certain analyses or

represent areas for improvement. First, C-scores measure reductions in variation, which

correlate with deleteriousness but are also affected by local mutation rate, background

selection, biased gene conversion, and other phenomena, potentially limiting accuracy.

Second, C-scores reflect the proportion of variants with a given annotation pattern that are

Kircher et al. Page 6

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

visible to selection but may not capture differences in selective intensity; other approaches,

such as polymorphism-to-divergence comparisons, may be more accurate for estimating

selective coefficients46. Third, there is a strong need for more “gold standard” data,

particularly for non-coding regions of the genome, the current paucity of which limits the

development of better annotations as well as our ability to validate predictions. Fourth, it is

at present not possible to precisely calibrate the relationship between CADD-estimated

deleteriousness and the likelihood that a variant is pathogenic. As such, C-scores are best

interpreted in terms of “likelihood of deleteriousness” rather than “likelihood of

pathogenicity”, e.g. the quantifiable extent of depletion of a given C-score from chimp-

derived alleles (Fig. 2c, Supplementary Table 11). Especially for discovering causal

variants, CADD should be treated as one piece of information contributing to the totality of

evidence for pathogenicity, and evaluated as a supplement, not a replacement, for genetic

information.

The “one-stop” nature of CADD is likely to be of great practical and conceptual value to

future sequencing studies. It will minimize the scope and diversity of annotations that have

to be generated, tracked, and evaluated by a lab or project, and reduce the need for ad hoc

combinations of filters, scores, and parameters as is now routinely done. For example, an

oft-used approach in exome studies is to merge missense (with or without an annotation of

“damage” or given level of conservation), nonsense, and splice-disrupting variants into a

single, internally unranked list of “protein-altering” variants prior to genetic analysis5. With

CADD, one might avoid arbitrary filters/thresholds altogether, including both coding and

non-coding variants on a single, meaningfully ranked list. For example, a recent study of

recessive, non-syndromic pancreatic agenesis identified 5 causal non-coding variants that

disrupt function of a distal enhancer of PTF1A47. C-scores for these non-coding, disease-

causal variants (scaled scores between 23.2 and 24.5) rank them above 99.5% of all possible

human SNVs, above 97% of missense SNVs in a typical exome, and higher than 56% of

Mendelian pathogenic SNVs in ClinVar27.

Both in research and in the clinic, our capacity to define catalogs of genetic variants exceeds

our ability to systematically evaluate their potential impacts. This challenge will deepen as

sequencing accelerates, as genomes displace exomes, and as the array of functional

categories and annotations expand. A unified, quantitative, and scalable framework capable

of exploiting many genomic annotations will be essential to meet this challenge. We

anticipate that the model described here and the accompanying freely available pre-

computed scores for all possible GRCh37/hg19 SNVs (http://cadd.gs.washington.edu/) will

be broadly useful immediately, and improve over time, enabling better interpretation of

variants of uncertain significance in a clinical setting and improving discovery power for

genetic studies of both Mendelian and complex diseases.

Online Methods

Simulated and observed variants

The basis of the CADD framework is to capture correlates of selective constraint as

manifested in differences between simulated variants and observed human derived changes.

For the simulated variants, we developed a genome-wide simulator of de novo germline

Kircher et al. Page 7

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

variation. The simulator was motivated by the parameters of the General Time Reversible

(GTR) model50, but because the standard GTR does not naturally accommodate asymmetric

CpG-specific mutation rates, we use a fully empirical model of sequence evolution with a

separate rate for CpG dinucleotides and local adjustment of mutation rates (see

Supplementary Note). Simulation parameters were obtained from Ensembl Enredo-Pecan-

Ortheus (EPO)13,15 whole genome alignments of six primate species (Ensembl Compara

release 66). A custom script and the associated rate matrices underlying the genome-wide

simulator are available as Supplementary File 1. We applied these parameters to simulate

single nucleotide (SNV) and insertion/deletion (indel) variants based on the human reference

sequence (GRCh37).

For observed human derived changes, we extracted sites where the human reference genome

differs from the inferred human-chimp ancestral genome from the Ensembl EPO 6 primate

alignments defined above, excluding variants in the most recent 1000 Genomes Project14

data (1000G, variant release 3, 20101123) with a frequency of greater than 5%, and

including variants where the human reference carries an ancestral allele (i.e. matching the

inferred human-chimp ancestor sequence) but where the derived allele is observed with

frequency above 95% in the 1000G data. We identified a total of 14,893,290 SNVs, and

627,071 insertions and 1,107,414 deletions (less than 50bp in length).

Variant annotation matrix

We used the Ensembl Variant Effect Predictor (VEP, Ensembl Gene annotation v68)16 to

obtain gene model annotation for single nucleotide and indel variants. For single nucleotide

variants within coding sequence, we also obtained SIFT7 and PolyPhen-26 scores from VEP.

We combined output lines describing MotifFeatures with the other annotation lines,

reformatted it to a pure tabular format and reduced the different Consequence output values

to 17 levels and implemented a four-level hierarchy in case of overlapping annotations (see

Supplementary Note). To the 6 VEP input derived columns (chromosome, start, reference

allele, alternative allele, variant type: SNV/INS/DEL, length) and 26 actual VEP output

derived columns, we added 56 columns providing diverse annotations (e.g. mapability

scores and segmental duplication annotation as distributed by UCSC51,52; PhastCons and

phyloP conservation scores53 for three multi-species alignments9 excluding the human

reference sequence in score calculation; GERP++ single-nucleotides scores, element scores

and p-values54, also defined from alignments with the human reference excluded;

background selection score40,55; expression value, H3K27 acetylation, H3K4 methylation,

H3K4 trimethylation, nucleosome occupancy and open chromatin tracks provided for

ENCODE cell lines in the UCSC super tracks52; genomic segment type assignment from

Segway56; predicted transcription factor binding sites and motifs11; overlapping ENCODE

ChIP-seq transcription factors11, 1000 Genome variant14 and Exome Sequencing Project57

variant status and frequencies, Grantham scores20 associated with a reported amino acid

substitution). The Supplementary Note provides a full description and Supplementary Table

1 lists all columns of the obtained annotation matrix.

Kircher et al. Page 8

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Imputation and final training data set

From the annotations described above, some columns are not useful for model training or

needed to be excluded from training as they differ between the simulated variants and the

human-chimpanzee ancestor differences for technical reasons (see Supplementary Note for a

complete list; note that no allele frequency information was used in model training). In order

to fit models, we imputed missing values in genome-wide measures by the genome average

obtained from the simulated data, or set missing values to 0 where appropriate

(Supplementary Table 2). Further, we created an “undefined” category for the categorical

annotations in order to accommodate missing values. In order to deal with missing values in

annotations that are not defined on a subset of variants (e.g. information only available for

protein-coding genes), we set the missing values to zero and also created indicator variables

that contain a 1 if the corresponding variant is undefined, and a 0 otherwise. Since insertions

and deletions may produce arbitrary length Ref/Alt and nAA/oAA columns (and thus not a

fixed number of categorical levels), these values were set to N for Ref/Alt and set to

“undefined” for nAA/oAA.

Sites from the simulation were labeled +1 and human derived variants as −1. Only insertions

and deletions shorter than 50bp were considered for model training and the Length column

was capped at 49 for the prediction of longer events. The ratio of indel events to SNV events

obtained for the simulation (1:8.46).

Model training

We generated ten training data sets by sampling an equal number of 13,141,299 SNVs,

627,071 insertions and 926,968 deletions from both the simulated variant and observed

variant datasets. In order to train each support vector machine (SVM) model, the processed

data was converted to a sparse matrix representation after converting all n-level categorical

values to n individual Boolean flags. 1% of sites (~132,000 SNVs, 6,000 insertions and

9,000 deletions each) were randomly selected and used as a test data set. All other sites were

used to train linear SVMs using the LIBOCAS v0.96 library21. The SVM model fits a

hyperplane as defined below. X1,…,Xn are the 63 annotations described above (which

expand to 166 features due to the treatment of categorical annotations), W1,…,W11 are the

Boolean features that indicate whether a given feature (out of cDNApos, relcDNApos,

CDSpos, relCDSpos, protPos, relProtPos, Grantham, PolyPhenVal, SIFTval, as well as

Dst2Splice ACCEPTOR and DONOR) is undefined, 1{A} is an indicator variable for

whether the event A holds, and D is the set of bStatistic, cDNApos, CDSpos, Dst2Splice,

GerpN, GerpS, mamPhCons, mamPhyloP, minDistTSE, minDistTSS, priPhCons, priPhyloP,

protPos, relcDNApos, relCDSpos, relProtPos, verPhCons, and verPhyloP. Due to the coding

of categorical values using Boolean variables, the total number of features in this model is

949.

Kircher et al. Page 9

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

SVM models were trained, using various values for the generalization parameter (C), which

assigns the cost of misclassifications. Supplementary Fig. 4 shows the model training

convergence in 2000 iterations (~70h) for different settings of C. These results indicate that

model training only converges within a reasonable amount of time for C values around

0.0025 and below. We therefore trained models for all ten training data sets with C=0.0025.

We determined the average of the model parameters and used the average model.

Model testing and validation

We annotated all 8.6 billion possible substitutions in the human reference genome

(GRCh37), and applied the model to score all possible substitutions. When scoring sites with

multiple VEP annotation lines, we score all possible annotations first and then report the one

with the highest deleteriousness after applying the four hierarchy levels. We mapped the C-

scores to a phred-like scale (“scaled C-scores”) ranging from 1 to 99 based on their rank

relative to all possible substitutions in the human reference genome, i.e. −10log10(rank/total

number of substitutions).

We used several datasets extracted from the literature and public databases to look at the

performance of the model scores (see Supplementary Note for details): (1) C-scores in

specific gene classes motivated by the analysis performed by Khurana et al.58 (i.e.

HGMD48, non-immune essential genes described by Liao et al.23, GWAS genes as available

from the Genome.gov catalog, LoF genes from MacArthur et al.49 and olfactory genes from

the Ensembl 68 gene build). (2) 210 mutations in MLL2 associated with Kabuki syndrome

from Makrythanasis et al.25. We complemented those with 679 putatively benign variants

observed in the Exome Sequencing Project (ESP)57. (3) We downloaded a total of 119

SNVs, 30 insertions and 63 deletions (all required to be at most 50nt) within or near HBB

that give rise to thalassemia from HbVar26. Disease categories were used as defined by

HbVar, except that all types that are not “beta0” or “beta+” were pooled into one category,

“other”. (4) We obtained the NCBI ClinVar27 data set (release date June 16 2012) and

extracted variants that were marked “pathogenic” or “non-pathogenic (benign)”. We also

selected a set of apparently benign (≥5% allele frequency) variants from ESP that were

matched to the pathogenic ClinVar sites in terms of their Consequence annotations. In

addition, we generated a data set where we matched ESP and ClinVar frequencies to three

decimal precisions of the alternative allele frequency. Due to the overlap of ClinVar and

ESP variants with the PolyPhen training data set, we trained a separate classifier without the

PolyPhen features and we also checked the performance on the subset of ClinVar and ESP

variants not used for PolyPhen training. To compare the performance of CADD with other

publically available missense annotations not used in model training, we downloaded scores

Kircher et al. Page 10

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

from dbNSFP 2.059. (5) We combined high confidence de novo mutations from five family

based autism exome sequencing studies30–34, a total of 948 ASD probands and 590

unaffected siblings. Further, we obtained the coding variants as described above for two

family-based intellectual disability (ID) studies35,36, 151 ID and 20 unrelated control

families. (6) We obtained the expression fold change for each base substitution in ALDOB

and ECR11 from Patwardhan et al.28. This data set contains a total of 777 variants for

ALDOB and 1,860 variants for ECR11. Further, we obtained the HBB promoter data of

Patwardhan et al.29. The promoter data set contains a total of 210 variants associated with an

expression fold change. (7) We obtained a list of 23,788 single nucleotide somatic cancer

mutations in p53 which were reported to the International Agency for Research on Cancer

(IARC). These mutations correspond to 2,068 distinct variants; we recorded the number of

times that each variant was reported. (8) We obtained GATK VCF variant call files for all

autosomes and the X chromosome from shotgun sequencing of eleven men originating from

diverse human populations40. (9) We obtained the NHGRI genome-wide association study

(GWAS) catalog on December 18, 2012, and obtained 9,977 distinct SNP-trait associations

spanning 7,531 unique SNPs in 1000 Genomes; these variants are referred to as “lead

SNPs”. We used the Genome Variation Server (GVS, http://gvs.gs.washington.edu/

GVS137/) to find all SNPs within 100 kb of a lead SNP that have a pairwise correlation of

R2 >= 0.8 within Utah residents with ancestry from northern and western Europe (CEU).

This resulted in an additional 56,538 unique SNPs, referred to as “tag SNPs”. We also

developed “control” SNP sets, selected to match trait-associated SNPs for a variety of

features that may bias SNPs found by GWAS in the absence of any causal effects.

Supplementary Material

Refer to Web version on PubMed Central for supplementary material.

Acknowledgments

We thank P. Green and members of the Shendure Lab for helpful discussions and suggestions. Our work was

supported by National Institutes of Health (N.I.H.) grants U54HG006493 (to J.S. and G.C), DP5OD009145 (to

D.W.) and DP1HG007811 (to J.S.).

References

1. Cooper GM, et al. Single-nucleotide evolutionary constraint scores highlight disease-causing

mutations. Nat Methods. 2010; 7:250–1. [PubMed: 20354513]

2. Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of

genomic data. Nat Rev Genet. 2011; 12:628–40. [PubMed: 21850043]

3. Musunuru K, et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus.

Nature. 2010; 466:714–9. [PubMed: 20686566]

4. Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease.

Nat Biotechnol. 2012; 30:1095–106. [PubMed: 23138309]

5. Ng SB, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature.

2009; 461:272–6. [PubMed: 19684571]

6. Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat Methods.

2010; 7:248–9. [PubMed: 20354512]

7. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids

Res. 2003; 31:3812–4. [PubMed: 12824425]

Kircher et al. Page 11

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

8. Cooper GM, et al. Distribution and intensity of constraint in mammalian genomic sequence.

Genome Res. 2005; 15:901–13. [PubMed: 15965027]

9. Siepel A, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Genome Res. 2005; 15:1034–50. [PubMed: 16024819]

10. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on

mammalian phylogenies. Genome Res. 2010; 20:110–21. [PubMed: 19858363]

11. ENCODE Project Consortium et al. An integrated encyclopedia of DNA elements in the human

genome. Nature. 2012; 489:57–74. [PubMed: 22955616]

12. Kimura, M. The neutral theory of molecular evolution. Vol. xv. Cambridge University Press,

Cambridge Cambridgeshire; New York: 1983. p. 367

13. Paten B, et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res.

2008; 18:1829–43. [PubMed: 18849525]

14. The 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092

human genomes. Nature. 2012; 491:56–65. [PubMed: 23128226]

15. Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. Enredo and Pecan: genome-wide mammalian

consistency-based multiple alignment with paralogs. Genome Res. 2008; 18:1814–28. [PubMed:

18849524]

16. McLaren W, et al. Deriving the consequences of genomic variants with the Ensembl API and SNP

Effect Predictor. Bioinformatics. 2010; 26:2069–70. [PubMed: 20562413]

17. Meyer LR, et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic

Acids Res. 2013; 41:D64–9. [PubMed: 23155063]

18. Boyle AP, et al. High-resolution mapping and characterization of open chromatin across the

genome. Cell. 2008; 132:311–22. [PubMed: 18243105]

19. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA

interactions. Science. 2007; 316:1497–502. [PubMed: 17540862]

20. Grantham R. Amino acid difference formula to help explain protein evolution. Science. 1974;

185:862–4. [PubMed: 4843792]

21. Franc V, Sonnenburg S. Optimized cutting plane algorithm for large-scale risk minimization. The

Journal of Machine Learning Research. 2009; 10:2157–2192.

22. Ewing B, Green P. Base-calling of automated sequencer traces using phred. II Error probabilities.

Genome Res. 1998; 8:186–94. [PubMed: 9521922]

23. Liao BY, Zhang J. Null mutations in human and mouse orthologs frequently result in different

phenotypes. Proc Natl Acad Sci U S A. 2008; 105:6987–92. [PubMed: 18458337]

24. Fu W, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding

variants. Nature. 2012

25. Makrythanasis P, et al. MLL2 mutation detection in 86 patients with Kabuki syndrome: a

genotype-phenotype study. Clin Genet. 2013

26. Giardine B, et al. HbVar database of human hemoglobin variants and thalassemia mutations: 2007

update. Hum Mutat. 2007; 28:206. [PubMed: 17221864]

27. Baker M. One-stop shop for disease genes. Nature. 2012; 491:171. [PubMed: 23135443]

28. Patwardhan RP, et al. Massively parallel functional dissection of mammalian enhancers in vivo.

Nat Biotechnol. 2012; 30:265–70. [PubMed: 22371081]

29. Patwardhan RP, et al. High-resolution analysis of DNA regulatory elements by synthetic saturation

mutagenesis. Nat Biotechnol. 2009; 27:1173–5. [PubMed: 19915551]

30. O’Roak BJ, et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de

novo mutations. Nature genetics. 2011; 43:585–9. [PubMed: 21572417]

31. O’Roak BJ, et al. Sporadic autism exomes reveal a highly interconnected protein network of de

novo mutations. Nature. 2012; 485:246–50. [PubMed: 22495309]

32. Sanders SJ, et al. De novo mutations revealed by whole-exome sequencing are strongly associated

with autism. Nature. 2012; 485:237–41. [PubMed: 22495306]

33. Neale BM, et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders.

Nature. 2012; 485:242–5. [PubMed: 22495311]

Kircher et al. Page 12

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

34. Iossifov I, et al. De novo gene disruptions in children on the autistic spectrum. Neuron. 2012;

74:285–99. [PubMed: 22542183]

35. Rauch A, et al. Range of genetic mutations associated with severe non-syndromic sporadic

intellectual disability: an exome sequencing study. Lancet. 2012

36. de Ligt J, et al. Diagnostic Exome Sequencing in Persons with Severe Intellectual Disability. The

New England journal of medicine. 2012

37. Cooper GM, et al. A copy number variation morbidity map of developmental delay. Nat Genet.

2011; 43:838–46. [PubMed: 21841781]

38. Ng SB, et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat

Genet. 2010; 42:790–3. [PubMed: 20711175]

39. Rohland N, Reich D. Cost-effective, high-throughput DNA sequencing libraries for multiplexed

target capture. Genome Res. 2012; 22:939–46. [PubMed: 22267522]

40. Meyer M, et al. A high-coverage genome sequence from an archaic Denisovan individual. Science.

2012; 338:222–6. [PubMed: 22936568]

41. Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci

for human diseases and traits. Proc Natl Acad Sci U S A. 2009; 106:9362–7. [PubMed: 19474294]

42. Nicolae DL, et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance

discovery from GWAS. PLoS Genet. 2010; 6:e1000888. [PubMed: 20369019]

43. Gerstein MB, et al. Architecture of the human regulatory network derived from ENCODE data.

Nature. 2012; 489:91–100. [PubMed: 22955619]

44. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with

regulatory information in the human genome. Genome Res. 2012; 22:1748–59. [PubMed:

22955986]

45. Gonzalez-Perez A, Lopez-Bigas N. Improving the assessment of the outcome of nonsynonymous

SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet. 2011; 88:440–9.

[PubMed: 21457909]

46. Arbiza L, et al. Genome-wide inference of natural selection on human transcription factor binding

sites. Nat Genet. 2013; 45:723–9. [PubMed: 23749186]

47. Weedon MN, et al. Recessive mutations in a distal PTF1A enhancer cause isolated pancreatic

agenesis. Nat Genet. 2013; 46:61–4. [PubMed: 24212882]

48. Stenson PD, et al. The Human Gene Mutation Database: 2008 update. Genome Med. 2009; 1:13.

[PubMed: 19348700]

49. MacArthur DG, et al. A systematic survey of loss-of-function variants in human protein-coding

genes. Science. 2012; 335:823–8. [PubMed: 22344438]

50. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math

Life Sci. 1986; 17:57–86.

51. Fujita PA, et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2011;

39:D876–82. [PubMed: 20959295]

52. Rosenbloom KR, et al. ENCODE whole-genome data in the UCSC Genome Browser: update

2012. Nucleic Acids Res. 2012; 40:D912–7. [PubMed: 22075998]

53. Hubisz MJ, Pollard KS, Siepel A. PHAST and RPHAST: phylogenetic analysis with space/time

models. Brief Bioinform. 2011; 12:41–51. [PubMed: 21278375]

54. Davydov EV, et al. Identifying a high fraction of the human genome to be under selective

constraint using GERP++ PLoS Comput Biol. 2010; 6:e1001025. [PubMed: 21152010]

55. McVicker G, Gordon D, Davis C, Green P. Widespread genomic signatures of natural selection in

hominid evolution. PLoS Genet. 2009; 5:e1000471. [PubMed: 19424416]

56. Hoffman MM, et al. Unsupervised pattern discovery in human chromatin structure through

genomic segmentation. Nat Methods. 2012; 9:473–6. [PubMed: 22426492]

57. Tennessen JA, et al. Evolution and functional impact of rare coding variation from deep

sequencing of human exomes. Science. 2012; 337:64–9. [PubMed: 22604720]

58. Khurana E, Fu Y, Chen J, Gerstein M. Interpretation of genomic variants using a unified biological

network approach. PLoS Comput Biol. 2013; 9:e1002886. [PubMed: 23505346]

Kircher et al. Page 13

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

59. Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs

and their functional predictions. Hum Mutat. 2011; 32:894–9. [PubMed: 21520341]

Kircher et al. Page 14

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 1.

Relationship of scaled C-scores and categorical variant consequences. The upper plot shows the proportion of substitutions with

a specific consequence for each scaled C-score bin, while the middle panel shows the proportion of substitutions with a specific

consequence after first normalizing by the total number of variants observed in that category. The legend indicates the median

and range of scaled C-score values for each category. Consequences are obtained from the Ensembl Variant Effect Predictor16

(Supplementary Note), e.g. “noncoding change” refers to changes in annotated non-coding transcripts. Detailed counts of

functional assignments in each C-score bin are in Supplementary Table 8. The lower panel shows violin plots of the median C-

scores of potential nonsense (stop-gained) variants for genes that: harbor at least 5 known pathogenic mutations48 (“disease”);

are predicted to be “essential”23; harbor variants associated with complex traits41 (“GWAS”); harbor at least 2 loss-of-function

mutations in 1000 Genomes49 (“LoF”); encode olfactory receptor proteins; or are in a random selection of 500 genes (“Other”;

see Supplementary Note).

Kircher et al. Page 15

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 2.

Relationship between scaled C-scores and: the average derived allele frequency (DAF) of variants identified in the 1000

Genomes Project14 or ESP24 (upper panel); the under-representation of polymorphic sites in 1000 Genomes (middle panel); and

chimpanzee lineage derived variants (lower panel). The dashed lines in the upper plot indicate the mean DAF and confidence

intervals indicate 1.96x standard errors of the mean (SEM) DAF in each bin. Under-representation is defined as the proportion

of 1000 Genomes (middle panel) or chimpanzee-derived (lower panel) variants in a specific scaled C-score bin divided by the

frequency with which that scaled C-score is observed for all possible mutations of the human reference assembly (10C-score/−10).

The stronger under-representation of chimpanzee-derived variants relative to 1000 Genomes variants is expected given that the

former are mostly fixed or high-frequency variants (and have survived many generations of purifying selection) while the latter

are mostly low-frequency variants. Depletion values in both panels for C-score bins other than 0 are significantly different from

expectation (binomial proportion test, all p-values <10−11).

Kircher et al. Page 16

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 3.

Receiver operating characteristics (ROC) for discriminating curated, pathogenic mutations defined by the NIH ClinVar

database27 matched to apparently benign ESP alleles (DAF ≥ 5%)24 with the same categorical consequence. The left panel

shows genome-wide variants for which GerpS, PhCons, and PhyloP scores are defined (n=16,334), while the middle panel limits

the analysis to missense changes (n=15,154), with missing values imputed to an upper value limit of each score, and right panel

to missense changes for which PolyPhen, SIFT and Grantham scores are all defined (n=13,358). Versions of the right panel that

exclude the overlap between PolyPhen training data and the ClinVar database or use a CADD model trained without PolyPhen

as a feature are shown in Supplementary Fig. 12. Area under the curve (AUC) values are provided in the figure legend for each

of the scores used.

Kircher et al. Page 17

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 4.

Ranking of pathogenic ClinVar variants among the variants identified by whole genome sequencing of eleven human

individuals from diverse populations. Left panel: Cumulative distributions of the ranks of 9,831 pathogenic ClinVar variants

when “spiked in” to each of 11 personal genomes. For example, C-scores of ~30% of ClinVar variants rank in the top 0.1% of

all variants within a personal genome, and most rank in the top 1%. About 25% of pathogenic ClinVar SNVs are not scored by

PolyPhen/SIFT because of missing values or its restriction to missense variation; note also that ranks for PolyPhen/SIFT are

computed among missense variants only and are therefore derived from far fewer total variants (see a plot restricted to missense

variation in Supplementary Fig. 16). Right panel: A QQ-plot of the C-scores of the SNVs identified from the eleven individuals

and pathogenic ClinVar SNVs. For a given scaled C-score observed in an individual, the fraction of that individual’s variants

with a C-score at least that large was computed (y-axis). The C-score corresponding to this quantile of the distribution of all

possible variants is displayed on the x-axis. High C-scores are underrepresented compared to the set of all possible variants. In

contrast, known disease-causal variants from ClinVar have large C-scores relative to the set of all possible variants. This fact

can be exploited to prioritize causal variants identified from whole genome sequencing of individual genomes (left panel and

Supplementary Tables 10–11).

Kircher et al. Page 18

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 5.

C-scores for GWAS SNPs are higher than nearby control SNPs and dependent on study sample size. The average scaled C-score

(y-axis) is plotted for each category of SNP, as indicated by color, relative to the sample sizes of the association studies in which

the SNPs were identified (x-axis). Sample size bins are log2-scaled and mutually exclusive; for example, the bin labeled “1024”

represents all SNPs from studies with between 512 and 1024 samples. Error bars are ±1 standard errors of the mean (SEM).

Shaded rectangles represent the overall, i.e. across all sample sizes, scaled C-score means ±1 SEM for each category as

indicated by the color.

Kircher et al. Page 19

Nat Genet. Author manuscript; available in PMC 2014 September 01.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Inflammatory bowel disease and rheumatoid arthritis share a common genetic structure

Article

Full-text available

Jun 2024

Background The comorbidity rate of inflammatory bowel disease (IBD) and rheumatoid arthritis (RA) is high; nevertheless, the reasons behind this high rate remain unclear. Their similar genetic makeup probably contributes to this comorbidity. Methods Based on data obtained from the genome-wide association study of IBD and RA, we first assessed an overall genetic association by performing the linkage disequilibrium score regression (LDSC) analysis. Further, a local correlation analysis was performed by estimating the heritability in summary statistics. Next, the causality between the two diseases was analyzed by two-sample Mendelian randomization (MR). A genetic overlap was analyzed by the conditional/conjoint false discovery rate (cond/conjFDR) method.LDSC with specific expression of gene analysis was performed to identify related tissues between the two diseases. Finally, GWAS multi-trait analysis (MTAG) was also carried out. Results IBD and RA are correlated at the genomic level, both overall and locally. The MR results suggested that IBD induced RA. We identified 20 shared loci between IBD and RA on the basis of a conjFDR of <0.01. Additionally, we identified two tissues, namely spleen and small intestine terminal ileum, which were commonly associated with both IBD and RA. Conclusion Herein, we proved the presence of a polygenic overlap between the genetic makeup of IBD and RA and provided new insights into the genetic architecture and mechanisms underlying the high comorbidity between these two diseases.

Clinical and genetic characteristics of ALS patients with variants in genes regulating DNA methylation

Article

Full-text available

Jun 2024
J NEUROL

Background Aberrant DNA methylation alterations are implicated in amyotrophic lateral sclerosis (ALS). Nevertheless, the influence of genetic variants in genes regulating DNA methylation on ALS patients is not well understood. Therefore, we aim to provide a comprehensive variant profile of genes related to DNA methylation (DNMT1, DNMT3A, DNMT3B, DNMT3L) and demethylation (TET1, TET2, TET3, TDG) and to investigate the association of these variants with ALS. Methods Variants were screened in a cohort of 2240 ALS patients from Southwest China, using controls from the Genome Aggregation Database (n = 9976) and the China Metabolic Analytics Project (n = 10,588). The over-representation of rare variants and their association with ALS risk were evaluated using Fisher’s exact test with Bonferroni correction at both allele and gene levels. Kaplan–Meier analysis and Cox regression analysis were employed to explore the relationship between variants and survival. Results A total of 210 variants meeting the criteria were identified. Gene-based burden analysis identified a significant increase in ALS risk associated with rare variants in the TET2 gene (OR = 1.95, 95% CI = 1.29–2.88, P = 0.001). Survival analysis demonstrated that patients carrying variants in demethylation-related genes had a higher risk of death compared to those with methylation-related gene variants (HR = 1.29, 95% CI = 1.03–1.86, P = 0.039). Conclusions This study provides a genetic variant profile of genes involved in DNA methylation and demethylation regulation, along with the clinical characteristics of ALS patients carrying these variants. The findings offer genetic evidence implicating disrupted DNA methylation dynamics in ALS.

Shared and Unique Genetic Links between Neuroticism and Gastrointestinal Tract Diseases

Article

Full-text available

Jun 2024
DEPRESS ANXIETY

Objective. Association between neuroticism and gastrointestinal tract (GIT) diseases may not be attributable to the genetic overlaps between neuroticism and psychiatric disorders. We aim to explore the genetic links and mechanisms of neuroticism and GIT diseases. Materials and Methods. We obtained European genome-wide association data of neuroticism (n = 390,278) or subclusters (depressed, n = 357,957; worry, n = 348,219) and six GIT diseases: gastroesophageal reflux disease (GERD, n = 456,327), inflammatory bowel disease (IBD, n = 456,327), peptic ulcer disease (PUD, n = 456,327), irritable bowel syndrome (IBS, n = 486,601), Crohn’s disease (CD, n = 20,883), and ulcerative colitis (UC, n = 21,895). We performed genetic correlation analysis (high-definition likelihood method and cross-trait linkage disequilibrium score regression), pairwise pleiotropic analysis, single nucleic acid polymorphism annotation, Bayesian colocalization, gene-level analysis, transcriptome-wide association analysis, and gene set enrichment analysis. Results. Neuroticism and its subclusters are associated with most GIT diseases (15 of 18 trait-pairs). GERD and PUD were highly correlated with depressed affect. We identified pleiotropic loci 11q23.2 (mapped gene: NCAM1/DRD2) and 18q12.2 (mapped gene: CELF4) in neuroticism and IBS/GERD, supporting the genetic overlap between neuroticism and depression. We found that 16q12.1 (mapped gene: NKD1/ZNF423/NOD2) and 2q37.1 (mapped gene: ATG16L1/SP140) are only highlighted in depressed/neuroticism CD, revealing pleiotropic loci with dissimilarities between neuroticism and different GIT diseases. MR analysis suggested that genetic liability to neuroticism is associated with increased risks of IBS, PUD, and GERD. Conclusion. Our findings document the genetic links between neuroticism and six GIT diseases, highlighting the genetic overlaps and heterogeneity between neuroticism and psychiatric disorders in the context of gastrointestinal disorders. Both the shared and unique pleiotropic loci identified between neuroticism and different GIT diseases could facilitate mechanistic understandings and may stimulate further translational implications.

Identification and analyses of exonic and copy number variants in spastic paraplegia

Article

Full-text available

Jun 2024

Hereditary spastic paraplegias are a diverse group of degenerative disorders that are clinically categorized as isolated; with involvement of lower limb spasticity, or symptomatic, where spastic paraplegia is complicated by further neurological features. We sought to identify the underlying genetic causes of these disorders in the participating patients. Three consanguineous families with multiple affected members were identified by visiting special schools in the Punjab Province. DNA was extracted from blood samples of the participants. Exome sequencing was performed for selected patients from the three families, and the data were filtered to identify rare homozygous variants. ExomeDepth was used for the delineation of the copy number variants. All patients had varying degrees of intellectual disabilities, poor speech development, spasticity, a wide-based gait or an inability to walk and hypertonia. In family RDHR07, a homozygous deletion involving multiple exons and introns of SPG11 (NC000015.9:g.44894055_449028del) was found and correlated with the phenotype of the patients who had spasticity and other complex movement disorders, but not those who exhibited ataxic or indeterminate symptoms as well. In families ANMD03 and RDFA06, a nonsense variant, c.985C > T;(p.Arg329Ter) in DDHD2 and a frameshift insertion‒deletion variant of AP4B1, c.965-967delACTinsC;p.(Tyr322SerfsTer14), were identified which were homozygous in the patients while the obligate carriers in the respective pedigrees were heterozygous. All variants were ultra-rare with none, or very few carriers identified in the public databases. The three loss of function variants are likely to cause nonsense-mediated decay of the respective transcripts. Our research adds to the genetic variability associated with the SPG11 and AP4B1 variants and emphasizes the genetic heterogeneity of hereditary spastic paraplegia.

Identification and in silico structural analysis for the first de novo mutation in the cystic fibrosis transmembrane conductance regulator protein in Iran: case report and developmental insight using microsatellite markers

Article

Full-text available

Jun 2024
Ther Adv Respir Dis

Plain language summary Identifying the first de novo mutation in the cystic fibrosis transmembrane conductance regulator protein in Iran: a case report with insights from microsatellite markers A child can develop Cystic Fibrosis (CF) if both parents pass on mutated genes. In some rare cases, new genetic mutations occur spontaneously, causing CF. This report discusses a unique case where a child has one gene with a spontaneous mutation and inherits another gene mutation from the mother. We used a method called Sanger sequencing to find the two different gene changes in the affected person. We also used computer analysis to predict how these changes might affect the protein responsible for this genetic disease. To confirm that the child's new change is not inherited, we used a type of genetic marker called microsatellite markers. The mutation inherited from the mother and the new spontaneous mutation resulted in a unique change in the responsible protein. This mutation is located in a specific part of the protein called the lasso motif. Our computer simulations show that this mutation disrupts the interaction between the lasso motif and another part of the protein called the R-domain, which ultimately affects the protein's function. This case is significant because it is the first reported instance of a de novo mutation causing CF in Asia. It has important implications for genetic testing, counseling, and understanding how recessive genetic disorders like CF occur within the Iranian population.

Differences in Genomic Alterations and Accumulations of Heavy Metals Between Advanced Non-small Cell Lung Cancer Patients with and without Bone Metastasis

Article

Jan 2024

A novel variant in ADPRS disrupts ARH3 stability and subcellular localization in children with neurodegeneration and respiratory failure

Preprint

Jun 2024

Purpose ADP-ribosylation is a post-translational modification involving the transfer of one or more ADP-ribose units from NAD+ to target proteins. Dysregulation of ADP-ribosylation is implicated in neurodegenerative diseases. Here we report a novel homozygous variant in the ADPRS gene (c.545A>G, p.His182Arg) encoding the mono(ADP-ribosyl) hydrolase ARH3 found in 2 patients with childhood-onset neurodegeneration with stress-induced ataxia and seizures (CONDSIAS). Methods Genetic testing via exome sequencing was used to identify the underlying disease cause in two siblings with developmental delay, seizures, progressive muscle weakness, and respiratory failure following an episodic course. Studies in a cell culture model uncover biochemical and cellular consequences of the identified genetic change. Results The ARH3 H182R variant affects a highly conserved residue in the active site of ARH3, leading to protein instability, degradation, and reduced expression. ARH3 H182R additionally fails to localize to the nucleus. The combination of reduced expression and mislocalization of ARH3 H182R resulted in accumulation of mono-ADP ribosylated species in cells. Conclusions The children’s clinical course combined with the biochemical characterization of their genetic variant develops our understanding of the pathogenic mechanisms driving CONDSIAS and highlights a critical role for ARH3-regulated ADP ribosylation in nervous system integrity.

Next generation sequencing identifies WNT signalling as a significant pathway in Autosomal Recessive Polycystic Kidney Disease (ARPKD) manifestation and may be linked to disease severity

Article

Jun 2024
BBA-MOL BASIS DIS

Genetic burden of dysregulated cytoskeletal organisation in the pathogenesis of pulmonary fibrosis

Preprint

Jun 2024

Background Pulmonary fibrosis (PF) is a shared characteristic of chronic interstitial lung diseases of mixed aetiology. Previous studies on PF highlight a pathogenic role for common and rare genetic variants. This study aimed to identify rare pathogenic variants that are enriched in distinct biological pathways and dysregulated gene expression. Methods Rare variants were identified using whole genome sequencing (WGS) from two independent PF cohorts, the PROFILE study and the Genomics England 100K (GE100KGP) cohort, with the gnomAD database as a reference. Four pathogenic variant categories were defined: loss of function variants, missense variants, protein altering variants, and protein truncating variants. Gene burden testing was performed for rare variants defined as having a minor allele frequency <0.1%. Overrepresentation analysis of gene ontology terms and gene concept network analysis were used to interpret functional pathways. Integration of publicly available transcriptomic datasets was performed using weighted gene co-expression network analysis of idiopathic pulmonary fibrosis (IPF) lung tissue compared with healthy controls. Results Burden testing was performed on 507 patients from the PROFILE study and 451 PF patients from GE100KGP cohort, compared with 76,156 control participants from the gnomAD database. Ninety genes containing significantly more pathogenic rare variants in cases than in controls were observed in both cohorts. Fifty-six genes included missense variants and 87 genes included protein altering variants. For missense variants, HMCN1 , encoding hemicentin-1, and RGPD1 , encoding a protein with a RanBD1 domain, were highly associated with PF in both PROFILE (p=5.70E-22 and p=4.48E-51, respectively) and GE100KGP cohorts (p=2.27E-24 and p=1.59E-36, respectively). 56 of 90 genes with significant burden were observed within modules correlated with disease in transcriptomic analysis, including HMCN1 and RGPD1 . Enriched functional categories from genetic and transcriptomic analyses included pathways involving extracellular matrix constituents, cell adhesion properties and microtubule organisation. Conclusions Rare pathogenic variant burden testing and weighted gene co-expression network analysis of transcriptomic data provided complementary evidence for pathways regulating cytoskeletal dynamics in PF pathogenesis. Functional validation of candidates could provide novel targets for intervention strategies.

Spectrum and genotype–phenotype relationship of ALPK3 variants in Chinese patients with hypertrophic cardiomyopathy

Article

Jun 2024

Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations

Article

Full-text available

Mar 2012

Evidence for the etiology of autism spectrum disorders (ASDs) has consistently pointed to a strong genetic component complicated by substantial locus heterogeneity(1,2). We sequenced the exomes of 20 individuals with sporadic ASD (cases) and their parents, reasoning that these families would be enriched for de novo mutations of major effect. We identified 21 de novo mutations, 11 of which were protein altering. Protein-altering mutations were significantly enriched for changes at highly conserved residues. We identified potentially causative de novo events in 4 out of 20 probands, particularly among more severely affected individuals, in FOXP1, GRIN2B, SCN1A and LAMC3. In the FOXP1 mutation carrier, we also observed a rare inherited CNTNAP2 missense variant, and we provide functional support for a multi-hit model for disease risk(3). Our results show that trio-based exome sequencing is a powerful approach for identifying new candidate genes for ASDs and suggest that de novo mutations may contribute substantially to the genetic etiology of ASDs.

Recessive mutations in a distal PTF1A enhancer cause isolated pancreatic agenesis

Article

Full-text available

Nov 2013
Nat Genet

The contribution of cis-regulatory mutations to human disease remains poorly understood. Whole-genome sequencing can identify all noncoding variants, yet the discrimination of causal regulatory mutations represents a formidable challenge. We used epigenomic annotation in human embryonic stem cell (hESC)-derived pancreatic progenitor cells to guide the interpretation of whole-genome sequences from individuals with isolated pancreatic agenesis. This analysis uncovered six different recessive mutations in a previously uncharacterized ∼400-bp sequence located 25 kb downstream of PTF1A (encoding pancreas-specific transcription factor 1a) in ten families with pancreatic agenesis. We show that this region acts as a developmental enhancer of PTF1A and that the mutations abolish enhancer activity. These mutations are the most common cause of isolated pancreatic agenesis. Integrating genome sequencing and epigenomic annotation in a disease-relevant cell type can thus uncover new noncoding elements underlying human development and disease.

An integrated map of genetic variation from 1,092 human genomes Consortium GP Nature 2012 491 56 65 10.1038/nature11632

Article

Full-text available

Nov 2012

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Genome-wide inference of natural selection on human transcription factor binding sites

Article

Full-text available

Jun 2013
Nat Genet

For decades, it has been hypothesized that gene regulation has had a central role in human evolution, yet much remains unknown about the genome-wide impact of regulatory mutations. Here we use whole-genome sequences and genome-wide chromatin immunoprecipitation and sequencing data to demonstrate that natural selection has profoundly influenced human transcription factor binding sites since the divergence of humans from chimpanzees 4-6 million years ago. Our analysis uses a new probabilistic method, called INSIGHT, for measuring the influence of selection on collections of short, interspersed noncoding elements. We find that, on average, transcription factor binding sites have experienced somewhat weaker selection than protein-coding genes. However, the binding sites of several transcription factors show clear evidence of adaptation. Several measures of selection are strongly correlated with predicted binding affinity. Overall, regulatory elements seem to contribute substantially to both adaptive substitutions and deleterious polymorphisms with key implications for human evolution and disease.

An integrated map of genetic variation from 1

Article

Jan 2012

An integrated encyclopedia of DNA elements in the human genome

Article

Sep 2012

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes

Article

Apr 2012

D.G. MacArthur

De Novo Point Mutations, Revealed by Whole-Exome Sequencing, Are Strongly Associated with Autism Spectrum Disorders

Conference Paper

May 2012

Background: Multiple studies have confirmed the contribution of rare variations in chromosomal structure to the risk for Autism Spectrum Disorders (ASD). Large, multigenic de novo copy number variations (CNVs) have been found in 5-10% of probands from families with only a single affected individual, carrying markedly greater risks than those associated with common genetic polymorphisms. However, the overall contribution of de novo single nucleotide variants (SNVs) to ASD remains to be characterized. Objectives: To assess the frequency and distribution of de novo single nucleotide variants (SNVs) in ASD affected individuals and in their unaffected siblings; to determine if de novo SNVs carry risk for ASD; and to identify specific disease associated de novo SNVs. Methods: Whole-exome sequencing was performed on 872 individuals in 224 families selected from the Simons Simplex Collection (SSC). These were made up of 200 quartet families (father, mother, probands with ASD and unaffected sibling) and 24 trio families (father, mother and proband). De novo variants were predicted from the sequencing data and confirmed by PCR and Sanger sequencing. Results: We found that de novo, non-synonymous SNVs are significantly more common in probands than in unaffected siblings (p=0.01; OR=1.88; 95%CI: 1.08-3.28). This difference is more significant when we consider only those non-synonymous mutations present in brain-expressed genes (p=0.006; OR=2.15; CI: 1.10-4.20). In probands we estimate that at least 19% of all de novo SNVs, 41% of non-synonymous de novo SNVs in brain-expressed genes and 77% of nonsense/splice site mutations in brain-expressed genes carry risk for ASD. Based on the de novo mutation rate observed in unaffected siblings, we demonstrate that the observation of multiple independent de novo non-synonymous SNVs in the same brain-expressed gene among unrelated probands can reliably differentiate risk alleles from neutral substitutions. In the current study, among a total of 279 identified de novo coding mutations, there is only a single instance in probands, and none in siblings, in which two independent nonsense substitutions disrupt the same gene, SCN2A (Sodium Channel, Voltage-Gated, Type II, Alpha Subunit), a result that is unlikely by chance (p=0.01). Conclusions: In simplex families de novo SNVs carry risk for ASD. This risk is most readily apparent for non-synonymous variants and in brain-expressed genes. Specific mutations can be associated with ASD by virtue of multiple observations from different samples in the same gene and this approach offers a clear route to identify multiple ASD risk-associated genes in larger cohorts.

The Neutral Theory Of Molecular Evolution

Article

Oct 1983

MOTOO KIMURA

Motoo Kimura, as founder of the neutral theory, is uniquely placed to write this book. He first proposed the theory in 1968 to explain the unexpectedly high rate of evolutionary change and very large amount of intraspecific variability at the molecular level that had been uncovered by new techniques in molecular biology. The theory - which asserts that the great majority of evolutionary changes at the molecular level are caused not by Darwinian selection but by random drift of selectively neutral mutants - has caused controversy ever since. This book is the first comprehensive treatment of this subject and the author synthesises a wealth of material - ranging from a historical perspective, through recent molecular discoveries, to sophisticated mathematical arguments - all presented in a most lucid manner.

Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences

Article

Nov 1985

Simon Tavaré

A General Framework for Estimating the Relative Pathogenicity of Human Genetic Variants

Abstract and Figures

Recommended publications

Predicting the clinical impact of human mutation with deep neural networks

Genotypes Need Phenotypes

Syntool: A Novel Region-Based Intolerance Score to Single Nucleotide Substitution for Synonymous Mut...

Polygenic susceptibility to breast cancer and implications for prevention