ArticlePDF Available

MutationTaster2: Mutation prediction for the deep-sequencing age

March 2014
Nature Methods 11(4):361-2

March 2014
11(4):361-2

DOI:10.1038/nmeth.2890

Source
PubMed

Authors:

Jana Marie Schwarz

Charité Universitätsmedizin Berlin

David N Cooper

Cardiff University

Markus Schuelke

Charité Universitätsmedizin Berlin

Dominik Seelow

Berliner Institut für Gesundheitsforschung

| Comparison between MutationTaster2 and other prediction tools

…

Figures - uploaded by Dominik Seelow

Content may be subject to copyright.

Content uploaded by Dominik Seelow

Content may be subject to copyright.

NATURE METHODS | VOL.11 NO.4 | APRIL 2014 | 361

CORRESPONDENCE

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

Eric Lubeck1,2, Ahmet F Coskun1,2, Timur Zhiyentayev1,

Mubhij Ahmad1 & Long Cai1

1Division of Chemistry and Chemical Engineering, California Institute of Technology,

Pasadena, California, USA. 2These authors contributed equally to this work.

e-mail: lcai@caltech.edu

1. Lubeck, E. & Cai, L. Nat. Methods 9, 743–748 (2012).

2. Ke, R. et al. Nat. Methods 10, 857–860 (2013).

3. Levesque, M.J., Ginart, P., Wei, Y. & Raj, A. Nat. Methods 10, 865–867 (2013).

4. Levesque, M.J. & Raj, A. Nat. Methods 10, 246–248 (2013).

MutationTaster2: mutation prediction

for the deep-sequencing age

To the Editor:

The majority of the gene variants discovered by next-

generation sequencing (NGS) projects are either intronic or synony-

mous. These variants are difficult to interpret because their effects on

protein expression and function tend to be less obvious than those

of missense or nonsense variants. Here we present MutationTaster2

(http://www.mutationtaster.org/), the latest version of our web-based

software MutationTaster1, which evaluates the pathogenic potential

of DNA sequence alterations. It is designed to predict the functional

consequences of not only amino acid substitutions but also intronic

and synonymous alterations, short insertion and/or deletion (indel)

mutations and variants spanning intron-exon borders.

MutationTaster2 includes all publicly available single-nucleotide

polymorphisms (SNPs) and indels from the 1000 Genomes Project2

(hereafter referred to as 1000G) as well as known disease variants from

ClinVar3 and HGMD Public4. Alterations found more than four times

in the homozygous state in 1000G or in HapMap5 are automatically

regarded as neutral. Variants marked as pathogenic in ClinVar are

automatically predicted to be disease causing, and the disease phe-

notype is displayed. We have integrated tests for regulatory features,

including data from the ENCODE project6 and JASPAR7, and score

the evolutionary conservation around DNA variants (Supplementary

Methods). To reduce the number of false positive splice-site

four barcodes left out). We first immobilized cells on glass

surfaces (

Supplementary Methods

). The DNA probes were

hybridized, imaged and then removed by DNase I treatment

(88.5% ± 11.0% efficiency (± standard deviation);

Supplementary

Fig. 2

and

Supplementary Note

). The remaining signal was pho-

tobleached (

Supplementary Fig. 3

). Even after six hybridizations,

mRNAs were observed at 70.9% ± 21.8% of the original intensity

(

Supplementary Fig. 4

). We observed that 77.9% ± 5.6% of the

spots that colocalized in the first two hybridizations also colo-

calized with the third hybridization (

Fig. 1b

and

Supplementary

Figs. 5

and

). We quantified the mRNA abundances by counting

the occurrence of corresponding barcodes in the cell (n = 37 cells;

Supplementary Figs. 7

and

). We also show that mRNAs can be

stripped and rehybridized efficiently in adherent mammalian cells

(

Supplementary Figs. 9

and

Sequential barcoding has many advantages. First, it scales up

quickly; with even two dyes the coding capacity is in principle

unlimited. Second, during each hybridization, all available FISH

probes against a transcript can be used, thereby increasing the

brightness of the FISH signal. Last, barcode readout is robust,

enabling full z stacks on native samples.

This barcoding scheme is conceptually akin to sequencing tran-

scripts in single cells with FISH. In contrast with the technique

used by Ke et al.2, our method takes advantage of the high hybrid-

ization efficiency of FISH (>95% of the mRNAs are detected1,3) and

the fact that base-pair resolution is usually not needed to uniquely

identify a transcript. We note that FISH probes can also be designed

to resolve a large number of splice isoforms and single-nucleotide

polymorphisms3, as well as chromosome loci4, in single cells. In

combination with our previous report of super-resolution FISH1,

the sequential barcoding method will enable the transcriptome to

be directly imaged at single-cell resolution in complex samples such

as brain tissue.

Note: Any Supplementary Information and Source Data files are available in the online

version of the paper (doi:10.1038/nmeth.2892).

ACKNOWLEDGMENTS

This work is funded by US National Institutes of Health single-cell analysis

program award R01HD075605.

FISH probes

with purple dye

DNase I

mRNA

Hyb 1

mRNA

Rehyb

Same probes

with blue dye

mRNA

Hyb 2

DNase I

and rehyb

N hybs

Barcode #

scales as

Same probes

with green dye

mRNA

Hyb N

Hybridization 1 – probe set 1Hybridization 2 – probe set 2Hybridization 3 – probe set 1

Composite four-color FISH images

5 μm5 μm5 μm

1 μm

Figure 1 | Sequential barcoding. (a) Schematic

of sequential barcoding. In each round of

hybridization, 24 probes are hybridized on each

transcript, imaged and then stripped by DNase I

treatment. The same probe sequences are used

in different rounds of hybridization (hyb), but

probes are coupled to different fluorophores. (b)

Composite four-color FISH data from three rounds

of hybridizations on multiple yeast cells. Twelve

genes are encoded by two rounds of hybridization,

with the third hybridization using the same

probes as hybridization 1. The boxed regions are

magnified in the bottom right corner of each

image. Spots colocalizing between hybridizations

are detected (as outlined in insets) and have their

barcodes extracted. Spots without colocalization

are due to nonspecific binding of probes in the

cell as well as mishybridization. The number of

instances of each barcode can be quantified to

provide the abundances of the corresponding

transcripts in single cells.

362 | VOL.11 NO.4 | APRIL 2014 | NATURE METHODS

CORRESPONDENCE

with a slight increase in the simple_aae

model (from 87.2% in MutationTaster

to 88.6% in MutationTaster2) and

substantial changes in the without_aae

model (from 82.7% to 92.2%) and the

complex_aae model (from 79.3% to 90.7%)

(Supplementary Table 2).

We compared the predictions of the web

versions of MutationTaster2, SIFT (http://

sift.jcvi.org/), PolyPhen-2 (http://genetics.

bwh.harvard.edu/pph2/) and PROVEAN

(http://provean.jcvi.org/index.php) on

1,100 polymorphisms and 1,100 disease

mutations with variants causing single amino acid exchanges.

MutationTaster2 had the highest accuracy (88%) of the tools test-

ed (Table 1 ). The actual performance of MutationTaster2 is even

better because the program automatically detects and categorizes

confirmed polymorphisms and known disease mutations. In a real-

world example using exome data, MutationTaster2 yielded a false

positive rate of 1% for homozygous alterations (Supplementary

Tabl e 3 and Supplementary Methods).

The major drawback of MutationTaster2 is its limitation to intra-

genic variants. With the advance of whole-genome sequencing proj-

ects, it should be possible to overcome this limitation in the future.

It should be noted that MutationTaster2 has been designed specifi-

cally to aid the identification of rare variants with severe impact

(as in monogenic disorders) and is not intended to predict the

consequences of common variants with small effects.

Note: Any Supplementary Information and Source Data files are available in the online

version of the paper (doi:10.1038/nmeth.2890).

ACKNOWLEDGMENTS

This work is supported by grants from the Deutsche Forschungsgemeinschaft (SFB665

TP-C4) to M.S., the Einsteinstiftung Berlin (A-2011-63) to J.M.S. and M.S. and the

German Bundesministerium für Bildung und Forschung (mitoNET 01GM1113D) to

D.S. and M.S. M.S. is a member of the NeuroCure Center of Excellence (Exc 257).

COMPETING FINANCIAL INTERESTS

The authors declare competing financial interests: details are available in the online

version of the paper (doi:10.1038/nmeth.2890).

Jana Marie Schwarz1,2, David N Cooper3, Markus Schuelke1,2 &

Dominik Seelow1,2

1Department of Neuropediatrics, Charité – Universitätsmedizin Berlin, Berlin,

Ge rma ny. 2NeuroCure Clinical Research Center, Charité – Universitätsmedizin

Berlin, Berlin, Germany. 3Institute of Medical Genetics, Cardiff University,

Cardiff, UK.

e-mail: dominik.seelow@charite.de

1. Schwarz, J.M., Rödelsperger, C., Schuelke, M. & Seelow, D. Nat. Methods 7,

575–576 (2010).

2. The 1000 Genomes Project Consortium. Nature 491, 56–65 (2012).

3. Landrum, M.J. et al. Nucleic Acids Res. 42, D980–D985 (2014).

4. Stenson, P. D. et al. Hum. Genet. 133, 1–9 (2014).

5. Altshuler, D.M. et al. Nature 467, 52–58 (2010).

6. The ENCODE Project Consortium. Nature 489, 57–74 (2012).

7. Portales-Casamar, E. et al. Nucleic Acids Res. 38, D105–D110 (2010).

8. Seelow, D., Schwarz, J.M. & Schuelke, M. PLoS ONE 3, e3874 (2008).

predictions, MutationTaster2 considers loss or decreased strength of

splice sites only at existing intron-exon borders. A sequence change

within 2 base pairs of an intron-exon junction is regarded as the loss

of a splice site. As a further improvement, MutationTaster2 is able to

analyze sequence alterations spanning an intron-exon junction; these

are likely to perturb normal splicing and hence have considerable

pathogenic potential.

We were able to substantially increase the speed of

MutationTaster2 by caching BLASTP results from protein-

conservation analysis and by implementing our own function to

search for changes in the amino acid sequence. A single analysis

now takes less than 0.10 seconds on average.

For the rapid and user-friendly analysis of NGS results, we cre-

ated a dedicated query engine. Users can upload VCF files and

adjust several parameters, such as confining consideration to

homozygous variants or certain regions and filtering for known

polymorphisms. Job-scheduler software processes the genotypes

in a highly parallel fashion (500,000 alterations per hour). Users

can opt to be notified by e-mail when the process is complete. The

results can be filtered, prioritized and inspected in a web browser

or downloaded. We integrated our candidate-gene search engine,

GeneDistiller8, to let users determine the most likely candidate

genes among the potentially deleterious variants. In addition, we

developed a web interface for single queries using chromosomal

positions. MutationTaster2 automatically maps the variant to all

suitable genes and transcripts, analyzes the variant in all of them

and displays a table summarizing the predictions for all transcripts

and detailed results for each transcript.

As with its predecessor, MutationTaster2 uses a Bayes classi-

fier to generate predictions. Because alterations with different

effects on the protein sequence require different tests, we use

three classification models, designed for alterations that lead

to single amino acid substitutions (‘simple_aae’), involve more

than one amino acid (‘complex_aae’) or are noncoding or syn-

onymous (‘without_aae’). MutationTaster2 was trained and

tested with single base exchanges and short indels, compris-

ing >6,000,000 validated polymorphisms from 1000G and (with

permission from BIOBASE) >100,000 known disease muta-

tions from HGMD Professional (

Supplementary Table 1

We were able to improve the accuracy in all classification models,

Table 1 | Comparison between MutationTaster2 and other prediction tools

Tool nNPV PPV Sensitivity Specificity Accuracy

PPH2-var 2,200 0.808 0.875 0.789 0.887 0.838

PPH2-div 2,200 0.853 0.827 0.858 0.821 0.840

PROVEAN 2,200 0.798 0.865 0.778 0.878 0.828

SIFT 2,200 0.832 0.854 0.827 0.858 0.843

MutationTaster1 2,200 0.850 0.870 0.846 0.874 0.860

MutationTaster2 2,200 0.886 0.875 0.887 0.874 0.880

Details about the methods and further statistics are presented in Supplementary Methods and at http://www.

mutationtaster.org/info/statistics.html. n, number of cases; NPV, negative prediction value; PPV, positive prediction

value; PPH2-div, PolyPhen-2 with HumDiv classifier; PPH2-var, PolyPhen-2 with HumVar classifier.

Integrative genomics identifies SHPRH as a tumor suppressor gene in lung adenocarcinoma that regulates DNA damage response

Article

Full-text available

Jun 2024

Background Identification of driver mutations and development of targeted therapies has considerably improved outcomes for lung cancer patients. However, significant limitations remain with the lack of identified drivers in a large subset of patients. Here, we aimed to assess the genomic landscape of lung adenocarcinomas (LUADs) from individuals without a history of tobacco use to reveal new genetic drivers of lung cancer. Methods Integrative genomic analyses combining whole-exome sequencing, copy number, and mutational information for 83 LUAD tumors was performed and validated using external datasets to identify genetic variants with a predicted functional consequence and assess association with clinical outcomes. LUAD cell lines with alteration of identified candidates were used to functionally characterize tumor suppressive potential using a conditional expression system both in vitro and in vivo. Results We identified 21 genes with evidence of positive selection, including 12 novel candidates that have yet to be characterized in LUAD. In particular, SNF2 Histone Linker PHD RING Helicase ( SHPRH ) was identified due to its frequency of biallelic disruption and location within the familial susceptibility locus on chromosome arm 6q. We found that low SHPRH mRNA expression is associated with poor survival outcomes in LUAD patients. Furthermore, we showed that re-expression of SHPRH in LUAD cell lines with inactivating alterations for SHPRH reduces their in vitro colony formation and tumor burden in vivo. Finally, we explored the biological pathways associated SHPRH inactivation and found an association with the tolerance of LUAD cells to DNA damage. Conclusions These data suggest that SHPRH is a tumor suppressor gene in LUAD, whereby its expression is associated with more favorable patient outcomes, reduced tumor and mutational burden, and may serve as a predictor of response to DNA damage. Thus, further exploration into the role of SHPRH in LUAD development may make it a valuable biomarker for predicting LUAD risk and prognosis.

Genomic Exploration of Essential Hypertension in African-Brazilian Quilombo Populations: A Comprehensive Approach with Pedigree Analysis and Family-Based Association Studies

Preprint

Full-text available

Jun 2024

Essential Hypertension (EH) is a medical condition characterized by persistent high blood pressure, and it is one of the most significant public health problems globally, causing approximately 9.4 million deaths annually. The prevalence of EH varies according to genetic ancestry and affects differently specific populations, with 17% in the Americas, 19.2% in the Western Pacific, 23.2% in Europe, 25.1% in Southeast Asia, 26.3% in the Eastern Mediterranean, and 27.2% in Africa. EH is a multifactorial disease, that is, it is determined by genetic factors and influenced by environmental factors. Although genetic factors are estimated to contribute to around 30-60% of the variation in blood pressure, the genetic complexity of hypertension is not yet fully understood due to the limited knowledge of candidate genes, susceptibility loci, and population-specific differences. Various approaches such as candidate gene-based, genome-wide linkage analysis (GWLA), and family- and population-based genome-wide association studies (GWAS) have been used to identify genetic factors contributing to EH, but part of the heritability of blood pressure-related phenotypes is not explained by the genetic factors known so far, mainly due to methodological limitations. The main objective of this study was to investigate the genetic basis of EH by mapping regions of interest (ROI) and investigate candidate genes and variants influencing the Essential Hypertension in African-derived individuals from partially isolated populations of quilombo remnants in Vale do Ribeira (Sao Paulo - Brazil), which were previously well characterized from the clinical, genealogical and population genetics point of view. Samples from 431 individuals (167 affected, 261 unaffected and 3 with an unknown phenotype) from eight quilombo remnants populations were genotyped using SNP array with approximately 650.000 SNPs. The global ancestry proportions of these populations were estimated to be 47%, 36%, and 16% for African, European, and Native American ancestries, respectively. In addition, genealogical information from 673 individuals was used to construct six pedigrees comprising a total of 1104 individuals. The mapping strategy consisted of a multi-level computational approach. Pedigrees were constructed (GenoPro v.3) based on interviews and kinship coefficient (King v.2.2, MORGAN v.3.4 and PBAP v.1). The dataset was pruned (King v.2.2 and PBAP v.1) to obtain three non-overlapping markers subpanels (PBAP v.2). Haplotype phasing and local ancestry (SNPFlip v.0.0.6, SHAPEIT2 and RFMIX v.2) were performed to obtain SNP allele frequency (ADMIXFRQ v.1) to account for admixture. Genome-wide and dense linkage analyses were performed using the three subpanels of markers (MORGAN v.3.4). Finally, we performed fine-mapping using family-based association studies (GENESIS v.2.24) based on population (MINIMAC v.4) and pedigree (GIGI2) imputed data, and EH-related genes and variant investigation (relying on publicly available databases). The linkage analysis strategy resulted in the mapping of 22 ROIs with LOD score 1.45, containing markers co-segregating with the phenotype. The LOD score range was 1.45-3.03 considering all the linked segments and these 22 ROIs encompassed 2363 genes. As our first fine-mapping strategy, we identified 60 EH-related genes as potential candidates to contribute to high blood pressure in quilombo remnants pedigrees. In addition, as our second fine-mapping strategy, we identified 118 suggestive or significant variants through family-based association studies. Considering only the common results between the two strategies, we found that 14 genes - PHGDH and S100A10 (ROI1), MFN2 (ROI2), RYR2, EDARADD and MTR (ROI3), SERTAD2 (ROI4), LPP (ROI5), KCNT1 (ROI11), TENM4 (ROI13), P2RX1, ZZEF1 and RPA1 (ROI18), and ALPK2 (ROI20) - were identified within the mapped regions with strong and sufficient evidence in the literature attesting relatedness with the phenotype. These genes harbor 29 SNPs that were either within them or very close, and they were pointed out by family-based association studies as showing suggestive or significant association with hypertension. We also identified 46 other genes within the mapped regions, but with less evidence of relatedness to the phenotype since they were not replicated through family-based association studies. Overall, the results obtained in this study revealed, through a complementary approach - combining admixture-adjusted genome-wide linkage analysis based on Markov chain Monte Carlo (MCMC) methods, association studies on imputed data, and in silico investigations - genetic regions, variants and candidate genes that shed light on the genetic basis of essential hypertension. These genes are responsible for encoding proteins that play crucial roles in regulating blood pressure, including the regulation of sodium and potassium levels, and the aldosterone pathway, among others. Our findings reveal genes and variants with distinct potential to explain the genetic etiology of essential hypertension observed in quilombo remnant populations.

Identification and in silico structural analysis for the first de novo mutation in the cystic fibrosis transmembrane conductance regulator protein in Iran: case report and developmental insight using microsatellite markers

Article

Full-text available

Jun 2024
Ther Adv Respir Dis

Plain language summary Identifying the first de novo mutation in the cystic fibrosis transmembrane conductance regulator protein in Iran: a case report with insights from microsatellite markers A child can develop Cystic Fibrosis (CF) if both parents pass on mutated genes. In some rare cases, new genetic mutations occur spontaneously, causing CF. This report discusses a unique case where a child has one gene with a spontaneous mutation and inherits another gene mutation from the mother. We used a method called Sanger sequencing to find the two different gene changes in the affected person. We also used computer analysis to predict how these changes might affect the protein responsible for this genetic disease. To confirm that the child's new change is not inherited, we used a type of genetic marker called microsatellite markers. The mutation inherited from the mother and the new spontaneous mutation resulted in a unique change in the responsible protein. This mutation is located in a specific part of the protein called the lasso motif. Our computer simulations show that this mutation disrupts the interaction between the lasso motif and another part of the protein called the R-domain, which ultimately affects the protein's function. This case is significant because it is the first reported instance of a de novo mutation causing CF in Asia. It has important implications for genetic testing, counseling, and understanding how recessive genetic disorders like CF occur within the Iranian population.

A novel variant in ADPRS disrupts ARH3 stability and subcellular localization in children with neurodegeneration and respiratory failure

Preprint

Jun 2024

Purpose ADP-ribosylation is a post-translational modification involving the transfer of one or more ADP-ribose units from NAD+ to target proteins. Dysregulation of ADP-ribosylation is implicated in neurodegenerative diseases. Here we report a novel homozygous variant in the ADPRS gene (c.545A>G, p.His182Arg) encoding the mono(ADP-ribosyl) hydrolase ARH3 found in 2 patients with childhood-onset neurodegeneration with stress-induced ataxia and seizures (CONDSIAS). Methods Genetic testing via exome sequencing was used to identify the underlying disease cause in two siblings with developmental delay, seizures, progressive muscle weakness, and respiratory failure following an episodic course. Studies in a cell culture model uncover biochemical and cellular consequences of the identified genetic change. Results The ARH3 H182R variant affects a highly conserved residue in the active site of ARH3, leading to protein instability, degradation, and reduced expression. ARH3 H182R additionally fails to localize to the nucleus. The combination of reduced expression and mislocalization of ARH3 H182R resulted in accumulation of mono-ADP ribosylated species in cells. Conclusions The children’s clinical course combined with the biochemical characterization of their genetic variant develops our understanding of the pathogenic mechanisms driving CONDSIAS and highlights a critical role for ARH3-regulated ADP ribosylation in nervous system integrity.

Deciphering the etiology of undiagnosed ocular anomalies along with systemic alterations in pediatric patients through whole exome sequencing

Article

Full-text available

Jun 2024

Inherited and developmental eye diseases are quite diverse and numerous, and determining their genetic cause is challenging due to their high allelic and locus heterogeneity. New molecular approaches, such as whole exome sequencing (WES), have proven to be powerful molecular tools for addressing these cases. The present study used WES to identify the genetic etiology in ten unrelated Mexican pediatric patients with complex ocular anomalies and other systemic alterations of unknown etiology. The WES approach allowed us to identify five clinically relevant variants in the GZF1, NFIX, TRRAP, FGFR2 and PAX2 genes associated with Larsen, Malan, developmental delay with or without dysmorphic facies and autism, LADD1 and papillorenal syndromes. Mutations located in GZF1 and NFIX were classified as pathogenic, those in TRRAP and FGFR2 were classified as likely pathogenic variants, and those in PAX2 were classified as variants of unknown significance. Protein modeling of the two missense FGFR2 p.(Arg210Gln) and PAX2 p.(Met3Thr) variants showed that these changes could induce potential structural alterations in important functional regions of the proteins. Notably, four out of the five variants were not previously reported, except for the TRRAP gene. Consequently, WES enabled the identification of the genetic cause in 40% of the cases reported. All the syndromes reported herein are very rare, with phenotypes that may overlap with other genetic entities.

Recessive variants in MYO1C as a potential novel cause of proteinuric kidney disease

Article

Jun 2024
PEDIATR NEPHROL

Steroid-resistant nephrotic syndrome is the second leading cause of chronic kidney disease among patients < 25 years of age. Through exome sequencing, identification of > 65 monogenic causes has revealed insights into disease mechanisms of nephrotic syndrome (NS). To elucidate novel monogenic causes of NS, we combined homozygosity mapping with exome sequencing in a worldwide cohort of 1649 pediatric patients with NS. We identified homozygous missense variants in MYO1C in two unrelated children with NS (c.292C > T, p.R98W; c.2273 A > T, p.K758M). We evaluated publicly available kidney single-cell RNA sequencing datasets and found MYO1C to be predominantly expressed in podocytes. We then performed structural modeling for the identified variants in PyMol using aligned shared regions from two available partial structures of MYO1C (4byf and 4r8g). In both structures, calmodulin, a common regulator of myosin activity, is shown to bind to the IQ motif. At both residue sites (K758; R98), there are ion-ion interactions stabilizing intradomain and ligand interactions: R98 binds to nearby D220 within the myosin motor domain and K758 binds to E14 on a calmodulin molecule. Variants of these charged residues to non-charged amino acids could ablate these ionic interactions, weakening protein structure and function establishing the impact of these variants. We here identified recessive variants in MYO1C as a potential novel cause of NS in children. A higher resolution version of the Graphical abstract is available as Supplementary information

Genotype-phenotype analyses of Iranian patients with hemophilia B (Leyden -) and hemophilia B (Leyden +): A single-center study

Article

Jun 2024
TRANSFUS APHER SCI

First family with Perry syndrome from Mexico

Article

Full-text available

Jun 2024

Association Between Clonal Hematopoiesis and Left Ventricular Reverse Remodeling in Nonischemic Dilated Cardiomyopathy

Article

Jun 2024

Next generation sequencing identifies WNT signalling as a significant pathway in Autosomal Recessive Polycystic Kidney Disease (ARPKD) manifestation and may be linked to disease severity

Article

Jun 2024
BBA-MOL BASIS DIS

An integrated map of genetic variation from 1,092 human genomes Consortium GP Nature 2012 491 56 65 10.1038/nature11632

Article

Full-text available

Nov 2012

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

ClinVar: Public archive of relationships among sequence variation and human phenotype

Article

Full-text available

Nov 2013
NUCLEIC ACIDS RES

ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) provides a freely available archive of reports of relationships among medically important variants and phenotypes. ClinVar accessions submissions reporting human variation, interpretations of the relationship of that variation to human health and the evidence supporting each interpretation. The database is tightly coupled with dbSNP and dbVar, which maintain information about the location of variation on human assemblies. ClinVar is also based on the phenotypic descriptions maintained in MedGen (http://www.ncbi.nlm.nih.gov/medgen). Each ClinVar record represents the submitter, the variation and the phenotype, i.e. the unit that is assigned an accession of the format SCV000000000.0. The submitter can update the submission at any time, in which case a new version is assigned. To facilitate evaluation of the medical importance of each variant, ClinVar aggregates submissions with the same variation/phenotype combination, adds value from other NCBI databases, assigns a distinct accession of the format RCV000000000.0 and reports if there are conflicting clinical interpretations. Data in ClinVar are available in multiple formats, including html, download as XML, VCF or tab-delimited subsets. Data from ClinVar are provided as annotation tracks on genomic RefSeqs and are used in tools such as Variation Reporter (http://www.ncbi.nlm.nih.gov/variation/tools/reporter), which reports what is known about variation based on user-supplied locations.

The Human Gene Mutation Database: Building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine

Article

Full-text available

Sep 2013
HUM GENET

The Human Gene Mutation Database (HGMD(®)) is a comprehensive collection of germline mutations in nuclear genes that underlie, or are associated with, human inherited disease. By June 2013, the database contained over 141,000 different lesions detected in over 5,700 different genes, with new mutation entries currently accumulating at a rate exceeding 10,000 per annum. HGMD was originally established in 1996 for the scientific study of mutational mechanisms in human genes. However, it has since acquired a much broader utility as a central unified disease-oriented mutation repository utilized by human molecular geneticists, genome scientists, molecular biologists, clinicians and genetic counsellors as well as by those specializing in biopharmaceuticals, bioinformatics and personalized genomics. The public version of HGMD ( http://www.hgmd.org ) is freely available to registered users from academic institutions/non-profit organizations whilst the subscription version (HGMD Professional) is available to academic, clinical and commercial users under license via BIOBASE GmbH.

The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. 2012. Nature 489: 57–74

Article

Full-text available

Sep 2012

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research. The human genome sequence provides the underlying code for human biology. Despite intensive study, especially in identifying protein-coding genes, our understanding of the genome is far from complete, particularly with regard to non-coding RNAs, alternatively spliced transcripts and regulatory sequences. Systematic analyses of transcripts and regulatory information are essential for the identification of genes and regulatory regions, and are an important resource for the study of human biology and disease. Such analyses can also provide comprehensive views of the organization and variability of genes and regulatory information across cellular contexts, species and individuals.

An integrated encyclopedia of DNA elements in the human genome

Article

Full-text available

Sep 2012
NATURE

The International HapMap 3 Consortium.. Integrating common and rare genetic variation in diverse human populations. Nature 467: 52-58

Article

Full-text available

Sep 2010
NATURE

Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of <or=5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs. This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.

JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 38:D105-D110

Article

Full-text available

Nov 2009
NUCLEIC ACIDS RES

JASPAR (http://jaspar.genereg.net) is the leading open-access database of matrix profiles describing the DNA-binding patterns of transcription factors (TFs) and other proteins interacting with DNA in a sequence-specific manner. Its fourth major release is the largest expansion of the core database to date: the database now holds 457 non-redundant, curated profiles. The new entries include the first batch of profiles derived from ChIP-seq and ChIP-chip whole-genome binding experiments, and 177 yeast TF binding profiles. The introduction of a yeast division brings the convenience of JASPAR to an active research community. As binding models are refined by newer data, the JASPAR database now uses versioning of matrices: in this release, 12% of the older models were updated to improved versions. Classification of TF families has been improved by adopting a new DNA-binding domain nomenclature. A curated catalog of mammalian TFs is provided, extending the use of the JASPAR profiles to additional TFs belonging to the same structural family. The changes in the database set the system ready for more rapid acquisition of new high-throughput data sources. Additionally, three new special collections provide matrix profile data produced by recent alternative high-throughput approaches.

GeneDistiller—Distilling Candidate Genes from Linkage Intervals

Article

Full-text available

Feb 2008
PLOS ONE

Linkage studies often yield intervals containing several hundred positional candidate genes. Different manual or automatic approaches exist for the determination of the gene most likely to cause the disease. While the manual search is very flexible and takes advantage of the researchers' background knowledge and intuition, it may be very cumbersome to collect and study the relevant data. Automatic solutions on the other hand usually focus on certain models, remain "black boxes" and do not offer the same degree of flexibility. We have developed a web-based application that combines the advantages of both approaches. Information from various data sources such as gene-phenotype associations, gene expression patterns and protein-protein interactions was integrated into a central database. Researchers can select which information for the genes within a candidate interval or for single genes shall be displayed. Genes can also interactively be filtered, sorted and prioritised according to criteria derived from the background knowledge and preconception of the disease under scrutiny. GeneDistiller provides knowledge-driven, fully interactive and intuitive access to multiple data sources. It displays maximum relevant information, while saving the user from drowning in the flood of data. A typical query takes less than two seconds, thus allowing an interactive and explorative approach to the hunt for the candidate gene. ACCESS: GeneDistiller can be freely accessed at http://www.genedistiller.org.

MutationTaster Evaluates Disease-causing Potential of Sequence Alterations

Article

Aug 2010
Br J Pharmacol

Jan 2010
NUCLEIC ACIDS RES
105-110

E Portales-Casamar

Portales-Casamar, E. et al. Nucleic Acids Res. 38, D105-D110 (2010).

MutationTaster2: Mutation prediction for the deep-sequencing age

Figures

Recommended publications

Analysis of microRNA profile of Anopheles sinensis by deep sequencing and bioinformatic approaches

Deep sequencing reveals an association between HIV-1 subtype C mutations in gp41 MPER epitopes and m...

Identification of minority resistance mutations in the HIV-1 integrase coding region using next gene...

S11 Table