ArticlePDF Available

MutationTaster2: Mutation prediction for the deep-sequencing age

Authors:

Figures

Content may be subject to copyright.
NATURE METHODS | VOL.11 NO.4 | APRIL 2014 | 361
CORRESPONDENCE
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Eric Lubeck1,2, Ahmet F Coskun1,2, Timur Zhiyentayev1,
Mubhij Ahmad1 & Long Cai1
1Division of Chemistry and Chemical Engineering, California Institute of Technology,
Pasadena, California, USA. 2These authors contributed equally to this work.
e-mail: lcai@caltech.edu
1. Lubeck, E. & Cai, L. Nat. Methods 9, 743–748 (2012).
2. Ke, R. et al. Nat. Methods 10, 857–860 (2013).
3. Levesque, M.J., Ginart, P., Wei, Y. & Raj, A. Nat. Methods 10, 865–867 (2013).
4. Levesque, M.J. & Raj, A. Nat. Methods 10, 246–248 (2013).
MutationTaster2: mutation prediction
for the deep-sequencing age
To the Editor:
The majority of the gene variants discovered by next-
generation sequencing (NGS) projects are either intronic or synony-
mous. These variants are difficult to interpret because their effects on
protein expression and function tend to be less obvious than those
of missense or nonsense variants. Here we present MutationTaster2
(http://www.mutationtaster.org/), the latest version of our web-based
software MutationTaster1, which evaluates the pathogenic potential
of DNA sequence alterations. It is designed to predict the functional
consequences of not only amino acid substitutions but also intronic
and synonymous alterations, short insertion and/or deletion (indel)
mutations and variants spanning intron-exon borders.
MutationTaster2 includes all publicly available single-nucleotide
polymorphisms (SNPs) and indels from the 1000 Genomes Project2
(hereafter referred to as 1000G) as well as known disease variants from
ClinVar3 and HGMD Public4. Alterations found more than four times
in the homozygous state in 1000G or in HapMap5 are automatically
regarded as neutral. Variants marked as pathogenic in ClinVar are
automatically predicted to be disease causing, and the disease phe-
notype is displayed. We have integrated tests for regulatory features,
including data from the ENCODE project6 and JASPAR7, and score
the evolutionary conservation around DNA variants (Supplementary
Methods). To reduce the number of false positive splice-site
four barcodes left out). We first immobilized cells on glass
surfaces (
Supplementary Methods
). The DNA probes were
hybridized, imaged and then removed by DNase I treatment
(88.5% ± 11.0% efficiency (± standard deviation);
Supplementary
Fig. 2
and
Supplementary Note
). The remaining signal was pho-
tobleached (
Supplementary Fig. 3
). Even after six hybridizations,
mRNAs were observed at 70.9% ± 21.8% of the original intensity
(
Supplementary Fig. 4
). We observed that 77.9% ± 5.6% of the
spots that colocalized in the first two hybridizations also colo-
calized with the third hybridization (
Fig. 1b
and
Supplementary
Figs. 5
and
6
). We quantified the mRNA abundances by counting
the occurrence of corresponding barcodes in the cell (n = 37 cells;
Supplementary Figs. 7
and
8
). We also show that mRNAs can be
stripped and rehybridized efficiently in adherent mammalian cells
(
Supplementary Figs. 9
and
10
).
Sequential barcoding has many advantages. First, it scales up
quickly; with even two dyes the coding capacity is in principle
unlimited. Second, during each hybridization, all available FISH
probes against a transcript can be used, thereby increasing the
brightness of the FISH signal. Last, barcode readout is robust,
enabling full z stacks on native samples.
This barcoding scheme is conceptually akin to sequencing tran-
scripts in single cells with FISH. In contrast with the technique
used by Ke et al.2, our method takes advantage of the high hybrid-
ization efficiency of FISH (>95% of the mRNAs are detected1,3) and
the fact that base-pair resolution is usually not needed to uniquely
identify a transcript. We note that FISH probes can also be designed
to resolve a large number of splice isoforms and single-nucleotide
polymorphisms3, as well as chromosome loci4, in single cells. In
combination with our previous report of super-resolution FISH1,
the sequential barcoding method will enable the transcriptome to
be directly imaged at single-cell resolution in complex samples such
as brain tissue.
Note: Any Supplementary Information and Source Data files are available in the online
version of the paper (doi:10.1038/nmeth.2892).
ACKNOWLEDGMENTS
This work is funded by US National Institutes of Health single-cell analysis
program award R01HD075605.
FISH probes
with purple dye
DNase I
mRNA
Hyb 1
mRNA
Rehyb
Same probes
with blue dye
mRNA
Hyb 2
DNase I
and rehyb
N hybs
Barcode #
scales as
FN
Same probes
with green dye
mRNA
Hyb N
Hybridization 1 – probe set 1Hybridization 2 – probe set 2Hybridization 3 – probe set 1
Composite four-color FISH images
5 μm5 μm5 μm
1 μm
a
b
Figure 1 | Sequential barcoding. (a) Schematic
of sequential barcoding. In each round of
hybridization, 24 probes are hybridized on each
transcript, imaged and then stripped by DNase I
treatment. The same probe sequences are used
in different rounds of hybridization (hyb), but
probes are coupled to different fluorophores. (b)
Composite four-color FISH data from three rounds
of hybridizations on multiple yeast cells. Twelve
genes are encoded by two rounds of hybridization,
with the third hybridization using the same
probes as hybridization 1. The boxed regions are
magnified in the bottom right corner of each
image. Spots colocalizing between hybridizations
are detected (as outlined in insets) and have their
barcodes extracted. Spots without colocalization
are due to nonspecific binding of probes in the
cell as well as mishybridization. The number of
instances of each barcode can be quantified to
provide the abundances of the corresponding
transcripts in single cells.
npg © 2014 Nature America, Inc. All rights reserved.
362 | VOL.11 NO.4 | APRIL 2014 | NATURE METHODS
CORRESPONDENCE
with a slight increase in the simple_aae
model (from 87.2% in MutationTaster
to 88.6% in MutationTaster2) and
substantial changes in the without_aae
model (from 82.7% to 92.2%) and the
complex_aae model (from 79.3% to 90.7%)
(Supplementary Table 2).
We compared the predictions of the web
versions of MutationTaster2, SIFT (http://
sift.jcvi.org/), PolyPhen-2 (http://genetics.
bwh.harvard.edu/pph2/) and PROVEAN
(http://provean.jcvi.org/index.php) on
1,100 polymorphisms and 1,100 disease
mutations with variants causing single amino acid exchanges.
MutationTaster2 had the highest accuracy (88%) of the tools test-
ed (Table 1 ). The actual performance of MutationTaster2 is even
better because the program automatically detects and categorizes
confirmed polymorphisms and known disease mutations. In a real-
world example using exome data, MutationTaster2 yielded a false
positive rate of 1% for homozygous alterations (Supplementary
Tabl e 3 and Supplementary Methods).
The major drawback of MutationTaster2 is its limitation to intra-
genic variants. With the advance of whole-genome sequencing proj-
ects, it should be possible to overcome this limitation in the future.
It should be noted that MutationTaster2 has been designed specifi-
cally to aid the identification of rare variants with severe impact
(as in monogenic disorders) and is not intended to predict the
consequences of common variants with small effects.
Note: Any Supplementary Information and Source Data files are available in the online
version of the paper (doi:10.1038/nmeth.2890).
ACKNOWLEDGMENTS
This work is supported by grants from the Deutsche Forschungsgemeinschaft (SFB665
TP-C4) to M.S., the Einsteinstiftung Berlin (A-2011-63) to J.M.S. and M.S. and the
German Bundesministerium für Bildung und Forschung (mitoNET 01GM1113D) to
D.S. and M.S. M.S. is a member of the NeuroCure Center of Excellence (Exc 257).
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details are available in the online
version of the paper (doi:10.1038/nmeth.2890).
Jana Marie Schwarz1,2, David N Cooper3, Markus Schuelke1,2 &
Dominik Seelow1,2
1Department of Neuropediatrics, Charité – Universitätsmedizin Berlin, Berlin,
Ge rma ny. 2NeuroCure Clinical Research Center, Charité – Universitätsmedizin
Berlin, Berlin, Germany. 3Institute of Medical Genetics, Cardiff University,
Cardiff, UK.
e-mail: dominik.seelow@charite.de
1. Schwarz, J.M., Rödelsperger, C., Schuelke, M. & Seelow, D. Nat. Methods 7,
575–576 (2010).
2. The 1000 Genomes Project Consortium. Nature 491, 56–65 (2012).
3. Landrum, M.J. et al. Nucleic Acids Res. 42, D980–D985 (2014).
4. Stenson, P. D. et al. Hum. Genet. 133, 1–9 (2014).
5. Altshuler, D.M. et al. Nature 467, 52–58 (2010).
6. The ENCODE Project Consortium. Nature 489, 57–74 (2012).
7. Portales-Casamar, E. et al. Nucleic Acids Res. 38, D105–D110 (2010).
8. Seelow, D., Schwarz, J.M. & Schuelke, M. PLoS ONE 3, e3874 (2008).
predictions, MutationTaster2 considers loss or decreased strength of
splice sites only at existing intron-exon borders. A sequence change
within 2 base pairs of an intron-exon junction is regarded as the loss
of a splice site. As a further improvement, MutationTaster2 is able to
analyze sequence alterations spanning an intron-exon junction; these
are likely to perturb normal splicing and hence have considerable
pathogenic potential.
We were able to substantially increase the speed of
MutationTaster2 by caching BLASTP results from protein-
conservation analysis and by implementing our own function to
search for changes in the amino acid sequence. A single analysis
now takes less than 0.10 seconds on average.
For the rapid and user-friendly analysis of NGS results, we cre-
ated a dedicated query engine. Users can upload VCF files and
adjust several parameters, such as confining consideration to
homozygous variants or certain regions and filtering for known
polymorphisms. Job-scheduler software processes the genotypes
in a highly parallel fashion (500,000 alterations per hour). Users
can opt to be notified by e-mail when the process is complete. The
results can be filtered, prioritized and inspected in a web browser
or downloaded. We integrated our candidate-gene search engine,
GeneDistiller8, to let users determine the most likely candidate
genes among the potentially deleterious variants. In addition, we
developed a web interface for single queries using chromosomal
positions. MutationTaster2 automatically maps the variant to all
suitable genes and transcripts, analyzes the variant in all of them
and displays a table summarizing the predictions for all transcripts
and detailed results for each transcript.
As with its predecessor, MutationTaster2 uses a Bayes classi-
fier to generate predictions. Because alterations with different
effects on the protein sequence require different tests, we use
three classification models, designed for alterations that lead
to single amino acid substitutions (‘simple_aae’), involve more
than one amino acid (‘complex_aae’) or are noncoding or syn-
onymous (‘without_aae’). MutationTaster2 was trained and
tested with single base exchanges and short indels, compris-
ing >6,000,000 validated polymorphisms from 1000G and (with
permission from BIOBASE) >100,000 known disease muta-
tions from HGMD Professional (
Supplementary Table 1
).
We were able to improve the accuracy in all classification models,
Table 1 | Comparison between MutationTaster2 and other prediction tools
Tool nNPV PPV Sensitivity Specificity Accuracy
PPH2-var 2,200 0.808 0.875 0.789 0.887 0.838
PPH2-div 2,200 0.853 0.827 0.858 0.821 0.840
PROVEAN 2,200 0.798 0.865 0.778 0.878 0.828
SIFT 2,200 0.832 0.854 0.827 0.858 0.843
MutationTaster1 2,200 0.850 0.870 0.846 0.874 0.860
MutationTaster2 2,200 0.886 0.875 0.887 0.874 0.880
Details about the methods and further statistics are presented in Supplementary Methods and at http://www.
mutationtaster.org/info/statistics.html. n, number of cases; NPV, negative prediction value; PPV, positive prediction
value; PPH2-div, PolyPhen-2 with HumDiv classifier; PPH2-var, PolyPhen-2 with HumVar classifier.
npg © 2014 Nature America, Inc. All rights reserved.
... Third, variants that were in genes determined not to be expressed in NS LUAD as determined by Illumina expression microarray analysis of a panel 30 BCCA samples were removed using a pipeline previously described [28]. Lastly, amino acid functional change was predicted with SIFT [29], LRT [30], MutationTaster [31], MutationAssessor [32], FATHMM [33][34][35], and MetaSVM [36] by ANNOVAR [17] and if at least half of the algorithms predicted that a variant was tolerable to protein function, the variant was removed, unless it was an indel or had a SPIDEX [27]| DPSI z-score|≥2. ...
... Following removal of non-coding and silent variants, we searched for variants in genes that are expressed in LUADs from NSs, as it has previously been demonstrated that genes not expressed in specific cancers have a higher mutation rate due to limited selective pressure [28] (Fig. 1a). Finally, amino acid functional prediction by various algorithms [29][30][31][32][33][34][35][36] as annotated through ANNOVAR [17] was assessed and only variants with at least half of the algorithms predicting a deleterious consequence to protein function were considered as a functional amino acid substitution. Variants that were either indels, functional amino acid substitutions, or predicted to affect splicing (SPIDEX|DPSI z-score|≥2) were considered candidates, resulting in 428 variants with a predicted functional consequence across the 15 samples (Fig. 1a). ...
Article
Full-text available
Background Identification of driver mutations and development of targeted therapies has considerably improved outcomes for lung cancer patients. However, significant limitations remain with the lack of identified drivers in a large subset of patients. Here, we aimed to assess the genomic landscape of lung adenocarcinomas (LUADs) from individuals without a history of tobacco use to reveal new genetic drivers of lung cancer. Methods Integrative genomic analyses combining whole-exome sequencing, copy number, and mutational information for 83 LUAD tumors was performed and validated using external datasets to identify genetic variants with a predicted functional consequence and assess association with clinical outcomes. LUAD cell lines with alteration of identified candidates were used to functionally characterize tumor suppressive potential using a conditional expression system both in vitro and in vivo. Results We identified 21 genes with evidence of positive selection, including 12 novel candidates that have yet to be characterized in LUAD. In particular, SNF2 Histone Linker PHD RING Helicase ( SHPRH ) was identified due to its frequency of biallelic disruption and location within the familial susceptibility locus on chromosome arm 6q. We found that low SHPRH mRNA expression is associated with poor survival outcomes in LUAD patients. Furthermore, we showed that re-expression of SHPRH in LUAD cell lines with inactivating alterations for SHPRH reduces their in vitro colony formation and tumor burden in vivo. Finally, we explored the biological pathways associated SHPRH inactivation and found an association with the tolerance of LUAD cells to DNA damage. Conclusions These data suggest that SHPRH is a tumor suppressor gene in LUAD, whereby its expression is associated with more favorable patient outcomes, reduced tumor and mutational burden, and may serve as a predictor of response to DNA damage. Thus, further exploration into the role of SHPRH in LUAD development may make it a valuable biomarker for predicting LUAD risk and prognosis.
... , CADD(Rentzsch et al., 2021) , NCBI PubMed (National Center for Biotechnology Information, 1988), Ensembl Variant Effect Predictor(McLaren et al., 2016), ClinVar(Landrum et al., 2018), Mutation Taster(Schwarz et al., 2014), SIFT(Vaser et al., 2016), PolyPhen(Adzhubei et al., 2010), and VarSome(Kopanos et al., 2019). Annotation included chromosome and physical positions (GRCh37/hg19), dbSNP rsID, associated genes, genomic alterations, variant consequences, exonic functions, and pathogenicity classifications. ...
Preprint
Full-text available
Essential Hypertension (EH) is a medical condition characterized by persistent high blood pressure, and it is one of the most significant public health problems globally, causing approximately 9.4 million deaths annually. The prevalence of EH varies according to genetic ancestry and affects differently specific populations, with 17% in the Americas, 19.2% in the Western Pacific, 23.2% in Europe, 25.1% in Southeast Asia, 26.3% in the Eastern Mediterranean, and 27.2% in Africa. EH is a multifactorial disease, that is, it is determined by genetic factors and influenced by environmental factors. Although genetic factors are estimated to contribute to around 30-60% of the variation in blood pressure, the genetic complexity of hypertension is not yet fully understood due to the limited knowledge of candidate genes, susceptibility loci, and population-specific differences. Various approaches such as candidate gene-based, genome-wide linkage analysis (GWLA), and family- and population-based genome-wide association studies (GWAS) have been used to identify genetic factors contributing to EH, but part of the heritability of blood pressure-related phenotypes is not explained by the genetic factors known so far, mainly due to methodological limitations. The main objective of this study was to investigate the genetic basis of EH by mapping regions of interest (ROI) and investigate candidate genes and variants influencing the Essential Hypertension in African-derived individuals from partially isolated populations of quilombo remnants in Vale do Ribeira (Sao Paulo - Brazil), which were previously well characterized from the clinical, genealogical and population genetics point of view. Samples from 431 individuals (167 affected, 261 unaffected and 3 with an unknown phenotype) from eight quilombo remnants populations were genotyped using SNP array with approximately 650.000 SNPs. The global ancestry proportions of these populations were estimated to be 47%, 36%, and 16% for African, European, and Native American ancestries, respectively. In addition, genealogical information from 673 individuals was used to construct six pedigrees comprising a total of 1104 individuals. The mapping strategy consisted of a multi-level computational approach. Pedigrees were constructed (GenoPro v.3) based on interviews and kinship coefficient (King v.2.2, MORGAN v.3.4 and PBAP v.1). The dataset was pruned (King v.2.2 and PBAP v.1) to obtain three non-overlapping markers subpanels (PBAP v.2). Haplotype phasing and local ancestry (SNPFlip v.0.0.6, SHAPEIT2 and RFMIX v.2) were performed to obtain SNP allele frequency (ADMIXFRQ v.1) to account for admixture. Genome-wide and dense linkage analyses were performed using the three subpanels of markers (MORGAN v.3.4). Finally, we performed fine-mapping using family-based association studies (GENESIS v.2.24) based on population (MINIMAC v.4) and pedigree (GIGI2) imputed data, and EH-related genes and variant investigation (relying on publicly available databases). The linkage analysis strategy resulted in the mapping of 22 ROIs with LOD score 1.45, containing markers co-segregating with the phenotype. The LOD score range was 1.45-3.03 considering all the linked segments and these 22 ROIs encompassed 2363 genes. As our first fine-mapping strategy, we identified 60 EH-related genes as potential candidates to contribute to high blood pressure in quilombo remnants pedigrees. In addition, as our second fine-mapping strategy, we identified 118 suggestive or significant variants through family-based association studies. Considering only the common results between the two strategies, we found that 14 genes - PHGDH and S100A10 (ROI1), MFN2 (ROI2), RYR2, EDARADD and MTR (ROI3), SERTAD2 (ROI4), LPP (ROI5), KCNT1 (ROI11), TENM4 (ROI13), P2RX1, ZZEF1 and RPA1 (ROI18), and ALPK2 (ROI20) - were identified within the mapped regions with strong and sufficient evidence in the literature attesting relatedness with the phenotype. These genes harbor 29 SNPs that were either within them or very close, and they were pointed out by family-based association studies as showing suggestive or significant association with hypertension. We also identified 46 other genes within the mapped regions, but with less evidence of relatedness to the phenotype since they were not replicated through family-based association studies. Overall, the results obtained in this study revealed, through a complementary approach - combining admixture-adjusted genome-wide linkage analysis based on Markov chain Monte Carlo (MCMC) methods, association studies on imputed data, and in silico investigations - genetic regions, variants and candidate genes that shed light on the genetic basis of essential hypertension. These genes are responsible for encoding proteins that play crucial roles in regulating blood pressure, including the regulation of sodium and potassium levels, and the aldosterone pathway, among others. Our findings reveal genes and variants with distinct potential to explain the genetic etiology of essential hypertension observed in quilombo remnant populations.
... The BigDye Terminator Cycle sequencing kit (Thermo Fisher Scientific, Foster City, CA, USA, TF) was employed for amplicon sequencing, and a 3130/XL Genetic Analyzer (TF) was utilized for analysis. Bioinformatic tools, including Mutation Taster, 26 Polyphen-2, 27 CADD, 28 FATHMM, 29 SIFT, 30 and PROVEAN, 31 were employed to investigate variant pathogenicity. The 3D structure of the protein was analyzed by modeling in the UCSF ChimeraX software package (version 1.3). ...
Article
Full-text available
Plain language summary Identifying the first de novo mutation in the cystic fibrosis transmembrane conductance regulator protein in Iran: a case report with insights from microsatellite markers A child can develop Cystic Fibrosis (CF) if both parents pass on mutated genes. In some rare cases, new genetic mutations occur spontaneously, causing CF. This report discusses a unique case where a child has one gene with a spontaneous mutation and inherits another gene mutation from the mother. We used a method called Sanger sequencing to find the two different gene changes in the affected person. We also used computer analysis to predict how these changes might affect the protein responsible for this genetic disease. To confirm that the child's new change is not inherited, we used a type of genetic marker called microsatellite markers. The mutation inherited from the mother and the new spontaneous mutation resulted in a unique change in the responsible protein. This mutation is located in a specific part of the protein called the lasso motif. Our computer simulations show that this mutation disrupts the interaction between the lasso motif and another part of the protein called the R-domain, which ultimately affects the protein's function. This case is significant because it is the first reported instance of a de novo mutation causing CF in Asia. It has important implications for genetic testing, counseling, and understanding how recessive genetic disorders like CF occur within the Iranian population.
... ARH3 H182R is predicted to be pathogenic by multiple tools based on evolutionary conservation and protein structure-function relationships (Table S4). [30][31][32] Further, His182 is a highly evolutionarily conserved residue in the active site of ARH3 that other studies found to be essential for binding ADPr (Figure 1D-E). 28,29,33 ARH3 H182R was not found in gnomAD 34 , a collection of sequencing data from over 195,000 individuals taken from many projects, nor in Al Mena 35 , a collection of 2,115 individuals from the Middle East and North Africa. ...
Preprint
Purpose ADP-ribosylation is a post-translational modification involving the transfer of one or more ADP-ribose units from NAD+ to target proteins. Dysregulation of ADP-ribosylation is implicated in neurodegenerative diseases. Here we report a novel homozygous variant in the ADPRS gene (c.545A>G, p.His182Arg) encoding the mono(ADP-ribosyl) hydrolase ARH3 found in 2 patients with childhood-onset neurodegeneration with stress-induced ataxia and seizures (CONDSIAS). Methods Genetic testing via exome sequencing was used to identify the underlying disease cause in two siblings with developmental delay, seizures, progressive muscle weakness, and respiratory failure following an episodic course. Studies in a cell culture model uncover biochemical and cellular consequences of the identified genetic change. Results The ARH3 H182R variant affects a highly conserved residue in the active site of ARH3, leading to protein instability, degradation, and reduced expression. ARH3 H182R additionally fails to localize to the nucleus. The combination of reduced expression and mislocalization of ARH3 H182R resulted in accumulation of mono-ADP ribosylated species in cells. Conclusions The children’s clinical course combined with the biochemical characterization of their genetic variant develops our understanding of the pathogenic mechanisms driving CONDSIAS and highlights a critical role for ARH3-regulated ADP ribosylation in nervous system integrity.
Article
Full-text available
Inherited and developmental eye diseases are quite diverse and numerous, and determining their genetic cause is challenging due to their high allelic and locus heterogeneity. New molecular approaches, such as whole exome sequencing (WES), have proven to be powerful molecular tools for addressing these cases. The present study used WES to identify the genetic etiology in ten unrelated Mexican pediatric patients with complex ocular anomalies and other systemic alterations of unknown etiology. The WES approach allowed us to identify five clinically relevant variants in the GZF1, NFIX, TRRAP, FGFR2 and PAX2 genes associated with Larsen, Malan, developmental delay with or without dysmorphic facies and autism, LADD1 and papillorenal syndromes. Mutations located in GZF1 and NFIX were classified as pathogenic, those in TRRAP and FGFR2 were classified as likely pathogenic variants, and those in PAX2 were classified as variants of unknown significance. Protein modeling of the two missense FGFR2 p.(Arg210Gln) and PAX2 p.(Met3Thr) variants showed that these changes could induce potential structural alterations in important functional regions of the proteins. Notably, four out of the five variants were not previously reported, except for the TRRAP gene. Consequently, WES enabled the identification of the genetic cause in 40% of the cases reported. All the syndromes reported herein are very rare, with phenotypes that may overlap with other genetic entities.
Article
Steroid-resistant nephrotic syndrome is the second leading cause of chronic kidney disease among patients < 25 years of age. Through exome sequencing, identification of > 65 monogenic causes has revealed insights into disease mechanisms of nephrotic syndrome (NS). To elucidate novel monogenic causes of NS, we combined homozygosity mapping with exome sequencing in a worldwide cohort of 1649 pediatric patients with NS. We identified homozygous missense variants in MYO1C in two unrelated children with NS (c.292C > T, p.R98W; c.2273 A > T, p.K758M). We evaluated publicly available kidney single-cell RNA sequencing datasets and found MYO1C to be predominantly expressed in podocytes. We then performed structural modeling for the identified variants in PyMol using aligned shared regions from two available partial structures of MYO1C (4byf and 4r8g). In both structures, calmodulin, a common regulator of myosin activity, is shown to bind to the IQ motif. At both residue sites (K758; R98), there are ion-ion interactions stabilizing intradomain and ligand interactions: R98 binds to nearby D220 within the myosin motor domain and K758 binds to E14 on a calmodulin molecule. Variants of these charged residues to non-charged amino acids could ablate these ionic interactions, weakening protein structure and function establishing the impact of these variants. We here identified recessive variants in MYO1C as a potential novel cause of NS in children. A higher resolution version of the Graphical abstract is available as Supplementary information
Article
Full-text available
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
Article
Full-text available
ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) provides a freely available archive of reports of relationships among medically important variants and phenotypes. ClinVar accessions submissions reporting human variation, interpretations of the relationship of that variation to human health and the evidence supporting each interpretation. The database is tightly coupled with dbSNP and dbVar, which maintain information about the location of variation on human assemblies. ClinVar is also based on the phenotypic descriptions maintained in MedGen (http://www.ncbi.nlm.nih.gov/medgen). Each ClinVar record represents the submitter, the variation and the phenotype, i.e. the unit that is assigned an accession of the format SCV000000000.0. The submitter can update the submission at any time, in which case a new version is assigned. To facilitate evaluation of the medical importance of each variant, ClinVar aggregates submissions with the same variation/phenotype combination, adds value from other NCBI databases, assigns a distinct accession of the format RCV000000000.0 and reports if there are conflicting clinical interpretations. Data in ClinVar are available in multiple formats, including html, download as XML, VCF or tab-delimited subsets. Data from ClinVar are provided as annotation tracks on genomic RefSeqs and are used in tools such as Variation Reporter (http://www.ncbi.nlm.nih.gov/variation/tools/reporter), which reports what is known about variation based on user-supplied locations.
Article
Full-text available
The Human Gene Mutation Database (HGMD(®)) is a comprehensive collection of germline mutations in nuclear genes that underlie, or are associated with, human inherited disease. By June 2013, the database contained over 141,000 different lesions detected in over 5,700 different genes, with new mutation entries currently accumulating at a rate exceeding 10,000 per annum. HGMD was originally established in 1996 for the scientific study of mutational mechanisms in human genes. However, it has since acquired a much broader utility as a central unified disease-oriented mutation repository utilized by human molecular geneticists, genome scientists, molecular biologists, clinicians and genetic counsellors as well as by those specializing in biopharmaceuticals, bioinformatics and personalized genomics. The public version of HGMD ( http://www.hgmd.org ) is freely available to registered users from academic institutions/non-profit organizations whilst the subscription version (HGMD Professional) is available to academic, clinical and commercial users under license via BIOBASE GmbH.
Article
Full-text available
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research. The human genome sequence provides the underlying code for human biology. Despite intensive study, especially in identifying protein-coding genes, our understanding of the genome is far from complete, particularly with regard to non-coding RNAs, alternatively spliced transcripts and regulatory sequences. Systematic analyses of transcripts and regulatory information are essential for the identification of genes and regulatory regions, and are an important resource for the study of human biology and disease. Such analyses can also provide comprehensive views of the organization and variability of genes and regulatory information across cellular contexts, species and individuals.
Article
Full-text available
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
Article
Full-text available
Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of <or=5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs. This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.
Article
Full-text available
JASPAR (http://jaspar.genereg.net) is the leading open-access database of matrix profiles describing the DNA-binding patterns of transcription factors (TFs) and other proteins interacting with DNA in a sequence-specific manner. Its fourth major release is the largest expansion of the core database to date: the database now holds 457 non-redundant, curated profiles. The new entries include the first batch of profiles derived from ChIP-seq and ChIP-chip whole-genome binding experiments, and 177 yeast TF binding profiles. The introduction of a yeast division brings the convenience of JASPAR to an active research community. As binding models are refined by newer data, the JASPAR database now uses versioning of matrices: in this release, 12% of the older models were updated to improved versions. Classification of TF families has been improved by adopting a new DNA-binding domain nomenclature. A curated catalog of mammalian TFs is provided, extending the use of the JASPAR profiles to additional TFs belonging to the same structural family. The changes in the database set the system ready for more rapid acquisition of new high-throughput data sources. Additionally, three new special collections provide matrix profile data produced by recent alternative high-throughput approaches.
Article
Full-text available
Linkage studies often yield intervals containing several hundred positional candidate genes. Different manual or automatic approaches exist for the determination of the gene most likely to cause the disease. While the manual search is very flexible and takes advantage of the researchers' background knowledge and intuition, it may be very cumbersome to collect and study the relevant data. Automatic solutions on the other hand usually focus on certain models, remain "black boxes" and do not offer the same degree of flexibility. We have developed a web-based application that combines the advantages of both approaches. Information from various data sources such as gene-phenotype associations, gene expression patterns and protein-protein interactions was integrated into a central database. Researchers can select which information for the genes within a candidate interval or for single genes shall be displayed. Genes can also interactively be filtered, sorted and prioritised according to criteria derived from the background knowledge and preconception of the disease under scrutiny. GeneDistiller provides knowledge-driven, fully interactive and intuitive access to multiple data sources. It displays maximum relevant information, while saving the user from drowning in the flood of data. A typical query takes less than two seconds, thus allowing an interactive and explorative approach to the hunt for the candidate gene. ACCESS: GeneDistiller can be freely accessed at http://www.genedistiller.org.
  • E Portales-Casamar
Portales-Casamar, E. et al. Nucleic Acids Res. 38, D105-D110 (2010).