Content uploaded by Dominik Seelow
Author content
All content in this area was uploaded by Dominik Seelow on Mar 31, 2015
Content may be subject to copyright.
NATURE METHODS | VOL.11 NO.4 | APRIL 2014 | 361
CORRESPONDENCE
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Eric Lubeck1,2, Ahmet F Coskun1,2, Timur Zhiyentayev1,
Mubhij Ahmad1 & Long Cai1
1Division of Chemistry and Chemical Engineering, California Institute of Technology,
Pasadena, California, USA. 2These authors contributed equally to this work.
e-mail: lcai@caltech.edu
1. Lubeck, E. & Cai, L. Nat. Methods 9, 743–748 (2012).
2. Ke, R. et al. Nat. Methods 10, 857–860 (2013).
3. Levesque, M.J., Ginart, P., Wei, Y. & Raj, A. Nat. Methods 10, 865–867 (2013).
4. Levesque, M.J. & Raj, A. Nat. Methods 10, 246–248 (2013).
MutationTaster2: mutation prediction
for the deep-sequencing age
To the Editor:
The majority of the gene variants discovered by next-
generation sequencing (NGS) projects are either intronic or synony-
mous. These variants are difficult to interpret because their effects on
protein expression and function tend to be less obvious than those
of missense or nonsense variants. Here we present MutationTaster2
(http://www.mutationtaster.org/), the latest version of our web-based
software MutationTaster1, which evaluates the pathogenic potential
of DNA sequence alterations. It is designed to predict the functional
consequences of not only amino acid substitutions but also intronic
and synonymous alterations, short insertion and/or deletion (indel)
mutations and variants spanning intron-exon borders.
MutationTaster2 includes all publicly available single-nucleotide
polymorphisms (SNPs) and indels from the 1000 Genomes Project2
(hereafter referred to as 1000G) as well as known disease variants from
ClinVar3 and HGMD Public4. Alterations found more than four times
in the homozygous state in 1000G or in HapMap5 are automatically
regarded as neutral. Variants marked as pathogenic in ClinVar are
automatically predicted to be disease causing, and the disease phe-
notype is displayed. We have integrated tests for regulatory features,
including data from the ENCODE project6 and JASPAR7, and score
the evolutionary conservation around DNA variants (Supplementary
Methods). To reduce the number of false positive splice-site
four barcodes left out). We first immobilized cells on glass
surfaces (
Supplementary Methods
). The DNA probes were
hybridized, imaged and then removed by DNase I treatment
(88.5% ± 11.0% efficiency (± standard deviation);
Supplementary
Fig. 2
and
Supplementary Note
). The remaining signal was pho-
tobleached (
Supplementary Fig. 3
). Even after six hybridizations,
mRNAs were observed at 70.9% ± 21.8% of the original intensity
(
Supplementary Fig. 4
). We observed that 77.9% ± 5.6% of the
spots that colocalized in the first two hybridizations also colo-
calized with the third hybridization (
Fig. 1b
and
Supplementary
Figs. 5
and
6
). We quantified the mRNA abundances by counting
the occurrence of corresponding barcodes in the cell (n = 37 cells;
Supplementary Figs. 7
and
8
). We also show that mRNAs can be
stripped and rehybridized efficiently in adherent mammalian cells
(
Supplementary Figs. 9
and
10
).
Sequential barcoding has many advantages. First, it scales up
quickly; with even two dyes the coding capacity is in principle
unlimited. Second, during each hybridization, all available FISH
probes against a transcript can be used, thereby increasing the
brightness of the FISH signal. Last, barcode readout is robust,
enabling full z stacks on native samples.
This barcoding scheme is conceptually akin to sequencing tran-
scripts in single cells with FISH. In contrast with the technique
used by Ke et al.2, our method takes advantage of the high hybrid-
ization efficiency of FISH (>95% of the mRNAs are detected1,3) and
the fact that base-pair resolution is usually not needed to uniquely
identify a transcript. We note that FISH probes can also be designed
to resolve a large number of splice isoforms and single-nucleotide
polymorphisms3, as well as chromosome loci4, in single cells. In
combination with our previous report of super-resolution FISH1,
the sequential barcoding method will enable the transcriptome to
be directly imaged at single-cell resolution in complex samples such
as brain tissue.
Note: Any Supplementary Information and Source Data files are available in the online
version of the paper (doi:10.1038/nmeth.2892).
ACKNOWLEDGMENTS
This work is funded by US National Institutes of Health single-cell analysis
program award R01HD075605.
FISH probes
with purple dye
DNase I
mRNA
Hyb 1
mRNA
Rehyb
Same probes
with blue dye
mRNA
Hyb 2
DNase I
and rehyb
N hybs
Barcode #
scales as
FN
Same probes
with green dye
mRNA
Hyb N
Hybridization 1 – probe set 1Hybridization 2 – probe set 2Hybridization 3 – probe set 1
Composite four-color FISH images
5 μm5 μm5 μm
1 μm
a
b
Figure 1 | Sequential barcoding. (a) Schematic
of sequential barcoding. In each round of
hybridization, 24 probes are hybridized on each
transcript, imaged and then stripped by DNase I
treatment. The same probe sequences are used
in different rounds of hybridization (hyb), but
probes are coupled to different fluorophores. (b)
Composite four-color FISH data from three rounds
of hybridizations on multiple yeast cells. Twelve
genes are encoded by two rounds of hybridization,
with the third hybridization using the same
probes as hybridization 1. The boxed regions are
magnified in the bottom right corner of each
image. Spots colocalizing between hybridizations
are detected (as outlined in insets) and have their
barcodes extracted. Spots without colocalization
are due to nonspecific binding of probes in the
cell as well as mishybridization. The number of
instances of each barcode can be quantified to
provide the abundances of the corresponding
transcripts in single cells.
npg © 2014 Nature America, Inc. All rights reserved.
362 | VOL.11 NO.4 | APRIL 2014 | NATURE METHODS
CORRESPONDENCE
with a slight increase in the simple_aae
model (from 87.2% in MutationTaster
to 88.6% in MutationTaster2) and
substantial changes in the without_aae
model (from 82.7% to 92.2%) and the
complex_aae model (from 79.3% to 90.7%)
(Supplementary Table 2).
We compared the predictions of the web
versions of MutationTaster2, SIFT (http://
sift.jcvi.org/), PolyPhen-2 (http://genetics.
bwh.harvard.edu/pph2/) and PROVEAN
(http://provean.jcvi.org/index.php) on
1,100 polymorphisms and 1,100 disease
mutations with variants causing single amino acid exchanges.
MutationTaster2 had the highest accuracy (88%) of the tools test-
ed (Table 1 ). The actual performance of MutationTaster2 is even
better because the program automatically detects and categorizes
confirmed polymorphisms and known disease mutations. In a real-
world example using exome data, MutationTaster2 yielded a false
positive rate of 1% for homozygous alterations (Supplementary
Tabl e 3 and Supplementary Methods).
The major drawback of MutationTaster2 is its limitation to intra-
genic variants. With the advance of whole-genome sequencing proj-
ects, it should be possible to overcome this limitation in the future.
It should be noted that MutationTaster2 has been designed specifi-
cally to aid the identification of rare variants with severe impact
(as in monogenic disorders) and is not intended to predict the
consequences of common variants with small effects.
Note: Any Supplementary Information and Source Data files are available in the online
version of the paper (doi:10.1038/nmeth.2890).
ACKNOWLEDGMENTS
This work is supported by grants from the Deutsche Forschungsgemeinschaft (SFB665
TP-C4) to M.S., the Einsteinstiftung Berlin (A-2011-63) to J.M.S. and M.S. and the
German Bundesministerium für Bildung und Forschung (mitoNET 01GM1113D) to
D.S. and M.S. M.S. is a member of the NeuroCure Center of Excellence (Exc 257).
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details are available in the online
version of the paper (doi:10.1038/nmeth.2890).
Jana Marie Schwarz1,2, David N Cooper3, Markus Schuelke1,2 &
Dominik Seelow1,2
1Department of Neuropediatrics, Charité – Universitätsmedizin Berlin, Berlin,
Ge rma ny. 2NeuroCure Clinical Research Center, Charité – Universitätsmedizin
Berlin, Berlin, Germany. 3Institute of Medical Genetics, Cardiff University,
Cardiff, UK.
e-mail: dominik.seelow@charite.de
1. Schwarz, J.M., Rödelsperger, C., Schuelke, M. & Seelow, D. Nat. Methods 7,
575–576 (2010).
2. The 1000 Genomes Project Consortium. Nature 491, 56–65 (2012).
3. Landrum, M.J. et al. Nucleic Acids Res. 42, D980–D985 (2014).
4. Stenson, P. D. et al. Hum. Genet. 133, 1–9 (2014).
5. Altshuler, D.M. et al. Nature 467, 52–58 (2010).
6. The ENCODE Project Consortium. Nature 489, 57–74 (2012).
7. Portales-Casamar, E. et al. Nucleic Acids Res. 38, D105–D110 (2010).
8. Seelow, D., Schwarz, J.M. & Schuelke, M. PLoS ONE 3, e3874 (2008).
predictions, MutationTaster2 considers loss or decreased strength of
splice sites only at existing intron-exon borders. A sequence change
within 2 base pairs of an intron-exon junction is regarded as the loss
of a splice site. As a further improvement, MutationTaster2 is able to
analyze sequence alterations spanning an intron-exon junction; these
are likely to perturb normal splicing and hence have considerable
pathogenic potential.
We were able to substantially increase the speed of
MutationTaster2 by caching BLASTP results from protein-
conservation analysis and by implementing our own function to
search for changes in the amino acid sequence. A single analysis
now takes less than 0.10 seconds on average.
For the rapid and user-friendly analysis of NGS results, we cre-
ated a dedicated query engine. Users can upload VCF files and
adjust several parameters, such as confining consideration to
homozygous variants or certain regions and filtering for known
polymorphisms. Job-scheduler software processes the genotypes
in a highly parallel fashion (500,000 alterations per hour). Users
can opt to be notified by e-mail when the process is complete. The
results can be filtered, prioritized and inspected in a web browser
or downloaded. We integrated our candidate-gene search engine,
GeneDistiller8, to let users determine the most likely candidate
genes among the potentially deleterious variants. In addition, we
developed a web interface for single queries using chromosomal
positions. MutationTaster2 automatically maps the variant to all
suitable genes and transcripts, analyzes the variant in all of them
and displays a table summarizing the predictions for all transcripts
and detailed results for each transcript.
As with its predecessor, MutationTaster2 uses a Bayes classi-
fier to generate predictions. Because alterations with different
effects on the protein sequence require different tests, we use
three classification models, designed for alterations that lead
to single amino acid substitutions (‘simple_aae’), involve more
than one amino acid (‘complex_aae’) or are noncoding or syn-
onymous (‘without_aae’). MutationTaster2 was trained and
tested with single base exchanges and short indels, compris-
ing >6,000,000 validated polymorphisms from 1000G and (with
permission from BIOBASE) >100,000 known disease muta-
tions from HGMD Professional (
Supplementary Table 1
).
We were able to improve the accuracy in all classification models,
Table 1 | Comparison between MutationTaster2 and other prediction tools
Tool nNPV PPV Sensitivity Specificity Accuracy
PPH2-var 2,200 0.808 0.875 0.789 0.887 0.838
PPH2-div 2,200 0.853 0.827 0.858 0.821 0.840
PROVEAN 2,200 0.798 0.865 0.778 0.878 0.828
SIFT 2,200 0.832 0.854 0.827 0.858 0.843
MutationTaster1 2,200 0.850 0.870 0.846 0.874 0.860
MutationTaster2 2,200 0.886 0.875 0.887 0.874 0.880
Details about the methods and further statistics are presented in Supplementary Methods and at http://www.
mutationtaster.org/info/statistics.html. n, number of cases; NPV, negative prediction value; PPV, positive prediction
value; PPH2-div, PolyPhen-2 with HumDiv classifier; PPH2-var, PolyPhen-2 with HumVar classifier.
npg © 2014 Nature America, Inc. All rights reserved.