ArticlePDF Available

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification

Authors:

Abstract and Figures

Efficient analysis of very large amounts of raw data for peptide identification and protein quantification is a principal challenge in mass spectrometry (MS)-based proteomics. Here we describe MaxQuant, an integrated suite of algorithms specifically developed for high-resolution, quantitative MS data. Using correlation analysis and graph theory, MaxQuant detects peaks, isotope clusters and stable amino acid isotope-labeled (SILAC) peptide pairs as three-dimensional objects in m/z, elution time and signal intensity space. By integrating multiple mass measurements and correcting for linear and nonlinear mass offsets, we achieve mass accuracy in the p.p.b. range, a sixfold increase over standard techniques. We increase the proportion of identified fragmentation spectra to 73% for SILAC peptide pairs via unambiguous assignment of isotope and missed-cleavage state and individual mass precision. MaxQuant automatically quantifies several hundred thousand peptides per SILAC-proteome experiment and allows statistically robust identification and quantification of >4,000 proteins in mammalian cell lysates.
Content may be subject to copyright.
MaxQuant enables high peptide identification rates,
individualized p.p.b.-range mass accuracies and
proteome-wide protein quantification
Ju
¨rgen Cox & Matthias Mann
Efficient analysis of very large amounts of raw data for peptide identification and protein quantification is a principal challenge
in mass spectrometry (MS)-based proteomics. Here we describe MaxQuant, an integrated suite of algorithms specifically
developed for high-resolution, quantitative MS data. Using correlation analysis and graph theory, MaxQuant detects peaks,
isotope clusters and stable amino acid isotope–labeled (SILAC) peptide pairs as three-dimensional objects in m/z, elution time
and signal intensity space. By integrating multiple mass measurements and correcting for linear and nonlinear mass offsets, we
achieve mass accuracy in the p.p.b. range, a sixfold increase over standard techniques. We increase the proportion of identified
fragmentation spectra to 73% for SILAC peptide pairs via unambiguous assignment of isotope and missed-cleavage state and
individual mass precision. MaxQuant automatically quantifies several hundred thousand peptides per SILAC-proteome
experiment and allows statistically robust identification and quantification of 44,000 proteins in mammalian cell lysates.
Data analysis in MS-based proteomics is much more challenging than
for other high-throughput technologies such as microarrays1and
remains a principal bottleneck in proteomics2,3. In one popular format
of MS-based proteomics, proteins are enzymatically digested to
peptides, which are analyzed online by liquid chromatography (LC)
coupled to electrospray and tandem MS (MS/MS)4.MSspectra
contain peptide mass and intensity information, and the identity of
the peptides is deduced by matching the MS/MS spectra against a
sequence database5,6. Typically, peaks are extracted from raw data, the
peptide mass is estimated from the scan from which the peak was
‘picked for sequencing and the peak files are sent to a search engine.
Results consist of tables of identified proteins. In a quantitative
proteomics experiment using stable isotopes, peptide and protein
ratios are obtained by direct comparison of the signals of the ‘light’
and ‘heavy isotope in the same LC run7,8.
There is already a substantial literature on computational proteo-
mics’ (reviewed in refs. 3,9–12). However, these efforts were usually
not directed at high-resolution data of the type readily attainable
today and they do not approach the quality of a skilled human expert.
Here we describe a set of algorithms that efficiently and robustly
extracts information from raw MS data and allows very high peptide
identification rates as well as high-accuracy protein quantification for
several thousand proteins in complex proteomes.
RESULTS
Analysis pipeline
MaxQuant incorporates all steps needed in a computational proteo-
mics platform but currently uses Mascot13 to generate peptide
candidates for MS/MS spectra. Below, we describe the analysis frame-
work and illustrate its performance with SILAC-treated HeLa cells that
were stimulated for 2 h with epidermal growth factor (EGF)14.These
data were obtained by triplicate analysis of 24 peptide fractions from
isoelectric focusing using an LTQ Orbitrap mass spectrometer. We
describe conceptual issues and computational analysis. A detailed
explanation of algorithms is provided in Supplementary Notes online
and their C# source code in Supplementary Data online.
Feature detection and quantification
The high resolution of modern mass spectrometers and the need for
quantification in functional proteomics led us to start the data analysis
with ‘features in the MS spectra (mass and intensity of the peptide
peaks) rather than focus on the fragmentation spectra. This is already
commonly done in MS-based biomarker discovery9. In MaxQuant,
peaks are detected in each MS scan by fitting a gaussian peak shape
to the three central raw data points and then assembled into three-
dimensional (3D) peak hills over the m/z-retention time plane
(Fig. 1a–c). Smoothed intensity profiles over retention time are split
at significant local minima. From the centroid masses we obtain a high
precision, intensity-weighted estimate of mass for the 3D peak
(Fig. 1d). For each 3D peak an individual mass precision is calculated
by bootstrap replication (Supplementary Notes).
Each of the 72 LC-MS runs of the HeLa proteome resulted in
B382,000 3D peaks, on average. It is not trivial to efficiently and
reliably determine isotope patterns, and we employ a graph theoretical
data structure to construct an undirected graph with the 3D peaks as
vertices. An edge is inserted between two peaks when the difference in
Received 27 May; accepted 31 October; published online 30 November 2008; doi:10.1038/nbt.1511
Department for Proteomics and Signal Transduction, Max-Planck Institute for Biochemistry, Am Klopferspitz 18, D-82152 Martinsried, Germany. Correspondence should
be addressed to J.C. (cox@biochem.mpg.de) or M.M. (mmann@biochem.mpg.de).
NATURE BIOTECHNOLOGY VOLUME 26 NUMBER 12 DECEMBER 2008 1367
ARTICLES
©2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
mass equals the difference in isotope mass of an average amino acid
(‘averagine’15) within bootstrap errors, with an additional error
tolerance due to unknown atomic composition and when the intensity
profiles have a sufficiently high overlap in retention time. The
resulting graph contains millions of edges, connecting pre-isotope’
patterns, which, however, are not necessarily consistent in terms of
charge state. We then iteratively determine the longest, consistent sub-
graphs. In the LC-MS run in Figure 2, the number of 3D peaks was
317,658, assembling into 31,806 isotope patterns. Thus, isotope
patterns reduce data features tenfold and are a potent noise filter. A
particularly dense region is greatly enlarged in Figure 2c. In this small
region of a few m/zunits, three overlapping isotope patterns
are automatically and correctly assigned, despite the overlap caused
by peptides of different charge states (z¼5 versus z¼2) and near-
identical masses of two co-eluting peptides. This would have been
difficult from the MS information alone, even for an expert human
scientist (Fig. 2d).
In our example we carried out SILAC16 using arginine and lysine.
To detect heavy-light SILAC partners we consider all possible pairs of
isotope patterns. Potential SILAC pairs are first required to have
sufficient intensity correlation over elution time (allowing for some
retention-time shift due to isotope effects) and to have equal charges.
By default we assume at most three labeled amino acids per peptide.
Therefore, pairs could contain lysine (K), arginine (R), KK, KR, RR,
KKK, KKR, KRR and RRR. For each of these cases we convolute the
two measured isotope patterns with the theoretical isotope patterns of
Counts/s
m/z
Counts/s
m/z
Counts/s
m/z
m/z
73
t
t
72
81.2
80.8
80.4
844.2 844.6 845 845.4 845.8
917.98
t
918.02
m/z
a
b
c
de
f
68
70
72
548 550 552 554 556
Counts/s
558 560 562 564 566
72
71.5
557.6
LHHVSSLAWLDEHTLVTTSHDASVK, light,
M = 2782.4038, z = 5
VIVPNMEFR, heavy,
M = 1103.5798, z = 2
LGINSLQELK, light,
M = 1113.6394, z = 2
558.0 558.4 0
106
557.6 558.0 558.4
a
b
cd
t
t
t
50
100
1,100
1,000
900
800
700
m/z
m/z
m/zm/z
600
500
400
Figure 1 Three-dimensional peak detection.
(a) Two-dimensional (2D) peaks whose intensity
drops to zero on both sides. The centroid mass of
a 2D peak is calculated as a fit of a gaussian
peak shape to the three central raw data points.
(b) Peaks are broken up at local intensity minima.
(c) 2D peaks in adjacent MS scans are assembled
to 3D peak hills over the m/z-retention time
plane. Two peaks in neighboring scans are
connected whenever their centroid m/zpositions
are sufficiently close. (d) 3D peak eluting over
1.5 min represented with color-coded intensity,
decreasing from green over yellow to white, in the
mass-retention time plane. Forty-nine centroids
(dotted red line) have been joined to form this 3D
peak. Note that fluctuations in mass become
larger at low abundance. (e) 3D representation of
the same peak. (f) Eleven 3D peaks forming two
isotope patterns. The masses of the upper and
lower isotope patterns are identical. The sixth
peak of the lower isotope pattern has just been
detected, whereas the sixth peak of the upper
isotope pattern has just escaped detection.
Figure 2 Automatic large-scale SILAC pair
detection. (a) Overview of the part of the mass-
retention time plane capturing most of the
peptides in one LC-MS run of an OFFGEL
fraction of HeLa cell lysate. 5,666 SILAC pairs
have been detected in this run and are coded in
different colors. (b) Zoom into the region
indicated by the black rectangle in a. Several
SILAC pairs can be seen with charges ranging
up to five. MS/MS sequencing events are
indicated either by squares, in case they led to
a peptide identification, or by crosses. (c) Zoom
into the region indicated by the black rectangle
in bshowing a challenging case for isotope
pattern detection involving three peptides.
Note that MaxQuant correctly assigned the
monoisotopic mass, whereas the instrument
software picked the C13 peak for sequencing.
The heavy-labeled blue peptide has a small
peak at the low-mass side of the monoisotopic
peak because of the usual impurities of the
commercially available heavy amino acids.
(d) The mass spectrum corresponding to the
dotted rectangle in c.
1368 VOLUME 26 NUMBER 12 DECEMBER 2008 NATURE BIOTECHNOLOGY
ARTICLES
©2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
the difference atoms, that is, the atoms that have to be added so that
both peptides would have the same atomic composition. If the mass
differences are within the combined bootstrap error and if there is
sufficient intensity correlation of the two isotope patterns in m/z
dimension, the peaks are associated as a SILAC pair. Figure 2 contains
5,666 SILAC pairs.
The resulting isotope patterns are then scaled to each other using all
ratios, starting with a least-square solution and determining the best
median fit iteratively by bisection. This yields the fold-change between
the two SILAC peptides (Supplementary Notes). For triple-labeling
SILAC experiments17 more cases need to be considered but the
procedures are very similar. In each LC-MS run, we normalize peptide
ratios so that the median of their logarithms is zero, which corrects for
unequal protein loading, assuming that the majority of proteins show
no differential regulation.
Improving peptide mass accuracy
The peptide mass is calculated as the intensity-weighted average of all
MS peak centroids in the 3D peaks within the isotope patterns
belonging to a SILAC pair or triplet. The statistics of the number of
mass measurements per SILAC peptide is given in Supplementary
Figure 1 online.
We use the several hundred SILAC charge pairs in every LC-MS run
for recalibration without knowing their identity and minimize differ-
ences between mass estimates from different charge states. The
resulting polynomial remaps experimental m/zvalues to their cor-
rected values. Nonlinear mass corrections are about 1 p.p.m. for the
LTQ Orbitrap mass spectrometer (Supplementary Fig. 2 online).
We next use the two masses of peptide charge pairs to derive an
estimate of the mass accuracy (deviation from the true value) from the
estimate of the mass precision (repeatability of the measurement) by
requiring that mass estimates are within the error range. We then scale
the bootstrap errors by the required factor—between two to three in
our data. As in similar cases18,thisfactoris
likely due to autocorrelation between the
centroid determinations in subsequent spectra.
To correct for global expansion or contraction
of the mass scale we use well-identified pep-
tides and minimize the mass deviation of
these peptides weighed by their individual
mass precisions.
We plotted the corrected mass precisions for
the 477,511 SILAC pairs in our data set as a
function of peptide signal (Fig. 3a). Mass pre-
cisions are extremely high (p.p.b.) and roughly
proportional to one over the square root of the
peptide signal. Figure 3b shows that 50% of
the peaks have corrected mass precisions better
than 393 p.p.b. In agreement with this, the
actual mass deviations of all identified peptides
(measured minus calculated mass) have a s.d.
of 409 p.p.b. and average absolute mass devia-
tion (average of the absolute value of the
difference between measured and calculated
masses) of 278 p.p.b. (Fig. 3c).
Peptide mass estimates are usually taken
from the MS peak that leads to selecting the
peptide for fragmentation (Fig. 3d). Average
absolute mass accuracy in this standard
approach is 1.8 p.p.m. and s.d. is 2.5 p.p.m.
Thus mass accuracy measured as s.d., a key
performance parameter in proteomics19, improved sixfold using our
computational approach. We suspect the improvement would have
been even greater if we had not used the lock mass option20. Worse,
even including the lock mass, the normal approach would have
necessitated a maximum allowed mass deviation of 10 p.p.m. for all
peptides, whereas searches are performed with much tighter and
individualized mass tolerances in MaxQuant.
Peptide and protein identification
Because the SILAC state of most isotope patterns is known beforehand,
we can treat the label modifications as fixed in the database search. By
counting the number of arginines and lysines, the SILAC state
distinguishes limit tryptic peptides from incompletely cleaved ones.
This a priori information decreases the search space about tenfold. For
fragmentation spectra not associated with a SILAC pair, a conventional
database search is performed. After a database search, the list of top ten
sequences matching a fragmentation spectrum is sorted according to
their peptide score or P-score21 and filtered for consistency with
a priori information, retaining the best scoring one. We allow a
deviation between the measured and calculated mass of four s.d. of
the individual bootstrap error for each peptide.
We use a database containing all true protein sequences, concate-
nated with reversed nonsense versions of these sequences22,23.Toavoid
spurious correlations because half of the reversed tryptic peptides have
the same mass as the forward sequence, we also swapped every
arginine and lysine with the preceding amino acid in the reversed
sequences. This approach still retains the local amino acid relations—
leading to the same length and mass distributions of peptides
(Supplementary Notes).
To assess the likelihood of false identification we generate two lists
of peptides, one for the hits in the forward sequences and one in the
reversed sequences. We construct two histograms by gaussian kernel
smoothing (Fig. 4). They can be interpreted as approximations to the
1e8
1e7
Intensity
Counts/binCounts/bin
Counts/bin
1e6
1e5
34%
34% 13%
3%
3%
13%
1e4 0.1
Corrected mass precision (p.p.m.)
–1.5 1
0.5% 2.5% 13%
34% 34%
13% 13% 13%
34%
34%
2.5% 2.5% 2.5%
0.5% 0.5% 0.5%
–0.5 0 0.5 1 1.5
Measured – calculated mass (p.p.m.)
–5–10 1005
Measured – calculated mass (p.p.m.)
Corrected mass precision (p.p.m.)
110
0 0.5 1 1.5 2 2.5
a
c
b
d
Figure 3 Accurate masses and individual peptide mass errors. (a) Mass precision corrected for
autocorrelation of 4477,000 SILAC pairs as a function of integrated signal intensity. Precision is
inversely proportional to the square root of peptide intensity. (b)Samedataasabut binned by
corrected mass precision. (c) Mass deviation of all identified peptides. (d) Mass deviation without
MaxQuant: precursor masses were taken directly from instrument software (‘monoisotopic M/Z’). The
scaled distribution from cis shown in red for comparison.
NATURE BIOTECHNOLOGY VOLUME 26 NUM BER 12 DECEMBER 2008 1369
ARTICLES
©2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
total and the conditional probability densities
pðs;LÞand ps;LX¼falsejðÞ
where the Boolean variable Xindicates ‘true or false (forward) or
‘false’ (reverse) sequences. sis the peptide database score and Lthe
peptide length. The probability of a false hit, given the peptide
identification score and the length of the peptide is then
pX¼falsejs;LðÞ¼
ps;LjX¼falseðÞpX¼falseðÞ
ps;LðÞ
the posterior error probability (PEP) of each individual peptide. We
use the PEP only as input for calculating the false-discovery rate
(FDR) below. The a priori probability p(X¼false) is a constant with no
effect on the final list of accepted peptides at a given FDR. Longer
peptides, which are less frequent in the database, are automatically
accepted with lower scores.
Todetermineacutoffscoreforaspecic
FDR, we sort all peptide identifications, from
the forward and the reverse database, by their
PEP, starting with the best. Peptides are
accepted until 1% of reverse hits/forward hits
has accumulated. The fraction of wrong iden-
tifications in the forward database is then 1%
as well.
In this run, 11,299 sequencing events led to
7,307 peptide identifications (identification
rate of 64.7%, Fig. 5). Sequencing events
associated with SILAC pairs have identifica-
tion rates of 84.4%. Identifications (red
squares) cluster in particular regions of the
contour plot (Fig. 5a), with characteristic
polymer patterns devoid of peptide identifi-
cations (Fig. 5b) and fragmentation events in
peptide-rich regions almost uniformly identi-
fied (Fig. 5c). Note that many SILAC pairs
were not targeted for sequencing at all (32.3%
in this run).
We next assemble peptide hits into protein
hits, a nontrivial step in shotgun proteo-
mics24. Whenever the set of identified
peptides in one protein is equal to or com-
pletely contained in the set of identified peptides of another protein
these two proteins are joined in a protein group. Shared peptides are
most parsimoniously associated with the group with the highest
number of identified peptides (‘razor’ peptides24) but remain in all
groups where they occur. Protein quantification may then be per-
formed based only on unique peptides, including razor peptides, or
using all peptides. By default we use unique and razor peptides as a
compromise between unequivocal peptide assignment and most-
accurate quantification.
We assign to each protein group a PEP by multiplying their peptide
PEPs. Only peptides with distinct sequences and only the highest-
scoring identified spectra are used to avoid bias due to dependent
peptides. Similarly to the peptide PEP, the protein PEP serves to sort
the list of hits from forward and reverse databases. Using a protein
FDR of 1% and requiring that each protein group contain a unique
peptide, we identified 4,149 proteins in the cell line proteome
(Supplementary Table 1 online).
L = 6 L = 8 L = 10 L = 12
L = 14 L = 16 L = 20 L = 24
0 100
P-score
200 300 0 100
P-score
200 300 0 100
P-score
200 300 0 100
P-score
200 300
0 100
P-score
200 300
0 100
P-score
200 300
0 100
P-score
200 300
0 100
Counts/bin
Counts/bin
Counts/bin
Counts/bin
Counts/bin
Counts/bin
Counts/bin
Counts/bin
P-score
200 300
Figure 4 Peptide score (P-score) distributions. The panels show the distributions of scores in the
forward (blue) and reverse (red) database with peptide length (L) as the parameter. MaxQuant filters
potential hits by a priori information, which moves the reverse hit distribution far to the left. These
distributions are used to calculate the false-positive rate for peptide identification as a function of
peptide length.
80
100
120
300 400 500 600
80
90
100
110
600 700 800 900
120
tt
t
80
40
0
600 1,000
m/zm/zm/z
1,400
acb
Figure 5 High rate of identified MS/MS spectra. MS/MS sequencing events are indicated in the mass-retention time plane (contour plot). Identified and
unidentified MS/MS spectra are represented by red squares and blue crosses, respectively. (a) Peptides elute between 40 and 120 min and peptide
identifications are shifted to higher m/zvalues at later points in the gradient. (b) Left rectangle of a. In this region, characteristic polymer patterns that do
not lead to peptide identifications are prevalent. (c) In contrast, in a peptide-rich region of the contour plot (right rectangle in a), almost all fragmentation
events lead to successful peptide identification.
1370 VOLUME 26 NUMBER 12 DECEMBER 2008 NATURE BIOTECHNOLOGY
ARTICLES
©2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
Protein quantification
Many of the isotope patterns that have not been assembled into SILAC
pairs are nevertheless identified by database search. For these peptides
the m/z-elution time shapes of the 3D peaks belonging to the
identified SILAC version are translated to the location of the
missing SILAC partner and after integration of intensities, ratios are
calculated in the same way as for SILAC pairs that were detected
before identification.
Protein ratios are calculated as the median of all SILAC peptide
ratios, minimizing the effect of outliers. We normalize the protein
ratios to correct for unequal protein amounts.
We next calculate an outlier significance score for log protein
ratios (significance A). To create a robust and asymmetrical estimate
of the s.d. of the main distribution we calculate the 15.87, 50 and
84.13 percentiles r1,r0,andr1.r1-r0and r0-r1are right- and left-
sided robust s.d. For a normal distribution, these would be equal
to each other and to the conventional definition of an s.d. A
suitable measure for a ratio r4r0being significantly far away from
the main distribution is the distance to r0measured in terms of the
right s.d.
z¼rr0
r1r0
As a P-value for detection of significant outlier ratios we define
significance A ¼1
2erfc z
ffiffi
2
p

¼1
ffiffiffiffiffi
2p
pZ1
z
et2=2dt
which is the probability of obtaining a log-ratio of at least this mag-
nitude under the null hypothesis that the distribution of log-ratios has
normal upper and lower tails (Supplementary Fig. 3 online).
For highly abundant proteins the statistical spread of unregulated
proteins is much more focused than for low abundance ones8(Fig. 6).
To capture this effect, we define another quantity, significance B,
which is calculated only on the protein subsets obtained by intensity
binning. We define bins of equal occupancy such that each contains at
least 300 proteins.
We quantified 4,100 proteins, comparable to the number of
significant messages in a microarray experiment on the same cell
type14 (Supplementary Table 1). If a minimum of three quantifica-
tion events (three SILAC pairs) is required, quantification becomes
very reliable25,26 because an outlier ratio has no effect on the median.
Strikingly, 99.3% of proteins were within 50% of the one-to-one ratio.
This implies excellent SILAC partner identification as wrongly part-
nered peptides would have ratios strongly deviating from 1. We found
48 proteins to be significantly upregulated based on significance B with
a Benjamini-Hochberg27 FDR o5% (Supplementary Table 2 online).
Notably, two of the most heavily upregulated proteins after 2 h of
EGF stimulation were the transcription factors JunB and the orphan
nuclear receptor NR4A1, also termed early-response protein NAK1
(Fig. 6). Both are known to be regulated by growth stimuli. Among
the most upregulated proteins in Figure 6 there is a conserved dual-
specificity tyrosine-serine phosphatase (MTM1), widely studied in
relation to myotubular myopathy28 and, like PTEN, a lipid phospha-
tase29. The completely uncharacterized protein C1orf52 is tightly
associated with the tumor suppressor BCL10 and therefore also called
BAG for BCL10-associated gene. Neither of these proteins was
known to be induced upon EGF stimulation. Many of the other
significantly regulated proteins also have potential connection to
growth factor signaling (Supplementary Table 1). Proteins encoded
by genes having regulatory binding sites for SREBP-1 are shown to be
significantly upregulated when analyzed by TRANSFAC30. SREPB-1
likely mediates the effects of EGF stimulation on cancer-relevant
proteins like FAS31.
DISCUSSION
We have introduced a set of computational proteomics algorithms
with several useful features. Efficient extraction of mass information
allows us to search protein databases with maximum allowed mass
deviations that adjust themselves to the precision with which the
peptide is measured. The mass accuracies achieved here are the highest
yet reported in large-scale proteomics32 and sharply limit the number
of candidate peptides in database searches. With low-resolution data,
only a few percent of fragmentation events lead to successful identi-
fication33, whereas the mass accuracy and feature extraction in
MaxQuant allow 73% of the fragmentation events associated with
SILAC peptide pairs to be identified. Thus, standard ion trap frag-
mentation is extremely information rich, and nontryptic and modified
peptides do not constitute the majority of fragmented peptides. The
MaxQuant algorithms recently enabled comprehensive quantification
of the yeast proteome34. Although we identified essentially the
complete proteome, we found only three (o1%) of the 814 ‘dubious’
open reading frames (ORFs) (http://www.yeastgenome.org/), which
are not expected to be expressed from evidence such as comparative
genome sequencing. This provides independent evidence that our
FDR estimates of peptide and protein identifications are very stringent
(Supplementary Fig. 4 online). Much higher identification rates
among dubious ORFs (3%) were found in genome-wide tagging
experiments35,36. Likewise, aggregate data from yeast proteome
resources cover 12% of these dubious ORFs37, the same percentage
as their occurrence in the genome.
We have already applied MaxQuant to quantify 45,000 proteins in
the mouse stem cell proteome38 and several other proteomes in similar
depth. We conclude that the computational tools for proteome-wide
quantification are now in hand. With further advances in instrumen-
tation, particularly in the dynamic range of measurements39,40,
proteomics should be suitable for routine functional genomics’
experiments, for which microarrays have so far been the only option.
0.1 1
Protein ratio
JUNB
C1orf52 NR4A1
OBSCN
MTM1
KIAA1429
10
1e6
1e8
1e10
Intensity
Figure 6 Proteome-wide accurate quantification and significance.
Normalized protein ratios are plotted against summed peptide intensities.
The spread of the cloud is lower at high abundance, indicating that
quantification is more precise. The data points are colored by their
‘significance B’, with blue crosses having values 40.05, red squares
between 0.05 and 0.01, yellow diamonds between 0.01 and 0.001 and
green circles o0.001.
NATURE BIOTECHNOLOGY VOLUME 26 NUM BER 12 DECEMBER 2008 1371
ARTICLES
©2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
METHODS
Software development and availability of MaxQuant. MaxQuant is developed
for the .NET framework and written in the C# language. The interactive 3D
data viewer was developed on the basis of DirectX. MaxQuant executables are
available via http://www.maxquant.org/, whereas the source code of algorithms
is available in Supplementary Data. It runs on Windows desktop computers
and is compatible with XP and Vista. Processing time is currently about 20 min
per raw file and per processing core. Detailed description ofthe algorithms used
in MaxQuant can be found in Supplementary Notes.
Data processing. The Mascot program version 2.2.04 was used to generate up
to ten peptide sequence candidates per fragmentation spectrum (Matrix
Science), and International Protein Index (IPI) version 3.48 was searched.
The databasesearch is done with an initial maximum allowed mass deviation of
7 p.p.m. for the peptide mass and 0.5 m/zunits for fragmentation peaks, which
is optimal for linear ion trap data41.
Gene Ontology, Pfam domain and TRANSFAC overrepresentation analysis.
P-values for overrepresentation in regulated proteins were calculated with the
Wilcoxon-Mann-Whitney test on the continuous significance B values calcu-
lated by the MaxQuant software.
Data used in analysis. The data used in this analysis have been published in
reference 14. SILAC was performed as described42. Briefly, HeLa cells were
stimulated with EGF for 2 h and mass spectrometric analysis performed as
described20. ‘Heavy’ (EGF stimulated) and ‘light’ (control) SILAC cell popula-
tions were combined and lysed. Proteins were digested in solution with trypsin,
and the resulting peptides were separated by isoelectric focusing into 24 fractions
with an Agilent 3100 OFFGEL Fractionator. Each fraction was purified with
StageTips43 andanalyzedbyliquidchromatographycombinedwithelectrospray
tandem mass spectrometry on a Thermo Scientific LTQ Orbitrap mass spectro-
meter with lock mass calibration20. The experiment was performed in triplicate.
Raw mass spectrometric data files and evidence tables containing
peptide and protein data can be downloaded from Tranche at http://tranche.
proteomecommons.org/.
Note: Supplementary information is available on the Nature Biotechnology website.
ACKNOWLEDGMENTS
We thank all the other members of the Proteomics and Signal Transduction
group for help with the development of MaxQuant. Shubin Ren helped in
developing the 3D data viewer used in MaxQuant. Nina Hubner measured the
data used in this analysis. This work was supported by the Max-Planck Society
and by the 6th Framework Program of the European Union (Interaction
Proteome LSHG-CT-2003-505520 and HEROIC LSHG-CT-2005-018883).
Published online at http://www.nature.com/naturebiotechnology/
Reprints and permissions information is available online at http://npg.nature.com/
reprintsandpermissions/
1. Allison, D.B., Cui, X., Page, G.P. & Sabripour, M. Microarray data analysis: from disarray
to consolidation and consensus. Nat. Rev. Genet. 7, 55–65 (2006).
2. Patterson, S.D. & Aebersold, R.H. Proteomics: the first decade and beyond. Nat. Genet.
33 Suppl, 311–323 (2003).
3. Nesvizhskii, A.I., Vitek, O. & Aebersold, R. Analysis and validation of proteomic data
generated by tandem mass spectrometry. Nat. Methods 4, 787–797 (2007).
4. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature422, 198–207
(2003).
5. Steen, H. & Mann, M. The ABC’s (and XYZ’s) of peptide sequencing. Nat. Rev. Mol. Cell
Biol. 5, 699–711 (2004).
6. Sadygov, R.G., Cociorva, D. & Yates, J.R. III. Large-scale database searching using
tandem mass spectra: looking up the answer in the back of the book. Nat. Methods 1,
195–202 (2004).
7. Ong, S.E. & Mann, M. Mass spectrometry-based proteomics turns quantitative. Nat.
Chem. Biol. 1, 252–262 (2005).
8. Bantscheff, M., Schirle, M., Sweetman, G., Rick, J. & Kuster, B. Quantitative mass
spectrometry in proteomics: a critical review. Anal. Bioanal. Chem. 389, 1017–1031
(2007).
9. Listgarten, J. & Emili, A. Statistical and computational methods for comparative
proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol.
Cell. Proteomics 4, 419–434 (2005).
10. Colinge, J. & Bennett, K.L. Introduction to computational proteomics. PLOS Comput .
Biol. 3,e114(2007).
11. Matthiesen, R. Methods, algorithms and tools in computational proteomics: a practical
pointofview.Proteomics 7, 2815–2832 (2007).
12. Mead, J.A., Shadforth, I.P. & Bessant, C. Public proteomic MS repositories and
pipelines: available tools and biological applications. Proteomics 7, 2769–2786
(2007).
13. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein
identification by searching sequence databases using mass spectrometry data. Electro-
phoresis 20, 3551–3567 (1999).
14. Cox, J. & Mann, M. Is proteomics the new genomics? Cell 130, 395–398
(2007).
15. Senko, M.W., Beu, S.C. & McLafferty, F.W. Determination of monoisotopic masses and
ion populations for large biomolecules from resolved isotopic distributions. J. Am. Soc.
Mass Spectrom. 6, 229–233 (1995).
16. Ong, S.E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a
simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 1,
376–386 (2002).
17. Blagoev, B., Ong, S.E., Kratchmarova, I. & Mann, M. Temporal analysis of
phosphotyrosine-dependent signaling networks by quantitative proteomics. Nat.
Biotechnol. 22, 1139–1145 (2004).
18. Sokal, A.D. Monte Carlo Methods in Statistical Physics: Foundations and New Algo-
rithms (Lausanne, Switzerland, 1996).
19. Zubarev, R. & Mann, M. On the proper use of mass accuracy in proteomics. Mol. Cell.
Proteomics 6, 377–381 (2007).
20. Olsen, J.V. et al. Parts per million mass accuracy on an Orbitrap mass spectrometer via
lock mass injection into a C-trap. Mol. Cell. Proteomics 4, 2010–2021 (2005).
21. Olsen, J.V. & Mann, M. Improved peptide identification in proteomics by two con-
secutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. USA 101,
13417–13422 (2004).
22. Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in large-
scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214
(2007).
23. Ka
¨ll, L., Storey, J.D., MacCoss, M.J. & Nobel, W.S. Assigning significance to peptides
identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 7,
29–34 (2008).
24. Nesvizhskii, A.I. & Aebersold, R. Interpretation of shotgun proteomic data: the protein
inference problem. Mol. Cell. Proteomics 4, 1419–1440 (2005).
25. Selbach, M. et al. Widespread changes in protein synthesis induced by microRNAs.
Nature 455, 58–63 (2008).
26. Bonaldi, T. et al. Combined use of RNAi and quantitative proteomics to study gene
function in Drosophila.Mol. Cell 31, 762–772 (2008).
27. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300 (1995).
28. Laporte, J. et al. MTM1 mutations in X-linked myotubular myopathy. Hum. Mutat. 15,
393–409 (2000).
29. Wishart, M.J. & Dixon, J.E. PTEN and myotubularin phosphatases: from 3-phospho-
inositide dephosphorylation to disease. Trends Cell Biol. 12, 579–585 (2002).
30. Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic
Acids Res. 31, 374–378 (2003).
31. Swinnen, J.V. et al. Stimulation of tumor-associated fatty acid synthase expression by
growth factor activation of the sterol regulatory element-binding protein pathway.
Oncogene 19, 5173–5181 (2000).
32. Liu, T., Belov, M.E., Jaitly, N., Qian, W.J. & Smith, R.D. Accurate mass measurements
in proteomics. Chem. Rev. 107, 3621–3653 (2007).
33. Kuster, B., Schirle, M., Mallick, P. & Aebersold, R. Scoring proteomes with proteotypic
peptide probes. Nat. Rev. Mol. Cell Biol. 6, 577–583 (2005).
34. de Godoy,L.M. et al. Comprehensive mass-spectrometry-based proteomequantification
of haploid versus diploid yeast. Nature 455, 1251–1254 (2008).
35. Huh, W.K. et al. Global analysis of protein localization in budding yeast. Nature 425,
686–691 (2003).
36. Ghaemmaghami, S. et al. Global analysis of protein expression in ye ast. Nature 425,
737–741 (2003).
37. King, N.L. et al. Analysis of the Saccharomyces cerevisiae proteome with PeptideAtlas.
Genome Biol. 7,R106(2006).
38. Graumann, J. et al. SILAC-labeling and proteome quantitation of mouse embryonic
stem cells to a depth of 5111 proteins. Mol. Cell Proteomics 7, 672–683 (2008).
39. Eriksson, J. & Fenyo, D. Improving the success rate of proteome analysis by modeling
protein-abundance distributions and experimental designs. Nat. Biotechnol. 25,
651–655 (2007).
40. Mann, M. & Kelleher, N.L. Special feature: precision proteomics: The case for high
resolution and high mass accuracy. Proc. Natl. Acad. Sci. USA. published online, doi:
10.1073/pnas.0800788105 (25 September 2008).
41. Cox, J., Hubner, N.C. & Mann, M. How much peptide sequence information is
contained in ion trap tandem mass spectra? J. Am. Soc. Mass. Spectrom. published
online, doi:10.1016/j.jasms.2008.07.024 (7 August 2008).
42. Ong, S.E. & Mann, M. A practical recipe for stable isotope labeling by amino acids in
cell culture (SILAC). Nat. Protocols 1, 2650–2660 (2006).
43. Rappsilber, J., Mann, M. & Ishihama, Y. Protocol for micro-purification, enrichment,
pre-fractionation and storage of peptides for proteomics usingStageTips. Nat. Protocols
2, 1896–1906 (2007).
1372 VOLUME 26 NUMBER 12 DECEMBER 2008 NATURE BIOTECHNOLOGY
ARTICLES
©2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
... Cells were resolved in PBS and adjusted to OD 600 = 15. From each biological replicate, one 1.4 mL aliquot was treated with 14 µL 1 mM I16 in DMSO (10 µM final concentration) and the second aliquot with14 µL DMSO and incubated for 1 h (37 °C, 200 rpm). From this point, samples were kept on ice if not otherwise described. ...
Article
Full-text available
Isonitrile natural products, also known as isocyanides, demonstrate potent antimicrobial activities, yet our understanding of their molecular targets remains limited. Here, we focus on the so far neglected group of...
Preprint
Mechanical stimuli, particularly laminar blood flow, play a crucial role in shaping the vascular system. Changes in the rate of blood flow manifest in altered shear stress, which activates signaling cascades that drive vascular remodeling. Consistently, dysregulation of the endothelial response to fluid shear forces and aberrant flow patterns both lead to pathological conditions, including impaired blood vessel development and atherosclerosis. Despite its importance, the mechanisms driving the coordinated cell behavior underlying vascular remodeling are not fully understood. Combining classical cell biological approaches with advanced image analysis, mathematical modeling, biomimetic strategies, and in vivo studies, we identify the planar cell polarity (PCP) protein Vangl1 as an enforcer of flow-dependent cell dynamics in the vascular system. We demonstrate that shear stress triggers the relocation of Vangl1 from an internal reservoir to the plasma membrane at the initiation of cell remodeling. Membrane enrichment of Vangl1 is mediated by a Coronin1C-dependent shift in the equilibrium between endo- and exocytosis and results in the spatial reorganization of another essential PCP protein, Frizzled6 (Fzd6). The resulting mutual exclusion of the core PCP proteins Fzd6 and Vangl1 augments differential junctional and cytoskeletal dynamics along the flow axis. Loss of Vangl1 limits the ability of endothelial cells to respond to shear forces in a coordinated fashion, resulting in irregular cell alignment along the flow direction and erroneous vessel sprouting. Together, these studies introduce core PCP signaling as a determinant of collective cell dynamics and organization of the vascular system.
Preprint
Full-text available
Histone Deacetylase 1 (HDAC1) removes acetyl groups from lysine residues on the core histones, a critical step in the regulation of chromatin accessibility. Despite histone deacetylation being an apparently repressive activity, suppression of HDACs causes both up- and down-regulation of gene expression. Here we exploited the degradation tag (dTAG) system to rapidly degrade HDAC1 in embryonic stem cells (ESCs) lacking its paralog, HDAC2. Unlike HDAC inhibitors that lack isoform specificity, the dTAG system allowed specific degradation and removal of HDAC1 in <1 hour (100x faster than genetic knockouts). This rapid degradation caused increased histone acetylation in as little as 2 hours, with H2BK5 and H2BK11 being the most sensitive. The majority of differentially expressed genes following 2 hours of HDAC1 degradation were upregulated (275 genes up vs 15 down) with increased proportions of downregulated genes observed at 6 (1,153 up vs 443 down) and 24 hours (1,146 up vs 967 down) respectively. Upregulated genes showed increased H2BK5ac and H3K27ac around their transcriptional start site (TSS). In contrast, decreased acetylation of super-enhancers (SEs) was linked to the most strongly downregulated genes. These findings suggest a paradoxical role for HDAC1 in the maintenance of histone acetylation levels at critical enhancer regions required for the pluripotency-associated gene network.
Preprint
Full-text available
TERRA, the lncRNA derived from the ends of chromosomes, has a number of well-described nuclear roles including telomere maintenance and homeostasis. A growing body of evidence now points at its role in human cells outside of nucleus—it has been found to be a component of extracellular vesicles, a player in inflammation signalling and its capacity for translation has been shown. In this work, using a combination of sensitive microscopy methods, cellular fractionation, proteomics and transcriptome analysis, we demonstrate directly for the first time that TERRA is present in the cytoplasm of human telomerase-negative cells, especially upon various stress stimuli, and that it associates with stress granules. Confirming the presence of TERRA in the cytoplasm, our work fills an important gap in the field, and contributes to the discussion about the role of TERRA as a transcript involved in nucleo-cytoplasmic stress communication.
Preprint
Full-text available
Heterochromatin is a key feature of eukaryotic genomes that serves important regulatory and structural roles in regions such as centromeres. In fission yeast, maintenance of existing heterochromatic domains relies on positive feedback loops involving histone methylation and non-coding RNAs. However, requirements for de novo establishment of heterochromatin are less well understood. Here, through a cross-based assay we have identified a novel factor influencing the efficiency of heterochromatin establishment. We determine that the previously uncharacterised protein is an ortholog of human Caprin1, an RNA-binding protein linked to stress granule formation. We confirm that the fission yeast ortholog, here named Cpn1, also associates with stress granules, and we uncover evidence of interplay between heterochromatin integrity and ribonucleoprotein (RNP) granule formation, with heterochromatin mutants showing reduced granule formation in the presence of stress, but increased granule formation in the absence of stress. We link this to regulation of non-coding heterochromatic transcripts, since in heterochromatin-deficient cells, absence of Cpn1 leads to hyperaccumulation of centromeric RNAs at centromeres. Together, our findings unveil a novel link between RNP homeostasis and heterochromatin assembly, and implicate Cpn1 and associated factors in facilitating efficient heterochromatin establishment by enabling removal of excess transcripts that would otherwise impair assembly processes.
Preprint
Full-text available
Neonatal health is dependent on early risk stratification, diagnosis, and timely management of potentially devastating conditions, particularly in the setting of prematurity. Many of these conditions are poorly predicted in real-time by clinical data and current diagnostics. Umbilical cord blood may represent a novel source of molecular signatures that provides a window into the state of the fetus at birth. In this study, we comprehensively characterized the cord blood proteome of infants born between 24 to 42 weeks using untargeted mass spectrometry and functional enrichment analysis. We determined that the cord blood proteome at birth varies significantly across gestational development. Proteins that function in structural development and growth (e.g., extracellular matrix organization, lipid particle remodeling, and blood vessel development) are more abundant earlier in gestation. In later gestations, proteins with increased abundance are in immune response and inflammatory pathways, including complements and calcium-binding proteins. Furthermore, these data contribute to the knowledge of the physiologic state of neonates across gestational age, which is crucial to understand as we strive to best support postnatal development in preterm infants, determine mechanisms of pathology causing adverse health outcomes, and develop cord blood biomarkers to help tailor our diagnosis and therapeutics for critical neonatal conditions.
Preprint
Full-text available
Prognostic tests and treatment approaches for optimized clinical care of prostatic neoplasms are an unmet need. Prostate cancer (PCa) and associated extracellular vesicles (EVs) proteome changes occur during initiation and progression of the disease. PCa tissue proteome has been previously characterized, but screening of tissue samples constitutes an invasive procedure. Consequently, we focused this study on liquid biopsies, such as urine samples. More specifically, urinary small extracellular vesicle and particles proteome profiles of 100 subjects were analyzed using liquid chromatography coupled to high-resolution mass spectrometry (LC-MS/MS). We identified 171 proteins that were differentially expressed between intraductal prostate cancer/cribriform (IDC/Crib) and non-IDC/non-Crib after correction for multiple testing. However, the strong correlation between IDC/Crib and Gleason Grade complicates the disentanglement of the underlying factors driving this association. Nevertheless, even after accounting for multiple testing and adjusting for ISUP (International Society of Urological Pathology) grading, two proteins continued to exhibit significant differential expression between IDC/Crib and non-IDC/non-Crib. Functional enrichment analysis based on cancer hallmark proteins disclosed a clear pattern of androgen response down-regulation in urinary EVs from IDC/Crib compared to non-IDC/non-Crib. Interestingly, proteome differences between IDC and cribriform were more subtle, suggesting high proteome heterogeneity. Overall, the urinary EV proteome reflect partly the prostate pathology.
Chapter
Mass-spectrometry (MS)-based single-cell proteomics (SCP) explores cellular heterogeneity by focusing on the functional effectors of the cells—proteins. However, extracting meaningful biological information from MS data is far from trivial, especially with single cells. Currently, data analysis workflows are substantially different from one research team to another. Moreover, it is difficult to evaluate pipelines as ground truths are missing. Our team has developed the R/Bioconductor package called scp to provide a standardized framework for SCP data analysis. It relies on the widely used QFeatures and SingleCellExperiment data structures. In addition, we used a design containing cell lines mixed in known proportions to generate controlled variability for data analysis benchmarking. In this chapter, we provide a flexible data analysis protocol for SCP data using the scp package together with comprehensive explanations at each step of the processing. Our main steps are quality control on the feature and cell level, aggregation of the raw data into peptides and proteins, normalization, and batch correction. We validate our workflow using our ground truth data set. We illustrate how to use this modular, standardized framework and highlight some crucial steps.
Article
Full-text available
A fundamental goal of cell biology is to define the functions of proteins in the context of compartments that organize them in the cellular environment. Here we describe the construction and analysis of a collection of yeast strains expressing full-length, chromosomally tagged green fluorescent protein fusion proteins. We classify these proteins, representing 75% of the yeast proteome, into 22 distinct subcellular localization categories, and provide localization information for 70% of previously unlocalized proteins. Analysis of this high-resolution, high-coverage localization data set in the context of transcriptional, genetic, and protein-protein interaction data helps reveal the logic of transcriptional co-regulation, and provides a comprehensive view of interactions within and between organelles in eukaryotic cells.
Article
Full-text available
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses – the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferroni-type procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Article
Full-text available
The TRANSFAC® database on eukaryotic transcriptional regulation, comprising data on transcription factors, their target genes and regulatory binding sites, has been extended and further developed, both in number of entries and in the scope and structure of the collected data. Structured fields for expression patterns have been introduced for transcription factors from human and mouse, using the CYTOMER® database on anatomical structures and developmental stages. The functionality of Match™, a tool for matrix-based search of transcription factor binding sites, has been enhanced. For instance, the program now comes along with a number of tissue-(or state-)specific profiles and new profiles can be created and modified with Match™ Profiler. The GENE table was extended and gained in importance, containing amongst others links to LocusLink, RefSeq and OMIM now. Further, (direct) links between factor and target gene on one hand and between gene and encoded factor on the other hand were introduced. The TRANSFAC® public release is available at http://www.gene-regulation.com. For yeast an additional release including the latest data was made available separately as TRANSFAC® Saccharomyces Module (TSM) at http://transfac.gbf.de. For CYTOMER® free download versions are available at http://www.biobase.de:8080/index.html.
Article
Full-text available
Mass spectrometry is a powerful technology for the analysis of large numbers of endogenous proteins. However, the analytical challenges associated with comprehensive identification and relative quantification of cellular proteomes have so far appeared to be insurmountable. Here, using advances in computational proteomics, instrument performance and sample preparation strategies, we compare protein levels of essentially all endogenous proteins in haploid yeast cells to their diploid counterparts. Our analysis spans more than four orders of magnitude in protein abundance with no discrimination against membrane or low level regulatory proteins. Stable-isotope labelling by amino acids in cell culture (SILAC) quantification was very accurate across the proteome, as demonstrated by one-to-one ratios of most yeast proteins. Key members of the pheromone pathway were specific to haploid yeast but others were unaltered, suggesting an efficient control mechanism of the mating response. Several retrotransposon-associated proteins were specific to haploid yeast. Gene ontology analysis pinpointed a significant change for cell wall components in agreement with geometrical considerations: diploid cells have twice the volume but not twice the surface area of haploid cells. Transcriptome levels agreed poorly with proteome changes overall. However, after filtering out low confidence microarray measurements, messenger RNA changes and SILAC ratios correlated very well for pheromone pathway components. Systems-wide, precise quantification directly at the protein level opens up new perspectives in post-genomics and systems biology.
Article
X‐linked myotubular myopathy (XLMTM; MIM# 310400) is a severe congenital muscle disorder caused by mutations in the MTM1 gene. This gene encodes a dual‐specificity phosphatase named myotubularin, defining a large gene family highly conserved through evolution (which includes the putative anti‐phosphatase Sbf1/hMTMR5). We report 29 mutations in novel cases, including 16 mutations not described before. To date, 198 mutations have been identified in unrelated families, accounting for 133 different disease‐associated mutations which are widespread throughout the gene. Most point mutations are truncating, but 26% (35/133) are missense mutations affecting residues conserved in the Drosophila ortholog and in the homologous MTMR1 gene. Three recurrent mutations affect 17% of the patients, and a total of 21 different mutations were found in several independent families. The frequency of female carriers appears higher than expected (only 17% are de novo mutations). While most truncating mutations cause the severe and early lethal phenotype, some missense mutations are associated with milder forms and prolonged survival (up to 54 years). Hum Mutat 15:393–409, 2000. © 2000 Wiley‐Liss, Inc.
Article
Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be com pared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.
Article
The coupling of electrospray ionization with Fourier-transform mass spectrometry allows the analysis of large biomolecules with mass-measuring errors of less than 1 ppm. The large number of atoms incorporated in these molecules results in a low probability for the all-monoisotopic species. This produces the potential to misassign the number of heavy isotopes in a specific peak and make a mass error of ±1 Da, although the certainty of the measurement beyond the decimal place is greater than 0.1 Da. Statistical tests are used to compare the measured isotopic distribution with the distribution for a model molecule of the same average molecular mass, which allows the assignment of the monoisotopic mass, even in cases where the monoisotopic peak is absent from the spectrum. The statistical test produces error levels that are inversely proportional to the number of molecules in a distribution, which allows an estimation of the number of ions in the trapped ion cell. It has been determined, via this method that 128 charges are required to produce a signal-to-noise ratio of 3:1, which correlates well with previous experimental methods.
Article
Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be compared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.
Article
Proteomics has progressed radically in the last 5 years and is now on par with most genomic technologies in throughput and comprehensiveness. Analyzing peptide mixtures by liquid chromatography coupled to high-resolution mass spectrometry (LC-MS) has emerged as the main technology for in-depth proteome analysis whereas two-dimensional gel electrophoresis, low-resolution MALDI, and protein arrays are playing niche roles. MS-based proteomics is rapidly becoming quantitative through both label-free and stable isotope labeling technologies. The latest generation of mass spectrometers combines extremely high resolving power, mass accuracy, and very high sequencing speed in routine proteomic applications. Peptide fragmentation is mostly performed in low-resolution but very sensitive and fast linear ion traps. However, alternative fragmentation methods and high-resolution fragment analysis are becoming much more practical. Recent advances in computational proteomics are removing the data analysis bottleneck. Thus, in a few specialized laboratories, "precision proteomics" can now identify and quantify almost all fragmented peptide peaks. Huge challenges and opportunities remain in technology development for proteomics; thus, this is not "the beginning of the end" but surely "the end of the beginning."
Article
RNA interference is a powerful way to study gene function and is frequently combined with microarray analysis. Here we introduce a similar technology at the protein level by simultaneously applying Stable Isotope Labeling by Amino acids in Cell culture (SILAC) and RNA interference (RNAi) to Drosophila SL2 cells. After knockdown of ISWI, an ATP-hydrolyzing motor of different chromatin remodeling complexes, we obtained a quantitative proteome of more than 4,000 proteins. ISWI itself was reduced 10-fold as quantified by SILAC. Several hundred proteins were significantly regulated and clustered into distinct functional categories. Acf-1, a direct interaction partner of ISWI, is severely depleted at the protein, but not the transcript, level; this most likely results from reduced protein stability. We found little overall correlation between changes in the transcriptome and proteome with many protein changes unaccompanied by message changes. However, correlation was high for those mRNAs that changed significantly by microarray.