ArticlePDF Available

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification

December 2008
Nature Biotechnology 26(12):1367-72

December 2008
26(12):1367-72

DOI:10.1038/nbt.1511

Source
PubMed

Authors:

Juergen Cox

Max Planck Institute of Biochemistry

Efficient analysis of very large amounts of raw data for peptide identification and protein quantification is a principal challenge in mass spectrometry (MS)-based proteomics. Here we describe MaxQuant, an integrated suite of algorithms specifically developed for high-resolution, quantitative MS data. Using correlation analysis and graph theory, MaxQuant detects peaks, isotope clusters and stable amino acid isotope-labeled (SILAC) peptide pairs as three-dimensional objects in m/z, elution time and signal intensity space. By integrating multiple mass measurements and correcting for linear and nonlinear mass offsets, we achieve mass accuracy in the p.p.b. range, a sixfold increase over standard techniques. We increase the proportion of identified fragmentation spectra to 73% for SILAC peptide pairs via unambiguous assignment of isotope and missed-cleavage state and individual mass precision. MaxQuant automatically quantifies several hundred thousand peptides per SILAC-proteome experiment and allows statistically robust identification and quantification of >4,000 proteins in mammalian cell lysates.

Automatic large-scale SILAC pair detection. (a) Overview of the part of the massretention time plane capturing most of the peptides in one LC-MS run of an OFFGEL fraction of HeLa cell lysate. 5,666 SILAC pairs have been detected in this run and are coded in different colors. (b) Zoom into the region indicated by the black rectangle in a. Several SILAC pairs can be seen with charges ranging up to five. MS/MS sequencing events are indicated either by squares, in case they led to a peptide identification, or by crosses. (c) Zoom into the region indicated by the black rectangle in b showing a challenging case for isotope pattern detection involving three peptides. Note that MaxQuant correctly assigned the monoisotopic mass, whereas the instrument software picked the C13 peak for sequencing. The heavy-labeled blue peptide has a small peak at the low-mass side of the monoisotopic peak because of the usual impurities of the commercially available heavy amino acids. (d) The mass spectrum corresponding to the dotted rectangle in c.

…

Accurate masses and individual peptide mass errors. (a) Mass precision corrected for autocorrelation of 4477,000 SILAC pairs as a function of integrated signal intensity. Precision is inversely proportional to the square root of peptide intensity. (b) Same data as a but binned by corrected mass precision. (c) Mass deviation of all identified peptides. (d) Mass deviation without MaxQuant: precursor masses were taken directly from instrument software ('monoisotopic M/Z'). The scaled distribution from c is shown in red for comparison.

…

Proteome-wide accurate quantification and significance. Normalized protein ratios are plotted against summed peptide intensities. The spread of the cloud is lower at high abundance, indicating that quantification is more precise. The data points are colored by their 'significance B', with blue crosses having values 40.05, red squares between 0.05 and 0.01, yellow diamonds between 0.01 and 0.001 and green circles o0.001.

…

Figures - uploaded by Juergen Cox

Content may be subject to copyright.

Content uploaded by Juergen Cox

Content may be subject to copyright.

MaxQuant enables high peptide identiﬁcation rates,

individualized p.p.b.-range mass accuracies and

proteome-wide protein quantiﬁcation

¨rgen Cox & Matthias Mann

Efﬁcient analysis of very large amounts of raw data for peptide identiﬁcation and protein quantiﬁcation is a principal challenge

in mass spectrometry (MS)-based proteomics. Here we describe MaxQuant, an integrated suite of algorithms speciﬁcally

developed for high-resolution, quantitative MS data. Using correlation analysis and graph theory, MaxQuant detects peaks,

isotope clusters and stable amino acid isotope–labeled (SILAC) peptide pairs as three-dimensional objects in m/z, elution time

and signal intensity space. By integrating multiple mass measurements and correcting for linear and nonlinear mass offsets, we

achieve mass accuracy in the p.p.b. range, a sixfold increase over standard techniques. We increase the proportion of identiﬁed

fragmentation spectra to 73% for SILAC peptide pairs via unambiguous assignment of isotope and missed-cleavage state and

individual mass precision. MaxQuant automatically quantiﬁes several hundred thousand peptides per SILAC-proteome

experiment and allows statistically robust identiﬁcation and quantiﬁcation of 44,000 proteins in mammalian cell lysates.

Data analysis in MS-based proteomics is much more challenging than

for other high-throughput technologies such as microarrays1and

remains a principal bottleneck in proteomics2,3. In one popular format

of MS-based proteomics, proteins are enzymatically digested to

peptides, which are analyzed online by liquid chromatography (LC)

coupled to electrospray and tandem MS (MS/MS)4.MSspectra

contain peptide mass and intensity information, and the identity of

the peptides is deduced by matching the MS/MS spectra against a

sequence database5,6. Typically, peaks are extracted from raw data, the

peptide mass is estimated from the scan from which the peak was

‘picked’ for sequencing and the peak ﬁles are sent to a search engine.

Results consist of tables of identiﬁed proteins. In a quantitative

proteomics experiment using stable isotopes, peptide and protein

ratios are obtained by direct comparison of the signals of the ‘light’

and ‘heavy’ isotope in the same LC run7,8.

There is already a substantial literature on ‘computational proteo-

mics’ (reviewed in refs. 3,9–12). However, these efforts were usually

not directed at high-resolution data of the type readily attainable

today and they do not approach the quality of a skilled human expert.

Here we describe a set of algorithms that efﬁciently and robustly

extracts information from raw MS data and allows very high peptide

identiﬁcation rates as well as high-accuracy protein quantiﬁcation for

several thousand proteins in complex proteomes.

RESULTS

Analysis pipeline

MaxQuant incorporates all steps needed in a computational proteo-

mics platform but currently uses Mascot13 to generate peptide

candidates for MS/MS spectra. Below, we describe the analysis frame-

work and illustrate its performance with SILAC-treated HeLa cells that

were stimulated for 2 h with epidermal growth factor (EGF)14.These

data were obtained by triplicate analysis of 24 peptide fractions from

isoelectric focusing using an LTQ Orbitrap mass spectrometer. We

describe conceptual issues and computational analysis. A detailed

explanation of algorithms is provided in Supplementary Notes online

and their C# source code in Supplementary Data online.

Feature detection and quantiﬁcation

The high resolution of modern mass spectrometers and the need for

quantiﬁcation in functional proteomics led us to start the data analysis

with ‘features’ in the MS spectra (mass and intensity of the peptide

peaks) rather than focus on the fragmentation spectra. This is already

commonly done in MS-based biomarker discovery9. In MaxQuant,

peaks are detected in each MS scan by ﬁtting a gaussian peak shape

to the three central raw data points and then assembled into three-

dimensional (3D) peak hills over the m/z-retention time plane

(Fig. 1a–c). Smoothed intensity proﬁles over retention time are split

at signiﬁcant local minima. From the centroid masses we obtain a high

precision, intensity-weighted estimate of mass for the 3D peak

(Fig. 1d). For each 3D peak an individual mass precision is calculated

by bootstrap replication (Supplementary Notes).

Each of the 72 LC-MS runs of the HeLa proteome resulted in

B382,000 3D peaks, on average. It is not trivial to efﬁciently and

reliably determine isotope patterns, and we employ a graph theoretical

data structure to construct an undirected graph with the 3D peaks as

vertices. An edge is inserted between two peaks when the difference in

Received 27 May; accepted 31 October; published online 30 November 2008; doi:10.1038/nbt.1511

Department for Proteomics and Signal Transduction, Max-Planck Institute for Biochemistry, Am Klopferspitz 18, D-82152 Martinsried, Germany. Correspondence should

be addressed to J.C. (cox@biochem.mpg.de) or M.M. (mmann@biochem.mpg.de).

NATURE BIOTECHNOLOGY VOLUME 26 NUMBER 12 DECEMBER 2008 1367

ARTICLES

mass equals the difference in isotope mass of an average amino acid

(‘averagine’15) within bootstrap errors, with an additional error

tolerance due to unknown atomic composition and when the intensity

proﬁles have a sufﬁciently high overlap in retention time. The

resulting graph contains millions of edges, connecting ‘pre-isotope’

patterns, which, however, are not necessarily consistent in terms of

charge state. We then iteratively determine the longest, consistent sub-

graphs. In the LC-MS run in Figure 2, the number of 3D peaks was

317,658, assembling into 31,806 isotope patterns. Thus, isotope

patterns reduce data features tenfold and are a potent noise ﬁlter. A

particularly dense region is greatly enlarged in Figure 2c. In this small

region of a few m/zunits, three overlapping isotope patterns

are automatically and correctly assigned, despite the overlap caused

by peptides of different charge states (z¼5 versus z¼2) and near-

identical masses of two co-eluting peptides. This would have been

difﬁcult from the MS information alone, even for an expert human

scientist (Fig. 2d).

In our example we carried out SILAC16 using arginine and lysine.

To detect heavy-light SILAC partners we consider all possible pairs of

isotope patterns. Potential SILAC pairs are ﬁrst required to have

sufﬁcient intensity correlation over elution time (allowing for some

retention-time shift due to isotope effects) and to have equal charges.

By default we assume at most three labeled amino acids per peptide.

Therefore, pairs could contain lysine (K), arginine (R), KK, KR, RR,

KKK, KKR, KRR and RRR. For each of these cases we convolute the

two measured isotope patterns with the theoretical isotope patterns of

Counts/s

m/z

Counts/s

m/z

Counts/s

m/z

81.2

80.8

80.4

844.2 844.6 845 845.4 845.8

917.98

918.02

m/z

548 550 552 554 556

Counts/s

558 560 562 564 566

71.5

557.6

LHHVSSLAWLDEHTLVTTSHDASVK, light,

M = 2782.4038, z = 5

VIVPNMEFR, heavy,

M = 1103.5798, z = 2

LGINSLQELK, light,

M = 1113.6394, z = 2

558.0 558.4 0

106

557.6 558.0 558.4

100

1,100

1,000

900

800

700

m/z

m/zm/z

600

500

400

Figure 1 Three-dimensional peak detection.

(a) Two-dimensional (2D) peaks whose intensity

drops to zero on both sides. The centroid mass of

a 2D peak is calculated as a ﬁt of a gaussian

peak shape to the three central raw data points.

(b) Peaks are broken up at local intensity minima.

to 3D peak hills over the m/z-retention time

plane. Two peaks in neighboring scans are

connected whenever their centroid m/zpositions

are sufﬁciently close. (d) 3D peak eluting over

1.5 min represented with color-coded intensity,

decreasing from green over yellow to white, in the

mass-retention time plane. Forty-nine centroids

(dotted red line) have been joined to form this 3D

peak. Note that ﬂuctuations in mass become

larger at low abundance. (e) 3D representation of

the same peak. (f) Eleven 3D peaks forming two

isotope patterns. The masses of the upper and

lower isotope patterns are identical. The sixth

peak of the lower isotope pattern has just been

detected, whereas the sixth peak of the upper

isotope pattern has just escaped detection.

Figure 2 Automatic large-scale SILAC pair

detection. (a) Overview of the part of the mass-

retention time plane capturing most of the

peptides in one LC-MS run of an OFFGEL

fraction of HeLa cell lysate. 5,666 SILAC pairs

have been detected in this run and are coded in

different colors. (b) Zoom into the region

indicated by the black rectangle in a. Several

SILAC pairs can be seen with charges ranging

up to ﬁve. MS/MS sequencing events are

indicated either by squares, in case they led to

a peptide identiﬁcation, or by crosses. (c) Zoom

into the region indicated by the black rectangle

in bshowing a challenging case for isotope

pattern detection involving three peptides.

Note that MaxQuant correctly assigned the

monoisotopic mass, whereas the instrument

software picked the C13 peak for sequencing.

The heavy-labeled blue peptide has a small

peak at the low-mass side of the monoisotopic

peak because of the usual impurities of the

commercially available heavy amino acids.

(d) The mass spectrum corresponding to the

dotted rectangle in c.

1368 VOLUME 26 NUMBER 12 DECEMBER 2008 NATURE BIOTECHNOLOGY

ARTICLES

the difference atoms, that is, the atoms that have to be added so that

both peptides would have the same atomic composition. If the mass

differences are within the combined bootstrap error and if there is

sufﬁcient intensity correlation of the two isotope patterns in m/z

dimension, the peaks are associated as a SILAC pair. Figure 2 contains

5,666 SILAC pairs.

The resulting isotope patterns are then scaled to each other using all

ratios, starting with a least-square solution and determining the best

median ﬁt iteratively by bisection. This yields the fold-change between

the two SILAC peptides (Supplementary Notes). For triple-labeling

SILAC experiments17 more cases need to be considered but the

procedures are very similar. In each LC-MS run, we normalize peptide

ratios so that the median of their logarithms is zero, which corrects for

unequal protein loading, assuming that the majority of proteins show

no differential regulation.

Improving peptide mass accuracy

The peptide mass is calculated as the intensity-weighted average of all

MS peak centroids in the 3D peaks within the isotope patterns

belonging to a SILAC pair or triplet. The statistics of the number of

mass measurements per SILAC peptide is given in Supplementary

Figure 1 online.

We use the several hundred SILAC charge pairs in every LC-MS run

for recalibration without knowing their identity and minimize differ-

ences between mass estimates from different charge states. The

resulting polynomial remaps experimental m/zvalues to their cor-

rected values. Nonlinear mass corrections are about 1 p.p.m. for the

LTQ Orbitrap mass spectrometer (Supplementary Fig. 2 online).

We next use the two masses of peptide charge pairs to derive an

estimate of the mass accuracy (deviation from the true value) from the

estimate of the mass precision (repeatability of the measurement) by

requiring that mass estimates are within the error range. We then scale

the bootstrap errors by the required factor—between two to three in

our data. As in similar cases18,thisfactoris

likely due to autocorrelation between the

centroid determinations in subsequent spectra.

To correct for global expansion or contraction

of the mass scale we use well-identiﬁed pep-

tides and minimize the mass deviation of

these peptides weighed by their individual

mass precisions.

We plotted the corrected mass precisions for

the 477,511 SILAC pairs in our data set as a

function of peptide signal (Fig. 3a). Mass pre-

cisions are extremely high (p.p.b.) and roughly

proportional to one over the square root of the

peptide signal. Figure 3b shows that 50% of

the peaks have corrected mass precisions better

than 393 p.p.b. In agreement with this, the

actual mass deviations of all identiﬁed peptides

(measured minus calculated mass) have a s.d.

of 409 p.p.b. and average absolute mass devia-

tion (average of the absolute value of the

difference between measured and calculated

masses) of 278 p.p.b. (Fig. 3c).

Peptide mass estimates are usually taken

from the MS peak that leads to selecting the

peptide for fragmentation (Fig. 3d). Average

absolute mass accuracy in this standard

approach is 1.8 p.p.m. and s.d. is 2.5 p.p.m.

Thus mass accuracy measured as s.d., a key

performance parameter in proteomics19, improved sixfold using our

computational approach. We suspect the improvement would have

been even greater if we had not used the ‘lock mass option’20. Worse,

even including the lock mass, the normal approach would have

necessitated a maximum allowed mass deviation of 10 p.p.m. for all

peptides, whereas searches are performed with much tighter and

individualized mass tolerances in MaxQuant.

Peptide and protein identiﬁcation

Because the SILAC state of most isotope patterns is known beforehand,

we can treat the label modiﬁcations as ﬁxed in the database search. By

counting the number of arginines and lysines, the SILAC state

distinguishes limit tryptic peptides from incompletely cleaved ones.

This a priori information decreases the search space about tenfold. For

fragmentation spectra not associated with a SILAC pair, a conventional

database search is performed. After a database search, the list of top ten

sequences matching a fragmentation spectrum is sorted according to

their peptide score or P-score21 and ﬁltered for consistency with

a priori information, retaining the best scoring one. We allow a

deviation between the measured and calculated mass of four s.d. of

the individual bootstrap error for each peptide.

We use a database containing all true protein sequences, concate-

nated with reversed nonsense versions of these sequences22,23.Toavoid

spurious correlations because half of the reversed tryptic peptides have

the same mass as the forward sequence, we also swapped every

arginine and lysine with the preceding amino acid in the reversed

sequences. This approach still retains the local amino acid relations—

leading to the same length and mass distributions of peptides

(Supplementary Notes).

To assess the likelihood of false identiﬁcation we generate two lists

of peptides, one for the hits in the forward sequences and one in the

reversed sequences. We construct two histograms by gaussian kernel

smoothing (Fig. 4). They can be interpreted as approximations to the

1e8

1e7

Intensity

Counts/binCounts/bin

Counts/bin

1e6

1e5

34%

34% 13%

13%

1e4 0.1

Corrected mass precision (p.p.m.)

–1.5 1

0.5% 2.5% 13%

34% 34%

13% 13% 13%

34%

2.5% 2.5% 2.5%

0.5% 0.5% 0.5%

–0.5 0 0.5 1 1.5

Measured – calculated mass (p.p.m.)

–5–10 1005

Measured – calculated mass (p.p.m.)

Corrected mass precision (p.p.m.)

110

0 0.5 1 1.5 2 2.5

Figure 3 Accurate masses and individual peptide mass errors. (a) Mass precision corrected for

autocorrelation of 4477,000 SILAC pairs as a function of integrated signal intensity. Precision is

inversely proportional to the square root of peptide intensity. (b)Samedataasabut binned by

corrected mass precision. (c) Mass deviation of all identiﬁed peptides. (d) Mass deviation without

MaxQuant: precursor masses were taken directly from instrument software (‘monoisotopic M/Z’). The

scaled distribution from cis shown in red for comparison.

NATURE BIOTECHNOLOGY VOLUME 26 NUM BER 12 DECEMBER 2008 1369

ARTICLES

total and the conditional probability densities

pðs;LÞand ps;LX¼falsejðÞ

where the Boolean variable Xindicates ‘true or false’ (forward) or

‘false’ (reverse) sequences. sis the peptide database score and Lthe

peptide length. The probability of a false hit, given the peptide

identiﬁcation score and the length of the peptide is then

pX¼falsejs;LðÞ¼

ps;LjX¼falseðÞpX¼falseðÞ

ps;LðÞ

the posterior error probability (PEP) of each individual peptide. We

use the PEP only as input for calculating the false-discovery rate

(FDR) below. The a priori probability p(X¼false) is a constant with no

effect on the ﬁnal list of accepted peptides at a given FDR. Longer

peptides, which are less frequent in the database, are automatically

accepted with lower scores.

Todetermineacutoffscoreforaspeciﬁc

FDR, we sort all peptide identiﬁcations, from

the forward and the reverse database, by their

PEP, starting with the best. Peptides are

accepted until 1% of reverse hits/forward hits

has accumulated. The fraction of wrong iden-

tiﬁcations in the forward database is then 1%

as well.

In this run, 11,299 sequencing events led to

7,307 peptide identiﬁcations (identiﬁcation

rate of 64.7%, Fig. 5). Sequencing events

associated with SILAC pairs have identiﬁca-

tion rates of 84.4%. Identiﬁcations (red

squares) cluster in particular regions of the

contour plot (Fig. 5a), with characteristic

polymer patterns devoid of peptide identiﬁ-

cations (Fig. 5b) and fragmentation events in

peptide-rich regions almost uniformly identi-

ﬁed (Fig. 5c). Note that many SILAC pairs

were not targeted for sequencing at all (32.3%

in this run).

We next assemble peptide hits into protein

hits, a nontrivial step in shotgun proteo-

mics24. Whenever the set of identiﬁed

peptides in one protein is equal to or com-

pletely contained in the set of identiﬁed peptides of another protein

these two proteins are joined in a protein group. Shared peptides are

most parsimoniously associated with the group with the highest

number of identiﬁed peptides (‘razor’ peptides24) but remain in all

groups where they occur. Protein quantiﬁcation may then be per-

formed based only on unique peptides, including razor peptides, or

using all peptides. By default we use unique and razor peptides as a

compromise between unequivocal peptide assignment and most-

accurate quantiﬁcation.

We assign to each protein group a PEP by multiplying their peptide

PEPs. Only peptides with distinct sequences and only the highest-

scoring identiﬁed spectra are used to avoid bias due to dependent

peptides. Similarly to the peptide PEP, the protein PEP serves to sort

the list of hits from forward and reverse databases. Using a protein

FDR of 1% and requiring that each protein group contain a unique

peptide, we identiﬁed 4,149 proteins in the cell line proteome

(Supplementary Table 1 online).

L = 6 L = 8 L = 10 L = 12

L = 14 L = 16 L = 20 L = 24

0 100

P-score

200 300 0 100

P-score

200 300 0 100

P-score

200 300 0 100

P-score

200 300

0 100

P-score

200 300

0 100

P-score

200 300

0 100

P-score

200 300

0 100

Counts/bin

P-score

200 300

Figure 4 Peptide score (P-score) distributions. The panels show the distributions of scores in the

forward (blue) and reverse (red) database with peptide length (L) as the parameter. MaxQuant ﬁlters

potential hits by a priori information, which moves the reverse hit distribution far to the left. These

distributions are used to calculate the false-positive rate for peptide identiﬁcation as a function of

peptide length.

100

120

300 400 500 600

100

110

600 700 800 900

120

600 1,000

m/zm/zm/z

1,400

acb

Figure 5 High rate of identiﬁed MS/MS spectra. MS/MS sequencing events are indicated in the mass-retention time plane (contour plot). Identiﬁed and

unidentiﬁed MS/MS spectra are represented by red squares and blue crosses, respectively. (a) Peptides elute between 40 and 120 min and peptide

identiﬁcations are shifted to higher m/zvalues at later points in the gradient. (b) Left rectangle of a. In this region, characteristic polymer patterns that do

not lead to peptide identiﬁcations are prevalent. (c) In contrast, in a peptide-rich region of the contour plot (right rectangle in a), almost all fragmentation

events lead to successful peptide identiﬁcation.

1370 VOLUME 26 NUMBER 12 DECEMBER 2008 NATURE BIOTECHNOLOGY

ARTICLES

Protein quantiﬁcation

Many of the isotope patterns that have not been assembled into SILAC

pairs are nevertheless identiﬁed by database search. For these peptides

the m/z-elution time shapes of the 3D peaks belonging to the

identiﬁed SILAC version are translated to the location of the

missing SILAC partner and after integration of intensities, ratios are

calculated in the same way as for SILAC pairs that were detected

before identiﬁcation.

Protein ratios are calculated as the median of all SILAC peptide

ratios, minimizing the effect of outliers. We normalize the protein

ratios to correct for unequal protein amounts.

We next calculate an outlier signiﬁcance score for log protein

ratios (signiﬁcance A). To create a robust and asymmetrical estimate

of the s.d. of the main distribution we calculate the 15.87, 50 and

84.13 percentiles r1,r0,andr1.r1-r0and r0-r1are right- and left-

sided robust s.d. For a normal distribution, these would be equal

to each other and to the conventional deﬁnition of an s.d. A

suitable measure for a ratio r4r0being signiﬁcantly far away from

the main distribution is the distance to r0measured in terms of the

right s.d.

z¼rr0

r1r0

As a P-value for detection of signiﬁcant outlier ratios we deﬁne

significance A ¼1

2erfc z

ﬃﬃﬃ



¼1

ﬃﬃﬃﬃﬃ

pZ1

et2=2dt

which is the probability of obtaining a log-ratio of at least this mag-

nitude under the null hypothesis that the distribution of log-ratios has

normal upper and lower tails (Supplementary Fig. 3 online).

For highly abundant proteins the statistical spread of unregulated

proteins is much more focused than for low abundance ones8(Fig. 6).

To capture this effect, we deﬁne another quantity, signiﬁcance B,

which is calculated only on the protein subsets obtained by intensity

binning. We deﬁne bins of equal occupancy such that each contains at

least 300 proteins.

We quantiﬁed 4,100 proteins, comparable to the number of

signiﬁcant messages in a microarray experiment on the same cell

type14 (Supplementary Table 1). If a minimum of three quantiﬁca-

tion events (three SILAC pairs) is required, quantiﬁcation becomes

very reliable25,26 because an outlier ratio has no effect on the median.

Strikingly, 99.3% of proteins were within 50% of the one-to-one ratio.

This implies excellent SILAC partner identiﬁcation as wrongly part-

nered peptides would have ratios strongly deviating from 1. We found

48 proteins to be signiﬁcantly upregulated based on signiﬁcance B with

a Benjamini-Hochberg27 FDR o5% (Supplementary Table 2 online).

Notably, two of the most heavily upregulated proteins after 2 h of

EGF stimulation were the transcription factors JunB and the orphan

nuclear receptor NR4A1, also termed early-response protein NAK1

(Fig. 6). Both are known to be regulated by growth stimuli. Among

the most upregulated proteins in Figure 6 there is a conserved dual-

speciﬁcity tyrosine-serine phosphatase (MTM1), widely studied in

relation to myotubular myopathy28 and, like PTEN, a lipid phospha-

tase29. The completely uncharacterized protein C1orf52 is tightly

associated with the tumor suppressor BCL10 and therefore also called

BAG for BCL10-associated gene. Neither of these proteins was

known to be induced upon EGF stimulation. Many of the other

signiﬁcantly regulated proteins also have potential connection to

growth factor signaling (Supplementary Table 1). Proteins encoded

by genes having regulatory binding sites for SREBP-1 are shown to be

signiﬁcantly upregulated when analyzed by TRANSFAC30. SREPB-1

likely mediates the effects of EGF stimulation on cancer-relevant

proteins like FAS31.

DISCUSSION

We have introduced a set of computational proteomics algorithms

with several useful features. Efﬁcient extraction of mass information

allows us to search protein databases with maximum allowed mass

deviations that adjust themselves to the precision with which the

peptide is measured. The mass accuracies achieved here are the highest

yet reported in large-scale proteomics32 and sharply limit the number

of candidate peptides in database searches. With low-resolution data,

only a few percent of fragmentation events lead to successful identi-

ﬁcation33, whereas the mass accuracy and feature extraction in

MaxQuant allow 73% of the fragmentation events associated with

SILAC peptide pairs to be identiﬁed. Thus, standard ion trap frag-

mentation is extremely information rich, and nontryptic and modiﬁed

peptides do not constitute the majority of fragmented peptides. The

MaxQuant algorithms recently enabled comprehensive quantiﬁcation

of the yeast proteome34. Although we identiﬁed essentially the

complete proteome, we found only three (o1%) of the 814 ‘dubious’

open reading frames (ORFs) (http://www.yeastgenome.org/), which

are not expected to be expressed from evidence such as comparative

genome sequencing. This provides independent evidence that our

FDR estimates of peptide and protein identiﬁcations are very stringent

(Supplementary Fig. 4 online). Much higher identiﬁcation rates

among dubious ORFs (3%) were found in genome-wide tagging

experiments35,36. Likewise, aggregate data from yeast proteome

resources cover 12% of these dubious ORFs37, the same percentage

as their occurrence in the genome.

We have already applied MaxQuant to quantify 45,000 proteins in

the mouse stem cell proteome38 and several other proteomes in similar

depth. We conclude that the computational tools for proteome-wide

quantiﬁcation are now in hand. With further advances in instrumen-

tation, particularly in the dynamic range of measurements39,40,

proteomics should be suitable for routine ‘functional genomics’

experiments, for which microarrays have so far been the only option.

0.1 1

Protein ratio

JUNB

C1orf52 NR4A1

OBSCN

MTM1

KIAA1429

1e6

1e8

1e10

Intensity

Figure 6 Proteome-wide accurate quantiﬁcation and signiﬁcance.

Normalized protein ratios are plotted against summed peptide intensities.

The spread of the cloud is lower at high abundance, indicating that

quantiﬁcation is more precise. The data points are colored by their

‘signiﬁcance B’, with blue crosses having values 40.05, red squares

between 0.05 and 0.01, yellow diamonds between 0.01 and 0.001 and

green circles o0.001.

NATURE BIOTECHNOLOGY VOLUME 26 NUM BER 12 DECEMBER 2008 1371

ARTICLES

METHODS

Software development and availability of MaxQuant. MaxQuant is developed

for the .NET framework and written in the C# language. The interactive 3D

data viewer was developed on the basis of DirectX. MaxQuant executables are

available via http://www.maxquant.org/, whereas the source code of algorithms

is available in Supplementary Data. It runs on Windows desktop computers

and is compatible with XP and Vista. Processing time is currently about 20 min

per raw ﬁle and per processing core. Detailed description ofthe algorithms used

in MaxQuant can be found in Supplementary Notes.

Data processing. The Mascot program version 2.2.04 was used to generate up

to ten peptide sequence candidates per fragmentation spectrum (Matrix

Science), and International Protein Index (IPI) version 3.48 was searched.

The databasesearch is done with an initial maximum allowed mass deviation of

7 p.p.m. for the peptide mass and 0.5 m/zunits for fragmentation peaks, which

is optimal for linear ion trap data41.

Gene Ontology, Pfam domain and TRANSFAC overrepresentation analysis.

P-values for overrepresentation in regulated proteins were calculated with the

Wilcoxon-Mann-Whitney test on the continuous signiﬁcance B values calcu-

lated by the MaxQuant software.

Data used in analysis. The data used in this analysis have been published in

reference 14. SILAC was performed as described42. Brieﬂy, HeLa cells were

stimulated with EGF for 2 h and mass spectrometric analysis performed as

described20. ‘Heavy’ (EGF stimulated) and ‘light’ (control) SILAC cell popula-

tions were combined and lysed. Proteins were digested in solution with trypsin,

and the resulting peptides were separated by isoelectric focusing into 24 fractions

with an Agilent 3100 OFFGEL Fractionator. Each fraction was puriﬁed with

StageTips43 andanalyzedbyliquidchromatographycombinedwithelectrospray

tandem mass spectrometry on a Thermo Scientiﬁc LTQ Orbitrap mass spectro-

meter with lock mass calibration20. The experiment was performed in triplicate.

Raw mass spectrometric data ﬁles and evidence tables containing

peptide and protein data can be downloaded from Tranche at http://tranche.

proteomecommons.org/.

Note: Supplementary information is available on the Nature Biotechnology website.

ACKNOWLEDGMENTS

We thank all the other members of the Proteomics and Signal Transduction

group for help with the development of MaxQuant. Shubin Ren helped in

developing the 3D data viewer used in MaxQuant. Nina Hubner measured the

data used in this analysis. This work was supported by the Max-Planck Society

and by the 6th Framework Program of the European Union (Interaction

Proteome LSHG-CT-2003-505520 and HEROIC LSHG-CT-2005-018883).

Published online at http://www.nature.com/naturebiotechnology/

Reprints and permissions information is available online at http://npg.nature.com/

reprintsandpermissions/

1. Allison, D.B., Cui, X., Page, G.P. & Sabripour, M. Microarray data analysis: from disarray

to consolidation and consensus. Nat. Rev. Genet. 7, 55–65 (2006).

2. Patterson, S.D. & Aebersold, R.H. Proteomics: the ﬁrst decade and beyond. Nat. Genet.

33 Suppl, 311–323 (2003).

3. Nesvizhskii, A.I., Vitek, O. & Aebersold, R. Analysis and validation of proteomic data

generated by tandem mass spectrometry. Nat. Methods 4, 787–797 (2007).

4. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature422, 198–207

(2003).

5. Steen, H. & Mann, M. The ABC’s (and XYZ’s) of peptide sequencing. Nat. Rev. Mol. Cell

Biol. 5, 699–711 (2004).

6. Sadygov, R.G., Cociorva, D. & Yates, J.R. III. Large-scale database searching using

tandem mass spectra: looking up the answer in the back of the book. Nat. Methods 1,

195–202 (2004).

7. Ong, S.E. & Mann, M. Mass spectrometry-based proteomics turns quantitative. Nat.

Chem. Biol. 1, 252–262 (2005).

8. Bantscheff, M., Schirle, M., Sweetman, G., Rick, J. & Kuster, B. Quantitative mass

spectrometry in proteomics: a critical review. Anal. Bioanal. Chem. 389, 1017–1031

(2007).

9. Listgarten, J. & Emili, A. Statistical and computational methods for comparative

proteomic proﬁling using liquid chromatography-tandem mass spectrometry. Mol.

Cell. Proteomics 4, 419–434 (2005).

10. Colinge, J. & Bennett, K.L. Introduction to computational proteomics. PLOS Comput .

Biol. 3,e114(2007).

11. Matthiesen, R. Methods, algorithms and tools in computational proteomics: a practical

pointofview.Proteomics 7, 2815–2832 (2007).

12. Mead, J.A., Shadforth, I.P. & Bessant, C. Public proteomic MS repositories and

pipelines: available tools and biological applications. Proteomics 7, 2769–2786

(2007).

13. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein

identiﬁcation by searching sequence databases using mass spectrometry data. Electro-

phoresis 20, 3551–3567 (1999).

14. Cox, J. & Mann, M. Is proteomics the new genomics? Cell 130, 395–398

(2007).

15. Senko, M.W., Beu, S.C. & McLafferty, F.W. Determination of monoisotopic masses and

ion populations for large biomolecules from resolved isotopic distributions. J. Am. Soc.

Mass Spectrom. 6, 229–233 (1995).

16. Ong, S.E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a

simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 1,

376–386 (2002).

17. Blagoev, B., Ong, S.E., Kratchmarova, I. & Mann, M. Temporal analysis of

phosphotyrosine-dependent signaling networks by quantitative proteomics. Nat.

Biotechnol. 22, 1139–1145 (2004).

18. Sokal, A.D. Monte Carlo Methods in Statistical Physics: Foundations and New Algo-

rithms (Lausanne, Switzerland, 1996).

19. Zubarev, R. & Mann, M. On the proper use of mass accuracy in proteomics. Mol. Cell.

Proteomics 6, 377–381 (2007).

20. Olsen, J.V. et al. Parts per million mass accuracy on an Orbitrap mass spectrometer via

lock mass injection into a C-trap. Mol. Cell. Proteomics 4, 2010–2021 (2005).

21. Olsen, J.V. & Mann, M. Improved peptide identiﬁcation in proteomics by two con-

secutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. USA 101,

13417–13422 (2004).

22. Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased conﬁdence in large-

scale protein identiﬁcations by mass spectrometry. Nat. Methods 4, 207–214

(2007).

23. Ka

¨ll, L., Storey, J.D., MacCoss, M.J. & Nobel, W.S. Assigning signiﬁcance to peptides

identiﬁed by tandem mass spectrometry using decoy databases. J. Proteome Res. 7,

29–34 (2008).

24. Nesvizhskii, A.I. & Aebersold, R. Interpretation of shotgun proteomic data: the protein

inference problem. Mol. Cell. Proteomics 4, 1419–1440 (2005).

25. Selbach, M. et al. Widespread changes in protein synthesis induced by microRNAs.

Nature 455, 58–63 (2008).

26. Bonaldi, T. et al. Combined use of RNAi and quantitative proteomics to study gene

function in Drosophila.Mol. Cell 31, 762–772 (2008).

27. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and

powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300 (1995).

28. Laporte, J. et al. MTM1 mutations in X-linked myotubular myopathy. Hum. Mutat. 15,

393–409 (2000).

29. Wishart, M.J. & Dixon, J.E. PTEN and myotubularin phosphatases: from 3-phospho-

inositide dephosphorylation to disease. Trends Cell Biol. 12, 579–585 (2002).

30. Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to proﬁles. Nucleic

Acids Res. 31, 374–378 (2003).

31. Swinnen, J.V. et al. Stimulation of tumor-associated fatty acid synthase expression by

growth factor activation of the sterol regulatory element-binding protein pathway.

Oncogene 19, 5173–5181 (2000).

32. Liu, T., Belov, M.E., Jaitly, N., Qian, W.J. & Smith, R.D. Accurate mass measurements

in proteomics. Chem. Rev. 107, 3621–3653 (2007).

33. Kuster, B., Schirle, M., Mallick, P. & Aebersold, R. Scoring proteomes with proteotypic

peptide probes. Nat. Rev. Mol. Cell Biol. 6, 577–583 (2005).

34. de Godoy,L.M. et al. Comprehensive mass-spectrometry-based proteomequantiﬁcation

of haploid versus diploid yeast. Nature 455, 1251–1254 (2008).

35. Huh, W.K. et al. Global analysis of protein localization in budding yeast. Nature 425,

686–691 (2003).

36. Ghaemmaghami, S. et al. Global analysis of protein expression in ye ast. Nature 425,

737–741 (2003).

37. King, N.L. et al. Analysis of the Saccharomyces cerevisiae proteome with PeptideAtlas.

Genome Biol. 7,R106(2006).

38. Graumann, J. et al. SILAC-labeling and proteome quantitation of mouse embryonic

stem cells to a depth of 5111 proteins. Mol. Cell Proteomics 7, 672–683 (2008).

39. Eriksson, J. & Fenyo, D. Improving the success rate of proteome analysis by modeling

protein-abundance distributions and experimental designs. Nat. Biotechnol. 25,

651–655 (2007).

40. Mann, M. & Kelleher, N.L. Special feature: precision proteomics: The case for high

resolution and high mass accuracy. Proc. Natl. Acad. Sci. USA. published online, doi:

10.1073/pnas.0800788105 (25 September 2008).

41. Cox, J., Hubner, N.C. & Mann, M. How much peptide sequence information is

contained in ion trap tandem mass spectra? J. Am. Soc. Mass. Spectrom. published

online, doi:10.1016/j.jasms.2008.07.024 (7 August 2008).

42. Ong, S.E. & Mann, M. A practical recipe for stable isotope labeling by amino acids in

cell culture (SILAC). Nat. Protocols 1, 2650–2660 (2006).

43. Rappsilber, J., Mann, M. & Ishihama, Y. Protocol for micro-puriﬁcation, enrichment,

pre-fractionation and storage of peptides for proteomics usingStageTips. Nat. Protocols

2, 1896–1906 (2007).

1372 VOLUME 26 NUMBER 12 DECEMBER 2008 NATURE BIOTECHNOLOGY

ARTICLES

Isocyanides inhibit bacterial pathogens by covalent targeting of essential metabolic enzymes

Article

Full-text available

Jan 2024

Isonitrile natural products, also known as isocyanides, demonstrate potent antimicrobial activities, yet our understanding of their molecular targets remains limited. Here, we focus on the so far neglected group of...

Mechanical Strain Activates Planar Cell Polarity Signaling to Coordinate Vascular Cell Dynamics

Preprint

Jun 2024

Mechanical stimuli, particularly laminar blood flow, play a crucial role in shaping the vascular system. Changes in the rate of blood flow manifest in altered shear stress, which activates signaling cascades that drive vascular remodeling. Consistently, dysregulation of the endothelial response to fluid shear forces and aberrant flow patterns both lead to pathological conditions, including impaired blood vessel development and atherosclerosis. Despite its importance, the mechanisms driving the coordinated cell behavior underlying vascular remodeling are not fully understood. Combining classical cell biological approaches with advanced image analysis, mathematical modeling, biomimetic strategies, and in vivo studies, we identify the planar cell polarity (PCP) protein Vangl1 as an enforcer of flow-dependent cell dynamics in the vascular system. We demonstrate that shear stress triggers the relocation of Vangl1 from an internal reservoir to the plasma membrane at the initiation of cell remodeling. Membrane enrichment of Vangl1 is mediated by a Coronin1C-dependent shift in the equilibrium between endo- and exocytosis and results in the spatial reorganization of another essential PCP protein, Frizzled6 (Fzd6). The resulting mutual exclusion of the core PCP proteins Fzd6 and Vangl1 augments differential junctional and cytoskeletal dynamics along the flow axis. Loss of Vangl1 limits the ability of endothelial cells to respond to shear forces in a coordinated fashion, resulting in irregular cell alignment along the flow direction and erroneous vessel sprouting. Together, these studies introduce core PCP signaling as a determinant of collective cell dynamics and organization of the vascular system.

Rapid degradation of Histone Deacetylase 1 (HDAC1) reveals essential roles in both gene repression and active transcription

Preprint

Full-text available

Jun 2024

Histone Deacetylase 1 (HDAC1) removes acetyl groups from lysine residues on the core histones, a critical step in the regulation of chromatin accessibility. Despite histone deacetylation being an apparently repressive activity, suppression of HDACs causes both up- and down-regulation of gene expression. Here we exploited the degradation tag (dTAG) system to rapidly degrade HDAC1 in embryonic stem cells (ESCs) lacking its paralog, HDAC2. Unlike HDAC inhibitors that lack isoform specificity, the dTAG system allowed specific degradation and removal of HDAC1 in <1 hour (100x faster than genetic knockouts). This rapid degradation caused increased histone acetylation in as little as 2 hours, with H2BK5 and H2BK11 being the most sensitive. The majority of differentially expressed genes following 2 hours of HDAC1 degradation were upregulated (275 genes up vs 15 down) with increased proportions of downregulated genes observed at 6 (1,153 up vs 443 down) and 24 hours (1,146 up vs 967 down) respectively. Upregulated genes showed increased H2BK5ac and H3K27ac around their transcriptional start site (TSS). In contrast, decreased acetylation of super-enhancers (SEs) was linked to the most strongly downregulated genes. These findings suggest a paradoxical role for HDAC1 in the maintenance of histone acetylation levels at critical enhancer regions required for the pluripotency-associated gene network.

Telomeric lncRNA TERRA localizes to stress granules in human ALT cells

Preprint

Full-text available

Jun 2024

TERRA, the lncRNA derived from the ends of chromosomes, has a number of well-described nuclear roles including telomere maintenance and homeostasis. A growing body of evidence now points at its role in human cells outside of nucleus—it has been found to be a component of extracellular vesicles, a player in inflammation signalling and its capacity for translation has been shown. In this work, using a combination of sensitive microscopy methods, cellular fractionation, proteomics and transcriptome analysis, we demonstrate directly for the first time that TERRA is present in the cytoplasm of human telomerase-negative cells, especially upon various stress stimuli, and that it associates with stress granules. Confirming the presence of TERRA in the cytoplasm, our work fills an important gap in the field, and contributes to the discussion about the role of TERRA as a transcript involved in nucleo-cytoplasmic stress communication.

Fission yeast Caprin protein is required for efficient heterochromatin establishment

Preprint

Full-text available

Jun 2024

Heterochromatin is a key feature of eukaryotic genomes that serves important regulatory and structural roles in regions such as centromeres. In fission yeast, maintenance of existing heterochromatic domains relies on positive feedback loops involving histone methylation and non-coding RNAs. However, requirements for de novo establishment of heterochromatin are less well understood. Here, through a cross-based assay we have identified a novel factor influencing the efficiency of heterochromatin establishment. We determine that the previously uncharacterised protein is an ortholog of human Caprin1, an RNA-binding protein linked to stress granule formation. We confirm that the fission yeast ortholog, here named Cpn1, also associates with stress granules, and we uncover evidence of interplay between heterochromatin integrity and ribonucleoprotein (RNP) granule formation, with heterochromatin mutants showing reduced granule formation in the presence of stress, but increased granule formation in the absence of stress. We link this to regulation of non-coding heterochromatic transcripts, since in heterochromatin-deficient cells, absence of Cpn1 leads to hyperaccumulation of centromeric RNAs at centromeres. Together, our findings unveil a novel link between RNP homeostasis and heterochromatin assembly, and implicate Cpn1 and associated factors in facilitating efficient heterochromatin establishment by enabling removal of excess transcripts that would otherwise impair assembly processes.

Mechanism of chaperone coordination during cotranslational protein folding in bacteria

Article

Jun 2024
MOL CELL

Evolution of the Umbilical Cord Blood Proteome Across Gestational Development

Preprint

Full-text available

Jun 2024

Neonatal health is dependent on early risk stratification, diagnosis, and timely management of potentially devastating conditions, particularly in the setting of prematurity. Many of these conditions are poorly predicted in real-time by clinical data and current diagnostics. Umbilical cord blood may represent a novel source of molecular signatures that provides a window into the state of the fetus at birth. In this study, we comprehensively characterized the cord blood proteome of infants born between 24 to 42 weeks using untargeted mass spectrometry and functional enrichment analysis. We determined that the cord blood proteome at birth varies significantly across gestational development. Proteins that function in structural development and growth (e.g., extracellular matrix organization, lipid particle remodeling, and blood vessel development) are more abundant earlier in gestation. In later gestations, proteins with increased abundance are in immune response and inflammatory pathways, including complements and calcium-binding proteins. Furthermore, these data contribute to the knowledge of the physiologic state of neonates across gestational age, which is crucial to understand as we strive to best support postnatal development in preterm infants, determine mechanisms of pathology causing adverse health outcomes, and develop cord blood biomarkers to help tailor our diagnosis and therapeutics for critical neonatal conditions.

Profiling of urinary extracellular vesicle protein signatures from patients with cribriform and intraductal prostate carcinoma in a cross-sectional study

Preprint

Full-text available

May 2024

Prognostic tests and treatment approaches for optimized clinical care of prostatic neoplasms are an unmet need. Prostate cancer (PCa) and associated extracellular vesicles (EVs) proteome changes occur during initiation and progression of the disease. PCa tissue proteome has been previously characterized, but screening of tissue samples constitutes an invasive procedure. Consequently, we focused this study on liquid biopsies, such as urine samples. More specifically, urinary small extracellular vesicle and particles proteome profiles of 100 subjects were analyzed using liquid chromatography coupled to high-resolution mass spectrometry (LC-MS/MS). We identified 171 proteins that were differentially expressed between intraductal prostate cancer/cribriform (IDC/Crib) and non-IDC/non-Crib after correction for multiple testing. However, the strong correlation between IDC/Crib and Gleason Grade complicates the disentanglement of the underlying factors driving this association. Nevertheless, even after accounting for multiple testing and adjusting for ISUP (International Society of Urological Pathology) grading, two proteins continued to exhibit significant differential expression between IDC/Crib and non-IDC/non-Crib. Functional enrichment analysis based on cancer hallmark proteins disclosed a clear pattern of androgen response down-regulation in urinary EVs from IDC/Crib compared to non-IDC/non-Crib. Interestingly, proteome differences between IDC and cribriform were more subtle, suggesting high proteome heterogeneity. Overall, the urinary EV proteome reflect partly the prostate pathology.

Standardized Workflow for Mass-Spectrometry-Based Single-Cell Proteomics Data Processing and Analysis Using the scp Package

Chapter

Apr 2024

Mass-spectrometry (MS)-based single-cell proteomics (SCP) explores cellular heterogeneity by focusing on the functional effectors of the cells—proteins. However, extracting meaningful biological information from MS data is far from trivial, especially with single cells. Currently, data analysis workflows are substantially different from one research team to another. Moreover, it is difficult to evaluate pipelines as ground truths are missing. Our team has developed the R/Bioconductor package called scp to provide a standardized framework for SCP data analysis. It relies on the widely used QFeatures and SingleCellExperiment data structures. In addition, we used a design containing cell lines mixed in known proportions to generate controlled variability for data analysis benchmarking. In this chapter, we provide a flexible data analysis protocol for SCP data using the scp package together with comprehensive explanations at each step of the processing. Our main steps are quality control on the feature and cell level, aggregation of the raw data into peptides and proteins, normalization, and batch correction. We validate our workflow using our ground truth data set. We illustrate how to use this modular, standardized framework and highlight some crucial steps.

Basis of gene-specific transcription regulation by the Integrator complex

Article

Jun 2024
MOL CELL

Global analysis of protein localization in budding yeast

Article

Full-text available

Nov 2003
NATURE

A fundamental goal of cell biology is to define the functions of proteins in the context of compartments that organize them in the cellular environment. Here we describe the construction and analysis of a collection of yeast strains expressing full-length, chromosomally tagged green fluorescent protein fusion proteins. We classify these proteins, representing 75% of the yeast proteome, into 22 distinct subcellular localization categories, and provide localization information for 70% of previously unlocalized proteins. Analysis of this high-resolution, high-coverage localization data set in the context of transcriptional, genetic, and protein-protein interaction data helps reveal the logic of transcriptional co-regulation, and provides a comprehensive view of interactions within and between organelles in eukaryotic cells.

Controlling The False Discovery Rate - A Practical And Powerful Approach To Multiple Testing

Article

Full-text available

Nov 1995

The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses – the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferroni-type procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

TRANSFAC®: Transcriptional regulation, from patterns to profiles

Article

Full-text available

Jan 2003
NUCLEIC ACIDS RES

The TRANSFAC® database on eukaryotic transcriptional regulation, comprising data on transcription factors, their target genes and regulatory binding sites, has been extended and further developed, both in number of entries and in the scope and structure of the collected data. Structured fields for expression patterns have been introduced for transcription factors from human and mouse, using the CYTOMER® database on anatomical structures and developmental stages. The functionality of Match™, a tool for matrix-based search of transcription factor binding sites, has been enhanced. For instance, the program now comes along with a number of tissue-(or state-)specific profiles and new profiles can be created and modified with Match™ Profiler. The GENE table was extended and gained in importance, containing amongst others links to LocusLink, RefSeq and OMIM now. Further, (direct) links between factor and target gene on one hand and between gene and encoded factor on the other hand were introduced. The TRANSFAC® public release is available at http://www.gene-regulation.com. For yeast an additional release including the latest data was made available separately as TRANSFAC® Saccharomyces Module (TSM) at http://transfac.gbf.de. For CYTOMER® free download versions are available at http://www.biobase.de:8080/index.html.

Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast

Article

Full-text available

Oct 2008
NATURE

Mass spectrometry is a powerful technology for the analysis of large numbers of endogenous proteins. However, the analytical challenges associated with comprehensive identification and relative quantification of cellular proteomes have so far appeared to be insurmountable. Here, using advances in computational proteomics, instrument performance and sample preparation strategies, we compare protein levels of essentially all endogenous proteins in haploid yeast cells to their diploid counterparts. Our analysis spans more than four orders of magnitude in protein abundance with no discrimination against membrane or low level regulatory proteins. Stable-isotope labelling by amino acids in cell culture (SILAC) quantification was very accurate across the proteome, as demonstrated by one-to-one ratios of most yeast proteins. Key members of the pheromone pathway were specific to haploid yeast but others were unaltered, suggesting an efficient control mechanism of the mating response. Several retrotransposon-associated proteins were specific to haploid yeast. Gene ontology analysis pinpointed a significant change for cell wall components in agreement with geometrical considerations: diploid cells have twice the volume but not twice the surface area of haploid cells. Transcriptome levels agreed poorly with proteome changes overall. However, after filtering out low confidence microarray measurements, messenger RNA changes and SILAC ratios correlated very well for pheromone pathway components. Systems-wide, precise quantification directly at the protein level opens up new perspectives in post-genomics and systems biology.

MTM1 mutations in X‐linked myotubular myopathy

Article

May 2000
HUM MUTAT

X‐linked myotubular myopathy (XLMTM; MIM# 310400) is a severe congenital muscle disorder caused by mutations in the MTM1 gene. This gene encodes a dual‐specificity phosphatase named myotubularin, defining a large gene family highly conserved through evolution (which includes the putative anti‐phosphatase Sbf1/hMTMR5). We report 29 mutations in novel cases, including 16 mutations not described before. To date, 198 mutations have been identified in unrelated families, accounting for 133 different disease‐associated mutations which are widespread throughout the gene. Most point mutations are truncating, but 26% (35/133) are missense mutations affecting residues conserved in the Drosophila ortholog and in the homologous MTMR1 gene. Three recurrent mutations affect 17% of the patients, and a total of 21 different mutations were found in several independent families. The frequency of female carriers appears higher than expected (only 17% are de novo mutations). While most truncating mutations cause the severe and early lethal phenotype, some missense mutations are associated with milder forms and prolonged survival (up to 54 years). Hum Mutat 15:393–409, 2000. © 2000 Wiley‐Liss, Inc.

Probability-based protein identification by searching sequence databases using mass spectrometry data

Article

Dec 1999
ELECTROPHORESIS

Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be com pared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.

Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions

Article

Apr 1995

The coupling of electrospray ionization with Fourier-transform mass spectrometry allows the analysis of large biomolecules with mass-measuring errors of less than 1 ppm. The large number of atoms incorporated in these molecules results in a low probability for the all-monoisotopic species. This produces the potential to misassign the number of heavy isotopes in a specific peak and make a mass error of ±1 Da, although the certainty of the measurement beyond the decimal place is greater than 0.1 Da. Statistical tests are used to compare the measured isotopic distribution with the distribution for a model molecule of the same average molecular mass, which allows the assignment of the monoisotopic mass, even in cases where the monoisotopic peak is absent from the spectrum. The statistical test produces error levels that are inversely proportional to the number of molecules in a distribution, which allows an estimation of the number of ions in the trapped ion cell. It has been determined, via this method that 128 charges are required to produce a signal-to-noise ratio of 3:1, which correlates well with previous experimental methods.

Perkins DN, Pappin DJC, Creasy DM, Cottrell JS.. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20: 3551-3567

Article

Dec 1999

Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be compared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.

Precision Proteomics: The Case for High Resolution and High Mass Accuracy

Article

Oct 2008
P NATL ACAD SCI USA

Proteomics has progressed radically in the last 5 years and is now on par with most genomic technologies in throughput and comprehensiveness. Analyzing peptide mixtures by liquid chromatography coupled to high-resolution mass spectrometry (LC-MS) has emerged as the main technology for in-depth proteome analysis whereas two-dimensional gel electrophoresis, low-resolution MALDI, and protein arrays are playing niche roles. MS-based proteomics is rapidly becoming quantitative through both label-free and stable isotope labeling technologies. The latest generation of mass spectrometers combines extremely high resolving power, mass accuracy, and very high sequencing speed in routine proteomic applications. Peptide fragmentation is mostly performed in low-resolution but very sensitive and fast linear ion traps. However, alternative fragmentation methods and high-resolution fragment analysis are becoming much more practical. Recent advances in computational proteomics are removing the data analysis bottleneck. Thus, in a few specialized laboratories, "precision proteomics" can now identify and quantify almost all fragmented peptide peaks. Huge challenges and opportunities remain in technology development for proteomics; thus, this is not "the beginning of the end" but surely "the end of the beginning."

Combined Use of RNAi and Quantitative Proteomics to Study Gene Function in Drosophila

Article

Oct 2008

RNA interference is a powerful way to study gene function and is frequently combined with microarray analysis. Here we introduce a similar technology at the protein level by simultaneously applying Stable Isotope Labeling by Amino acids in Cell culture (SILAC) and RNA interference (RNAi) to Drosophila SL2 cells. After knockdown of ISWI, an ATP-hydrolyzing motor of different chromatin remodeling complexes, we obtained a quantitative proteome of more than 4,000 proteins. ISWI itself was reduced 10-fold as quantified by SILAC. Several hundred proteins were significantly regulated and clustered into distinct functional categories. Acf-1, a direct interaction partner of ISWI, is severely depleted at the protein, but not the transcript, level; this most likely results from reduced protein stability. We found little overall correlation between changes in the transcriptome and proteome with many protein changes unaccompanied by message changes. However, correlation was high for those mRNAs that changed significantly by microarray.

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification

Abstract and Figures

Recommended publications

Recognition of β-hairpin motifs in proteins by using the composite vector

Comparative phosphoproteomics reveals evolutionary and functional conservation of phosphorylation ac...

How do shotgun proteomics algorithms identify proteins?

Normalization of Peak Intensities in Bottom-Up MS-Based Proteomics Using Singular Value Decompositio...