ChapterPDF Available

Bioinformatics Methods to Deduce Biological Interpretation from Proteomics Data

Authors:

Abstract and Figures

High-throughput proteomics studies generate large amounts of data. Biological interpretation of these large scale datasets is often challenging. Over the years, several computational tools have been developed to facilitate meaningful interpretation of large-scale proteomics data. In this chapter, we describe various analyses that can be performed and bioinformatics tools and resources that enable users to do the analyses. Many Web-based and stand-alone tools are relatively user-friendly and can be used by most biologists without significant assistance.
Content may be subject to copyright.
147
Shivakumar Keerthikumar and Suresh Mathivanan (eds.), Proteome Bioinformatics, Methods in Molecular Biology,
vol. 1549, DOI 10.1007/978-1-4939-6740-7_12, © Springer Science+Business Media LLC 2017
Chapter 12
Bioinformatics Methods to Deduce Biological
Interpretation from Proteomics Data
Krishna Patel, Manika Singh, and Harsha Gowda
Abstract
High-throughput proteomics studies generate large amounts of data. Biological interpretation of these
large scale datasets is often challenging. Over the years, several computational tools have been developed
to facilitate meaningful interpretation of large-scale proteomics data. In this chapter, we describe various
analyses that can be performed and bioinformatics tools and resources that enable users to do the analyses.
Many Web-based and stand-alone tools are relatively user-friendly and can be used by most biologists
without significant assistance.
Key words Gene ontology, FunRich, Reactome, NetPath, Phosphoproteome, Pathways, Enrichment,
Post-translational modifications
1 Introduction
High-throughput proteomics studies result in identification and
quantitation of thousands of proteins in a biological specimen.
These studies are often carried out to determine dynamic changes
in proteins including differential expression pattern between bio-
logical conditions, activation of specific signaling pathways and in
protein complexes. To achieve these, mass spectrometry based
methods are often employed to measure relative abundance of pro-
teins or post-translational modifications including phosphoryla-
tion, acetylation, glycosylation, and ubiquitination. Although such
large-scale studies generate enormous amount of data, they pose
significant challenge for biologists for biological interpretation.
Several commercial and open source tools have been devel-
oped over the years to facilitate biological interpretation of pro-
teomics data. These tools allow biologists to disentangle complexity
in large datasets and identify meaningful patterns. Most biological
processes are not driven by a single protein but many proteins act-
ing in concert. If any two biological conditions or cell phenotypes
were compared using quantitative proteomics, one could expect a
148
set of proteins that regulate these two distinct cell phenotypes or
biological conditions to be differentially expressed. Tools that are
developed to carry out gene set enrichment or overrepresentation
analysis enable identification of such patterns from large scale data-
sets. Such enrichment analysis can also facilitate functional annota-
tion of orphan molecules based on their association with other
well-characterized molecules. Here, we describe several tools that
can be used for such analysis in mammalian system, particularly
those that have well-annotated data including human.
2 Materials
Several commercial as well as open source tools are available for
carrying out bioinformatics analysis of high throughput datasets.
For each type of analysis, we are providing list of tools that can be
used in relevant sections of the chapter. A step-by-step instruction
is also provided for one tool in each section. General outline of the
workflow and different kinds of analysis that can be carried out is
provided in Fig. 1.
3 Methods
Gene ontology (GO) consortium has developed controlled vocab-
ulary to represent biological functions, processes, and cellular
localization information [1]. The terms are linked to correspond-
ing genes based on our understanding of gene function and local-
ization. This data is extensively used to carry out GO enrichment
analysis that provides insights into biological functions/processes
enriched in a large scale proteomics dataset. There are several tools
that have been developed to carry out enrichment analysis provid-
ing gene/protein list as an input. FunRich [2] is a user friendly
stand-alone tool for GO enrichment analysis. The tool allows users
to upload or paste gene symbols, gene ID, Uniprot ID, and RefSeq
protein ID as input for the analysis. Results of the enrichment anal-
ysis are produced in various graphical formats such as bar graph,
pie chart, Venn diagram, heat map, and doughnut chart. Multiple
gene sets can be uploaded for comparative analysis of GO enrich-
ment and pathway enrichment analysis. The tool provides various
graphical representation options for visualizing comparative results.
One of the widely used Web-based tools is DAVID (Database
for Annotation, Visualization, and Integrated Discovery (https://
david.ncifcrf.gov/) [3]. It provides a comprehensive set of func-
tional annotation tools which can not only identify enriched bio-
logical themes, particularly GO terms, but also discover functionally
related enriched gene groups based on popular pathway databases
including KEGG [4] and BioCarta [5]. Here we describe a step-
by- step guide for GO enrichment using DAVID.
3.1 Gene Ontology
Enrichment Analysis
Krishna Patel et al.
149
Fig. 1 A general framework and outline of various bioinformatics analyses approaches that can be used for high-throughput proteomic data
High Throughput Proteomic Analysis
150
There are two major DAVID tools that could be used for
functional annotation/classification of gene lists—Functional Anno-
tation and Gene Functional Classification. The tools can be accessed
by clicking the links on top left corner of the home page.
1. To begin the analysis, click on “Functional annotation”.
2. The resulting Web page shows three tabs—Upload, List, and
Background.
3. In the “Upload” tab, either paste gene list into the box or
browse and upload the list where there is a single column with
each row representing a single gene (see Note 1).
4. The ‘list’ tab in DAVID allows users to limit gene annotations
to one or more species. The default parameter chooses Homo
sapiens.
5. For enrichment analysis, user has to choose a background
using ‘Background’ tab. Default background in DAVID is
Homo sapiens whole genome background. The user can choose
to use a custom background.
6. DAVID recognizes gene lists with various identifiers including
official gene symbols and accession numbers. For proteomics
datasets, it is best to use official gene symbols in gene lists and
choose that as an identifier in step 2 in ‘Upload’ tab.
7. In step 3, choose if the list you uploaded should be used as
‘Gene List’ or ‘Background’. For data from human samples,
choose your input as ‘Gene List’ as Homo sapiens whole
genome background is used as a default.
8. Click ‘Submit List’ button. The results provided by DAVID
include ‘Functional Annotation Clustering’, ‘Functional
Annotation Chart’ and ‘Functional Annotation Table’. These
results provide a quick glance of major biological functions
enriched in the gene list.
9. For GO enrichment analysis, click on Gene_Ontology and
select GOTERM_BP_ALL for biological process, GOTERM_
CC_ALL for subcellular localization, and GOTERM_MF_
ALL for molecular function as background for the GO
enrichment analysis. Click on “Functional annotation cluster-
ing” and DAVID will generate clusters of terms with similar
biological meaning based on shared/similar gene members.
The significance of this enrichment is also calculated based on
modified Fisher Exact P-value.
10. Top panel of the result window is parameter panel which user can
modify according to need and rerun the process without submit-
ting input again. It is recommended to select higher stringency
for small, concise and meaningful clusters rather than broader
and vague cluster of proteins. Default setting is medium strin-
gency however user can modify this option based on the analysis.
Krishna Patel et al.
151
Higher enrichment score indicates that annotation term members
are overrepresented in uploaded input.
11. Result table displays annotation categories, enriched functional
annotation, enrichment scores of each cluster, number of genes
contributing to clustering of similar GO terms, and modified
Fisher Exact P-value.
12. To analyze the most enriched clusters, user can sieve out clus-
ters with maximum enrichment score and lesser P-value for
biological process, molecular function and subcellular localiza-
tion (see Note 2).
13. A link to ‘G’ on top of each cluster could be used to extract
defined set of proteins contributing to enrichment of the given
cluster and matrix icon draws heat map for the small cluster
and provides the GO term count matrix for each protein which
can be further used for plotting graphs.
14. User can also employ pathway and functional domain enrich-
ment analysis using DAVID by selecting “Pathway”, “Functional
categories” and “Protein domains” as backend reference data-
base for functional annotation. However, a user-friendly graphi-
cal user interface for pathways analysis study is deployed by
Web-resource Reactome which is explained in detail below.
Table 1 enlists other widely used open source gene set enrich-
ment analysis tools.
Proteins regulate most cellular processes. Several proteins work in
concert to regulate these processes and are often grouped into spe-
cific pathways in which they carry out their functions. Over the
years, pathways and processes that are regulated by specific pro-
teins have been systematically annotated. Based on protein expres-
sion data, it is possible to arrive at pathways and processes that are
active in a biological sample. In addition to expression, some of the
most widely studied signaling pathway mechanisms include
dynamic interplay of kinases and phosphatases that results in addi-
tion or removal of phosphorylation on proteins. Differential pro-
tein expression data or phosphoproteomics data can be utilized to
carry out pathway enrichment analysis. If expression or phosphor-
ylation levels of certain proteins are changing in a biological sample
as compared to their pattern in an appropriate control, it is possible
to predict potential pathways that are differentially regulated.
Reactome [14] is manually curated open access Web-based resource
of biological pathways which allows users to browse, search and
map proteins onto pathways. It also provides list of interactors
acquired from IntAct [15] molecular interaction database with
nodes of pathways.
Here we describe Reactome, a Web-based tool that can be
used for pathway analysis.
3.2 Pathway
Analysis
High Throughput Proteomic Analysis
152
Table 1
List of tools that can be used for gene ontology and gene set enrichment analysis
Name Description Link Reference
GSEA Gene set enrichment analysis
(GSEA) is an expression analytics
tool. It compares gene set
enrichment between conditions
and provides enriched set of
genes with their statistical
significance scores to interpret
biological data
Stand-alone http://www.
broadinstitute.org/gsea/
[6]
FunRich FunRich is a downloadable tool for
pathways and GO enrichment
analysis of genes and proteins. It
can process genes/proteins
irrespective of source of the
sample as user can load
customized database along with
default available background
database
Stand-alone http://funrich.org/ [2]
GoMiner GoMiner leverages Gene Ontology
by providing a framework to
visualize and integrate “omics”
data. It makes cluster of genes
and their expression profiles
which can be analyzed for their
biological significance. Each
gene is linked to BioCarta, Entez
Genome, NCBI structures,
Pubmed and MedMiner for
greater clarity
Stand-alone, Web http://
discover.nci.nih.gov/gominer
[7]
GOstat GOstat tool uses GO terms
database to find statistically over
represented genes from the data
set. The results list out
significant set of genes for
biological interpretation
Web http://gostat.wehi.edu.au [8]
GOToolBox GOToolBox is used for functional
annotation of genes. GOtoolBox
is a perl based program which
can be automated in any gene
expression analysis pipeline.
GOToolBox also has GO-Diet
and PRODISTIN framework
which can be used to study
protein–protein interactions
Web http://genome.crg.es/
GOToolBox/
[9]
(continued)
Krishna Patel et al.
153
1. Reactome (http://www.reactome.org/) allows mapping the
list of proteins on pathways and carry out enrichment analysis
to determine if the input data contains overrepresentation of
proteins involved in certain pathways (see Note 3).
2. Click on “Analyze Data”. It is a three-step process that begins
with pasting the protein list with appropriate header on the
Web page. The tool also takes accession numbers and other
identifiers as an input. In the next step, it allows projection of
data on to human annotation if it comes from a different spe-
cies and also to include interactors from IntAct Molecular
Interaction database. After making appropriate selection, click
on analyze.
3. The resulting page is divided into four panels. ‘Hierarchy
panel’ on the left part of the Web page lists enriched pathways
with corresponding FDR, ‘Viewport’ panel shows graphical
representation of an overview of these pathways with various
options to navigate, top panel provides configuration options
and a bottom panel provides details of objects selected in the
pathway diagram. A detailed manual to understand and navi-
gate this pathway analysis tool can be found at http://wiki.
reactome.org/index.php/Usersguide.
Table 1
(continued)
Name Description Link Reference
GeneMerge GeneMerge enables over-
representation analysis of gene
attributes in a given set of genes
as compared to genome
background
Stand-alone, Web http://www.
genemerge.net/
[10]
GO:TermFinder GO:TermFinder is a tool that helps
to find significant GO terms
shared among a list of genes. It
has GO:TermFinder libraries
that enables visualization of
results
Stand-alone http://search.cpan.
org/dist/GO-TermFinder/
[11]
agriGO agriGO is a specialized data
analytics tool for the agricultural
community. The database has 38
agricultural species comprising of
274 data types
Web http://bioinfo.cau.edu.cn/
agriGO/
[12]
FatiGO FatiGO helps to find significant
over- representation of functional
annotations in one gene set
compared to the other
Web http://babelomics.bioinfo.
cipf.es
[13]
High Throughput Proteomic Analysis
154
There are various commercial tools such as QIAGEN Ingenuity
Pathway Analysis (IPA) and Agilent Genomics Genespring for
functional and pathway enrichment analysis. Table 2 lists some of
the widely used pathway resources and network analysis tools.
Post-translational modifications (PTM) play an important role in
regulating various cellular processes. One of the most widely stud-
ied PTM is phosphorylation. It acts as a switch for activation and
deactivation of specific proteins and associated signaling pathways.
This modification serves as a rapid and reversible means to modu-
late protein activity and transduce signals. Advent of mass spec-
trometry has revolutionized our ability to map PTMs. These
studies have provided a comprehensive view of proteins that
undergo modifications along with specific sites. Based on our
understanding of enzyme–substrate relationships and specific
motifs that are targeted for post-translational modifications, a
number of computational tools have been developed to predict
PTMs. These tools can be utilized to evaluate the validity of identi-
fied sites in large scale studies (based on known sites in the data-
base) or predict potential modifications.
Human Protein Reference Database (HPRD) [21] is a reposi-
tory of manually curated PTM sites. Phospho.ELM [22] is a
resource of experimentally validated phosphorylation sites that are
manually curated from the literature. The RESID [23] database
provides PTM information with literature citation, protein feature
table, molecular models, structure diagrams and Gene Ontology
cross reference. PhosphoSitePlus [24] is a comprehensive reposi-
tory of curated phosphosites containing reference and orthologous
residues in other species. O-GLYCBASE [25] is a resource con-
taining experimentally verified O-linked glycosylation sites.
Unimod [26] is a comprehensive public domain database of pro-
tein modifications for mass spectrometry application.
Most extensively studied PTM is phosphorylation. Protein
kinases add phosphate moieties to Tyr, Ser, or Thr residues. Mass
spectrometry is being extensively used to investigate protein phos-
phorylation in a high-throughput manner. Phosphorylation either
increases or decreases the activity of target protein. Overlaying phos-
phoproteomic data on curated pathways can provide insights into
activation or deactivation of a particular signaling pathway.
PhosphositePlus [24] and PHOSIDA [27] are comprehensive repos-
itories of curated phosphosites containing reference and orthologous
residues in other species. Protein sequences can be analyzed using
various prediction tools for identifying phosphosites such as
KinasePhos 2.0 [28], NetPhos 2.0 [29], and DISPHOS 1.3 [30].
Several computational approaches have been developed to pre-
dict acetylation sites. NetAcet [31] is a neural network based
N-terminal acetylation site prediction tool, N-Ace [32] predicts
acetylation sites based on physicochemical properties of protein with
accessible surface area, PSKAcePred [33] is an approach that uses
3.3 Post-
translational
Modification Analysis
Krishna Patel et al.
155
Table 2
List of pathway resources and network analysis tools
Name Description Link Reference
NetPath NetPath is a manually curated resource
of signal transduction pathways.
Pathway data can be browsed,
visualized or downloaded in PSI-MI,
BioPAX and SBML formats. These
standard formats enable visualization
using external tools like Cytoscape
Web www.netpath.org [16]
PANTHER Protein ANalysis THrough
Evolutionary Relationships
(PANTHER) is an analysis
framework with multiple tools for
evolutionary and functional
classification of proteins. Panther
pathway resource allows visualization
of protein expression data in the
context of pathway diagrams
Web http://www.pantherdb.org/
pathway
[17]
KEGG Kyoto encyclopedia of genes and
genomes (KEGG) is an integrated
database resource. Pathway maps
and annotation in KEGG is widely
used for pathway enrichment analysis
Web http://www.genome.jp/
kegg/
[4]
STRING Search Tool for the Retrieval of
Interacting Genes/Proteins
(STRING) is a database of protein–
protein interactions
Web http://string-db.org/ [18]
FunRich FunRich is a downloadable tool for
pathways and GO enrichment
analysis of genes and proteins. It can
process genes/proteins irrespective
of source of the sample as user can
load customized database along with
default available background
database
Stand-alone http://funrich.org/ [2]
MINT MINT: Molecular INTeraction is a
curated molecular interaction
database
Web, stand-alone http://mint.
bio.uniroma2.it/mint/
Welcome.do
[19]
NetworKIN NetworKIN database provides
interface to analyze cellular
phosphorylation networks. It allows
users to query precomputed
kinase–substrate relations or obtain
predictions on novel
phosphoproteins
Web, stand-alone http://
networkin.info
[20]
High Throughput Proteomic Analysis
156
evolutionary similarity along with physicochemical properties to
predict lysine acetylation sites and Species Specific Prediction of
Lysine Acetylation (SSPKA) [34] is a computational framework that
incorporates predicted secondary structure information, and com-
bines functional features and sequence feature to predict species-
specific acetylation sites across six different species—H. sapiens, R.
norvegicus, M. musculus, E. coli, S. typhimurium and S. cerevisiae.
Ubiquitination is one of the most difficult PTMs to be identi-
fied due to its low abundance, size, and dynamic regulation. Due
to larger size of ubiquitin compared to other PTMs, it is difficult to
capture by mass spectrometry. However, several ubiquitination
sites have been mapped in the last few years based on diglycine-
modified lysine tag can be identified by mass spectrometry. Several
tools including UbPred [35], UbiPred [36], E3Miner [37], hCK-
SAAP_UbSite [38], and iUbiq-Lys [39] have been developed over
the years for prediction of ubiquitination sites. hUbiquitome [40]
is a comprehensive repository of experimentally verified human
ubiquitination enzymes and substrates.
Small ubiquitin-like modifier (SUMO) attaches to various tar-
get proteins and modulates cellular processes such as DNA replica-
tion, transcription, cell division, nuclear trafficking, and DNA
damage response. SUMOylation affects half-life, localization of
targets or binding partners and is a crucial mechanism that allows
cells to adapt to stress stimuli. Identification of SUMO sites has
enabled us to identify strong dependency of SUMOylation events
on other PTMs [41]. SUMOsp [42] and GPS-SUMO [43] pre-
dicts SUMO sites on proteins.
Glycosylation is a common PTM that plays a crucial role in
protein folding, cell–cell interaction, antigenicity, transport, and
half-life. There are four types of glycosylation: N-linked, O-linked,
C-mannosylation, and GPI anchor attachment. EnsembleGly [44]
predicts both O- and N-linked glycosylation sites, NetCGlyc [45]
predicts C-mannosylation, NetOGlyc [46] predicts O-glycosylation
sites, and NetNGlyc [47] predicts N-Glycosylation sites; PredGPI
[48] and GPI-SOM [49] predict GPI anchor sites in a protein.
Scansite [50] is a tool to analyze protein sequence for phos-
phorylation motifs recognized by many kinases and Motif-X [51]
allows prediction of various PTM site motifs by identifying over-
represented residues in the flanking regions. ProMEX [52] is a
database of mass spectra of tryptic peptides from plant proteins and
phosphoproteins.
Here we describe PTM analysis using commonly used PTM
database Phospho.ELM [22] and phosphorylation PTM site pre-
dictor NetPhos 2.0 [29].
1. To identify experimentally validated PTMs of a given protein,
browse Phospho.ELM database (http://phospho.elm.eu.
org/index.html). Database can be queried using protein name,
UniPROT accession, and Ensembl identifier.
Krishna Patel et al.
157
2. Result page of Phospho.ELM database consists of table detailing
residue, position of residue in proteins, flanking sequence with
PTM site, kinase, PubMed reference for each site reported,
conservation score, cross-reference to eukaryotic linear motif
resource (ELM: http://elm.eu.org/), phospho-peptide bind-
ing domain, SMART domains, and cross-reference to PDB
link along with other information such as substrate, cross-ref-
erence to PHOSIDA [27], PhosphositePlus [24], MINT [19],
and GO-Terms [1].
3. Computational prediction of phosphorylation can be done
using NetPhos 2.0 server (http://www.cbs.dtu.dk/services/
NetPhos/). Users can submit protein sequence in FASTA for-
mat and select target residue for phosphorylation (tyrosine,
serine, or threonine). By default, all three residues are checked
in the analysis. Select checkbox if users wish to generate graph-
ical output.
4. Click on “Submit” to initiate analysis. In a single query, up to
2000 protein sequences can be analyzed by this Web-based
tool.
5. Result page will display table detailing submitted protein ID,
residue position, PTM site with flanking sequences and score.
Three tables are separately generated for serine, threonine, and
tyrosine.
6. A graphical result depicts propensity of a residue on a given
position as PTM site. Three different color peaks are used for
each residue (S,T,Y) on an X-Y plane where X-axis is sequence
position and Y-axis is phosphorylation potential.
A multitude of tools are available for data integration and visualiza-
tion of “omics” data-sets (Table 3). Most visualization tools focus
on biomolecular interactions and pathways. These tools commonly
employ 2D graphs for data representation. The basic efficiency of
these tools lies in its compatibility with other tools and databases.
4 Notes
1. It is preferable to use ‘Gene Symbol’ as unique identifier for
genes. DAVID has ID conversion tool that can be used to pre-
pare the lists with uniform identifiers.
2. Enrichment analysis methods often involve statistical tests to
determine if input data contains overrepresentation of proteins
involved in certain functions, processes, or pathways more than
what is expected by chance. This is calculated with respect to
the background database used by respective tools. Many tools
also provide flexibility for users by providing the option of
using custom database as background. Knowledge of statistical
3.4 Visualization
Tools
High Throughput Proteomic Analysis
158
approach employed in such tools would allow user to make
relevant selections for different kind of datasets to identify most
enriched genes/proteins cluster.
3. Pathway enrichment analysis is done using the pathway data-
base used in the background. Back end pathway database used
for analysis will directly influence the outcomes of the pathway
analysis. This aspect should be taken into consideration and
users should select appropriate pathway annotation resource
most suitable for intended pathway analysis.
Table 3
List of pathway analysis and visualization tools
Name Description Link Reference
GenMAPP GenMAPP is a Web-based visualization
tool for gene/protein expression
profiles. It has MAPPBuilder tool for
creating MAPP file (.mapp) which
creates graphical pathway representation
of genes and MAPPFinder tool to
annotate the pathway. Each gene is
identified by unique geneID from
Genbank. MAPP files can be shared and
manipulated by the user
Stand-alone http://www.
genmapp.org
[53]
CytoScape Cytoscape is Java-based stand-alone tool
which supports large scale network
analysis. Both protein–protein and
protein–gene networks can be visualized
and edited. The standard file format of
Cytoscape is Cytoscape Session File (.
cys). Input file in Cytoscape can be
delimited text table or excel workbook
though it supports all major input
formats. The result can be exported in
any of the formats like SIF, GML,
XGMML, and PSI-MI formats
Stand-alone http://www.
cytoscape.org/
[54]
Medusa Medusa is Java application for visualization
of complex pathways. Result from
STRING pathway database can be
analyzed in Medusa. Medusa is less
suited for big datasets
Stand-alone https://sites.
google.com/site/
medusa3visualization/
[55]
Perseus Perseus is a statistical analysis visualization
tool for proteomics data. It has
incorporated multiple statistical methods
like t-test, clustering, enrichment analysis
including normalization of data. It
provides various graphs for visualization
of data like scatter plot and volcano plot
Stand-alone http://www.
biochem.mpg.
de/5111810/perseus
[56]
Krishna Patel et al.
159
References
1. Ashburner M, Ball CA, Blake JA, Botstein D,
Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-
Tarver L, Kasarskis A, Lewis S, Matese JC,
Richardson JE, Ringwald M, Rubin GM, Sherlock
G (2000) Gene ontology: tool for the unification
of biology. The gene ontology consortium. Nat
Genet 25(1):25–29. doi:10.1038/75556
2. Pathan M, Keerthikumar S, Ang CS, Gangoda
L, Quek CY, Williamson NA, Mouradov D,
Sieber OM, Simpson RJ, Salim A, Bacic A, Hill
AF, Stroud DA, Ryan MT, Agbinya JI,
Mariadason JM, Burgess AW, Mathivanan S
(2015) FunRich: an open access standalone
functional enrichment and interaction network
analysis tool. Proteomics 15(15):2597–2601.
doi:10.1002/pmic.201400515
3. Dennis G Jr, Sherman BT, Hosack DA, Yang J,
Gao W, Lane HC, Lempicki RA (2003)
DAVID: database for annotation, visualiza-
tion, and integrated discovery. Genome Biol
4(5):P3
4. Kanehisa M, Goto S (2000) KEGG: Kyoto
encyclopedia of genes and genomes. Nucleic
Acids Res 28(1):27–30
5. Nishimura D (2004) BioCarta. Biotech
Software Internet Report 2:117–120. doi:
10.1089/152791601750294344
6. Subramanian A, Tamayo P, Mootha VK,
Mukherjee S, Ebert BL, Gillette MA, Paulovich
A, Pomeroy SL, Golub TR, Lander ES,
Mesirov JP (2005) Gene set enrichment analy-
sis: a knowledge-based approach for
interpreting genome-wide expression profiles.
Proc Natl Acad Sci U S A 102(43):15545–
15550. doi:10.1073/pnas.0506580102
7. Zeeberg BR, Feng W, Wang G, Wang MD,
Fojo AT, Sunshine M, Narasimhan S, Kane
DW, Reinhold WC, Lababidi S, Bussey KJ,
Riss J, Barrett JC, Weinstein JN (2003)
GoMiner: a resource for biological interpreta-
tion of genomic and proteomic data. Genome
Biol 4(4):R28
8. Beissbarth T, Speed TP (2004) GOstat: find
statistically overrepresented gene ontologies
within a group of genes. Bioinformatics
20(9):1464–1465. doi:10.1093/bioinformat-
ics/bth088
9. Martin D, Brun C, Remy E, Mouren P,
Thieffry D, Jacq B (2004) GOToolBox:
functional analysis of gene datasets based on
Gene Ontology. Genome Biol 5(12):R101.
doi:10.1186/gb-2004-5-12-r101
10. Castillo-Davis CI, Hartl DL (2003) Gene
Merge—post-genomic analysis, data mining,
and hypothesis testing. Bioinformatics
19(7):891–892
11. Boyle EI, Weng S, Gollub J, Jin H, Botstein D,
Cherry JM, Sherlock G (2004) GO::Term
Finder—open source software for accessing
gene ontology information and finding signifi-
cantly enriched gene ontology terms associated
with a list of genes. Bioinformatics 20(18):3710–
3715. doi:10.1093/bioinformatics/bth456
12. Du Z, Zhou X, Ling Y, Zhang Z, Su Z (2010)
agriGO: a GO analysis toolkit for the agricul-
tural community. Nucleic Acids Res 38(Web
Server Issue):64–70. doi:10.1093/nar/gkq310
13. Al-Shahrour F, Minguez P, Tarraga J, Medina
I, Alloza E, Montaner D, Dopazo J (2007)
FatiGO +: a functional profiling tool for
genomic data. Integration of functional anno-
tation, regulatory motifs and interaction data
with microarray experiments. Nucleic Acids
Res 35(Web Server Issue):91–96. doi:10.1093/
nar/gkm260
14. Joshi-Tope G, Gillespie M, Vastrik I,
D'Eustachio P, Schmidt E, de Bono B, Jassal
B, Gopinath GR, Wu GR, Matthews L, Lewis
S, Birney E, Stein L (2005) Reactome: a
knowledgebase of biological pathways. Nucleic
Acids Res 33(Database issue):D428–D432.
doi:10.1093/nar/gki072
15. Hermjakob H, Montecchi-Palazzi L,
Lewington C, Mudali S, Kerrien S, Orchard S,
Vingron M, Roechert B, Roepstorff P, Valencia
A, Margalit H, Armstrong J, Bairoch A,
Cesareni G, Sherman D, Apweiler R (2004)
IntAct: an open source molecular interaction
database. Nucleic Acids Res 32(Database
issue):D452–D455. doi:10.1093/nar/gkh052
16. Kandasamy K, Mohan SS, Raju R,
Keerthikumar S, Kumar GS, Venugopal AK,
Telikicherla D, Navarro JD, Mathivanan S,
Pecquet C, Gollapudi SK, Tattikota SG,
Mohan S, Padhukasahasram H, Subbannayya
Y, Goel R, Jacob HK, Zhong J, Sekhar R,
Nanjappa V, Balakrishnan L, Subbaiah R,
Ramachandra YL, Rahiman BA, Prasad TS,
Lin JX, Houtman JC, Desiderio S, Renauld
JC, Constantinescu SN, Ohara O, Hirano T,
Kubo M, Singh S, Khatri P, Draghici S, Bader
GD, Sander C, Leonard WJ, Pandey A (2010)
NetPath: a public resource of curated signal
transduction pathways. Genome Biol
11(1):R3. doi:10.1186/gb-2010-11-1-r3
High Throughput Proteomic Analysis
160
17. Mi H, Poudel S, Muruganujan A, Casagrande
JT, Thomas PD (2016) PANTHER version 10:
expanded protein families and functions, and
analysis tools. Nucleic Acids Res 44(D1):D336–
D342. doi:10.1093/nar/gkv1194
18. von Mering C, Huynen M, Jaeggi D, Schmidt
S, Bork P, Snel B (2003) STRING: a database
of predicted functional associations between
proteins. Nucleic Acids Res 31(1):258–261
19. Zanzoni A, Montecchi-Palazzi L, Quondam
M, Ausiello G, Helmer-Citterich M, Cesareni
G (2002) MINT: a molecular INTeraction
database. FEBS Lett 513(1):135–140
20. Linding R, Jensen LJ, Pasculescu A, Olhovsky
M, Colwill K, Bork P, Yaffe MB, Pawson T
(2008) NetworKIN: a resource for exploring
cellular phosphorylation networks. Nucleic
Acids Res 36(Database issue):D695–D699.
doi:10.1093/nar/gkm902
21. Peri S, Navarro JD, Kristiansen TZ, Amanchy
R, Surendranath V, Muthusamy B, Gandhi
TK, Chandrika KN, Deshpande N, Suresh S,
Rashmi BP, Shanker K, Padma N, Niranjan V,
Harsha HC, Talreja N, Vrushabendra BM,
Ramya MA, Yatish AJ, Joy M, Shivashankar
HN, Kavitha MP, Menezes M, Choudhur y
DR, Ghosh N, Saravana R, Chandran S,
Mohan S, Jonnalagadda CK, Prasad CK,
Kumar-Sinha C, Deshpande KS, Pandey A
(2004) Human protein reference database as a
discovery resource for proteomics. Nucleic
Acids Res 32(Database issue):D497–D501.
doi:10.1093/nar/gkh070
22. Diella F, Cameron S, Gemund C, Linding R,
Via A, Kuster B, Sicheritz-Ponten T, Blom N,
Gibson TJ (2004) Phospho.ELM: a database of
experimentally verified phosphorylation sites in
eukaryotic proteins. BMC Bioinformatics 5:79.
doi:10.1186/1471-2105-5-79
23. Garavelli JS (2004) The RESID database of
protein modifications as a resource and anno-
tation tool. Proteomics 4(6):1527–1533.
doi:10.1002/pmic.200300777
24. Hornbeck PV, Zhang B, Murray B, Kornhauser
JM, Latham V, Skrzypek E (2015)
PhosphoSitePlus, 2014: mutations, PTMs and
recalibrations. Nucleic Acids Res 43(Database
issue):D512–D520. doi:10.1093/nar/gku1267
25. Gupta R, Birch H, Rapacki K, Brunak S,
Hansen JE (1999) O-GLYCBASE version 4.0:
a revised database of O-glycosylated proteins.
Nucleic Acids Res 27(1):370–372
26. Creasy DM, Cottrell JS (2004) Unimod: pro-
tein modifications for mass spectrometry.
Proteomics 4(6):1534–1536. doi:10.1002/
pmic.200300744
27. Gnad F, Ren S, Cox J, Olsen JV, Macek B,
Oroshi M, Mann M (2007) PHOSIDA (phos-
phorylation site database): management, struc-
tural and evolutionary investigation, and
prediction of phosphosites. Genome Biol
8(11):R250. doi:10.1186/gb-2007-8-11-r250
28. Huang HD, Lee TY, Tzeng SW, Horng JT
(2005) KinasePhos: a web tool for identifying
protein kinase-specific phosphorylation sites.
Nucleic Acids Res 33(Web Server Issue):226–
229. doi:10.1093/nar/gki471
29. Blom N, Gammeltoft S, Brunak S (1999)
Sequence and structure-based prediction of
eukaryotic protein phosphorylation sites.
J Mol Biol 294(5):1351–1362. doi:10.1006/
jmbi.1999.3310
30. Iakoucheva LM, Radivojac P, Brown CJ,
O'Connor TR, Sikes JG, Obradovic Z, Dunker
AK (2004) The importance of intrinsic disorder
for protein phosphorylation. Nucleic Acids Res
32(3):1037–1049. doi:10.1093/nar/gkh253
31. Kiemer L, Bendtsen JD, Blom N (2005)
NetAcet: prediction of N-terminal acetylation
sites. Bioinformatics 21(7):1269–1270.
doi:10.1093/bioinformatics/bti130
32. Lee TY, Hsu JB, Lin FM, Chang WC, Hsu PC,
Huang HD (2010) N-Ace: using solvent acces-
sibility and physicochemical properties to identify
protein N-acetylation sites. J Comput Chem
31(15):2759–2771. doi:10.1002/jcc.21569
33. Suo SB, Qiu JD, Shi SP, Sun XY, Huang SY,
Chen X, Liang RP (2012) Position-specific anal-
ysis and prediction for protein lysine acetylation
based on multiple features. PLoS One 7(11),
e49108. doi:10.1371/journal.pone.0049108
34. Li Y, Wang M, Wang H, Tan H, Zhang Z,
Webb GI, Song J (2014) Accurate in silico
identification of species-specific acetylation
sites by integrating protein sequence-derived
and functional features. Sci Rep 4:5765.
doi:10.1038/srep05765
35. Radivojac P, Vacic V, Haynes C, Cocklin RR,
Mohan A, Heyen JW, Goebl MG, Iakoucheva
LM (2010) Identification, analysis, and predic-
tion of protein ubiquitination sites. Proteins
78(2):365–380. doi:10.1002/prot.22555
36. Tung CW, Ho SY (2008) Computational
identification of ubiquitylation sites from pro-
tein sequences. BMC Bioinformatics 9:310.
doi: 10.1186/1471-2105-9-310
37. Lee H, Yi GS, Park JC (2008) E3Miner: a text
mining tool for ubiquitin-protein ligases.
Nucleic Acids Res 36(Web Server Issue):416–
422. doi:10.1093/nar/gkn286
38. Chen Z, Zhou Y, Song J, Zhang Z (2013)
hCKSAAP_UbSite: improved prediction of
human ubiquitination sites by exploiting amino
acid pattern and properties. Biochim Biophys
Acta 1834(8):1461–1467. doi:10.1016/j.
bbapap.2013.04.006
Krishna Patel et al.
161
39. Qiu WR, Xiao X, Lin WZ, Chou KC (2015)
iUbiq-Lys: prediction of lysine ubiquitination
sites in proteins by extracting sequence evolu-
tion information via a gray system model.
J Biomol Struct Dyn 33(8):1731–1742. doi:1
0.1080/07391102.2014.968875
40. Du Y, Xu N, Lu M, Li T (2011) hUbiquitome:
a database of experimentally verified ubiquiti-
nation cascades in humans. Database (Oxford)
2011:bar055. doi:10.1093/database/bar055
41. Eifler K, Vertegaal AC (2015) Mapping the
SUMOylated landscape. FEBS J 282(19):3669–
3680. doi:10.1111/febs.13378
42. Xue Y, Zhou F, Fu C, Xu Y, Yao X (2006)
SUMOsp: a web server for sumoylation site
prediction. Nucleic Acids Res 34(Web Server
Issue):254–257. doi:10.1093/nar/gkl207
43. Zhao Q, Xie Y, Zheng Y, Jiang S, Liu W, Mu
W, Liu Z, Zhao Y, Xue Y, Ren J (2014) GPS-
SUMO: a tool for the prediction of sumoylation
sites and SUMO-interaction motifs. Nucleic
Acids Res 42(Web Server Issue):325–330.
doi:10.1093/nar/gku383
44. Caragea C, Sinapov J, Silvescu A, Dobbs D,
Honavar V (2007) Glycosylation site predic-
tion using ensembles of Support Vector
Machine classifiers. BMC Bioinformatics
8:438. doi:10.1186/1471-2105-8-438
45. Julenius K (2007) NetCGlyc 1.0: prediction of
mammalian C-mannosylation sites. Glycobiology
17(8):868–876. doi:10.1093/glycob/cwm050
46. Hansen JE, Lund O, Tolstrup N, Gooley AA,
Williams KL, Brunak S (1998) NetOglyc: pre-
diction of mucin type O-glycosylation sites
based on sequence context and surface acces-
sibility. Glycoconj J 15(2):115–130
47. Gupta R, Jung E, Brunak S (2004)
NetNGlyc 1.0 Server. Center for biological
sequence analysis, technical university of
Denmark (http://wwwcbsdtudk/services/
NetNGlyc)
48. Pierleoni A, Martelli PL, Casadio R (2008)
PredGPI: a GPI-anchor predictor. BMC
Bioinformatics 9:392. doi:10.1186/
1471-2105-9-392
49. Fankhauser N, Maser P (2005) Identification
of GPI anchor attachment signals by a
Kohonen self-organizing map. Bioinformatics
21(9):1846–1852. doi:10.1093/bioinformat-
ics/bti299
50. Obenauer JC, Cantley LC, Yaffe MB
(2003) Scansite 2.0: proteome-wide predic-
tion of cell signaling interactions using short
sequence motifs. Nucleic Acids Res 31(13):
3635–3641
51. Chou MF, Schwartz D (2011) Biological
sequence motif discovery using motif-x. Curr
Protoc Bioinformatics 13:15–24. doi:10.1002/
0471250953.bi1315s35
52. Hummel J, Niemann M, Wienkoop S, Schulze
W, Steinhauser D, Selbig J, Walther D,
Weckwerth W (2007) ProMEX: a mass spec-
tral reference database for proteins and protein
phosphorylation sites. BMC Bioinformatics
8:216. doi:10.1186/1471-2105-8-216
53. Dahlquist KD, Salomonis N, Vranizan K,
Lawlor SC, Conklin BR (2002) GenMAPP, a
new tool for viewing and analyzing microarray
data on biological pathways. Nat Genet
31(1):19–20. doi:10.1038/ng0502-19
54. Shannon P, Markiel A, Ozier O, Baliga
NS, Wang JT, Ramage D, Amin N, Schwikowski
B, Ideker T (2003) Cytoscape: a software envi-
ronment for integrated models of biomolecular
interaction networks. Genome Res
13(11):2498–2504. doi:10.1101/gr.1239303
55. Hooper SD, Bork P (2005) Medusa: a simple
tool for interaction graph analysis. Bioinformatics
21(24):4432–4433. doi:10.1093/bioinformat-
ics/bti696
56. Tyanova S, Temu T, Sinitcyn P, Carlson A,
Hein M, Geiger T, Mann M and Cox J
(2016) The Perseus computational platform
for comprehensive analysis of (prote)omics
data. Nature Methods 3(9):731–740. doi:
10.1038/nmeth.3901
High Throughput Proteomic Analysis
... It is commonly used to anticipate the proteins produced by a certain bacterial pathogen and to identify the most important proteins linked to virulence. Gene ontology and enrichment analyses for function and pathway studies, as well as visualization tools to portray data in the form of graphs and charts, are the most crucial tools (97). This method was used to predict Chlamydia pneumonia nuclear targeting proteins that may play a role in lung cancer genesis (98). ...
Article
Full-text available
Proteomics is playing an increasingly important role in identifying pathogens, emerging and re-emerging infectious agents, understanding pathogenesis, and diagnosis of diseases. Recently, more advanced and sophisticated proteomics technologies have transformed disease diagnostics and vaccines development. The detection of pathogens is made possible by more accurate and time-constrained technologies, resulting in an early diagnosis. More detailed and comprehensive information regarding the proteome of any noxious agent is made possible by combining mass spectrometry with various gel-based or short-gun proteomics approaches recently. MALDI-ToF has been proved quite useful in identifying and distinguishing bacterial pathogens. Other quantitative approaches are doing their best to investigate bacterial virulent factors, diagnostic markers and vaccine candidates. Proteomics is also helping in the identification of secreted proteins and their virulence-related functions. This review aims to highlight the role of cutting-edge proteomics approaches in better understanding the functional genomics of pathogens. This also underlines the limitations of proteomics in bacterial secretome research.
... In theory, the microenvironment within the biliary tract and the gallbladder will be more resistant to external variation and more accessible for bile retrieval in animal models. For all that, the biliary tract including its reservoir, will be an ideal biological system to evaluate through metaproteomic and microbiota analyses in conjunction, changes related to specific diets, neoplastic conditions, antibiotic use, chemotherapeutic schemas etc. Finding meaningful biological information from omics' science data sets has been one of the major challenges of science in recent years (38). The relevance of research findings cannot be measured in every biological instance using statistical significance alone, as not all statistically significant results translate into meaningful biological change. ...
Article
Full-text available
Trillions of bacteria are present in the gastrointestinal tract as part of the local microbiota. Bacteria have been associated with a wide range of gastrointestinal diseases including malignant neoplasms. The association of bacteria in gastrointestinal and biliary tract carcinogenesis is supported in the paradigm of Helicobacter pylori and intestinal-type gastric cancer. However, the association of bacterial species to a specific carcinoma, different from intestinal-type gastric cancer is unresolved. The relationship of bacteria to a specific malignant neoplasm can drive clinical interventions. We review the classic bacteria risk factors identified using cultures and PCR (polymerase chain reaction) with new research regarding a microbiota approach through 16S rRNA (16S ribosomal ribonucleic acid gene) or metagenomic analysis for selected carcinomas in the biliary tract.
... In theory, the microenvironment within the biliary tract and the gallbladder will be more resistant to external variation and more accessible for bile retrieval in animal models. For all that, the biliary tract including its reservoir, will be an ideal biological system to evaluate through metaproteomic and microbiota analyses in conjunction, changes related to specific diets, neoplastic conditions, antibiotic use, chemotherapeutic schemas etc. Finding meaningful biological information from omics' science data sets has been one of the major challenges of science in recent years (38). The relevance of research findings cannot be measured in every biological instance using statistical significance alone, as not all statistically significant results translate into meaningful biological change. ...
Article
Full-text available
Purpose: To analyze human and bacteria proteomic profiles in bile, exposed to a tumor vs. non-tumor microenvironment, in order to identify differences between these conditions, which may contribute to a better understanding of pancreatic carcinogenesis. Patients and Methods: Using liquid chromatography and mass spectrometry, human and bacterial proteomic profiles of a total of 20 bile samples (7 from gallstone (GS) patients, and 13 from pancreatic head ductal adenocarcinoma (PDAC) patients) that were collected during surgery and taken directly from the gallbladder, were compared. g:Profiler and KEGG (Kyoto Encyclopedia of Genes and Genomes) Mapper Reconstruct Pathway were used as the main comparative platform focusing on over-represented biological pathways among human proteins and interaction pathways among bacterial proteins. Results: Three bacterial infection pathways were over-represented in the human PDAC group of proteins. IL-8 is the only human protein that coincides in the three pathways and this protein is only present in the PDAC group. Quantitative and qualitative differences in bacterial proteins suggest a dysbiotic microenvironment in the PDAC group, supported by significant participation of antibiotic biosynthesis enzymes. Prokaryotes interaction signaling pathways highlight the presence of zeatin in the GS group and surfactin in the PDAC group, the former in the metabolism of terpenoids and polyketides, and the latter in both metabolisms of terpenoids, polyketides and quorum sensing. Based on our findings, we propose a bacterial-induced carcinogenesis model for the biliary tract. Conclusion: To the best of our knowledge this is the first study with the aim of comparing human and bacterial bile proteins in a tumor vs. non-tumor microenvironment. We proposed a new carcinogenesis model for the biliary tract based on bile metaproteomic findings. Our results suggest that bacteria may be key players in biliary tract carcinogenesis, in a long-lasting dysbiotic and epithelially harmful microenvironment, in which specific bacterial species' biofilm formation is of utmost importance. Our finding should be further explored in future using in vitro and in vivo investigations.
Article
Full-text available
PANTHER (Protein Analysis THrough Evolutionary Relationships, http://pantherdb.org) is a widely used online resource for comprehensive protein evolutionary and functional classification, and includes tools for large-scale biological data analysis. Recent development has been focused in three main areas: genome coverage, functional information (‘annotation’) coverage and accuracy, and improved genomic data analysis tools. The latest version of PANTHER, 10.0, includes almost 5000 new protein families (for a total of over 12 000 families), each with a reference phylogenetic tree including protein-coding genes from 104 fully sequenced genomes spanning all kingdoms of life. Phylogenetic trees now include inference of horizontal transfer events in addition to speciation and gene duplication events. Functional annotations are regularly updated using the models generated by the Gene Ontology Phylogenetic Annotation Project. For the data analysis tools, PANTHER has expanded the number of different ‘functional annotation sets’ available for functional enrichment testing, allowing analyses to access all Gene Ontology annotations—updated monthly from the Gene Ontology database—in addition to the annotations that have been inferred through evolutionary relationships. The Prowler (data browser) has been updated to enable users to more efficiently browse the entire database, and to create custom gene lists using the multiple axes of classification in PANTHER.
Article
Full-text available
SUMOylation is a posttranslational modification regulating a multitude of cellular processes, including replication, cell cycle progression, protein transport and the DNA damage response. Similar to ubiquitin, the Small Ubiquitin-like Modifier (SUMO) is covalently attached to target proteins in a reversible process via an enzymatic cascade. SUMOylation is essential for nearly all eukaryotic organisms and deregulation of the SUMO system is associated with human diseases such as cancer and neurodegenerative diseases. Therefore it is of great interest to understand the regulation and dynamics of this posttranslational modification. Within the last decade, mass spectrometry analyses of SUMO proteomes has overcome several obstacles, greatly expanding the number of known SUMO target proteins. In this review we will briefly outline the basic concepts of the SUMO system and critically discuss the potential of proteomic approaches to decipher SUMOylation patterns in order to understand the role of SUMO in health and disease. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Article
Full-text available
PhosphoSitePlus(®) (PSP, http://www.phosphosite.org/), a knowledgebase dedicated to mammalian post-translational modifications (PTMs), contains over 330 000 non-redundant PTMs, including phospho, acetyl, ubiquityl and methyl groups. Over 95% of the sites are from mass spectrometry (MS) experiments. In order to improve data reliability, early MS data have been reanalyzed, applying a common standard of analysis across over 1 000 000 spectra. Site assignments with P > 0.05 were filtered out. Two new downloads are available from PSP. The 'Regulatory sites' dataset includes curated information about modification sites that regulate downstream cellular processes, molecular functions and protein-protein interactions. The 'PTMVar' dataset, an intersect of missense mutations and PTMs from PSP, identifies over 25 000 PTMVars (PTMs Impacted by Variants) that can rewire signaling pathways. The PTMVar data include missense mutations from UniPROTKB, TCGA and other sources that cause over 2000 diseases or syndromes (MIM) and polymorphisms, or are associated with hundreds of cancers. PTMVars include 18 548 phosphorlyation sites, 3412 ubiquitylation sites, 2316 acetylation sites, 685 methylation sites and 245 succinylation sites. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Article
Full-text available
Abstract As one of the most important posttranslational modifications (PTMs), ubiquitination plays an important role in regulating varieties of biological processes, such as signal transduction, cell division, apoptosis, and immune response. Ubiquitination is also named "lysine ubiquitination" because it occurs when an ubiquitin is covalently attached to lysine (K) residues of targeting proteins. Given an uncharacterized protein sequence that contains many lysine residues, which one of them is the ubiquitination site, and which one is of non-ubiquitination site? With the avalanche of protein sequences generated in the postgenomic age, it is highly desired for both basic research and drug development to develop an automated method for rapidly and accurately annotating the ubiquitination sites in proteins. In view of this, a new predictor called "iUbiq-Lys" was developed based on the evolutionary information, grey system model, as well as the general form of pseudo amino acid composition. It was demonstrated via the rigorous cross validations that the new predictor remarkably outperformed all its counterparts. As a web-server, iUbiq-Lys is accessible to the public at http://www.jci-bioinfo.cn/iUbiq-Lys . For the convenience of most experimental scientists, we have further provided a protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of its development process.
Article
Full-text available
Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.
Article
Full-text available
Small ubiquitin-like modifiers (SUMOs) regulate a variety of cellular processes through two distinct mechanisms, including covalent sumoylation and non-covalent SUMO interaction. The complexity of SUMO regulations has greatly hampered the large-scale identification of SUMO substrates or interaction partners on a proteome-wide level. In this work, we developed a new tool called GPS-SUMO for the prediction of both sumoylation sites and SUMO-interaction motifs (SIMs) in proteins. To obtain an accurate performance, a new generation group-based prediction system (GPS) algorithm integrated with Particle Swarm Optimization approach was applied. By critical evaluation and comparison, GPS-SUMO was demonstrated to be substantially superior against other existing tools and methods. With the help of GPS-SUMO, it is now possible to further investigate the relationship between sumoylation and SUMO interaction processes. A web service of GPS-SUMO was implemented in PHP + JavaScript and freely available at http://sumosp.biocuckoo.org.
Article
A main bottleneck in proteomics is the downstream biological analysis of highly multivariate quantitative protein abundance data generated using mass-spectrometry-based analysis. We developed the Perseus software platform (http://www.perseus-framework.org) to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data. Perseus contains a comprehensive portfolio of statistical tools for high-dimensional omics data analysis covering normalization, pattern recognition, time-series analysis, cross-omics comparisons and multiple-hypothesis testing. A machine learning module supports the classification and validation of patient groups for diagnosis and prognosis, and it also detects predictive protein signatures. Central to Perseus is a user-friendly, interactive workflow environment that provides complete documentation of computational methods used in a publication. All activities in Perseus are realized as plugins, and users can extend the software by programming their own, which can be shared through a plugin store. We anticipate that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.
Article
As high-throughput techniques including proteomics become more accessible to individual laboratories, there is an urgent need for a user-friendly bioinformatics analysis system. Here, we describe FunRich, an open access, standalone functional enrichment and network analysis tool. FunRich is designed to be used by biologists with minimal or no support from computational and database experts. Using FunRich, users can perform functional enrichment analysis on background databases that are integrated from heterogeneous genomic and proteomic resources (>1.5 million annotations). Besides default human specific FunRich database, users can download data from the UniProt database which currently supports 20 different taxonomies against which enrichment analysis can be performed. Moreover, the users can build their own custom databases and perform the enrichment analysis irrespective of organism. In addition to proteomics datasets, the custom database allows for the tool to be used for genomics, lipidomics and metabolomics datasets. Thus FunRich allows for complete database customization and thereby permits for the tool to be exploited as a skeleton for enrichment analysis irrespective of the data type or organism used. FunRich is user-friendly and provides graphical representation (Venn, pie charts, bar graphs, column, heatmap and doughnuts) of the data with customizable font, scale and color (publication quality) This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.