ArticlePDF Available

GPS-Uber: A hybrid-learning framework for prediction of general and E3-specific lysine ubiquitination sites

Authors:

Abstract

As an important post-translational modification, lysine ubiquitination participates in numerous biological processes and is involved in human diseases, whereas the site specificity of ubiquitination is mainly decided by ubiquitin-protein ligases (E3s). Although numerous ubiquitination predictors have been developed, computational prediction of E3-specific ubiquitination sites is still a great challenge. Here, we carefully reviewed the existing tools for the prediction of general ubiquitination sites. Also, we developed a tool named GPS-Uber for the prediction of general and E3-specific ubiquitination sites. From the literature, we manually collected 1311 experimentally identified site-specific E3-substrate relations, which were classified into different clusters based on corresponding E3s at different levels. To predict general ubiquitination sites, we integrated 10 types of sequence and structure features, as well as three types of algorithms including penalized logistic regression, deep neural network and convolutional neural network. Compared with other existing tools, the general model in GPS-Uber exhibited a highly competitive accuracy, with an area under curve values of 0.7649. Then, transfer learning was adopted for each E3 cluster to construct E3-specific models, and in total 112 individual E3-specific predictors were implemented. Using GPS-Uber, we conducted a systematic prediction of human cancer-associated ubiquitination events, which could be helpful for further experimental consideration. GPS-Uber will be regularly updated, and its online service is free for academic research at http://gpsuber.biocuckoo.cn/.
Chenwei Wang is a postdoc scientist at Huazhong University of Science and Technology. He mainly focuses on the development of new algorithms to predict
functional PTM events from multi-omic data. He developed a number of computational methods including iCMod, cMAK, iFPS and iFIP to predict functional
kinases, substrates and interacting partners involved in regulating circadian, autophagy and ageing. He was also a major developer of EPSD for collecting known
protein phosphorylation sites in eukaryotes.
Xiaodan Tan is a master student at Huazhong University of Science and Technology. She mainly focuses on the collection and annotation of protein lysine
modifications.
Dachao Tang is a PhD student at Huazhong University of Science and Technology. His major research is identification of potentially important proteins involved
in regulating autophagy.
Yujie Gou is a PhD student at Huazhong University of Science and Technology. She focuses on the development of new deep learning frameworks to process
biomedical imaging data.
Cheng Han is a PhD student at Huazhong University of Science and Technology. He is working on the computational prediction of PTM sites.
Wanshan Ning is a postdoc scientist at Huazhong University of Science and Technology.His major research interest is focused on using artificial intelligence
methods to analyze sequence, multi-omics and imaging data. He built a new hybrid-learning architecture named HybridSucc for predicting general and
species-specific succinylation sites. He also developed GPS-Palm to predict palmitoylation sites, HUST-19 to predict COVID-19 clinical outcomes and POC-19 to
prioritize protein biomarkers of COVID-19.
Shaofeng Lin is a PhD student at Huazhong University of Science and Technology. His major research interest is focused on the integration of PTM data. He was a
major developer of EPSD. He was also a major developer of iUUCD 2.0, a database of ubiquitin and ubiquitin-like conjugations.
Weizhi Zhang is a PhD student at Huazhong University of Science and Technology. He mainly focuses on developing machine learning algorithms based on
multi-omic data. He developed a new method named iCAL to predict cancer mutations that change autophagy selectivity.
Miaomiao Chen is a PhD student at Huazhong University of Science and Technology. She is working on the prediction of phosphorylation sites using deep
learning algorithms.
Di Peng is a postdoc scientist at Huazhong University of Science and Technology. His major research interests are focused on experimentally discovering new PTM
regulators, substrates and sites in the regulation of diverse biological processes, with the combination of computational predictions.
Yu Xu e is a professor at Huazhong University of Science and Technology. He has started to work in the field of PTM Bioinformatics since 2004. He is interested in
using both computational and experimental approaches to exploit how functional PTM events can be precisely orchestrated to regulate various biological
processes, such as autophagy, circadian and cell fate determination. He is also involved in the establishment of a new interdisciplinary field, artificial intelligence
biology (AIBIO) in China.
Received: September 3, 2021. Revised: December 11, 2021. Accepted: December 14, 2021
© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Briefings in Bioinformatics, 2022, 1–15
https://doi.org/10.1093/bib/bbab574
Review
GPS-Uber: a hybrid-learning framework for prediction
of general and E3-specific lysine ubiquitination sites
Chenwei Wang,Xiaodan Tan,Dachao Tang, Yujie Gou, Cheng Han,Wanshan Ning , Shaofeng Lin , Wei zhi Z han g ,
Miaomiao Chen, Di Peng and Yu Xu e
Corresponding author: Yu Xue, Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and
Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and
Technology, Wuhan,Hubei 430074, China. Tel.: +86-27-87793903; Fax: +86-27-87793172; E-mail: xueyu@hust.edu.cn.
These authors contributed equally to this work.
Abstract
As an important post-translational modification, lysine ubiquitination participates in numerous biological processes and is involved
in human diseases, whereas the site specificity of ubiquitination is mainly decided by ubiquitin-protein ligases (E3s). Although
numerous ubiquitination predictors have been developed, computational prediction of E3-specific ubiquitination sites is still a great
challenge. Here, we carefully reviewed the existing tools for the prediction of general ubiquitination sites. Also, we developed a tool
named GPS-Uber for the prediction of general and E3-specific ubiquitination sites. From the literature, we manually collected 1311
experimentally identified site-specific E3-substrate relations, which were classified into different clusters based on corresponding
E3s at different levels. To predict general ubiquitination sites, we integrated 10 types of sequence and structure features, as well as
three types of algorithms including penalized logistic regression, deep neural network and convolutional neural network. Compared
with other existing tools, the general model in GPS-Uber exhibited a highly competitive accuracy, with an area under curve values of
0.7649. Then, transfer learning was adopted for each E3 cluster to construct E3-specific models, and in total 112 individual E3-specific
predictors were implemented. Using GPS-Uber, we conducted a systematic prediction of human cancer-associated ubiquitination
events, which could be helpful for further experimental consideration. GPS-Uber will be regularly updated, and its online service is
free for academic research at http://gpsuber.biocuckoo.cn/.
Keywords: Post-translational modification, lysine ubiquitination, ubiquitin-protein ligase, site-specific E3-substrate relation, deep
learning
Introduction
As one of the most indispensable post-translational
modifications (PTMs), lysine ubiquitination regulates a
wide spectrum of biological processes including protein
degradation and turnover, membrane trafficking, cell
cycle and deoxyribonucleic acid (DNA) damage repair
[13]. In 1978, Ciehanover et al. discovered a 76-amino
acid protein, ubiquitin, which can be covalently attached
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
2|Wan g et al.
to lysine residues in protein substrates through a
cascade of biochemical reactions catalyzed by ubiquitin-
activating enzymes (E1s), ubiquitin-conjugating enzymes
(E2s) and ubiquitin-protein ligases (E3s) [4,5]. E3s are
structurally diverse enzymes and play a critical role in
determining the substrate specificity and efficiency of
ubiquitination reactions [6,7]. The aberrances in E3s
and ubiquitinated targets have been associated with
numerous human diseases, such as cancer, autoimmune
diseases, metabolic syndromes and neurodegenerative
diseases [79]. Thus, identification of E3-specific targets
and site-specific E3-substrate relations (ssESRs) is fun-
damental for understanding the molecular mechanisms
and regulatory roles of lysine ubiquitination.
Conventionally, biochemical identification of E3-
specific targets and ubiquitination sites is low-throughput
(LTP), labor-intensive and time-consuming. During
the past years, a number of high-throughput (HTP)
experimental methods have been developed, such as
yeast two-hybrid screening, phage display, global protein
stability profiling, affinity purification-tandem mass
spectrometry (AP-MS/MS) and Gly–Gly (diGly) remnant
affinity purification [1012]. For example, in 2008, Yen
et al. developed a fluorescence-based system called
global protein stability profiling, which could monitor
the protein turnover under different physiological and
disease conditions [13]. Using this method, Yen et al.
systematically identified 359 highly potential substrates
of the Skp1-cullin-F-box (SCF) ubiquitin ligase, and most
of the known SCF targets were covered [14]. With the
help of AP-MS/MS, Low et al. identified 221 potential
SCFβTrC P substrates that contained the DpSGXX(X)pS
motif, a primary degron to be specifically recognized
by SCFβTrC P [15]. In addition, Elia et al. identified 33 503
ubiquitination sites using the diGly remnant affinity
purification strategy and discovered EXO1 as a new SCF
target in response to DNA damage [16].
Besides the LTP and HTP experimental assays, com-
putational prediction of E3-substrate interactions (ESIs)
or ubiquitination sites has also emerged to be a highly
useful approach. For the prediction of ESIs, in 2017,
Li et al. incorporated multiple types of informative
features including orthologous ESI, enriched domain
pair, enriched Gene Ontology (GO) term pair, network
topology and E3 recognition motif (aka ‘primary degron’)
and developed a naïve Bayesian-based method named
UbiBrowser [17]. Recently, UbiBrowser 2.0 was released
to cover more species, and prediction of deubiquitinase-
substrate interactions was also implemented [18].
In parallel, Chen et al. integrated transcriptomics-,
proteomics-, network- and pathway-based associations
and used recursive feature elimination and random
forest (RF) algorithms to develop a new method for
predicting ESIs [19]. Through further experiments, they
validated 3 and 5 potentially new targets of SCFSKP2 and
SCFFBXL6, respectively [19]. For the prediction of general or
species-specific ubiquitination sites, various tools have
also been developed, including UbiPred [20], UbPred [21],
UbSite [22], CKSAAP_UbSite [23], WPNNA [24], UbiProber
[25], hCKSAAP_UbSite [26], RUBI [27], iUbiq-Lys [28],
UbiSite [29], ESA-UbiSite [30], PTM-ssMP [31], PTMscape
[32], ModPred [33], deepUbiquitylation [34], DeepUbi
[35], MUscADEL [36], DL-plant-ubsites-prediction [37],
MusiteDeep [38], UbiSite-XGBoost [39], UbiComb [40],
CNNAthUbi [41], DeepTL-Ubi [42] and MultiLyGAN
[43]. Although numerous efforts have been taken in
computational analysis of ubiquitination, prediction of
exact ssESRs remains to be a great challenge.
Here, we first provided a brief review of currently
available tools for predicting general and species-
specific ubiquitination sites. Then, we developed an
online service named group-based prediction system
for ubiquitin E3 ligase-substrate relations (GPS-Uber),
which could predict general and E3-specific lysine
ubiquitination sites from protein sequences. For train-
ing models in GPS-Uber, seven sequence- and three
structure-based features were considered, and three
machine learning algorithms including two-dimensional
(2D) convolutional neural network (CNN), deep neural
network (DNN) and penalized logistic regression (PLR)
were integrated into a hybrid-learning architecture.
Compared with other existing tools, GPS-Uber showed
a highly competitive accuracy, with an area under
the curve (AUC) value of 0.7649 for the prediction of
general ubiquitination sites. With the help of transfer
learning, 111 individual E3-specific predictors were
also constructed (Figure 1). To investigate the potential
relationships between ubiquitination and cancer, the
ubiquitination sites of known cancer proteins were
predicted by GPS-Uber at the E3 group level and could
serve as a useful resource for further experimental
consideration. Taken together, we anticipate that GPS-
Uber could be helpful to facilitate the research on E3-
mediated ubiquitination.
Methods
Data collection and preparation
First, the combinations of keywords including ‘ubiq-
uitination’, ‘ubiquitinated’ and ‘ubiquitylation’ were
added with suffixes such as ‘lysine’, ‘residue’, ‘site’
and ‘proteomic’ to search experimentally identified
ssESRs from PubMed. Only known ssESRs in Homo
sapiens were collected, because much fewer ssESRs
were identified in other species. Through the literature
biocuration, we obtained 1311 known ssESRs between
1117 human ubiquitination sites of 391 proteins and 177
E3s (Supplementary Table S1). More details on collection
of known ssESRs were shown in the Supplementary
Methods.
In 2017, we developed the protein lysine modification
database (PLMD), which contained 121 742 experimen-
tally identified lysine ubiquitination sites in 25 103
proteins [44]. For the prediction of general ubiquitination
sites, these sites were taken as the benchmark data
set. A widely used clustering program, CD-HIT [45], was
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
GPS-Uber: a hybrid-learning framework |3
Figure 1. The experimental procedure of this study. First, experimentally
identified ubiquitination sites were taken from PLMD [44], and homolo-
gous sites were eliminated through the CD-HIT clustering [45] to generate
the initial training data set which contained 61 161 ubiquitination sites.
Then, 10 types of sequence- and structure-based features, including
GPS, PseAAC, CKSAAP, OBC, AAindex, ACF, PSSM, ASA, SS and BTA, were
encoded for model training with three algorithms including DNN,PLR and
CNN. Meanwhile, known E3-specific ubiquitination sites were manually
collected from PubMed and classified to various E3 clusters based on the
information of iUUCD [1]. Transfer learning was performed for different
E3 clusters to construct E3-specific models based on the general model.
Finally, a user-friendly online service was developed for researchers in this
field.
adopted to classify this data set into different clusters
with a threshold of 40% sequence similarity. To avoid
the homologous redundancy, only one representative
sequence in each cluster was extracted into training
data. Then, we defined a ubiquitination site peptide
USP(m,n) as a lysine residue flanked by upstream m
residues and downstream nresidues, and USP(10,10) was
chosen in this study for rapid training. As previously
described [46], the USP(10,10) items around known
ubiquitination sites were regarded as positive data,
whereas USP(10,10) items from other non-ubiquitinated
lysine residues were taken as negative data. For lysine
residues located near to N- or C-terminus of the protein
sequences, one or multiple characters were added
to complement the USP(10,10) items. Prior to model
training, the redundant USP(10,10) items were removed.
Before the E3-specific training, the hierarchical clas-
sifications of human E3s at different levels, including
class, group, subgroup, family and single E3, were
downloaded from integrated annotations for Ubiqui-
tin and Ubiquitin-like Conjugation Database (iUUCD)
(http://iuucd.biocuckoo.org/)[
1], and the 1311 known
ssESRs were classified into different E3 clusters at group,
subgroup, family and single E3 levels. Only E3 clusters
with 3 ubiquitination sites were kept for further
training. For each E3 cluster, positive and negative data
were generated the same as that in general training.
Finally, we got 111 E3 clusters with 3 known sites.
Performance evaluation measurements
For the evaluation of our methods, four widely used
measurements, including sensitivity (Sn), specificity (Sp),
accuracy (Ac) and Matthew correlation coefficient (MCC),
were calculated as below:
Sn =TP
TP +FN
Sp =TN
TN +FP
Ac =TP +TN
TP +FP +TN +FN
MCC =(TP ×TN)(FN ×FP)
(TP +FN)×(TN +FP)×(TP +FP)×(TN +FN)
For the prediction of general ubiquitination sites, 4-,
6-, 8- and 10-fold cross-validations were performed to
evaluate the accuracy and robustness of finally deter-
mined models, using the training data set that contained
61 161 known ubiquitination sites. For the comparison of
GPS-Uber with other existing tools, a timestamp-based
strategy [38] was adopted to split the initial benchmark
data set into a secondary training data set containing
55 426 sites reported before 2016, and an independent
testing data set with 5735 sites released after 2016. Addi-
tional models were generated with this training data set
using the algorithm of GPS-Uber, and the testing data
set was then used to evaluate the performance of GPS-
Uber and other tools. The initial benchmark data set, the
secondary training data set and the independent testing
data set were freely downloadable at: http://gpsuber.bio
cuckoo.cn/userguide.php. For predicting E3-specific sites,
the robustness of models with 30 sites was tested with
10-fold cross-validations for 20 times, and leave-one-
out (LOO) validations were performed for other models
with <30 sites. For each model, the receiver operating
characteristic (ROC) curve was illustrated based on Sn
and 1-Sp scores, from which the AUC value was further
calculated.
The algorithm of GPS-Uber
In 2020, we developed a hybrid-learning architecture
called HybridSucc which combined a PLR algorithm with
a DNN algorithm for the prediction of succinylation sites
[47]. Through the integration of conventional machine
learning and deep learning algorithms, the performance
of the predictor was significantly improved. Later, a
parallel CNN framework was constructed, and a new
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
4|Wan g et al.
tool, GPS-Palm, was released for S-palmitoylation site
prediction with a promising accuracy with graphical
features [48].
In this study, a novel hybrid-learning framework was
designed to incorporate PLR, DNN and CNN algorithms.
First, seven types of sequence features including the
peptide similarity encoded by the GPS method [46],
pseudo amino acid composition (PseAAC), composition
of k-spaced amino acid pairs (CKSAAPs), orthogonal
binary coding (OBC), physicochemical properties in the
Amino Acid index (AAindex) database , autocorrelation
functions (ACFs) and position-specific scoring matrix
(PSSM), and three structural features including acces-
sible surface area (ASA), secondary structure (SS) and
backbone torsion angles (BTAs) were encoded to one-
dimensional (1D) vectors for PLR and DNN and 2D
matrices for CNN (Figure 1)[2030,34,35,37,4043,47,
4952]. More details on feature encoding were shown in
Supplementary Methods. For each feature, a PLR model
with the ridge (L2) penalty was constructed by scikit-
learn v0.23.2, and the ‘lbfgs’ solver was adopted for
parameter optimization. A four-layer DNN framework
was implemented in Keras 2.4.3 (http://github.com/fcho
llet/keras) based on tf-nightly-gpu 2.5.0 dev20201028
for the same encoded vectors. Similarly, an 11-layer 2D
CNN framework containing four convolutional and four
pooling layers was realized for the encoded matrices. The
rectified linear unit (ReLU) was adopted as the activate
function, which was defined as:
ReLU(x)=x,x0
0, x<0
In the output layer, one neuron with the sigmoid func-
tion was taken to calculate the final score for a given
USP(10,10):
sigmoid(y)=1
1+ey
To rapidly determine the optimal parameters of deep
learning networks, we randomly extracted 1/10 ubiqui-
tination sites from the training data set for general pre-
diction, and different combinations of parameters were
tested with this data set to determine the combination
with the highest AUC value. The optimized parameters
including number of neurons, dropout ratio, learning rate
and pool size were provided in Supplementary Table S2.
For each USP(10,10), 10 features (f1,f2,f3,...,f10 ) were
separately scored by DNN (D1,D2,D3,...,D10), PLR (P1,
P2,P3,...,P10) and CNN (C1,C2,C3,...,C10 ). Then a 30-
dimensional vector containing 30 scores was generated
as follows:
V=(D1,D2,D3,...,D10,P1,P2,P3,...,P10 ,C1,C2,C3,...,C10)
To integrate the information from various features and
algorithms, the vector Vwas then used as the secondary
feature and a new four-layer DNN model was constructed
to get a final score.
For E3-specific prediction, the ssESRs in each clus-
ter were used to fine-tune the general models through
the transfer learning strategy, and the optimized models
were assigned to corresponding E3 predictors. For each
predictor, three thresholds including high, medium and
low were determined based on Sp values of 95%, 90% and
85%, respectively. In the online service of GPS-Uber, the
medium threshold was chosen as the default.
A computer with the NVIDIA GeForce RTX 3090 GPU,
a Genuine Intel(R) CPU @ 2.30GHz CPU and 128 GB RAM
were used for the training of computational models.
The hypergeometric test
For the enrichment analysis of E3-specific substrates,
GO annotation files (released on 1 May 2021) [53] were
downloaded from the Gene Ontology Resource (http://ge
neontology.org/), containing 19 762 human proteins with
at least one GO term. For each GO term twith E3 group e,
we defined the following:
N= number of genes annotated by at least one
GO term.
n= number of genes annotated by GO term t.
M= number of e’s substrates annotated by at least one
GO term.
m= number of e’s substrates annotated by GO term t.
The enrichment ratio (E-ratio) of twas then computed,
and the P-value was calculated with the hypergeometric
distribution as follows:
Eratio =m
M/n
N
pvalue =n
m=mM
mNM
nm
N
n(E-ratio 1), or.
pvalue =m
m=0M
mNM
nm
N
n(E-ratio <1).
The hypergeometric test was also adopted for the GO-
based enrichment analyses of cancer proteins predicted
to be ubiquitinated by E3 groups. A total of 707 cancer
proteins were downloaded from the Cancer Gene Census
in Catalogue of Somatic Mutations in Cancer (COSMIC)
(https://cancer.sanger.ac.uk/census, v94) [54].
The analysis of primary degrons
Previously, it was reported that E3 recognition motifs
act as primary degrons to determine the ubiquitination
specificity at the substrate level [55]. The Eukary-
otic Linear Motif Database (ELM, http://elm.eu.org)
provides a comprehensive dataset of experimentally
characterized short linear motifs, including known E3-
specific primary degrons [56]. Here, we downloaded
the file ‘elm_classes.tsv’ that contained 317 motif
classes and associated regular expressions from ELM.
The information from the columns ‘ELMIdentifier’,
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
GPS-Uber: a hybrid-learning framework |5
‘FunctionalSiteName’ and ‘Description’ was extracted,
and 27 known degrons were reserved for 11 E3 clusters
in GPS-Uber if available (Supplementary Table S3). Using
the ‘re’ module of Python, the sequence profile of
each E3-specific degron was used to search the protein
sequences of the corresponding E3-specific substrates,
using the 391 proteins containing 1311 known ssESRs.
More details on collection of known ssESRs were shown
in Supplementary Methods. For each identified degron
motif, the distance to its proximal ubiquitination site
modified by the same E3 was counted.
The gene expression and proteomic data
Files (.mRNAseq_Pre-process.Level_3.) containing the
mRNA expression levels of 37 cancer types of The Cancer
Genome Atlas (TCGA) program were downloaded from
BROAD Institute (http://gdac.broadinstitute.org/runs/
stddata__latest/)[57]. We mapped this data set to 177
E3s and 391 substrates of the E3-specific data set to get
their mRNA expression profiles. Time-course proteomic
data generated from a previously study [58] were also
used, containing 6205 proteins mutually quantified
from normal rat kidney cells treated with 16-nm silica
nanoparticles at 60 μg/ml for 0, 8, 16, 20 and 24 h. For rat
proteins, their human orthologs of E3s and substrates
were computationally identified by reciprocal best
hits [59].
The data visualization
For each E3 group, the USP(10,10) items in positive data
were directly uploaded to the web service of pLogo
(https://plogo.uconn.edu/), and corresponding negative
data were selected as background. Then the sequence
logo was generated automatically. The heat map was
diagrammed by a previously developed tool HemI [60],
and Cytoscape [61] was used to visualize networks.
In addition, the functional domain and predicted
ubiquitinated sites of RAC1 were illustrated using DOG
2.0 [62].
Results
A summary of available methods for the
prediction of ubiquitination sites
Besides the large-scale identification of ESIs and ubiq-
uitination sites with HTP experimental methods [10
14,16], computational predictions also provided an
alternative approach to facilitate the research of ubiq-
uitination. Because fewer studies have been conducted
on the prediction of ESIs [1719], here we mainly
focused on review of the 28 available methods for
predicting general or species-specific ubiquitination sites
(Supplementary Table S4).
In 2008, Tung et al. developed the first ubiquitination
site predictor named UbiPred [20]. After the evaluation
of different features and classifiers, the combination
of 31 informative physicochemical properties from
AAindex and support vector machine (SVM) algorithm
was adopted for training the final model [20]. In the
next 10 years, SVM has been widely used for predicting
ubiquitination sites. For example, Chen et al. designed
CKSAAP_UbSite based on the CKSAAP feature, and
SVM was used to predict yeast ubiquitination sites [23].
For the prediction of human ubiquitination sites, the
authors released hCKSAAP_UbSite, in which additional
features including binary amino acid compositions,
AAindex properties and protein aggregation propensity
were encoded to construct SVM classifiers [26]. In 2013,
Chen et al. combined PseAAC, k-nearest neighbor (KNN)
and AAindex to construct UbiProber for both general
and species-specific predictions [25]. Using an iterative
approach, Walsh et al. reported RUBI as a rapid genome-
scale predictor for lysine ubiquitination, whereas bi-
directional recurrent neural networks were incorporated
with SVM to integrate the sequence- and structure-
based features [27]. Later, iUbiq-Lys was released by
incorporating PseAAC, PSSM and gray system model [28].
By developing UbiSite with a two-layered SVM model,
Huang et al. adopted four widely used features including
PseAAC, PSSM, positional-weighted matrix (PWM) and
ASA and extracted substrate motifs using the MDDLogo
[29]. To evaluate the performance of different features,
Nguyen et al. developed a new framework using SVM, and
the motif-based models derived from MDDLogo exhibited
the best accuracy [52]. Also, Wang et al. constructed an
SVM-based method known as ESA-UbiSite, in which 31
AAindex properties were selected by an optimization
approach [30]. In 2018, Liu et al. reported a comprehensive
web server called PTM-site-specific modification profile
(ssMP), which provided predictions for multiple types
of PTM sites including lysine ubiquitination sites. For
each PTM type, ssMP was generated from both local
sequence and proximal PTMs, and SVM classifier was
then adopted to construct the computational model [31].
Through the integration of various features including
AAindex, ASA, SS, BTA and PSSM, Li et al. developed an R
package named PTMscape for the prediction of various
PTM sites including lysine ubiquitination sites, based on
linear SVM [32].
Besides SVM, machine learning algorithms including
RF and KNN were also adopted to predict lysine ubiq-
uitination. In 2010, Radivojac et al. used RF to construct
UbPred, which integrated 586 sequence features. Based
on the same RF algorithm [21], Zhao et al. integrated
four features including PseAAC, PSSM, AAindex and dis-
order score and developed an ensemble model via voting
[49]. Also, Lee et al. reported UbSite based on a radial
basis function network, which combined the features
of PseAAC, CKSAAP, PSSM and ASA [22]. Using a fea-
ture selection procedure, 456 features including PSSM,
AAindex and disorder score were extracted by Cai et al.,
and KNN algorithm was adopted to develop a novel
ubiquitination site predictor [50]. Similar features were
also incorporated by WPNNA, a new classifier based
on an optimized KNN algorithm [24]. Moreover, Pejaver
et al. designed a LR-based tool named ModPred for the
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
6|Wan g et al.
prediction of >20 types of PTM sites, and four feature
types including sequence-based, physicochemical, struc-
tural properties and evolutionary properties were inte-
grated for model training [33]. Later, Nguyen et al. used
profile hidden Markov model to build several models
based on identified motifs of existing sites [51]. In addi-
tion, the eXtreme gradient boosting (XGBoost) algorithm
was adopted by Liu et al. to develop UbiSite-XGBoost,
a new predictor for general ubiquitination sites, and
various features including PseAAC, CKSAAP, AAindex,
PsePSSM, BLOSUM62, adapted normal distribution bi-
profile Bayes and encoding based on grouped weight
were integrated [39].
Recently, with the accumulation of ubiquitination
sites, advances in deep learning provided a great
opportunity for big data training. In 2018, He et al.
constructed deepUbiquitylation, which combined DNN
and CNN to encode three features as OBC, AAindex
and PSSM [34]. Later, Fu et al. designed a CNN-based
framework DeepUbi [35]. The performances of four
features including OBC, AAindex, PseAAC and CKSAAP
were evaluated, and the combination of OBC and
CKSAAP obtained the highest AUC value. Meanwhile,
a new computational tool, MUscADEL, was reported
by Chen et al. for lysine PTMs prediction [36]. An
extended RNN framework was constructed with a word
embedding layer to extract sequence features. More
recently, DeepTL-Ubi was constructed with a densely
connected CNN, and transfer learning was performed
to extend the prediction for multiple species with the
feature of OBC [42]. However, MultiLyGAN released by
Yang et al. adopted conditional Wasserstein generative
adversarial network to eliminate data imbalance,and the
RF algorithm was used to generate models for multiple
lysine modifications [43]. Beyond general prediction, the
development of tools to predict plant ubiquitylation sites
is also prevalent. In 2020, Wang et al. released a CNN-
based architecture called DL-plant-ubsites-prediction,
which implemented a word-embedding method based
on features of PseAAC, CKSAAP, PWM and sequence logo
[37]. In parallel, MusiteDeep was developed to provide
efficient predictors for numerous types of PTM sites
including ubiquitination sites [38]. In MusiteDeep, two
CNN-based networks were integrated to generate the
final model for each PTM type with OBC feature [38].
By integrating CNN with long short-term memory, Siraj
et al. constructed UbiComb for the prediction of plant
ubiquitination sites [40]. In addition, an Arabidopsis
thaliana-specific predictor CNNAtuUbi was designed by
Wang et al. using a CNN framework [41]. No tools have
been developed for the prediction of exact ssESRs from
the protein sequences.
The data statistics of known E3-specific
ubiquitination sites
Considering that PLMD provides no information on
upstream E3s, we collected 1311 experimentally iden-
tified ssESRs from the literature for the development of
E3-specific models. Using the hierarchical classifications
of iUUCD [1], these ssESRs were hierarchically clustered
at different levels, including 6 groups, 4 subgroups, 15
families and 93 single E3s, and positive and negative
data sets were generated for each cluster.
In our results, four groups of Really Interesting
New Gene (RING), Cullin RING, RING-between RING–
RING (RBR) and Homologous to the E6AP Carboxyl
Terminus (HECT) covered 98.17% (1287) of total ssESRs,
and RING contained the largest positive data set with
754 ssESRs in 221 proteins (Figure 2A). In contrast,
only 9 ssESRs were reported to be ubiquitinated by
Recognition components of the N-end rule pathway
(N-recognin) E3s, and 17 ssESRs from 6 ubiquitinated
proteins were classified into the other group. Obvi-
ously, these sites were abundant in eight E3 clusters
at the family level, including RING/RING, RBR/RBR,
RING/U-box, HECT/HECT, Cullin RING/DDB1-CUL4-
X-box/DDB1-binding WD40 protein (Cullin RING/D-
CX/DWD), Cullin RING/Skp1-Culline-F-box protein/F-
box (Cullin RING/SCF/F-box), Cullin RING/BTB, Cul3
and RBX1 form a Cul3-based ligase/BTB_3-box (Cullin
RING/BCR/BTB_3-box) and Cullin RING/ECS/Suppres-
sors Of Cytokine Signalling_Von-Hippel Lindau_BC-
box (Cullin RING/ECS/SOCS_VHL_BC-box) (Figure 2A).
Again, the RING/RING family was matched with most
substrates of 676 ssESRs. The comparison of substrates
across the four major E3 groups with 30 protein
substrates demonstrated a low coverage among different
groups. Only 7 proteins were known to be modified by 3
types of E3 groups, and 21 substrates were shared by
RING and Cullin RING groups (Figure 2B). In addition,
the sequence logos of these groups were generated for
the investigation of potential substrate motifs, which
demonstrated diverse patterns for E3 groups (Figure 2C).
For example, besides a high frequency of serine (S)
detected at position +4 for both RING and Cullin RING
groups, glutamic acid (E) and proline (P) showed high
probabilities at position 3 and +3 of RING, respectively,
whereas a signature of arginine (R) was found at position
6 of Cullin RING. Similarly, different patterns of amino
acids were detected with RBR and HECT groups, such as
aspartic acid (D) at position 3 for RBR and valine (V) at
position 4 for HECT (Figure 2C). The results suggested
that different E3s prefer to recognize different sequence
profiles for substrate ubiquitination.
Next, GO-based enrichment analyses were conducted
to detect the biological processes regulated by four
major E3 groups with 30 substrates (Figure 2D).
Interestingly, the coverage of biological processes was
much higher than substrates, such as the process of
‘positive regulation of transcription by RNA polymerase
II’ (GO: 0045944), which was enriched in top five
enriched biological processes of RING, Cullin RING and
HECT at the same time. Also, ‘negative regulation of
apoptotic process’ (GO: 0043066) and ‘protein deubiq-
uitination’ (GO:0016579) were detected simultaneously
with two different E3 groups. The results indicated that
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
GPS-Uber: a hybrid-learning framework |7
Figure 2. The analysis of the known E3-specific ubiquitination sites. (A) The number of known substrates of E3 groups and families with 30 known
ubiquitination sites. More details are shown in Supplementary Table S1.(B) The overlap of protein substrates from four major E3 groups with 30
substrates. (C) The sequence logos of four major E3 groups. (D) The GO-based enrichment analysis of protein substrates from four major E3 groups.
a considerable number of processes were mutually
regulated by different types of E3s. Using the mRNA
expression data from TCGA [57], the correlation of
the 177 E3s and 391 substrates was analyzed. The
average Spearman’s rank correlation coefficient (ρ)
was calculated as 0.0487 (Supplementary Figure S1A),
indicating a weak correlation of E3s and their targets
at the transcriptional level. Moreover, we re-analyzed
the time-course quantitative proteomic data from a
recently published study [58], and the average ρof 0.0614
supported a weak correlation of E3s and their targets at
the translational level (Supplementary Figure S1B).
Development of a hybrid-learning framework for
the prediction of ubiquitination sites
In the past two decades, various features have been
adopted to construct the predictors for ubiquitination
sites, and performance improvement was observed
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
8|Wan g et al.
Figure 3. The hybrid-learning framework of GPS-Uber, as well as the performance values of different features. (A) For each USP(10,10) item, 10 types
of features were encoded for model training with three algorithms including DNN, PLR and CNN. Then a vector containing 30 predicted scores was
generated as input for an additional DNN framework to produce the final score. The 4-, 6-, 8- and 10-fold cross-validations and LOO were performed to
evaluate the robustness of models. (B) For each feature, the AUC values of DNN, CNN and PLR were calculated for the general and E3-specific predictors
with 30 known ubiquitination sites. (C) The distribution of AUC values from 10-fold cross-validations for the general predictor, on different types of
features.
with the combination of different features [2043,
4952]. Meanwhile, algorithms based on conventional
machine learning and deep learning were both widely
used, whereas a systematic evaluation was yet to be
performed. More importantly, although numerous tools
were developed, prediction of exact ssESRs was still
a great challenge. Here, we designed a new architec-
ture, GPS-Uber, for the prediction of both general and
E3-specific ubiquitination sites (Figure 3A). For each
USP(10,10) in the training data set, 10 types of sequence-
or structure-based features were individually encoded
into 1D vectors and 2D matrices, respectively. For each
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
GPS-Uber: a hybrid-learning framework |9
feature, three models were separately constructed
based on PLR, DNN and CNN algorithms, using the
corresponding encoded vector or matrix. As a result,
30 scores were generated through the combination of
features and algorithms, and vectors containing these
scores were adopted by a new DNN model as inputs to
obtain the final prediction scores for all USP(10,10) items.
The predictor for general ubiquitination sites was first
constructed, and transfer learning was then adopted
for implementation of E3-specific models. In total,
112 individual predictors were constructed by GPS-
Uber. Using the training data set with 61 161 known
ubiquitination sites, 10-fold cross-validations were
conducted to evaluate the performance of models
with 30 ubiquitination sites (Supplementary Table S5).
The results demonstrated that the AUC values of
sequence features were generally higher than those
of structural features, especially in GPS-based peptide
similarity, PseAAC and CKSAAP (Figure 3B). In general,
deep learning algorithms showed higher AUC values
than PLR (Figure 3C). For general prediction, most of
the features exhibited small fluctuations of AUC values
through 10-fold cross-validations (Figure 3C), supporting
the robustness and stability of computational models.
The final model that combined the three algorithms and
10 features reached an AUC value of 0.8106, which was
significantly improved compared with single algorithms
or features (Figure 3C). At the group level, Cullin RING
got the highest AUC value of 0.8967 compared with
RING (0.8188), RBR (0.8074) and HECT (0.8204), whereas
the AUC values of E3 families ranged from 0.7767
(RING) to 0.9396 (BTB_3-box) (Figure 3B). Meanwhile,
LOO validations were conducted for predictors with
<30 ubiquitination sites (Supplementary Table S5), and
the incorporation of 10 types of features and three
types of algorithms significantly improved the prediction
performance for all data sets, further supporting the
superiority of GPS-Uber.
Performance evaluation and comparison
Besides 10-fold cross-validations, 4-, 6- and 8-fold cross-
validations were also performed using the initial training
data set. For the general model, the AUC values were
calculated as 0.8102, 0.8107, 0.8105 and 0.8106 for
4-, 6-, 8- and 10-fold cross-validations, respectively
(Figure 4A). To demonstrate the superiority of GPS-
Uber, a comparison was conducted between GPS-
Uber and other existing tools with the independent
testing data. Although 28 methods were reported for
predicting general or species-specific ubiquitination
sites, applicable online services or executable codes were
provided by only six tools, including hCKSAAP_UbSite
[26], RUBI [27], ESA_Ubisite [30], DL-plant-ubsites-
prediction [37], CNNAthUbi [41] and MusiteDeep [38].
These tools were designed for general prediction and
no E3-specific model was provided. Using a timestamp-
based method [38], 55 426 sites reported before 2016
were used for additional model training by GPS-Uber,
whereas 5735 sites released after 2016 were directly
submitted to GPS-Uber and other existing tools for an
unbiased comparison. For each tool, the ROC curve was
illustrated and AUC value was calculated (Figure 4B). The
results demonstrated that GPS-Uber exhibited a highly
competitive accuracy compared with other tools,and the
AUC values were 0.7649, 0.6993/0.6866, 0.6698, 0.6670,
0.6411, 0.5774, 0.5262 for GPS-Uber, CNNAthUbi, RUBI,
DL-plant-ubsites-prediction, MusiteDeep, ESA_Ubisites
and hCKSAAP_UbSite, respectively (Figure 4B).
To further evaluate the robustness of GPS-Uber,
we conducted an additional validation. For the initial
training data set containing 61 161 ubiquitination
sites, the USP(10,10) items were randomly separated
into five equal parts, with the same distribution of
positive data versus negative data. Then, four parts
were taken as a new training data set, and the 10-
fold cross-validation was performed for parameter
optimization, whereas the remaining one part was taken
as an independent testing data set for performance
evaluation. This procedure was repeated five times
until each of the five parts was used as the testing
data set for one time. The AUC values ranged from
0.7449 to 0.7536 (Supplementary Figure S2A), supporting
the stability and superiority of GPS-Uber. In addition,
we tested 1D CNN directly using the vectors encoded
from the 10 features, and the performance was slightly
reduced against 2D CNN when integrated in GPS-Uber
(Supplementary Figure S2B).
Of note, GPS-Uber provided multiple unique predic-
tors to predict E3-specific ubiquitination sites for the
first time, whereas additional 4-, 6- and 8-fold cross-
validations were also conducted for models with 30
ubiquitination sites. Due to the page limitation, the ROC
curves of four E3 families including Cullin RING/SCF/F-
box, Cullin RING/DCX/DWD, RING/RING and RBR/RBR
were shown (Figure 4C). For Cullin RING/SCF/F-box, the
AUC values of 4-,6-, 8- and 10-fold cross-validations were
0.8831, 0.8862, 0.8885 and 0.8979. Similar results were
observed for Cullin RING/DCX/DWD with AUC values of
0.8709, 0.8668, 0.8681 and 0.8866, respectively, whereas
satisfying performance values were also detected
with RING/RING and RBR/RBR. To investigate the site
specificity of E3s, four E3s that belonged to HECT group
(NEDD4 and ITCH) and RING group (MDM2 and STUB1)
were selected. For each E3-specific predictor, the training
data sets of the remaining three E3s were individually
used to test its performance. From the results, it could
be found that each E3-specific predictor only exhibited
a much higher accuracy on its own training data set,
supporting a strong specificity of E3s on recognition their
target sites (Supplementary Figure S2C). Taken together,
our results indicated the promising robustness and
accuracy of GPS-Uber for both general and E3-specific
predictions.
For convenience, a user-friendly web server was devel-
oped for GPS-Uber (Figure 4D). The clickable hierarchical
classification tree of E3s was located in the left panel,
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
10 |Wan g et al.
Figure 4. Performance evaluation and comparison of GPS-Uber with other existing tools. (A) The ROC curves and AUC values of n-fold cross-validations
of the general model. (B) Comparison of the general model of GPS-Uber with other existing predictors using an independent testing data set. (C)The
accuracy values of E3-specif ic predictors for E3 families, including Cullin RING/SCF/F-box, Cullin RING/DCX/DWD, RING/RING and RBR/RBR. (D) Interface
of the online service of GPS-Uber.
which enabled various combinations for the prediction of
different E3s. Then, single or multiple protein sequences
in FASTA format could be submitted after the selection of
E3s, and the prediction results would be presented after
a few seconds, which contains potential ubiquitination
sites with seven types of information, including ‘ID’,
‘Position’, ‘Code’, ‘E3 enzymes’, ‘Peptides’, ‘Score’ and
‘Cutoff’ (Figure 4D). Also, we implemented an option
‘View experimental information’, which could be ticked
to additionally present a column of ‘Source’ in the
prediction page (Figure 4D). The experimental evidence
of predicted sites could be viewed by clicking on
‘Exp.’ if available (Figure 4D). Furthermore, an option
‘Filtered by Name’ was added to enable the rapid
search of E3s in the left hierarchical tree (Figure 4D).
Moreover, an additional module was implemented for
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
GPS-Uber: a hybrid-learning framework |11
the prediction using gene names, protein names and/or
UniProt accession numbers for eight model species
including H. sapiens,Mus musculus,Rattus norvegicus,
Danio rerio,Drosophila melanogaster,Caenorhabditis elegans,
A. thaliana and Saccharomyces cerevisiae, and could be
accessed at http://gpsuber.biocuckoo.cn/online_name.
php (Supplementary Figure S3). In addition, the Ras-
related C3 botulinum toxin substrate 1 (RAC1) protein
was used as an example for new users. The online
service of GPS-Uber was implemented using PHP and
JavaScript and has been tested on multiple web browsers
including Google Chrome 92.0, Mozilla Firefox 89, Opera
77.0 and Safari 14.1.1. In summary, GPS-Uber was
designed to provide a handy resource for the research
of ubiquitination.
Prediction of potential cancer-associated
ubiquitination events
Considering that numerous signaling pathways were
regulated by ubiquitination in human, and aberrant
E3s or ubiquitination events have been reported to be
associated with cancers, we considered whether GPS-
Uber could be used to reveal new relationships between
ubiquitination and cancers, which would provide novel
insight for the treatment of cancers. A total of 707 cancer
proteins maintained in COSMIC [54] were downloaded
as input for GPS-Uber, and the ubiquitination sites
regulated by the six E3 groups were predicted with the
medium threshold. Strikingly, only two cancer proteins
were excluded with no site predicted, and 674 (95.33%)
proteins were predicted to be ubiquitinated by three or
more types of E3s (Figure 5A). Moreover, the statistic of
identified ubiquitination sites showed that 644 proteins
were matched with >10 sites (Figure 5B), which might
be partly related to the length of protein sequences
(Supplementary Table S6). These results demonstrated
the prevalence and importance of cancer-associated
ubiquitination events.
GO-based enrichment analyses were performed for
154 cancer proteins with predicted ubiquitination sites of
all the six E3 groups, and transcription-related pathways
were enriched as the dominating process, including ‘reg-
ulation of transcription, DNA-templated’ (GO:0045893
and GO:0006355), ‘regulation of transcription by RNA
polymerase II’ (GO:0045944 and GO:0000122) and ‘chro-
matin remodelling’ (GO:0006338) pathways (Figure 5C).
In addition, biological processes associated with cell
cycle and cellular homeostasis were also detected.
Similar results were observed when same analyses
were conducted with individual E3 category (Figure 5D).
Interestingly, phosphorylation-related pathways such
as ‘positive regulation of kinase activity’ (GO: 0033674)
and ‘peptidyl-tyrosine phosphorylation’ (GO:0018108)
were also identified, which suggested cancer-related
crosstalk between ubiquitination and phosphorylation.
The human RAC1, an important cancer protein, was
selected by GPS-Uber as an example for E3-specific
ubiquitination sites prediction. It has been reported
that TRAF6, an E3 belonging to RING/RING family, could
aggravate ischemic stroke through the ubiquitination
of RAC1 at K16 [63]. In addition, K147 in RAC1 was
found to be ubiquitinated by IAPs, which were also
classified into RING/RING family [64]. Beyond RING/RING
family, E3s from Cullin RING/SCF/F-box family could
also regulate the degradation of RAC1 through the
ubiquitination at K166 [65]. The prediction of GPS-
Uber showed highly consistent with these experimental
results when categories for RING/RING and Cullin
RING/SCF/F-box were chosen, and more information
could be provided by using other predictors (Figure 5E).
Discussion
Since the discovery of ubiquitin in 1978, the underlying
mechanisms of ubiquitination were always the hotspots
in the field of PTMs [2,4,66]. A broad spectrum of biolog-
ical processes and diseases, such as protein degradation
and cancers, has been reported to be regulated by ubiqui-
tination. As a reversible covalent modification, multiple
enzymes were involved in the processes of ubiquitina-
tion and deubiquitination, and the substrate specificity
of ubiquitination was largely controlled by E3s. Thus,
the identification of ubiquitination sites and upstream
E3s plays a crucial role in understanding the molecular
mechanisms of ubiquitination. Besides conventional LTP
experimental strategies, the development of a variety
of HTP assays enabled the large-scale identification of
ubiquitination sites and led to the emergence of vari-
ous databases, such as mUbiSIDa [67] and PLMD [44].
Based on these data resources, numerous computational
methods have been developed with different features
and algorithms and facilitated the rapid identification of
potential ubiquitination sites.
In this study, 10 types of well-used features were first
integrated to construct a model for general ubiquitina-
tion sites prediction. For the integration of conventional
machine learning and deep learning algorithms, a
hybrid-learning architecture based on PLR, DNN and
CNN was constructed, and 10-fold cross-validation
was performed with the final model, exhibiting an
AUC value of 0.7649 on the independent testing data
set (Figure 4B). Compared with six existing tools, GPS-
Uber showed a highly competitive accuracy on the
prediction of general ubiquitination sites (Figure 4B).
Since the existing tools were mainly focused on general
predictions, and predicting E3-specific ubiquitination
sites was still not available.To fill this gap, a total of 1311
experimentally identified ssESRs were collected from 637
LTP studies, and E3s were carefully mapped to human
proteome. In 2017, we developed a database called
iUUCD for ubiquitin and ubiquitin-like conjugations that
contained comprehensive annotations and hierarchical
classifications for 1153 known E3s from multi-species [1].
In this study, E3s were manually classified at four levels
using the information of iUUCD, and transfer learning
was then conducted with each E3 category for model
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
12 |Wan g et al.
Figure 5. Cancer-associated ubiquitination events predicted by GPS-Uber. (A) The distribution of cancer proteins predicted to be ubiquitinated by six E3
groups with GPS-Uber. (B) The distribution of numbers of predicted ubiquitination sites from cancer proteins. (C) The GO-based enrichment analysis of
154 cancer proteins predicted to be ubiquitinated by all the six E3 groups. (D) A network of pathways predicted to be regulated by different E3 groups.
(E) Predicted ubiquitination sites and upstream E3 families of RAC1_HUMAN with GPS-Uber were visualized by DOG 2.0 [62].
construction. At last, we implemented the online service
of GPS-Uber, which provided 111 E3-specific predictors
and one general predictor for ubiquitination sites.In GPS-
Uber, E3 clusters with 3 ubiquitination sites were kept,
and the reliability of these models might be relatively
lower. However, including these E3 clusters could provide
a more comprehensive prediction and facilitate further
experimental validations. For example, we previously
developed a tool named GPS 2.0 for the prediction of
kinase-specific phosphorylation sites [68]. A predictor,
CK1/VRK, was trained by only four sites. Based on
the prediction of GPS 2.0, Choi et al. successfully
validated a novel phosphorylation site, Ser6 in hnRNP
A1, to be specifically modified by VRK1, and such a
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
GPS-Uber: a hybrid-learning framework |13
phosphorylation event plays a critical role in telomere
maintenance [69]. Since previous studies suggested that
ubiquitination takes part in the regulation of various
cancers, we speculated that whether GPS-Uber could be
used to reveal novel cancer-associated ubiquitination
signatures. Predictors of six E3 groups were adopted for
predicting potential ubiquitination sites in known cancer
proteins, and the GO-based enrichment results were
highly consistent with known studies. Transcription-
related pathways were shown to be regulated by all
types of E3s, whereas similar functions have already been
observed by other researchers [70,71].
In GPS-Uber, the basic hypothesis is that short peptides
around lysine residues provided the major specificity
of E3-specific ubiquitination. Following evaluations
revealed a promising accuracy of GPS-Uber on predicting
E3-specific ubiquitination sites (Figures 3B, 4C, and
Supplementary Figure S2A), supporting the existence
of such a modification specificity at the site level.
However, ubiquitination sites are not the binding sites
of E3s, which specifically recognize primary degrons
in substrates for the interaction [55]. Previously, it was
reported that ubiquitination sites tend to appear very
close to primary degrons (often within 20 residues)
[55]. Using the sequence motifs of 27 known degrons
for 11 E3 clusters (Supplementary Table S3), potential
degron sequences were detected from their known E3-
specific substrates, and the distances to their proximal
ubiquitination sites modified by the same E3s were
counted. From the results, it was found that most of
the E3-specific ubiquitination sites do not locate close
to their corresponding primary degrons, whereas only
36.00%, 32.14% and 27.27% of anaphase-promoting
complex/cyclosome (APC/C)-, BRAC1- and VHL-specific
ubiquitination sites had a proximal primary degron
within 20 residues (Supplementary Figure S4A). In
addition, we extended USP(m,n) to USP(15, 15), USP(20,
20), USP(25, 25) and USP(30, 30) for four E3 clus-
ters, including RING/RING/MDM2, RING/RING/BRCA1,
RING/RING/TRAF6 and Cullin RING/SCF and compared
the performance to USP(10, 10). From the results, it could
be found that the AUC values were slightly increased
with longer flanking regions (Supplementary Figure S4B
E), indicating that considering potential primary degrons
did not significantly improve the accuracy for the
prediction of E3-specific ubiquitination sites.
Because only the features of flanking sequences
around ubiquitination sites were considered in GPS-Uber,
the ESIs should be predetermined by HTP experiments
or computational predictions together with following
LTP experimental validations. Incorporation of the state-
of-the-art computational methods for predicting ESIs
into GPS-Uber will be our future plan, and such an
integration will be crucial to accurately predict both ESIs
and ssESRs and provide more useful clues to identify
functionally important ubiquitination events in vivo.
Also, we will extend the benchmark data sets used in
this study. More general and E3-specific ubiquitination
sites will be integrated to improve the performance
of GPS-Uber, and new features and algorithms should
be considered. Moreover, a similar strategy could be
adopted with deubiquitinating enzymes for the research
of the whole ubiquitination system. In addition, since the
crosstalk between ubiquitination and other PTMs like
phosphorylation has been confirmed to be important in
many cellular processes [72,73], an improved algorithm
incorporated with relationships among different PTM
types will be useful to further improve the prediction
accuracy. Nevertheless, GPS-Uber will be continuously
maintained and improved for academic research.
Key Points
We reviewed existing tools for predicting gen-
eral ubiquitination sites, including the informa-
tion of various features, algorithms and training
data sets.
We developed a novel hybrid-learning framework
for predicting lysine ubiquitination sites, which
integrated 10 types of features and three types
of machine learning algorithms including penal-
ized logistic regression (PLR), deep neural net-
work (DNN) and convolutional neural network
(CNN).
We constructed 111 individual E3-specific predic-
tors through further transfer learning and devel-
oped a new tool named GPS-Uber for predicting
both general and E3-specific lysine ubiquitina-
tion sites, exhibiting a higher accuracy than other
existing tools.
Supplementary Data
Supplementary data are available online at https://acade
mic.oup.com/bib.
Acknowledgement
The authors thank Drs Han Cheng, Haodong Xu and Hang
Xu for their helpful comments.
Funding
Chinese Postdoctoral Science Foundation (2020 M682395,
2018 M642816 and 2019 T120648), Natural Science Foun-
dation of China (31930021 and 31970633), Fundamental
Research Funds for the Central Universities (2019kfyR-
CPY043) and Changjiang Scholars Program of China.
References
1. ZhouJ,XuY,LinS,et al. iUUCD 2.0: an update with rich annota-
tions for ubiquitin and ubiquitin-like conjugations. Nucleic Acids
Res 2018;46:D447–53.
2. Simoneschi D, Rona G, Zhou N, et al. CRL4(AMBRA1) is a master
regulator of D-type cyclins. Nature 2021;592:789–93.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
14 |Wan g et al.
3. Pohl C, Dikic I. Cellular quality control by the ubiquitin-
proteasome system and autophagy. Science 2019;366:818–22.
4. Ciehanover A, Hod Y, Hershko A. A heat-stable polypeptide
component of an ATP-dependent proteolytic system from retic-
ulocytes. Biochem Biophys Res Commun 1978;81:1100–5.
5. Scheffner M, Nuber U, Huibregtse JM. Protein ubiquitination
involving an E1-E2-E3 enzyme ubiquitin thioester cascade.
Nature 1995;373:81–3.
6. Zheng N, Shabek N. Ubiquitin ligases: structure, function, and
regulation. Annu Rev Biochem 2017;86:129–57.
7. Bernassola F, Chillemi G, Melino G. HECT-type E3 ubiquitin
ligases in cancer. Trends Biochem Sci 2019;44:1057–75.
8. Popovic D, Vucic D, Dikic I. Ubiquitination in disease pathogen-
esis and treatment. Nat Med 2014;20:1242–53.
9. Manasanch EE, Orlowski RZ. Proteasome inhibitors in cancer
therapy. NatRevClinOncol2017;14:417–33.
10. Iconomou M, Saunders DN. Systematic approaches to identify
E3 ligase substrates. Biochem J 2016;473:4083–101.
11. O’Connor HF, Huibregtse JM. Enzyme-substrate relationships in
the ubiquitin system: approaches for identifying substrates of
ubiquitin ligases. Cell Mol Life Sci 2017;74:3363–75.
12. Rayner SL, Morsch M, Molloy MP, et al. Using proteomics to
identify ubiquitin ligase-substrate pairs: how novel methods
may unveil therapeutic targets for neurodegenerative diseases.
Cell Mol Life Sci 2019;76:2499–510.
13. Yen HC, Xu Q , Chou DM, et al. Global protein stability profiling
in mammalian cells. Science 2008;322:918–23.
14. Yen HC, Elledge SJ. Identification of SCF ubiquitin ligase sub-
strates by global protein stability profiling. Science 2008;322:
923–9.
15. Low TY, Peng M, Magliozzi R, et al. A systems-wide screen iden-
tifies substrates of the SCFbetaTrCP ubiquitin ligase. Sci Signal
2014;7:rs8.
16. Elia AE, Boardman AP, Wang DC, et al. Quantitative proteomic
atlas of ubiquitination and acetylation in the DNA damage
response. Mol Cell 2015;59:867–81.
17. Li Y, Xie P, Lu L, et al. An integrated bioinformatics platform for
investigating the human E3 ubiquitin ligase-substrate interac-
tion network. Nat Commun 2017;8:347.
18. Wang X, Li Y, He M, et al. UbiBrowser 2.0: a comprehen-
sive resource for proteome-wide known and predicted ubiq-
uitin ligase/deubiquitinase-substrate interactions in eukary-
otic species. Nucleic Acids Res 2021, gkab962. doi: https://doi.o
rg/10.1093/nar/gkab962.
19. Chen D, Liu X, Xia T, et al. A multidimensional characterization
of E3 ubiquitin ligase and substrate interaction network. iScience
2019;16:177–91.
20. Tung CW, Ho SY. Computational identification of ubiquitylation
sites from protein sequences. BMC Bioinformatics 2008;9:310.
21. Radivojac P, Vacic V, Haynes C, et al. Identification, analysis,
and prediction of protein ubiquitination sites. Proteins 2010;78:
365–80.
22. Lee TY, Chen SA, Hung HY, et al. Incorporating distant sequence
features and radial basis function networks to identify ubiquitin
conjugation sites. PLoS One 2011;6:e17331.
23. Chen Z, Chen YZ, Wang XF, et al. Prediction of ubiquitination
sites by using the composition of k-spaced amino acid pairs. PLoS
One 2011;6:e22930.
24. Feng KY, Huang T, Feng KR, et al. Using WPNNA classifier in
ubiquitination site prediction based on hybrid features. Protein
Pept Lett 2013;20:318–23.
25. Chen X, Qiu JD, Shi SP, et al. Incorporating key position
and amino acid residue features to identify general and
species-specific ubiquitin conjugation sites. Bioinformatics
2013;29:1614–22.
26. Chen Z, Zhou Y, Song J, et al. hCKSAAP_UbSite: improved
prediction of human ubiquitination sites by exploiting amino
acid pattern and properties. Biochim Biophys Acta 2013;1834:
1461–7.
27. Walsh I, Di Domenico T, Tosatto SC. RUBI: rapid proteomic-
scale prediction of lysine ubiquitination and factors influencing
predictor performance. Amino Acids 2014;46:853–62.
28. Qiu WR, Xiao X, Lin WZ, et al. iUbiq-Lys: prediction of lysine
ubiquitination sites in proteins by extracting sequence evolution
information via a gray system model. J Biomol Struct Dyn 2015;33:
1731–42.
29. Huang CH, Su MG, Kao HJ, et al. UbiSite: incorporating two-
layered machine learning method with substrate motifs to
predict ubiquitin-conjugation site on lysines. BMC Syst Biol
2016;10(Suppl 1):6.
30. Wang JR, Huang WL, Tsai MJ, et al. ESA-UbiSite: accurate pre-
diction of human ubiquitination sites by identifying a set of
effective negatives. Bioinformatics 2017;33:661–8.
31. Liu Y, Wang MH, Xi JN, et al. PTM-ssMP: a web server for predict-
ing different types of post-translational modification sites using
novel site-specific modification profile. Int J Biol Sci 2018;14:
946–56.
32. Li GXH, Vogel C, Choi H. PTMscape: an open source tool to predict
generic post-translational modifications and map modification
crosstalk in protein domains and biological processes. Mol Omics
2018;14:197–209.
33. Pejaver V, Hsu WL, Xin FX, et al. The structural and functional
signatures of proteins that undergo multiple events of post-
translational modification. Protein Sci 2014;23:1077–93.
34. He F, Wang R, Li JG, et al. Large-scale prediction of protein
ubiquitination sites using a multimodal deep architecture. BMC
Syst Biol 2018;12.
35. Fu H, Yang Y, Wang X, et al. DeepUbi: a deep learning framework
for prediction of ubiquitination sites in proteins. BMC Bioinfor-
matics 2019;20:86.
36. Chen Z, Liu X, Li F, et al. Large-scale comparative assessment
of computational predictors for lysine post-translational modi-
fication sites. Brief Bioinform 2019;20:2267–90.
37. Wang H, Wang Z, Li Z, et al. Incorporating deep learning with
word embedding to identify plant Ubiquitylation sites. Front Cell
Dev Biol 2020;8:572195.
38. Wang D, Liu D, Yuchi J, et al. MusiteDeep: a deep-learning based
webserver for protein post-translational modification site pre-
diction and visualization. Nucleic Acids Res 2020;48:W140–6.
39. Liu Y, Jin S, Song L, et al. Prediction of protein ubiquitination sites
via multi-view features based on eXtreme gradient boosting
classifier. J Mol Graph Model 2021;107:107962.
40. Siraj A, Lim DY, Tayara H, et al. UbiComb: a hybrid deep learning
model for predicting plant-specific protein ubiquitylation sites.
Genes 2021;12:717.
41. Wang XF, Yan RX, Chen YZ, et al. Computational identification
of ubiquitination sites in Arabidopsis thaliana using convolutional
neural networks. Plant Mol Biol 2021;105:601–10.
42. Liu Y, Li A, Zhao XM, et al. DeepTL-Ubi: a novel deep transfer
learning method for effectively predicting ubiquitination sites
of multiple species. Methods 2021;192:103–11.
43. Yang Y, Wang H, Li W, et al. Prediction and analysis of multiple
protein lysine modified sites based on conditional Wasserstein
generative adversarial networks. BMC Bioinformatics 2021;22:171.
44. Xu H, Zhou J, Lin S, et al. PLMD: an updated data resource of
protein lysine modifications. J Genet Genomics 2017;44:243–50.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
GPS-Uber: a hybrid-learning framework |15
45. Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the
next-generation sequencing data. Bioinformatics 2012;28:3150–2.
46. Wang C, Xu H, Lin S, et al. GPS 5.0: an update on the prediction
of kinase-specific phosphorylation sites in proteins. Genomics
Proteomics Bioinformatics 2020;18:72–80.
47. Ning W, Xu H, Jiang P, et al. HybridSucc: a hybrid-learning
architecture for general and species-specific succinylation site
prediction. Genomics Proteomics Bioinformatics 2020;18:194–207.
48. Ning W, Jiang P, Guo Y, et al. GPS-palm: a deep learning-
based graphic presentation system for the prediction of
S-palmitoylation sites in proteins. Brief Bioinform 2021;22:
1836–47.
49. ZhaoX,LiX,MaZ,et al. Prediction of lysine ubiquitylation with
ensemble classifier and feature selection. Int J Mol Sci 2011;12:
8347–61.
50. Cai Y, Huang T, Hu L, et al. Prediction of lysine ubiquitination
with mRMR feature selection and analysis. Amino Acids 2012;42:
1387–95.
51. Nguyen VN, Huang KY, Huang CH, et al. Characterization and
identification of ubiquitin conjugation sites with E3 ligase recog-
nition specificities. BMC Bioinformatics 2015;16(Suppl 1):S1.
52. Nguyen VN, Huang KY, Huang CH, et al. Anewschemeto
characterize and identify protein ubiquitination sites. IEEE/ACM
Trans Comput Biol Bioinform 2017;14:393–403.
53. Huntley RP, Sawford T, Mutowo-Meullenet P, et al. The GOA
database: gene ontology annotation updates for 2015. Nucleic
Acids Res 2015;43:D1057–63.
54. Forbes SA, Beare D, Gunasekaran P, et al. COSMIC: exploring
the world’s knowledge of somatic mutations in human cancer.
Nucleic Acids Res 2015;43:D805–11.
55. Guharoy M, Bhowmick P, Sallam M, et al. Tripartite degrons
confer diversity and specificity on regulated protein degrada-
tion in the ubiquitin-proteasome system. Nat Commun 2016;7:
10239.
56. Kumar M, Michael S, Alvarado-Valverde J, et al. The eukary-
otic linear motif resource: 2022 release. Nucleic Acids Res 2021,
gkab975. doi: https://doi.org/10.1093/nar/gkab975.
57. Liu J, Lichtenberg T, Hoadley KA, et al. An integrated TCGA
pan-cancer clinical data resource to drive high-quality survival
outcome analytics. Cell 2018;173:400–416.e11.
58. RuanC,WangC,GongX,et al. An integrative multi-omics
approach uncovers the regulatory role of CDK7 and CDK4 in
autophagy activation induced by silica nanoparticles. Autophagy
2021;17:1426–47.
59. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on
protein families. Science 1997;278:631–7.
60. DengW,WangY,LiuZ,et al. HemI: a toolkit for illustrating
heatmaps. PLoS One 2014;9:e111988.
61. Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software
environment for integrated models of biomolecular interaction
networks. Genome Res 2003;13:2498–504.
62. Ren J, Wen L, Gao X, et al. DOG 1.0: illustrator of protein domain
structures. Cell Res 2009;19:271–3.
63. Li T, Qin JJ, Yang X, et al. The ubiquitin E3 ligase TRAF6 exac-
erbates ischemic stroke by ubiquitinating and activating Rac1.
J Neurosci 2017;37:12123–40.
64. Oberoi-Khanuja TK, Rajalingam K. IAPs as E3 ligases of Rac1:
shaping the move. Small GTPases 2012;3:131–6.
65. Zhao J, Mialki RK, Wei J, et al. SCF E3 ligase F-box protein complex
SCF(FBXL19) regulates cell migration by mediating Rac1 ubiqui-
tination and degradation. FASEB J 2013;27:2611–9.
66. Swatek KN, Komander D. Ubiquitin modifications. Cell Res
2016;26:399–422.
67. Chen T, Zhou T, He B, et al. mUbiSiDa: a comprehensive
database for protein ubiquitination sites in mammals. PLoS One
2014;9:e85744.
68. Xue Y, Ren J, Gao X, et al. GPS 2.0, a tool to predict kinase-specific
phosphorylation sites in hierarchy. Mol Cell Proteomics 2008;7:
1598–608.
69. Choi YH, Lim JK, Jeong MW, et al. HnRNP A1 phosphorylated by
VRK1 stimulates telomerase and its binding to telomeric DNA
sequence. Nucleic Acids Res 2012;40:8499–518.
70. Zhang F, Yu X. WAC, a functional partner of RNF20/40, regu-
lates histone H2B ubiquitination and gene transcription. Mol Cell
2011;41:384–97.
71. Han TY, Guo M, Gan MX, et al. TRIM59 regulates autophagy
through modulating both the transcription and the ubiquitina-
tion of BECN1. Autophagy 2018;14:2035–48.
72. Worden EJ, Hoffmann NA, Hicks CW, et al. Mechanism of cross-
talk between H2B ubiquitination and H3 methylation by Dot1L.
Cell 2019;176:1490–501.
73. Vu LD, Gevaert K, De Smet I. Protein language: post-translational
modifications talking to each other. Trends Plant Sci 2018;23:
1068–80.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab574/6509047 by Huazhong University of Science and Technology, Yu Xue on 18 January 2022
... SUMOplot TM Analysis Program (Solayman et al., 2017) analysed the protein for SUMOylation sites based on the direct matching of amino acids to the SUMO consensus sequence (B-K-x-D/E). A hybrid learning framework, GPS-Uber (Wang et al., 2022) is used to predict both general and E3-specific lysine ubiquitination sites. A neural network-enabled prediction of potential glycosylation sites using NetOGlyc-4.0 (Hansen et al., 1998) produced results with scores greater than 0.5. ...
... GPS-Uber [32], (http://gpsuber.biocuckoo.cn/) was developed as a hybrid learning approach to predict the sites of lysine ubiquitination with high accuracy. ...
Article
Full-text available
Interferon (IFN) exerts its effects through interferon-stimulated genes (ISGs), but its efficacy is limited by interferon resistance, which can be caused by the ubiquitination of key proteins. UBE2O was initially identified as a promising therapeutic target based on data from the TCGA and iUUCD 2.0 databases. Through the inhibition of UBE2O, interferon α/β signaling and overall interferon signaling were activated. Integrating data from proteomic, mass spectrometry, and survival analyses led to the identification of IFIT3, a mediator of interferon signaling, as a ubiquitination substrate of UBE2O. The results of in vitro and in vivo experiments demonstrated that the knockdown of UBE2O can enhance the efficacy of interferon-α by upregulating IFIT3 expression. K236 was identified as a ubiquitination site in IFIT3, and the results of rescue experiments confirmed that the effect of UBE2O on interferon-α sensitivity is dependent on IFIT3 activity. ATO treatment inhibited UBE2O and increased IFIT3 expression, thereby increasing the effectiveness of interferon-α. In conclusion, these findings suggest that UBE2O worsens the therapeutic effect of interferon-α by targeting IFIT3 for ubiquitination and degradation.
... In addition, CHX, a popular inhibitor of protein synthesis, was used to assess the effect of UBE2K on GluN2B protein stability, and we found that silencing of UBE2K slowed down the degradation of GluN2B considerably (Fig. 5g). We used GPS-Uber [29] (http:// gpsub er. biocu ckoo. ...
Article
Full-text available
Background Synaptic degeneration occurs in the early stage of Alzheimer's disease (AD) before devastating symptoms, strongly correlated with cognitive decline. Circular RNAs (circRNAs) are abundantly enriched in neural tissues, and aberrant expression of circRNAs precedes AD symptoms, significantly correlated with clinical dementia severity. However, the direct relationship between circRNA dysregulation and synaptic impairment in the early stage of AD remains poorly understood. Methods Hippocampal whole-transcriptome sequencing was performed to identify dysregulated circRNAs and miRNAs in 4-month-old wild-type and APP/PS1 mice. RNA antisense purification and mass spectrometry were utilized to unveil interactions between circRIMS2 and methyltransferase 3, N6-adenosine-methyltransferase complex catalytic subunit (METTL3). The roles of circRIMS2/miR-3968 in synaptic targeting of UBE2K-mediated ubiquitination of GluN2B subunit of NMDA receptor were evaluated via numerous lentiviruses followed by morphological staining, co-immunoprecipitation and behavioral testing. Further, a membrane-permeable peptide was used to block the ubiquitination of K1082 on GluN2B in AD mice. Results circRIMS2 was significantly upregulated in 4-month-old APP/PS1 mice, which was mediated by METTL3-dependent N6-methyladenosine (m6A) modification. Overexpression of circRIMS2 led to synaptic and memory impairments in 4-month-old C57BL/6 mice. MiR-3968/UBE2K was validated as the downstream of circRIMS2. Elevated UBE2K induced synaptic dysfunction of AD through ubiquitinating K1082 on GluN2B. Silencing METTL3 or blocking the ubiquitination of K1082 on GluN2B with a short membrane-permeable peptide remarkably rescued synaptic dysfunction in AD mice. Conclusions In conclusion, our study demonstrated that m6A-modified circRIMS2 mediates the synaptic and memory impairments in AD by activating the UBE2K-dependent ubiquitination and degradation of GluN2B via sponging miR-3968, providing novel therapeutic strategies for AD.
... biocu ckoo. cn/) [40]. Thus, we speculate that CRT2 works as a signal transductor for cellulose/cellobiose perception (Fig. 8), similar to the pattern of RGT2 in S. cerevisiae [37]. ...
Article
Full-text available
Background Induction of cellulase in cellulolytic fungi Trichoderma reesei is strongly activated by cellulosic carbon sources. The transport of cellulosic inducer and the perception of inducing signal is generally considered as the critical process for cellulase induction, that the inducing signal would be perceived by a sugar transporter/transceptor in T. reesei. Several sugar transporters are coexpressed during the induction stage, but which function they serve and how they work collaboratively are still difficult to elucidate. Results In this study, we found that the constitutive expression of the cellulose response transporter-like protein CRT2 (previously identified as putative lactose permease TRE77517) improves cellulase induction on a cellulose, cellobiose or lactose medium. Functional studies indicate that the membrane-bound CRT2 is not a transporter of cellobiose, lactose or glucose in a yeast system, and it also does not affect cellobiose and lactose utilization in T. reesei. Further study reveals that CRT2 has a slightly similar function to the cellobiose transporter CRT1 in cellulase induction. Overexpression of CRT2 led to upregulation of CRT1 and the key transcription factor XYR1. Moreover, overexpression of CRT2 could partially compensate for the function loss of CRT1 on cellulase induction. Conclusions Our study uncovers the novel function of CRT2 in cellulase induction collaborated with CRT1 and XYR1, possibly as a signal transductor. These results deepen the understanding of the influence of sugar transporters in cellulase production.
Article
Small ubiquitin-like modifiers (SUMOs) are tiny but important protein regulators involved in orchestrating a broad spectrum of biological processes, either by covalently modifying protein substrates or by noncovalently interacting with other proteins. Here, we report an updated server, GPS-SUMO 2.0, for the prediction of SUMOylation sites and SUMO-interacting motifs (SIMs). For predictor training, we adopted three machine learning algorithms, penalized logistic regression (PLR), a deep neural network (DNN), and a transformer, and used 52 404 nonredundant SUMOylation sites in 8262 proteins and 163 SIMs in 102 proteins. To further increase the accuracy of predicting SUMOylation sites, a pretraining model was first constructed using 145 545 protein lysine modification sites, followed by transfer learning to fine-tune the model. GPS-SUMO 2.0 exhibited greater accuracy in predicting SUMOylation sites than did other existing tools. For users, one or multiple protein sequences or identifiers can be input, and the prediction results are shown in a tabular list. In addition to the basic statistics, we integrated knowledge from 35 public resources to annotate SUMOylation sites or SIMs. The GPS-SUMO 2.0 server is freely available at https://sumo.biocuckoo.cn/. We believe that GPS-SUMO 2.0 can serve as a useful tool for further analysis of SUMOylation and SUMO interactions.
Article
Full-text available
Rhabdoviruses are single-stranded, negative-sense RNA viruses with broad host range, several of which are important pathogens. Compared with the rhabdoviruses infecting mammals, host factors involved in aquatic rhabdovirus infection have remained largely unknown. In the present study, we report the roles of host eukaryotic translation elongation factor 1 alpha (eEF1A) on the infection of Siniperca chuatsi rhabdovirus (SCRV, genus Siniperhavirus ), which is an important pathogen of mandarin fish. eEF1A was identified from SCRV nucleoprotein (N)-based affinity purified proteins. Further protein interaction and mutation assays proved that eEF1A interacted not only with the N protein but also the virus matrix protein (M), which relied on the N-terminal of eEF1A. SCRV infection and overexpression of N or M all stimulated the promoter activity of the eEF1A gene and, thus, upregulated its expression, whereas the upregulated eEF1A inhibited the transcription of SCRV genome. Mechanistically, eEF1A impaired the interactions between N and phosphoprotein (P), or N and N, which are important for the efficient transcription and replication of rhabdovirus. Meanwhile, eEF1A promoted the ubiquitin-proteasome degradation of the M protein, which relied on lysine 48 (K48) of ubiquitin. In addition, we showed that the ubiquitination degradation of M protein relied on C-terminal domain of eEF1A, but inhibition of the N-P or N-N interactions needs its entire length. Collectively, these results revealed two different mechanisms used by eEF1A to resist a fish rhabdovirus, which provided novel insights into the role of eEF1A in rhabdovirus infection and new information for antiviral research. IMPORTANCE Although a virus can regulate many cellular responses to facilitate its replication by interacting with host proteins, the host can also restrict virus infection through these interactions. In the present study, we showed that the host eukaryotic translation elongation factor 1 alpha (eEF1A), an essential protein in the translation machinery, interacted with two proteins of a fish rhabdovirus, Siniperca chuatsi rhabdovirus (SCRV), and inhibited virus infection via two different mechanisms: (i) inhibiting the formation of crucial viral protein complexes required for virus transcription and replication and (ii) promoting the ubiquitin-proteasome degradation of viral protein. We also revealed the functional regions of eEF1A that are involved in the two processes. Such a host protein inhibiting a rhabdovirus infection in two ways is rarely reported. These findings provided new information for the interactions between host and fish rhabdovirus.
Article
Protein phosphorylation, catalyzed by kinases, is an important biochemical process, which plays an essential role in multiple cell signaling pathways. Meanwhile, protein-protein interactions (PPI) constitute the signaling pathways. Abnormal phosphorylation status on protein can regulate protein functions through PPI to evoke severe diseases, such as Cancer and Alzheimer's disease. Due to the limited experimental evidence and high costs to experimentally identify novel evidence of phosphorylation regulation on PPI, it is necessary to develop a high-accuracy and user-friendly artificial intelligence method to predict phosphorylation effect on PPI. Here, we proposed a novel sequence-based machine learning method named PhosPPI, which achieved better identification performance (Accuracy and AUC) than other competing predictive methods of Betts, HawkDock and FoldX. PhosPPI is now freely available in web server (https://phosppi.sjtu.edu.cn/). This tool can help the user to identify functional phosphorylation sites affecting PPI and explore phosphorylation-associated disease mechanism and drug development.
Article
Full-text available
Almost twenty years after its initial release, the Eukaryotic Linear Motif (ELM) resource remains an invaluable source of information for the study of motif-mediated protein-protein interactions. ELM provides a comprehensive, regularly updated and well-organised repository of manually curated, experimentally validated short linear motifs (SLiMs). An increasing number of SLiM-mediated interactions are discovered each year and keeping the resource up-to-date continues to be a great challenge. In the current update, 30 novel motif classes have been added and five existing classes have undergone major revisions. The update includes 411 new motif instances mostly focused on cell-cycle regulation, control of the actin cytoskeleton, membrane remodelling and vesicle trafficking pathways, liquid-liquid phase separation and integrin signalling. Many of the newly annotated motif-mediated interactions are targets of pathogenic motif mimicry by viral, bacterial or eukaryotic pathogens, providing invaluable insights into the molecular mechanisms underlying infectious diseases. The current ELM release includes 317 motif classes incorporating 3934 individual motif instances manually curated from 3867 scientific publications. ELM is available at: http://elm.eu.org.
Article
Full-text available
As an important post-translational modification, ubiquitination mediates ∼80% of protein degradation in eukaryotes. The degree of protein ubiquitination is tightly determined by the delicate balance between specific ubiquitin ligase (E3)-mediated ubiquitination and deubiquitinase-mediated deubiquitination. In 2017, we developed UbiBrowser 1.0, which is an integrated database for predicted human proteome-wide E3-substrate interactions. Here, to meet the urgent requirement of proteome-wide E3/deubiquitinase-substrate interactions (ESIs/DSIs) in multiple organisms, we updated UbiBrowser to version 2.0 (http://ubibrowser.ncpsb.org.cn). Using an improved protocol, we collected 4068/967 known ESIs/DSIs by manual curation, and we predicted about 2.2 million highly confident ESIs/DSIs in 39 organisms, with >210-fold increase in total data volume. In addition, we made several new features in the updated version: (i) it allows exploring proteins' upstream E3 ligases and deubiquitinases simultaneously; (ii) it has significantly increased species coverage; (iii) it presents a uniform confidence scoring system to rank predicted ESIs/DSIs. To facilitate the usage of UbiBrowser 2.0, we also redesigned the web interface for exploring these known and predicted ESIs/DSIs, and added functions of 'Browse', 'Download' and 'Application Programming Interface'. We believe that UbiBrowser 2.0, as a discovery tool, will contribute to the study of protein ubiquitination and the development of drug targets for complex diseases.
Article
Full-text available
Protein ubiquitylation is an essential post-translational modification process that performs a critical role in a wide range of biological functions, even a degenerative role in certain diseases, and is consequently used as a promising target for the treatment of various diseases. Owing to the significant role of protein ubiquitylation, these sites can be identified by enzymatic approaches, mass spectrometry analysis, and combinations of multidimensional liquid chromatography and tandem mass spectrometry. However, these large-scale experimental screening techniques are time consuming, expensive, and laborious. To overcome the drawbacks of experimental methods, machine learning and deep learning-based predictors were considered for prediction in a timely and cost-effective manner. In the literature, several computational predictors have been published across species; however, predictors are species-specific because of the unclear patterns in different species. In this study, we proposed a novel approach for predicting plant ubiquitylation sites using a hybrid deep learning model by utilizing convolutional neural network and long short-term memory. The proposed method uses the actual protein sequence and physicochemical properties as inputs to the model and provides more robust predictions. The proposed predictor achieved the best result with accuracy values of 80% and 81% and F-scores of 79% and 82% on the 10-fold cross-validation and an independent dataset, respectively. Moreover, we also compared the testing of the independent dataset with popular ubiquitylation predictors; the results demonstrate that our model significantly outperforms the other methods in prediction classification results.
Article
Full-text available
D-type cyclins are central regulators of the cell division cycle and are among the most frequently deregulated therapeutic targets in human cancer¹, but the mechanisms that regulate their turnover are still being debated2,3. Here, by combining biochemical and genetics studies in somatic cells, we identify CRL4AMBRA1 (also known as CRL4DCAF3) as the ubiquitin ligase that targets all three D-type cyclins for degradation. During development, loss of Ambra1 induces the accumulation of D-type cyclins and retinoblastoma (RB) hyperphosphorylation and hyperproliferation, and results in defects of the nervous system that are reduced by treating pregnant mice with the FDA-approved CDK4 and CDK6 (CDK4/6) inhibitor abemaciclib. Moreover, AMBRA1 acts as a tumour suppressor in mouse models and low AMBRA1 mRNA levels are predictive of poor survival in cancer patients. Cancer hotspot mutations in D-type cyclins abrogate their binding to AMBRA1 and induce their stabilization. Finally, a whole-genome, CRISPR–Cas9 screen identified AMBRA1 as a regulator of the response to CDK4/6 inhibition. Loss of AMBRA1 reduces sensitivity to CDK4/6 inhibitors by promoting the formation of complexes of D-type cyclins with CDK2. Collectively, our results reveal the molecular mechanism that controls the stability of D-type cyclins during cell-cycle progression, in development and in human cancer, and implicate AMBRA1 as a critical regulator of the RB pathway.
Article
Full-text available
Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. Method We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. Results In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN . Conclusions The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.
Article
Full-text available
Key message: We developed two CNNs for predicting ubiquitination sites in Arabidopsis thaliana, demonstrated their competitive performance, analyzed amino acid physicochemical properties and the CNN structures, and predicted ubiquitination sites in Arabidopsis. As an important posttranslational protein modification, ubiquitination plays critical roles in plant physiology, including plant growth and development, biotic and abiotic stress, metabolism, and so on. A lot of ubiquitination site prediction models have been developed for human, mouse and yeast. However, there are few models to predict ubiquitination sites for the plant Arabidopsis thaliana. Based on this context, we proposed two convolutional neural network (CNN) based models for predicting ubiquitination sites in A. thaliana. The two models reach AUC (area under the ROC curve) values of 0.924 and 0.913 respectively in five-fold cross-validation, and 0.921 and 0.914 respectively in independent test, which outperform other models and demonstrate the competitive edge of them. We in-depth analyze the amino acid physicochemical properties in the neighboring sequence regions of the ubiquitination sites, and study the influence of the CNN structure to the prediction performance. Potential ubiquitination sites in the global Arbidopsis proteome are predicted using the two CNN models. To facilitate the community, the source code, training and test dataset, predicted ubiquitination sites in the Arbidopsis proteome are available at GitHub ( http://github.com/nongdaxiaofeng/CNNAthUbi ) for interest users.
Article
Full-text available
Protein ubiquitylation is an important posttranslational modification (PTM), which is involved in diverse biological processes and plays an essential role in the regulation of physiological mechanisms and diseases. The Protein Lysine Modifications Database (PLMD) has accumulated abundant ubiquitylated proteins with their substrate sites for more than 20 kinds of species. Numerous works have consequently developed a variety of ubiquitylation site prediction tools across all species, mainly relying on the predefined sequence features and machine learning algorithms. However, the difference in ubiquitylated patterns between these species stays unclear. In this work, the sequence-based characterization of ubiquitylated substrate sites has revealed remarkable differences among plants, animals, and fungi. Then an improved word-embedding scheme based on the transfer learning strategy was incorporated with the multilayer convolutional neural network (CNN) for identifying protein ubiquitylation sites. For the prediction of plant ubiquitylation sites, the proposed deep learning scheme could outperform the machine learning-based methods, with the accuracy of 75.6%, precision of 73.3%, recall of 76.7%, F-score of 0.7493, and 0.82 AUC on the independent testing set. Although the ubiquitylated specificity of substrate sites is complicated, this work has demonstrated that the application of the word-embedding method can enable the extraction of informative features and help the identification of ubiquitylated sites. To accelerate the investigation of protein ubiquitylation, the data sets and source code used in this study are freely available at https://github.com/wang-hong-fei/DL-plant-ubsites-prediction.
Article
Full-text available
As an important protein acylation modification, lysine succinylation (Ksucc) is involved in diverse biological processes, and participates in human tumorigenesis. Here, we collected 26,243 non-redundant known Ksucc sites from 13 species as the benchmark data set, combined 10 types of informative features, and implemented a hybrid-learning architecture by integrating deep-learning and conventional machine-learning algorithms into a single framework. We constructed a new tool named HybridSucc, which achieved area under curve (AUC) values of 0.885 and 0.952 for general and human-specific prediction of Ksucc sites, respectively. In comparison, the accuracy of HybridSucc was 17.84% to 50.62% better than that of other existing tools. Using HybridSucc, we conducted a proteome-wide prediction and prioritized 370 cancer mutations that change Ksucc states of 218 important proteins, including PKM2, SHMT2, and IDH2. We not only developed a high-profile tool for predicting Ksucc sites, but also generated useful candidates for further experimental consideration. The online service of HybridSucc can be freely accessed for academic research at http://hybridsucc.biocuckoo.org/.
Article
Ubiquitination is a common and reversible post-translational protein modification that regulates apoptosis and plays an important role in protein degradation and cell diseases. However, experimental identification of protein ubiquitination sites is usually time-consuming and labor-intensive, so it is necessary to establish effective predictors. In this study, we propose a ubiquitination sites prediction method based on multi-view features, namely UbiSite-XGBoost. Firstly, we use seven single-view features encoding methods to convert protein sequence fragments into digital information. Secondly, the least absolute shrinkage and selection operator (LASSO) is applied to remove the redundant information and get the optimal feature subsets. Finally, these features are inputted into the eXtreme gradient boosting (XGBoost) classifier to predict ubiquitination sites. Five-fold cross-validation shows that the AUC values of Set1-Set6 datasets are 0.8258, 0.7592, 0.7853, 0.8345, 0.8979 and 0.8901, respectively. The synthetic minority oversampling technique (SMOTE) is employed in Set4-Set6 unbalanced datasets, and the AUC values are 0.9777, 0.9782 and 0.9860, respectively. In addition, we constructed three independent test datasets which the AUC values are 0.8007, 0.6897 and 0.7280, respectively. The results show that the proposed method UbiSite-XGBoost is superior to other ubiquitination prediction methods and it provides new guidance for the identification of ubiquitination sites.
Article
Ubiquitination is one of the most important post-translational modifications which involves in many biological processes. Because mass spectrometry-based ubiquitination site identification methods are costly and time consuming, computational approaches provide alternative ways to the determination of ubiquitination sites. Although machine learning based methods can effectively predict ubiquitination sites, most of them rely on feature engineering, which may lead to bias or incomplete feature. Recently, deep learning has achieved great success in prediction of post-translational modification sites. However, deep learning method has not been explored in the prediction of species-specific ubiquitination sites. In this paper, we propose a novel transfer deep learning method, named DeepTL-Ubi, for predicting ubiquitination sites of multiple species. DeepTL-Ubi enhances the performance of species-specific ubiquitination site prediction by transferring common knowledge from the large amount of human data to other species, which effectively solves the problem of insufficient training data for other species. Besides, we train and test our model by collecting ubiquitination sites for multiple species from several sources. Experiment results show that our transfer learning technique can effectively improve the predictive performance of species with small sample size, and DeepTL-Ubi is superior to existing tools in many species. The source code and training data of DeepTL-Ubi are publicly deposited at https://github.com/USTC-HIlab/DeepTL-Ubi