Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Methylation, which is one of the most prominent post-translational modifications on proteins, regulates many important cellular functions. Though several model-based methylation site predictors have been reported, all existing methods employ machine learning strategies, such as support vector machines and random forest, to predict sites of methylation based on a set of “hand-selected” features. As a consequence, the subsequent models may be biased toward one set of features. Moreover, due to the large number of features, model development can often be computationally expensive. In this paper, we propose an alternative approach based on deep learning to predict arginine methylation sites. Our model, which we termed DeepRMethylSite, is computationally less expensive than traditional feature-based methods while eliminating potential biases that can arise through features selection. Based on independent testing on our dataset, DeepRMethylSite achieved efficiency scores of 68%, 82% and 0.51 with respect to sensitivity (SN), specificity (SP) and Matthew’s correlation coefficient (MCC), respectively. Importantly, in side-by-side comparisons with other state-of-the-art methylation site predictors, our method performs on par or better in all scoring metrics tested.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The arginine methylation of protein includes the addition of one or two methyl groups to the arginine residue, thereby imparting a key regulatory mechanism for gene expression, transcriptional regulation, signal transduction, RNA processing, DNA repair, new computational methods using the most recent available largescale experimental datasets. Recently, Chaudhari et al. proposed Deep-RMethylSite [33], a deep learning-based method using one-hot and embedding integer encodings for identifying arginine methylation in protein. The prediction performance of the DeepRMethylSite [33] helps in exploring a robust and accurate computational method for predicting protein arginine methylation sites. ...
... Recently, Chaudhari et al. proposed Deep-RMethylSite [33], a deep learning-based method using one-hot and embedding integer encodings for identifying arginine methylation in protein. The prediction performance of the DeepRMethylSite [33] helps in exploring a robust and accurate computational method for predicting protein arginine methylation sites. ...
... was utilized. For fair comparison and generalization of the model, the same dataset was used as of DeepRMethylSite [33]. ...
Article
Protein methylation is one of the most prominent posttranslation modifications that essentially regulates several biological processes in eukaryotes. Therefore, identification of the arginine methylation site is crucial in deciphering its characteristics and functions in cell biology, disease mechanisms, and guided drug development. The computation methods address the long-term bottleneck together with the cost, time, and labor required in experimental methods for large-scale identification of protein arginine methylation sites. In this study, we proposed a robust machine learning-based computational tool known as iIRMethyl, employing the primary sequence and physicochemical properties of protein along with a two-step feature selection method for optimal selection of feature descriptors. Moreover, the performance of iIRMethyl was comprehensively evaluated via k-fold cross-validation on a benchmark dataset and independent test dataset. iIRMethyl demonstrated a remarkably greater performance than the state-of-the-art method and achieved an average area under the curve value of 0.99 for both k-fold cross-validation and an independent test set in the identification of protein arginine methylation sites. Furthermore, the outcomes reveal that iIRMethyl is a robust and accurate computational tool for large-scale identification of arginine methylation sites and would facilitate the understanding of their functional mechanisms and accelerating their application in drug development and clinical therapy. Additionally, the prediction mechanism of the proposed model iIRMethyl is interpreted using the SHapley Additive exPlanation algorithm.
... For MOAC, the interaction of phosphate-containing peptides and the metal oxide occur by reversible Lewis acid-base chemistry [75]. TiO 2 was the first metal oxide utilized for phosphopeptide enrichment and is extensively used [76,77], but ZrO 2 , Al 2 O 3 , Fe 3 O 4 , and other metal oxides have been developed for this purpose as well [78][79][80]. TiO 2 predominately enriches singly phosphorylated peptides, while IMAC and PolyMAC tend to enrich more multiply phosphorylated peptides [81,82]. ...
... The following autocorrelation variants were proposed as protein descriptors. Fig. 1 Illustration depicting the lag between two amino acid residues on a sequence 80 Hamid Ismail et al. ...
... In addition, since feature extraction requires programming and mathematical skills, to facilitate the development of ML-based approaches, various stand-alone, as well as web servers have been developed to facilitate feature extraction for researchers. These tools include PROFEAT [66], Psuedoamino Acid Composition (PseAAC) [75], PseAAC-Builder [76], propy [65], PseAAC-General [77], protr/ProtrWeb [78], Rcpi [12], PseKRAAC [79], iFeature [80], and Seq2Feature [81] (Tables 10 and 11). However, in addition to extracting these features, the ability to integrate (combine) multiple features together so that they be readily input to the machine learning apparatus and extract features from a large number of proteins sequences become crucial. ...
Chapter
Posttranslational modification (PTM) is an important biological mechanism to promote functional diversity among the proteins. So far, a wide range of PTMs has been identified. Among them, glycation is considered as one of the most important PTMs. Glycation is associated with different neurological disorders including Parkinson and Alzheimer. It is also shown to be responsible for different diseases, including vascular complications of diabetes mellitus. Despite all the efforts have been made so far, the prediction performance of glycation sites using computational methods remains limited. Here we present a newly developed machine learning tool called iProtGly-SS that utilizes sequential and structural information as well as Support Vector Machine (SVM) classifier to enhance lysine glycation site prediction accuracy. The performance of iProtGly-SS was investigated using the three most popular benchmarks used for this task. Our results demonstrate that iProtGly-SS is able to achieve 81.61%, 93.62%, and 92.95% prediction accuracies on these benchmarks, which are significantly better than those results reported in the previous studies. iProtGly-SS is implemented as a web-based tool which is publicly available at http://brl.uiu.ac.bd/iprotgly-ss/ .
... Computational approaches for predicting protein methylation sites can be an inexpensive, highly accurate, and fast alternative method through massive data sets. The commonly used computational approaches are support vector machine (SVM) (Chen et al., 2006;Shao et al., 2009;Shien et al., 2009;Shi et al., 2012;Lee et al., 2014;Qiu et al., 2014;Wen et al., 2016), group-based prediction system (GPS) (Deng et al., 2017), Random Forest (Wei et al., 2017), and neural network (NN) Hasan & Khatun, 2018;Chaudhari et al., 2020). ...
... The application of the machine learning approach to predict possible methylation sites on protein sequences has been studied in numerous previous research. The latest and the most relevant studies to our study were conducted by Chen et al. (2018) and Chaudhari et al. (2020). Chen et al. (2018) developed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), a methylation site prediction model that was trained and tested on human and mice protein data sets. ...
... Meanwhile, Chen et al. (2018) hypothesized that the order of amino acids in the protein sequence has a significant influence on the location where the methylation process can occur. The other model is DeepRMethylSite which was developed by Chaudhari et al. (2020). The model was implemented with the combination of convolutional neural network (CNN) and LSTM. ...
Article
Full-text available
Background Conventional in vivo methods for post-translational modification site prediction such as spectrophotometry, Western blotting, and chromatin immune precipitation can be very expensive and time-consuming. Neural networks (NN) are one of the computational approaches that can predict effectively the post-translational modification site. We developed a neural network model, namely the Sequential and Spatial Methylation Fusion Network (SSMFN), to predict possible methylation sites on protein sequences. Method We designed our model to be able to extract spatial and sequential information from amino acid sequences. Convolutional neural networks (CNN) is applied to harness spatial information, while long short-term memory (LSTM) is applied for sequential data. The latent representation of the CNN and LSTM branch are then fused. Afterwards, we compared the performance of our proposed model to the state-of-the-art methylation site prediction models on the balanced and imbalanced dataset. Results Our model appeared to be better in almost all measurement when trained on the balanced training dataset. On the imbalanced training dataset, all of the models gave better performance since they are trained on more data. In several metrics, our model also surpasses the PRMePred model, which requires a laborious effort for feature extraction and selection. Conclusion Our models achieved the best performance across different environments in almost all measurements. Also, our result suggests that the NN model trained on a balanced training dataset and tested on an imbalanced dataset will offer high specificity and low sensitivity. Thus, the NN model for methylation site prediction should be trained on an imbalanced dataset. Since in the actual application, there are far more negative samples than positive samples.
... These algorithms can learn complex patterns from large datasets and extract sequence features, enabling accurate and efficient predictions. For example, some researchers used a deep learning model [9] to identify protein methylation sites from amino acid sequences and achieved an accuracy of 87.04%. Among the research endeavors, one notable example is the iMethyl-PseAAC [10], an innovative Support Vector Machine (SVM) model designed specifically for predicting protein methylation sites. ...
... For a fair comparison of the RMSxAI and state-of-the-art predictors, the identical data set for arginine methylation was used. The comparison results of the RMSxAI with existing predictors are shown in Table 2. RMSxAI was compared with MeMo [11], BPB-PPMS [44], PMeS [45], MASA [46], iMethyl-PseAAC [10], PSSMe [47], MePred-RF [48], PRmePRed [26], DeepRMethylSite [9], and SSMFN [49] on the same data set delivered by Kumar et al. [26]. The performance values of PMeS, BPB-PPMS, MASA, MeMo, PSSMe, iMethyl-PseAAC, MePred-RF, PRmePRed, DeepRMethylSite, and SSMFN were taken from Lumbanraja et al. [49]. ...
Article
Full-text available
Protein methylation is a vital regulator of many biological processes at the post-translational level, and accurate prediction of protein methylation sites is essential for research and drug discovery. In this paper, we present a new method, namely RMSxAI, to predict the arginine methylation sites from primary sequences using machine learning algorithms and describe the predictions using explainable artificial intelligence (XAI) techniques. Leveraging experimentally validated methylated and unmethylated protein sequences from diverse organisms, we deduced several sequence features, encompassing physicochemical properties, amino acid composition, and evolutionary insights. Our results show that the proposed RMSxAI can predict protein methylation sites with high accuracy, bringing the F1 score up to 0.88 and overall accuracy up to 88.4%. We use various XAI methods to explain the output results. These include key features, partial occupancy maps, and local variation models that provide insight into key features and interactions that lead to predictions. Overall, our approach is relevant to research and drug discovery, and our results demonstrate the potential of machine learning algorithms and XAI methods to provide accurate and meaningful prediction of arginine methylation sites.
... These algorithms can learn complex patterns from large datasets and extract sequence features, enabling accurate and efficient predictions. For example, some researchers used a deep learning model Chaudhari, Thapa, Roy, Newman, Saigo and Dukka (2020) to identify protein methylation sites from amino acid sequences and achieved an accuracy of 87.04%. Among the research endeavors, one notable example is the iMethyl-PseAAC Qiu, Xiao, Lin and Chou (2014), an innovative Support Vector Machine (SVM) model designed specifically for predicting protein methylation sites. ...
... For a fair comparison of the RMSxAI and state-of-the-art predictors, the identical data set for arginine methylation was used. The comparison results of the RMSxAI with existing predictors are shown in Table 2 (2012), MASA Shien, Lee, Chang, Hsu, Horng, Hsu, Wang andHuang (2009), iMethyl-PseAAC Qiu et al. (2014), PSSMe Wen, Shi, Xu, Wang and Qiu (2016), MePred- RF Wei, Xing, Shi, Ji and Zou (2017), PRmePRed Kumar et al. (2017), DeepRMethylSite Chaudhari et al. (2020), and SSMFN Lumbanraja, Mahesworo, Cenggoro, Sudigyo and Pardamean (2021) on the same data set delivered by Kumar et al. Kumar et al. (2017). ...
Preprint
Full-text available
Protein methylation is a vital regulator of many biological processes at the post-translational level,and accurate prediction of protein methylation sites is essential for research and drug discovery.In this paper, we present a new method, namely RMSxAI, to predict the arginine methylationsites from primary sequences using machine learning algorithms and describe the predictions usingexplainable artificial intelligence (XAI) techniques. Leveraging experimentally validated methylatedand unmethylated protein sequences from diverse organisms, we deduced several sequence features,encompassing physicochemical properties, amino acid composition, and evolutionary insights. Ourresults show that the proposed RMSxAI can predict protein methylation sites with high accuracy,bringing the F1 score up to 0.88 and overall accuracy up to 88.4%. We use various XAI methods toexplain the forecast. These include key features, partial occupancy maps, and local variation modelsthat provide insight into key features and interactions that lead to forecasts. Overall, our approach isrelevant to research and drug discovery, and our results demonstrate the potential of machine learningalgorithms and XAI methods to provide accurate and meaningful prediction of arginine methylationsites.
... An arginine methylation prediction method, CTD-RF, developed by Hou et al. [37] that integrates RF with distribution, composition, and transition features. Some of the researchers also used convolutional neural network (CNN) and long short-term memory (LSTM) deep learning algorithms for the prediction of arginine methylation sites [38,39]. ...
... The proposed model was retrained using their training and validation data set and then tested using the independent test set to assess the proposed model. The PRMxAI was compared with BPB-PPMS [33], PMeS [42], iMethyl-PseAAC [32], MASA [31], MeMo [30], PSSMe [43], MePred-RF [35], DeepRMethylSite [38], and SSMFN [39] (see Table 9). The performances of PMeS, BPB-PPMS, MASA, MeMo, PSSMe, iMethyl-PseAAC, MePred-RF, DeepRMethylSite, and SSMFN were reported by Lumbanraja et al. [39] on the same data set. ...
Article
Full-text available
Background Protein methylation, a post-translational modification, is crucial in regulating various cellular functions. Arginine methylation is required to understand crucial biochemical activities and biological functions, like gene regulation, signal transduction, etc. However, some experimental methods, including Chip–Chip, mass spectrometry, and methylation-specific antibodies, exist for the prediction of methylated proteins. These experimental methods are expensive and tedious. As a result, computational methods based on machine learning play an efficient role in predicting arginine methylation sites. Results In this research, a novel method called PRMxAI has been proposed to predict arginine methylation sites. The proposed PRMxAI extract sequence-based features, such as dipeptide composition, physicochemical properties, amino acid composition, and information theory-based features (Arimoto, Havrda-Charvat, Renyi, and Shannon entropy), to represent the protein sequences into numerical format. Various machine learning algorithms are implemented to select the better classifier, such as Decision trees, Naive Bayes, Random Forest, Support vector machines, and K-nearest neighbors. The random forest algorithm is selected as the underlying classifier for the PRMxAI model. The performance of PRMxAI is evaluated by employing 10-fold cross-validation, and it yields 87.17% and 90.40% accuracy on mono-methylarginine and di-methylarginine data sets, respectively. This research also examines the impact of various features on both data sets using explainable artificial intelligence. Conclusions The proposed PRMxAI shows the effectiveness of the features for predicting arginine methylation sites. Additionally, the SHapley Additive exPlanation method is used to interpret the predictive mechanism of the proposed model. The results indicate that the proposed PRMxAI model outperforms other state-of-the-art predictors.
... However, CNN may fail to capture the information on remote dependencies between residues in protein sequences. Another deep learning classifier, DeepRMethylSite [21], was a CNN and bidirectional long short-term memory network (BiLSTM) for identifying arginine methylation sites. Similar to the Deep-RMethylSite model, Sohoko-kcr was proposed as a deep learningand bidirectional gated recurrent unit-based model to identify croton acylation sites [22]. ...
... To fairly compare the models, we re-implemented BERT-Kcr [26], DeepRMethylSite [21], MusiteDeep [45], AlexNet [52] and LSTM [53] on the high-level arginine methylation data set constructed by us. The comparison results of methylation prediction tools based on the arginine methylation independent test set are listed in Table 3. ...
Article
Protein arginine methylation is an important posttranslational modification (PTM) associated with protein functional diversity and pathological conditions including cancer. Identification of methylation binding sites facilitates a better understanding of the molecular function of proteins. Recent developments in the field of deep neural networks have led to a proliferation of deep learning-based methylation identification studies because of their fast and accurate prediction. In this paper, we propose DeepGpgs, an advanced deep learning model incorporating Gaussian prior and gated attention mechanism. We introduce a residual network channel to extract the evolutionary information of proteins. Then we combine the adaptive embedding with bidirectional long short-term memory networks to form a context-shared encoder layer. A gated multi-head attention mechanism is followed to obtain the global information about the sequence. A Gaussian prior is injected into the sequence to assist in predicting PTMs. We also propose a weighted joint loss function to alleviate the false negative problem. We empirically show that DeepGpgs improves Matthews correlation coefficient by 6.3% on the arginine methylation independent test set compared with the existing state-of-the-art methylation site prediction methods. Furthermore, DeepGpgs has good robustness in phosphorylation site prediction of SARS-CoV-2, which indicates that DeepGpgs has good transferability and the potential to be extended to other modification sites prediction. The open-source code and data of the DeepGpgs can be obtained from https://github.com/saizhou1/DeepGpgs.
... In the deeplearning-based models, CapsNet contained a multi-layer CNN for predicting PRme sites, which outperformed other well-known tools in most cases (Wang D. et al., 2019). The deep-learning model DeepRMethylSite was constructed with the integration of One-Hot and embedding integer encodings (Chaudhari et al., 2020). The development of these models has contributed significantly to the discovery of PRme sites. ...
... To examine the predictive quality of the proposed CNNArginineMe, we compare it with reported PRme site predictors. DeepRMethylSite is the latest deep-learning predictor with the best performance compared to other reported ones (Chaudhari et al., 2020). To fairly compare CNNArginineMe and DeepRMethylSite, we used the dataset to construct DeepRMethylSite to rebuild CNNArginineMe and employed its independent test set for evaluation. ...
Article
Full-text available
Protein arginine methylation (PRme), as one post-translational modification, plays a critical role in numerous cellular processes and regulates critical cellular functions. Though several in silico models for predicting PRme sites have been reported, new models may be required to develop due to the significant increase of identified PRme sites. In this study, we constructed multiple machine-learning and deep-learning models. The deep-learning model CNN combined with the One-Hot coding showed the best performance, dubbed CNNArginineMe. CNNArginineMe performed best in AUC scoring metrics in comparisons with several reported predictors. Additionally, we employed CNNArginineMe to predict arginine methylation proteome and performed functional analysis. The arginine methylated proteome is significantly enriched in the amyotrophic lateral sclerosis (ALS) pathway. CNNArginineMe is freely available at https://github.com/guoyangzou/CNNArginineMe.
... In the deeplearning-based models, CapsNet contained a multi-layer CNN for predicting PRme sites, which outperformed other well-known tools in most cases (Wang D. et al., 2019). The deep-learning model DeepRMethylSite was constructed with the integration of One-Hot and embedding integer encodings (Chaudhari et al., 2020). The development of these models has contributed significantly to the discovery of PRme sites. ...
... To examine the predictive quality of the proposed CNNArginineMe, we compare it with reported PRme site predictors. DeepRMethylSite is the latest deep-learning predictor with the best performance compared to other reported ones (Chaudhari et al., 2020). To fairly compare CNNArginineMe and DeepRMethylSite, we used the dataset to construct DeepRMethylSite to rebuild CNNArginineMe and employed its independent test set for evaluation. ...
... To fill up this gap and to support web lab experiments, researchers are developing accurate and effective computational tools to predict post-translational modification sites. A large number of computational methods have already developed for identifying different types of post-translational modifications in different amino acid residues, particularly for Acetylation (Chen et al., 2018b;Yang et al., 2018), Citrullination , Malonylation Bao et al., 2019;Thapa et al., 2020b), Methylation (Chaudhari et al., 2020), Phosphorylation (Luo et al., 2019;Fenoy et al., 2019;Long et al., 2020), Sumoylation (Yang et al., 2018;He et al., 2019;Chen et al., 2018a), Succinylation (Hasan et al., 2019;Wang et al., 2019a;Kao et al., 2020;Thapa et al., 2020a;Zhu et al., 2020;Islam et al., 2017), Sulfenylation (Wang et al., 2019b;Butt and Khan, 2019), Pupylation (Singh et al., 2021), and Ubiquitination (He et al., 2019;Yadav et al., 2018). ...
... To objectively evaluate a novel computational method for posttranslational modification prediction, different statistical analysis, particularly k-fold cross-validation, jackknife test and independent test are commonly used in the state-of-the-art post-translation modification predictors (Ning et al., 2019;Chaudhari et al., 2020;Long et al., 2020;Kao et al., 2020;Thapa et al., 2020a;Jia et al., 2019;Ju and Wang, 2020;Hasan et al., 2017cHasan et al., , 2020. For estimating performance, most researchers prefer to use sensitivity, specificity and accuracy, in some cases, Matthews's correlation coefficient (MCC) and area under ROC curve (AUC). ...
Article
Formylation is one of the newly discovered post-translational modifications in lysine residue which is responsible for different kinds of diseases. In this work, a novel predictor, named predForm-Site, has been developed to predict formylation sites with higher accuracy. We have integrated multiple sequence features for developing a more informative representation of formylation sites. Moreover, decision function of the underlying classifier have been optimized on skewed formylation dataset during prediction model training for prediction quality improvement. On the dataset used by LFPred and Formator predictor, predForm-Site achieved 99.5% sensitivity, 99.8% specificity and 99.8% overall accuracy with AUC of 0.999 in the jackknife test. In the independent test, it has also achieved more than 97% sensitivity and 99% specificity. Similarly, in benchmarking with recent method CKSAAP_FormSite, the proposed predictor significantly outperformed in all the measures, particularly sensitivity by around 20%, specificity by nearly 30% and overall accuracy by more than 22%. These experimental results show that the proposed predForm-Site can be used as a complementary tool for the fast exploration of formylation sites. For convenience of the scientific community, predForm-Site has been deployed as an online tool, accessible at http://103.99.176.239:8080/predForm-Site.
... Recently, deep learning has been used to solve various problems in bioinformatics Tang et al., 2019;Chaudhari et al., 2020;Thapa et al., 2020;Wang D. et al., 2020;. One of the most serious problems associated with deep learning stems from data dependence. ...
... To our knowledge, the resulting models, termed DTL-DephosSite-ST and DTL-DephosSite-Y, are the first general dephosphorylation site predictors for S/T and Y dephosphorylation, respectively. Deep learning-based models have recently been developed for several important PTMs, including phosphorylation, methylation, acetylation, and succinylation, to name a few (Wang et al., 2017;Luo et al., 2019;Wu et al., 2019;Al-barakati et al., 2020;Chaudhari et al., 2020;Thapa et al., 2020;Ahmed et al., 2021). Similar to previous deep learning-based models, our models did not require any hand selected features during model development. ...
Article
Full-text available
Phosphorylation, which is mediated by protein kinases and opposed by protein phosphatases, is an important post-translational modification that regulates many cellular processes, including cellular metabolism, cell migration, and cell division. Due to its essential role in cellular physiology, a great deal of attention has been devoted to identifying sites of phosphorylation on cellular proteins and understanding how modification of these sites affects their cellular functions. This has led to the development of several computational methods designed to predict sites of phosphorylation based on a protein’s primary amino acid sequence. In contrast, much less attention has been paid to dephosphorylation and its role in regulating the phosphorylation status of proteins inside cells. Indeed, to date, dephosphorylation site prediction tools have been restricted to a few tyrosine phosphatases. To fill this knowledge gap, we have employed a transfer learning strategy to develop a deep learning-based model to predict sites that are likely to be dephosphorylated. Based on independent test results, our model, which we termed DTL-DephosSite, achieved efficiency scores for phosphoserine/phosphothreonine residues of 84%, 84% and 0.68 with respect to sensitivity (SN), specificity (SP) and Matthew’s correlation coefficient (MCC). Similarly, DTL-DephosSite exhibited efficiency scores of 75%, 88% and 0.64 for phosphotyrosine residues with respect to SN, SP, and MCC.
... For lysine, MufeSPM provided higher evaluation results than MePred-RF and iMethyl-PseAAC. Table 8 presents the comparison of Mufe-SPM with MeMo [25], MASA [27], BPB-PPMS [26], PMeS [29], iMethyl-PseAAC [30], PSSMe [51], MePred-RF [32], PRmePRed [36] and Deep-RMethylSite [52] by using the data set Data-3 for arginine methylation. MufeSPM showed the accuracy of 93.10%, the sensitivity of 90.26%, the specificity of 96.38% and the MCC of 0.86 for arginine methylation. ...
Article
Full-text available
Integrated studies (multi-omics studies) comprising genetic, proteomic and epigenetic data analyses have become an emerging topic in biomedical research. Protein methylation is a posttranslational modification that plays an essential role in various cellular activities. The prediction of methylation sites (arginine and lysine) is vital to understand the molecular processes of protein methylation. However, current experimental techniques used for methylation site predictions are tedious and expensive. Hence, computational techniques for predicting methylation sites in proteins are necessary. For predicting methylation sites, various computational methods have been proposed in recent years. Most existing methods require structural and evolutionary information for retrieving features, acquiring this information is not always convenient. Thus, we proposed a novel method, called multi-factorial feature extraction and site prognosis model (MufeSPM), for the prediction of protein methylation sites based on information theory features (Renyi, Shannon, Havrda–Charvat and Arimoto entropy), amino acid composition and physicochemical properties acquired from protein methylation data. A random forest algorithm was used to predict methylation sites in protein sequences. This paper also studied the impact of different features and classifiers on arginine and lysine methylation data sets. For the R methylation data set, MufeSPM yielded 82.45%(⁠± 3.47) accuracy, and for the K methylation data set, it provided an average accuracy of 71.94%(⁠± 2.12). Additionally, the area under the receiver operating characteristic curve for different classifiers in predicting methylation site was provided. The experimental results signify that MufeSPM performs better than the state-of-the-art predictors.
... Chaudhari et al. proposed a deep learning-based approach for predicting arginine methylation sites in a protein called DeepR-MethylSite [80]. The authors collected 10,429 positive samples from 5725 proteins. ...
Chapter
Full-text available
Posttranslational modification (PTM ) is a ubiquitous phenomenon in both eukaryotes and prokaryotes which gives rise to enormous proteomic diversity. PTM mostly comes in two flavors: covalent modification to polypeptide chain and proteolytic cleavage. Understanding and characterization of PTM is a fundamental step toward understanding the underpinning of biology. Recent advances in experimental approaches, mainly mass-spectrometry-based approaches, have immensely helped in obtaining and characterizing PTMs. However, experimental approaches are not enough to understand and characterize more than 450 different types of PTMs and complementary computational approaches are becoming popular. Recently, due to the various advancements in the field of Deep Learning (DL), along with the explosion of applications of DL to various fields, the field of computational prediction of PTM has also witnessed the development of a plethora of deep learning (DL)-based approaches. In this book chapter, we first review some recent DL-based approaches in the field of PTM site prediction. In addition, we also review the recent advances in the not-so-studied PTM , that is, proteolytic cleavage predictions. We describe advances in PTM prediction by highlighting the Deep learning architecture, feature encoding, novelty of the approaches, and availability of the tools/approaches. Finally, we provide an outlook and possible future research directions for DL-based approaches for PTM prediction.Key words Deep learning Posttranslational modification siteProteolytic cleavagePhosphorylation Machine learning
... Differentially treated cells can then be analyzed simultaneously by MS, and true methyl groups are identified by the presence of both heavy and light versions. To aid with analyzing such data, open-source programs have recently been developed, MethylQuant 86 and hmSEEKER, 87 though both are limited in ACS Chemical Biology pubs.acs.org/acschemicalbiology Reviews their compatibility with analytical pipelines. ...
Article
Protein methylation is a key post-translational modification whose effects on gene expression have been intensively studied over the last two decades. Recently, renewed interest in non-histone protein methylation has gained momentum for its role in regulating important cellular processes and the activity of many proteins, including transcription factors, enzymes, and structural complexes. The extensive and dynamic role that protein methylation plays within the cell also highlights its potential for bioengineering applications. Indeed, while synthetic histone protein methylation has been extensively used to engineer gene expression, engineering of non-histone protein methylation has not been fully explored yet. Here, we report the latest findings, highlighting how non-histone protein methylation is fundamental for certain cellular functions and is implicated in disease, and review recent efforts in the engineering of protein methylation.
... In recent years, deep learning (DL) based methods have been used to predict the PTM sites in cellular proteins. Typical applications include DeepSuccinylSite [54], MusiteDeep [55], DeepRMethylSite [56], and DeepPhos [57]. In DL, a suitable raw vector is given to the architecture and transformed into highly abstract features by propagating through whole model. ...
Article
Full-text available
S-Nitrosylation modification is one of the most important post-translational modifications; it plays a critical role in a vast variety of biological processes and is related to various diseases. Identification of S-Nitrosylation sites in proteins is crucial for understanding and controlling basic biological processes. The conventional experimental identification methods are laborious and cost in-efficient. To overcome these issues, computational biological approaches are under consideration, including use of machine learning and deep learning algorithms. All existing S-Nitrosylation predictors use the handicraft feature extraction method and could be improved upon. We propose an end-to-end deep learning based S-Nitrosylation site predictor with an embedded layer and bidirectional long short-term memory. The proposed method uses protein sequences as inputs without any need for complex features interventions. This sequence-based protein prediction method is associated with a significant improvement in identification of S-Nitrosylation sites. More specifically, the best prediction of the proposed architecture showed an improvement of in MCC 3% on 5-fold cross validation and 5% on an independent test dataset. Finally, the user-friendly publicly available webserver is accessible at http://nsclbio.jbnu.ac.kr/tools/RecSNO/.
... During the past few years, a wide range of methods has been proposed to predict Glutarylation sites using many machine learning approaches [20][21][22][23][24][25]. Recently, many deep learning models have been used to predict different types of PTMs [6,[26][27][28][29]. In one of the earliest studies, Tan et al. detected 23 Glutarylation sites in 13 unique proteins from HeLa cells [16]. ...
Article
Full-text available
Post Translational Modification (PTM) is defined as the alteration of protein sequence upon interaction with different macromolecules after the translation process. Glutarylation is considered one of the most important PTMs, which is associated with a wide range of cellular functioning, including metabolism, translation, and specified separate subcellular localizations. During the past few years, a wide range of computational approaches has been proposed to predict Glutarylation sites. However, despite all the efforts that have been made so far, the prediction performance of the Glutarylation sites has remained limited. One of the main challenges to tackle this problem is to extract features with significant discriminatory information. To address this issue, we propose a new machine learning method called BiPepGlut using the concept of a bi-peptide-based evolutionary method for feature extraction. To build this model, we also use the Extra-Trees (ET) classifier for the classification purpose, which, to the best of our knowledge, has never been used for this task. Our results demonstrate BiPepGlut is able to significantly outperform previously proposed models to tackle this problem. BiPepGlut achieves 92.0%, 84.8%, 95.6%, 0.82, and 0.88 in accuracy, sensitivity, specificity, Matthew’s Correlation Coefficient, and F1-score, respectively. BiPepGlut is implemented as a publicly available online predictor.
Preprint
Full-text available
Post-translational modifications (PTMs) are pivotal in modulating protein functions, influencing key cellular processes such as signaling, localization, and protein degradation. The complexity of these biological interactions necessitates efficient predictive methodologies. In this work, we introduce PTMGPT2, an interpretable protein language model that utilizes prompt-based fine-tuning to improve its accuracy and generalizability in precisely predicting PTMs. Drawing inspiration from recent advancements in GPT-based architectures, PTMGPT2 adopts an unsupervised learning approach to identify PTMs. It utilizes a custom prompt to guide the model through the subtle linguistic patterns encoded in amino acid sequences, generating tokens indicative of PTM sites. To provide interpretability, we visualize attention profiles from the model’s final decoder layer, elucidating sequence motifs essential for molecular recognition and modification variability. Furthermore, we conducted analyses to investigate the effects of mutations at or near PTM sites, thereby offering deeper insights into protein functionality. Our analysis encompasses a comprehensive dataset comprising 3,88,084 modification sites across 19 distinct PTM types, facilitating the identification of novel PTM sites. Comparative assessments reveal that PTMGPT2 outperforms existing methods by an average 5.45% in MCC, underscoring its potential in identifying novel therapeutic strategies, disease associations, and drug targets.
Article
Protein methylation is a form of post-translational modifications of protein, which is crucial for various cellular processes, including transcription activity and DNA repair. Correctly predicting protein methylation sites is fundamental for research and drug discovery. Some experimental techniques, such as methyl-specific antibodies, chromatin immune precipitation and mass spectrometry, exist for predicting protein methylation sites, but these techniques are time-consuming and costly. The ability to predict methylation sites using in silico techniques may help researchers identify potential candidate sites for future examination and make it easier to carry out site-specific investigations and downstream characterizations. In this research, we proposed a novel deep learning-based predictor, named DeepPRMS, to identify protein methylation sites in primary sequences. The DeepPRMS utilizes the gated recurrent unit (GRU) and convolutional neural network (CNN) algorithms to extract the sequential and spatial information from the primary sequences. GRU is used to extract sequential information, while CNN is used for spatial information. We combined the latent representation of GRU and CNN models to have a better interaction among them. Based on the independent test data set, DeepPRMS obtained an accuracy of 85.32%, a specificity of 84.94%, Matthew’s correlation coefficient of 0.71 and a sensitivity of 85.80%. The results indicate that DeepPRMS can predict protein methylation sites with high accuracy and outperform the state-of-the-art models. The DeepPRMS is expected to effectively guide future research experiments for identifying potential methylated protein sites. The web server is available at http://deepprms.nitsri.ac.in/.
Article
Protein methylation is one of the most important reversible post-translational modifications (PTMs), playing a vital role in the regulation of gene expression. Protein methylation sites serve as biomarkers in cardiovascular and pulmonary diseases, influencing various aspects of normal cell biology and pathogenesis. Nonetheless, the majority of existing computational methods for predicting protein methylation sites (PMSP) have been constructed based on protein sequences, with few methods leveraging the topological information of proteins. To address this issue, we propose an innovative framework for predicting Methylation Sites using Graphs (GraphMethySite) that employs graph convolution network in conjunction with Bayesian Optimization (BO) to automatically discover the graphical structure surrounding a candidate site and improve the predictive accuracy. In order to extract the most optimal subgraphs associated with methylation sites, we extend GraphMethySite by coupling it with a hybrid Bayesian optimization (together named GraphMethySite $^+$ ) to determine and visualize the topological relevance among amino-acid residues. We evaluated our framework on two extended protein methylation datasets, and empirical results demonstrate that it outperforms existing state-of-the-art methylation prediction methods.
Article
Phosphorylation is one of the most important post-translational modifications and plays a pivotal role in various cellular processes. Although there exist several computational tools to predict phosphorylation sites, existing tools have not yet harnessed the knowledge distilled by pretrained protein language models. Herein, we present a novel deep learning-based approach called LMPhosSite for the general phosphorylation site prediction that integrates embeddings from the local window sequence and the contextualized embedding obtained using global (overall) protein sequence from a pretrained protein language model to improve the prediction performance. Thus, the LMPhosSite consists of two base-models: one for capturing effective local representation and the other for capturing global per-residue contextualized embedding from a pretrained protein language model. The output of these base-models is integrated using a score-level fusion approach. LMPhosSite achieves a precision, recall, Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%, 0.390, and 49.15%, for the combined serine and threonine independent test data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively, for the tyrosine independent test data set, which is better than the compared approaches. These results demonstrate that LMPhosSite is a robust computational tool for the prediction of the general phosphorylation sites in proteins.
Article
OMIC is a novel approach that analyses entire genetic or molecular profiles in humans and other organisms. It involves identifying and quantifying biological molecules that contribute to a species' structure, function, and dynamics. Finding the secrets of OMIC is like deciphering the biochemical code, but building data-driven models to mine the hidden phenotypic trait information has been a research hotspot. Transcriptome analysis is a popular biological technology for characterizing living systems' overall health, including cells and tissues. Individual transcript expression levels are known to be correlated with those of other transcripts. Nevertheless, most computational studies do not fully exploit these inter-feature correlations. Differential expression analyses, for example, assume that the expression levels of the transcripts are independent. Thus, we propose extracting these inter-feature correlations using the convolutional neural network (CNN) and transforming the transcriptomic features into a new space of convolutional transcriptomic (LaCOme) features. On most transcriptomic datasets in use, a series of comprehensive experiments have demonstrated that engineered LaCOme features outperform the original transcriptomic features in classification performances. Based on experimental results, OMIC data from biological samples could be further enriched using CNN to enhance computational analysis results. Also, feature rough screening can be used to extract valuable information from OMIC, regardless of the algorithm used to select features. It may always be better to create a novel feature than to keep the original. Furthermore, we investigated the feasibility of the feature construction method through cross-validation and independent verification, hoping to develop a more efficient and effective method.
Article
Dense granule proteins (GRAs) are secreted by Apicomplexa protozoa, which are closely related to an extensive variety of farm animal diseases. Predicting GRAs is an integral part in prevention and treatment of parasitic diseases. Considering that biological experiment approach is time-consuming and labor-intensive, computational method is a superior choice. Hence, developing an effective computational method for GRAs prediction is of urgency. In this paper, we present a novel computational method named GRA-GCN through graph convolutional network. In terms of the graph theory, the GRAs prediction can be regarded as a node classification task. GRA-GCN leverages k-nearest neighbor algorithm to construct the feature graph for aggregating more informative representation. To our knowledge, this is the first attempt to utilize computational approach for GRAs prediction. Evaluated by 5-fold cross-validations, the GRA-GCN method achieves satisfactory performance, and is superior to four classic machine learning-based methods and three state-of-the-art models. The analysis of the comprehensive experiment results and a case study could offer valuable information for understanding complex mechanisms, and would contribute to accurate prediction of GRAs. Moreover, we also implement a web server at http://dgpd.tlds.cc/GRAGCN/index/ , for facilitating the process of using our model.
Article
Full-text available
Protein phosphorylation, which is one of the most important post-translational modifications (PTMs), is involved in regulating myriad cellular processes. Herein, we present a novel deep learning based approach for organism-specific protein phosphorylation site prediction in Chlamydomonas reinhardtii, a model algal phototroph. An ensemble model combining convolutional neural networks and long short-term memory (LSTM) achieves the best performance in predicting phosphorylation sites in C. reinhardtii. Deemed Chlamy-EnPhosSite, the measured best AUC and MCC are 0.90 and 0.64 respectively for a combined dataset of serine (S) and threonine (T) in independent testing higher than those measures for other predictors. When applied to the entire C. reinhardtii proteome (totaling 1,809,304 S and T sites), Chlamy-EnPhosSite yielded 499,411 phosphorylated sites with a cut-off value of 0.5 and 237,949 phosphorylated sites with a cut-off value of 0.7. These predictions were compared to an experimental dataset of phosphosites identified by liquid chromatography-tandem mass spectrometry (LC–MS/MS) in a blinded study and approximately 89.69% of 2,663 C. reinhardtii S and T phosphorylation sites were successfully predicted by Chlamy-EnPhosSite at a probability cut-off of 0.5 and 76.83% of sites were successfully identified at a more stringent 0.7 cut-off. Interestingly, Chlamy-EnPhosSite also successfully predicted experimentally confirmed phosphorylation sites in a protein sequence (e.g., RPS6 S245) which did not appear in the training dataset, highlighting prediction accuracy and the power of leveraging predictions to identify biologically relevant PTM sites. These results demonstrate that our method represents a robust and complementary technique for high-throughput phosphorylation site prediction in C. reinhardtii. It has potential to serve as a useful tool to the community. Chlamy-EnPhosSite will contribute to the understanding of how protein phosphorylation influences various biological processes in this important model microalga.
Preprint
Full-text available
Protein phosphorylation is one of the most important post-translational modifications (PTMs) and involved in myriad cellular processes. Although many non-organism-specific computational phosphorylation site prediction tools and a few tools for organism-specific phosphorylation site prediction exist, none are currently available for Chlamydomonas reinhardtii. Herein, we present a novel deep learning (DL) based approach for organism-specific protein phosphorylation site prediction in Chlamydomonas reinhardtii , a model algal phototroph. Our novel approach called Chlamy-EnPhosSite (based on ensemble approach combining convolutional neural networks (CNN) and long short-term memory LSTM) produces AUC and MCC of 0.90 and 0.64 respectively for a combined dataset of serine (S) and threonine (T) in independent testing. When applied to the entire C. reinhardtii proteome (totaling 1,809,304 S and T sites), Chlamy-EnPhosSite yielded 499,411 phosphorylated sites with a cut-off value of 0.5 and 237,949 phosphorylated sites with a cut-off value of 0.7. These predictions were compared to an experimental dataset of phosphosites identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) in a blinded study and approximately 90% of 2,663 C. reinhardtii S and T phosphorylation sites were successfully predicted by Chlamy-EnPhosSite at a probability cut-off of 0.5 and 77% of sites were successfully identified at a more stringent 0.7 cut-off. Interestingly, Chlamy-EnPhosSite also successfully predicted experimentally confirmed phosphorylation sites in a protein sequence (e.g., RPS6 S245) which did not appear in the training dataset, highlighting prediction accuracy and the power of leveraging predictions to identify biologically relevant PTM sites. These results demonstrate that our method represents a robust and complementary technique for high-throughput phosphorylation site prediction in C. reinhardtii. It has potential to serve as a useful tool to the community. Chlamy-EnPhosSite will contribute to the understanding of how protein phosphorylation influences various biological processes in this important model microalga.
Article
Full-text available
Background: Protein succinylation has recently emerged as an important and common post-translation modification (PTM) that occurs on lysine residues. Succinylation is notable both in its size (e.g., at 100 Da, it is one of the larger chemical PTMs) and in its ability to modify the net charge of the modified lysine residue from + 1 to - 1 at physiological pH. The gross local changes that occur in proteins upon succinylation have been shown to correspond with changes in gene activity and to be perturbed by defects in the citric acid cycle. These observations, together with the fact that succinate is generated as a metabolic intermediate during cellular respiration, have led to suggestions that protein succinylation may play a role in the interaction between cellular metabolism and important cellular functions. For instance, succinylation likely represents an important aspect of genomic regulation and repair and may have important consequences in the etiology of a number of disease states. In this study, we developed DeepSuccinylSite, a novel prediction tool that uses deep learning methodology along with embedding to identify succinylation sites in proteins based on their primary structure. Results: Using an independent test set of experimentally identified succinylation sites, our method achieved efficiency scores of 79%, 68.7% and 0.48 for sensitivity, specificity and MCC respectively, with an area under the receiver operator characteristic (ROC) curve of 0.8. In side-by-side comparisons with previously described succinylation predictors, DeepSuccinylSite represents a significant improvement in overall accuracy for prediction of succinylation sites. Conclusion: Together, these results suggest that our method represents a robust and complementary technique for advanced exploration of protein succinylation.
Article
Full-text available
Background: DNA methylation (DNAm) is an epigenetic regulator of gene expression programs that can be altered by environmental exposures, aging, and in pathogenesis. Traditional analyses that associate DNAm alterations with phenotypes suffer from multiple hypothesis testing and multi-collinearity due to the high-dimensional, continuous, interacting and non-linear nature of the data. Deep learning analyses have shown much promise to study disease heterogeneity. DNAm deep learning approaches have not yet been formalized into user-friendly frameworks for execution, training, and interpreting models. Here, we describe MethylNet, a DNAm deep learning method that can construct embeddings, make predictions, generate new data, and uncover unknown heterogeneity with minimal user supervision. Results: The results of our experiments indicate that MethylNet can study cellular differences, grasp higher order information of cancer sub-types, estimate age and capture factors associated with smoking in concordance with known differences. Conclusion: The ability of MethylNet to capture nonlinear interactions presents an opportunity for further study of unknown disease, cellular heterogeneity and aging processes.
Article
Full-text available
Among the widespread and increasing number of identified post-translational modifications (PTMs), arginine methylation is catalyzed by the protein arginine methyltransferases (PRMTs) and regulates fundamental processes in cells, such as gene regulation, RNA processing, translation, and signal transduction. As epigenetic regulators, PRMTs play key roles in pluripotency, differentiation, proliferation, survival, and apoptosis, which are essential biological programs leading to development, adult homeostasis but also pathological conditions including cancer. A full understanding of the molecular mechanisms that underlie PRMT-mediated gene regulation requires the genome wide mapping of each player, i.e., PRMTs, their substrates and epigenetic marks, methyl-marks readers as well as interaction partners, in a thorough and unambiguous manner. However, despite the tremendous advances in high throughput sequencing technologies and the numerous efforts from the scientific community, the epigenomic profiling of PRMTs as well as their histone and non-histone substrates still remains a big challenge owing to obvious limitations in tools and methodologies. This review will summarize the present knowledge about the genome wide mapping of PRMTs and their substrates as well as the technical approaches currently in use. The limitations and pitfalls of the technical tools along with conventional approaches will be then discussed in detail. Finally, potential new strategies for chromatin profiling of PRMTs and histone substrates will be proposed and described.
Article
Full-text available
Background Determination of genome-wide DNA methylation is significant for both basic research and drug development. As a key epigenetic modification, this biochemical process can modulate gene expression to influence the cell differentiation which can possibly lead to cancer. Due to the involuted biochemical mechanism of DNA methylation, obtaining a precise prediction is a considerably tough challenge. Existing approaches have yielded good predictions, but the methods either need to combine plenty of features and prerequisites or deal with only hypermethylation and hypomethylation. Results In this paper, we propose a deep learning method for prediction of the genome-wide DNA methylation, in which the Methylation Regression is implemented by Convolutional Neural Networks (MRCNN). Through minimizing the continuous loss function, experiments show that our model is convergent and more precise than the state-of-art method (DeepCpG) according to results of the evaluation. MRCNN also achieves the discovery of de novo motifs by analysis of features from the training process. Conclusions Genome-wide DNA methylation could be evaluated based on the corresponding local DNA sequences of target CpG loci. With the autonomous learning pattern of deep learning, MRCNN enables accurate predictions of genome-wide DNA methylation status without predefined features and discovers some de novo methylation-related motifs that match known motifs by extracting sequence patterns.
Article
Full-text available
The purpose of this study is to examine existing deep learning techniques for addressing class imbalanced data. Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g., fraud detection and cancer detection. Moreover, highly imbalanced data poses added difficulty, as most learners will exhibit bias towards the majority class, and in extreme cases, may ignore the minority class altogether. Class imbalance has been studied thoroughly over the last two decades using traditional machine learning models, i.e. non-deep learning. Despite recent advances in deep learning, along with its increasing popularity, very little empirical work in the area of deep learning with class imbalance exists. Having achieved record-breaking performance results in several complex domains, investigating the use of deep neural networks for problems containing high levels of class imbalance is of great interest. Available studies regarding class imbalance and deep learning are surveyed in order to better understand the efficacy of deep learning when applied to class imbalanced data. This survey discusses the implementation details and experimental results for each study, and offers additional insight into their strengths and weaknesses. Several areas of focus include: data complexity, architectures tested, performance interpretation, ease of use, big data application, and generalization to other domains. We have found that research in this area is very limited, that most existing work focuses on computer vision tasks with convolutional neural networks, and that the effects of big data are rarely considered. Several traditional methods for class imbalance, e.g. data sampling and cost-sensitive learning, prove to be applicable in deep learning, while more advanced methods that exploit neural network feature learning abilities show promising results. The survey concludes with a discussion that highlights various gaps in deep learning from class imbalanced data for the purpose of guiding future research.
Article
Full-text available
The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
Article
Full-text available
Motivation: Computational methods for phosphorylation site prediction play important roles in protein function studies and experimental design. Most existing methods are based on feature extraction, which may result in incomplete or biased features. Deep learning as the cutting-edge machine learning method has the ability to automatically discover complex representations of phosphorylation patterns from the raw sequences, and hence it provides a powerful tool for improvement of phosphorylation site prediction. Results: We present MusiteDeep, the first deep-learning framework for predicting general and kinase-specific phosphorylation sites. MusiteDeep takes raw sequence data as input and uses convolutional neural networks with a novel two-dimensional attention mechanism. It achieves over a 50% relative improvement in the area under the precision-recall curve in general phosphorylation site prediction and obtains competitive results in kinase-specific prediction compared to other well-known tools on the benchmark data. Availability and implementation: MusiteDeep is provided as an open-source tool available at https://github.com/duolinwang/MusiteDeep . Contact: xudong@missouri.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Protein methylation is an important Post-Translational Modification (PTMs) of proteins. Arginine methylation carries out and regulates several important biological functions, including gene regulation and signal transduction. Experimental identification of arginine methylation site is a daunting task as it is costly as well as time and labour intensive. Hence reliable prediction tools play an important task in rapid screening and identification of possible methylation sites in proteomes. Our preliminary assessment using the available prediction methods on collected data yielded unimpressive results. This motivated us to perform a comprehensive data analysis and appraisal of features relevant in the context of biological significance, that led to the development of a prediction tool PRmePRed with better performance. The PRmePRed perform reasonably well with an accuracy of 84.10%, 82.38% sensitivity, 83.77% specificity, and Matthew's correlation coefficient of 66.20% in 10-fold cross-validation. PRmePRed is freely available at http://bioinfo.icgeb.res.in/PRmePRed/
Article
Full-text available
Motivation: A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40,000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. Results: We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein-protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Availability and implementation: Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo. Contact: robert.hoehndorf@kaust.edu.sa. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition. The implemented state-of-the-art methods can be categorized into 4 groups: (i) under-sampling, (ii) over-sampling, (iii) combination of over- and under-sampling, and (iv) ensemble learning methods. The proposed toolbox only depends on numpy, scipy, and scikit-learn and is distributed under MIT license. Furthermore, it is fully compatible with scikit-learn and is part of the scikit-learn-contrib supported project. Documentation, unit tests as well as integration tests are provided to ease usage and contribution. The toolbox is publicly available in GitHub: https://github.com/scikit-learn-contrib/imbalanced-learn.
Article
Full-text available
Protein methylation is an essential posttranslational modification (PTM) mostly occurs at lysine and arginine residues, and regulates a variety of cellular processes. Owing to the rapid progresses in the large-scale identification of methylation sites, the available data set was dramatically expanded, and more attention has been paid on the identification of specific methylation types of modification residues. Here, we briefly summarized the current progresses in computational prediction of methylation sites, which provided an accurate, rapid and efficient approach in contrast with labor-intensive experiments. We collected 5421 methyllysines and methylarginines in 2592 proteins from the literature, and classified most of the sites into different types. Data analyses demonstrated that different types of methylated proteins were preferentially involved in different biological processes and pathways, whereas a unique sequence preference was observed for each type of methylation sites. Thus, we developed a predictor of GPS-MSP, which can predict mono-, di- and tri-methylation types for specific lysines, and mono-, symmetric di- and asymmetrical di-methylation types for specific arginines. We critically evaluated the performance of GPS-MSP, and compared it with other existing tools. The satisfying results exhibited that the classification of methylation sites into different types for training can considerably improve the prediction accuracy. Taken together, we anticipate that our study provides a new lead for future computational analysis of protein methylation, and the prediction of methylation types of covalently modified lysine and arginine residues can generate more useful information for further experimental manipulation.
Article
Full-text available
Owing to the importance of the post-translational modifications (PTMs) of proteins in regulating biological processes, the dbPTM (http://dbPTM.mbc.nctu.edu.tw/) was developed as a comprehensive database of experimentally verified PTMs from several databases with annotations of potential PTMs for all UniProtKB protein entries. For this 10th anniversary of dbPTM, the updated resource provides not only a comprehensive dataset of experimentally verified PTMs, supported by the literature, but also an integrative interface for accessing all available databases and tools that are associated with PTM analysis. As well as collecting experimental PTM data from 14 public databases, this update manually curates over 12 000 modified peptides, including the emerging S-nitrosylation, S-glutathionylation and succinylation, from approximately 500 research articles, which were retrieved by text mining. As the number of available PTM prediction methods increases, this work compiles a non-homologous benchmark dataset to evaluate the predictive power of online PTM prediction tools. An increasing interest in the structural investigation of PTM substrate sites motivated the mapping of all experimental PTM peptides to protein entries of Protein Data Bank (PDB) based on database identifier and sequence identity, which enables users to examine spatially neighboring amino acids, solvent-accessible surface area and side-chain orientations for PTM substrate sites on tertiary structures. Since drug binding in PDB is annotated, this update identified over 1100 PTM sites that are associated with drug binding. The update also integrates metabolic pathways and protein-protein interactions to support the PTM network analysis for a group of proteins. Finally, the web interface is redesigned and enhanced to facilitate access to this resource.
Article
Full-text available
PhosphoSitePlus(®) (PSP, http://www.phosphosite.org/), a knowledgebase dedicated to mammalian post-translational modifications (PTMs), contains over 330 000 non-redundant PTMs, including phospho, acetyl, ubiquityl and methyl groups. Over 95% of the sites are from mass spectrometry (MS) experiments. In order to improve data reliability, early MS data have been reanalyzed, applying a common standard of analysis across over 1 000 000 spectra. Site assignments with P > 0.05 were filtered out. Two new downloads are available from PSP. The 'Regulatory sites' dataset includes curated information about modification sites that regulate downstream cellular processes, molecular functions and protein-protein interactions. The 'PTMVar' dataset, an intersect of missense mutations and PTMs from PSP, identifies over 25 000 PTMVars (PTMs Impacted by Variants) that can rewire signaling pathways. The PTMVar data include missense mutations from UniPROTKB, TCGA and other sources that cause over 2000 diseases or syndromes (MIM) and polymorphisms, or are associated with hundreds of cancers. PTMVars include 18 548 phosphorlyation sites, 3412 ubiquitylation sites, 2316 acetylation sites, 685 methylation sites and 245 succinylation sites. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Article
Full-text available
Before becoming the native proteins during the biosynthesis, their polypeptide chains created by ribosome's translating mRNA will undergo a series of "product-forming" steps, such as cutting, folding, and posttranslational modification (PTM). Knowledge of PTMs in proteins is crucial for dynamic proteome analysis of various human diseases and epigenetic inheritance. One of the most important PTMs is the Arg- or Lys-methylation that occurs on arginine or lysine, respectively. Given a protein, which site of its Arg (or Lys) can be methylated, and which site cannot? This is the first important problem for understanding the methylation mechanism and drug development in depth. With the avalanche of protein sequences generated in the postgenomic age, its urgency has become self-evident. To address this problem, we proposed a new predictor, called iMethyl-PseAAC. In the prediction system, a peptide sample was formulated by a 346-dimensional vector, formed by incorporating its physicochemical, sequence evolution, biochemical, and structural disorder information into the general form of pseudo amino acid composition. It was observed by the rigorous jackknife test and independent dataset test that iMethyl-PseAAC was superior to any of the existing predictors in this area.
Article
Full-text available
Protein methylation is predominantly found on lysine and arginine residues, and carries many important biological functions, including gene regulation and signal transduction. Given their important involvement in gene expression, protein methylation and their regulatory enzymes are implicated in a variety of human disease states such as cancer, coronary heart disease and neurodegenerative disorders. Thus, identification of methylation sites can be very helpful for the drug designs of various related diseases. In this study, we developed a method called PMeS to improve the prediction of protein methylation sites based on an enhanced feature encoding scheme and support vector machine. The enhanced feature encoding scheme was composed of the sparse property coding, normalized van der Waals volume, position weight amino acid composition and accessible surface area. The PMeS achieved a promising performance with a sensitivity of 92.45%, a specificity of 93.18%, an accuracy of 92.82% and a Matthew's correlation coefficient of 85.69% for arginine as well as a sensitivity of 84.38%, a specificity of 93.94%, an accuracy of 89.16% and a Matthew's correlation coefficient of 78.68% for lysine in 10-fold cross validation. Compared with other existing methods, the PMeS provides better predictive performance and greater robustness. It can be anticipated that the PMeS might be useful to guide future experiments needed to identify potential methylation sites in proteins of interest. The online service is available at http://bioinfo.ncu.edu.cn/inquiries_PMeS.aspx.
Article
Full-text available
Chromatin is not an inert structure, but rather an instructive DNA scaffold that can respond to external cues to regulate the many uses of DNA. A principle component of chromatin that plays a key role in this regulation is the modification of histones. There is an ever-growing list of these modifications and the complexity of their action is only just beginning to be understood. However, it is clear that histone modifications play fundamental roles in most biological processes that are involved in the manipulation and expression of DNA. Here, we describe the known histone modifications, define where they are found genomically and discuss some of their functional consequences, concentrating mostly on transcription where the majority of characterisation has taken place.
Article
Full-text available
Protein methylation is one type of reversible post-translational modifications (PTMs), which plays vital roles in many cellular processes such as transcription activity, DNA repair. Experimental identification of methylation sites on proteins without prior knowledge is costly and time-consuming. In silico prediction of methylation sites might not only provide researches with information on the candidate sites for further determination, but also facilitate to perform downstream characterizations and site-specific investigations. In the present study, a novel approach based on Bi-profile Bayes feature extraction combined with support vector machines (SVMs) was employed to develop the model for Prediction of Protein Methylation Sites (BPB-PPMS) from primary sequence. Methylation can occur at many residues including arginine, lysine, histidine, glutamine, and proline. For the present, BPB-PPMS is only designed to predict the methylation status for lysine and arginine residues on polypeptides due to the absence of enough experimentally verified data to build and train prediction models for other residues. The performance of BPB-PPMS is measured with a sensitivity of 74.71%, a specificity of 94.32% and an accuracy of 87.98% for arginine as well as a sensitivity of 70.05%, a specificity of 77.08% and an accuracy of 75.51% for lysine in 5-fold cross validation experiments. Results obtained from cross-validation experiments and test on independent data sets suggest that BPB-PPMS presented here might facilitate the identification and annotation of protein methylation. Besides, BPB-PPMS can be extended to build predictors for other types of PTM sites with ease. For public access, BPB-PPMS is available at http://www.bioinfo.bio.cuhk.edu.hk/bpbppms.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
Protein methylation is a stable post-translational modification (PTM) with important biological functions. It occurs predominantly on arginine and lysine residues with varying numbers of methyl groups, such as mono-, di- or trimethyl lysine. Existing methods for identifying methylation sites are laborious, require large amounts of sample and cannot be applied to complex mixtures. We have previously described stable isotope labeling by amino acids in cell culture (SILAC) for quantitative comparison of proteomes. In heavy methyl SILAC, cells metabolically convert [(13)CD(3)]methionine to the sole biological methyl donor, [(13)CD(3)]S-adenosyl methionine. Heavy methyl groups are fully incorporated into in vivo methylation sites, directly labeling the PTM. This provides markedly increased confidence in identification and relative quantitation of protein methylation by mass spectrometry. Using antibodies targeted to methylated residues and analysis by liquid chromatography-tandem mass spectrometry, we identified 59 methylation sites, including previously unknown sites, considerably extending the number of in vivo methylation sites described in the literature.
Article
Full-text available
Protein methylation is an important and reversible post-translational modification of proteins (PTMs), which governs cellular dynamics and plasticity. Experimental identification of the methylation site is labor-intensive and often limited by the availability of reagents, such as methyl-specific antibodies and optimization of enzymatic reaction. Computational analysis may facilitate the identification of potential methylation sites with ease and provide insight for further experimentation. Here we present a novel protein methylation prediction web server named MeMo, protein methylation modification prediction, implemented in Support Vector Machines (SVMs). Our present analysis is primarily focused on methylation on lysine and arginine, two major protein methylation sites. However, our computational platform can be easily extended into the analyses of other amino acids. The accuracies for prediction of protein methylation on lysine and arginine have reached 67.1 and 86.7%, respectively. Thus, the MeMo system is a novel tool for predicting protein methylation and may prove useful in the study of protein methylation function and dynamics. The MeMo web server is available at: http://www.bioinfo.tsinghua.edu.cn/~tigerchen/memo.html.
Conference Paper
Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This grounding of dropout in approximate Bayesian inference suggests an extension of the theoretical results, offering insights into the use of dropout with RNN models. We apply this new variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks. The new approach outperforms existing techniques, and to the best of our knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.
Article
Protein methylation, an important post-translational modification, plays crucial roles in many cellular processes. The accurate prediction of protein methylation sites is fundamentally important for revealing the molecular mechanisms undergoing methylation. In recent years, computational prediction based on machine learning algorithms has emerged as a powerful and robust approach for identifying methylation sites, and much progress has been made in predictive performance improvement. However, the predictive performance of existing methods is not satisfactory in terms of overall accuracy. Motivated by this, we propose a novel random-forest-based predictor called MePred-RF, integrating several discriminative sequence-based feature descriptors and improving feature representation capability using a powerful feature selection technique. Importantly, unlike other methods based on multiple, complex information inputs, our proposed MePred-RF is based on sequence information alone. Comparative studies on benchmark datasets via vigorous jackknife tests indicate that our proposed MePred-RF method remarkably outperforms other state-of-the-art predictors, leading by a 4.5% average in terms of overall accuracy. A user-friendly webserver that implements the proposed method has been established for researchers’ convenience, and is now freely available for public use through http://server.malab.cn/MePred-­‐‑RF. We anticipate our research tool to be useful for the largescale prediction and analysis of protein methylation sites.
Article
As one of the most important reversible types of post-translational modification, protein methylation catalyzed by methyltransferases carries many pivotal biological functions as well as many essential biological processes. Identification of methylation sites is prerequisite for decoding methylation regulatory networks in living cells and understanding their physiological roles. Experimental methods are limitations of labor-intensive and time-consuming. While in silicon approaches are cost-effective and high-throughput manner to predict potential methylation sites, but those previous predictors only have a mixed model and their prediction performances are not fully satisfactory now. Recently, with increasing availability of quantitative methylation datasets in diverse species (especially in eukaryotes), there is a growing need to develop a species-specific predictor. Here, we designed a tool named PSSMe based on information gain (IG) feature optimization method for species-specific methylation site prediction. The IG method was adopted to analyze the importance and contribution of each feature, then select the valuable dimension feature vectors to reconstitute a new orderly feature, which was applied to build the finally prediction model. Finally, our method improves prediction performance of accuracy about 15% comparing with single features. Furthermore, our species-specific model significantly improves the predictive performance compare with other general methylation prediction tools. Hence, our prediction results serve as useful resources to elucidate the mechanism of arginine or lysine methylation and facilitate hypothesis-driven experimental design and validation. Availability and Implementation: The tool online service is implemented by C# language and freely available at http://bioinfo.ncu.edu.cn/PSSMe.aspx . Contact:jdqiu@ncu.edu.cn Supplementary information:Supplementary data are available at Bioinformatics online.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
Unlabelled: The structural, functional, and mechanistic characterization of several types of post-translational modifications (PTMs) is well-documented. PTMs, however, may interact or interfere with one another when regulating protein function. Yet, characterization of the structural and functional signatures of their crosstalk has been hindered by the scarcity of data. To this end, we developed a unified sequence-based predictor of 23 types of PTM sites that, we believe, is a useful tool in guiding biological experiments and data interpretation. We then used experimentally determined and predicted PTM sites to investigate two particular cases of potential PTM crosstalk in eukaryotes. First, we identified proteins statistically enriched in multiple types of PTM sites and found that they show preferences toward intrinsically disordered regions as well as functional roles in transcriptional, posttranscriptional, and developmental processes. Second, we observed that target sites modified by more than one type of PTM, referred to as shared PTM sites, show even stronger preferences toward disordered regions than their single-PTM counterparts; we explain this by the need for these regions to accommodate multiple partners. Finally, we investigated the influence of single and shared PTMs on differential regulation of protein-protein interactions. We provide evidence that molecular recognition features (MoRFs) show significant preferences for PTM sites, particularly shared PTM sites, implicating PTMs in the modulation of this specific type of macromolecular recognition. We conclude that intrinsic disorder is a strong structural prerequisite for complex PTM-based regulation, particularly in context-dependent protein-protein interactions related to transcriptional and developmental processes. Availability: www.modpred.org.
Article
FLAGELLAR proteins (flagellins) from several species of bacteria have been studied by Weibull1 and, more recently, by Koffler2. The present investigations have revealed differences in the amino-acid compositions of the flagellins from Proteus vulgaris2, and Salmonella typhimurium (Table 1). Also the latter protein has been found to contain ε-N-methyl-lysine, an amino-acid that has not been previously found to occur naturally.
Article
Studies over the last few years have identified protein methylation on histones and other proteins that are involved in the regulation of gene transcription. Several works have developed approaches to identify computationally the potential methylation sites on lysine and arginine. Studies of protein tertiary structure have demonstrated that the sites of protein methylation are preferentially in regions that are easily accessible. However, previous studies have not taken into account the solvent-accessible surface area (ASA) that surrounds the methylation sites. This work presents a method named MASA that combines the support vector machine with the sequence and structural characteristics of proteins to identify methylation sites on lysine, arginine, glutamate, and asparagine. Since most experimental methylation sites are not associated with corresponding protein tertiary structures in the Protein Data Bank, the effective solvent-accessible prediction tools have been adopted to determine the potential ASA values of amino acids in proteins. Evaluation of predictive performance by cross-validation indicates that the ASA values around the methylation sites can improve the accuracy of prediction. Additionally, an independent test reveals that the prediction accuracies for methylated lysine and arginine are 80.8 and 85.0%, respectively. Finally, the proposed method is implemented as an effective system for identifying protein methylation sites. The developed web server is freely available at http://MASA.mbc.nctu.edu.tw/.
Article
We describe a method that allows for the concurrent proteomic analysis of both membrane and soluble proteins from complex membrane-containing samples. When coupled with multidimensional protein identification technology (MudPIT), this method results in (i) the identification of soluble and membrane proteins, (ii) the identification of post-translational modification sites on soluble and membrane proteins, and (iii) the characterization of membrane protein topology and relative localization of soluble proteins. Overlapping peptides produced from digestion with the robust nonspecific protease proteinase K facilitates the identification of covalent modifications (phosphorylation and methylation). High-pH treatment disrupts sealed membrane compartments without solubilizing or denaturing the lipid bilayer to allow mapping of the soluble domains of integral membrane proteins. Furthermore, coupling protease protection strategies to this method permits characterization of the relative sidedness of the hydrophilic domains of membrane proteins.
Article
Arginine methylation is now coming out of the shadows of protein phosphorylation and entering the mainstream, largely due to the identification of the family of enzymes that lay down this modification. In addition, modification-specific antibodies and proteomic approaches have facilitated the identification of an array of substrates for the protein arginine methyltransferases. This review describes recent insights into the molecular processes regulated by arginine methylation in normal and diseased cells.
Article
Covalent modifications of histone tails have fundamental roles in chromatin structure and function. One such modification, lysine methylation, has important functions in many biological processes that include heterochromatin formation, X-chromosome inactivation and transcriptional regulation. Here, we summarize recent advances in our understanding of how lysine methylation functions in these diverse biological processes, and raise questions that need to be addressed in the future.
Keras: The Python Deep Learning library
  • C F Keras
  • Keras
Regulation of chromatin by histone modifications
  • A J Bannister
  • T Kouzarides
A. J. Bannister and T. Kouzarides, Regulation of chromatin by histone modifications, Cell Res., 2011, 21, 381-395.
  • P V Hornbeck
  • B Zhang
  • B Murray
  • J M Kornhauser
  • V Latham
  • E Skrzypek
P. V. Hornbeck, B. Zhang, B. Murray, J. M. Kornhauser, V. Latham and E. Skrzypek, PhosphoSitePlus, 2014: mutations, PTMs and recalibrations, Nucleic Acids Res., 2015, 43, D512-520.
  • M D Zeiler
M. D. Zeiler, 2012, ADADELTA: An adaptive learning rate method, arXiv:1212.5701.
Incorporating structural characteristics for identification of protein methylation sites
  • P.-C Horng
  • T.-Y Hsu
  • H.-D Wang
  • Huang
Horng, P.-C. Hsu, T.-Y. Wang and H.-D. Huang, Incorporating structural characteristics for identification of protein methylation sites, J. Comput. Chem., 2009, 30, 1532-1543.