Figure - available from: BMC Bioinformatics
This content is subject to copyright. Terms and conditions apply.
The AUC values of the baseline method and our method under different settings

The AUC values of the baseline method and our method under different settings

Source publication
Article
Full-text available
Background Long non-coding RNA (lncRNA) plays important roles in many biological and pathological processes, including transcriptional regulation and gene regulation. As lncRNA interacts with multiple proteins, predicting lncRNA-protein interactions (lncRPIs) is an important way to study the functions of lncRNA. Up to now, there have been a few wor...

Similar publications

Article
Full-text available
Oral cancer is one of the leading malignant tumors worldwide. Despite the advent of mul-tidisciplinary approaches, the overall prognosis of patients with oral cancer is poor, mainly due to late diagnosis. There is an urgent need to develop valid biomarkers for early detection and effective therapies. Long non-coding RNAs (lncRNAs) are recognized as...

Citations

... Zheng et al. created a protein similarity network from multi-source data like protein sequences, employing RF technology and known LPIs for potential LPI prediction (LPI-PPSN). 26 Shen et al. assessed lncRNA-protein similarity using methods like fast kernel learning (FKL), employing kernel ridge regression for potential LPI prediction (LPI-MFFKL). 27 Li et al. built a heterogeneous network from lncRNA similarity, protein similarity, and lncRNA-protein networks using RWR technology for potential LPI prediction (LPI-HN). ...
... Protein-related GO data, gathered from the GOA database, 55 are utilized in the calculations to enhance the initial protein features. 26 Assuming that the GO data for a protein are denoted as gp, the relationship matrix of the protein is calculated using the Jaccard coefficient: ...
Article
Full-text available
Long non-coding RNAs (lncRNAs) are important factors involved in biological regulatory networks. Accurately predicting lncRNA-protein interactions (LPIs) is vital for clarifying lncRNA’s functions and pathogenic mechanisms. Existing deep learning models have yet to yield satisfactory results in LPI prediction. Recently, graph autoencoders (GAEs) have seen rapid development, excelling in tasks like link prediction and node classification. We employed GAE technology for LPI prediction, devising the FMSRT-LPI model based on path masking and degree regression strategies and thereby achieving satisfactory outcomes. This represents the first known integration of path masking and degree regression strategies into the GAE framework for potential LPI inference. The effectiveness of our FMSRT-LPI model primarily relies on four key aspects. First, within the GAE framework, our model integrates multi-source relationships of lncRNAs and proteins with LPN’s topological data. Second, the implemented masking strategy efficiently identifies LPN’s key paths, reconstructs the network, and reduces the impact of redundant or incorrect data. Third, the integrated degree decoder balances degree and structural information, enhancing node representation. Fourth, the PolyLoss function we introduced is more appropriate for LPI prediction tasks. The results on multiple public datasets further demonstrate our model’s potential in LPI prediction.
... Five data sets are used in this paper, each of which contains protein sequence information, lncRNA sequence information and LPI network. Datasets human 1-human 3 are obtained from Li [35], Zheng [36] and Zhang et al. [37], respectively, and the three obtained data sets are preprocessed. UniProt [38], NPInter [39], NONCODE [40] and SUPERFAMILY [41] (2) interaction data with only one relevant lncRNA or protein interaction and no sequence or protein expression information were removed. ...
Article
Full-text available
LncRNA–protein interactions are ubiquitous in organisms and play a crucial role in a variety of biological processes and complex diseases. Many computational methods have been reported for lncRNA–protein interaction prediction. However, the experimental techniques to detect lncRNA–protein interactions are laborious and time-consuming. Therefore, to address this challenge, this paper proposes a reweighting boosting feature selection (RBFS) method model to select key features. Specially, a reweighted apporach can adjust the contribution of each observational samples to learning model fitting; let higher weights are given more influence samples than those with lower weights. Feature selection with boosting can efficiently rank to iterate over important features to obtain the optimal feature subset. Besides, in the experiments, the RBFS method is applied to the prediction of lncRNA–protein interactions. The experimental results demonstrate that our method achieves higher accuracy and less redundancy with fewer features.
... In kinase-specific phosphorylation sites prediction tool PKSPS, Gao et al. [28] argued that topologically similar kinases might share the common substrates, which supported the use of kinasekinase similarity in the prediction of kinase-specific phosphorylation sites. Zheng et al. [29] fused multiple protein-protein similarity networks, namely protein sequences-, protein domains-, protein GO terms-, and the STRING score-related similarities, in their model to predict long non-coding RNA and protein interactions. Their performance was superior to predictive models using only one protein-protein similarity network [29]. ...
... Zheng et al. [29] fused multiple protein-protein similarity networks, namely protein sequences-, protein domains-, protein GO terms-, and the STRING score-related similarities, in their model to predict long non-coding RNA and protein interactions. Their performance was superior to predictive models using only one protein-protein similarity network [29]. Inspired by their method, a kinase-kinase specificity similarity network was developed among human kinases. ...
... Inspired by their method, a kinase-kinase specificity similarity network was developed among human kinases. The sequence similarity was derived from the normalized Smith-Waterman score, functional similarity and protein domain similarity were based on the Jaccard score, and STRING similarity came from the STRING score, as per the strategies used in [29]. ...
Article
Phosphorylation is an essential mechanism for regulating protein activities. Determining kinase-specific phosphorylation sites by experiments involves time-consuming and expensive analyzes. Although several studies proposed computational methods to model kinase-specific phosphorylation sites, they typically required abundant experimentally verified phosphorylation sites to yield reliable predictions. Nevertheless, the number of experimentally verified phosphorylation sites for most kinases is relatively small, and the targeting phosphorylation sites are still unidentified for some kinases. In fact, there is little research related to these understudied kinases in the literature. Thus, this study aims to create predictive models for these understudied kinases. A kinase–kinase similarity network was generated by merging the sequence-, functional-, protein-domain- and ‘STRING’-related similarities. Thus, besides sequence data, protein–protein interactions and functional pathways were also considered to aid predictive modelling. This similarity network was then integrated with a classification of kinase groups to yield highly similar kinases to a specific understudied type of kinase. Their experimentally verified phosphorylation sites were leveraged as positive sites to train predictive models. The experimentally verified phosphorylation sites of the understudied kinase were used for validation. Results demonstrate that 82 out of 116 understudied kinases were predicted with adequate performance via the proposed modelling strategy, achieving a balanced accuracy of 0.81, 0.78, 0.84, 0.84, 0.85, 0.82, 0.90, 0.82 and 0.85, for the ‘TK’, ‘Other’, ‘STE’, ‘CAMK’, ‘TKL’, ‘CMGC’, ‘AGC’, ‘CK1’ and ‘Atypical’ groups, respectively. Therefore, this study demonstrates that web-like predictive networks can reliably capture the underlying patterns in such understudied kinases by harnessing relevant sources of similarities to predict their specific phosphorylation sites.
... In this work three human and two plant LPI datasets, derived from the NPInter database, were used as training for the classifier. These datasets were processed using several filters, similar to previous works (Li et al., 2015;Zheng et al., 2017;Zheng et al., 2017;Zhang et al., 2018;Bai et al., 2019). Multiple features of lncRNAs and proteins were calculated from their sequences using Pyfeat (Muhammod et al., 2019) and BioProt (Márquez and Castro Amaya, 2019). ...
... In this work three human and two plant LPI datasets, derived from the NPInter database, were used as training for the classifier. These datasets were processed using several filters, similar to previous works (Li et al., 2015;Zheng et al., 2017;Zheng et al., 2017;Zhang et al., 2018;Bai et al., 2019). Multiple features of lncRNAs and proteins were calculated from their sequences using Pyfeat (Muhammod et al., 2019) and BioProt (Márquez and Castro Amaya, 2019). ...
Article
Full-text available
Understanding how RNAs interact with proteins, RNAs, or other molecules remains a challenge of main interest in biology, given the importance of these complexes in both normal and pathological cellular processes. Since experimental datasets are starting to be available for hundreds of functional interactions between RNAs and other biomolecules, several machine learning and deep learning algorithms have been proposed for predicting RNA-RNA or RNA-protein interactions. However, most of these approaches were evaluated on a single dataset, making performance comparisons difficult. With this review, we aim to summarize recent computational methods, developed in this broad research area, highlighting feature encoding and machine learning strategies adopted. Given the magnitude of the effect that dataset size and quality have on performance, we explored the characteristics of these datasets. Additionally, we discuss multiple approaches to generate datasets of negative examples for training. Finally, we describe the best-performing methods to predict interactions between proteins and specific classes of RNA molecules, such as circular RNAs (circRNAs) and long non-coding RNAs (lncRNAs), and methods to predict RNA-RNA or RNA-RBP interactions independently of the RNA type.
... Li et al. applied LPIHN based on implementing random walk with restart on the heterogeneous network, including lncRNA-lncRNA similarity network, lncRNA-protein interactions network and protein-protein interaction network (Li et al., 2015). Methods proposed respectively by Zheng et al. and Yang et al. and the model of PLIPCOM extracted topological information of ncRNA-protein interactions by calculating the HeteSim scores on the relevance paths of the heterogeneous network (Yang et al., 2016;Zheng et al., 2017;Deng et al., 2018). Yao et al. used the knowledge graph with auto-encoder to detect protein complexes (Yao et al., 2020). ...
Article
Full-text available
Non-coding RNAs (ncRNAs) take essential effects on biological processes, like gene regulation. One critical way of ncRNA executing biological functions is interactions between ncRNA and RNA binding proteins (RBPs). Identifying proteins, involving ncRNA-protein interactions, can well understand the function ncRNA. Many high-throughput experiment have been applied to recognize the interactions. As a consequence of these approaches are time- and labor-consuming, currently, a great number of computational methods have been developed to improve and advance the ncRNA-protein interactions research. However, these methods may be not available to all RNAs and proteins, particularly processing new RNAs and proteins. Additionally, most of them cannot process well with long sequence. In this work, a computational method SAWRPI is proposed to make prediction of ncRNA-protein through sequence information. More specifically, the raw features of protein and ncRNA are firstly extracted through the k-mer sparse matrix with SVD reduction and learning nucleic acid symbols by natural language processing with local fusion strategy, respectively. Then, to classify easily, Hilbert Transformation is exploited to transform raw feature data to the new feature space. Finally, stacking ensemble strategy is adopted to learn high-level abstraction features automatically and generate final prediction results. To confirm the robustness and stability, three different datasets containing two kinds of interactions are utilized. In comparison with state-of-the-art methods and other results classifying or feature extracting strategies, SAWRPI achieved high performance on three datasets, containing two kinds of lncRNA-protein interactions. Upon our finding, SAWRPI is a trustworthy, robust, yet simple and can be used as a beneficial supplement to the task of predicting ncRNA-protein interactions.
... Ge et al. [27] and Zhao et al. [21] explored two bipartite network-based LPI inference models. Zheng et al. [28] found a few LPIs based on the built multiple protein-protein similarity networks. Zhang et al. [29] used the KATZ measure to identify the linkages between lncRNAs and proteins. ...
... Dataset 1 was compiled by Li et al. [25] and contains 3479 associations between 59 proteins and 935 lncRNAs after our removing lncRNAs and proteins without any sequence information in NONCODE [42], NPInter [43], and UniProt [44] databases. Dataset 2 was built by Zheng et al. [28] and contains 3265 associations from 84 proteins and 885 lncRNAs after preprocessing similar to dataset 1. Dataset 3 was retrieved by Zhang et al. [31] and contains 4158 associations between 27 proteins and 990 lncRNAs. The three datasets are from human. ...
Article
Full-text available
Background Long noncoding RNAs (lncRNAs) have dense linkages with various biological processes. Identifying interacting lncRNA-protein pairs contributes to understand the functions and mechanisms of lncRNAs. Wet experiments are costly and time-consuming. Most computational methods failed to observe the imbalanced characterize of lncRNA-protein interaction (LPI) data. More importantly, they were measured based on a unique dataset, which produced the prediction bias. Results In this study, we develop an Ensemble framework (LPI-EnEDT) with Extra tree and Decision Tree classifiers to implement imbalanced LPI data classification. First, five LPI datasets are arranged. Second, lncRNAs and proteins are separately characterized based on Pyfeat and BioTriangle and concatenated as a vector to represent each lncRNA-protein pair. Finally, an ensemble framework with Extra tree and decision tree classifiers is developed to classify unlabeled lncRNA-protein pairs. The comparative experiments demonstrate that LPI-EnEDT outperforms four classical LPI prediction methods (LPI-BLS, LPI-CatBoost, LPI-SKF, and PLIPCOM) under cross validations on lncRNAs, proteins, and LPIs. The average AUC values on the five datasets are 0.8480, 0,7078, and 0.9066 under the three cross validations, respectively. The average AUPRs are 0.8175, 0.7265, and 0.8882, respectively. Case analyses suggest that there are underlying associations between HOTTIP and Q9Y6M1, NRON and Q15717. Conclusions Fusing diverse biological features of lncRNAs and proteins and exploiting an ensemble learning model with Extra tree and decision tree classifiers, this work focus on imbalanced LPI data classification as well as interaction information inference for a new lncRNA (or protein).
... Deng et al. [20] integrated diffusion and HeteSim features on the heterogeneous lncRNA-protein network (PLIPCOM). Zheng et al. [21] fused sequences, domains, GO terms of proteins and the STRING database and built a more informative model. Zhang et al. [22] proposed a linear neighborhood propagation method (LPLNP) for LPI mining. ...
... Each dataset contains lncRNA sequences, protein sequences, and an LPI network. Datasets 1, 2, and 3 were from human and were provided by Li et al. [17], Zheng et al. [21], and Zhang et al. [22], respectively. We preprocess the three datasets by removing lncRNAs and proteins involved in one associated protein (or lncRNA) or without sequence or expression information in UniProt [46], NPInter [47], NONCODE [48], and SUPERFAMILY [49]. ...
Article
Full-text available
Background Long noncoding RNAs (lncRNAs) have dense linkages with a plethora of important cellular activities. lncRNAs exert functions by linking with corresponding RNA-binding proteins. Since experimental techniques to detect lncRNA-protein interactions (LPIs) are laborious and time-consuming, a few computational methods have been reported for LPI prediction. However, computation-based LPI identification methods have the following limitations: (1) Most methods were evaluated on a single dataset, and researchers may thus fail to measure their generalization ability. (2) The majority of methods were validated under cross validation on lncRNA-protein pairs, did not investigate the performance under other cross validations, especially for cross validation on independent lncRNAs and independent proteins. (3) lncRNAs and proteins have abundant biological information, how to select informative features need to further investigate. Results Under a hybrid framework (LPI-HyADBS) integrating feature selection based on AdaBoost, and classification models including deep neural network (DNN), extreme gradient Boost (XGBoost), and SVM with a penalty Coefficient of misclassification ( C -SVM), this work focuses on finding new LPIs. First, five datasets are arranged. Each dataset contains lncRNA sequences, protein sequences, and an LPI network. Second, biological features of lncRNAs and proteins are acquired based on Pyfeat. Third, the obtained features of lncRNAs and proteins are selected based on AdaBoost and concatenated to depict each LPI sample. Fourth, DNN, XGBoost, and C -SVM are used to classify lncRNA-protein pairs based on the concatenated features. Finally, a hybrid framework is developed to integrate the classification results from the above three classifiers. LPI-HyADBS is compared to six classical LPI prediction approaches (LPI-SKF, LPI-NRLMF, Capsule-LPI, LPI-CNNCP, LPLNP, and LPBNI) on five datasets under 5-fold cross validations on lncRNAs, proteins, lncRNA-protein pairs, and independent lncRNAs and independent proteins. The results show LPI-HyADBS has the best LPI prediction performance under four different cross validations. In particular, LPI-HyADBS obtains better classification ability than other six approaches under the constructed independent dataset. Case analyses suggest that there is relevance between ZNF667-AS1 and Q15717. Conclusions Integrating feature selection approach based on AdaBoost, three classification techniques including DNN, XGBoost, and C -SVM, this work develops a hybrid framework to identify new linkages between lncRNAs and proteins.
... Finally, we obtain 3,479 LPIs from 935 lncRNAs and 59 proteins. Dataset 2 was provided by Zheng et al. 22 . Noncoding RNA-protein interaction, lncRNA and protein sequences were downloaded from NPInter 2.0 67 , NONCODE 4.0 68 , and UniProt 65 , respectively. ...
Article
Full-text available
Long noncoding RNAs (lncRNAs) regulate many biological processes by interacting with corresponding RNA-binding proteins. The identification of lncRNA–protein Interactions (LPIs) is significantly important to well characterize the biological functions and mechanisms of lncRNAs. Existing computational methods have been effectively applied to LPI prediction. However, the majority of them were evaluated only on one LPI dataset, thereby resulting in prediction bias. More importantly, part of models did not discover possible LPIs for new lncRNAs (or proteins). In addition, the prediction performance remains limited. To solve with the above problems, in this study, we develop a Deep Forest-based LPI prediction method (LPIDF). First, five LPI datasets are obtained and the corresponding sequence information of lncRNAs and proteins are collected. Second, features of lncRNAs and proteins are constructed based on four-nucleotide composition and BioSeq2vec with encoder-decoder structure, respectively. Finally, a deep forest model with cascade forest structure is developed to find new LPIs. We compare LPIDF with four classical association prediction models based on three fivefold cross validations on lncRNAs, proteins, and LPIs. LPIDF obtains better average AUCs of 0.9012, 0.6937 and 0.9457, and the best average AUPRs of 0.9022, 0.6860, and 0.9382, respectively, for the three CVs, significantly outperforming other methods. The results show that the lncRNA FTX may interact with the protein P35637 and needs further validation.
... In 2017, Zheng et al. came up with a new method for predicting lncRNA-protein interactions by fusing four protein-protein similarity networks, which were got by calculating protein sequence, protein domain, protein functional annotations and STRING database, respectively (Zheng et al. 2017). As can be seen from Fig. 2, similar network Fig. 2 The workflow of predicting the interactions between lncRNA and protein by fusing multiple protein-protein similarity networks fusion algorithm (Wang et al. 2014) and HeteSim algorithm (Zeng et al. 2017) also are employed to improve the excellent performance of the model. ...
Article
Full-text available
Recent transcriptomics and bioinformatics studies have shown that ncRNAs can affect chromosome structure and gene transcription, participate in the epigenetic regulation, and take part in diseases such as tumorigenesis. Biologists have found that most ncRNAs usually work by interacting with the corresponding RNA-binding proteins. Therefore, ncRNA-protein interaction is a very popular study in both the biological and medical fields. However, due to the limitations of manual experiments in the laboratory, machine-learning methods for predicting ncRNA-protein interactions are increasingly favored by the researchers. In this review, we summarize several machine learning predictive models of ncRNA-protein interactions over the past few years, and briefly describe the characteristics of these machine learning models. In order to optimize the performance of machine learning models to better predict ncRNA-protein interactions, we give some promising future computational directions at the end.
... The similarity integration strategy proposed in this study is a linear network fusion (LNF) method. In order to verify the superior integration performance of the LNF, we compared LNF with two common similarity fusion strategies: similarity network fusion (SNF) (Zheng et al., 2017) and similarity kernel fusion (SKF) (Jiang et al., 2018;Xie et al., 2019). As shown in Figure 3, based on the LOOCV scheme, we plotted the ROC curve of three different integration methods. ...
Article
Full-text available
Plenty of microbes in our human body play a vital role in the process of cell physiology. In recent years, there is accumulating evidence indicating that microbes are closely related to many complex human diseases. In-depth investigation of disease-associated microbes can contribute to understanding the pathogenesis of diseases and thus provide novel strategies for the treatment, diagnosis, and prevention of diseases. To date, many computational models have been proposed for predicting microbe–disease associations using available similarity networks. However, these similarity networks are not effectively fused. In this study, we proposed a novel computational model based on multi-data integration and network consistency projection for Human Microbe–Disease Associations Prediction (HMDA-Pred), which fuses multiple similarity networks by a linear network fusion method. HMDA-Pred yielded AUC values of 0.9589 and 0.9361 ± 0.0037 in the experiments of leave-one-out cross validation (LOOCV) and 5-fold cross validation (5-fold CV), respectively. Furthermore, in case studies, 10, 8, and 10 out of the top 10 predicted microbes of asthma, colon cancer, and inflammatory bowel disease were confirmed by the literatures, respectively.