ArticlePDF Available

Prediction of Protein-Protein Interactions from Protein Sequence Using Local Descriptors

Authors:

Abstract and Figures

With a huge amount of protein sequence data, the computational method for protein-protein interaction (PPI) prediction using only the protein sequences information have drawn increasing interest. In this article, we propose a sequence-based method based on a novel representation of local protein sequence descriptors. Local descriptors account for the interactions between residues in both continuous and discontinuous regions of a protein sequence, so this method enables us to extract more PPI information from the sequence. A series of elaborate experiments are performed to optimize the prediction model by varying the parameter k and the distance measuring function of the k-nearest neighbors learning system and the ways of coding a protein pair. When performed on the PPI data of Saccharomyces cerevisiae, the method achieved 86.15% prediction accuracy with 81.03% sensitivity at the precision of 90.24%. An independent data set of 986 Escherichia coli PPIs was used to evaluate this prediction model and the prediction accuracy is 73.02%. Given the complex nature of PPIs, the performance of our method is promising, and it can be a helpful supplement for PPIs prediction.
Content may be subject to copyright.
A preview of the PDF is not available
... In general, recent sequence-based methods have focused on identifying new feature extraction methods from sequence information, while others have focused on developing predictive models. For example, Guo et al. [6] proposed using the auto-covariance descriptors (ADs) to convert amino acid sequences within a protein into feature vectors, while other authors such as Yang [7], You [8,9], and Zhou [10] suggested using multi-scale continuous and discontinuous region encoders to transform protein sequences into feature vectors. Considering feature fusion techniques to build higher-quality features for PPI prediction, Chen et al. [11] proposed the LightGBM-PPI model and used a combination of multiple descriptors, including Pseudo-Amino Acid Composition (PseAAC), Autocorrelation (AC), and CT to capture the information in encoding protein sequences. ...
... Inspired by these observations, we propose a novel PPI predictive model called DF-PPI (Deep Fusion-PPI). In our model, we employ a feature extraction step that uses three descriptors: F-vector [27], LD [7], and APAACplus (a new variant of APAAC [28] that we introduce here). To learn protein sequence embeddings, we use Doc2vec. ...
... The local descriptor (LD) was introduced by Yang et al. [7]. The LD encodes information about specific segments in a protein sequence. ...
Article
Full-text available
Understanding protein–protein interactions (PPIs) helps to identify protein functions and develop other important applications such as drug preparation and protein–disease relationship identification. Deep-learning-based approaches are being intensely researched for PPI determination to reduce the cost and time of previous testing methods. In this work, we integrate deep learning with feature fusion, harnessing the strengths of both approaches, handcrafted features, and protein sequence embedding. The accuracies of the proposed model using five-fold cross-validation on Yeast core and Human datasets are 96.34% and 99.30%, respectively. In the task of predicting interactions in important PPI networks, our model correctly predicted all interactions in one-core, Wnt-related, and cancer-specific networks. The experimental results on cross-species datasets, including Caenorhabditis elegans, Helicobacter pylori, Homo sapiens, Mus musculus, and Escherichia coli, also show that our feature fusion method helps increase the generalization capability of the PPI prediction model.
... In this context, exclusion criteria are often applied to prevent the inclusion of false-negative interactions in the dataset. The main criteria involve adding only negative pairs that have different subcellular localization information [38][39][40][41][42][43] or only add negative pair with minimum structural dissimilarity with positive pairs (reducing the homologous sequences bias) [21,20,26]. ...
Article
Full-text available
Machine Learning (ML) algorithms have been important tools for the extraction of useful knowledge from biological sequences, particularly in healthcare, agriculture, and the environment. However, the categorical and unstructured nature of these sequences requiring usually additional feature engineering steps, before an ML algorithm can be efficiently applied. The addition of these steps to the ML algorithm creates a processing pipeline, known as end-to-end ML. Despite the excellent results obtained by applying end-to-end ML to biotechnology problems, the performance obtained depends on the expertise of the user in the components of the pipeline. In this work, we propose an end-to-end ML-based framework called BioPrediction-RPI, which can identify implicit interactions between sequences, such as pairs of non-coding RNA and proteins, without the need for specialized expertise in end-to-end ML. This framework applies feature engineering to represent each sequence by structural and topological features. These features are divided into feature groups and used to train partial models, whose partial decisions are combined into a final decision, which, provides insights to the user by giving an interpretability report. In our experiments, the developed framework was competitive when compared with various expert-created models. We assessed BioPrediction-RPI with 12 datasets when it presented equal or better performance than all tools in 40% to 100% of cases, depending on the experiment. Finally, BioPrediction-RPI can fine-tune models based on new data and perform at the same level as ML experts, democratizing end-to-end ML and increasing its access to those working in biological sciences.
... Figure 3 describes and illustrates the architecture of the neural network used to learn the embedding matrix. The Amino Acid Encoding algorithm has two stages, the training dataset generating stage (steps 1-3) and the neural network training stage (steps [4][5][6][7][8][9][10][11]. Let maxlen be the maximum length of the protein sequences in , from that, the computational complexity of the algorithm is determined by the formula, (maxlen × | | + × | |). ...
Article
Full-text available
Understanding protein-protein interactions (PPIs) helps to identify protein functions and develop other important applications such as drug preparation, protein-disease relationship identification. Machine learning methods have been developed for the PPI prediction task in order to reduce the cost and time of previous experimental methods. In this paper, we study a method for determining PPIs using deep learning and protein sequence representation learning. In our method, an word embedding technique is utilized for protein sequence representation learning. This technique captures the semantic relationship between amino acids in protein sequences. The semantic relationship is then used as the input information, which is fed into a neural network to help recognize the interaction signature of the input protein pair. Different from previous studies, we integrate the protein sequence embedding mechanism into a neural network model. Thereby, the protein sequence embedding is better controlled for PPI prediction by our neural network model. We evaluate our method on benchmark datasets including Yeast, Human, and eight different independent sets. In addition, we also conduct an extensive comparison with the other existing methods. Our results show that the proposed method is superior to other existing methods and achieves high efficiency in predicting cross-species PPIs. The dataset and our source code are available at https://github.com/thnhub/BoostPPIP.git.
... Supplementary Sections A to G Figs. S1 to S5 Tables S1 to S3 References [62][63][64][65][66][67][68][69][70][71][72][73][74][75][76][77] ...
Article
Full-text available
Background: In real-world drug discovery, human experts typically grasp molecular knowledge of drugs and proteins from multimodal sources including molecular structures, structured knowledge from knowledge bases, and unstructured knowledge from biomedical literature. Existing multimodal approaches in AI drug discovery integrate either structured or unstructured knowledge independently, which compromises the holistic understanding of biomolecules. Besides, they fail to address the missing modality problem, where multimodal information is missing for novel drugs and proteins. Methods: In this work, we present KEDD, a unified, end-to-end deep learning framework that jointly incorporates both structured and unstructured knowledge for vast AI drug discovery tasks. The framework first incorporates independent representation learning models to extract the underlying characteristics from each modality. Then, it applies a feature fusion technique to calculate the prediction results. To mitigate the missing modality problem, we leverage sparse attention and a modality masking technique to reconstruct the missing features based on top relevant molecules. Results: Benefiting from structured and unstructured knowledge, our framework achieves a deeper understanding of biomolecules. KEDD outperforms state-of-the-art models by an average of 5.2% on drug–target interaction prediction, 2.6% on drug property prediction, 1.2% on drug–drug interaction prediction, and 4.1% on protein–protein interaction prediction. Through qualitative analysis, we reveal KEDD’s promising potential in assisting real-world applications. Conclusions: By incorporating biomolecular expertise from multimodal knowledge, KEDD bears promise in accelerating drug discovery.
... The application of computer methods for predicting Protein-Protein Interactions (PPI) can be divided into two main stages. The initial phase was dominated by Machine Learning technologies [25], involving the construction of linear relationships and training classifiers [26]; including Weighted Sparse Representation-based Classifier [27], SVM (Support Vector Machine) [28][29][30], Random Forest [31], Rotation Forest [32], KNN (K-Nearest Neighbors) [33], Extreme Learning Machine (ELM) [34], and other Support Vector Machines [35]. ...
Article
Full-text available
Purpose Sequenced Protein–Protein Interaction (PPI) prediction represents a pivotal area of study in biology, playing a crucial role in elucidating the mechanistic underpinnings of diseases and facilitating the design of novel therapeutic interventions. Conventional methods for extracting features through experimental processes have proven to be both costly and exceedingly complex. In light of these challenges, the scientific community has turned to computational approaches, particularly those grounded in deep learning methodologies. Despite the progress achieved by current deep learning technologies, their effectiveness diminishes when applied to larger, unfamiliar datasets. Results In this study, the paper introduces a novel deep learning framework, termed DL-PPI, for predicting PPIs based on sequence data. The proposed framework comprises two key components aimed at improving the accuracy of feature extraction from individual protein sequences and capturing relationships between proteins in unfamiliar datasets. 1. Protein Node Feature Extraction Module: To enhance the accuracy of feature extraction from individual protein sequences and facilitate the understanding of relationships between proteins in unknown datasets, the paper devised a novel protein node feature extraction module utilizing the Inception method. This module efficiently captures relevant patterns and representations within protein sequences, enabling more informative feature extraction. 2. Feature-Relational Reasoning Network (FRN): In the Global Feature Extraction module of our model, the paper developed a novel FRN that leveraged Graph Neural Networks to determine interactions between pairs of input proteins. The FRN effectively captures the underlying relational information between proteins, contributing to improved PPI predictions. DL-PPI framework demonstrates state-of-the-art performance in the realm of sequence-based PPI prediction.
... Seven different physicochemical properties of amino acids were considered, including hydrophobicity, hydrophilicity, side chain volume, polarity, polarizability, solvent-accessible surface area and net charge index of side chains. Yang et al. [64] represented a computational method that extracted sequence features by utilizing local descriptors (LD, including composition, transition, and distribution), which considered the effect of discontinuous amino acids but ignored global information. Zhou et al. [65] utilized the codon pair frequency difference to identify PPIs and showed comparable performance to other sequence-based methods. ...
Article
Full-text available
Predicting potential protein-protein interaction and non-interaction are vital to study the mechanism of protein function. Traditional experimental technologies show their disadvantages of being expensive, time-consuming and laborious. Numerous computational methods have been developed to detect potential interacting and non-interacting protein partners. This paper reviews recent advancements in effective computational models for protein-protein interactions and non-interactions prediction. We classified the computational methods based on the protein information types into five different categories and introduced the main ideas, advantages and disadvantages of algorithms in each category. To obtain a highquality dataset, we analyzed the collection methods and composition of positive and negative samples in detail and described some applications of real non-interacting protein pairs. Finally, we summarized some challenges and open issues in the future.
Article
Full-text available
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein–ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein–ligand interactions. Here, we review a comprehensive set of over 160 protein–ligand interaction predictors, which cover protein–protein, protein−nucleic acid, protein−peptide and protein−other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Article
Exploring protein-protein interaction (PPI) is of paramount importance for elucidating the intrinsic mechanism of various biological processes. Nevertheless, experimental determination of PPI can be both time-consuming and expensive, motivating the exploration of data-driven deep learning technologies as a viable, efficient, and accurate alternative. Nonetheless, most current deep learning-based methods regarded a pair of proteins to be predicted for possible interaction as two separate entities when extracting PPI features, thus neglecting the knowledge sharing among the collaborative protein and the target protein. Aiming at the above issue, a collaborative learning framework CollaPPI was proposed in this study, where two kinds of collaboration, i.e., protein-level collaboration and task-level collaboration, were incorporated to achieve not only the knowledge-sharing between a pair of proteins, but also the complementation of such shared knowledge between biological domains closely related to PPI (i.e., protein function, and subcellular location). Evaluation results demonstrated that CollaPPI obtained superior performance compared to state-of-the-art methods on two PPI benchmarks. Besides, evaluation results of CollaPPI on the additional PPI type prediction task further proved its excellent generalization ability.
Conference Paper
Protein-protein interaction (PPI) is vital for understanding protein functions and various cellular biological functions like DNA replication and transcription, signaling cascades, metabolic cycles, and metabolism. However, various experimental techniques exist for detecting protein-protein interactions, i.e., mass spectroscopy, protein arrays, yeast two-hybrid, etc. But these techniques are expensive and tedious, so there is a necessity to devise computational processes to facilitate the prediction of protein-protein interactions among the proteins. Computational methods offer a low-cost method to discover protein interactions that complement experimental methods. The methods based only on primary sequence data are more generic than methods based on additional details or protein-specific assumptions. This paper proposes a sequence-based model that combines local descriptors with Shannon entropy and Hurst exponent to detect PPI. Here, features are extracted directly from primary sequences, and the Support Vector Machine algorithm is used as a classifier. The proposed model on the DIP (Database of Interacting Proteins) dataset gives 96.71% accuracy with 94.94% precision and 98.58% recall. The findings validate that the proposed model performs better than various state-of-the-art predictors for protein-protein interactions.
Article
Full-text available
It is well known that most of the binding free energy of protein interaction is contributed by a few key hot spot residues. These residues are crucial for understanding the function of proteins and studying their interactions. Experimental hot spots detection methods such as alanine scanning mutagenesis are not applicable on a large scale since they are time consuming and expensive. Therefore, reliable and efficient computational methods for identifying hot spots are greatly desired and urgently required. In this work, we introduce an efficient approach that uses support vector machine (SVM) to predict hot spot residues in protein interfaces. We systematically investigate a wide variety of 62 features from a combination of protein sequence and structure information. Then, to remove redundant and irrelevant features and improve the prediction performance, feature selection is employed using the F-score method. Based on the selected features, nine individual-feature based predictors are developed to identify hot spots using SVMs. Furthermore, a new ensemble classifier, namely APIS (A combined model based on Protrusion Index and Solvent accessibility), is developed to further improve the prediction accuracy. The results on two benchmark datasets, ASEdb and BID, show that this proposed method yields significantly better prediction accuracy than those previously published in the literature. In addition, we also demonstrate the predictive power of our proposed method by modelling two protein complexes: the calmodulin/myosin light chain kinase complex and the heat shock locus gene products U and V complex, which indicate that our method can identify more hot spots in these two complexes compared with other state-of-the-art methods. We have developed an accurate prediction model for hot spot residues, given the structure of a protein complex. A major contribution of this study is to propose several new features based on the protrusion index of amino acid residues, which has been shown to significantly improve the prediction performance of hot spots. Moreover, we identify a compact and useful feature subset that has an important implication for identifying hot spot residues. Our results indicate that these features are more effective than the conventional evolutionary conservation, pairwise residue potentials and other traditional features considered previously, and that the combination of our and traditional features may support the creation of a discriminative feature set for efficient prediction of hot spot residues. The data and source code are available on web site http://home.ustc.edu.cn/~jfxia/hotspot.html.
Article
Full-text available
We propose a sequence-based multiple classifier system, i.e., rotation forest, to infer protein-protein interactions (PPIs). Moreover, Moran autocorrelation descriptor is used to code an interaction protein pair. Experimental results on Saccharomyces cerevisiae and Helicobacter pylori datasets show that our approach outperforms those previously published in literature, which demonstrates the effectiveness of the proposed method.
Article
Full-text available
Based on pseudo amino acid (PseAA) composition and a novel hybrid feature selection frame, this paper presents a computational system to predict the PPIs (protein-protein interactions) using 8796 protein pairs. These pairs are coded by PseAA composition, resulting in 114 features. A hybrid feature selection system, mRMR-KNNs-wrapper, is applied to obtain an optimized feature set by excluding poor-performed and/or redundant features, resulting in 103 remaining features. Using the optimized 103-feature subset, a prediction model is trained and tested in the k-nearest neighbors (KNNs) learning system. This prediction model achieves an overall accurate prediction rate of 76.18%, evaluated by 10-fold cross-validation test, which is 1.46% higher than using the initial 114 features and is 6.51% higher than the 20 features, coded by amino acid compositions. The PPIs predictor, developed for this research, is available for public use at http://chemdata.shu.edu.cn/ppi.
Article
Full-text available
Identification of protein interaction sites has significant impact on understanding protein function, elucidating signal transduction networks and drug design studies. With the exponentially growing protein sequence data, predictive methods using sequence information only for protein interaction site prediction have drawn increasing interest. In this article, we propose a predictive model for identifying protein interaction sites. Without using any structure data, the proposed method extracts a wide range of features from protein sequences. A random forest-based integrative model is developed to effectively utilize these features and to deal with the imbalanced data classification problem commonly encountered in binding site predictions. We evaluate the predictive method using 2829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other sequence-based predictive methods and can reliably predict residues involved in protein interaction sites. Furthermore, we apply the method to predict interaction sites and to construct three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. We show that the predicted interaction sites can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues. Datasets and software are available at http://ittc.ku.edu/~xwchen/bindingsite/prediction.
Article
Many proteins have evolved to form specific molecular complexes and the specificity of this interaction is essential for their function. The network of the necessary inter-residue contacts must consequently constrain the protein sequences to some extent. In other words, the sequence of an interacting protein must reflect the consequence of this process of adaptation. It is reasonable to assume that the sequence changes accumulated during the evolution of one of the interacting proteins must be compensated by changes in the other.Here we apply a method for detecting correlated changes in multiple sequence alignments to a set of interacting protein domains and show that positions where changes occur in a correlated fashion in the two interacting molecules tend to be close to the protein-protein interfaces. This leads to the possibility of developing a method for predicting contacting pairs of residues from the sequence alone. Such a method would not need the knowledge of the structure of the interacting proteins, and hence would be both radically different and more widely applicable than traditional docking methods.We indeed demonstrate here that the information about correlated sequence changes is sufficient to single out the right inter-domain docking solution amongst many wrong alternatives of two-domain proteins. The same approach is also used here in one case (haemoglobin) where we attempt to predict the interface of two different proteins rather than two protein domains. Finally, we report here a prediction about the inter-domain contact regions of the heat- shock protein Hsc70 based only on sequence information.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
The annotation of proteins can be achieved by classifying the protein of interest into a certain known protein family to induce its functional and structural features. This paper presents a new method for classifying protein sequences based upon the hydropathy blocks occurring in protein sequences. First, a fixed-dimensional feature vector is generated for each protein sequence using the frequency of the hydropathy blocks occurring in the sequence. Then, the support vector machine (SVM) classifier is utilized to classify the protein sequences into the known protein families. The experimental results have shown that the proteins belonging to the same family or subfamily can be identified using features generated from the hydropathy blocks.
Article
Deciphering the network of protein interactions that underlines cellular operations has become one of the main tasks of proteomics and computational biology. Recently, a set of bioinformatics approaches has emerged for the prediction of possible interactions by combining sequence and genomic information. Even though the initial results are very promising, the current methods are still far from perfect. We propose here a new way of discovering possible protein–protein interactions based on the comparison of the evolutionary distances between the sequences of the associated protein families, an idea based on previous observations of correspondence between the phylogenetic trees of associated proteins in systems such as ligands and receptors. Here, we extend the approach to different test sets, including the statistical evaluation of their capacity to predict protein interactions. To demonstrate the possibilities of the system to perform large-scale predictions of interactions, we present the application to a collection of more than 67 000 pairs of E.coli proteins, of which 2742 are predicted to correspond to interacting proteins.