Distribution chart of six-nuclei CSs in beta-hairpin and not beta-hairpin motifs. doi:10.1371/journal.pone.0139280.g001 

Distribution chart of six-nuclei CSs in beta-hairpin and not beta-hairpin motifs. doi:10.1371/journal.pone.0139280.g001 

Source publication
Article
Full-text available
Successful prediction of the beta-hairpin motif will be helpful for understanding the of the fold recognition. Some algorithms have been proposed for the prediction of beta-hairpin motifs. However, the parameters used by these methods were primarily based on the amino acid sequences. Here, we proposed a novel model for predicting beta-hairpin struc...

Context in source publication

Context 1
... analyzed the average chemical shifts of six nuclei in beta-hairpin and not beta-hairpin data- set. As showed in Fig 1, we found that the different distribution of the CSs six nuclei in beta- hairpin and not beta-hairpin dataset. The average chemical shift values of C,C α ,C β ,H α ,N nuclei are higher in not beta-hairpin dataset than beta-hairpin dataset. ...

Similar publications

Article
Full-text available
While atherosclerotic cardiovascular disease (ASCVD) is known to be common among modern people exposed to various risk factors, recent paleopathological studies have shown that it affected ancient populations much more frequently than expected. In 2010, we investigated a 17th century Korean female mummy with presumptive ASCVD signs. Although the re...
Article
Full-text available
Glycosylation is one of the most complex post translation modification in eukaryotic cells. Almost 50% of the human proteome is glycosylated as glycosylation plays a vital role in various biological functions such as antigen’s recognition, cell-cell communication, expression of genes and protein folding. It is a significant challenge to identify gl...

Citations

... Both algorithms have also been used in the classification of peptides and proteins, for example, LightGBM has been used in the prediction of anti-cancer peptides [66], protein structural class [67], protein-protein interactions [68], protein-ATP binding residues [69], and ion channels [70], among others. On the other hand, QDA has been used in the prediction of tumor T-cell antigens [71], antimicrobial peptides [72,73], protein motifs [74], and protein subcellular location [75], among others. ...
Article
Full-text available
Protein toxins are defense mechanisms and adaptations found in various organisms and microorganisms, and their use in scientific research as therapeutic candidates is gaining relevance due to their effectiveness and specificity against cellular targets. However, discovering these toxins is time-consuming and expensive. In silico tools, particularly those based on machine learning and deep learning, have emerged as valuable resources to address this challenge. Existing tools primarily focus on binary classification, determining whether a protein is a toxin or not, and occasionally identifying specific types of toxins. For the first time, we propose a novel approach capable of classifying protein toxins into 27 distinct categories based on their mode of action within cells. To accomplish this, we assessed multiple machine learning techniques and found that an ensemble model incorporating the Light Gradient Boosting Machine and Quadratic Discriminant Analysis algorithms exhibited the best performance. During the tenfold cross-validation on the training dataset, our model exhibited notable metrics: 0.840 accuracy, 0.827 F1 score, 0.836 precision, 0.840 sensitivity, and 0.989 AUC. In the testing stage, using an independent dataset, the model achieved 0.846 accuracy, 0.838 F1 score, 0.847 precision, 0.849 sensitivity, and 0.991 AUC. These results present a powerful next-generation tool called MultiToxPred 1.0, accessible through a web application. We believe that MultiToxPred 1.0 has the potential to become an indispensable resource for researchers, facilitating the efficient identification of protein toxins. By leveraging this tool, scientists can accelerate their search for these toxins and advance their understanding of their therapeutic potential.
... Preprocessing techniques like dimension reduction and class balancing can be useful to improve the performance of the prediction model (Liang, Ma, Yang, Wang, & Ma, 2018). In some of the recent studies of protein subcellular localization, Linear Discriminant Analysis (LDA) has been applied for the dimension reduction (Feng & Kou, 2015) (Wang & Yang, 2009) of protein feature vector. To improve the reliability and performance of the model, Synthetic Minority Oversampling Technique (SMOTE) is applied (Xiao, Cheng, Chen, Mao, & Chou, 2019) in the skewed distribution of the PSL dataset (Soleimani & Miller, 2019). ...
... Features based on the physicochemical (Biochem & Professi, 1986) (Chou, 2000) (Chou, 2005) and evolutionary (Chen, Hu, & Xue, 2019) information are used for the prediction of subcellular localization. Discriminant analysis (Feng & Kou, 2015) had been applied for the analysis of gene expression data (Liang et al., 2018) and prediction of protein subcellular localization of the Gram-Neagtive bacterial protein sequence data (Wang & Yang, 2009). Different oversampling methods were applied for the protein subcellular localizations of the apoptosis , bacterial (Xiao et al., 2019), and ZD98 (Zhang & Duan, 2018) protein sequence dataset, although subcellular localization datasets are generally skewed and every localization have similar importance. ...
... Various feature extraction techniques are applied in this work, to extract sequence, physicochemical and evolutionary information of the protein sequences, which causes high dimension of the feature vector. High dimensional feature vector preserves the sequence order information and protein residue properties of the protein sequences but at the same time, some of the descriptor's less discriminative and redundant behavior can affect the performance of the model (Feng & Kou, 2015). In this paper, LDA is applied to take out the redundant and inadequate features (Wang & Liu, 2015). ...
Article
Functional characterization of the Unknown Protein Sequence (UPS) is significant for biological research such as disease diagnosis and drug design. In this work Neuro-Fuzzy based machine learning framework is proposed with two levels of predictions, for functional characterization of UPS using augmented features and subcellular localization of the protein sequences. In the first level, Neuro-Fuzzy Approach (NFA) is applied for the categorization of UPS as bacterial or non-bacterial protein sequence. While NFA is capable to overcome the likelihood prediction problem of supervised learning algorithms with untrained samples such as UPS. In the next level functions of bacterial protein sequences are characterized using Protein Subcellular Localization (PSL) model. Physicochemical and evolutionary informations of the protein sequences are extracted and augmented as a protein feature vector that preserves the protein-residue-property and sequence-order-information. Various individual and ensemble classifiers such as Decision-Tree (C-4.5), k-Nearest-Neighbor (k-NN), Multi-Layer-Perceptron (MLP), Naïve-Bayes (NB), AdaBoost, and Gradient-Boosting-Machine (GBM) are used for the formation of the PSL model. PSL model is trained with augmented features of known Gram-Negative Bacterial Protein Sequence (GN_BPS) dataset with 10-fold cross-validation and 97.94% accuracy is achieved through C-4.5 classifier. Validated PSL model is further utilized for the functional characterization of the Unknown G- Bacterial Protein Sequences (Unk_GN_BPS) such as Unk_GN_156 and Unk_GN_61 datasets. The accuracy achieved for Unk_GN_156 is 78.20% with C-4.5 and 79.32% for the Unk_GN_61 through k-NN.
... Of note, other classification methods were also tested (as shown in S1 Text and S1 Fig), but LDA obtained the highest performance scores among all of them. To the best of our knowledge, such a method has only been used once before in protein NMR, to detect beta-hairpin regions [17]. ...
Article
Full-text available
NMR spectroscopy is key in the study of intrinsically disordered proteins (IDPs). Yet, even the first step in such an analysis—the assignment of observed resonances to particular nuclei—is often problematic due to low peak dispersion in the spectra of IDPs. We show that the assignment process can be aided by finding “hidden” chemical shift patterns specific to the amino acid residue types. We find such patterns in the training data from the Biological Magnetic Resonance Bank using linear discriminant analysis, and then use them to classify spin systems in an α -synuclein sample prepared by us. We describe two situations in which the procedure can greatly facilitate the analysis of NMR spectra. The first involves the mapping of spin systems chains onto the protein sequence, which is part of the assignment procedure—a prerequisite for any NMR-based protein analysis. In the second, the method supports assignment transfer between similar samples. We conducted experiments to demonstrate these cases, and both times the majority of spin systems could be unambiguously assigned to the correct residue types.
... Below, we propose a statistical method based on Linear Discriminant Analysis (LDA) for the automatic recognition of amino acid residue types. To the best of our knowledge, such a method has only been used once before in protein NMR, to detect beta-hairpin regions [17]. ...
Preprint
NMR spectroscopy is key in the study of intrinsically disordered proteins (IDPs). Yet, even the first step in such an analysis—the assignment of observed resonances to particular nuclei—is often problematic due to low peak dispersion in the spectra of IDPs. We show that the assignment process can be aided by finding “hidden” chemical shift patterns specific to the amino acid residue types. We find such patterns in the training data from the Biological Magnetic Resonance Bank using linear discriminant analysis, and then use them to classify spin systems in an alfa-synuclein sample prepared by us. We describe two situations in which the procedure can greatly facilitate the analysis of NMR spectra. The first involves the mapping of spin systems chains onto the protein sequence, which is part of the assignment procedure—a prerequisite for any NMR-based protein analysis. In the second, the method supports assignment transfer between similar samples. We conducted experiments to demonstrate these cases, and both times the majority of spin systems could be unambiguously assigned to the correct residue types. Author summary Intrinsically disordered proteins dynamically change their conformation, which allows them to fulfil many biologically significant functions, mostly related to process regulation. Their relation to many civilization diseases makes them essential objects to study. Nuclear magnetic resonance spectroscopy (NMR) is the method for such research, as it provides atomic-scale information on these proteins. However, the first step of the analysis – assignment of experimentally measured NMR chemical shifts to particular atoms of the protein – is more complex than in the case of structured proteins. The methods routinely used for these proteins are no more sufficient. We have developed a method of resolving ambiguities occurring during the assignment process. In a nutshell, we show that an advanced statistical method known as linear discriminant analysis makes it possible to determine chemical shift patterns specific to different types of amino acid residues. It allows assigning the chemical shifts more efficiently, opening the way to a plethora of structural and dynamical information on intrinsically disordered proteins.
... The support vector machine algorithm can be used to identify the β-hairpin in enzymes, where it participates in the formation of ligand binding sites [129]. The chemical shift function and quadratic discriminant analysis of experimental NMR data are robust algorithms for predicting the beta hairpin [130]. The HTHquery web service (http://www.ebi.ac.uk/thornton-srv/databases/HTHquery; accessed on 21 July 2021) can be used to predict helical pairs. ...
Article
Full-text available
Proteins expressed during the cell cycle determine cell function, topology, and responses to environmental influences. The development and improvement of experimental methods in the field of structural biology provide valuable information about the structure and functions of individual proteins. This work is devoted to the study of supersecondary structures of proteins and determination of their structural motifs, description of experimental methods for their detection, databases, and repositories for storage, as well as methods of molecular dynamics research. The interest in the study of supersecondary structures in proteins is due to their autonomous stability outside the protein globule, which makes it possible to study folding processes, conformational changes in protein isoforms, and aberrant proteins with high productivity.
... Some recent studies have been dedicated to dimension reduction algorithm for protein subcellular localization [11][11] [11]. Discriminant analysis is applied in a combined feature vector to develop protein secondary structure prediction model [39]. Linear discriminant analysis is utilized to improve the performance of the protein subcellular localization prediction model [40]. ...
Article
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
... While we have obtained encouraging results, we believe the performance can be further improved by using different deep neural network models and additional features. For instance, the chemical shift information has been used in many protein structure studies [35][36][37] . We consider using such information to further improve the model performance in the future work. ...
Article
Full-text available
Abstract Protein gamma-turn prediction is useful in protein function studies and experimental design. Several methods for gamma-turn prediction have been developed, but the results were unsatisfactory with Matthew correlation coefficients (MCC) around 0.2–0.4. Hence, it is worthwhile exploring new methods for the prediction. A cutting-edge deep neural network, named Capsule Network (CapsuleNet), provides a new opportunity for gamma-turn prediction. Even when the number of input samples is relatively small, the capsules from CapsuleNet are effective to extract high-level features for classification tasks. Here, we propose a deep inception capsule network for gamma-turn prediction. Its performance on the gamma-turn benchmark GT320 achieved an MCC of 0.45, which significantly outperformed the previous best method with an MCC of 0.38. This is the first gamma-turn prediction method utilizing deep neural networks. Also, to our knowledge, it is the first published bioinformatics application utilizing capsule network, which will provide a useful example for the community. Executable and source code can be download at http://dslsrv8.cs.missouri.edu/~cf797/MUFoldGammaTurn/download.html.
... Because the existing hypotheses namely: support vector machine [28,29] , random forest [30,31] , covariance discriminant (CD) [32] , neural network [33] , conditional random field [34] , SLLE algorithm [35] , K-nearest neighbor [36] , OET-KNN [36] , fuzzy K-nearest neighbor [37] and ML-KNN algorithm [38] can only execute vector space of equal length [37,38] . In order to express protein sequences into equal length along with sequence order information the notion of pseudo k-tuple nucleotide composition (PseKNC) [39][40][41] was adopted and further demonstrated the effectiveness of PseKNC in predicting recombination spots [41] , predicting nucleosome [41] , identifying splicing sites [42] , identifying translation initiation site [42] , predicting promoters [41] , identifying RNA and DNA modification [43,44] , identifying origin of replication [45] and others [46] . The simplest and easiest way to represent a DNA sample by using discrete model is nucleic acid composition (NAC), which is given below: ...
Article
Background and Objectives Enhancers are pivotal DNA elements, which are widely used in eukaryotes for activation of transcription genes. On the basis of enhancer strength, they are further classified into two groups; strong enhancers and weak enhancers. Due to high availability of huge amount of DNA sequences, it is needed to develop fast, reliable and robust intelligent computational method, which not only identify enhancers but also determines their strength. Considerable progress has been achieved in this regard; however, timely and precisely identification of enhancers is still a challenging task. Methods Two-level intelligent computational model for identification of enhancers and their subgroups is proposed. Two different feature extraction techniques including di-nucleotide composition and tri-nucleotide composition were adopted for extraction of numerical descriptors. Four classification methods including probabilistic neural network, support vector machine, k-nearest neighbor and random forest were utilized for classification. Results The proposed method yielded 77.25% of accuracy for dataset S1 contains enhancers and non-enhancers, whereas 64.70% of accuracy for dataset S2 comprises of strong enhancer and weak enhancer sequences using jackknife cross-validation test. Conclusion The predictive results validated that the proposed method is better than that of existing approaches so far reported in the literature. It is thus highly observed that the developed method will be useful and expedient for basic research and academia.
... (Jia et al., 2013). In 2015, based on the chemical shifts, an algorithm called quadratic discriminant was developed to identify beta-hairpin motifs, and the prediction results with sensitivity of 92%, the specificity of 94%, and Mathew's correlation coefficient of 0.85 were obtained (Feng and Kou, 2015). ...
Article
Full-text available
β-hairpins in enzyme, a kind of special protein with catalytic functions, contain many binding sites which are essential for the functions of enzyme. With the increasing number of observed enzyme protein sequences, it is of especial importance to use bioinformatics techniques to quickly and accurately identify the β-hairpin in enzyme protein for further advanced annotation of structure and function of enzyme. In this work, the proposed method was trained and tested on a non-redundant enzyme β-hairpin database containing 2818 β-hairpins and 1098 non-β-hairpins. With 5-fold cross-validation on the training dataset, the overall accuracy of 90.08% and Matthew’s correlation coefficient (Mcc) of 0.74 were obtained, while on the independent test dataset, the overall accuracy of 88.93% and Mcc of 0.76 were achieved. Furthermore, the method was validated on 845 β-hairpins with ligand binding sites. With 5-fold cross-validation on the training dataset and independent test on the test dataset, the overall accuracies were 85.82% (Mcc of 0.71) and 84.78%(Mcc of 0.70), respectively. With an integration of mRMR feature selection and SVM algorithm, a reasonable high accuracy were achieved, indicating the method to be an effective tool for the further studies of β-hairpins in enzymes structure. Additionally, as a novelty for function prediction of enzymes, β-hairpins with ligand binding sites were predicted. Based on this work, a web server was constructed to predict β-hairpin Motifs in Enzymes (http://202.207.29.251:8080/).
Chapter
Due to the advancement in various sequencing technologies, the gap between the number of protein sequences and the number of experimental protein structures is ever increasing. Community-wide initiatives like CASP have resulted in considerable efforts in the development of computational methods to accurately model protein structures from sequences. Sequence-based prediction of super-secondary structure has direct application in protein structure prediction, and there have been significant efforts in the prediction of super-secondary structure in the last decade. In this chapter, we first introduce the protein structure prediction problem and highlight some of the important progress in the field of protein structure prediction. Next, we discuss recent methods for the prediction of super-secondary structures. Finally, we discuss applications of super-secondary structure prediction in structure prediction/analysis of proteins. We also discuss prediction of protein structures that are composed of simple super-secondary structure repeats and protein structures that are composed of complex super-secondary structure repeats. Finally, we also discuss the recent trends in the field.