A sequence as a matrix. Representation of a sequence as a matrix (2D array) after encoding the raw amino acid sequence. The x-axis represents invariant 20 positions of amino acids and 1 position for non-amino acids. y-axis represents the sequence with fixed length of at most 1000 amino acids positions. https://doi.org/10.1371/journal.pone.0258625.g001

A sequence as a matrix. Representation of a sequence as a matrix (2D array) after encoding the raw amino acid sequence. The x-axis represents invariant 20 positions of amino acids and 1 position for non-amino acids. y-axis represents the sequence with fixed length of at most 1000 amino acids positions. https://doi.org/10.1371/journal.pone.0258625.g001

Source publication
Article
Full-text available
Although genes carry information, proteins are the main role player in providing all the functionalities of a living organism. Massive amounts of different proteins involve in every function that occurs in a cell. These amino acid sequences can be hierarchically classified into a set of families and subfamilies depending on their evolutionary relat...

Context in source publication

Context 1
... 3: Representing each amino acid as a vector in 2D space: We take x-axis as the ordered amino acid codes and y-axis as the amino acids in the sequence as shown in Fig 1. We used non-amino acid positions as zeros and amino acid positions as ones with 2 exceptions as mentioned in the next step 4. ...

Citations

... Existing works available in protein membrane classification has been listed. Sandaruwan et al. [6] have proposed DeepFam with ProtCNN to categorize proteins into their types. Here, non-hierarchical classification of proteins is performed efficiently. ...
... This classifier [1,11] calculates the ability of the classifier by correctly classifying the true positives as represented in Eq. (6). ...
Article
Full-text available
Membrane proteins provide a significant part in cellular activities. The role of membrane proteins is inevitable in drug interactions and in all living organisms. Membrane protein classification is used to identify the relationships between proteins. With the help of amino acid composition, proteins get classified. A novel protein classification scheme is proposed using Tri-code Embedding vector. This proposed method forms triplet subgroups which are assigned with unique code words. Then a triplet subgroup is formed from the amino acid subgroup which is provided as input to the Bidirectional Long Short-Term Memory (BiLSTM) and SoftMax layer for classification. Two data sets are utilized and classified, with 7582 membrane proteins and 4684 membrane proteins. The results are investigated applying the self-consistency test, the Mathew’s correlation coefficient and the independent data set. Moreover, the proposed method shows its improvement in protein classification process in terms of accuracy, specificity, sensitivity, precision, recall and fmeasure. Thus, the proposed scheme provides an effective protein classification scheme that incorporates the optimistic features of deep learning. The results depict that overall accuracy obtained for data set1 is 99.48% and for data set2 is 99.87%. The proposed method achieves the highest overall classification accuracy with minimum execution time when compared to the other methods.
... We selected an ANN model for our third and main classification model. Such deep-learning models have been quite effective for various classification tasks in bioinformatics over the years such as [46], [47], [48], [49]. ...
Preprint
Full-text available
The dynamic evolution of the SARS-CoV-2 virus is largely driven by mutations in its genetic sequence, culminating in the emergence of variants with increased capability to evade host immune responses. Accurate prediction of such mutations is fundamental in mitigating pandemic spread and developing effective control measures. In this study, we introduce a robust and interpretable deep-learning approach called PRIEST. This innovative model leverages time-series viral sequences to foresee potential viral mutations. Our comprehensive experimental evaluations underscore PRIEST's proficiency in accurately predicting immune-evading mutations. Our work represents a substantial step forward in the utilization of deep-learning methodologies for anticipatory viral mutation analysis and pandemic response.
... It is very common to use a traditional "flat" classification, whereby deep learning models do not explicitly consider evolutionary relationships between taxa (Hansen et al., 2020;Kasinathan et al., 2021;Kittichai et al., 2021;Xia et al., 2018). However, those relationships are hierarchical in nature, and hierarchical classification has been researched in different application domains (Silla and Freitas, 2011;Salakhutdinov et al., 2013;Park and Kim, 2020) such as diatom images (Dimitrovski et al., 2012), disease detection and protein families (Sandaruwan and Wannige, 2021). For animals such as arthropods (Tresson et al., 2021) and fish (Gupta et al., 2022), hierarchical classification has been investigated with the object detector YOLOv3 (Redmon and Farhadi, 2018) designed to detect and classify species using a "flat" multi-class structure. ...
... It is most common to use a traditional "flat" classification, whereby deep learning models do not explicitly consider evolutionary relationships between taxa (Hansen et al., 2020;Kasinathan et al., 2021;Kittichai et al., 2021;Xia et al., 2018). However, those relationships are hierarchical in nature, and hierarchical classification has been researched in different application domains (Silla and Freitas, 2011;30 Salakhutdinov et al., 2013;Park and Kim, 2020) such as diatom images (Dimitrovski et al., 2012), disease detection and protein families (Sandaruwan and Wannige, 2021). For animals such as arthropods (Tresson et al., 2021) and fish species (Gupta et al., 2022) hierarchical classification has been investigated with the object detector YOLOv3 (Redmon and Farhadi, 2018) designed to detect and classify species using a "flat" multi-class structure. ...
Preprint
Full-text available
Cameras and computer vision are revolutionising the study of insects, creating new research opportunities within agriculture, epidemiology, evolution, ecology and monitoring of biodiversity. However, a major challenge is the diversity of insects and close resemblances of many species combined with computer vision are often not sufficient to classify large numbers of insect species, which sometimes cannot be identified at the species level. Here, we present an algorithm to hierarchically classify insects from images, leveraging a simple taxonomy to (1) classify specimens across multiple taxonomic ranks simultaneously, and (2) highlight the lowest rank at which a reliable classification can be reached. Specifically, we propose multitask learning, a loss function incorporating class dependency at each taxonomic rank, and anomaly detection based on outlier analysis for quantification of uncertainty. First, we compile a dataset of 41,731 images of insects, combining images from time-lapse monitoring of floral scenes with images from the Global Biodiversity Information Facility (GBIF). Second, we adapt state-of-the-art convolutional neural networks, ResNet and EfficientNet, for the hierarchical classification of insects belonging to three orders, five families and nine species. Third, we assess model generalization for 11 species unseen by the trained models. Here, anomaly detection is used to predict the higher rank of the species not present in the training set. We found that incorporating a simple taxonomy into our model increased accuracy at higher taxonomic ranks. As expected, our algorithm correctly classified new insect species at higher taxonomic ranks, while classification was uncertain at lower taxonomic ranks. Anomaly detection can effectively flag novel taxa that are visually distinct from species in the training data. However, five novel taxa were consistently mistaken for visually similar species in the training data. Above all, we have demonstrated a practical approach to hierarchical classification based on species taxonomy and uncertainty during automated in situ monitoring of live insects. Our method is simple and versatile and could be implemented to classify a wide range of insects as well as other organisms.
... Many biological data sets are hierarchical in nature, which means they have classes (or labels) that can be further divided into other classes, such as organism taxonomy [2,3], structural domains of proteins [4][5][6][7], metabolic pathways [8,9], enzyme classifications [9,10], among others. In contrast to f lat classification, where classes are considered unrelated and independent, hierarchical classification associate labels to different classifica-tion levels [11]. ...
Article
Full-text available
The rate of biological data generation has increased dramatically in recent years, which has driven the importance of databases as a resource to guide innovation and the generation of biological insights. Given the complexity and scale of these databases, automatic data classification is often required. Biological data sets are often hierarchical in nature, with varying degrees of complexity, imposing different challenges to train, test and validate accurate and generalizable classification models. While some approaches to classify hierarchical data have been proposed, no guidelines regarding their utility, applicability and limitations have been explored or implemented. These include 'Local' approaches considering the hierarchy, building models per level or node, and 'Global' hierarchical classification, using a flat classification approach. To fill this gap, here we have systematically contrasted the performance of 'Local per Level' and 'Local per Node' approaches with a 'Global' approach applied to two different hierarchical datasets: BioLip and CATH. The results show how different components of hierarchical data sets, such as variation coefficient and prediction by depth, can guide the choice of appropriate classification schemes. Finally, we provide guidelines to support this process when embarking on a hierarchical classification task, which will help optimize computational resources and predictive performance.
Article
Full-text available
MicroRNAs (miRNAs) are short sequences of nucleotides, typically consisting of 21-25 base pairs, which play a crucial role in the regulation of genes throughout several biological processes. The identification of these miRNAs is challenging and intricate owing to their short read duration. Hence, the use of modern computational methodologies may provide significant benefits in accurately discerning these sequences. In recent years, there has been a growing use of computer methodologies for the categorization of diverse biological datasets. This work used publicly accessible miRNA sequences for the purpose of binary classification. Additionally, a dictionary was employed to numerically represent the nucleotide sequences, which were of a consistent length of 22 nucleotides. Various deep learning approaches, including Bidirectional Gated Recurrent Unit (Bi-GRU), Convolutional Neural Network (CNN), a mix of CNN and Long Short-Term Memory (LSTM), and LSTM, were used in the research investigation. All of the models exhibited much higher efficiency in comparison to the models documented in existing literature. Additionally, it was noted that the hybrid model combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) has superior performance compared to the other models, with the maximum classification accuracy of 92.8% on the testing dataset. This hybrid model presented in this study represents the first development of a classification model specifically designed for the categorization of miRNA sequences derived from either plant or animal sources. Our developed hybrid model efficiently classify the data as it uses two different algorithms in model building.
Article
With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.
Article
Full-text available
As complex molecules, proteins have various roles for living things. Proteins are organic molecules formed from twenty amino acid combinations with various functions for living things, such as transportation systems, a catalyst of chemical reactions for metabolism, and food reserves. This research aims to classify proteins family based on sequences of amino acids as the primary structure. There are 300 amino acid fragments obtained from the Pfam database. The proteins family database subset with three sub-sample classes was obtained, including 1-cysPrx_C, 4HBT, and ABC_Tran. In this research, the first and second order of the Markov chain for extracting features were applied. Moreover, we use a Probabilistic Neural Network (PNN) as a classifier compared to the joint probability technique with Markov assumptions. We evaluate the results by comparing the sensitivity and specificity of both classification techniques. The evaluation results show that overall, PNN has slightly better performance than the joint probability technique for classifying protein families.
Article
Spartina alterniflora is a halophyte that can survive in high-salinity environments, and it is phylogenetically close to important cereal crops, such as maize and rice. It is of scientific interest to understand why S. alterniflora can live under such extremely stressful conditions. The molecular mechanism underlying its high-saline tolerance is still largely unknown. Here we investigated the possibility that high-affinity K+ transporters (HKTs), which function in salt tolerance and maintenance of ion homeostasis in plants, are responsible for salt tolerance in S. alterniflora. To overcome the imprecision and unstable of the gene screening method caused by the conventional sequence alignment, we used a deep learning method, DeepGOPlus, to automatically extract sequence and protein characteristics from our newly assemble S. alterniflora genome to identify SaHKTs. Results showed that a total of 16 HKT genes were identified. The number of S. alterniflora HKTs (SaHKTs) is larger than that in all other investigated plant species except wheat. Phylogenetically related SaHKT members had similar gene structures, conserved protein domains and cis-elements. Expression profiling showed that most SaHKT genes are expressed in specific tissues and are differentially expressed under salt stress. Yeast complementation expression analysis showed that type I members SaHKT1;2, SaHKT1;3 and SaHKT1;8 and type II members SaHKT2;1, SaHKT2;3 and SaHKT2;4 had low-affinity K+ uptake ability and that type II members showed stronger K+ affinity than rice and Arabidopsis HKTs, as well as most SaHKTs showed preference for Na+ transport. We believe the deep learning-based methods are powerful approaches to uncovering new functional genes, and the SaHKT genes identified are important resources for breeding new varieties of salt-tolerant crops.