Conference PaperPDF Available

Cracking the Genetic Codes: Exploring DNA Sequence Classification with Machine Learning Algorithms and Voting Ensemble Strategies

Authors:

Abstract and Figures

In the domain of bioinformatics, DNA sequence classification is an indispensable tool that spans various scientific disciplines, contributing to scientists’ understanding of biology, aiding in the identification of genes, regulatory elements, and the functional significance of distinct genomic regions. Moreover, it plays a vital role in disease diagnosis, treatment strategies, drug discovery, evolution, agriculture, forensic identification, environmental monitoring and more. The classification process involves the intricate mapping of DNA sequences to distinct classes based on the arrangement of nucleotides. A fractional mutation in the sequence corresponds to a nuanced shift in the assigned class. Every numerical instance, serving as a depiction of a particular class, is closely associated with a specific gene lineage. In this study, for the DNA sequence preprocessing, both K-mer counting and count vectorization were used respectively. Afterwards, we utilized a variety of classifier models, encompassing Multinomial naive bayes (MNB) , Logistic regression (LR), Random forest (RF), LightGBM (LGMB), XGBoost (XGB), K-nearest neighbors (KNN) and Decision tree (DT) algorithm on three types of DNA sequence datasets (Human, Chimpanzee & Dog) to identify each of sequence’s corresponding gene class (0, 1, 2, 3, 4, 5, & 6). Then, the highest three and highest five classifier models were picked based on their accuracy scores. Afterwards, both soft voting and hard voting ensemble methods were implemented on this cluster of fundamental models to effectively leverage their collective predictive strength. The soft voting ensemble on the best three models consistently reached the highest accuracy across all three datasets. Employing this ensemble method, the human, chimpanzee, and dog datasets exhibited highest performance metrics i.e. accuracy, precision, recall, and f1-scores of (98.42%, 98.41%, 98.40%, 98.40%), (92.28%, 92.40%, 92.30%, 92.10%), and (70.12%, 73.10%, 70.10%, 69.20%) respectively.
Content may be subject to copyright.
2024 International Conference on Advances in Computing,
Communication, Electrical, and Smart Systems (iCACCESS), 8-9
March, Dhaka, Bangladesh
Cracking the Genetic Codes: Exploring DNA
Sequence Classification with Machine Learning
Algorithms and Voting Ensemble Strategies
1st Arifur Rahman
Department of Computer Science
and Engineering.
Khulna University of Engineering
& Technology.
Khulna, Bangladesh.
rarifkhan652@gmail.com
2nd Sakib Zaman
Department of Computer Science
and Engineering.
Khulna University of Engineering
& Technology.
Khulna, Bangladesh.
sakibzaman169@gmail.com
3rd Dola Das
Department of Computer Science
and Engineering.
Khulna University of Engineering
& Technology.
Khulna, Bangladesh.
dola.das@cse.kuet.ac.bd
Abstract—In the domain of bioinformatics, DNA sequence
classification is an indispensable tool that spans various scientific
disciplines, contributing to scientists’ understanding of biology,
aiding in the identification of genes, regulatory elements, and the
functional significance of distinct genomic regions. Moreover, it
plays a vital role in disease diagnosis, treatment strategies, drug
discovery, evolution, agriculture, forensic identification, environ-
mental monitoring and more. The classification process involves
the intricate mapping of DNA sequences to distinct classes based
on the arrangement of nucleotides. A fractional mutation in the
sequence corresponds to a nuanced shift in the assigned class.
Every numerical instance, serving as a depiction of a particular
class, is closely associated with a specific gene lineage. In this
study, for the DNA sequence preprocessing, both K-mer counting
and count vectorization were used respectively. Afterwards, we
utilized a variety of classifier models, encompassing Multinomial
naive bayes (MNB) , Logistic regression (LR), Random forest
(RF), LightGBM (LGMB), XGBoost (XGB), K-nearest neighbors
(KNN) and Decision tree (DT) algorithm on three types of DNA
sequence datasets (Human, Chimpanzee & Dog) to identify each
of sequence’s corresponding gene class (0, 1, 2, 3, 4, 5, & 6).
Then, the highest three and highest five classifier models were
picked based on their accuracy scores. Afterwards, both soft
voting and hard voting ensemble methods were implemented on
this cluster of fundamental models to effectively leverage their
collective predictive strength. The soft voting ensemble on the best
three models consistently reached the highest accuracy across
all three datasets. Employing this ensemble method, the human,
chimpanzee, and dog datasets exhibited highest performance
metrics i.e. accuracy, precision, recall, and f1-scores of (98.42%,
98.41%, 98.40%, 98.40%), (92.28%, 92.40%, 92.30%, 92.10%),
and (70.12%, 73.10%, 70.10%, 69.20%) respectively.
Index Terms—Bioinformatics, DNA sequence classification, K-
mer counting, CountVectorizer, BoW (Bag of Words), classifier
models, soft voting ensemble, hard voting ensemble.
I. INTRODUCTION
DNA sequence classification is a cornerstone in genomics,
playing a pivotal role in advancing our understanding of
life processes, genetics, comparative genomics, agricultural
applications, in identification of genetic variations associ-
ated with diseases, facilitating drug target identification etc.
The double-helix structure precisely represents the chemical
structure of DNA. The arrangement comprises two spiraled
nucleotide chains, connected by hydrogen bonds, and navi-
gating in different orientations [1]. Comprising four nitrogen
bases—Adenine (A), Thymine (T), Guanine (G), and Cytosine
(C)—DNA forms nucleotides, linking together via hydrogen
bonds in various orders [2], [3], [4]. The two threads of the
double helix balance each other, following a simple rule: if
one thread has A, the other must have T, and similarly, C
always pairs with G [5]. DNA sequencing is the process of
determining the order of nucleotides in DNA, revealing the
sequence of nucleic acid bases through various identification
techniques. Gene prediction methods in machine learning can
be grouped into two techniques, one of them is similarity-
based approach and another one is content-based approach
[6]. These methodologies leverage several sequence attributes,
encompassing GC content, sequence length & codon usage.
Academic researchers were pioneers in tracing the DNA
sequence in the early 1970s. Afterwards, the implementation
of fluorescence-based sequencing methods took place, utilizing
a DNA sequencer [7].
II. LITERATURE REVIEW
Using feature descriptors from different physiochemical
properties and six classifiers, the authors [8] created a stacked
ensemble model to identify enhancers. The model outper-
formed previous methods in accuracy, specificity, sensitivity,
and correlation coefficient. The researchers [9]used machine
learning to classify DNA sequences using label and k-mer
encoding, distinguishing infected and normal genes. Juneja et
al. [10] used a classification algorithm to classify three datasets
by gene class, where they split the sequences into defined-
length substrings for analysis. Mathur et al. [11] proposed a
hot vector matrix and machine learning-based DNA sequence
feature extraction classifier that represents word pairs as a
binary matrix of nucleotide positions.
The study [12] elucidates DNA sequences to distinguish be-
tween regular and disease-affected genes using ML techniques,
particularly AdaBoost and Random forest classifier for bag-
ging and detection, respectively. Furthermore, an identification
cascade structure reduced false-positive results and enhanced
reliability. In the paper [13], the authors predicted gene
979-8-3503-5028-9/24/$31.00 ©2024 IEEE
families using human, chimpanzee, and dog DNA sequences
using SVM and classification. Combining machine learning
techniques with a pattern-matching algorithm, the study [14]
suggests a model incorporating SVM Linear, and Naive Bayes
to execute DNA sequence classification. Vedanshee et al.
[15] predicted genetic defects in 22083 patient records using
human, chimpanzee, and dog DNA. They tagged, correlated,
and analyzed using five classifier models like SVC Classifier,
Gradient Boosting, Cat-Boost etc.
Fig. 1: Overview of the suggested ensemble approaches on the
three datasets. III. METHODOLOGY
A. Dataset Insight
In our research, we procured the comprehensive datasets
of gene sequences sourced from the publicly available
DNA sequence repository on Kaggle. These datasets
are available for download through the following link:
https://www.kaggle.com/code/khalidmostafa/dna-sequence-
classification-using-machine-learning/input. There are three
types of datasets are present, including Human Dataset,
Chimpanzee dataset, & Dog Dataset. These datasets are
present in FASTA format. In the realm of bioinformatics,
the FASTA format emerges as a pivotal text-based encoding
method employed for the representation of nucleotide or
amino acid sequences. This format stands as a standardized
approach for conveying biological information, where
nucleotides or amino acids are denoted by succinct single-
letter codes, such as [A, C, G, T, N]. In this intricate encoding
system, each letter signifies a distinct biological entity: A for
Adenosine, C for Cytosine, G for Guanine, T for Thymidine,
and N serving as a wildcard for any of the aforementioned
entities. Tab. I depicts the overview of the three datasets.
TABLE I: Frequency count of each gene class for the three
datasets.
Dataset Dataset Training Testing Class Count
name size set size set size label
0 531
1 534
2 349
Human 4380 3504 (80%) 876 (20%) 3 672
4 711
5 240
6 1343
0 234
1 185
2 144
Chimpanzee 1682 1346 (80%) 336 (20%) 3 228
4 261
5 109
6 521
0 131
1 75
2 64
Dog 820 656 (80%) 164 (20%) 3 95
4 135
5 60
6 260
B. Feature Matrix Generation
Both K-mer counting and CountVectorizer was utilized to
generate the feature matrix.
Fig. 2: Hexamer substrings (K=6).
1) K-mer Counting: K-Mer counting converted the DNA
sequence strings into k-mer words with a Kvalue of 6, known
as hexamers (K=6). Fig. 2 depicts the generated hexamer
substrings for the DNA sequence ATGGGGCACC. The con-
version of DNA sequences into k-mers serves the purpose of
breaking down the genetic information into smaller, overlap-
ping units. These k-mers serve as the elemental vocabulary in
deciphering the genetic language encoded in DNA. The k-mers
act as the primary features.
2) CountVectorizer: The CountVectorizer was applied to
establish a BoW (Bag of words) model, concentrating on the
counts of 4-grams (tetragrams). The resulting string sentence
formulated by K-mer counting served as input for the count
vectorizer, allowing for the creation of a comprehensive bag-
of-words model that encapsulated the unique genetic features
encoded in the original DNA sequences. By utilizing the BoW
approach, the count vectorizer constructed a sparse matrix
where each entry represents the count of a specific 4-gram
in a given genetic sequence. In Tab. II we showed a portion
of the generated sparse matrix.
TABLE II: A portion of sparse matrix generated by Count
Vectorization technique.
(index, column) value
(0, 181326) 1
(0, 178989) 1
(0, 55066) 1
(0, 217067) 1
(0, 171189) 1
(0, 216740) 1
(0, 169929) 1
(0, 211678) 1
(0, 151026) 1
(0, 135341) 1
(0, 74165) 1
(0, 58623) 1
(0, 231147) 1
(0, 227458) 1
The sparse matrix representation of a genetic sequence,
demonstrated the utilization of the CountVectorizer to convert
raw genetic text data into a structured and numerical format
suitable for subsequent analysis and machine learning tasks.
C. Dataset splitting
After the sparse matrix generation, the dataset was seg-
mented into two parts, one part was for training purpose with
80% of the data, while another one part was for testing purpose
with 20% of the data.
D. Classifier models
In this study, various classifier model was employed to
train the model including Multinomial naive bayes (MNB)
, Logistic regression (LR), Random forest (RF), LightGBM
(LGMB), XGBoost (XGB), K-nearest neighbors (KNN) and
Decision tree (DT). To obtain the highest accuracy by K-
nearest neighbors model, a loop was implemented to iteratively
evaluate the performance of the K-Nearest Neighbors (KNN)
classifier with varying values of number of neighbors, ranging
from 1 to 199. Fig. 3 depicts the generated plots to visualize
how the accuracy of the KNN classifier changes with different
values of K for the human, chimpanzee & Dog dataset. In
Tab. III we also represented the best K value with their
corresponding accuracies for the three datasets.
TABLE III: Best K value with corresponding accuracies for
the three datasets
Dataset name Best K Value Accuracy
Human K=1 85.84%
Chimpanzee K=1 84.87%
Dog K=3 51.22%
E. Ensemble model
1) Soft voting ensemble: The soft voting ensemble model
is a sophisticated technique in machine learning that amalga-
mates the predictions of multiple base models by considering
their weighted average probabilities, resulting in a consensual
decision. Mathematically, let Pi,j denote the predicted proba-
bility of the i-th sample belonging to the j-th class according
to the i-th base model. The soft voting ensemble prediction
Pensemble,j for the j-th class is computed as follows:
Pensemble,j =PN
i=1 Pi,j
N
After picking the best three and best five models, we utilized
soft voting ensemble on these sets of models.
2) Hard voting ensemble: Hard voting is an ensemble
technique in machine learning that combines the predictions of
multiple base models by selecting the class label that receives
the majority of votes. Mathematically, let Mrepresent the
number of base models in the ensemble, and Cdenote the
number of classes in the classification task. For each input
sample i, the hard voting ensemble prediction Ehard,j for the
j-th class is determined as follows:
Ehard,j =argmaxc
M
X
m=1
I(ym,i =c)
After choosing the best three and best five models, we em-
ployed hard voting ensemble on these sets of models.
IV. EXP ER IM EN TAL RESULTS
In the analysis of Human.txt, Chimpanzee.txt, and Dog.txt
datasets, it was evident that the soft voting ensemble, in-
corporating with the top three classifier models consistently
provided the highest accuracy, as shown in Tab. IV.
Human dataset When considering the human dataset,
Multinomial Naive Bayes, Logistic Regression, Random For-
est, LightGBM, XGBoost, K-Nearest Neighbors, and Deci-
sion Tree recorded accuracy percentages of 98.40%, 93.95%,
92.24%, 91.21%, 89.84%, 85.84%, and 81.15%, respectively.
Among classifier models, Multinomial Naive Bayes took the
lead, achieving the highest levels of accuracy, precision,
recall, and f1-score, all at an impressive 98.40%. Logis-
tic Regression earned the second-highest accuracy, show-
casing impressive precision, recall, and f1-score at 94.80%,
93.90%, and 94.00%. The soft voting ensemble on the top
three models (MNB+LR+RF) showcased the highest accuracy
among all proposed models for the human dataset, with
significant values for accuracy, precision, recall, and f1-score
of 98.42%, 98.41%, 98.40%, and 98.40%, respectively. In
contrast, the soft voting ensemble incorporating the top five
models (MNB+LR+RF+LGBM+XGB) attained the second-
best rank among all ensemble models, presenting an impres-
sive accuracy of 96.23%. Besides, this model delivered an
outstanding precision, recall and f1-score of 96.50%, 96.20%,
and 96.20% respectively.
Chimpanzee dataset For the Chimpanzee dataset, the per-
formance of various models was evaluated, with Multinomial
Naive Bayes, Logistic Regression, LightGBM, XGBoost, K-
Nearest Neighbors, Random Forest, and Decision Tree achiev-
ing accuracy scores of 91.39%, 89.91%, 88.13%, 86.05%,
84.87%, 84.27%, and 79.23% respectively. Among these,
Multinomial Naive Bayes exhibited the highest accuracy, pre-
cision, recall, and f1-score at 91.39%, 91.80%, 91.40%, and
91.40%. The soft voting ensemble on the top three models
(MNB+LR+LGBM) stood out as the best-performing model
for the Chimpanzee dataset, showcasing significant accuracy,
precision, recall, and f1-score values of 92.28%, 92.40%,
92.30%, and 92.10% respectively. This ensemble demonstrated
superior classification capabilities, leveraging the strengths of
the individual models.
Dog dataset Finally for the Dog dataset, a comprehensive
TABLE IV: Performance evaluation of all recommended models across the Human, Chimpanzee, and Dog datasets.
Dataset Model Accuracy Precision Recall F1-score MAE MSE RMSE RAE RRSE Best performer
type name (%) (%) (%) (%) (%) (%) (%) (%) (%) model
MNB 98.40% 98.40% 98.40% 98.40% 5.10% 20.90% 45.70% 1.50% 13.00%
LR 93.95% 94.80% 93.90% 94.00% 19.60% 80.60% 89.80% 5.60% 25.50%
RF 92.24% 93.40% 92.20% 92.40% 20.50% 69.60% 83.40% 5.80% 23.70%
LGBM 91.21% 92.00% 91.20% 91.20% 29.10% 12.34% 11.11% 8.30% 31.60%
XGB 89.84% 91.20% 89.80% 90.00% 37.60% 169.1% 130.0% 10.70% 37.00%
KNN 85.84% 92.60% 85.80% 87.40% 32.90% 87.20% 93.40% 9.40% 26.60%
DT 81.15% 82.80% 81.50% 81.90% 59.80% 242.9% 155.9% 17.00% 44.40%
Soft voting on
top 3 models 98.42% 98.41% 98.40% 98.40% 7.30% 28.30% 53.20% 2.10% 15.10%
(MNB+LR+RF)
Human Hard voting on Soft voting on
dataset top 3 models 95.67% 95.90% 95.70% 95.70% 14.70% 60.80% 78.00% 4.20% 22.20% top 3 models
(MNB+LR+RF) (MNB+LR+RF)
Soft voting on
top 5 models 96.23% 96.50% 96.20% 96.20% 11.00% 36.10% 60.10% 3.10% 17.10%
(MNB+LR+RF
+LGBM+XGB)
Hard voting on
top 5 models 94.29% 94.80% 98.40% 98.40% 16.40% 65.30% 80.80% 4.70% 23.00%
(MNB+LR+RF
+LGBM+XGB)
MNB 91.39% 91.80% 91.40% 91.40% 21.40% 72.40% 85.10% 5.60% 22.20%
LR 89.91% 91.10% 89.90% 89.70% 30.00% 119.6% 109.4% 7.80% 28.60%
RF 84.27% 87.30% 84.30% 84.10% 54.60% 238.0% 154.3% 14.30% 40.30%
LGBM 88.13% 89.10% 88.10% 87.90% 34.10% 127.9% 113.1% 8.90% 29.60%
XGB 86.05% 86.90% 86.10% 85.80% 38.60% 147.2% 121.3% 10.10% 31.70%
KNN 84.87% 89.40% 84.90% 85.00% 47.20% 197.9% 140.7% 12.30% 36.80%
DT 79.23% 79.70% 79.20% 79.20% 54.90% 189.0% 137.5% 14.40% 35.90%
Soft voting on
top 3 models 92.28% 92.40% 92.30% 92.10% 20.20% 74.80% 86.50% 5.30% 22.60%
(MNB+LR+LGBM)
Chimpanzee Hard voting on Soft voting on
dataset top 3 models 91.10% 91.80% 91.10% 90.80% 26.70% 104.5% 102.2% 7.00% 26.70% top 3 models
(MNB+LR+LGBM) (MNB+LR+LGBM)
Soft voting on
top 5 models 89.91% 91.60% 89.90% 89.80% 28.20% 110.7% 105.2% 7.40% 27.50%
(MNB+LR+LGBM
+XGBC+KNN)
Hard voting on
top 5 models 89.90% 91.20% 89.90% 89.70% 27.00% 100.0% 100.0% 7.10% 26.10%
(MNB+LR+LGBM
+XGBC+KNN)
MNB 70.10% 73.20% 70.10% 69.40% 101.2% 430.5% 207.5% 29.40% 60.30%
LR 59.77% 71.20% 59.80% 57.60% 145.7% 650.6% 255.1% 42.40% 74.20%
RF 56.71% 64.40% 56.70% 53.40% 162.8% 756.7% 275.1% 47.30% 80.00%
LGBM 64.63% 68.10% 64.60% 63.40% 108.5% 420.7% 205.1% 31.60% 59.60%
XGB 59.76% 63.50% 59.80% 58.90% 131.1% 538.4% 232.0% 38.10% 67.50%
KNN 51.22% 67.50% 51.20% 45.50% 179.9% 839.6% 289.8% 52.30% 84.30%
DT 53.66% 53.30% 53.70% 52.50% 142.7% 545.1% 233.5% 41.50% 67.90%
Soft voting on
top 3 models 70.12% 73.10% 70.10% 69.20% 100.6% 425.0% 206.2% 29.30% 59.90%
(MNB+LGBM+LR)
Dog Hard voting on Soft voting on
dataset top 3 models 66.50% 71.90% 66.50% 64.60% 109.8% 458.5% 214.1% 31.90% 62.30% top 3 models
(MNB+LGBM+LR) (MNB+LGBM+RF)
Soft voting on
top 5 models 67.10% 69.70% 67.10% 65.60% 134.8% 601.8% 245.3% 39.20% 71.30%
(MNB+LGBM+LR
+XGBC+RF)
Hard voting on
top 5 models 62.20% 67.60% 62.20% 60.10% 134.1% 576.8% 240.2% 39.00% 69.80%
(MNB+LGBM+LR
+XGBC+RF)
(a) K value vs Accuracy (Human). (b) K value vs Accuracy (Chimpanzee). (c) K value vs Accuracy (Dog).
Fig. 3: Evaluation of a KNN classifier (accuracy metric) with the changes of number of nearest neighbors for the 3 datasets.
(a) Decision tree (b) LightGBM (c) Logistic regression (d) Multinomial naive bayes (e) Random forest
(f) XGBoost
(g) Soft voting on
top 3 (MNB+LR+RF)
(h) Hard voting on
top 3 (MNB+LR+RF)
(i) Soft voting on top 5
(MNB+LR+RF+LGBM+XGBC)
(j) Hard voting on top 5
(MNB+LR+RF+LGBM+XGBC)
Fig. 4: Exhibit the confusion matrix results for the Human Dataset through all our suggested models.
(a) Decision tree (b) LightGBM (c) Logistic regression (d) Multinomial naive bayes (e) Random forest
(f) XGBoost
(g) Soft voting on
top 3 (MNB+LR+LGBM)
(h) Hard voting on
top 3 (MNB+LR+LGBM)
(i) Soft voting on top 5
(MNB+LR+LGBM+XGBC+KNN)
(j) Hard voting on top 5
(MNB+LR+LGBM+XGBC+KNN)
Fig. 5: Exhibit the confusion matrix results for the Chimpanzee Dataset through all our suggested models.
evaluation of various models was conducted, including Multi-
nomial Naive Bayes, Logistic Regression, Random Forest,
LightGBM, XGBoost, K-Nearest Neighbors, and Decision
Tree, which achieved accuracy scores of 70.10%, 59.77%,
56.71%, 64.63%, 59.76%, 51.22%, and 53.66%, respectively.
Out of the classifiers, Multinomial Naive Bayes model show-
cased the best accuracy. In addition, it secured noteworthy
precision, recall, and f1-score figures of 73.20%, 70.10%,
and 69.40%. Moreover, the soft voting ensemble on the top
three models accomplished a notable accuracy of 70.12%,
distinguishing it as the best model among all suggested
models. Besides, it presents noteworthy precision, recall, and
(a) Decision tree (b) LightGBM (c) Logistic regression (d) Multinomial naive bayes (e) Random forest
(f) XGBoost
(g) Soft voting on
top 3 (MNB+LGBM+LR)
(h) Hard voting on
top 3 (MNB+LGBM+LR)
(i) Soft voting on top 5
(MNB+LGBM+LR+XGBC+RF)
(j) Hard voting on top 5
(MNB+LGBM+LR+XGBC+RF)
Fig. 6: Exhibit the confusion matrix results for the Dog Dataset through all our suggested models.
f1-scores of 73.10%, 70.10%, and 69.20%. Additionally, due
to lack amount of data, we noted a substantial variation in the
accuracy scores of the dog dataset compared to other two.
V. DISCUSSION AND CONCLUSION
Our proposed both soft and hard voting ensemble model
was employed to all of the three species datasets to assess
its cross-species performance. In this study, the choice of
value Kin k-mer counting was significant, as it determines
the length of the subsequences considered. This parameter
is crucial in capturing specific patterns and characteristics
within the genetic data. We explored the models’ performance
through adjustments to the Kvalue, spanning from 1 to 6. Our
observations indicate that the recommended algorithms deliver
superior performance at K=6, emphasizing the importance of a
substring length comprising 6 nucleotides. Notably, deviations
beyond this value result in a decline in performance. In
addition, the CountVectorizer was applied to establish a BoW
(Bag of Words) model, concentrating on the counts of 4-grams.
In the case of tetragram vectorization, we attained the top
accuracy, leading us to opt for tetragram tokenization. Besides,
We noticed that the soft voting ensemble consistently gave a
small advantage in accuracy than hard voting ensemble. We
also observed a consistent trend where the ensemble accuracies
with the top three models consistently surpassed those with the
top five models. So according to this study, we can conclude
that the increment of classifier models could also degrade the
performance of voting ensemble models.
REFERENCES
[1] Chou KC, Shen HB. Predicting eukaryotic protein subcellular location
by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. J
Proteome Res. 2006 Aug;5(8):1888-97. doi: 10.1021/pr060167c. PMID:
16889410.
[2] Akhtar, M., Epps, J., & Ambikairajah, E. (2007). On DNA Numerical
Representations for Period-3 Based Exon Prediction. 2007 IEEE Inter-
national Workshop on Genomic Signal Processing and Statistics, 1-4.
[3] M. Akhtar, J. Epps and E. Ambikairajah, ”Signal Processing in Sequence
Analysis: Advances in Eukaryotic Gene Prediction, in IEEE Journal of
Selected Topics in Signal Processing, vol. 2, no. 3, pp. 310-321, June
2008, doi: 10.1109/JSTSP.2008.923854.
[4] Ramachandran P, Lu WS, Antoniou A. Filter-based methodology for the
location of hot spots in proteins and exons in DNA. IEEE Trans Biomed
Eng. 2012 Jun;59(6):1598-609. doi: 10.1109/TBME.2012.2190512.
Epub 2012 Mar 9. PMID: 22410955.
[5] W. Kinsner, ”Towards cognitive analysis of DNA,” 9th IEEE Interna-
tional Conference on Cognitive Informatics (ICCI’10), Beijing, China,
2010, pp. 6-7, doi: 10.1109/COGINF.2010.5599728.
[6] Wang Z, Chen Y, Li Y. A brief review of computational gene prediction
methods. Genomics Proteomics Bioinformatics. 2004 Nov;2(4):216-
21. doi: 10.1016/s1672-0229(04)02028-5. PMID: 15901250; PMCID:
PMC5187414.
[7] Olsvik O, Wahlberg J, Petterson B, Uhl´
en M, Popovic T, Wachsmuth IK,
Fields PI. Use of automated sequencing of polymerase chain reaction-
generated amplicons to identify three types of cholera toxin subunit B in
Vibrio cholerae O1 strains. J Clin Microbiol. 1993 Jan;31(1):22-5. doi:
10.1128/jcm.31.1.22-25.1993. PMID: 7678018; PMCID: PMC262614.
[8] B. A. Mir, M. U. Rehman, H. Tayara, and K. T. Chong, “Improving
enhancer identification with a multi-classifier stacked ensemble model,”
Journal of Molecular Biology, vol. 435, no. 23, p. 168314, 2023.
[9] S. Sarkar, K. Mridha, A. Ghosh, and R. N. Shaw, “Machine learning
in bioinformatics: New technique for dna sequencing classification,”
in Advanced Computing and Intelligent Technologies: Proceedings of
ICACIT 2022. Springer, 2022, pp. 335–355.
[10] S. Juneja, A. Dhankhar, A. Juneja, and S. Bali, An approach to dna
sequence classification through machine learning: Dna sequencing, k
mer counting, thresholding, sequence analysis,” International Journal of
Reliable and Quality E-Healthcare (IJRQEH), vol. 11, no. 2, pp. 1–15,
2022.
[11] G. Mathur, A. Pandey, and S. Goyal, “A comprehensive tool for rapid
and accurate prediction of disease using dna sequence classifier, Journal
of Ambient Intelligence and Humanized Computing, vol. 14, no. 10, pp.
13 869–13 885, 2023.
[12] S. S. Kanumalli, S. Swathi, K. Sukanya, V. Yamini, and N. Nagalakshmi,
“Classification of dna sequence using machine learning,” in Soft Com-
puting for Security Applications: Proceedings of ICSCS 2022. Springer,
2022, pp. 723–732.
[13] J. Rexie, K. Raimond, D. Brindha, and A. K. Prabavathy, “K-mer
based prediction of gene family by applying multinomial na¨ıve bayes
algorithm in dna sequence,” in AIP Conference Proceedings, vol. 2914,
no. 1. AIP Publishing, 2023.
[14] B. A. Hamed, O. A. S. Ibrahim, and T. Abd El-Hafeez, “Optimizing
classification efficiency with machine learning techniques for pattern
matching,” Journal of Big Data, vol. 10, no. 1, p. 124, 2023.
[15] V. Upadhyay, S. Harbhajanka, S. Pangaonkar, and R. Gunjan, “Ex-
ploratory data analysis and prediction of human genetic disorder and
species using dna sequencing,” in Proceedings of the Future Technolo-
gies Conference. Springer, 2023, pp. 197–213.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The study proposes a novel model for DNA sequence classification that combines machine learning methods and a pattern-matching algorithm. This model aims to effectively categorize DNA sequences based on their features and enhance the accuracy and efficiency of DNA sequence classification. The performance of the proposed model is evaluated using various machine learning algorithms, and the results indicate that the SVM linear classifier achieves the highest accuracy and F1 score among the tested algorithms. This finding suggests that the proposed model can provide better overall performance than other algorithms in DNA sequence classification. In addition, the proposed model is compared to two suggested algorithms, namely FLPM and PAPM, and the results show that the proposed model outperforms these algorithms in terms of accuracy and efficiency. The study further explores the impact of pattern length on the accuracy and time complexity of each algorithm. The results show that as the pattern length increases, the execution time of each algorithm varies. For a pattern length of 5, SVM Linear and EFLPM have the lowest execution time of 0.0035 s. However, at a pattern length of 25, SVM Linear has the lowest execution time of 0.0012 s. The experimental results of the proposed model show that SVM Linear has the highest accuracy and F1 score among the tested algorithms. SVM Linear achieved an accuracy of 0.963 and an F1 score of 0.97, indicating that it can provide the best overall performance in DNA sequence classification. Naive Bayes also performs well with an accuracy of 0.838 and an F1 score of 0.94. The proposed model offers a valuable contribution to the field of DNA sequence analysis by providing a novel approach to pre-processing and feature extraction. The model’s potential applications include drug discovery, personalized medicine, and disease diagnosis. The study’s findings highlight the importance of considering the impact of pattern length on the accuracy and time complexity of DNA sequence classification algorithms.
Article
Full-text available
Machine learning (ML) has been instrumental in optimal decision making through relevant historical data, including the domain of bioinformatics. In bioinformatics classification of natural genes and the genes that are infected by disease called invalid gene is a very complex task. In order to find the applicability of a fresh protein through genomic research, DNA sequences need to be classified. The current work identifies classes of DNA sequence using machine learning algorithm. These classes are basically dependent on the sequence of nucleotides. With a fractional mutation in sequence, there is a corresponding change in the class. Each numeric instance representing a class is linked to a gene family including G protein coupled receptors, tyrosine kinase, synthase, etc. In this paper, the authors applied the classification algorithm on three types of datasets to identify which gene class they belong to. They converted sequences into substrings with a defined length. That ‘k value' defines the length of substring which is one of the ways to analyze the sequence.
Article
Full-text available
In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.
Chapter
The genetic information expressed through the development of a sequencing model for DNA/RNA proteins using Machine Learning Algorithms is a big exploration and growing need. Basically, this was intended to identify, predict as well as classify gene families based on the DNA sequence with medical anomalies for early diagnosis of genetic variation. This study assessed gene sequences from three DNA sequence text files, including 4380 human, 1682 chimpanzee, and 820 dog DNA. The genetic disorder dataset includes 35 features that were utilized to predict genetic abnormalities across 22083 patient data. Labelling, correlating, exploratory data analysis, and prediction systems were made for both datasets. Prediction systems were made using Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Gradient Boosting, CatBoost, Multinomial Naive Bayes Classifier, and SVC Classifier algorithms. Multinomial Naive Bayes Classifier achieved the best accuracy rate of 94.42% for DNA sequencing dataset, followed by K Neighbour Classifier, Decision Tree Classifier, Random Forest Classifier, and SVC Classifier contributed 71.98%, 74.85%, 86.82% and 79.6% respectively. For the genetic disorder dataset, the best-performing model was CatBoost with a 54.72% R2CV score. As for the R2CV scores, Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Extreme Gradient Boosting, Light Gradient Boosting Machine and Gradient Boosting Classifier offered 47.36%, 34.16%, 45.27%, 40.83%, 52.36%, 48.89%, 48.75% and 53.34% respectively. Genetic disorders will be classified in the future based on extensive medical history, sequence data, deep learning models, federated machine learning and transfer learning.
Chapter
In the field of medical information research, the genetic series is widely used as a component of a category. One of the applications of ML is biochemistry. Bioinformatics is an interdisciplinary science that uses computers and communication science to understand biological data. One of its most difficult tasks is to distinguish between regular genes and disease-causing genes. The classification of gene sequences into existing categories is utilized in genomic research to discover the functions of novel proteins. As a result, it is critical to identify and categorize such genes. We employ ML approaches to distinguish between infected and normal genes using classification methods. AdaBoost has a high degree of precision; relative to the bagging algorithm and Random Forest Algorithm, AdaBoost fully considers the weight of each classifier. To generate a sequence of weak classifiers, an AdaBoost-based learning approach is used to find the most ‘informative’ or ‘discriminating’ features. The identification cascade structure can also help to limit false-positive results. This study provides an overview of the mechanics of gene sequence classification using ML Techniques, including a brief introduction to bioinformatics and important challenges in DNA Sequencing with ML.
Chapter
The extraction of useful information from deoxyribonucleic acid (DNA) is a major component of bioinformatics research, and DNA sequence categorization has a variety of applications, including genomic and biomedical data processing. DNA sequence classification is a critical problem in a general computational framework for biomedical data processing, and numerous machine learning techniques have been used to complete this task in recent years. Machine learning is a data processing technique that uses training data to create judgments, predictions, classifications, and recognitions. To learn the functions of a new protein, genomic researchers classify DNA sequences into known categories. As a result, it is critical to discover and characterize those genes. We employ machine learning approaches to distinguish between infected and normal genes using classification methods. In this study, we used the multinomial Naive Bayes classifier, SVM, KNN, and others to classify DNA sequences using label and k-mer encoding. Different categorization metrics are used to evaluate the models. The multinomial Naive Bayes classifier, SVM, KNN, decision tree, random forest, and logistic regression with k-mer encoding all have good accuracy on testing data, with 93.16% and 93.13%, respectively.KeywordsDNABioinformaticsComputational frameworkMachine learningClassification
Conference Paper
Summary form only given. Deoxyribonucleic acid (DNA) has become one of the most examined molecules on the planet. Scientist around the world have been trying to unravel its secrets for many purposes. For example, genetic information is currently used to raise better plants and animals, create enhanced pharmaceuticals for humans, and for gene therapy in medicine. Science as a whole has benefited from the study of genetics because of the increased understanding of biological process that all organisms share. In recent decades, a significant amount of research has been directed towards sequencing and understanding the entire human genome through the Human Genome Project (HGP) launched in 1986. The goal of the HGP was to find the location of the approximately 1×105 human genes, and read all the sequence of human genome (about 3×109 base pairs, bp). An exponential grow rate of that research has resulted in reaching the goal by 2003. Similarly, the speed of finding genes and their locations is also increasing rapidly. On the other hand, the traditional methods of finding genes and their location at chromatosomes through testing their biological function have been inherently slow. Although numerous faster techniques have been developed, there is still a need to augment them with new approaches. Therefore, robust computational solutions to the gene-finding problem could provide a valuable resource for the HGP and for the molecular-biology community. Most of the current research in the deciphering the meaning of DNA sequences is approached from the lowest base-pair level. Its main objective is to search for patterns or correlations existing in the DNA sequence related to codons, amino acids, and proteins. A number of gene-finding systems have been developed in recent decades. These systems use a variety of sophisticated computational data-miming techniques, including neural networks, dynamic programming, rule-based methods, decision trees, probability reasoni- - ng, hidden Markov chains, genetic programming, and support vector machines. Most of these approaches are based on local measures only. In addition, many of the techniques rely on the statistical qualities of exons in the gene, thus using only the known gene pool as a training set for their classification. Although the techniques have demonstrated limited success, better techniques should be developed. An approach to finding such improved techniques is to consider long-range relations (in addition to short-range relations) in the DNA sequence, spanning 104 nucleotides. If we had a good technique to measure such long-range relations, we would be able to estimate any existing self-affinity (fractality) in the DNA sequence, without any a priori assumptions about its structure. This would be a data-driven approach, rather than the common modeldriven approach. Along those lines, preliminary results have already been reported in the literature on a local self-similarity with a 180 bp periodicity in mammalian nuclear DNA sequence. Other publications have provided evidence that the long-range fractal correlations appear in DNA sequences with different values in different regions of the sequence. This paper describes such a multiscale approach, together with an algorithm based on a multifractal analysis, and demonstrates that multifractal estimates can be used to characterize DNA sequences [1], [2], [3]. This multifractal approach appears to be new, and may provide a key to cognitive analysis of DNA sequences. It should be clear that the DNA sequencing and gene finding techniques constitute a subset of bioinformatics, the science of using information to understand biology, with its numerous tools. In turn, bioinformatics is a subset of computational biology which is the application of quantitative analytical techniques in modelling biological systems. Very often, for structural biologists, DNA is not just a sequence of symbols, but implies 3D structures, molecular shapes an