Conference PaperPDF Available

Cracking the Genetic Codes: Exploring DNA Sequence Classification with Machine Learning Algorithms and Voting Ensemble Strategies

March 2024

March 2024

DOI:10.1109/iCACCESS61735.2024.10499483

Conference: 2024 International Conference on Advances in Computing, Communication, Electrical, and Smart Systems (iCACCESS)

Authors:

Arifur Rahman

Khulna University of Engineering and Technology

Sakib Zaman

United International University

In the domain of bioinformatics, DNA sequence classification is an indispensable tool that spans various scientific disciplines, contributing to scientists’ understanding of biology, aiding in the identification of genes, regulatory elements, and the functional significance of distinct genomic regions. Moreover, it plays a vital role in disease diagnosis, treatment strategies, drug discovery, evolution, agriculture, forensic identification, environmental monitoring and more. The classification process involves the intricate mapping of DNA sequences to distinct classes based on the arrangement of nucleotides. A fractional mutation in the sequence corresponds to a nuanced shift in the assigned class. Every numerical instance, serving as a depiction of a particular class, is closely associated with a specific gene lineage. In this study, for the DNA sequence preprocessing, both K-mer counting and count vectorization were used respectively. Afterwards, we utilized a variety of classifier models, encompassing Multinomial naive bayes (MNB) , Logistic regression (LR), Random forest (RF), LightGBM (LGMB), XGBoost (XGB), K-nearest neighbors (KNN) and Decision tree (DT) algorithm on three types of DNA sequence datasets (Human, Chimpanzee & Dog) to identify each of sequence’s corresponding gene class (0, 1, 2, 3, 4, 5, & 6). Then, the highest three and highest five classifier models were picked based on their accuracy scores. Afterwards, both soft voting and hard voting ensemble methods were implemented on this cluster of fundamental models to effectively leverage their collective predictive strength. The soft voting ensemble on the best three models consistently reached the highest accuracy across all three datasets. Employing this ensemble method, the human, chimpanzee, and dog datasets exhibited highest performance metrics i.e. accuracy, precision, recall, and f1-scores of (98.42%, 98.41%, 98.40%, 98.40%), (92.28%, 92.40%, 92.30%, 92.10%), and (70.12%, 73.10%, 70.10%, 69.20%) respectively.

Hexamer substrings (K=6).

…

Evaluation of a KNN classifier (accuracy metric) with the changes of number of nearest neighbors for the 3 datasets.

…

Figures - uploaded by Sakib Zaman

Content may be subject to copyright.

Content uploaded by Sakib Zaman

Content may be subject to copyright.

2024 International Conference on Advances in Computing,

Communication, Electrical, and Smart Systems (iCACCESS), 8-9

March, Dhaka, Bangladesh

Cracking the Genetic Codes: Exploring DNA

Sequence Classiﬁcation with Machine Learning

Algorithms and Voting Ensemble Strategies

1st Arifur Rahman

Department of Computer Science

and Engineering.

Khulna University of Engineering

& Technology.

Khulna, Bangladesh.

rarifkhan652@gmail.com

2nd Sakib Zaman

Department of Computer Science

and Engineering.

Khulna University of Engineering

& Technology.

Khulna, Bangladesh.

sakibzaman169@gmail.com

3rd Dola Das

Department of Computer Science

and Engineering.

Khulna University of Engineering

& Technology.

Khulna, Bangladesh.

dola.das@cse.kuet.ac.bd

Abstract—In the domain of bioinformatics, DNA sequence

classiﬁcation is an indispensable tool that spans various scientiﬁc

disciplines, contributing to scientists’ understanding of biology,

aiding in the identiﬁcation of genes, regulatory elements, and the

functional signiﬁcance of distinct genomic regions. Moreover, it

plays a vital role in disease diagnosis, treatment strategies, drug

discovery, evolution, agriculture, forensic identiﬁcation, environ-

mental monitoring and more. The classiﬁcation process involves

the intricate mapping of DNA sequences to distinct classes based

on the arrangement of nucleotides. A fractional mutation in the

sequence corresponds to a nuanced shift in the assigned class.

Every numerical instance, serving as a depiction of a particular

class, is closely associated with a speciﬁc gene lineage. In this

study, for the DNA sequence preprocessing, both K-mer counting

and count vectorization were used respectively. Afterwards, we

utilized a variety of classiﬁer models, encompassing Multinomial

naive bayes (MNB) , Logistic regression (LR), Random forest

(RF), LightGBM (LGMB), XGBoost (XGB), K-nearest neighbors

(KNN) and Decision tree (DT) algorithm on three types of DNA

sequence datasets (Human, Chimpanzee & Dog) to identify each

of sequence’s corresponding gene class (0, 1, 2, 3, 4, 5, & 6).

Then, the highest three and highest ﬁve classiﬁer models were

picked based on their accuracy scores. Afterwards, both soft

voting and hard voting ensemble methods were implemented on

this cluster of fundamental models to effectively leverage their

collective predictive strength. The soft voting ensemble on the best

three models consistently reached the highest accuracy across

all three datasets. Employing this ensemble method, the human,

chimpanzee, and dog datasets exhibited highest performance

metrics i.e. accuracy, precision, recall, and f1-scores of (98.42%,

98.41%, 98.40%, 98.40%), (92.28%, 92.40%, 92.30%, 92.10%),

and (70.12%, 73.10%, 70.10%, 69.20%) respectively.

Index Terms—Bioinformatics, DNA sequence classiﬁcation, K-

mer counting, CountVectorizer, BoW (Bag of Words), classiﬁer

models, soft voting ensemble, hard voting ensemble.

I. INTRODUCTION

DNA sequence classiﬁcation is a cornerstone in genomics,

playing a pivotal role in advancing our understanding of

life processes, genetics, comparative genomics, agricultural

applications, in identiﬁcation of genetic variations associ-

ated with diseases, facilitating drug target identiﬁcation etc.

The double-helix structure precisely represents the chemical

structure of DNA. The arrangement comprises two spiraled

nucleotide chains, connected by hydrogen bonds, and navi-

gating in different orientations [1]. Comprising four nitrogen

bases—Adenine (A), Thymine (T), Guanine (G), and Cytosine

(C)—DNA forms nucleotides, linking together via hydrogen

bonds in various orders [2], [3], [4]. The two threads of the

double helix balance each other, following a simple rule: if

one thread has A, the other must have T, and similarly, C

always pairs with G [5]. DNA sequencing is the process of

determining the order of nucleotides in DNA, revealing the

sequence of nucleic acid bases through various identiﬁcation

techniques. Gene prediction methods in machine learning can

be grouped into two techniques, one of them is similarity-

based approach and another one is content-based approach

[6]. These methodologies leverage several sequence attributes,

encompassing GC content, sequence length & codon usage.

Academic researchers were pioneers in tracing the DNA

sequence in the early 1970s. Afterwards, the implementation

of ﬂuorescence-based sequencing methods took place, utilizing

a DNA sequencer [7].

II. LITERATURE REVIEW

Using feature descriptors from different physiochemical

properties and six classiﬁers, the authors [8] created a stacked

ensemble model to identify enhancers. The model outper-

formed previous methods in accuracy, speciﬁcity, sensitivity,

and correlation coefﬁcient. The researchers [9]used machine

learning to classify DNA sequences using label and k-mer

encoding, distinguishing infected and normal genes. Juneja et

al. [10] used a classiﬁcation algorithm to classify three datasets

by gene class, where they split the sequences into deﬁned-

length substrings for analysis. Mathur et al. [11] proposed a

hot vector matrix and machine learning-based DNA sequence

feature extraction classiﬁer that represents word pairs as a

binary matrix of nucleotide positions.

The study [12] elucidates DNA sequences to distinguish be-

tween regular and disease-affected genes using ML techniques,

particularly AdaBoost and Random forest classiﬁer for bag-

ging and detection, respectively. Furthermore, an identiﬁcation

cascade structure reduced false-positive results and enhanced

reliability. In the paper [13], the authors predicted gene

families using human, chimpanzee, and dog DNA sequences

using SVM and classiﬁcation. Combining machine learning

techniques with a pattern-matching algorithm, the study [14]

suggests a model incorporating SVM Linear, and Naive Bayes

to execute DNA sequence classiﬁcation. Vedanshee et al.

[15] predicted genetic defects in 22083 patient records using

human, chimpanzee, and dog DNA. They tagged, correlated,

and analyzed using ﬁve classiﬁer models like SVC Classiﬁer,

Gradient Boosting, Cat-Boost etc.

Fig. 1: Overview of the suggested ensemble approaches on the

three datasets. III. METHODOLOGY

A. Dataset Insight

In our research, we procured the comprehensive datasets

of gene sequences sourced from the publicly available

DNA sequence repository on Kaggle. These datasets

are available for download through the following link:

https://www.kaggle.com/code/khalidmostafa/dna-sequence-

classiﬁcation-using-machine-learning/input. There are three

types of datasets are present, including Human Dataset,

Chimpanzee dataset, & Dog Dataset. These datasets are

present in FASTA format. In the realm of bioinformatics,

the FASTA format emerges as a pivotal text-based encoding

method employed for the representation of nucleotide or

amino acid sequences. This format stands as a standardized

approach for conveying biological information, where

nucleotides or amino acids are denoted by succinct single-

letter codes, such as [A, C, G, T, N]. In this intricate encoding

system, each letter signiﬁes a distinct biological entity: A for

Adenosine, C for Cytosine, G for Guanine, T for Thymidine,

and N serving as a wildcard for any of the aforementioned

entities. Tab. I depicts the overview of the three datasets.

TABLE I: Frequency count of each gene class for the three

datasets.

Dataset Dataset Training Testing Class Count

name size set size set size label

0 531

1 534

2 349

Human 4380 3504 (80%) 876 (20%) 3 672

4 711

5 240

6 1343

0 234

1 185

2 144

Chimpanzee 1682 1346 (80%) 336 (20%) 3 228

4 261

5 109

6 521

0 131

1 75

2 64

Dog 820 656 (80%) 164 (20%) 3 95

4 135

5 60

6 260

B. Feature Matrix Generation

Both K-mer counting and CountVectorizer was utilized to

generate the feature matrix.

Fig. 2: Hexamer substrings (K=6).

1) K-mer Counting: K-Mer counting converted the DNA

sequence strings into k-mer words with a Kvalue of 6, known

as hexamers (K=6). Fig. 2 depicts the generated hexamer

substrings for the DNA sequence ATGGGGCACC. The con-

version of DNA sequences into k-mers serves the purpose of

breaking down the genetic information into smaller, overlap-

ping units. These k-mers serve as the elemental vocabulary in

deciphering the genetic language encoded in DNA. The k-mers

act as the primary features.

2) CountVectorizer: The CountVectorizer was applied to

establish a BoW (Bag of words) model, concentrating on the

counts of 4-grams (tetragrams). The resulting string sentence

formulated by K-mer counting served as input for the count

vectorizer, allowing for the creation of a comprehensive bag-

of-words model that encapsulated the unique genetic features

encoded in the original DNA sequences. By utilizing the BoW

approach, the count vectorizer constructed a sparse matrix

where each entry represents the count of a speciﬁc 4-gram

in a given genetic sequence. In Tab. II we showed a portion

of the generated sparse matrix.

TABLE II: A portion of sparse matrix generated by Count

Vectorization technique.

(index, column) value

(0, 181326) 1

(0, 178989) 1

(0, 55066) 1

(0, 217067) 1

(0, 171189) 1

(0, 216740) 1

(0, 169929) 1

(0, 211678) 1

(0, 151026) 1

(0, 135341) 1

(0, 74165) 1

(0, 58623) 1

(0, 231147) 1

(0, 227458) 1

The sparse matrix representation of a genetic sequence,

demonstrated the utilization of the CountVectorizer to convert

raw genetic text data into a structured and numerical format

suitable for subsequent analysis and machine learning tasks.

C. Dataset splitting

After the sparse matrix generation, the dataset was seg-

mented into two parts, one part was for training purpose with

80% of the data, while another one part was for testing purpose

with 20% of the data.

D. Classiﬁer models

In this study, various classiﬁer model was employed to

train the model including Multinomial naive bayes (MNB)

, Logistic regression (LR), Random forest (RF), LightGBM

(LGMB), XGBoost (XGB), K-nearest neighbors (KNN) and

Decision tree (DT). To obtain the highest accuracy by K-

nearest neighbors model, a loop was implemented to iteratively

evaluate the performance of the K-Nearest Neighbors (KNN)

classiﬁer with varying values of number of neighbors, ranging

from 1 to 199. Fig. 3 depicts the generated plots to visualize

how the accuracy of the KNN classiﬁer changes with different

values of K for the human, chimpanzee & Dog dataset. In

Tab. III we also represented the best K value with their

corresponding accuracies for the three datasets.

TABLE III: Best K value with corresponding accuracies for

the three datasets

Dataset name Best K Value Accuracy

Human K=1 85.84%

Chimpanzee K=1 84.87%

Dog K=3 51.22%

E. Ensemble model

1) Soft voting ensemble: The soft voting ensemble model

is a sophisticated technique in machine learning that amalga-

mates the predictions of multiple base models by considering

their weighted average probabilities, resulting in a consensual

decision. Mathematically, let Pi,j denote the predicted proba-

bility of the i-th sample belonging to the j-th class according

to the i-th base model. The soft voting ensemble prediction

Pensemble,j for the j-th class is computed as follows:

Pensemble,j =PN

i=1 Pi,j

After picking the best three and best ﬁve models, we utilized

soft voting ensemble on these sets of models.

2) Hard voting ensemble: Hard voting is an ensemble

technique in machine learning that combines the predictions of

multiple base models by selecting the class label that receives

the majority of votes. Mathematically, let Mrepresent the

number of base models in the ensemble, and Cdenote the

number of classes in the classiﬁcation task. For each input

sample i, the hard voting ensemble prediction Ehard,j for the

j-th class is determined as follows:

Ehard,j =argmaxc

m=1

I(ym,i =c)

After choosing the best three and best ﬁve models, we em-

ployed hard voting ensemble on these sets of models.

IV. EXP ER IM EN TAL RESULTS

In the analysis of Human.txt, Chimpanzee.txt, and Dog.txt

datasets, it was evident that the soft voting ensemble, in-

corporating with the top three classiﬁer models consistently

provided the highest accuracy, as shown in Tab. IV.

Human dataset –When considering the human dataset,

Multinomial Naive Bayes, Logistic Regression, Random For-

est, LightGBM, XGBoost, K-Nearest Neighbors, and Deci-

sion Tree recorded accuracy percentages of 98.40%, 93.95%,

92.24%, 91.21%, 89.84%, 85.84%, and 81.15%, respectively.

Among classiﬁer models, Multinomial Naive Bayes took the

lead, achieving the highest levels of accuracy, precision,

recall, and f1-score, all at an impressive 98.40%. Logis-

tic Regression earned the second-highest accuracy, show-

casing impressive precision, recall, and f1-score at 94.80%,

93.90%, and 94.00%. The soft voting ensemble on the top

three models (MNB+LR+RF) showcased the highest accuracy

among all proposed models for the human dataset, with

signiﬁcant values for accuracy, precision, recall, and f1-score

of 98.42%, 98.41%, 98.40%, and 98.40%, respectively. In

contrast, the soft voting ensemble incorporating the top ﬁve

models (MNB+LR+RF+LGBM+XGB) attained the second-

best rank among all ensemble models, presenting an impres-

sive accuracy of 96.23%. Besides, this model delivered an

outstanding precision, recall and f1-score of 96.50%, 96.20%,

and 96.20% respectively.

Chimpanzee dataset –For the Chimpanzee dataset, the per-

formance of various models was evaluated, with Multinomial

Naive Bayes, Logistic Regression, LightGBM, XGBoost, K-

Nearest Neighbors, Random Forest, and Decision Tree achiev-

ing accuracy scores of 91.39%, 89.91%, 88.13%, 86.05%,

84.87%, 84.27%, and 79.23% respectively. Among these,

Multinomial Naive Bayes exhibited the highest accuracy, pre-

cision, recall, and f1-score at 91.39%, 91.80%, 91.40%, and

91.40%. The soft voting ensemble on the top three models

(MNB+LR+LGBM) stood out as the best-performing model

for the Chimpanzee dataset, showcasing signiﬁcant accuracy,

precision, recall, and f1-score values of 92.28%, 92.40%,

92.30%, and 92.10% respectively. This ensemble demonstrated

superior classiﬁcation capabilities, leveraging the strengths of

the individual models.

Dog dataset –Finally for the Dog dataset, a comprehensive

TABLE IV: Performance evaluation of all recommended models across the Human, Chimpanzee, and Dog datasets.

Dataset Model Accuracy Precision Recall F1-score MAE MSE RMSE RAE RRSE Best performer

type name (%) (%) (%) (%) (%) (%) (%) (%) (%) model

MNB 98.40% 98.40% 98.40% 98.40% 5.10% 20.90% 45.70% 1.50% 13.00%

LR 93.95% 94.80% 93.90% 94.00% 19.60% 80.60% 89.80% 5.60% 25.50%

RF 92.24% 93.40% 92.20% 92.40% 20.50% 69.60% 83.40% 5.80% 23.70%

LGBM 91.21% 92.00% 91.20% 91.20% 29.10% 12.34% 11.11% 8.30% 31.60%

XGB 89.84% 91.20% 89.80% 90.00% 37.60% 169.1% 130.0% 10.70% 37.00%

KNN 85.84% 92.60% 85.80% 87.40% 32.90% 87.20% 93.40% 9.40% 26.60%

DT 81.15% 82.80% 81.50% 81.90% 59.80% 242.9% 155.9% 17.00% 44.40%

Soft voting on

top 3 models 98.42% 98.41% 98.40% 98.40% 7.30% 28.30% 53.20% 2.10% 15.10%

(MNB+LR+RF)

Human Hard voting on Soft voting on

dataset top 3 models 95.67% 95.90% 95.70% 95.70% 14.70% 60.80% 78.00% 4.20% 22.20% top 3 models

(MNB+LR+RF) (MNB+LR+RF)

Soft voting on

top 5 models 96.23% 96.50% 96.20% 96.20% 11.00% 36.10% 60.10% 3.10% 17.10%

(MNB+LR+RF

+LGBM+XGB)

Hard voting on

top 5 models 94.29% 94.80% 98.40% 98.40% 16.40% 65.30% 80.80% 4.70% 23.00%

(MNB+LR+RF

+LGBM+XGB)

MNB 91.39% 91.80% 91.40% 91.40% 21.40% 72.40% 85.10% 5.60% 22.20%

LR 89.91% 91.10% 89.90% 89.70% 30.00% 119.6% 109.4% 7.80% 28.60%

RF 84.27% 87.30% 84.30% 84.10% 54.60% 238.0% 154.3% 14.30% 40.30%

LGBM 88.13% 89.10% 88.10% 87.90% 34.10% 127.9% 113.1% 8.90% 29.60%

XGB 86.05% 86.90% 86.10% 85.80% 38.60% 147.2% 121.3% 10.10% 31.70%

KNN 84.87% 89.40% 84.90% 85.00% 47.20% 197.9% 140.7% 12.30% 36.80%

DT 79.23% 79.70% 79.20% 79.20% 54.90% 189.0% 137.5% 14.40% 35.90%

Soft voting on

top 3 models 92.28% 92.40% 92.30% 92.10% 20.20% 74.80% 86.50% 5.30% 22.60%

(MNB+LR+LGBM)

Chimpanzee Hard voting on Soft voting on

dataset top 3 models 91.10% 91.80% 91.10% 90.80% 26.70% 104.5% 102.2% 7.00% 26.70% top 3 models

(MNB+LR+LGBM) (MNB+LR+LGBM)

Soft voting on

top 5 models 89.91% 91.60% 89.90% 89.80% 28.20% 110.7% 105.2% 7.40% 27.50%

(MNB+LR+LGBM

+XGBC+KNN)

Hard voting on

top 5 models 89.90% 91.20% 89.90% 89.70% 27.00% 100.0% 100.0% 7.10% 26.10%

(MNB+LR+LGBM

+XGBC+KNN)

MNB 70.10% 73.20% 70.10% 69.40% 101.2% 430.5% 207.5% 29.40% 60.30%

LR 59.77% 71.20% 59.80% 57.60% 145.7% 650.6% 255.1% 42.40% 74.20%

RF 56.71% 64.40% 56.70% 53.40% 162.8% 756.7% 275.1% 47.30% 80.00%

LGBM 64.63% 68.10% 64.60% 63.40% 108.5% 420.7% 205.1% 31.60% 59.60%

XGB 59.76% 63.50% 59.80% 58.90% 131.1% 538.4% 232.0% 38.10% 67.50%

KNN 51.22% 67.50% 51.20% 45.50% 179.9% 839.6% 289.8% 52.30% 84.30%

DT 53.66% 53.30% 53.70% 52.50% 142.7% 545.1% 233.5% 41.50% 67.90%

Soft voting on

top 3 models 70.12% 73.10% 70.10% 69.20% 100.6% 425.0% 206.2% 29.30% 59.90%

(MNB+LGBM+LR)

Dog Hard voting on Soft voting on

dataset top 3 models 66.50% 71.90% 66.50% 64.60% 109.8% 458.5% 214.1% 31.90% 62.30% top 3 models

(MNB+LGBM+LR) (MNB+LGBM+RF)

Soft voting on

top 5 models 67.10% 69.70% 67.10% 65.60% 134.8% 601.8% 245.3% 39.20% 71.30%

(MNB+LGBM+LR

+XGBC+RF)

Hard voting on

top 5 models 62.20% 67.60% 62.20% 60.10% 134.1% 576.8% 240.2% 39.00% 69.80%

(MNB+LGBM+LR

+XGBC+RF)

(a) K value vs Accuracy (Human). (b) K value vs Accuracy (Chimpanzee). (c) K value vs Accuracy (Dog).

Fig. 3: Evaluation of a KNN classiﬁer (accuracy metric) with the changes of number of nearest neighbors for the 3 datasets.

(a) Decision tree (b) LightGBM (c) Logistic regression (d) Multinomial naive bayes (e) Random forest

(f) XGBoost

(g) Soft voting on

top 3 (MNB+LR+RF)

(h) Hard voting on

top 3 (MNB+LR+RF)

(i) Soft voting on top 5

(MNB+LR+RF+LGBM+XGBC)

(j) Hard voting on top 5

(MNB+LR+RF+LGBM+XGBC)

Fig. 4: Exhibit the confusion matrix results for the Human Dataset through all our suggested models.

(a) Decision tree (b) LightGBM (c) Logistic regression (d) Multinomial naive bayes (e) Random forest

(f) XGBoost

(g) Soft voting on

top 3 (MNB+LR+LGBM)

(h) Hard voting on

top 3 (MNB+LR+LGBM)

(i) Soft voting on top 5

(MNB+LR+LGBM+XGBC+KNN)

(j) Hard voting on top 5

(MNB+LR+LGBM+XGBC+KNN)

Fig. 5: Exhibit the confusion matrix results for the Chimpanzee Dataset through all our suggested models.

evaluation of various models was conducted, including Multi-

nomial Naive Bayes, Logistic Regression, Random Forest,

LightGBM, XGBoost, K-Nearest Neighbors, and Decision

Tree, which achieved accuracy scores of 70.10%, 59.77%,

56.71%, 64.63%, 59.76%, 51.22%, and 53.66%, respectively.

Out of the classiﬁers, Multinomial Naive Bayes model show-

cased the best accuracy. In addition, it secured noteworthy

precision, recall, and f1-score ﬁgures of 73.20%, 70.10%,

and 69.40%. Moreover, the soft voting ensemble on the top

three models accomplished a notable accuracy of 70.12%,

distinguishing it as the best model among all suggested

models. Besides, it presents noteworthy precision, recall, and

(a) Decision tree (b) LightGBM (c) Logistic regression (d) Multinomial naive bayes (e) Random forest

(f) XGBoost

(g) Soft voting on

top 3 (MNB+LGBM+LR)

(h) Hard voting on

top 3 (MNB+LGBM+LR)

(i) Soft voting on top 5

(MNB+LGBM+LR+XGBC+RF)

(j) Hard voting on top 5

(MNB+LGBM+LR+XGBC+RF)

Fig. 6: Exhibit the confusion matrix results for the Dog Dataset through all our suggested models.

f1-scores of 73.10%, 70.10%, and 69.20%. Additionally, due

to lack amount of data, we noted a substantial variation in the

accuracy scores of the dog dataset compared to other two.

V. DISCUSSION AND CONCLUSION

Our proposed both soft and hard voting ensemble model

was employed to all of the three species datasets to assess

its cross-species performance. In this study, the choice of

value Kin k-mer counting was signiﬁcant, as it determines

the length of the subsequences considered. This parameter

is crucial in capturing speciﬁc patterns and characteristics

within the genetic data. We explored the models’ performance

through adjustments to the Kvalue, spanning from 1 to 6. Our

observations indicate that the recommended algorithms deliver

superior performance at K=6, emphasizing the importance of a

substring length comprising 6 nucleotides. Notably, deviations

beyond this value result in a decline in performance. In

addition, the CountVectorizer was applied to establish a BoW

(Bag of Words) model, concentrating on the counts of 4-grams.

In the case of tetragram vectorization, we attained the top

accuracy, leading us to opt for tetragram tokenization. Besides,

We noticed that the soft voting ensemble consistently gave a

small advantage in accuracy than hard voting ensemble. We

also observed a consistent trend where the ensemble accuracies

with the top three models consistently surpassed those with the

top ﬁve models. So according to this study, we can conclude

that the increment of classiﬁer models could also degrade the

performance of voting ensemble models.

REFERENCES

[1] Chou KC, Shen HB. Predicting eukaryotic protein subcellular location

by fusing optimized evidence-theoretic K-Nearest Neighbor classiﬁers. J

Proteome Res. 2006 Aug;5(8):1888-97. doi: 10.1021/pr060167c. PMID:

16889410.

[2] Akhtar, M., Epps, J., & Ambikairajah, E. (2007). On DNA Numerical

Representations for Period-3 Based Exon Prediction. 2007 IEEE Inter-

national Workshop on Genomic Signal Processing and Statistics, 1-4.

[3] M. Akhtar, J. Epps and E. Ambikairajah, ”Signal Processing in Sequence

Analysis: Advances in Eukaryotic Gene Prediction,” in IEEE Journal of

Selected Topics in Signal Processing, vol. 2, no. 3, pp. 310-321, June

2008, doi: 10.1109/JSTSP.2008.923854.

[4] Ramachandran P, Lu WS, Antoniou A. Filter-based methodology for the

location of hot spots in proteins and exons in DNA. IEEE Trans Biomed

Eng. 2012 Jun;59(6):1598-609. doi: 10.1109/TBME.2012.2190512.

Epub 2012 Mar 9. PMID: 22410955.

[5] W. Kinsner, ”Towards cognitive analysis of DNA,” 9th IEEE Interna-

tional Conference on Cognitive Informatics (ICCI’10), Beijing, China,

2010, pp. 6-7, doi: 10.1109/COGINF.2010.5599728.

[6] Wang Z, Chen Y, Li Y. A brief review of computational gene prediction

methods. Genomics Proteomics Bioinformatics. 2004 Nov;2(4):216-

21. doi: 10.1016/s1672-0229(04)02028-5. PMID: 15901250; PMCID:

PMC5187414.

[7] Olsvik O, Wahlberg J, Petterson B, Uhl´

en M, Popovic T, Wachsmuth IK,

Fields PI. Use of automated sequencing of polymerase chain reaction-

generated amplicons to identify three types of cholera toxin subunit B in

Vibrio cholerae O1 strains. J Clin Microbiol. 1993 Jan;31(1):22-5. doi:

10.1128/jcm.31.1.22-25.1993. PMID: 7678018; PMCID: PMC262614.

[8] B. A. Mir, M. U. Rehman, H. Tayara, and K. T. Chong, “Improving

enhancer identiﬁcation with a multi-classiﬁer stacked ensemble model,”

Journal of Molecular Biology, vol. 435, no. 23, p. 168314, 2023.

[9] S. Sarkar, K. Mridha, A. Ghosh, and R. N. Shaw, “Machine learning

in bioinformatics: New technique for dna sequencing classiﬁcation,”

in Advanced Computing and Intelligent Technologies: Proceedings of

ICACIT 2022. Springer, 2022, pp. 335–355.

[10] S. Juneja, A. Dhankhar, A. Juneja, and S. Bali, “An approach to dna

sequence classiﬁcation through machine learning: Dna sequencing, k

mer counting, thresholding, sequence analysis,” International Journal of

Reliable and Quality E-Healthcare (IJRQEH), vol. 11, no. 2, pp. 1–15,

2022.

[11] G. Mathur, A. Pandey, and S. Goyal, “A comprehensive tool for rapid

and accurate prediction of disease using dna sequence classiﬁer,” Journal

of Ambient Intelligence and Humanized Computing, vol. 14, no. 10, pp.

13 869–13 885, 2023.

[12] S. S. Kanumalli, S. Swathi, K. Sukanya, V. Yamini, and N. Nagalakshmi,

“Classiﬁcation of dna sequence using machine learning,” in Soft Com-

puting for Security Applications: Proceedings of ICSCS 2022. Springer,

2022, pp. 723–732.

[13] J. Rexie, K. Raimond, D. Brindha, and A. K. Prabavathy, “K-mer

based prediction of gene family by applying multinomial na¨ıve bayes

algorithm in dna sequence,” in AIP Conference Proceedings, vol. 2914,

no. 1. AIP Publishing, 2023.

[14] B. A. Hamed, O. A. S. Ibrahim, and T. Abd El-Hafeez, “Optimizing

classiﬁcation efﬁciency with machine learning techniques for pattern

matching,” Journal of Big Data, vol. 10, no. 1, p. 124, 2023.

[15] V. Upadhyay, S. Harbhajanka, S. Pangaonkar, and R. Gunjan, “Ex-

ploratory data analysis and prediction of human genetic disorder and

species using dna sequencing,” in Proceedings of the Future Technolo-

gies Conference. Springer, 2023, pp. 197–213.

ResearchGate has not been able to resolve any citations for this publication.

Optimizing classification efficiency with machine learning techniques for pattern matching

Article

Full-text available

Jul 2023

The study proposes a novel model for DNA sequence classification that combines machine learning methods and a pattern-matching algorithm. This model aims to effectively categorize DNA sequences based on their features and enhance the accuracy and efficiency of DNA sequence classification. The performance of the proposed model is evaluated using various machine learning algorithms, and the results indicate that the SVM linear classifier achieves the highest accuracy and F1 score among the tested algorithms. This finding suggests that the proposed model can provide better overall performance than other algorithms in DNA sequence classification. In addition, the proposed model is compared to two suggested algorithms, namely FLPM and PAPM, and the results show that the proposed model outperforms these algorithms in terms of accuracy and efficiency. The study further explores the impact of pattern length on the accuracy and time complexity of each algorithm. The results show that as the pattern length increases, the execution time of each algorithm varies. For a pattern length of 5, SVM Linear and EFLPM have the lowest execution time of 0.0035 s. However, at a pattern length of 25, SVM Linear has the lowest execution time of 0.0012 s. The experimental results of the proposed model show that SVM Linear has the highest accuracy and F1 score among the tested algorithms. SVM Linear achieved an accuracy of 0.963 and an F1 score of 0.97, indicating that it can provide the best overall performance in DNA sequence classification. Naive Bayes also performs well with an accuracy of 0.838 and an F1 score of 0.94. The proposed model offers a valuable contribution to the field of DNA sequence analysis by providing a novel approach to pre-processing and feature extraction. The model’s potential applications include drug discovery, personalized medicine, and disease diagnosis. The study’s findings highlight the importance of considering the impact of pattern length on the accuracy and time complexity of DNA sequence classification algorithms.

An Approach to DNA Sequence Classification Through Machine Learning: DNA Sequencing, K Mer Counting, Thresholding, Sequence Analysis

Article

Full-text available

Jan 2022

Machine learning (ML) has been instrumental in optimal decision making through relevant historical data, including the domain of bioinformatics. In bioinformatics classification of natural genes and the genes that are infected by disease called invalid gene is a very complex task. In order to find the applicability of a fresh protein through genomic research, DNA sequences need to be classified. The current work identifies classes of DNA sequence using machine learning algorithm. These classes are basically dependent on the sequence of nucleotides. With a fractional mutation in sequence, there is a corresponding change in the class. Each numeric instance representing a class is linked to a gene family including G protein coupled receptors, tyrosine kinase, synthase, etc. In this paper, the authors applied the classification algorithm on three types of datasets to identify which gene class they belong to. They converted sequences into substrings with a defined length. That ‘k value' defines the length of substring which is one of the ways to analyze the sequence.

A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier

Article

Full-text available

Jun 2022

In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.

K-mer based prediction of gene family by applying multinomial naïve bayes algorithm in DNA sequence

Conference Paper

Jan 2023

Exploratory Data Analysis and Prediction of Human Genetic Disorder and Species Using DNA Sequencing

Chapter

Nov 2023

The genetic information expressed through the development of a sequencing model for DNA/RNA proteins using Machine Learning Algorithms is a big exploration and growing need. Basically, this was intended to identify, predict as well as classify gene families based on the DNA sequence with medical anomalies for early diagnosis of genetic variation. This study assessed gene sequences from three DNA sequence text files, including 4380 human, 1682 chimpanzee, and 820 dog DNA. The genetic disorder dataset includes 35 features that were utilized to predict genetic abnormalities across 22083 patient data. Labelling, correlating, exploratory data analysis, and prediction systems were made for both datasets. Prediction systems were made using Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Gradient Boosting, CatBoost, Multinomial Naive Bayes Classifier, and SVC Classifier algorithms. Multinomial Naive Bayes Classifier achieved the best accuracy rate of 94.42% for DNA sequencing dataset, followed by K Neighbour Classifier, Decision Tree Classifier, Random Forest Classifier, and SVC Classifier contributed 71.98%, 74.85%, 86.82% and 79.6% respectively. For the genetic disorder dataset, the best-performing model was CatBoost with a 54.72% R2CV score. As for the R2CV scores, Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Extreme Gradient Boosting, Light Gradient Boosting Machine and Gradient Boosting Classifier offered 47.36%, 34.16%, 45.27%, 40.83%, 52.36%, 48.89%, 48.75% and 53.34% respectively. Genetic disorders will be classified in the future based on extensive medical history, sequence data, deep learning models, federated machine learning and transfer learning.

Improving Enhancer Identification with a Multi-Classifier Stacked Ensemble Model

Article

Oct 2023
J MOL BIOL

Classification of DNA Sequence Using Machine Learning

Chapter

Sep 2022

In the field of medical information research, the genetic series is widely used as a component of a category. One of the applications of ML is biochemistry. Bioinformatics is an interdisciplinary science that uses computers and communication science to understand biological data. One of its most difficult tasks is to distinguish between regular genes and disease-causing genes. The classification of gene sequences into existing categories is utilized in genomic research to discover the functions of novel proteins. As a result, it is critical to identify and categorize such genes. We employ ML approaches to distinguish between infected and normal genes using classification methods. AdaBoost has a high degree of precision; relative to the bagging algorithm and Random Forest Algorithm, AdaBoost fully considers the weight of each classifier. To generate a sequence of weak classifiers, an AdaBoost-based learning approach is used to find the most ‘informative’ or ‘discriminating’ features. The identification cascade structure can also help to limit false-positive results. This study provides an overview of the mechanics of gene sequence classification using ML Techniques, including a brief introduction to bioinformatics and important challenges in DNA Sequencing with ML.

Machine Learning in Bioinformatics: New Technique for DNA Sequencing Classification

Chapter

Aug 2022

The extraction of useful information from deoxyribonucleic acid (DNA) is a major component of bioinformatics research, and DNA sequence categorization has a variety of applications, including genomic and biomedical data processing. DNA sequence classification is a critical problem in a general computational framework for biomedical data processing, and numerous machine learning techniques have been used to complete this task in recent years. Machine learning is a data processing technique that uses training data to create judgments, predictions, classifications, and recognitions. To learn the functions of a new protein, genomic researchers classify DNA sequences into known categories. As a result, it is critical to discover and characterize those genes. We employ machine learning approaches to distinguish between infected and normal genes using classification methods. In this study, we used the multinomial Naive Bayes classifier, SVM, KNN, and others to classify DNA sequences using label and k-mer encoding. Different categorization metrics are used to evaluate the models. The multinomial Naive Bayes classifier, SVM, KNN, decision tree, random forest, and logistic regression with k-mer encoding all have good accuracy on testing data, with 93.16% and 93.13%, respectively.KeywordsDNABioinformaticsComputational frameworkMachine learningClassification

On DNA numerical representations for period-3 based exon prediction

Article

Jan 2007

Towards cognitive analysis of DNA

Conference Paper

Aug 2010

Witold Kinsner

Summary form only given. Deoxyribonucleic acid (DNA) has become one of the most examined molecules on the planet. Scientist around the world have been trying to unravel its secrets for many purposes. For example, genetic information is currently used to raise better plants and animals, create enhanced pharmaceuticals for humans, and for gene therapy in medicine. Science as a whole has benefited from the study of genetics because of the increased understanding of biological process that all organisms share. In recent decades, a significant amount of research has been directed towards sequencing and understanding the entire human genome through the Human Genome Project (HGP) launched in 1986. The goal of the HGP was to find the location of the approximately 1×105 human genes, and read all the sequence of human genome (about 3×109 base pairs, bp). An exponential grow rate of that research has resulted in reaching the goal by 2003. Similarly, the speed of finding genes and their locations is also increasing rapidly. On the other hand, the traditional methods of finding genes and their location at chromatosomes through testing their biological function have been inherently slow. Although numerous faster techniques have been developed, there is still a need to augment them with new approaches. Therefore, robust computational solutions to the gene-finding problem could provide a valuable resource for the HGP and for the molecular-biology community. Most of the current research in the deciphering the meaning of DNA sequences is approached from the lowest base-pair level. Its main objective is to search for patterns or correlations existing in the DNA sequence related to codons, amino acids, and proteins. A number of gene-finding systems have been developed in recent decades. These systems use a variety of sophisticated computational data-miming techniques, including neural networks, dynamic programming, rule-based methods, decision trees, probability reasoni- - ng, hidden Markov chains, genetic programming, and support vector machines. Most of these approaches are based on local measures only. In addition, many of the techniques rely on the statistical qualities of exons in the gene, thus using only the known gene pool as a training set for their classification. Although the techniques have demonstrated limited success, better techniques should be developed. An approach to finding such improved techniques is to consider long-range relations (in addition to short-range relations) in the DNA sequence, spanning 104 nucleotides. If we had a good technique to measure such long-range relations, we would be able to estimate any existing self-affinity (fractality) in the DNA sequence, without any a priori assumptions about its structure. This would be a data-driven approach, rather than the common modeldriven approach. Along those lines, preliminary results have already been reported in the literature on a local self-similarity with a 180 bp periodicity in mammalian nuclear DNA sequence. Other publications have provided evidence that the long-range fractal correlations appear in DNA sequences with different values in different regions of the sequence. This paper describes such a multiscale approach, together with an algorithm based on a multifractal analysis, and demonstrates that multifractal estimates can be used to characterize DNA sequences [1], [2], [3]. This multifractal approach appears to be new, and may provide a key to cognitive analysis of DNA sequences. It should be clear that the DNA sequencing and gene finding techniques constitute a subset of bioinformatics, the science of using information to understand biology, with its numerous tools. In turn, bioinformatics is a subset of computational biology which is the application of quantitative analytical techniques in modelling biological systems. Very often, for structural biologists, DNA is not just a sequence of symbols, but implies 3D structures, molecular shapes an

Cracking the Genetic Codes: Exploring DNA Sequence Classification with Machine Learning Algorithms and Voting Ensemble Strategies

Abstract and Figures

Recommended publications

Improved Time-domain Approaches for Locating Exons in DNA Using Zero-Phase Filtering

Improved Singular Value Decomposition-based Exons Prediction Approach Using Forward-backward Filteri...

A New Multiple Classifiers Soft Decisions Fusion Approach for Exons Prediction in DNA Sequences

Heart Disease Detection Using an Ensemble Solution with Target Engineering and Pearson Correlation F...