ArticlePDF Available

Artificial intelligence-based colon cancer prediction by identifying genomic biomarkers

March 2022
Medical Records 4(2)

March 2022
4(2)

DOI:10.37990/medr.1077024

Authors:

Nur Paksoy

Kahramanmaras Sutcu Imam University

Fatma Hilal Yagin

Inonu University

Aim: Colon cancer is the third most common type of cancer worldwide. Because of the poor prognosis and unclear preoperative staging, genetic biomarkers have become more important in the diagnosis and treatment of the disease. In this study, we aimed to determine the biomarker candidate genes for colon cancer and to develop a model that can predict colon cancer based on these genes.Material and Methods: In the study, a dataset containing the expression levels of 2000 genes from 62 different samples (22 healthy and 40 tumor tissues) obtained by the Princeton University Gene Expression Project and shared in the figshare database was used. Data were summarized as mean ± standard deviation. Independent Samples T-Test was used for statistical analysis. The SMOTE method was applied before the feature selection to eliminate the class imbalance problem in the dataset. The 13 most important genes that may be associated with colon cancer were selected with the LASSO feature selection method. Random Forest (RF), Decision Tree (DT), and Gaussian Naive Bayes methods were used in the modeling phase.Results: All 13 genes selected by LASSO had a statistically significant difference between normal and tumor samples. In the model created with RF, all the accuracy, specificity, f1-score, sensitivity, negative and positive predictive values were calculated as 1. The RF method offered the highest performance when compared to DT and Gaussian Naive Bayes.Conclusion: In the study, we identified the genomic biomarkers of colon cancer and classified the disease with a high-performance model. According to our results, it can be recommended to use the LASSO+RF approach when modeling high-dimensional microarray data.

Content uploaded by Fatma Hilal Yagin

Content may be subject to copyright.

196

Med Records 2022;4(2):196-202

DOI: 10.37990/medr.1077024

MEDICAL RECORDS-International Medical Journal

Articial Intelligence-based Colon Cancer Prediction by

Identifying Genomic Biomarkers

Genomik Biyobelirteçleri Belirleyerek Yapay Zeka Tabanlı Kolon Kanseri

Tahmini

Nur Paksoy, Fatma Hilal Yagin

Malatya Fahri Kayahan Healthcare Center, Department of Family Medicine Physician, Malatya, Turkey

Inonu University, Faculty of Medicine, Department of Biostatistics and Medical Informatics, Malatya, Turkey

Content of this journal is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Received: 22.02.2022 Accepted: 28.04.2022

Corresponding Author: Fatma Hilal Yagin, Inonu University, Faculty of Medicine, Department of Biostatistics and

Medical Informatics, Malatya, Turkey, E-mail: hilal.yagin@inonu.edu.tr

Abstract

Aim: Colon cancer is the third most common type of cancer worldwide. Because of the poor prognosis and unclear preoperative

staging, genetic biomarkers have become more important in the diagnosis and treatment of the disease. In this study, we aimed to

determine the biomarker candidate genes for colon cancer and to develop a model that can predict colon cancer based on these

genes.

Material and Methods: In the study, a dataset containing the expression levels of 2000 genes from 62 different samples (22 healthy and

40 tumor tissues) obtained by the Princeton University Gene Expression Project and shared in the gshare database was used. Data

were summarized as mean ± standard deviation. Independent Samples T-Test was used for statistical analysis. The SMOTE method

was applied before the feature selection to eliminate the class imbalance problem in the dataset. The 13 most important genes that

may be associated with colon cancer were selected with the LASSO feature selection method. Random Forest (RF), Decision Tree (DT),

and Gaussian Naive Bayes methods were used in the modeling phase.

Results: All 13 genes selected by LASSO had a statistically signicant difference between normal and tumor samples. In the model

created with RF, all the accuracy, specicity, f1-score, sensitivity, negative and positive predictive values were calculated as 1. The RF

method offered the highest performance when compared to DT and Gaussian Naive Bayes.

Conclusion: In the study, we identied the genomic biomarkers of colon cancer and classied the disease with a high-performance

model. According to our results, it can be recommended to use the LASSO+RF approach when modeling high-dimensional microarray

data.

Keywords: Colon cancer, microarray, genomics, LASSO, random forest, decision tree, gaussian naive bayes

Öz

Amaç: Kolon kanseri dünya genelinde en sık görülen üçüncü kanser türüdür. Kötü prognoz ve net olmayan preoperatif evreleme

nedeniyle, hastalığın tanı ve tedavisinde genetik biyobelirteçler daha önemli hale gelmiştir. Bu çalışmada kolon kanseri için biyobelirteç

adayı genlerin belirlenmesi ve bu genlere dayalı olarak kolon kanserini başarılı bir şekilde tahmin eden bir modelin geliştirilmesi

amaçlanmıştır.

Materyal ve Metot: Çalışmada, Princeton Üniversitesi Gen Ekspresyon Projesi ile elde edilen ve gshare veri tabanında paylaşılan

62 farklı örnekten (22 sağlıklı ve 40 tümör dokusu) 2000 genin ekspresyon düzeylerini içeren bir veri seti kullanıldı. Veriler ortalama

± standart sapma olarak özetlendi. İstatistiksel analizler için bağımsız örneklerde T-testi kullanıldı. Veri setindeki sınıf dengesizliği

sorununu ortadan kaldırmak için öznitelik seçiminden önce SMOTE yöntemi uygulandı. Kolon kanseri ile ilişkili olabilecek en önemli

13 gen, LASSO öznitelik seçim yöntemi ile seçildi. Modelleme aşamasında Rastgele Orman (RF), Karar Ağacı (DT) ve Gauss naive

Bayes yöntemleri kullanıldı.

Bulgular: LASSO tarafından seçilen 13 genin tümü, normal ve tümör numuneleri arasında istatistiksel olarak anlamlı bir farka sahipti.

RF ile oluşturulan modelde doğruluk, seçicilik, f1-skor, duyarlılık, negatif ve pozitif prediktif değerlerinin tümü 1 olarak hesaplanmıştır.

DT ve Gaussian Naive Bayes ile karşılaştırıldığında RF yöntemi en yüksek performansı vermiştir.

Sonuç: Çalışmada kolon kanserinin genomik biyobelirteçlerini belirledik ve hastalığı yüksek performanslı bir model ile sınıflandırdık.

Elde ettiğimiz sonuçlara göre, yüksek boyutlu mikrodizi verilerinin modellenmesinde LASSO+RF yaklaşımının kullanılması önerilebilir.

Anahtar Kelimeler: Kolon kanseri, mikrodizi, genomik, LASSO, rastgele orman, karar ağacı, gaussian naive bayes

Research Article

197

Med Records 2022;4(2):196-202

DOI: 10.37990/medr.1077024

INTRODUCTION

According to the World Health Organization, cancer is the

second leading cause of death after cardiovascular disease.

Colon cancer ranks 3rd in the world in terms of incidence

and is the 4th most common cancer. With the introduction

of screening programs in the USA in the last 30 years,

an improvement in cancer prognosis has been detected

thanks to early diagnosis, and this screening program

has been implemented in our country since 2009 (1,2).

Being able to perform postoperative staging and

determining prognosis by staging alone emphasizes

biomarkers and genetic evaluation in colon cancer. For

this reason, examination of colon cancer based on genetic

biomarkers is very important in the diagnosis and treatment

of the disease (3).

Microarray technology has allowed the simultaneous

measurement of thousands of gene expressions.

Identifying disease-related biomarker candidate

genes using microarray gene expression datasets and

distinguishing (classifying) disease samples from non-

disease samples has been an important research topic in

biomedicine and medicine. However, the resulting large-

scale datasets created many barriers to computational

techniques. The high dimensionality problem affects most

microarray gene expression datasets where dimensionality

is high (up to tens of thousands of genes) and sample size

is small (normally up to hundreds). Also, the high noise-to-

variability ratio of microarray trials adds to the difculties

(4).

Machine learning methods are frequently used to overcome

current challenges. Machine learning; it can be dened

as obtaining previously unknown, valid and applicable

information from data stacks through a dynamic process.

In this process, many techniques such as clustering,

data summarization, learning classication rules, nding

dependency networks, developing predictive models,

variability analysis and anomaly detection are used. With

machine learning, condential information is retrieved

in database systems comprising large data stacks. This

process is done using statistics, mathematical disciplines,

modeling techniques, database technology and various

computer programs (5,6).

Before constructing classication models in machine

learning with high-dimensional microarray datasets, it is an

important step to remove disease-related genes from the

dataset using trait (gene) selection methods. In this way,

both biomarker candidate genes can be selected and the

performance of the classication models to be created will

be improved (7).

In this study, we aimed to determine biomarker candidate

genes for colon cancer by using gene expression dataset

and to develop a classication model that can provide

clinical decision support to healthcare professionals.

MATERIAL AND METHOD

Dataset

In this study, an open source colon cancer gene expression

dataset obtained by Princeton University Gene Expression

Project and shared in gshare database (https://gshare.

com/articles/dataset/The_microarray_dataset_of_colon_

cancer_in_csv_format_/13658790/1) was used (8). The

dataset includes expression levels of 2000 genes from 62

different samples (22 healthy and 40 tumor tissues).

Statistical Evaluation

Data were summarized as mean ± standard deviation.

Compliance with the normal distribution was done with

the Kolmogorov-Smirnov test. Independent Samples T

Test was used for statistical analysis. Statistical tests

with a p value of less than 5% were considered signicant.

All statistical analyzes were performed using IBM SPSS

Statistics for Windows version 26.0 (New York, USA).

Data Preprocessing and Modeling

In datasets with class imbalance problem, most machine

learning techniques ignore minority class performance and

therefore underperform in minority class. One approach

to these datasets is to oversample the minority class and

is called the Synthetic Minority Oversampling Technique,

or SMOTE for short (9). In order to eliminate the class

imbalance problem in the colon cancer gene expression

dataset (22 normal and 40 tumor tissues), the SMOTE

method was applied before feature selection. In this way,

the number of samples in the groups, 40 normal and 40

tumor tissues, was equalized.

Afterwards, the 13 most important genes that may be

associated with colon cancer were selected with the

LASSO feature selection method. For the generalizability

of the model, 80% of the data set is divided as the training

set and 20% as the test set. Random Forest, Decision Trees

and Gaussian Naive Bayes classication methods were

used to predict colon cancer based on selected genes. The

performance of the models was evaluated with accuracy,

specicity, sensitivity, f1-score, negative predictive value

and positive predictive value.

LASSO Feature Selection

In 1996 the LASSO method was rst used by Robert

Tibshirani. Regularization and property selection are the

two main tasks of the method. The LASSO method puts

a constraint on the sum of the absolute values of model

parameters; the sum must be less than a xed value (upper

bound). To do this, the method implements a narrowing

(regularization) process in which regression variables

punish their coefcients, some of which reduce them to

zero. During the property selection process, variables that

still have a coefcient of zero after collapse are selected

for the model. This operation minimizes the prediction

error. In practice, the parameter that controls the power of

198

Med Records 2022;4(2):196-202

DOI: 10.37990/medr.1077024

punishment, is of great importance. When large enough,

the dimensionality is can be reduced in this manner. The

larger the parameter, the more coefcients are reduced

to zero. There are many advantages to using the LASSO

method. First, it can provide very good forecast accuracy,

since the reduction and removal of coefcients can reduce

variance without a signicant increase in deviation. It

is especially useful when there are few observations

and many variables in the data set. LASSO also helps to

improve the interpretability of the model by eliminating

irrelevant variables that are not associated with the

response variable, so that the problem of overlearning can

also be addressed (10,11).

Random Forest

The Random Forests algorithm, a community learning

method, aims to increase the classication value by

generating multiple decision trees during the classication

process. Because it includes random sampling and

improved properties of techniques in community methods,

the RF method offers better generalizations and makes

more valid predictions than conventional machine learning

methods. The reasons for the precise estimates of the RF

method are that it gives low deviation and low correlation

between trees. The low amount of deviation is obtained as

a result of the creation of rather large trees. By creating

as many different trees as possible, a low correlation

structure is achieved. Individually created classication

and regression decision trees come together to form the

decision forest community. The decision trees here are

randomly selected subsets from the data set to which

they are connected. The results obtained during the

formation of the decision forest are combined to make

the latest prediction. For classication, trees each leaf

node is created to contain only members of one class. For

regression, trees continue to divide until a small number of

units remain in the leaf node (12).

Decision Trees

Decision trees (DT) consist of root nodes, branches and

leaves. The leaves in the decision trees are the places

where the classication occurs and the branches refer

to the result. The tree is created by the division variation

method from the root node to the leaf nodes. A decision

node can contain one or more branches. A decision tree

can consist of both categorical and numerical data. The

decision tree contains two basic process steps. These

operations are splitting and pruning operations (13). The

most important step when creating a DT is to decide which

attribute values to base it on and which branching to create.

In the knowledge gain and gain ratio approach that includes

entropy rules, all attributes at hand are tested subjectically

and the attribute with the highest knowledge gain is

selected for branching. DT are a classication method that

creates a model in the form of a tree structure consisting of

decision nodes and leaf nodes by classication, property,

and target. The decision tree algorithm is developed by

dividing the data set into smaller pieces (14,15).

Gaussian Naive Bayes

A simple structured classication based on conditional

probability, which is assumed to be equal and independent

of each other in the classication of all attributes based

on conditional probability. The classication process is

done by combining the effects of different attributes on

the result. Naive Bayes classies using statistical methods

and is an important algorithm in terms of performance. The

importance of qualications is considered equal in all. The

Gaussian Naive Bayes (GNB) classier is the Naive Bayes

method, which is created by assuming that the class label

is a Gaussian distribution on the given property values. GNB

assigns all data to the closest location. However, instead of

using Euclidean distance to calculate the distance between

them, it calculates by taking into account the distance from

the average and the class variance (16).

RESULTS

Table I contains descriptive statistics for 13 genes selected

by LASSO trait selection. When Table I is examined; all 13

genes selected by LASSO had a statistically signicant

difference between normal and tumor samples. Hsa.8125,

Hsa.2710, Hsa.8147, Hsa.36689, Hsa.31933, Hsa.1387

and Hsa.865 were expressed lower in tumor samples, while

Hsa.3306, Hsa.22762, Hsa.3016, Hsa.5392, Hsa.1410 and

Hsa.2928 were expressed higher in tumor samples.

Table 1. Descriptive statistics for selected genes

Gene Name Normal (Mean ± SD)Tumor (Mean ± SD) t value p-value

Hsa.8125 2.144 ± 0.496 1.444 ± 0.442 6.87 <0.001

Hsa.2710 1.289 ± 0.392 0.89 ± 0.359 5.3 <0.001

Hsa.8147 2.092 ± 0.799 0.725 ± 0.637 9.97 <0.001

Hsa.36689 0.741 ± 0.42 -0.01 ± 0.318 9.83 <0.001

Hsa.3306 0.289 ± 0.504 1.138 ± 0.482 -8.07 <0.001

Hsa.22762 -0.242 ± 0.564 0.337 ± 0.759 -4.05 0.003

Hsa.31933 -0.107 ± 0.263 -0.475 ± 0.377 5.24 <0.001

Hsa.3016 -0.222 ± 1.074 0.962 ± 1.049 -5.16 <0.001

Hsa.5392 -1.064 ± 0.655 -0.486 ± 0.486 -4.82 <0.001

Hsa.1410 -0.794 ± 0.73 0.002 ± 0.694 -5.53 <0.001

Hsa.2928 -1.312 ± 0.563 -0.526 ± 0.518 -6.98 <0.001

Hsa.1387 0.827 ± 0.648 0.017 ± 0.779 5.4 <0.001

Hsa.865 0.45 ± 0.393 0.06 ± 0.568 3.78 0.006

SD: Standard deviation

199

Med Records 2022;4(2):196-202

DOI: 10.37990/medr.1077024

Table II presents the results of the performance measures

of the RF, DT, and GNB classication models. Specicity,

accuracy, f1-score, sensitivity, negative and positive

predictive value criteria obtained from the RF model

were all calculated as 1. That is, the RF model correctly

predicted all samples in the test set. From the DT model,

all performance measures were obtained as 0.9. Finally, in

the model created with the GNB method, the performance

measures were found to be accuracy 0.95, specicity 1, f1-

score 0.95, sensitivity 0.9, negative predictive value 0.9091,

and positive predictive value 1. The RF method offered the

highest performance compared to DT and GNB.

Table 2. Performance measures results for classication models

Metric Random Forest Gaussian Naive Bayes Decision Trees

Accuracy 1 0.95 0.9

Sensitivity 1 0.9 0.9

Specicity 1 1 0.9

PPV 1 1 0.9

NPV 1 0.9091 0.9

F1 score 1 0.95 0.9

PPV: Positive predictive value, NPV: Negative predictive value

DISCUSSION

Since knowing the biological functions of genes is useful

for knowing the origin, causes and treatment of many

diseases, studies in the eld of genomics have been on

the agenda of the scientic world for years. In addition to

their biological functions, the detection and relationships

of genes in the same biological pathway bring microarray

studies to the fore. Thanks to the detection of possibly

related genes, the detection and treatment of diseases

has become easier with the identication of gene clusters

(17). Based on this information, in the current study,

we developed a model that can predict the disease by

identifying the genes associated with colon cancer to

provide clinical decision support to physicians.

In this study, we used the LASSO feature selection method

to identify colon cancer-related genes. With the LASSO

method, Hsa.8125, Hsa.36689, Hsa.3306, Hsa.3016,

Hsa.8147, Hsa.2710, Hsa.22762, Hsa.31933, Hsa.5392,

Hsa.1410, Hsa.2928, Hsa.1387 and Hsa.865 genes may

be associated with colon cancer. Some of the biomarker

candidate genes we identied were in agreement with the

literature. Shaik et al. showed differential expression of

Hsa.8125, Hsa.36689 and Hsa.3306 genes in colon cancer

(genes1). In another study, Hsa.8125 and Hsa.3306 were

among 100 genes associated with colon cancer (18).

Hsa.8125; it is a gene that activates RNA binding activity,

is involved in nucleocytoplasmic transport, is located in the

endoplasmic reticulum, nucleus and perinuclear region of

the cytoplasm. Yan et al. showed that this gene, also known

as ANP32A, is overexpressed in colorectal cancer patients

and ANP32A levels are higher in poorly differentiated

tumors (19). Velmurugan et al. reported that this gene is

associated with lymph node metastasis (20).

When the relationship between the Hsa.36689 gene, whose

main task is guanylate cyclase activation in the colon, and

colon cancer was examined, Yang et al. identied this gene

among the top 5 most related genes (21). The Hsa.3016

and Hsa.8147 genes that we detected were also detected

as the other genes with the highest frequency in this study.

The Hsa.3306 gene is a gene that plays a role in cell

proliferation and is increased in cancer. In another study

examining the colon gene data set in the literature, it was

identied as one of the ten most closely related genes

among 2000 genes due to its association with colon

cancer. Among the genes detected in this study, Hsa.8125,

one of the genes we detected, is also included. It has been

shown that this gene, whose functions are important in the

construction of intestinal villi, increases in normal cells and

decreases in colon cancer cells (22).

Hsa.8147, also known as the desmin gene, is the gene

responsible for the production of desmin, a smooth

muscle-type intermediate lament protein expressed by

smooth muscle cells, but also in brotic tissue in wound

healing and tumor ‘desmoplastic’ stroma. Desmin also

surrounds the vasculature by being produced by pericytes

during angiogenesis in capillaries. It also plays a role in

angiogenesis in cancer tissue. Studies have shown an

increase in desmin expression in advanced cancer patients

(23). In a study conducted in patients with gallbladder

cancer, down-regulation of the desmin gene was detected

(24).

The Hsa.3016 gene, which we have observed to be

strongly associated with colon cancer, is one of the

genes responsible for coding the S-100P protein. S100

proteins are involved in many events such as regulation

of calcium homeostasis, cell proliferation, apoptosis, and

cell migration. The S100 protein family plays a role in many

stages of cancer formation and progression. S-100P acts

as an inducer of metastasis, overexpression of S-100P

increases the expression of S-100A6 and Cathepsin D,

which are involved in cellular invasion. Furthermore, S100P

promotes transendothelial migration of tumor cells (25).

The Hsa.2710 gene is one of the genes responsible for

making Fibulin-1, a secreted glycoprotein that is included

in the brillar extracellular matrix. It is involved in cell

adhesion and migration along protein bers within the

extracellular matrix (ECM). Considered to have a role in

cellular transformation and tumor invasion, it acts as a

tumor suppressor (26). In the study of Xu et al. , it was

shown that bulin downregulation is associated with

colorectal cancer (27).

Nucleolin; it is a multifunctional protein that is also found

in the nucleolus, nucleoplasm, and cytoplasm. Hsa.22762

is one of the genes involved in the synthesis of nucleolin.

200

Med Records 2022;4(2):196-202

DOI: 10.37990/medr.1077024

It is involved in the regulation of translation and stability

of oncogenic mRNAs in the nucleoplasm. In our study, the

presence of this gene was found to be signicantly related

in colon cancer patients. It has also been shown in other

studies that nucleolin is overexpressed in many cancer

types such as stomach, pancreatic, breast, cervix, prostate

cancers, leukemias, melanomas and colorectal cancers

(28).

The Hsa.31933 gene, which we detected in our study, is

one of the genes that helps Autographa californica multiple

nuclear polyhedrosis virus (AcMNPV), which is from the

Baculovirus family, to successfully initiate the expression

of viral genes by preparing the host environment and

controlling the subsequent viral gene expression like other

DNA viruses to infect their hosts. Viral genes, which are

expressed immediately after infection, play a critical role

in the early infection process; Hsa.31933 (Immediate-

Early Regulatory Protein IE-N gene) is one of these genes.

AcMNPV has been studied as a gene therapy vector. In

a study by Ono et al., they determined AcMNPV induces

antitumor acquired immunity; they showed AcMNPV can

act as an effective immune-inducing virus and eukaryotic

expression vector for gene carrier and has the potential to

be a tumor therapy agent (29).

In another study, recombinant DNA obtained with this virus

enabled the production of a natural antigen associated

with carcinoma in mice (30). Although there are no studies

related to this virus DNA in colon cancer yet, the data in our

study showed that there is a strong relationship between

colon cancer and this gene. We think that meaningful

results can be obtained as a result of the use of AcMNPV

as a vector with more comprehensive studies on the

treatment of colon cancer.

The Hsa. 5392 gene is also known as ribosomal protein

L24 (RPL24). It is one of the genes responsible for the

expression of ribosomal proteins. It encodes the ribosomal

protein L24, a homolog of the cytosolic RPL24 found in

higher eukaryotes. Studies have been conducted on the

overexpression of a number of ribosomal protein genes in

human tumors and their contribution to tumorigenesis (31).

Hsa.1410 is the gene responsible for the synthesis of

the eukaryotic translation initiation factor eIF-2. The

role of protein synthesis changes is important in cancer

development and progression. Studies show that ribosomal

protein synthesis plays a direct role during tumor initiation.

The translation initiation process is the rate-limiting step of

protein synthesis in eukaryotes, and a group of eukaryotic

translation initiation factors (eIFs) are involved. In previous

studies, it has been shown that a signicant increase in eIF3

subunits, eIF3A, eIF3B and eIF3M overexpression, which is

one of the translation initiation factors, in colorectal cancer

patients, and eIF4 subunits, of which eIF3C is an oncogene,

are also increased in cancer cells (32).

In studies, eIF2a expression was described as transiently

increased in normal cells, whereas constitutive

overexpression indicated tumor initiation and progression.

Golob-Schwarzl et al., they also showed that eIF2 is

overexpressed in colorectal cancers (32).

Among the genes we determined, Hsa.2928 is the mRNA

gene responsible for the expression of P-cadherin.

Cadherins are calcium-dependent cell adhesion proteins

that provide cell architecture and integrity, and their

degradation is often associated with human cancer (33).

Neo-expression or up-regulation of placental cadherin

(P-cadherin) has been reported in a variety of carcinomas,

including colorectal and bladder carcinomas (34).

The Hsa.1387 Human 11 beta-hydroxysteroid

dehydrogenase type II mRNA gene is a gene that has

a strong association with colon cancer and has been

found to be associated with colon carcinomas. 11 beta-

Hydroxysteroid dehydrogenase type II enzyme (11 beta

HSD2), which is also located in the colon, which has an

important role in water and electrolyte homeostasis, gives

specicity to the mineralocorticoid receptor (35).

MAP kinases, also known as (ERKs) encoded by the Hsa.865

(ERK-1, M84490) gene, are regulated by extracellular

signaling and act in a signal cascade that regulates various

cellular processes such as proliferation, differentiation

and cell cycle through the action of extracellular signals.

The tumor suppressor pathway is stimulated by ERK-1

phosphorylation (36). The relationship between colon

cancer and ERK-1 has been shown in many studies (37-39).

In our study, we showed its relationship with colon cancer.

In a similar study using the same data set in the literature,

PCA and PLS feature extraction methods were applied

and then they classied colon cancer with the support

vector machine method with an accuracy of 0.9516

(40). In another study, they found that the combined use

of PSO and SVM outperformed the model created with

only the SVM algorithm in terms of accuracy (0.94) and

performance, and was faster in terms of time analysis

(41). In the current study, three models were created using

RF, DT and GNB classiers based on biomarker candidate

genes determined by LASSO feature selection method.

According to the performance criteria obtained, the LASSO

+ RF model showed the best performance by correctly

classifying all samples.

CONCLUSION

In conclusion, this study identied genomic biomarkers

of colon cancer and classied the disease with a high-

performance model. According to the results obtained,

the LASSO method gave results compatible with the

literature while determining the genomic biomarkers.

For this reason, genes selected with LASSO can provide

clinical decision support to physicians in the diagnosis and

treatment of colon cancer. In addition, it can be suggested

that the LASSO+RF approach be used in modeling high-

dimensional data in medicine.

Financial disclosures: The authors declared that this study

hasn’t received no nancial support.

201

Med Records 2022;4(2):196-202

DOI: 10.37990/medr.1077024

Conflict of Interest: The authors declare that they have no

competing interest.

Ethical approval: Ethics committee approval is not required

in this study.

REFERENCES

1. Globocan W. Estimated cancer incidence, mortality and

prevalence worldwide in 2012. Int Agency Res Cancer. 2012.

2. Labianca R, Beretta G, Gatta G, et al. Colon cancer. Critical

Reviews Oncology Hematology. 2004;51:145-70.

3. Loboda A, Nebozhyn MV, Watters JW, et al. EMT is the

dominant program in human colon cancer. BMC Med

Genomics. 2011;4:1-10.

4. Xu C, Meng LB, Duan YC, et al. Screening and identication of

biomarkers for systemic sclerosis via microarray technology.

Int J Molecular Med. 2019;44:1753-70.

5. Ahmad MA, Eckert C, Teredesai A. Interpretable machine

learning in healthcare. Proceedings of the 2018 ACM

international conference on bioinformatics, Computational

Biology Health Informatics. 2018

6. Yagin FH, Yagin B, Arslan AK, Çolak C. Comparison of

Performances of Associative Classication Methods for

Cervical Cancer Prediction: Observational Study. Turkey

Clinics J Biostatistics. 2021;13:13:266-72.

7. Khaire UM, Dhanalakshmi R. High-dimensional microarray

dataset classication using an improved adam optimizer

(iAdam). J Ambient Intelligence Humanized Computing.

2020;11:5187-204.

8. Hameed SS, Hassan R, Hassan WH, et al. HDG-select: A novel

GUI based application for gene selection and classication

in high dimensional datasets. PloS One. 2021;16:e0246039.

9. Mulla GA, Demir Y, Hassan M. Combination of PCA

with SMOTE Oversampling for Classication of High-

Dimensional Imbalanced Data. Bitlis Eren University Science

and Technology Journal. 2021;10:858-69.

10. Güçkiran K, Cantürk İ, Özyilmaz L. DNA microarray gene

expression data classication using SVM, MLP, and RF

with feature selection methods relief and LASSO. Journal

of Suleyman Demirel University Institute of Science and

Technology. 2019;23:126-32.

11. Akyol K, Bayir Ş, Baha Ş. Importance of Attribute Selection

for Parkinson Disease. Academic Platform J Engineering

Sci. 2020;8:175-80.

12. Yilmaz R, Yagin FH. Early detection of coronary heart

disease based on machine learning methods. Med Records.

2022;4:1-6.

13. Secgin Y, Oner Z, Turan MK, Oner S. Gender prediction with

parameters obtained from pelvis computed tomography

images and decision tree algorithm. Med Science.

2021;10:356-61

14. Doğan Ş, Türkoğlu İ. Hypothyroidi and hyperthyroidi

detection from thyroid hormone parameters by using

decision trees. Fırat University Journal of Oriental Studies.

2007;5:163-9.

15. Pulat M, Kocakoç ID. Machine Learning and Decision in

Turkey. Bibliometric Analysis of Published Theses in the

Field of Trees. Journal of Management and Economics.

2021;28:287-308.

16. Kamel H, Abdulah D, Al-Tuwaijari JM. Cancer classication

using gaussian naive bayes algorithm. 2019 Int Engineering

Conference (IEC); 2019:36:165-5.

17. Quackenbush J. Microarray analysis and tumor

classication. New England J Med. 2006;354:2463-72.

18. Jose A. Gene selection by 1-d discrete wavelet transform

for classifying cancer samples using dna microarray date.

Ph.D. thesis, University of Akron, 2009.

19. Yan W, Bai Z, Wang J, et al. ANP32A modulates cell growth

by regulating p38 and Akt activity in colorectal cancer.

Oncology Reports. 2017;38:1605-12.

20. Velmurugan BK, Yeh K-T, Lee C-H, et al. Acidic leucine-

rich nuclear phosphoprotein-32A (ANP32A) association

with lymph node metastasis predicts poor survival in

oral squamous cell carcinoma patients. Oncotarget.

2016;7:10879.

21. Liu Q, Tan Y, Huang T, et al. TF-centered downstream gene

set enrichment analysis: Inference of causal regulators by

integrating TF-DNA interactions and protein post-translational

modications information. BMC Bioinformatics. 2010;11:1-

17.

22. Mora JAM, Ordoñez FM, Bonilla DA. Improvement of k-means

clustering algorithm performance in gene expression data

analysis through pre-processing with principal component

analysis and boosting. 2017;3:53-9.

23. Arentz G, Chataway T, Price TJ, et al. Desmin expression in

colorectal cancer stroma correlates with advanced stage

disease and marks angiogenic microvessels. Clinical

Proteomics. 2011;8:1-13.

24. Bhunia S, Barbhuiya MA, Gupta S, et al. Epigenetic

downregulation of desmin in gall bladder cancer reveals

its potential role in disease progression. Indian J Med

Research. 2020;151:311.

25. Chen H, Xu C, Qing’e Jin ZL. S100 protein family in human

cancer. Am J Cancer Res. 2014;4:89.

26. Twal WO, Czirok A, Hegedus B, et al. Fibulin-1 suppression

of bronectin-regulated cell adhesion and motility. J Cell Sci.

2001;114:4587-98.

27. Xu Z, Chen H, Liu D, Huo J. Fibulin-1 is downregulated through

promoter hypermethylation in colorectal cancer: a CONSORT

study. Med (Baltimore). 2015;94.e663

28. Tong X, Mirzoeva S, Veliceasa D, et al. Chemopreventive

apigenin controls UVB-induced cutaneous proliferation

and angiogenesis through HuR and thrombospondin-1.

Oncotarget. 2014;5:11413.

29. Ono C, Sato M, Taka H, et al. Tightly regulated expression

of Autographa californica multicapsid nucleopolyhedrovirus

immediate early genes emerges from their interactions and

possible collective behaviors. Plos One. 2015;10:e0119580.

30. Strassburg CP, Kasai Y, Seng BA, et al. Baculovirus

recombinant expressing a secreted form of a transmembrane

carcinoma-associated antigen. Cancer Res. 1992;52:815-21.

31. Loging WT, Reisman D. Elevated expression of ribosomal

202

Med Records 2022;4(2):196-202

DOI: 10.37990/medr.1077024

protein genes L37, RPP-1, and S2 in the presence of mutant

p53. Cancer Epidemiology and Prevention Biomarkers.

1999;8:1011-6.

32. Golob-Schwarzl N, Schweiger C, Koller C, et al. Separation

of low and high grade colon and rectum carcinoma by

eukaryotic translation initiation factors 1, 5 and 6. Oncotarget.

2017;8:101224.

33. Oliveira P, Sanges R, Huntsman D, et al. Characterization

of the intronic portion of cadherin superfamily members,

common cancer orchestrators. European J Human Genetics.

2012;20:878-83.

34. Van Marck V, Stove C, Jacobs K, et al. Pcadherin in adhesion

and invasion: Opposite roles in colon and bladder carcinoma.

Int J Cancer. 2011;128:1031-44.

35. Takahashi K, Sasano H, Fukushima K, et al. 11 beta-

hydroxysteroid dehydrogenase type II in human colon: a

new marker of fetal development and differentiation in

neoplasms. Anticancer Res. 1998;18:3381-8.

36. Baba Y, Nosho K, Shima K, et al. Prognostic signicance

of AMP-activated protein kinase expression and modifying

effect of MAPK3/1 in colorectl cancer. British J Cancer.

2010;103:1025-33.

37. Esteve-Puig R, Canals F, Colome N, et al. Uncoupling of the

LKB1-AMPKα energy sensor pathway by growth factors and

oncogenic BRAFV600E. PloS One. 2009;4:e4771.

38. Zheng B, Jeong JH, Asara JM, et al. Oncogenic B-RAF

negatively regulates the tumor suppressor LKB1 to promote

melanoma cell proliferation. Molecular Cell. 2009;33:237-47.

39. Kim MJ, Park IJ, Yun H, et al. AMP-activated protein kinase

antagonizes pro-apoptotic extracellular signal-regulated

kinase activation by inducing dual-specicity protein

phosphatases in response to glucose deprivation in HCT116

carcinoma. J Bio Chemistry. 2010;285:14617-27.

40. Arowolo MO, Isiaka RM, Abdulsalam SO, et al. A comparative

analysis of feature extraction methods for classifying colon

cancer microarray data. EAI Endorsed Transactions Scalable

Information Systems. 2017;4:1-6.

41. Al Rajab M, Lu J, Xu Q. Examining applying high performance

genetic data feature selection and classication algorithms

for colon cancer diagnosis. Computer Methods Programs

Bio Med. 2017;146:11-24.

Microarray Data Analysis: Identification of Biomarkers Responsible for Colon Cancer

Preprint

Full-text available

Apr 2023

Tungon Dugi

Colon cancer is one of the most common types of cancer worldwide, and early detection is crucial for effective treatment. Microarray technology has emerged as a powerful tool for identifying gene expression patterns associated with colon cancer. This study aimed to identify potential biomarker genes responsible for colon cancer and to develop a machine learning model that can predict colon cancer based on these genes. A microarray dataset with expression levels of 2000 genes with 62 different samples (22 Normal and 40 Abnormal tissues) obtained by the Queen's University Belfast Cancer Research and shared in the kaggle website were used in this study. Statistical analysis for independent sample, T-test was done and SMOTE-Tomek data sampling was applied before the feature selection to solve the class imbalance problem in the dataset. Also the data were summarized as mean ± standard deviation. The 10 most important genes that may contribute to colon cancer were selected using Extra Tree Classifier as a feature selection technique. Random Forest (RF), Decision Tree (DT) and Logistic Regression (LR) methods were used in the modeling phase. The top 10 most important genes selected by the Extra Tree Classifier (ETC) feature selection method had statistically significant differences between normal and abnormal samples. In the model created with the RF, all the accuracy, f1-score, sensitivity, specificity, negative and positive predictive values were calculated as 1. The RF model showed best performance in comparison to DT and LR. The study was able to identify the genomic biomarkers of colon cancer and with highest performance. The results also concluded that the ETC+RF model can be used when dealing with high-dimensional microarray data.

A Novel Optimized Colonic adenocarcinoma Detection using Deep Transfer Learning Approach with XceptionTS Model

Article

Full-text available

May 2024

Rakesh Patnaik

Colonic adenocarcinoma is a major contributor to global mortality, highlighting the crucial need for efficient detection and classification techniques. This research presents a new method called XceptionTS for classifying and detecting colon cancer using colonoscopy pictures. The XceptionTS method utilizes deep transfer learning techniques by leveraging the Xception model architecture. Nonlinear Mean Filtering (NMF) is used as a noise reduction method in image processing to improve the quality of colonoscopy pictures. We combine the MobileNetV2 and ResNet-50 models for healthcare image segmentation and feature extraction, respectively. The XceptionTS classifier efficiently gives accurate class labels to medical photos by combining Tabu Search Optimization with the strong Xception architecture. The assessment of the effectiveness of XceptionTS model is done using a dataset of 1560 colonoscopy images. An extensive comparison study is undertaken by analyzing the efficacy of our suggested approach with existing research. The XceptionTS system outperforms previous methodologies in colon cancer classification and detection tasks, showing higher accuracy and robustness according to experimental results. Our findings indicate that the XceptionTS technique shows potential as an advanced tool to increase the effectiveness of Colonic adenocarcinoma diagnosis, which could lead to better patient outcomes and healthcare management.

Analysis of hematological indicators via explainable artificial intelligence in the diagnosis of acute heart failure: a retrospective study

Article

Full-text available

Mar 2024

Introduction Acute heart failure (AHF) is a serious medical problem that necessitates hospitalization and often results in death. Patients hospitalized in the emergency department (ED) should therefore receive an immediate diagnosis and treatment. Unfortunately, there is not yet a fast and accurate laboratory test for identifying AHF. The purpose of this research is to apply the principles of explainable artificial intelligence (XAI) to the analysis of hematological indicators for the diagnosis of AHF. Methods In this retrospective analysis, 425 patients with AHF and 430 healthy individuals served as assessments. Patients’ demographic and hematological information was analyzed to diagnose AHF. Important risk variables for AHF diagnosis were identified using the Least Absolute Shrinkage and Selection Operator (LASSO) feature selection. To test the efficacy of the suggested prediction model, Extreme Gradient Boosting (XGBoost), a 10-fold cross-validation procedure was implemented. The area under the receiver operating characteristic curve (AUC), F1 score, Brier score, Positive Predictive Value (PPV), and Negative Predictive Value (NPV) were all computed to evaluate the model’s efficacy. Permutation-based analysis and SHAP were used to assess the importance and influence of the model’s incorporated risk factors. Results White blood cell (WBC), monocytes, neutrophils, neutrophil-lymphocyte ratio (NLR), red cell distribution width-standard deviation (RDW-SD), RDW-coefficient of variation (RDW-CV), and platelet distribution width (PDW) values were significantly higher than the healthy group (p < 0.05). On the other hand, erythrocyte, hemoglobin, basophil, lymphocyte, mean platelet volume (MPV), platelet, hematocrit, mean erythrocyte hemoglobin (MCH), and procalcitonin (PCT) values were found to be significantly lower in AHF patients compared to healthy controls (p < 0.05). When XGBoost was used in conjunction with LASSO to diagnose AHF, the resulting model had an AUC of 87.9%, an F1 score of 87.4%, a Brier score of 0.036, and an F1 score of 87.4%. PDW, age, RDW-SD, and PLT were identified as the most crucial risk factors in differentiating AHF. Conclusion The results of this study showed that XAI combined with ML could successfully diagnose AHF. SHAP descriptions show that advanced age, low platelet count, high RDW-SD, and PDW are the primary hematological parameters for the diagnosis of AHF.

Gene Expression-Based Cancer Classification for Handling the Class Imbalance Problem and Curse of Dimensionality

Article

Full-text available

Feb 2024
INT J MOL SCI

Cancer is a leading cause of death globally. The majority of cancer cases are only diagnosed in the late stages of cancer due to the use of conventional methods. This reduces the chance of survival for cancer patients. Therefore, early detection consequently followed by early diagnoses are important tasks in cancer research. Gene expression microarray technology has been applied to detect and diagnose most types of cancers in their early stages and has gained encouraging results. In this paper, we address the problem of classifying cancer based on gene expression for handling the class imbalance problem and the curse of dimensionality. The oversampling technique is utilized to overcome this problem by adding synthetic samples. Another common issue related to the gene expression dataset addressed in this paper is the curse of dimensionality. This problem is addressed by applying chi-square and information gain feature selection techniques. After applying these techniques individually, we proposed a method to select the most significant genes by combining those two techniques (CHiS and IG). We investigated the effect of these techniques individually and in combination. Four benchmarking biomedical datasets (Leukemia-subtypes, Leukemia-ALLAML, Colon, and CuMiDa) were used. The experimental results reveal that the oversampling techniques improve the results in most cases. Additionally, the performance of the proposed feature selection technique outperforms individual techniques in nearly all cases. In addition, this study provides an empirical study for evaluating several oversampling techniques along with ensemble-based learning. The experimental results also reveal that SVM-SMOTE, along with the random forests classifier, achieved the highest results, with a reporting accuracy of 100%. The obtained results surpass the findings in the existing literature as well.

Comparison of Electrocardiographic Parameters by Gender in Heart Failure Patients with Preserved Ejection Fraction via Artificial Intelligence

Article

Full-text available

Oct 2023

Background: Heart failure (HF) causes high morbidity and mortality worldwide. The prevalence of HF with preserved ejection fraction (HFpEF) is increasing compared with HF with reduced ejection fraction (HFrEF). Patients with HFpEF are a patient group with a high rate of hospitalization despite medical treatment. Early diagnosis is very important in this group of patients, and early treatment can improve their prognosis. Although electrocardiographic (ECG) findings have been adequately studied in patients with HFrEF, there are not enough studies on these parameters in patients with HFpEF. There are very few studies in the literature, especially on gender-specific changes. The current research aims to compare gender-specific ECG parameters in patients with HFpEF based on the implications of artificial intelligence (AI). Methods: A total of 118 patients participated in the study, of which 66 (56%) were women with HFpEF and 52 (44%) were men with HFpEF. Demographic, echocardiographic, and electrocardiographic characteristics of the patients were analyzed to compare gender-specific ECG parameters in patients with HFpEF. The AI approach combined with machine learning approaches (gradient boosting machine, k-nearest neighbors, logistic regression, random forest, and support vector machines) was applied for distinguishing male patients with HFpEF from female patients with HFpEF. Results: After determining the parameters (demographic, echocardiographic, and electrocardiographic) to distinguish male patients with HFpEF from female patients with HFpEF, machine learning methods were applied, and among these methods, the random forest model achieved an average accuracy of 84.7%. The random forest algorithm results showed that smoking, P-wave dispersion, P-wave amplitude, T-end P/(PQ*Age), Cornell product, and P-wave duration were the most influential parameters for distinguishing male patients with HFpEF from female patients with HFpEF. Conclusions: The proposed model serves as a valuable tool for physicians, facilitating the diagnosis, treatment, and follow-up for distinguishing male patients with HFpEF from female patients with HFpEF. Analyzing readily accessible electrocardiographic parameters empowers medical professionals to make informed decisions and provide enhanced care to a wide range of individuals.

Analysis of Hematological Predictors via Explainable Artificial Intelligence in Prediction of Acute Heart Failure: A Retrospective Study

Preprint

Full-text available

Jul 2023

Background: Acute heart failure (AHF) is a serious medical problem that necessitates hospitalisation and often results in death. Patients hospitalised to the emergency department (ED) should therefore receive an immediate diagnosis and treatment. Unfortunately, there is not yet a fast and accurate laboratory test for identifying AHF. The purpose of this research is to apply the principles of explainable artificial intelligence (XAI) to the analysis of hematological predictors for AHF. Methods: In this retrospective analysis, 425 patients with AHF and 430 healthy individuals served as assessments. Patients' demographic and hematological information was analyzed to determine AHF. Important risk variables for AHF diagnosis were identified using LASSO feature selection. To test the efficacy of the suggested prediction model (XGBoost), a 10-fold cross-validation procedure was implemented. The area under the receiver operating characteristic curve (AUC), F1 score, Brier score, and Positive Predictive Value (PPV) and Negative Predictive Value (NPV) were all computed to evaluate the model's efficacy. Permutation-based analysis and SHAP, were used to assess the importance and influence of the model's incorporated risk factors. Results: White blood cell (WBC), monocytes, neutrophils, neutrophil-lymphocyte ratio (NLR), red cell distribution width-standard deviation (RDW-SD), RDW-coefficient of variation (RDW-CV), and platelet distribution width (PDW) values were significantly higher than the healthy group (p<0.05). On the other hand, erythrocyte, hemoglobin, basophil, lymphocyte, mean platelet volume (MPV), platelet, hematocrit, mean erythrocyte hemoglobin (MCH) and procalcitonin (PCT) values were found to be significantly lower in AHF patients compared to healthy controls (p <0.05). When XGBoost was used in conjunction with LASSO to estimate AHF, the resulting model had an AUC of 87.9%, an F1 score of 87.4%, a Brier score of 0.036, and an F1 score of 87.4%. PDW, age, RDW-SD, and PLT were identified as the most crucial risk factors in differentiating AHF. Conclusions: The XGBoost model demonstrated exceptional performance in accurately estimating Acute Heart Failure, and the application of Explainable Artificial Intelligence effectively provided intuitive explanations for the model's estimations. The suggested interpretable model holds potential for the identification of patients at high risk, thereby facilitating the optimization of treatment and planning for follow-up in cases of AHF.

Investigation of potential biomarkers in prediction of acute myocardial infarction via explainable artificial intelligence

Article

Full-text available

Jun 2023

Rustem Yilmaz

Remodeling of the left ventricle (LV) after myocardial infarction (MI) is a process of infarct enlargement. Despite the relevance of the inflammatory response and healing process in LV remodeling after MI, the mechanisms that begin and govern these processes remain unknown. Based on the important information highlighted in different studies, the current research aims to investigate potential biomarkers for left ventricular remodeling after acute MI based on the interpretation of the explainable artificial intelligence (XAI). The project research from which the public dataset was obtained was designed in an experimental type. A cohort study involving 66 patients with coronary heart disease and 34 healthy community controls provided the platelet samples for the current research, which used available omics data on those samples. For discovering significant mechanistic connections between metabolites and glycans, the metabolomics and glycomics datasets were analyzed using biostatistics/metabolomics and explainable artificial intelligence techniques. Metabolomics data of 100 patients (AMI=66; Control=34) including 75 males and 25 females were evaluated in this study. As a result of experimental omics analyses, 102 metabolite levels of the patients were obtained. When FC values were examined, creatinine and dl-pipecolic acid levels were 0.50 and 0.55-fold down-regulated and glutamine, myoinositol, and cytosine levels were 1.34, 1.33, and 1.53-fold up-regulated in the AMI group compared to the control group. Findings of metabolomics data and XAI analyses revealed that five lipid metabolites may be used as potential predictors of AMI.

A Fecal-Microbial-Extracellular-Vesicles-Based MetabolomicsMachine Learning Framework and Biomarker Discovery for Predicting Colorectal Cancer Patients

Article

Full-text available

Apr 2023

Colorectal cancer (CRC) is one of the most common and lethal diseases among all types of cancer, and metabolites play a significant role in the development of this complex disease. This study aimed to identify potential biomarkers and targets in the diagnosis and treatment of CRC using high-throughput metabolomics. Metabolite data extracted from the feces of CRC patients and healthy volunteers were normalized with the median normalization and Pareto scale for multivariate analysis. Univariate ROC analysis, the t-test, and analysis of fold changes (FCs) were applied to identify biomarker candidate metabolites in CRC patients. Only metabolites that overlapped the two different statistical approaches (false-discovery-rate-corrected p-value < 0.05 and AUC > 0.70) were considered in the further analysis. Multivariate analysis was performed with biomarker candidate metabolites based on linear support vector machines (SVM), partial least squares discrimination analysis (PLS-DA), and random forests (RF). The model identified five biomarker candidate metabolites that were significantly and differently expressed (adjusted p-value < 0.05) in CRC patients compared to healthy controls. The metabolites were succinic acid, aminoisobutyric acid, butyric acid, isoleucine, and leucine. Aminoisobutyric acid was the metabolite with the highest discriminatory potential in CRC, with an AUC equal to 0.806 (95% CI = 0.700–0.897), and was down-regulated in CRC patients. The SVM model showed the most substantial discrimination capacity for the five metabolites selected in the CRC screening, with an AUC of 0.985 (95% CI: 0.94–1).

Microarray Gene Expression Data Classification Via Wilcoxon Sign Rank Sum and Novel Grey Wolf Optimized Ensemble Learning Models

Article

Aug 2023
IEEE ACM T COMPUT BI

Cancer is a deadly disease that affects the lives of people all over the world. Finding a few genes relevant to a single cancer disease can lead to effective treatments. The difficulty with microarray datasets is their high dimensionality; they have a large number of features in comparison to the small number of samples in these datasets. Additionally, microarray data typically exhibit significant asymmetry in dimensionality as well as high levels of redundancy and noise. It is widely held that the majority of genes lack informative value about the classes under study. Recent research has attempted to reduce this high dimensionality by employing various feature selection techniques. This paper presents new ensemble feature selection techniques via the Wilcoxon Sign Rank Sum test (WCSRS) and the Fisher's test (F-test). In the first phase of the experiment, data preprocessing was performed; subsequently, feature selection was performed via the WCSRS and F-test in such a way that the (probability values) p-values of the WCRSR and F-test were adopted for cancerous gene identification. The extracted gene set was used to classify cancer patients using ensemble learning models (ELM), random forest (RF), extreme gradient boosting (Xgboost), cat boost, and Adaboost. To boost the performance of the ELM, we optimized the parameters of all the ELMs using the Grey Wolf optimizer (GWO). The experimental analysis was performed on colon cancer, which included 2000 genes from 62 patients (40 malignant and 22 benign). Using a WCSRS test for feature selection, the optimized Xgboost demonstrated 100% accuracy. The optimized cat boost, on the other hand, demonstrated 100% accuracy using the F-test for feature selection. This represents a 15% improvement over previously reported values in the literature.

Advances in Genomic Data and Biomarkers: Revolutionizing NSCLC Diagnosis and Treatment

Article

Full-text available

Jul 2023

Non-small cell lung cancer (NSCLC) is a significant public health concern with high mortality rates. Recent advancements in genomic data, bioinformatics tools, and the utilization of biomarkers have improved the possibilities for early diagnosis, effective treatment, and follow-up in NSCLC. Biomarkers play a crucial role in precision medicine by providing measurable indicators of disease characteristics, enabling tailored treatment strategies. The integration of big data and artificial intelligence (AI) further enhances the potential for personalized medicine through advanced biomarker analysis. However, challenges remain in the impact of new biomarkers on mortality and treatment efficacy due to limited evidence. Data analysis, interpretation, and the adoption of precision medicine approaches in clinical practice pose additional challenges and emphasize the integration of biomarkers with advanced technologies such as genomic data analysis and artificial intelligence (AI), which enhance the potential of precision medicine in NSCLC. Despite these obstacles, the integration of biomarkers into precision medicine has shown promising results in NSCLC, improving patient outcomes and enabling targeted therapies. Continued research and advancements in biomarker discovery, utilization, and evidence generation are necessary to overcome these challenges and further enhance the efficacy of precision medicine. Addressing these obstacles will contribute to the continued improvement of patient outcomes in non-small cell lung cancer.

Comparison of Performances of Associative Classification Methods for Cervical Cancer Prediction: Observational Study

Article

Full-text available

Jan 2021

Early Detection of Coronary Heart Disease Based on Machine Learning Methods

Article

Full-text available

Jan 2022

Aim: Heart disease detection using machine learning methods has been an outstanding research topic as heart diseases continue to be a burden on healthcare systems around the world. Therefore, in this study, the performances of machine learning methods for predictive classification of coronary heart disease were compared.Material and Method: In the study, three different models were created with Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM) algorithms for the classification of coronary heart disease. For hyper parameter optimization, 3-repeats 10-fold repeated cross validation method was used. The performance of the models was evaluated based on Accuracy, F1 Score, Specificity, Sensitivity, Positive Predictive Value, Negative Predictive Value, and Confusion Matrix (Classification matrix).Results: RF 0.929, SVM 0.897 and LR 0.861 classified coronary heart disease with accuracy. Specificity, Sensitivity, F1-score, Negative predictive and Positive predictive values of the RF model were calculated as 0.929, 0.928, 0.928, 0.929 and 0.928, respectively. The Sensitivity value of the SVM model was higher compared to the RF. Conclusion: Considering the accurate classification rates of Coronary Heart disease, the RF model outperformed the SVM and LR models. Also, the RF model had the highest sensitivity value. We think that this result, which has a high sensitivity criterion in order to minimize overlooked heart patients, is clinically very important.

Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Article

Full-text available

Jul 2021

Imbalanced data classification is a common issue in data mining where the classifiers are skewed towards the larger data class. Classification of high-dimensional skewed (imbalanced) data is of great interest to decision-makers as it is more difficult to. The dimension reduction method, a process in which variables are reduced, allows high dimensional datasets to be interpreted more easily with a certain loss. This study, a method combiningSMOTE oversampling with principal component analysis is proposed to solve the imbalance problem in high dimensional data. Three classification algorithms consisting of Logistic Regression, K-Nearest Neighbor, DecisionTree methods and two separate datasets were utilised to evaluate the suggested method's efficacy and determine the classifiers' performance. Respectively, raw datasets, converted datasets by PCA, SMOTE and SMOTE+PCA(SMOTE and PCA) methods, were analyzed with the given algorithms. Analyzes were made using WEKA.Analysis results suggest that almost all classification algorithms improve their classification performance usingPCA, SOMTE, and SMOTE+PCA methods. However, the SMOTE method gave more efficient results than PCA and PCA+SMOTE methods for data rebalancing. Experimental results also suggest that K-Nearest Neighbor classifier provided higher classification performance compared to other algorithms.

Gender prediction with parameters obtained from pelvis computed tomography images and decision tree algorithm

Article

Full-text available

Mar 2021

Gender prediction is among the most critical topics in forensic medicine and anthropology since it is the basis of identity (height, weight, ancestry, age). Today, osteometry which is a low-cost, easily accessible method that requires no expertise is preferred when compared to DNA technology, which has several disadvantages such as high cost, accessibility, laboratory facilities, and expert personnel requirements. The Computed Tomography (CT) method, which is little affected by orientation and provides reconstruction opportunities, was selected instead of traditional methods for osteometry. This study aims to predict high and accurate gender with the Decision Tree (DT) algorithms used in the field of health recently. In the present study, CT images of 300 individuals (150 females, 150 males) without a pathology on the pelvic skeleton and between the ages of 25 and 50 were transformed into orthogonal form, landmarks were placed on promontorium, sacroiliac joint, iliac crest, terminal line, anterior superior iliac spine, anterior inferior iliac spine, greater trochanter, obturator foramen, lesser trochanter, femoral head, femoral neck, the body of femur, ischial tuberosity, acetabulum, and pubic symphysis, and the coordinates of these landmarks were determined. Then, parameters such as angle and length were obtained with various combinations. These parameters were analyzed with the DT algorithm.The analysis conducted with the DT algorithm revealed that accuracy (Acc) was 0.93, sensitivity was 0.95, specificity was 0.90, and the Matthews correlation coefficient was 0.86 for the pelvic skeleton. It was observed that the accuracy was quite high and more realistic when determined with the DT algorithm. In conclusion, the DT algorithm with multiple parameters and samples on pelvic CT images could improve the Acc of gender prediction. [Med-Science 2021; 10(2.000): 356-61]

HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets

Article

Full-text available

Jan 2021
PLOS ONE

The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. In this work, a novel stand-alone application, which is based on graphical user interface (GUI), is developed to perform the full functionality of gene selection and classification in high dimensional datasets. The so-called HDG-select application is validated on eleven high dimensional datasets of the format CSV and GEO soft. The proposed tool uses the efficient algorithm of combined filter-GBPSO-SVM and it was made freely available to users. It was found that the proposed HDG-select outperformed other tools reported in literature and presented a competitive performance, accessibility, and functionality.

Epigenetic downregulation of desmin in gall bladder cancer reveals its potential role in disease progression

Article

Full-text available

Apr 2020
INDIAN J MED RES

Background & objectives: Gall bladder cancer (GBC) is a fatal neoplasm, with a globally variable incidence rates. To improve the survival rate of patients, a newer set of biomarkers needs to be discovered for its early detection and better prognosis. Our earlier studies on GBC proteomics and whole-genome methylome data revealed expression of desmin to be significantly downregulated with correlated promoter hypermethylation during gall bladder carcinogenesis. Thus, to evaluate desmin as a potential biomarker for GBC, we carried out a detailed follow up study. Methods: Methylation-specific polymerase chain reaction (MS-PCR) (n=17, GBC and n=23, non-tumour control), real-time quantitative reverse transcription-polymerase chain reaction (qRT-PCR) [n=14, GBC and n=14, adjacent non-tumour (ANT)], immunohistochemistry (n=27, GBC and n=14, non-tumour) and immunoblotting (n=13, GBC and n=13, ANT) were performed in surgically removed gall bladder tissue samples. Results: MS-PCR analysis showed methylation of desmin in 88.23 per cent (15/17) gall bladder tumour samples as compared to non-tumour tissues (39.13%, 9/23). Real-time qRT-PCR analysis revealed a significant downregulation of desmin expression in GBC as compared to ANT tissue. This was further confirmed by western blot, showing reduced expression of desmin protein in GBC, as compared to non-tumour tissue. Immunohistochemical analysis also showed a decreased level of desmin i.e., more than 95 per cent (26/27) in tumour cells compared to non-tumours (35.71%, 5/14). Interpretation & conclusions: The increased frequency of desmin promoter methylation which could be responsible for its significant downregulation, indicates its potential as a candidate biomarker for GBC. This requires further validation in a large group of patients to evaluate its clinical utility.

High-dimensional microarray dataset classification using an improved adam optimizer (iAdam)

Article

Full-text available

Nov 2020

Classifying data samples into their respective categories is a challenging task, especially when the dataset has more features and only a few samples. A robust model is essential for the accurate classification of data samples. The logistic sigmoid model is one of the simplest model for binary classification. Among the various optimization techniques of the sigmoid function, Adam optimization technique iteratively updates network weights based on training data. Traditional Adam optimizer fails to converge model within certain epochs when the initial values for parameters are situated at the gentle region of the error surface. The continuous movement of the convergence curve in the direction of history can overshoot the goal and oscillate back and forth incessantly before converging to the global minima. The traditional Adam optimizer with a higher learning rate collapses after several epochs for the high-dimensional dataset. The proposed Improved Adam (iAdam) technique is a combination of the look-ahead mechanism and adaptive learning rate for each parameter. It improves the momentum of traditional Adam by evaluating the gradient after applying the current velocity. iAdam also acts as the correction factor to the momentum of Adam. Further, it works efficiently for the high-dimensional dataset and converges considerably to the smallest error within the specified epochs even at higher learning rates. The proposed technique is compared with several traditional methods which demonstrates that iAdam is suitable for the classification of high-dimensional data and it also prevents the model from overfitting by effectively handling bias-variance trade-offs.

Screening and identification of biomarkers for systemic sclerosis via microarray technology

Article

Full-text available

Sep 2019

Systemic sclerosis (SSc) is a complex autoimmune disease. The pathogenesis of SSc is currently unclear, although like other rheumatic diseases its pathogenesis is complicated. However, the ongoing development of bioinformatics technology has enabled new approaches to research this disease using microarray technology to screen and identify differentially expressed genes (DEGs) in the skin of patients with SSc compared with individuals with healthy skin. Publicly available data were downloaded from the Gene Expression Omnibus (GEO) database and intra‑group data repeatability tests were conducted using Pearson's correlation test and principal component analysis. DEGs were identified using an online tool, GEO2R. Functional annotation of DEGs was performed using Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. Finally, the construction and analysis of the protein‑protein interaction (PPI) network and identification and analysis of hub genes was carried out. A total of 106 DEGs were detected by the screening of SSc and healthy skin samples. A total of 10 genes [interleukin‑6, bone morphogenetic protein 4, calumenin (CALU), clusterin, cysteine rich angiogenic inducer 61, serine protease 23, secretogranin II, suppressor of cytokine signaling 3, Toll‑like receptor 4 (TLR4), tenascin C] were identified as hub genes with degrees ≥10, and which could sensitively and specifically predict SSc based on receiver operator characteristic curve analysis. GO and KEGG analysis showed that variations in hub genes were mainly enriched in positive regulation of nitric oxide biosynthetic processes, negative regulation of apoptotic processes, extracellular regions, extracellular spaces, cytokine activity, chemo‑attractant activity, and the phosphoinositide 3 kinase‑protein kinase B signaling pathway. In summary, bioinformatics techniques proved useful for the screening and identification of biomarkers of disease. A total of 106 DEGs and 10 hub genes were linked to SSc, in particular the TLR4 and CALU genes.

Interpretable Machine Learning in Healthcare

Article

Full-text available

Aug 2018

Cancer Classification Using Gaussian Naive Bayes Algorithm

Conference Paper

Jun 2019

Artificial intelligence-based colon cancer prediction by identifying genomic biomarkers

Abstract

Recommended publications

Microarray Data Analysis: Identification of Biomarkers Responsible for Colon Cancer

Genomic Biomarkers of Metastasis in Breast Cancer Patients: A Machine Learning Approach

Prediction of effective sociodemographic variables in modeling health literacy: A machine learning a...

Machine learning approach for classification of prostate cancer based on clinical biomarkers