ArticlePDF Available

Artificial intelligence-based colon cancer prediction by identifying genomic biomarkers

Authors:

Abstract

Aim: Colon cancer is the third most common type of cancer worldwide. Because of the poor prognosis and unclear preoperative staging, genetic biomarkers have become more important in the diagnosis and treatment of the disease. In this study, we aimed to determine the biomarker candidate genes for colon cancer and to develop a model that can predict colon cancer based on these genes.Material and Methods: In the study, a dataset containing the expression levels of 2000 genes from 62 different samples (22 healthy and 40 tumor tissues) obtained by the Princeton University Gene Expression Project and shared in the figshare database was used. Data were summarized as mean ± standard deviation. Independent Samples T-Test was used for statistical analysis. The SMOTE method was applied before the feature selection to eliminate the class imbalance problem in the dataset. The 13 most important genes that may be associated with colon cancer were selected with the LASSO feature selection method. Random Forest (RF), Decision Tree (DT), and Gaussian Naive Bayes methods were used in the modeling phase.Results: All 13 genes selected by LASSO had a statistically significant difference between normal and tumor samples. In the model created with RF, all the accuracy, specificity, f1-score, sensitivity, negative and positive predictive values were calculated as 1. The RF method offered the highest performance when compared to DT and Gaussian Naive Bayes.Conclusion: In the study, we identified the genomic biomarkers of colon cancer and classified the disease with a high-performance model. According to our results, it can be recommended to use the LASSO+RF approach when modeling high-dimensional microarray data.
196
Med Records 2022;4(2):196-202
DOI: 10.37990/medr.1077024
MEDICAL RECORDS-International Medical Journal
Articial Intelligence-based Colon Cancer Prediction by
Identifying Genomic Biomarkers
Genomik Biyobelirteçleri Belirleyerek Yapay Zeka Tabanlı Kolon Kanseri
Tahmini
Nur Paksoy, Fatma Hilal Yagin
Malatya Fahri Kayahan Healthcare Center, Department of Family Medicine Physician, Malatya, Turkey
Inonu University, Faculty of Medicine, Department of Biostatistics and Medical Informatics, Malatya, Turkey
Copyright@Author(s) - Available online at www.dergipark.org.tr/tr/pub/medr
Content of this journal is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Received: 22.02.2022 Accepted: 28.04.2022
Corresponding Author: Fatma Hilal Yagin, Inonu University, Faculty of Medicine, Department of Biostatistics and
Medical Informatics, Malatya, Turkey, E-mail: hilal.yagin@inonu.edu.tr
Abstract
Aim: Colon cancer is the third most common type of cancer worldwide. Because of the poor prognosis and unclear preoperative
staging, genetic biomarkers have become more important in the diagnosis and treatment of the disease. In this study, we aimed to
determine the biomarker candidate genes for colon cancer and to develop a model that can predict colon cancer based on these
genes.
Material and Methods: In the study, a dataset containing the expression levels of 2000 genes from 62 different samples (22 healthy and
40 tumor tissues) obtained by the Princeton University Gene Expression Project and shared in the gshare database was used. Data
were summarized as mean ± standard deviation. Independent Samples T-Test was used for statistical analysis. The SMOTE method
was applied before the feature selection to eliminate the class imbalance problem in the dataset. The 13 most important genes that
may be associated with colon cancer were selected with the LASSO feature selection method. Random Forest (RF), Decision Tree (DT),
and Gaussian Naive Bayes methods were used in the modeling phase.
Results: All 13 genes selected by LASSO had a statistically signicant difference between normal and tumor samples. In the model
created with RF, all the accuracy, specicity, f1-score, sensitivity, negative and positive predictive values were calculated as 1. The RF
method offered the highest performance when compared to DT and Gaussian Naive Bayes.
Conclusion: In the study, we identied the genomic biomarkers of colon cancer and classied the disease with a high-performance
model. According to our results, it can be recommended to use the LASSO+RF approach when modeling high-dimensional microarray
data.
Keywords: Colon cancer, microarray, genomics, LASSO, random forest, decision tree, gaussian naive bayes
Öz
Amaç: Kolon kanseri dünya genelinde en sık görülen üçüncü kanser türüdür. Kötü prognoz ve net olmayan preoperatif evreleme
nedeniyle, hastalığın tanı ve tedavisinde genetik biyobelirteçler daha önemli hale gelmiştir. Bu çalışmada kolon kanseri için biyobelirteç
adayı genlerin belirlenmesi ve bu genlere dayalı olarak kolon kanserini başarılı bir şekilde tahmin eden bir modelin geliştirilmesi
amaçlanmıştır.
Materyal ve Metot: Çalışmada, Princeton Üniversitesi Gen Ekspresyon Projesi ile elde edilen ve gshare veri tabanında paylaşılan
62 farklı örnekten (22 sağlıklı ve 40 tümör dokusu) 2000 genin ekspresyon düzeylerini içeren bir veri seti kullanıldı. Veriler ortalama
± standart sapma olarak özetlendi. İstatistiksel analizler için bağımsız örneklerde T-testi kullanıldı. Veri setindeki sınıf dengesizliği
sorununu ortadan kaldırmak için öznitelik seçiminden önce SMOTE yöntemi uygulandı. Kolon kanseri ile ilişkili olabilecek en önemli
13 gen, LASSO öznitelik seçim yöntemi ile seçildi. Modelleme aşamasında Rastgele Orman (RF), Karar Ağacı (DT) ve Gauss naive
Bayes yöntemleri kullanıldı.
Bulgular: LASSO tarafından seçilen 13 genin tümü, normal ve tümör numuneleri arasında istatistiksel olarak anlamlı bir farka sahipti.
RF ile oluşturulan modelde doğruluk, seçicilik, f1-skor, duyarlılık, negatif ve pozitif prediktif değerlerinin tümü 1 olarak hesaplanmıştır.
DT ve Gaussian Naive Bayes ile karşılaştırıldığında RF yöntemi en yüksek performansı vermiştir.
Sonuç: Çalışmada kolon kanserinin genomik biyobelirteçlerini belirledik ve hastalığı yüksek performanslı bir model ile sınıflandırdık.
Elde ettiğimiz sonuçlara göre, yüksek boyutlu mikrodizi verilerinin modellenmesinde LASSO+RF yaklaşımının kullanılması önerilebilir.
Anahtar Kelimeler: Kolon kanseri, mikrodizi, genomik, LASSO, rastgele orman, karar ağacı, gaussian naive bayes
Research Article
197
Med Records 2022;4(2):196-202
DOI: 10.37990/medr.1077024
INTRODUCTION
According to the World Health Organization, cancer is the
second leading cause of death after cardiovascular disease.
Colon cancer ranks 3rd in the world in terms of incidence
and is the 4th most common cancer. With the introduction
of screening programs in the USA in the last 30 years,
an improvement in cancer prognosis has been detected
thanks to early diagnosis, and this screening program
has been implemented in our country since 2009 (1,2).
Being able to perform postoperative staging and
determining prognosis by staging alone emphasizes
biomarkers and genetic evaluation in colon cancer. For
this reason, examination of colon cancer based on genetic
biomarkers is very important in the diagnosis and treatment
of the disease (3).
Microarray technology has allowed the simultaneous
measurement of thousands of gene expressions.
Identifying disease-related biomarker candidate
genes using microarray gene expression datasets and
distinguishing (classifying) disease samples from non-
disease samples has been an important research topic in
biomedicine and medicine. However, the resulting large-
scale datasets created many barriers to computational
techniques. The high dimensionality problem affects most
microarray gene expression datasets where dimensionality
is high (up to tens of thousands of genes) and sample size
is small (normally up to hundreds). Also, the high noise-to-
variability ratio of microarray trials adds to the difculties
(4).
Machine learning methods are frequently used to overcome
current challenges. Machine learning; it can be dened
as obtaining previously unknown, valid and applicable
information from data stacks through a dynamic process.
In this process, many techniques such as clustering,
data summarization, learning classication rules, nding
dependency networks, developing predictive models,
variability analysis and anomaly detection are used. With
machine learning, condential information is retrieved
in database systems comprising large data stacks. This
process is done using statistics, mathematical disciplines,
modeling techniques, database technology and various
computer programs (5,6).
Before constructing classication models in machine
learning with high-dimensional microarray datasets, it is an
important step to remove disease-related genes from the
dataset using trait (gene) selection methods. In this way,
both biomarker candidate genes can be selected and the
performance of the classication models to be created will
be improved (7).
In this study, we aimed to determine biomarker candidate
genes for colon cancer by using gene expression dataset
and to develop a classication model that can provide
clinical decision support to healthcare professionals.
MATERIAL AND METHOD
Dataset
In this study, an open source colon cancer gene expression
dataset obtained by Princeton University Gene Expression
Project and shared in gshare database (https://gshare.
com/articles/dataset/The_microarray_dataset_of_colon_
cancer_in_csv_format_/13658790/1) was used (8). The
dataset includes expression levels of 2000 genes from 62
different samples (22 healthy and 40 tumor tissues).
Statistical Evaluation
Data were summarized as mean ± standard deviation.
Compliance with the normal distribution was done with
the Kolmogorov-Smirnov test. Independent Samples T
Test was used for statistical analysis. Statistical tests
with a p value of less than 5% were considered signicant.
All statistical analyzes were performed using IBM SPSS
Statistics for Windows version 26.0 (New York, USA).
Data Preprocessing and Modeling
In datasets with class imbalance problem, most machine
learning techniques ignore minority class performance and
therefore underperform in minority class. One approach
to these datasets is to oversample the minority class and
is called the Synthetic Minority Oversampling Technique,
or SMOTE for short (9). In order to eliminate the class
imbalance problem in the colon cancer gene expression
dataset (22 normal and 40 tumor tissues), the SMOTE
method was applied before feature selection. In this way,
the number of samples in the groups, 40 normal and 40
tumor tissues, was equalized.
Afterwards, the 13 most important genes that may be
associated with colon cancer were selected with the
LASSO feature selection method. For the generalizability
of the model, 80% of the data set is divided as the training
set and 20% as the test set. Random Forest, Decision Trees
and Gaussian Naive Bayes classication methods were
used to predict colon cancer based on selected genes. The
performance of the models was evaluated with accuracy,
specicity, sensitivity, f1-score, negative predictive value
and positive predictive value.
LASSO Feature Selection
In 1996 the LASSO method was rst used by Robert
Tibshirani. Regularization and property selection are the
two main tasks of the method. The LASSO method puts
a constraint on the sum of the absolute values of model
parameters; the sum must be less than a xed value (upper
bound). To do this, the method implements a narrowing
(regularization) process in which regression variables
punish their coefcients, some of which reduce them to
zero. During the property selection process, variables that
still have a coefcient of zero after collapse are selected
for the model. This operation minimizes the prediction
error. In practice, the parameter that controls the power of
198
Med Records 2022;4(2):196-202
DOI: 10.37990/medr.1077024
punishment, is of great importance. When large enough,
the dimensionality is can be reduced in this manner. The
larger the parameter, the more coefcients are reduced
to zero. There are many advantages to using the LASSO
method. First, it can provide very good forecast accuracy,
since the reduction and removal of coefcients can reduce
variance without a signicant increase in deviation. It
is especially useful when there are few observations
and many variables in the data set. LASSO also helps to
improve the interpretability of the model by eliminating
irrelevant variables that are not associated with the
response variable, so that the problem of overlearning can
also be addressed (10,11).
Random Forest
The Random Forests algorithm, a community learning
method, aims to increase the classication value by
generating multiple decision trees during the classication
process. Because it includes random sampling and
improved properties of techniques in community methods,
the RF method offers better generalizations and makes
more valid predictions than conventional machine learning
methods. The reasons for the precise estimates of the RF
method are that it gives low deviation and low correlation
between trees. The low amount of deviation is obtained as
a result of the creation of rather large trees. By creating
as many different trees as possible, a low correlation
structure is achieved. Individually created classication
and regression decision trees come together to form the
decision forest community. The decision trees here are
randomly selected subsets from the data set to which
they are connected. The results obtained during the
formation of the decision forest are combined to make
the latest prediction. For classication, trees each leaf
node is created to contain only members of one class. For
regression, trees continue to divide until a small number of
units remain in the leaf node (12).
Decision Trees
Decision trees (DT) consist of root nodes, branches and
leaves. The leaves in the decision trees are the places
where the classication occurs and the branches refer
to the result. The tree is created by the division variation
method from the root node to the leaf nodes. A decision
node can contain one or more branches. A decision tree
can consist of both categorical and numerical data. The
decision tree contains two basic process steps. These
operations are splitting and pruning operations (13). The
most important step when creating a DT is to decide which
attribute values to base it on and which branching to create.
In the knowledge gain and gain ratio approach that includes
entropy rules, all attributes at hand are tested subjectically
and the attribute with the highest knowledge gain is
selected for branching. DT are a classication method that
creates a model in the form of a tree structure consisting of
decision nodes and leaf nodes by classication, property,
and target. The decision tree algorithm is developed by
dividing the data set into smaller pieces (14,15).
Gaussian Naive Bayes
A simple structured classication based on conditional
probability, which is assumed to be equal and independent
of each other in the classication of all attributes based
on conditional probability. The classication process is
done by combining the effects of different attributes on
the result. Naive Bayes classies using statistical methods
and is an important algorithm in terms of performance. The
importance of qualications is considered equal in all. The
Gaussian Naive Bayes (GNB) classier is the Naive Bayes
method, which is created by assuming that the class label
is a Gaussian distribution on the given property values. GNB
assigns all data to the closest location. However, instead of
using Euclidean distance to calculate the distance between
them, it calculates by taking into account the distance from
the average and the class variance (16).
RESULTS
Table I contains descriptive statistics for 13 genes selected
by LASSO trait selection. When Table I is examined; all 13
genes selected by LASSO had a statistically signicant
difference between normal and tumor samples. Hsa.8125,
Hsa.2710, Hsa.8147, Hsa.36689, Hsa.31933, Hsa.1387
and Hsa.865 were expressed lower in tumor samples, while
Hsa.3306, Hsa.22762, Hsa.3016, Hsa.5392, Hsa.1410 and
Hsa.2928 were expressed higher in tumor samples.
Table 1. Descriptive statistics for selected genes
Gene Name Normal (Mean ± SD)Tumor (Mean ± SD) t value p-value
Hsa.8125 2.144 ± 0.496 1.444 ± 0.442 6.87 <0.001
Hsa.2710 1.289 ± 0.392 0.89 ± 0.359 5.3 <0.001
Hsa.8147 2.092 ± 0.799 0.725 ± 0.637 9.97 <0.001
Hsa.36689 0.741 ± 0.42 -0.01 ± 0.318 9.83 <0.001
Hsa.3306 0.289 ± 0.504 1.138 ± 0.482 -8.07 <0.001
Hsa.22762 -0.242 ± 0.564 0.337 ± 0.759 -4.05 0.003
Hsa.31933 -0.107 ± 0.263 -0.475 ± 0.377 5.24 <0.001
Hsa.3016 -0.222 ± 1.074 0.962 ± 1.049 -5.16 <0.001
Hsa.5392 -1.064 ± 0.655 -0.486 ± 0.486 -4.82 <0.001
Hsa.1410 -0.794 ± 0.73 0.002 ± 0.694 -5.53 <0.001
Hsa.2928 -1.312 ± 0.563 -0.526 ± 0.518 -6.98 <0.001
Hsa.1387 0.827 ± 0.648 0.017 ± 0.779 5.4 <0.001
Hsa.865 0.45 ± 0.393 0.06 ± 0.568 3.78 0.006
SD: Standard deviation
199
Med Records 2022;4(2):196-202
DOI: 10.37990/medr.1077024
Table II presents the results of the performance measures
of the RF, DT, and GNB classication models. Specicity,
accuracy, f1-score, sensitivity, negative and positive
predictive value criteria obtained from the RF model
were all calculated as 1. That is, the RF model correctly
predicted all samples in the test set. From the DT model,
all performance measures were obtained as 0.9. Finally, in
the model created with the GNB method, the performance
measures were found to be accuracy 0.95, specicity 1, f1-
score 0.95, sensitivity 0.9, negative predictive value 0.9091,
and positive predictive value 1. The RF method offered the
highest performance compared to DT and GNB.
Table 2. Performance measures results for classication models
Metric Random Forest Gaussian Naive Bayes Decision Trees
Accuracy 1 0.95 0.9
Sensitivity 1 0.9 0.9
Specicity 1 1 0.9
PPV 1 1 0.9
NPV 1 0.9091 0.9
F1 score 1 0.95 0.9
PPV: Positive predictive value, NPV: Negative predictive value
DISCUSSION
Since knowing the biological functions of genes is useful
for knowing the origin, causes and treatment of many
diseases, studies in the eld of genomics have been on
the agenda of the scientic world for years. In addition to
their biological functions, the detection and relationships
of genes in the same biological pathway bring microarray
studies to the fore. Thanks to the detection of possibly
related genes, the detection and treatment of diseases
has become easier with the identication of gene clusters
(17). Based on this information, in the current study,
we developed a model that can predict the disease by
identifying the genes associated with colon cancer to
provide clinical decision support to physicians.
In this study, we used the LASSO feature selection method
to identify colon cancer-related genes. With the LASSO
method, Hsa.8125, Hsa.36689, Hsa.3306, Hsa.3016,
Hsa.8147, Hsa.2710, Hsa.22762, Hsa.31933, Hsa.5392,
Hsa.1410, Hsa.2928, Hsa.1387 and Hsa.865 genes may
be associated with colon cancer. Some of the biomarker
candidate genes we identied were in agreement with the
literature. Shaik et al. showed differential expression of
Hsa.8125, Hsa.36689 and Hsa.3306 genes in colon cancer
(genes1). In another study, Hsa.8125 and Hsa.3306 were
among 100 genes associated with colon cancer (18).
Hsa.8125; it is a gene that activates RNA binding activity,
is involved in nucleocytoplasmic transport, is located in the
endoplasmic reticulum, nucleus and perinuclear region of
the cytoplasm. Yan et al. showed that this gene, also known
as ANP32A, is overexpressed in colorectal cancer patients
and ANP32A levels are higher in poorly differentiated
tumors (19). Velmurugan et al. reported that this gene is
associated with lymph node metastasis (20).
When the relationship between the Hsa.36689 gene, whose
main task is guanylate cyclase activation in the colon, and
colon cancer was examined, Yang et al. identied this gene
among the top 5 most related genes (21). The Hsa.3016
and Hsa.8147 genes that we detected were also detected
as the other genes with the highest frequency in this study.
The Hsa.3306 gene is a gene that plays a role in cell
proliferation and is increased in cancer. In another study
examining the colon gene data set in the literature, it was
identied as one of the ten most closely related genes
among 2000 genes due to its association with colon
cancer. Among the genes detected in this study, Hsa.8125,
one of the genes we detected, is also included. It has been
shown that this gene, whose functions are important in the
construction of intestinal villi, increases in normal cells and
decreases in colon cancer cells (22).
Hsa.8147, also known as the desmin gene, is the gene
responsible for the production of desmin, a smooth
muscle-type intermediate lament protein expressed by
smooth muscle cells, but also in brotic tissue in wound
healing and tumor ‘desmoplastic’ stroma. Desmin also
surrounds the vasculature by being produced by pericytes
during angiogenesis in capillaries. It also plays a role in
angiogenesis in cancer tissue. Studies have shown an
increase in desmin expression in advanced cancer patients
(23). In a study conducted in patients with gallbladder
cancer, down-regulation of the desmin gene was detected
(24).
The Hsa.3016 gene, which we have observed to be
strongly associated with colon cancer, is one of the
genes responsible for coding the S-100P protein. S100
proteins are involved in many events such as regulation
of calcium homeostasis, cell proliferation, apoptosis, and
cell migration. The S100 protein family plays a role in many
stages of cancer formation and progression. S-100P acts
as an inducer of metastasis, overexpression of S-100P
increases the expression of S-100A6 and Cathepsin D,
which are involved in cellular invasion. Furthermore, S100P
promotes transendothelial migration of tumor cells (25).
The Hsa.2710 gene is one of the genes responsible for
making Fibulin-1, a secreted glycoprotein that is included
in the brillar extracellular matrix. It is involved in cell
adhesion and migration along protein bers within the
extracellular matrix (ECM). Considered to have a role in
cellular transformation and tumor invasion, it acts as a
tumor suppressor (26). In the study of Xu et al. , it was
shown that bulin downregulation is associated with
colorectal cancer (27).
Nucleolin; it is a multifunctional protein that is also found
in the nucleolus, nucleoplasm, and cytoplasm. Hsa.22762
is one of the genes involved in the synthesis of nucleolin.
200
Med Records 2022;4(2):196-202
DOI: 10.37990/medr.1077024
It is involved in the regulation of translation and stability
of oncogenic mRNAs in the nucleoplasm. In our study, the
presence of this gene was found to be signicantly related
in colon cancer patients. It has also been shown in other
studies that nucleolin is overexpressed in many cancer
types such as stomach, pancreatic, breast, cervix, prostate
cancers, leukemias, melanomas and colorectal cancers
(28).
The Hsa.31933 gene, which we detected in our study, is
one of the genes that helps Autographa californica multiple
nuclear polyhedrosis virus (AcMNPV), which is from the
Baculovirus family, to successfully initiate the expression
of viral genes by preparing the host environment and
controlling the subsequent viral gene expression like other
DNA viruses to infect their hosts. Viral genes, which are
expressed immediately after infection, play a critical role
in the early infection process; Hsa.31933 (Immediate-
Early Regulatory Protein IE-N gene) is one of these genes.
AcMNPV has been studied as a gene therapy vector. In
a study by Ono et al., they determined AcMNPV induces
antitumor acquired immunity; they showed AcMNPV can
act as an effective immune-inducing virus and eukaryotic
expression vector for gene carrier and has the potential to
be a tumor therapy agent (29).
In another study, recombinant DNA obtained with this virus
enabled the production of a natural antigen associated
with carcinoma in mice (30). Although there are no studies
related to this virus DNA in colon cancer yet, the data in our
study showed that there is a strong relationship between
colon cancer and this gene. We think that meaningful
results can be obtained as a result of the use of AcMNPV
as a vector with more comprehensive studies on the
treatment of colon cancer.
The Hsa. 5392 gene is also known as ribosomal protein
L24 (RPL24). It is one of the genes responsible for the
expression of ribosomal proteins. It encodes the ribosomal
protein L24, a homolog of the cytosolic RPL24 found in
higher eukaryotes. Studies have been conducted on the
overexpression of a number of ribosomal protein genes in
human tumors and their contribution to tumorigenesis (31).
Hsa.1410 is the gene responsible for the synthesis of
the eukaryotic translation initiation factor eIF-2. The
role of protein synthesis changes is important in cancer
development and progression. Studies show that ribosomal
protein synthesis plays a direct role during tumor initiation.
The translation initiation process is the rate-limiting step of
protein synthesis in eukaryotes, and a group of eukaryotic
translation initiation factors (eIFs) are involved. In previous
studies, it has been shown that a signicant increase in eIF3
subunits, eIF3A, eIF3B and eIF3M overexpression, which is
one of the translation initiation factors, in colorectal cancer
patients, and eIF4 subunits, of which eIF3C is an oncogene,
are also increased in cancer cells (32).
In studies, eIF2a expression was described as transiently
increased in normal cells, whereas constitutive
overexpression indicated tumor initiation and progression.
Golob-Schwarzl et al., they also showed that eIF2 is
overexpressed in colorectal cancers (32).
Among the genes we determined, Hsa.2928 is the mRNA
gene responsible for the expression of P-cadherin.
Cadherins are calcium-dependent cell adhesion proteins
that provide cell architecture and integrity, and their
degradation is often associated with human cancer (33).
Neo-expression or up-regulation of placental cadherin
(P-cadherin) has been reported in a variety of carcinomas,
including colorectal and bladder carcinomas (34).
The Hsa.1387 Human 11 beta-hydroxysteroid
dehydrogenase type II mRNA gene is a gene that has
a strong association with colon cancer and has been
found to be associated with colon carcinomas. 11 beta-
Hydroxysteroid dehydrogenase type II enzyme (11 beta
HSD2), which is also located in the colon, which has an
important role in water and electrolyte homeostasis, gives
specicity to the mineralocorticoid receptor (35).
MAP kinases, also known as (ERKs) encoded by the Hsa.865
(ERK-1, M84490) gene, are regulated by extracellular
signaling and act in a signal cascade that regulates various
cellular processes such as proliferation, differentiation
and cell cycle through the action of extracellular signals.
The tumor suppressor pathway is stimulated by ERK-1
phosphorylation (36). The relationship between colon
cancer and ERK-1 has been shown in many studies (37-39).
In our study, we showed its relationship with colon cancer.
In a similar study using the same data set in the literature,
PCA and PLS feature extraction methods were applied
and then they classied colon cancer with the support
vector machine method with an accuracy of 0.9516
(40). In another study, they found that the combined use
of PSO and SVM outperformed the model created with
only the SVM algorithm in terms of accuracy (0.94) and
performance, and was faster in terms of time analysis
(41). In the current study, three models were created using
RF, DT and GNB classiers based on biomarker candidate
genes determined by LASSO feature selection method.
According to the performance criteria obtained, the LASSO
+ RF model showed the best performance by correctly
classifying all samples.
CONCLUSION
In conclusion, this study identied genomic biomarkers
of colon cancer and classied the disease with a high-
performance model. According to the results obtained,
the LASSO method gave results compatible with the
literature while determining the genomic biomarkers.
For this reason, genes selected with LASSO can provide
clinical decision support to physicians in the diagnosis and
treatment of colon cancer. In addition, it can be suggested
that the LASSO+RF approach be used in modeling high-
dimensional data in medicine.
Financial disclosures: The authors declared that this study
hasn’t received no nancial support.
201
Med Records 2022;4(2):196-202
DOI: 10.37990/medr.1077024
Conflict of Interest: The authors declare that they have no
competing interest.
Ethical approval: Ethics committee approval is not required
in this study.
REFERENCES
1. Globocan W. Estimated cancer incidence, mortality and
prevalence worldwide in 2012. Int Agency Res Cancer. 2012.
2. Labianca R, Beretta G, Gatta G, et al. Colon cancer. Critical
Reviews Oncology Hematology. 2004;51:145-70.
3. Loboda A, Nebozhyn MV, Watters JW, et al. EMT is the
dominant program in human colon cancer. BMC Med
Genomics. 2011;4:1-10.
4. Xu C, Meng LB, Duan YC, et al. Screening and identication of
biomarkers for systemic sclerosis via microarray technology.
Int J Molecular Med. 2019;44:1753-70.
5. Ahmad MA, Eckert C, Teredesai A. Interpretable machine
learning in healthcare. Proceedings of the 2018 ACM
international conference on bioinformatics, Computational
Biology Health Informatics. 2018
6. Yagin FH, Yagin B, Arslan AK, Çolak C. Comparison of
Performances of Associative Classication Methods for
Cervical Cancer Prediction: Observational Study. Turkey
Clinics J Biostatistics. 2021;13:13:266-72.
7. Khaire UM, Dhanalakshmi R. High-dimensional microarray
dataset classication using an improved adam optimizer
(iAdam). J Ambient Intelligence Humanized Computing.
2020;11:5187-204.
8. Hameed SS, Hassan R, Hassan WH, et al. HDG-select: A novel
GUI based application for gene selection and classication
in high dimensional datasets. PloS One. 2021;16:e0246039.
9. Mulla GA, Demir Y, Hassan M. Combination of PCA
with SMOTE Oversampling for Classication of High-
Dimensional Imbalanced Data. Bitlis Eren University Science
and Technology Journal. 2021;10:858-69.
10. Güçkiran K, Cantürk İ, Özyilmaz L. DNA microarray gene
expression data classication using SVM, MLP, and RF
with feature selection methods relief and LASSO. Journal
of Suleyman Demirel University Institute of Science and
Technology. 2019;23:126-32.
11. Akyol K, Bayir Ş, Baha Ş. Importance of Attribute Selection
for Parkinson Disease. Academic Platform J Engineering
Sci. 2020;8:175-80.
12. Yilmaz R, Yagin FH. Early detection of coronary heart
disease based on machine learning methods. Med Records.
2022;4:1-6.
13. Secgin Y, Oner Z, Turan MK, Oner S. Gender prediction with
parameters obtained from pelvis computed tomography
images and decision tree algorithm. Med Science.
2021;10:356-61
14. Doğan Ş, Türkoğlu İ. Hypothyroidi and hyperthyroidi
detection from thyroid hormone parameters by using
decision trees. Fırat University Journal of Oriental Studies.
2007;5:163-9.
15. Pulat M, Kocakoç ID. Machine Learning and Decision in
Turkey. Bibliometric Analysis of Published Theses in the
Field of Trees. Journal of Management and Economics.
2021;28:287-308.
16. Kamel H, Abdulah D, Al-Tuwaijari JM. Cancer classication
using gaussian naive bayes algorithm. 2019 Int Engineering
Conference (IEC); 2019:36:165-5.
17. Quackenbush J. Microarray analysis and tumor
classication. New England J Med. 2006;354:2463-72.
18. Jose A. Gene selection by 1-d discrete wavelet transform
for classifying cancer samples using dna microarray date.
Ph.D. thesis, University of Akron, 2009.
19. Yan W, Bai Z, Wang J, et al. ANP32A modulates cell growth
by regulating p38 and Akt activity in colorectal cancer.
Oncology Reports. 2017;38:1605-12.
20. Velmurugan BK, Yeh K-T, Lee C-H, et al. Acidic leucine-
rich nuclear phosphoprotein-32A (ANP32A) association
with lymph node metastasis predicts poor survival in
oral squamous cell carcinoma patients. Oncotarget.
2016;7:10879.
21. Liu Q, Tan Y, Huang T, et al. TF-centered downstream gene
set enrichment analysis: Inference of causal regulators by
integrating TF-DNA interactions and protein post-translational
modications information. BMC Bioinformatics. 2010;11:1-
17.
22. Mora JAM, Ordoñez FM, Bonilla DA. Improvement of k-means
clustering algorithm performance in gene expression data
analysis through pre-processing with principal component
analysis and boosting. 2017;3:53-9.
23. Arentz G, Chataway T, Price TJ, et al. Desmin expression in
colorectal cancer stroma correlates with advanced stage
disease and marks angiogenic microvessels. Clinical
Proteomics. 2011;8:1-13.
24. Bhunia S, Barbhuiya MA, Gupta S, et al. Epigenetic
downregulation of desmin in gall bladder cancer reveals
its potential role in disease progression. Indian J Med
Research. 2020;151:311.
25. Chen H, Xu C, Qing’e Jin ZL. S100 protein family in human
cancer. Am J Cancer Res. 2014;4:89.
26. Twal WO, Czirok A, Hegedus B, et al. Fibulin-1 suppression
of bronectin-regulated cell adhesion and motility. J Cell Sci.
2001;114:4587-98.
27. Xu Z, Chen H, Liu D, Huo J. Fibulin-1 is downregulated through
promoter hypermethylation in colorectal cancer: a CONSORT
study. Med (Baltimore). 2015;94.e663
28. Tong X, Mirzoeva S, Veliceasa D, et al. Chemopreventive
apigenin controls UVB-induced cutaneous proliferation
and angiogenesis through HuR and thrombospondin-1.
Oncotarget. 2014;5:11413.
29. Ono C, Sato M, Taka H, et al. Tightly regulated expression
of Autographa californica multicapsid nucleopolyhedrovirus
immediate early genes emerges from their interactions and
possible collective behaviors. Plos One. 2015;10:e0119580.
30. Strassburg CP, Kasai Y, Seng BA, et al. Baculovirus
recombinant expressing a secreted form of a transmembrane
carcinoma-associated antigen. Cancer Res. 1992;52:815-21.
31. Loging WT, Reisman D. Elevated expression of ribosomal
202
Med Records 2022;4(2):196-202
DOI: 10.37990/medr.1077024
protein genes L37, RPP-1, and S2 in the presence of mutant
p53. Cancer Epidemiology and Prevention Biomarkers.
1999;8:1011-6.
32. Golob-Schwarzl N, Schweiger C, Koller C, et al. Separation
of low and high grade colon and rectum carcinoma by
eukaryotic translation initiation factors 1, 5 and 6. Oncotarget.
2017;8:101224.
33. Oliveira P, Sanges R, Huntsman D, et al. Characterization
of the intronic portion of cadherin superfamily members,
common cancer orchestrators. European J Human Genetics.
2012;20:878-83.
34. Van Marck V, Stove C, Jacobs K, et al. Pcadherin in adhesion
and invasion: Opposite roles in colon and bladder carcinoma.
Int J Cancer. 2011;128:1031-44.
35. Takahashi K, Sasano H, Fukushima K, et al. 11 beta-
hydroxysteroid dehydrogenase type II in human colon: a
new marker of fetal development and differentiation in
neoplasms. Anticancer Res. 1998;18:3381-8.
36. Baba Y, Nosho K, Shima K, et al. Prognostic signicance
of AMP-activated protein kinase expression and modifying
effect of MAPK3/1 in colorectl cancer. British J Cancer.
2010;103:1025-33.
37. Esteve-Puig R, Canals F, Colome N, et al. Uncoupling of the
LKB1-AMPKα energy sensor pathway by growth factors and
oncogenic BRAFV600E. PloS One. 2009;4:e4771.
38. Zheng B, Jeong JH, Asara JM, et al. Oncogenic B-RAF
negatively regulates the tumor suppressor LKB1 to promote
melanoma cell proliferation. Molecular Cell. 2009;33:237-47.
39. Kim MJ, Park IJ, Yun H, et al. AMP-activated protein kinase
antagonizes pro-apoptotic extracellular signal-regulated
kinase activation by inducing dual-specicity protein
phosphatases in response to glucose deprivation in HCT116
carcinoma. J Bio Chemistry. 2010;285:14617-27.
40. Arowolo MO, Isiaka RM, Abdulsalam SO, et al. A comparative
analysis of feature extraction methods for classifying colon
cancer microarray data. EAI Endorsed Transactions Scalable
Information Systems. 2017;4:1-6.
41. Al Rajab M, Lu J, Xu Q. Examining applying high performance
genetic data feature selection and classication algorithms
for colon cancer diagnosis. Computer Methods Programs
Bio Med. 2017;146:11-24.
... Hsa.36689, Hsa.8417, Hsa.8125, Hsa.692, Hsa.6814 and Hsa.37937 were among the top 10 genes associated with colon cancer shown in the study [13]. In another study, Hsa.8125, Hsa.8147, Hsa.36689 and Hsa.2928 were among the top 13 genes associated with colon cancer [14]. Hsa.8125 is a gene located in the endoplasmic reticulum, nucleus and perinuclear region of the cytoplasm, is involved in nucleocytoplasmic transport and activates RNA binding activity. ...
... In a similar study using the same dataset in the literature, LASSO feature selection methods were applied and then they classified colon cancer with random forest, Gaussian Naive Bayes and Decision Tree methods with an accuracy of 1.0, 0.95 and 0.9 respectively [14]. In another study they used PCA and PLS feature extraction methods and used a support vector machine for the classification of colon cancer with an accuracy of 0.82 and 0.95 respectively [25]. ...
Preprint
Full-text available
Colon cancer is one of the most common types of cancer worldwide, and early detection is crucial for effective treatment. Microarray technology has emerged as a powerful tool for identifying gene expression patterns associated with colon cancer. This study aimed to identify potential biomarker genes responsible for colon cancer and to develop a machine learning model that can predict colon cancer based on these genes. A microarray dataset with expression levels of 2000 genes with 62 different samples (22 Normal and 40 Abnormal tissues) obtained by the Queen's University Belfast Cancer Research and shared in the kaggle website were used in this study. Statistical analysis for independent sample, T-test was done and SMOTE-Tomek data sampling was applied before the feature selection to solve the class imbalance problem in the dataset. Also the data were summarized as mean ± standard deviation. The 10 most important genes that may contribute to colon cancer were selected using Extra Tree Classifier as a feature selection technique. Random Forest (RF), Decision Tree (DT) and Logistic Regression (LR) methods were used in the modeling phase. The top 10 most important genes selected by the Extra Tree Classifier (ETC) feature selection method had statistically significant differences between normal and abnormal samples. In the model created with the RF, all the accuracy, f1-score, sensitivity, specificity, negative and positive predictive values were calculated as 1. The RF model showed best performance in comparison to DT and LR. The study was able to identify the genomic biomarkers of colon cancer and with highest performance. The results also concluded that the ETC+RF model can be used when dealing with high-dimensional microarray data.
... These are then fine-tuned with the help of the objective task. Whenever there is a scarcity of original dataset, it has been shown to be an effective strategy [22][23][24][25]. Due to the difficulties of obtaining medical image collections, this is a regular occurrence in medical imaging. ...
Article
Full-text available
Colonic adenocarcinoma is a major contributor to global mortality, highlighting the crucial need for efficient detection and classification techniques. This research presents a new method called XceptionTS for classifying and detecting colon cancer using colonoscopy pictures. The XceptionTS method utilizes deep transfer learning techniques by leveraging the Xception model architecture. Nonlinear Mean Filtering (NMF) is used as a noise reduction method in image processing to improve the quality of colonoscopy pictures. We combine the MobileNetV2 and ResNet-50 models for healthcare image segmentation and feature extraction, respectively. The XceptionTS classifier efficiently gives accurate class labels to medical photos by combining Tabu Search Optimization with the strong Xception architecture. The assessment of the effectiveness of XceptionTS model is done using a dataset of 1560 colonoscopy images. An extensive comparison study is undertaken by analyzing the efficacy of our suggested approach with existing research. The XceptionTS system outperforms previous methodologies in colon cancer classification and detection tasks, showing higher accuracy and robustness according to experimental results. Our findings indicate that the XceptionTS technique shows potential as an advanced tool to increase the effectiveness of Colonic adenocarcinoma diagnosis, which could lead to better patient outcomes and healthcare management.
... Artificial intelligence (AI) plays a crucial role in numerous clinical decision support systems, facilitating the use of computational methods to make inferences that are comparable to human reasoning processes (14). The strategies presented in this context are founded upon medical information that has been either explicitly encoded or automatically generated from medical data using machine learning techniques. ...
Article
Full-text available
Introduction Acute heart failure (AHF) is a serious medical problem that necessitates hospitalization and often results in death. Patients hospitalized in the emergency department (ED) should therefore receive an immediate diagnosis and treatment. Unfortunately, there is not yet a fast and accurate laboratory test for identifying AHF. The purpose of this research is to apply the principles of explainable artificial intelligence (XAI) to the analysis of hematological indicators for the diagnosis of AHF. Methods In this retrospective analysis, 425 patients with AHF and 430 healthy individuals served as assessments. Patients’ demographic and hematological information was analyzed to diagnose AHF. Important risk variables for AHF diagnosis were identified using the Least Absolute Shrinkage and Selection Operator (LASSO) feature selection. To test the efficacy of the suggested prediction model, Extreme Gradient Boosting (XGBoost), a 10-fold cross-validation procedure was implemented. The area under the receiver operating characteristic curve (AUC), F1 score, Brier score, Positive Predictive Value (PPV), and Negative Predictive Value (NPV) were all computed to evaluate the model’s efficacy. Permutation-based analysis and SHAP were used to assess the importance and influence of the model’s incorporated risk factors. Results White blood cell (WBC), monocytes, neutrophils, neutrophil-lymphocyte ratio (NLR), red cell distribution width-standard deviation (RDW-SD), RDW-coefficient of variation (RDW-CV), and platelet distribution width (PDW) values were significantly higher than the healthy group (p < 0.05). On the other hand, erythrocyte, hemoglobin, basophil, lymphocyte, mean platelet volume (MPV), platelet, hematocrit, mean erythrocyte hemoglobin (MCH), and procalcitonin (PCT) values were found to be significantly lower in AHF patients compared to healthy controls (p < 0.05). When XGBoost was used in conjunction with LASSO to diagnose AHF, the resulting model had an AUC of 87.9%, an F1 score of 87.4%, a Brier score of 0.036, and an F1 score of 87.4%. PDW, age, RDW-SD, and PLT were identified as the most crucial risk factors in differentiating AHF. Conclusion The results of this study showed that XAI combined with ML could successfully diagnose AHF. SHAP descriptions show that advanced age, low platelet count, high RDW-SD, and PDW are the primary hematological parameters for the diagnosis of AHF.
... Pasksoy et al. [21] studied colon cancer using artificial intelligence and genomic biomarkers using biomarker candidate genes for colon cancer, and they developed a model for predicting the disease based on these genes. They used a dataset with gene expression levels from 62 samples (22 healthy and 40 tumor tissues). ...
Article
Full-text available
Cancer is a leading cause of death globally. The majority of cancer cases are only diagnosed in the late stages of cancer due to the use of conventional methods. This reduces the chance of survival for cancer patients. Therefore, early detection consequently followed by early diagnoses are important tasks in cancer research. Gene expression microarray technology has been applied to detect and diagnose most types of cancers in their early stages and has gained encouraging results. In this paper, we address the problem of classifying cancer based on gene expression for handling the class imbalance problem and the curse of dimensionality. The oversampling technique is utilized to overcome this problem by adding synthetic samples. Another common issue related to the gene expression dataset addressed in this paper is the curse of dimensionality. This problem is addressed by applying chi-square and information gain feature selection techniques. After applying these techniques individually, we proposed a method to select the most significant genes by combining those two techniques (CHiS and IG). We investigated the effect of these techniques individually and in combination. Four benchmarking biomedical datasets (Leukemia-subtypes, Leukemia-ALLAML, Colon, and CuMiDa) were used. The experimental results reveal that the oversampling techniques improve the results in most cases. Additionally, the performance of the proposed feature selection technique outperforms individual techniques in nearly all cases. In addition, this study provides an empirical study for evaluating several oversampling techniques along with ensemble-based learning. The experimental results also reveal that SVM-SMOTE, along with the random forests classifier, achieved the highest results, with a reporting accuracy of 100%. The obtained results surpass the findings in the existing literature as well.
... AI plays a crucial role in numerous clinical decision support systems, facilitating the use of computational methods to make inferences that are comparable to human reasoning processes [15]. The strategies presented in this context are founded upon medical information that has been either explicitly encoded or automatically generated from medical data using machine learning techniques. ...
Article
Full-text available
Background: Heart failure (HF) causes high morbidity and mortality worldwide. The prevalence of HF with preserved ejection fraction (HFpEF) is increasing compared with HF with reduced ejection fraction (HFrEF). Patients with HFpEF are a patient group with a high rate of hospitalization despite medical treatment. Early diagnosis is very important in this group of patients, and early treatment can improve their prognosis. Although electrocardiographic (ECG) findings have been adequately studied in patients with HFrEF, there are not enough studies on these parameters in patients with HFpEF. There are very few studies in the literature, especially on gender-specific changes. The current research aims to compare gender-specific ECG parameters in patients with HFpEF based on the implications of artificial intelligence (AI). Methods: A total of 118 patients participated in the study, of which 66 (56%) were women with HFpEF and 52 (44%) were men with HFpEF. Demographic, echocardiographic, and electrocardiographic characteristics of the patients were analyzed to compare gender-specific ECG parameters in patients with HFpEF. The AI approach combined with machine learning approaches (gradient boosting machine, k-nearest neighbors, logistic regression, random forest, and support vector machines) was applied for distinguishing male patients with HFpEF from female patients with HFpEF. Results: After determining the parameters (demographic, echocardiographic, and electrocardiographic) to distinguish male patients with HFpEF from female patients with HFpEF, machine learning methods were applied, and among these methods, the random forest model achieved an average accuracy of 84.7%. The random forest algorithm results showed that smoking, P-wave dispersion, P-wave amplitude, T-end P/(PQ*Age), Cornell product, and P-wave duration were the most influential parameters for distinguishing male patients with HFpEF from female patients with HFpEF. Conclusions: The proposed model serves as a valuable tool for physicians, facilitating the diagnosis, treatment, and follow-up for distinguishing male patients with HFpEF from female patients with HFpEF. Analyzing readily accessible electrocardiographic parameters empowers medical professionals to make informed decisions and provide enhanced care to a wide range of individuals.
... Artificial intelligence (AI) plays a crucial role in numerous clinical decision support systems, facilitating the use of computational methods to make inferences that are comparable to human reasoning processes [14]. The strategies presented in this context are founded upon medical information that has been either explicitly encoded or automatically generated from medical data using machine learning techniques. ...
Preprint
Full-text available
Background: Acute heart failure (AHF) is a serious medical problem that necessitates hospitalisation and often results in death. Patients hospitalised to the emergency department (ED) should therefore receive an immediate diagnosis and treatment. Unfortunately, there is not yet a fast and accurate laboratory test for identifying AHF. The purpose of this research is to apply the principles of explainable artificial intelligence (XAI) to the analysis of hematological predictors for AHF. Methods: In this retrospective analysis, 425 patients with AHF and 430 healthy individuals served as assessments. Patients' demographic and hematological information was analyzed to determine AHF. Important risk variables for AHF diagnosis were identified using LASSO feature selection. To test the efficacy of the suggested prediction model (XGBoost), a 10-fold cross-validation procedure was implemented. The area under the receiver operating characteristic curve (AUC), F1 score, Brier score, and Positive Predictive Value (PPV) and Negative Predictive Value (NPV) were all computed to evaluate the model's efficacy. Permutation-based analysis and SHAP, were used to assess the importance and influence of the model's incorporated risk factors. Results: White blood cell (WBC), monocytes, neutrophils, neutrophil-lymphocyte ratio (NLR), red cell distribution width-standard deviation (RDW-SD), RDW-coefficient of variation (RDW-CV), and platelet distribution width (PDW) values were significantly higher than the healthy group (p<0.05). On the other hand, erythrocyte, hemoglobin, basophil, lymphocyte, mean platelet volume (MPV), platelet, hematocrit, mean erythrocyte hemoglobin (MCH) and procalcitonin (PCT) values were found to be significantly lower in AHF patients compared to healthy controls (p <0.05). When XGBoost was used in conjunction with LASSO to estimate AHF, the resulting model had an AUC of 87.9%, an F1 score of 87.4%, a Brier score of 0.036, and an F1 score of 87.4%. PDW, age, RDW-SD, and PLT were identified as the most crucial risk factors in differentiating AHF. Conclusions: The XGBoost model demonstrated exceptional performance in accurately estimating Acute Heart Failure, and the application of Explainable Artificial Intelligence effectively provided intuitive explanations for the model's estimations. The suggested interpretable model holds potential for the identification of patients at high risk, thereby facilitating the optimization of treatment and planning for follow-up in cases of AHF.
... Artificial intelligence (AI) is a key component of many clinical decision support systems, allowing procedures to infer judgements computationally similar to human thinking processes [12]. These strategies are based on medical knowledge explicitly encoded with rules or automatically derived from medical data via machine learning. ...
Article
Full-text available
Remodeling of the left ventricle (LV) after myocardial infarction (MI) is a process of infarct enlargement. Despite the relevance of the inflammatory response and healing process in LV remodeling after MI, the mechanisms that begin and govern these processes remain unknown. Based on the important information highlighted in different studies, the current research aims to investigate potential biomarkers for left ventricular remodeling after acute MI based on the interpretation of the explainable artificial intelligence (XAI). The project research from which the public dataset was obtained was designed in an experimental type. A cohort study involving 66 patients with coronary heart disease and 34 healthy community controls provided the platelet samples for the current research, which used available omics data on those samples. For discovering significant mechanistic connections between metabolites and glycans, the metabolomics and glycomics datasets were analyzed using biostatistics/metabolomics and explainable artificial intelligence techniques. Metabolomics data of 100 patients (AMI=66; Control=34) including 75 males and 25 females were evaluated in this study. As a result of experimental omics analyses, 102 metabolite levels of the patients were obtained. When FC values were examined, creatinine and dl-pipecolic acid levels were 0.50 and 0.55-fold down-regulated and glutamine, myoinositol, and cytosine levels were 1.34, 1.33, and 1.53-fold up-regulated in the AMI group compared to the control group. Findings of metabolomics data and XAI analyses revealed that five lipid metabolites may be used as potential predictors of AMI.
... Therefore, we considered only metabolites overlapping the two different statistical approaches for further analysis (FDR-corrected p-value < 0.05 and AUC > 0.70). Multivariate analyses were performed using the ROC curve method with biomarker candidate metabolites based on linear support vector machine (SVM) [17], partial least squares discrimination analysis (PLS-DA) [18], and random forest (RF) [19] algorithms. These methods have proved to be robust for high-dimensional data and are widely used for other types of 'omics' data analysis. ...
Article
Full-text available
Colorectal cancer (CRC) is one of the most common and lethal diseases among all types of cancer, and metabolites play a significant role in the development of this complex disease. This study aimed to identify potential biomarkers and targets in the diagnosis and treatment of CRC using high-throughput metabolomics. Metabolite data extracted from the feces of CRC patients and healthy volunteers were normalized with the median normalization and Pareto scale for multivariate analysis. Univariate ROC analysis, the t-test, and analysis of fold changes (FCs) were applied to identify biomarker candidate metabolites in CRC patients. Only metabolites that overlapped the two different statistical approaches (false-discovery-rate-corrected p-value < 0.05 and AUC > 0.70) were considered in the further analysis. Multivariate analysis was performed with biomarker candidate metabolites based on linear support vector machines (SVM), partial least squares discrimination analysis (PLS-DA), and random forests (RF). The model identified five biomarker candidate metabolites that were significantly and differently expressed (adjusted p-value < 0.05) in CRC patients compared to healthy controls. The metabolites were succinic acid, aminoisobutyric acid, butyric acid, isoleucine, and leucine. Aminoisobutyric acid was the metabolite with the highest discriminatory potential in CRC, with an AUC equal to 0.806 (95% CI = 0.700–0.897), and was down-regulated in CRC patients. The SVM model showed the most substantial discrimination capacity for the five metabolites selected in the CRC screening, with an AUC of 0.985 (95% CI: 0.94–1).
Article
Cancer is a deadly disease that affects the lives of people all over the world. Finding a few genes relevant to a single cancer disease can lead to effective treatments. The difficulty with microarray datasets is their high dimensionality; they have a large number of features in comparison to the small number of samples in these datasets. Additionally, microarray data typically exhibit significant asymmetry in dimensionality as well as high levels of redundancy and noise. It is widely held that the majority of genes lack informative value about the classes under study. Recent research has attempted to reduce this high dimensionality by employing various feature selection techniques. This paper presents new ensemble feature selection techniques via the Wilcoxon Sign Rank Sum test (WCSRS) and the Fisher's test (F-test). In the first phase of the experiment, data preprocessing was performed; subsequently, feature selection was performed via the WCSRS and F-test in such a way that the (probability values) p-values of the WCRSR and F-test were adopted for cancerous gene identification. The extracted gene set was used to classify cancer patients using ensemble learning models (ELM), random forest (RF), extreme gradient boosting (Xgboost), cat boost, and Adaboost. To boost the performance of the ELM, we optimized the parameters of all the ELMs using the Grey Wolf optimizer (GWO). The experimental analysis was performed on colon cancer, which included 2000 genes from 62 patients (40 malignant and 22 benign). Using a WCSRS test for feature selection, the optimized Xgboost demonstrated 100% accuracy. The optimized cat boost, on the other hand, demonstrated 100% accuracy using the F-test for feature selection. This represents a 15% improvement over previously reported values in the literature.
Article
Full-text available
Non-small cell lung cancer (NSCLC) is a significant public health concern with high mortality rates. Recent advancements in genomic data, bioinformatics tools, and the utilization of biomarkers have improved the possibilities for early diagnosis, effective treatment, and follow-up in NSCLC. Biomarkers play a crucial role in precision medicine by providing measurable indicators of disease characteristics, enabling tailored treatment strategies. The integration of big data and artificial intelligence (AI) further enhances the potential for personalized medicine through advanced biomarker analysis. However, challenges remain in the impact of new biomarkers on mortality and treatment efficacy due to limited evidence. Data analysis, interpretation, and the adoption of precision medicine approaches in clinical practice pose additional challenges and emphasize the integration of biomarkers with advanced technologies such as genomic data analysis and artificial intelligence (AI), which enhance the potential of precision medicine in NSCLC. Despite these obstacles, the integration of biomarkers into precision medicine has shown promising results in NSCLC, improving patient outcomes and enabling targeted therapies. Continued research and advancements in biomarker discovery, utilization, and evidence generation are necessary to overcome these challenges and further enhance the efficacy of precision medicine. Addressing these obstacles will contribute to the continued improvement of patient outcomes in non-small cell lung cancer.
Article
Full-text available
Aim: Heart disease detection using machine learning methods has been an outstanding research topic as heart diseases continue to be a burden on healthcare systems around the world. Therefore, in this study, the performances of machine learning methods for predictive classification of coronary heart disease were compared.Material and Method: In the study, three different models were created with Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM) algorithms for the classification of coronary heart disease. For hyper parameter optimization, 3-repeats 10-fold repeated cross validation method was used. The performance of the models was evaluated based on Accuracy, F1 Score, Specificity, Sensitivity, Positive Predictive Value, Negative Predictive Value, and Confusion Matrix (Classification matrix).Results: RF 0.929, SVM 0.897 and LR 0.861 classified coronary heart disease with accuracy. Specificity, Sensitivity, F1-score, Negative predictive and Positive predictive values of the RF model were calculated as 0.929, 0.928, 0.928, 0.929 and 0.928, respectively. The Sensitivity value of the SVM model was higher compared to the RF. Conclusion: Considering the accurate classification rates of Coronary Heart disease, the RF model outperformed the SVM and LR models. Also, the RF model had the highest sensitivity value. We think that this result, which has a high sensitivity criterion in order to minimize overlooked heart patients, is clinically very important.
Article
Full-text available
Imbalanced data classification is a common issue in data mining where the classifiers are skewed towards the larger data class. Classification of high-dimensional skewed (imbalanced) data is of great interest to decision-makers as it is more difficult to. The dimension reduction method, a process in which variables are reduced, allows high dimensional datasets to be interpreted more easily with a certain loss. This study, a method combiningSMOTE oversampling with principal component analysis is proposed to solve the imbalance problem in high dimensional data. Three classification algorithms consisting of Logistic Regression, K-Nearest Neighbor, DecisionTree methods and two separate datasets were utilised to evaluate the suggested method's efficacy and determine the classifiers' performance. Respectively, raw datasets, converted datasets by PCA, SMOTE and SMOTE+PCA(SMOTE and PCA) methods, were analyzed with the given algorithms. Analyzes were made using WEKA.Analysis results suggest that almost all classification algorithms improve their classification performance usingPCA, SOMTE, and SMOTE+PCA methods. However, the SMOTE method gave more efficient results than PCA and PCA+SMOTE methods for data rebalancing. Experimental results also suggest that K-Nearest Neighbor classifier provided higher classification performance compared to other algorithms.
Article
Full-text available
Gender prediction is among the most critical topics in forensic medicine and anthropology since it is the basis of identity (height, weight, ancestry, age). Today, osteometry which is a low-cost, easily accessible method that requires no expertise is preferred when compared to DNA technology, which has several disadvantages such as high cost, accessibility, laboratory facilities, and expert personnel requirements. The Computed Tomography (CT) method, which is little affected by orientation and provides reconstruction opportunities, was selected instead of traditional methods for osteometry. This study aims to predict high and accurate gender with the Decision Tree (DT) algorithms used in the field of health recently. In the present study, CT images of 300 individuals (150 females, 150 males) without a pathology on the pelvic skeleton and between the ages of 25 and 50 were transformed into orthogonal form, landmarks were placed on promontorium, sacroiliac joint, iliac crest, terminal line, anterior superior iliac spine, anterior inferior iliac spine, greater trochanter, obturator foramen, lesser trochanter, femoral head, femoral neck, the body of femur, ischial tuberosity, acetabulum, and pubic symphysis, and the coordinates of these landmarks were determined. Then, parameters such as angle and length were obtained with various combinations. These parameters were analyzed with the DT algorithm.The analysis conducted with the DT algorithm revealed that accuracy (Acc) was 0.93, sensitivity was 0.95, specificity was 0.90, and the Matthews correlation coefficient was 0.86 for the pelvic skeleton. It was observed that the accuracy was quite high and more realistic when determined with the DT algorithm. In conclusion, the DT algorithm with multiple parameters and samples on pelvic CT images could improve the Acc of gender prediction. [Med-Science 2021; 10(2.000): 356-61]
Article
Full-text available
The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. In this work, a novel stand-alone application, which is based on graphical user interface (GUI), is developed to perform the full functionality of gene selection and classification in high dimensional datasets. The so-called HDG-select application is validated on eleven high dimensional datasets of the format CSV and GEO soft. The proposed tool uses the efficient algorithm of combined filter-GBPSO-SVM and it was made freely available to users. It was found that the proposed HDG-select outperformed other tools reported in literature and presented a competitive performance, accessibility, and functionality.
Article
Full-text available
Background & objectives: Gall bladder cancer (GBC) is a fatal neoplasm, with a globally variable incidence rates. To improve the survival rate of patients, a newer set of biomarkers needs to be discovered for its early detection and better prognosis. Our earlier studies on GBC proteomics and whole-genome methylome data revealed expression of desmin to be significantly downregulated with correlated promoter hypermethylation during gall bladder carcinogenesis. Thus, to evaluate desmin as a potential biomarker for GBC, we carried out a detailed follow up study. Methods: Methylation-specific polymerase chain reaction (MS-PCR) (n=17, GBC and n=23, non-tumour control), real-time quantitative reverse transcription-polymerase chain reaction (qRT-PCR) [n=14, GBC and n=14, adjacent non-tumour (ANT)], immunohistochemistry (n=27, GBC and n=14, non-tumour) and immunoblotting (n=13, GBC and n=13, ANT) were performed in surgically removed gall bladder tissue samples. Results: MS-PCR analysis showed methylation of desmin in 88.23 per cent (15/17) gall bladder tumour samples as compared to non-tumour tissues (39.13%, 9/23). Real-time qRT-PCR analysis revealed a significant downregulation of desmin expression in GBC as compared to ANT tissue. This was further confirmed by western blot, showing reduced expression of desmin protein in GBC, as compared to non-tumour tissue. Immunohistochemical analysis also showed a decreased level of desmin i.e., more than 95 per cent (26/27) in tumour cells compared to non-tumours (35.71%, 5/14). Interpretation & conclusions: The increased frequency of desmin promoter methylation which could be responsible for its significant downregulation, indicates its potential as a candidate biomarker for GBC. This requires further validation in a large group of patients to evaluate its clinical utility.
Article
Full-text available
Classifying data samples into their respective categories is a challenging task, especially when the dataset has more features and only a few samples. A robust model is essential for the accurate classification of data samples. The logistic sigmoid model is one of the simplest model for binary classification. Among the various optimization techniques of the sigmoid function, Adam optimization technique iteratively updates network weights based on training data. Traditional Adam optimizer fails to converge model within certain epochs when the initial values for parameters are situated at the gentle region of the error surface. The continuous movement of the convergence curve in the direction of history can overshoot the goal and oscillate back and forth incessantly before converging to the global minima. The traditional Adam optimizer with a higher learning rate collapses after several epochs for the high-dimensional dataset. The proposed Improved Adam (iAdam) technique is a combination of the look-ahead mechanism and adaptive learning rate for each parameter. It improves the momentum of traditional Adam by evaluating the gradient after applying the current velocity. iAdam also acts as the correction factor to the momentum of Adam. Further, it works efficiently for the high-dimensional dataset and converges considerably to the smallest error within the specified epochs even at higher learning rates. The proposed technique is compared with several traditional methods which demonstrates that iAdam is suitable for the classification of high-dimensional data and it also prevents the model from overfitting by effectively handling bias-variance trade-offs.
Article
Full-text available
Systemic sclerosis (SSc) is a complex autoimmune disease. The pathogenesis of SSc is currently unclear, although like other rheumatic diseases its pathogenesis is complicated. However, the ongoing development of bioinformatics technology has enabled new approaches to research this disease using microarray technology to screen and identify differentially expressed genes (DEGs) in the skin of patients with SSc compared with individuals with healthy skin. Publicly available data were downloaded from the Gene Expression Omnibus (GEO) database and intra‑group data repeatability tests were conducted using Pearson's correlation test and principal component analysis. DEGs were identified using an online tool, GEO2R. Functional annotation of DEGs was performed using Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. Finally, the construction and analysis of the protein‑protein interaction (PPI) network and identification and analysis of hub genes was carried out. A total of 106 DEGs were detected by the screening of SSc and healthy skin samples. A total of 10 genes [interleukin‑6, bone morphogenetic protein 4, calumenin (CALU), clusterin, cysteine rich angiogenic inducer 61, serine protease 23, secretogranin II, suppressor of cytokine signaling 3, Toll‑like receptor 4 (TLR4), tenascin C] were identified as hub genes with degrees ≥10, and which could sensitively and specifically predict SSc based on receiver operator characteristic curve analysis. GO and KEGG analysis showed that variations in hub genes were mainly enriched in positive regulation of nitric oxide biosynthetic processes, negative regulation of apoptotic processes, extracellular regions, extracellular spaces, cytokine activity, chemo‑attractant activity, and the phosphoinositide 3 kinase‑protein kinase B signaling pathway. In summary, bioinformatics techniques proved useful for the screening and identification of biomarkers of disease. A total of 106 DEGs and 10 hub genes were linked to SSc, in particular the TLR4 and CALU genes.