Access to this full-text is provided by Wiley.
Content available from Computational Intelligence and Neuroscience
This content is subject to copyright. Terms and conditions apply.
Research Article
Machine Learning Hybrid Model for the Prediction of Chronic
Kidney Disease
Hira Khalid,
1
Ajab Khan ,
1
Muhammad Zahid Khan ,
2
Gulzar Mehmood ,
3
and Muhammad Shuaib Qureshi
4
1
Department of Information Technology, Abbottabad University of Science and Technology, Havelian 22500,
Abbottabad, Pakistan
2
Department of Computer Science and I.T, Network Systems and Security Research Group, University of Malakand,
Chakdara 18800, Khyber Pakhtunkhwa, Pakistan
3
Department of Computer Science, IQRA National University, Swat Campus 19220, Peshawar, Pakistan
4
Department of Computer Science, School of Arts and Sciences, University of Central Asia, Bishkek, Kyrgyzstan
Correspondence should be addressed to Muhammad Shuaib Qureshi; muhammad.qureshi@ucentralasia.org
Received 25 July 2022; Revised 6 September 2022; Accepted 19 September 2022; Published 14 March 2023
Academic Editor: Farman Ali
Copyright ©2023 Hira Khalid et al. is is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
To diagnose an illness in healthcare, doctors typically conduct physical exams and review the patient’s medical history, followed
by diagnostic tests and procedures to determine the underlying cause of symptoms. Chronic kidney disease (CKD) is currently
the leading cause of death, with a rapidly increasing number of patients, resulting in 1.7 million deaths annually. While various
diagnostic methods are available, this study utilizes machine learning due to its high accuracy. In this study, we have used the
hybrid technique to build our proposed model. In our proposed model, we have used the Pearson correlation for feature
selection. In the rst step, the best models were selected on the basis of critical literature analysis. In the second step, the
combination of these models is used in our proposed hybrid model. Gaussian Na¨
ıve Bayes, gradient boosting, and decision tree
classier are used as a base classier, and the random forest classier is used as a meta-classier in the proposed hybrid model.
e objective of this study is to evaluate the best machine learning classication techniques and identify the best-used machine
learning classier in terms of accuracy. is provides a solution for overtting and achieves the highest accuracy. It also
highlights some of the challenges that aect the result of better performance. In this study, we critically review the existing
available machine learning classication techniques. We evaluate in terms of accuracy, and a comprehensive analytical
evaluation of the related work is presented with a tabular system. In implementation, we have used the top four models and built
a hybrid model using UCI chronic kidney disease dataset for prediction. Gradient boosting achieves around 99% accuracy,
random forest achieves 98%, decision tree classier achieves 96% accuracy, and our proposed hybrid model performs best
getting 100% accuracy on the same dataset. Some of the main machine learning algorithms used to predict the occurrence of
CKD are Na¨
ıve Bayes, decision tree, K-nearest neighbor, random forest, support vector machine, LDA, GB, and neural
network. In this study, we apply GB (gradient boosting), Gaussian Na¨
ıve Bayes, and decision tree along with random forest on
the same set of features and compare the accuracy score.
1. Introduction
Nowadays, chronic kidney disease (CKD) is a rapidly
growing disease, and millions of people die due to lack of
timely aordable treatment. Chronic kidney disease patients
belong to low-class and middle-classincome-generating
countries [1, 2].
In 2013, about one million people died due to chronic
kidney disease [3]. e developing world suers more from
the chronic kidney disease, and low to average income
countries contain a total of 387.5 million CKD patients
where 177.4 million patients are male and 210.1 million
patients are female [4]. ese gures show that a large
number of people in developing countries suer from
Hindawi
Computational Intelligence and Neuroscience
Volume 2023, Article ID 9266889, 14 pages
https://doi.org/10.1155/2023/9266889
chronic kidney disease, and this ratio is increasing day by
day. A lot of work has been done for the early diagnosis of
chronic kidney disease so that the disease could be treated at
an early stage. In this article, we are focusing on machine
learning prediction models for chronic kidney disease and
giving importance to accuracy.
Chronic kidney disease is a common type of kidney
disease that occurs when both kidneys are damaged, and
the CKD patients suer from this condition for a long
term. Here, the term kidney damage means any kidney
condition that can cause improper functioning of the
kidney. is could be caused by any disorder or due to lack
of essentials like the glomerular ltration rate (GFR)
reduction [5]. Our proposed prediction model takes the
clinical symptoms as input and predicts the results using
the stacking classier with the random forest algorithm as
a base classier.
Machine learning is gaining signicance in healthcare
diagnosis as it enables intricate analysis, thereby minimizing
human errors and enhancing the precision of predictions.
Machine learning algorithms and classiers are now con-
sidered the most reliable techniques for the diagnosis of
dierent diseases like heart disease, diabetes, tumors disease,
and liver disease predictions [6].
Dierent machine learning algorithms used the Na¨
ıve Bayes,
SVM, and the decision tree for the classication purpose, while
random forest, logistic regression, and linear regression were
used for the regression purpose in the medical elds for the
prediction. With the ecient use of these algorithms, the death
rate can be minimized due to early-stage diagnosis and patients
can be treated timely. Along with maintaining the clinical
symptoms, chronic kidney disease patients should include
physical activities in daily life. ey should exercise, drink water,
and avoid junk food. e common symptoms of chronic kidney
disease are shown in Figure 1.
is article delivers an overview and analysis sub-
sequently followed by an implementation and evaluation of
the machine learning classiers used in CKD diagnosis.
Further, this article discusses the importance of machine
learning classiers in healthcare and explains how these can
make more accurate predictions. Figure 2 represents the
block diagram of the chronic kidney disease
prediction model.
e core objective of this article is to propose and im-
plement a hybrid machine learning prediction model for
chronic kidney disease where due importance is given to
accuracy. In this article, we have analyzed the accuracy of
same dataset with respect to dierent machine learning
algorithms and compared their accuracy score so as to get
a better model. Our focus remains on the solution of
overtting problem using cross-validation while achieving
the highest accuracy to build a best hybrid model from the
combination of available popular machine learning classi-
ers such as decision tree, gradient boosting, Gaussian Na¨
ıve
Bayes, and gradient boosting. e ultimate goal is to deliver
an accurate and eective treatment to CKD patients at
a reduced cost. Before we proceed further, we need to know
little more about common diseases of the kidney. In Table 1,
there is a list of some of the most common kidney diseases
(Table 2).
e remaining portion of the article is organized as
follows. Section 2 contains the literature survey along with
the tabular comparison of the dierent machine learning
algorithms used and an analysis of the results. Section 3
contains the proposed methodology. Section 4 contains the
dataset details. Section 5 contains results and discussion.
Section 6 contains conclusion and future work.
2. Literature Review
is section covers research work related to algorithms
and assesses some algorithms based on their accuracy. In
research work [7], the data mining technique applied to
specic analysis of clinical records is a good method. e
performance of the decision tree method was 91% (ac-
curacy) compared to the Na¨
ıve Bayesian method. e
classication algorithm for diabetes dataset had 94%
specicity and 95% sensitivity. ey also found that
mining helps retrieve correlations of attributes that are no
longer direct indicators of the type they are trying to
predict. Similar work still needs to be done to improve the
0.30%
2.50%
3.30%
1.90%
2.60%
2.30%
1.80%
2.40%
2.20%
6.80%
9.70%
2.10%
71.90%
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
(%)
Symptoms in CKD Patients
ethnicity
marital status
educational level
diabetes
hypertension
cerebrovascular disease
myocardial infarction
malignancy
psychiatric disease
BMI
albumin
haemoglobin
proteinuria
Figure 1: Symptoms in CKD patients [7].
CKD EDA
Feature selection
Hybrid Mode
Predicted output
Figure 2: Block diagram of the machine learning hybrid model.
2Computational Intelligence and Neuroscience
overall performance of prediction engine accuracy in the
statistical analysis of neural networks and clustering
algorithms.
In [8], the authors described the prediction models using
machine learning techniques including K-nearest neighbor
(KNN), support vector machine (SVM), logistic regression
(LR), and decision tree classiers for CKD prediction. From
the experiment, it was concluded that the SVM classier
provides the highest accuracy, 98.3%. SVM has the absolute
best sensitivity after training and testing performed with the
proposed method. erefore, according to this comparison,
it could be concluded that an SVM classier is used to
predict persistent kidney disease.
In the paper [9], they chose four dierent algorithms and
compared them to get an accurate expectation rate over the
dataset. Unlike all approaches that were presented, they got
the best results from the gradient boosting classier. e
models eectively achieve an accuracy rate of 99.80%,
whereas AdaBoost and LDA achieve 97.91% at a low value.
Also, the gradient boosting ML classier takes much time to
make the prediction compared to others and has a higher
predictable value in both the curves (ROC and AUC).
Hence, an accurate expectation undoubtedly depends on the
preprocessing strategy, and the methods of preprocessing
must be approached cautiously to precisely achieve recog-
nized results.
In [7], the authors investigated the machine learning
ability, which is supported by predictive analysis so as to
predict CKD early. An experimental procedure was per-
formed by considering a dataset of 400 cases collected by
Apollo Hospitals India. In this article, two labels were used
as output/targets in this hybrid model (i.e., patients having
CKD and others who are healthy) and four dierent ma-
chine learning classiers were implemented. On the com-
parison of these classiers, the classication along with
regression tree, and the RPART classication model, showed
remarkably better results in terms of accuracy. ey used the
information gain quotient for excruciating criterion, and
here the optimum spilling reduces the noise of the resulting
feature subsets. In this study, the RPART limited value of
criterion for the splitting was ve, meaning that splits re-
peatedly occur for the ve instances present in the leaf node.
In addition, they identied an equivalent previous proba-
bility for the class attributes. Here, the RPART prediction
model used seven terminal nodes for the earlier predictions
of CKD. e experimental results showed that the highest
AUC and TPR were obtained with the machine learning
prediction model, whereas the highest TNR (1.00) was
achieved with the model RPART. e RPART model could
be described as a set of rules for making the decision.
However, the major drawback of RPART is the consider-
ation of the single factor as a parameter in every division
Table 1: Description of common diseases of the kidney.
Diseases Description
CKD Chronic kidney disease (CKD) can occur when a disease or condition damages
kidney function, causing kidney damage to deteriorate over a few months or years.
Kidney stones Kidney stones (also called renal calculi) are hard pledges made of salts and minerals
that form inside your kidney.
Glomerulonephritis
Glomerulonephritis causes infection and damage to the ltering part of the kidneys
(glomerulus). It can occur quickly or could be over a longer period. Poisons,
metabolic wastes, and surplus uid are not properly strained into the urine. Instead,
they build up in the body producing inammation and fatigue.
Polycystic kidney disease
Polycystic kidney disease (PKD) is a genetic disorder that can produce many cysts
lled with uid and they grow inside your kidneys. Usually, they are harmless. e
cysts can change the shape of the kidneys while making them much bigger.
Table 2: Equations for accuracy measurement.
S. no Authors Accuracy equations
1 Padmanaban and Parthiban [8] Precision i�TPi/TPi + FPi
2 Charleonnan et al. [9] ACC �(TP + TN)/(P + N)
3 Ghosh et al. [7] e results of performance degree indices are dependent on TP, TN, FP, and FN
4 Fu et al. [10] Ext. values �points >Q3 + 1.5 (IQR) points <Q1 −1.5 (IQR)
5 Devika et al. [11] Accuracy �number of properly classied samples/total variety of samples
6 Revathy et al. [12] Accuracy � (TP +TN)/(TP +TN +FP +FN)Accuracy �TP + TN/
TP + TN + FP + FN
7 Nishat et al. [14] Accuracy � (TP +TN)/(TP +TN +FP +FN)Accuracy �TP + TN/
TP + TN + FP + FN
8 Rabby et al. [13] Descriptive analysis of the data as well as the experimental results
9 Pouriyeh et al. [15] Finding most signicant feature using chi-square test
10 Jabbar et al. [16] Experimental results only
True positive (TP) �list contains stated cases that are correctly categorized with CKD. False positive (FP) �list contains set that is inaccurately categorized
with CKD. True negative (TN) �list contains stated instances that are correctly categorized with CKD. False negative (FN) �list contains set of instances that
are exactly categorized with CKD.
Computational Intelligence and Neuroscience 3
procedure, while considering dierent parameter combi-
nations could result in better CKD predictions. However, the
machine learning prediction model gives the lowest error
rate. e major reason is that the MLP could adopt and
handle complex predictions. e complex relationships
require hidden nodes and they are useful as they allow neural
networks to model between parameters while sometimes
deal with nonlinearity in data. e overall results indicate
that the algorithms of machine learning give an inspiring
and a feasible methodology for earlier CKD prediction.
As we have already seen, there are dierent machine
learning prediction models and learning programs avail-
able to assist practitioners. In [5], they used a new selection
guide for predicting CKD. In this work, CKD is predicted
by using specic classiers and a reasonable study of overall
performance. In this study, they performed the evaluation
of the Na¨
ıve Bayes classier, random forest, and articial
neural network classiers and concluded that the random
forest classier performs better as compared to other
classiers. e worth of forecasting CKD has been pro-
gressive. Several sustainable evolutionary policies can be
used to improve the outcomes of the suggested classiers.
Here, Na¨
ıve Bayes, random forest, and KNN were applied
to predict CKD. Early diagnosis of CKD helps to treat those
aected well in time and prevent the disease from pro-
gressing to worse stage. e early detection of this type of
disease and well-timed treatment is one of the main ob-
jectives of the medical eld.
In [10], a machine learning prediction model was de-
veloped for the early prediction of CKD. e dataset gives
input features gathered from the CKD dataset and the
models were tested and validated for the given input fea-
tures. Machine learning decision tree classier, random
forest classier, and support vector classier were con-
structed for the diagnosis of CKD. e performance analysis
of the models was assessed on the basis of the accuracy score
of the prediction model. On comparison, the results of the
research showed that the random forest classier model
performs much better at predicting CKD as compared to
decision tree and support vector classiers.
e kidneys play a vital role in maintaining the body’s
blood pressure, acid-base sense of balance, and electrolyte
sense of balance, not only needed to lter toxins from the
body. Malfunction is accountable for insignicant to mortal
illnesses, in addition to dysfunction in the other body or-
gans. erefore, researchers all over the world have dedi-
cated themselves for nding techniques to accurately
diagnose and eectively treat chronic kidney disease. As
machine learning classiers are increasingly used in the
medical eld for diagnosis, now CKD is also included in the
list of diseases that could be predicted using machine
learning classiers. e research to detect CKD with ML
algorithms has enhanced the procedure and consequence
accuracy progressively. ey proposed the random forest
classier (99.75% accuracy) as the maximum ecient
classier among all other classiers. e study demonstrates
the eective handling of missing values in data through four
techniques, namely, mode, mean, median, and zero-point
methods. It also evaluates the performance of machine
learning models under two scenarios, with and without
tuning the hyperparameters, and observes signicant im-
provement in the classiers’ performance, which is visually
presented through graphs [11].
Overall, the motive of the study is to examine the ap-
plicability of specic supervised machine learning classiers
in the eld of bioinformatics and oer their compatibility in
detecting several serious diseases such as the diagnosis of
CKD at an early stage [12].
ey built an updated and procient machine learning
(ML) application that can perceptually perceive and predict
the state of chronic kidney disease. In this work, the ten most
important machine learning methods for predicting per-
manent kidney disease were considered. e level of ac-
curacy of the classication algorithm we used in our project
is as good as we wanted.
For the prediction of disease, the rst most essential step
is to detect the disease that is costly in developing countries
like Pakistan and Bangladesh. e people of these countries
mostly suer from this. Currently, CKD patient proportion
is increasing rapidly in Pakistan and Bangladesh. So, in that
article, the authors tried to develop a system that helps in
predicting the risk of CKD. In the proposed model, they used
and processed UCI datasets and real-time datasets and tried
to deal with missing data and trained the model using
random forest and ANN classiers. en, they implemented
these two algorithms in the Python language. e accuracy
they got with the random forest algorithm is 97.12% and that
with ANN is 94.5%, which is relatively very good. By use of
this proposed method, risk prediction of CKD at an early
stage is possible.
In [13], the authors predicted CKD based on sugar levels,
aluminum levels, and red blood cell percentage. In this
perception, ve classiers were applied, namely, Na¨
ıve
Bayes, logistic regression, decision table, random tree, and
random forest, and for each classier, the results were noted
based on (i) without preprocessing, (ii) SMOTE with
resampling, and (iii) class equalizer. Random forest classier
has been observed to give the highest accuracy at 98.93% in
SMOTE with resampling.
2.1. Comparison of Machine Learning Classiers for CKD.
In this section, a comprehensive comparison of the state of
the art is presented in the form of a table. e evaluation is
formed in the aspect of accuracy, which can be compre-
hended in Table 3. e table has eight features that are
described below:
Author: this contains the names of the authors of each
article along with the reference.
Year: this column provides the year of the paper’s
publication.
Input data: this column shows the type of dataset that
was used as input for the machine learning classiers.
Disease type: is section shows the type of disease that
was predicted by using dierent classiers. It shows the
best classier found in the research paper, which is the
classier with the maximum accuracy.
4Computational Intelligence and Neuroscience
Table 3: Comparison of classiers for CKD.
S.
no Authors Year Input
data
Disease
type Tools Classiers Cross-validation Accuracy
1Padmanaban and Parthiban
[8] 2016 Diabetic patients CKD WEKA, YALE Na¨
ıve Bayes 10 folds 86%
UCI machine learning Decision tree 91%
2 Charleonnan et al. [9] 2016 Clinical data CKD WEKA,
MATLAB
SVM
5 folds
98. %
Logistic regression 96.55%
Decision tree 94.81%
KNN 98.1%
3 Ghosh et al. [7] 2020 Apollo Hospitals India CKD Python
SVM
5 folds
99.56%
AB 97.91%
LDA 97.91%
GB 99.80%
4 Fu et al.. [10] 2018 UCI repository (CKD dataset) CKD Python
RPART
No
cross-validation
98.2%
SVM 97.3%
LOGR 99.4%
MLP 99.5%
5 Devika et al. [11] 2019 UCI repository (CKD dataset) Chronic renal
disorder C Sharp
Na¨
ıve Bayes No
cross-validation
99.63%
KNN 87.78%
Random forest 99.84%
6 Revathy et al. [12] 2019 UCI repository (CKD dataset) CKD Python
Decision tree No
cross-validation
94.16%
SVM 98.33%
Random forest 99.16%
7 Nishat et al. [14] 2021 Learning repository of University of
California, Irvine CKD Python
CNN
No
cross-validation
78%
LR 98.25%
DT 99%
RF 99.75%
SVM 85%
NB 96.5%
MLP 81.25%
QDA 37.5%
8 Rabby et al. [13] 2019 UCI repository (CKD dataset) CKD Python
K-nearest neighbor
No
cross-validation
71.25%
RF 98.75%
SVM 97.50
GNB 100%
AB 98.75%
DT 100%
LDA 97.50%
GB 98.75
LR 97.50%
ANN 65%
9 Pouriyeh et al. [15] 2020 UCI repository (CKD dataset) CKD Python RF 10 folds 97.12%
ANN 94.5%
Computational Intelligence and Neuroscience 5
Table 3: Continued.
S.
no Authors Year Input
data
Disease
type Tools Classiers Cross-validation Accuracy
10 Jabber et al. [16] 2020 UCI repository (CKD dataset) CKD Python
Decision tree
No
cross-validation
96.79%
Logistic regression 97.86%
Na¨
ıve Bayes 97.33%
Random forest 98.9 %
11 Bmc [17] 2013 UCI repository Diabetic kidney
disease MATLAB
SVM
No
cross-validation
0.91
PLS 0.83
FFNN 0.85
RPART 0.87
Random forest 0.91
Na¨
ıve Bayes 0.86
C5.0 0.90
12 Ramya and Radha [18] 2016 UCI repository Chronic kidney
disease R
BP No
cross-validation
80.4
RBF 85.3
Random forest (RF) 78.6
13 Kumar [19] 2016 UCI repository CKD MATLAB
RF
No
cross-validation
95.67
SMO 90
Na¨
ıve Bayes 87.64
RBF 83.78
MLPC 89
SLG 87
14 Basarslan and Kayaalp [20] 2019 UCI repository Chronic kidney
disease MATLAB
K-nearest neighbor
No
cross-validation
97
Na¨
ıve Bayes 96.5
LR 97.56
RF 99
15 Dowluru and Rayavarapu
[21] 2012 UCI repository Kidney stone
WEKA tool
Na¨
ıve Bayes
classication No
cross-validation
0.99
Logistic regression 1.00
J48 algorithm 0.97
Random forest 0.98
Orange tool
Na¨
ıve Bayes 0.79
KNN 0.7377
Classication tree 0.9352
C4.5 0.9352
SVM 0.9198
Random forest 0.9352
Bold values represent the highest accuracy in the relevant paper.
6Computational Intelligence and Neuroscience
Classiers: this column signies the dierent machine
learning classiers that were used in the research and
the comparison between them.
Tool: e column represents the programming lan-
guage or the framework that was used in building the
model. e researchers used these tools to preprocess
the input data, then create a prediction model, and
nally go to the testing stage.
Cross-validation: this column gives information about
the validation of the classiers and makes a comparison
of dierent research papers regarding folds of cross-
validation used.
Accuracy: e accuracy of the outcomes of the rec-
ommended model is represented in this column. If the
article crisscrosses a comparison, the accuracy column
only contains the accuracy percent of the best classier
conrmed by the author.
2.2. ML Classier with Highest Accuracy. e machine
learning algorithms that we analyzed from the above lit-
erature are listed in Table 4 and Figure 3.
3. Proposed Methodology
e proposed hybrid model is implemented in Python with
pandas, sklearn, Matplotlib, Plotly, and other essential libraries.
We have downloaded the CKD dataset from the UCI re-
pository. e dataset contains two groups (CKD represented by
1 and non-CKD represented by 0) of chronic kidney disease in
the downloaded information. e machine learning algorithm
that has best accuracy is selected for analysis and imple-
mentation so that repeated results are produced. We have also
developed a hybrid model based on knowledge that we gained
during the analysis and implementation. e hybrid model
consists of Gaussian Na¨
ıve Bayes, gradient boosting, and de-
cision tree as base classiers and random forest as a meta
classier. We have selected the tree-based machine learning
algorithms for achieving the highest accuracy, while at the same
time, it can handle the overtting problem. In this paper, we
detect the outliers with the violin plot as shown in Figure 4. As
a solution of this problem, we implement the k-fold technique
and design our model in such a way that it can reduce the
problem of overtting along with achieving the highest ac-
curacy. e classiers are discussed as under.
3.1. Na¨
ıve Bayes (NB). e NB classier is related to the
group of probabilistic classiers and is constructed on the
basis of the Na¨
ıve Bayes (NB) theorem. It takes up vigorous
independence between the component’s/features, and it
contains the most crucial part of how this classier creates
forecasts. It can be built easily and is appropriately used in
the medical eld for the prediction of dierent diseases [15].
3.2. Decision Tree (DT). e decision tree classier has
a tree-like conguration or owchart-like construction. It
consists of subdivisions, leaves/child nodes, and a root/
parent node. Here inner nodes comprise the features,
whereas the subdivisions epitomize the outcome of every
check on every node. Decision tree is one of the commonly
used classiers for classication determination because it
does not need abundant information in the eld or place
constraints for it to work [15].
3.3. Random Forest (RF). In the ensemble and stacking
classication approach, the random forest (RF) is the most
eective algorithm among the other machine learning al-
gorithms. In prediction and probability estimations, random
forest (RF) algorithm has been used. Random forest (RF)
classier consists of many decision trees. Tin Kam Ho of Bell
Labs introduced the concept of random forest in 1995, where
each decision tree casts a vote to determine the object’s class.
e RF method is the combination of both bagging and
random selection of attributes. Random forest classier has
the three hyperparameter tuning values [16].
(i) Number of decision trees (ntree) used by the
random forest classier
(ii) Size of the minimum node in the trees
(iii) Number of attributes employed in splitting every
node for every tree (mtry). Here, m is the number of
attributes.
Table 4: Machine learning algorithms and classiers.
Articles Classiers Highest accuracy (%)
1 Decision tree 91
2 SVM 98.3
3 GB 99.80
4 MLP 99.5
5 Random forest 99.84
6 Random forest 99.16
7 Random forest 99.75
8GNB 100
Decision tree 100
9 Random forest 97.12
10 Random forest 98.93
Bold values represent the highest accuracy in the literature.
91% 98.30%
99.80%
99.50%
99.84%
99.16%
99.75%
100%
100%
97.12%98.93%
Highest Accuracy
Decision Tree
SVM
GB
MLP
Random forest
Random forest
Random forest
GNB
Decision Tree
Random Forest
Random Forest
Figure 3: Comparison of machine learning classiers.
Computational Intelligence and Neuroscience 7
80
100
60
40
20
0
age
01
class
class
1.0
0.0
180
160
140
120
100
80
60
40
bp
01
class
class
1.0
0.0
1.025
1.02
1.015
1.01
1.005
sg
01
class
class
1.0
0.0
5
6
4
3
2
1
0
−1
al
01
class
class
1.0
0.0
5
4
3
2
1
0
su
01
class
class
1.0
0.0
500
400
300
200
100
0
bgr
01
class
class
1.0
0.0
1.025
1.02
1.015
1.01
1.005
sg
01
class
class
1.0
0.0
400
300
200
100
0
bu
01
class
class
1.0
0.0
(a)
Figure 4: Continued.
8Computational Intelligence and Neuroscience
Some of the advantages of the random forest classier
are listed as follows.
(i) For ensemble learning algorithms, the random
forest is the most appropriate choice
(ii) For large datasets, random forest classier
performs well
(iii) Random forest (RF) is able to handle hundreds of
input attributes
(iv) Random forest can estimate which attributes are
more important in classication
(v) Missing value can be handled by using random
forest classier
(vi) Random forest handles the balancing error for class
in unbalanced datasets
3.4. Gaussian Na¨
ıve Bayes (GNB). Gaussian Na¨
ıve Bayes
(GNB) calculated the mean and standard deviation of each
attribute at the training stage. To calculate the probabilities for
the test data, mean and standard deviation were used. Due to
this reason, some values of attributes are too big or too small
from the value of the mean calculated. It aects the classier
25k
20k
15k
10k
5k
001
class
wc
class
1.0
0.0
8
7
6
5
4
3
2
rc
01
class
class
1.0
0.0
60
50
40
30
20
10
pcv
0 1
class
class
1.0
0.0
150
100
50
0
sod
01
class
class
1.0
0.0
15
10
5
hemo
01
class
class
1.0
0.0
50
40
30
20
10
0
pot
01
class
class
1.0
0.0
(b)
Figure 4: Violin plot of attributes.
Computational Intelligence and Neuroscience 9
performance when testing data patterns have those attribute
values and gives sometimes wrong output labels [22].
3.5. Hybrid Model. We use the concept of stacking for our
hybrid model. As a type of ensemble technique in stacking,
multiple classication models were combined with a main/
meta classier. One after the other, multiple layers were
placed, where the models pass their predictions, and the
upper most layer model makes decisions on the base of the
combination of dierent models as a base model. e models
in the low layer get attributes as input from the original data.
e topmost layer of the model gets output from the lower
layers and gives the results as a nal prediction. e stacking
technique involves using multiple independent machine-
learning models as input to process the original data. After
that, the meta classier is used to predict the input along with
the output of each machine learning model and individual
algorithm’s weights are estimated. e algorithms that are
performing best are selected, and others having low perfor-
mance are removed. In this technique, multiple classiers as
base model are combined and then, by using dierent ma-
chine learning algorithms, are trained on the same dataset
through the use of a meta-classier [23]. Figure 5 shows the
ow diagram for the proposed hybrid model.
e execution of the model with the sequence of the
steps is given below:
(i) Collect the data of CKD from UCI repository
(ii) Exploratory data analysis (EDA) is performed on
that dataset
(iii) is dataset is split into two parts: test data and
train data
(iv) Apply the cross-validation of 10 folds
(v) Train the base models Gaussian Na¨
ıve Bayes, gradient
boosting, and decision tree with the train set giving
the predictions as M1, M2, and M3, respectively
(vi) e output of the base models M1, M2, and M3 and
test set data serve as input for random forest as
input for training
(vii) Once the random forest gets trained, it gives the
prediction on the basis of training dataset and the
output predictions of the base models
In this study, we have considered the UCI CKD dataset,
and this dataset is split into two parts. 80% of data is used for
training purposes as an input to the machine learning al-
gorithms. We exploited the Gaussian Na¨
ıve Bayes, gradient
boosting, decision tree, and stacking classier with random
forest algorithm which was used to predict the chronic
kidney disease for 20% test data as input and plotted the
predicted values and compared their values. Our proposed
methodology has the following advantages.
(i) We implemented four machine learning algorithms
that are decision tree, gradient boosting, Gaussian
Na¨
ıve Bayes, and random forest. We applied
stacking classiers to build the hybrid model that
combines these four algorithms.
(ii) We analyzed the accuracy of the same dataset with
respect to dierent machine learning algorithms
and compared their accuracy score to get the
best model
(iii) We implemented a stacking classier technique to
build a new model with improved accuracy
4. Dataset Details
We selected 14 attributes from the dataset that we are
using from the UCI repository dataset of chronic kidney
disease as input features as shown in Table 5 where age
attribute shows the patient’s age, bp indicates the blood
pressure, sg indicates the specic gravity of the urine, al
indicates the level of aluminum in the patient urine, bgr
(blood glucose random) indicates the blood sugar level
glucose tolerance, su represents the sugar level, bu in-
dicates the blood urea, sod indicates the amount of so-
dium, sc indicates the serum creatinine, pot indicates the
amount of potassium, hemo indicates the hemoglobin,
and pcv indicates the packed cell volume. Further, wc
indicates the white blood cell count, and rc indicates the
red blood cell count.
To identify the number of chronic kidney disease pa-
tients and the number of healthy ones, we performed the
visualization on the CKD dataset, which can be seen in the
histogram plot in Figure 6. Here 0.0 represents the healthy
cases, while 1.0 represents the chronic kidney disease pa-
tients. In this dataset, there are 250 chronic kidney disease
patients, while 150 are healthy people.
e Pearson correlation feature selection method is used
to get the best combination of features for the prediction of
chronic kidney disease. e correlation of the 14 attributes
and 1 output label is presented in Figure 7.
When we go from the exploratory data analysis stage to
the pair plot visualization, it is observed to be very helpful as
it gives the data that can be used to nd the relationship
between attributes for both the categorical and continuous
variables. We import the Seaborn library to get pair plot. e
information about all the attributes is in one picture and is
clear. e statistical information is in attractive format
represented with pair plot as shown in Figure 8.
e violin plots are used for all the attributes in ex-
ploratory data analysis that are used in the hybrid model.
ese can give additional useful information like density
trace and distribution of the dataset. e violin plots give the
whole range of dataset which cannot be shown by box plot.
e violin plots of all 14 attributes are given in Figure 4.
Figure 9 shows the comparison of dierent models’ accuracy
scores in the form of a chart.
. Results and Discussion
Machine learning algorithms such as gradient boosting,
Gaussian Na¨
ıve Bayes, decision tree, and random forest
classier were used in the proposed hybrid model. ese
dierent machine learning classiers were used as a com-
bination for the chronic kidney disease predictions. is also
overcomes the overtting problem and results in higher
10 Computational Intelligence and Neuroscience
Start
Ckd data
set
Exploratory Data Analysis
Data preprocessing
Data splitting
Cross validation
Implementing Base
models
Trained based model
Random Forest
Prediction
Final Model
Test se t
Train set
M3
M2
M1
ckd
Not-ckd
Figure 5: Flowchart for the proposed model.
Table 5: e attribute set with their data types.
# Attributes Full form Data type Nonempty value Missing values
0 age Age oat64 400 0
1 bp Blood pressure oat64 400 0
2 sg Specic gravity of urine oat64 400 0
3 al Level of aluminum oat64 400 0
4 su Sugar level oat64 400 0
5 bgr Blood glucose random oat64 400 0
6 bu Blood urea oat64 400 0
7 sc Sugar level oat64 400 0
8 sod Amount of sodium oat64 400 0
9 pot Amount of potassium oat64 400 0
10 hemo Hemoglobin oat64 400 0
11 pcv Packed cell volume oat64 400 0
12 wc White cell oat64 400 0
13 rc Red cell oat64 400 0
Computational Intelligence and Neuroscience 11
accuracy. In order to improve accuracy and to come up with
a novel approach as compared to the existing work, we have
implemented the proposed hybrid model with the best
combination of GB, GNB, and decision tree, along with the
random forest classiers [24–27]. e results described in
Table 6 show that diagnosis of chronic kidney disease is
eective using the random forest with combination as
a stacking technique in the hybrid model. Gradient boosting
achieves 99% accuracy, random forest achieves 98% accu-
racy, and our hybrid model achieves 100% accuracy, and at
the same time, it has reduced the chances of overtting.
In order to nd the contributions to the development of
prediction models for chronic kidney disease, a regional
basis analysis is performed. As discussed in the Introduction
section that the developing countries’ population suers
more from chronic kidney disease, it was observed that most
of the research work is performed in developing countries. A
summary of this region-wise contribution is presented in
Figure 10.
250
200
150
100
count
50
0
1.0
class
0.0
Figure 6: Histogram plot.
10
0.8
0.6
0.4
0.2
0.0
−0.2
−0.4
−0.6
age
bp
sg
al
su
bgr
bu
sc
sod
pot
hemo
pcv
wc
rc
class
age
bp
sg
al
su
bgr
bu
sc
sod
pot
hemo
pcv
wc
rc
class
Figure 7: Heat map of chosen attributes.
Figure 8: Pair plot of each attribute.
Hybrid
Model
Gradient
Boosting
Random
Forest
Decision
Tre e
Gaussian
Naïve Bayes
Accuracy 100 99 98 96 93
88
90
92
94
96
98
100
102
Axis Title
Accuracy
Figure 9: Accuracy score of implemented machine learning classiers.
Table 6: Accuracy score of implemented machine learning
classiers.
ML algorithms Accuracy (%)
Gradient boosting 99
Gaussian Na¨
ıve Bayes 93
Decision tree 96
Random forest 98
Hybrid model 100
Asia
50%
Europe
20%
Africa
10%
America
20%
REGION WISE
CONTRIBUTIONS
Figure 10: Region-wise contributions.
12 Computational Intelligence and Neuroscience
6. Conclusion
Chronic kidney disease is considered as one of the prom-
inent life-threatening diseases in the developing world. e
most obvious cause seems to be lack of physical exercise. e
medical practitioners used a number of diagnosis processes
and procedures, where machine learning is the recent de-
velopment. In this paper, we have selected machine learning
because in terms of accuracy, it performs better as compared
to other available approaches. In this article, we have used
the Pearson correlation feature selection method and ap-
plied the same on machine learning classier. GB, GNB,
decision tree, and random forest are the base classiers for
the stacking algorithm, whereas these are implemented with
the cross-validation on the basis of accuracy score. In this
study, we evaluated these algorithms on the same dataset.
Furthermore, we have used dataset of CKD from the UCI
directory that contains 14 attributes and 400 instances. On
the basis of these attributes, our proposed stacking model is
able to predict whether the person is a CKD patient or not
with 100% accuracy. Best features are selected using the
Pearson correlation method, and the stacking algorithm is
implemented with the best machine learning classiers. e
cross-validation enhances the performance of the stacking
model. As we have worked on the chronic kidney disease
data of the binary group, the stacking algorithm performs
better with these combinations of algorithms. We can im-
plement the stacking technique for the prediction of other
diseases to get better accuracy score.
Data Availability
No data were used to support this study.
Conflicts of Interest
e authors declare that they have no conicts of interest.
References
[1] V. Jha, G. Garcia-Garcia, K. Iseki et al., “Chronic kidney
disease: global dimension and perspectives,” e Lancet,
vol. 382, no. 9888, pp. 260–272, 2013.
[2] R. Ruiz-Arenas, “A summary of worldwide national activities
in chronic kidney disease (CKD) testing, the electronic
journal of the international federation of,” Clinical Chemistry
and Laboratory Medicine, vol. 28, no. 4, pp. 302–314, 2017.
[3] edailystar, “Over 35,000 develop kidney failure in Ban-
gladesh every year,” 2019, https://www.thedailystar.net/city/
news/18m-kidney-patients-bangladesh-every-year-1703665.
[4] Prothomalo, “Women more aected by kidney diseases,”
2018, https://en.prothomalo.com/bangladesh/Womenmore-
aected-by-kidney-diseases.
[5] Scottish Intercollegiate Guidelines Network (Sign), Diagnosis
and Management of Chronic Kidney Disease: A National
Clinical Guideline, SIGN, Victoria, Australia, 2008.
[6] M. Kavitha, G. Gnaneswar, R. Dinesh, Y. R. Sai, and
R. S. Suraj, “Heart disease prediction using hybrid machine
learning model,” in Proceedings of the 2021 6th International
Conference on Inventive Computation Technologies (ICICT),
Coimbatore, India, January 2021.
[7] P. Ghosh, F. M. Javed Mehedi Shamrat, S. Shultana, S. Afrin,
A. A. Anjum, and A. A. Khan, “Optimization of prediction
method of chronic kidney disease using machine learning
algorithm,” in Proceedings of the 2020 15th International Joint
Symposium on Articial Intelligence and Natural Language
Processing (iSAI-NLP), Bangkok, ailand, November 2020.
[8] K. R. A. Padmanaban and G. Parthiban, “Applying machine
learning techniques for predicting the risk of chronic kidney
disease,” Indian Journal of Science and Technology, vol. 9,
no. 29, 2016.
[9] A. Charleonnan, T. Fufaung, T. Niyomwong,
W. Chokchueypattanakit, S. Suwannawach, and
N. Ninchawee, “Predictive analytics for chronic kidney dis-
ease using machine learning techniques,” in Proceedings of the
2016 Management and Innovation Technology International
Conference (MITicon), Bang-San, ailand, October 2016.
[10] G.-S. Fu, Y. Levin-Schwartz, Q.-H. Lin, and D. Zhang,
“Machine learning for medical imaging,” Journal of healthcare
engineering, vol. 2019, pp. 1-2, 2019.
[11] R. Devika, S. V. Avilala, and V. Subramaniyaswamy,
“Comparative study of classier for chronic kidney disease
prediction using naive Bayes, KNN and random forest,” in
Proceedings of the 2019 3rd International Conference on
Computing Methodologies and Communication (ICCMC),
Erode, India, March 2019.
[12] S. Revathy, B. Bharathi, P. Jeyanthi, and M. Ramesh, “Chronic
kidney disease prediction using machine learning models,”
International Journal of Engineering and Advanced Technol-
ogy, vol. 9, no. 1, pp. 6364–6367, 2019.
[13] A. S. A. Rabby, R. Mamata, M. A. Laboni, Ohidujjaman, and
S. Abujar, “Machine learning applied to kidney disease pre-
diction: comparison study,” in Proceedings of the 2019 10th
International Conference on Computing, Communication and
Networking Technologies (ICCCNT), Kanpur, India, July 2019.
[14] M. Nishat, F. Faisal, R. Dip et al., “A comprehensive analysis
on detecting chronic kidney disease by employing machine
learning algorithms,” EAI Endorsed Transactions on Pervasive
Health and Technology, vol. 7, Article ID 170671, 2018.
[15] S. Pouriyeh, S. Vahid, G. Sannino, G. De Pietro, H. Arabnia,
and J. Gutierrez, “A comprehensive investigation and com-
parison of Machine Learning Techniques in the domain of
heart disease,” in Proceedings of the 2017 IEEE Symposium on
Computers and Communications (ISCC), Heraklion, Greece,
July 2017.
[16] M. A. Jabbar, B. L. Deekshatulu, and P. Chandra, “Intelligent
heart disease prediction system using random forest and
evolutionary approach,” Journal of network and innovative
computing, vol. 4, pp. 175–184, 2016.
[17] Bmc, “Biomedcentral,” 2022.
[18] S. Ramya and N. Radha, “Diagnosis of chronic kidney disease
using machine learning algorithms,” International Journal of
Innovative Research in Computer and Communication Engi-
neering, vol. 4, no. 1, 2016.
[19] M. Kumar, “Prediction of chronic kidney disease using
random forest machine learning algorithm,” International
Journal of Computer Science and Mobile Computing, vol. 5,
pp. 24–33, 2016.
[20] M. S. Basarslan and F. Kayaalp, “Performance analysis of fuzzy
rough set-based and correlation-based attribute selection
methods on detection of chronic kidney disease with various
classiers,” in Proceedings of the 2019 Scientic Meeting on
Electrical-Electronics and Biomedical Engineering and Com-
puter Science (EBBT), April 2019.
Computational Intelligence and Neuroscience 13
[21] S. K. Dowluru and A. K. Rayavarapu, “Statistical and data
mining aspects on kidney stones: a systematic review and
metza-analysis,” Open Access Scientic Reports, vol. 1, no. 12,
2012.
[22] S. M. M. Hasan, M. A. Mamun, M. P. Uddin, and
M. A. Hossain, “Comparative analysis of classication ap-
proaches for heart disease prediction,” in Proceedings of the
2018 International Conference on Computer, Communication,
Chemical, Material and Electronic Engineering (IC4ME2),
pp. 1–4, Rajshahi, Bangladesh, February 2018.
[23] C. B. C. Latha and S. C. Jeeva, “Improving the accuracy of
prediction of heart disease risk based on ensemble classi-
cation techniques,” Informatics in Medicine Unlocked, vol. 16,
Article ID 100203, 2019.
[24] A. J. Aljaaf, A.-J. Dhiya, H. M. Hussein et al., “Early prediction
of chronic kidney disease using machine learning supported
by predictive analytics,” in Proceedings of the 2018 IEEE
Congress on Evolutionary Computation (CEC), Rio de Janeiro,
Brazil, July 2018.
[25] S. Khan, M. Z. Khan, P. Khan, G. Mehmood, A. Khan, and
M. Fayaz, “An ant-hocnet routing protocol based on opti-
mized fuzzy logic for swarm of UAVs in FANET,” Wireless
Communications and Mobile Computing, vol. 2022, Article ID
6783777, 12 pages, 2022.
[26] M. Fayaz, G. Mehmood, A. Khan, S. Abbas, M. Fayaz, and
J. Gwak, “Counteracting selsh nodes using reputation based
system in mobile ad hoc networks,” Electronics, vol. 11, no. 2,
p. 185, 2022.
[27] M. Z. U. Haq, M. Z. Khan, H. U. Rehman et al., “An adaptive
topology management scheme to maintain network con-
nectivity in Wireless Sensor Networks,” Sensors, vol. 22, no. 8,
p. 2855, 2022.
14 Computational Intelligence and Neuroscience
Content uploaded by Gulzar Mehmood
Author content
All content in this area was uploaded by Gulzar Mehmood on Mar 15, 2023
Content may be subject to copyright.